The purpose behind this project is many-fold.
Exposing a machine learning model via API being among the easier means of putting said model into production.
The Iris data set is included with python's sci-kit learn package. It was collected decades ago by a famous statistician and appears often in statistics curriculum.
The data itself include a well-balanced list of three distinct iris species, and four continuous measurements of each. There's, honestly, little in the way of noise in the process of identifying species by these metrics. They are largely linearly separable.
I split the data into mutually exclusive test-and-training sets
I estimated the models using scikit-learn and developed the API using Flask and Flask-RESTful in a virutal environment in python 3.10.
I tested the API locally using postman.
I pushed everything to a github repository, and used this to deploy the API on Heroku. Specifically, it is available on the subdomain snyderjos-iris-api
of herokuapp.com
, which provides hosting (for free!).
Unfotunate update: Heroku is discontinuing its free tier as of late November 2022--This will become unavailable in December of 2022
import requests
r = requests.get("http://snyderjos-iris-api.herokuapp.com/list")
print(r.status_code)
print(r.text)
200 {"models":["Random_forest","Naive_bayes","SVM","Ensemble"],"necessary features":["sepal length (cm)","sepal width (cm)","petal length (cm)","petal width (cm)"],"outputs":["pred","prob"]}
Let's take a look at the training data:
import json
with open("X_test.json","r") as file:
parsed = json.load(file)
print(parsed)
{'sepal length (cm)': [6.1, 5.7, 7.7, 6.0, 6.8, 5.4, 5.6, 6.9], 'sepal width (cm)': [2.8, 3.8, 2.6, 2.9, 2.8, 3.4, 2.9, 3.1], 'petal length (cm)': [4.7, 1.7, 6.9, 4.5, 4.8, 1.5, 3.6, 5.1], 'petal width (cm)': [1.2, 0.3, 2.3, 1.5, 1.4, 0.4, 1.3, 2.3]}
The API expects that the fields passed appear in a list.
url = "http://snyderjos-iris-api.herokuapp.com/Random_forest/pred"
headers = {'Accept' : 'application/json', 'Content-Type' : 'application/json'}
r = requests.post(url, data = open("X_test.json","rb"),headers = headers)
print(r.status_code)
print(r.text)
200 {"predictions": ["versicolor", "setosa", "virginica", "versicolor", "versicolor", "setosa", "versicolor", "virginica"]}
The API call above provides the predicted iris species as predicted by the random forest model.
url = "http://snyderjos-iris-api.herokuapp.com/Random_forest/prob"
r = requests.post(url, data = open("X_test.json","rb"),headers = headers)
print(r.status_code)
print(r.text)
200 {"probabilities": {"setosa": {"0": 0.0, "1": 0.96, "2": 0.0, "3": 0.0, "4": 0.0, "5": 1.0, "6": 0.0, "7": 0.0}, "versicolor": {"0": 1.0, "1": 0.04, "2": 0.02, "3": 1.0, "4": 0.85, "5": 0.0, "6": 1.0, "7": 0.05}, "virginica": {"0": 0.0, "1": 0.0, "2": 0.98, "3": 0.0, "4": 0.15, "5": 0.0, "6": 0.0, "7": 0.95}}}
The above provides the probability vectors for each iris species as produced by the random forest model.
What about other models?
url = "http://snyderjos-iris-api.herokuapp.com/SVM/pred"
r = requests.post(url, data = open("X_test.json","rb"),headers = headers)
print(r.status_code)
print(r.text)
url = "http://snyderjos-iris-api.herokuapp.com/Naive_bayes/pred"
r = requests.post(url, data = open("X_test.json","rb"),headers = headers)
print(r.status_code)
print(r.text)
url = "http://snyderjos-iris-api.herokuapp.com/Ensemble/pred"
r = requests.post(url, data = open("X_test.json","rb"),headers = headers)
print(r.status_code)
print(r.text)
200 {"predictions": ["versicolor", "setosa", "virginica", "versicolor", "versicolor", "setosa", "versicolor", "virginica"]} 200 {"predictions": ["versicolor", "setosa", "virginica", "versicolor", "versicolor", "setosa", "versicolor", "virginica"]} 200 {"predictions": ["versicolor", "setosa", "virginica", "versicolor", "versicolor", "setosa", "versicolor", "virginica"]}
Note that they all agree. In fact, the train-test split was lucky in that the test sets all had 100% accuracy.
How useful is an API that estimates the species of iris based off of petal and sepal dimensions?
Not terribly.
How does this deal with missing data?
It doesn't at the moment.
It currently only accepts batch inputs. Would it not be useful to provide an endpoint that provides and return one example.
Possibly yes.
The json output is rather unweildy. Would the API not benefit from offering multiple options for output?
Yes, it would.
Do you forsee implementing any of the above?
In all honesty, no. Implementing these would entail a whole lot of development and testing. There are other projects that are of greater interest and would be more in line with my interests. Perfection only holds so many dividends
¯\_(ツ)_/¯
Scikit-Learn is python's premier machine learning library. It has any number of out-of-the-box solutions. |
Flask has a large number of uses. Here, I use it to create an API to expose the estimated models. |
Docker is a tool intended to perminently solve the retort of "Well, it works on my computer." |