sklearn - Simple Project Rundown
We'll be building a simple machine learning model using Python and machine learning algorithm from sklearn
toolkit. Not quite as complex as building your own neural network with TensorFlow but a good starting point to learn machine learning.
The following 5 libraries are common for machine learning projects:
numpy
: provides N-dimensional arrays and lower level math tools.pandas
: provides data structure, analysis and data processing tools.matplotlib
: provides 2D plotting tools.scipy
: tools for scientific work e.g. solving equations and optimization.sklearn
: standard machine learning algorithms.
Anaconda is the easiest way to install all of these. Wrapping everything in a Docker container with Jupyter notebooks is not a bad idea either.
conda install numpy, pandas, matplotlib, scipy, scikit-learn
Check if they're all installed:
import sys
print('Python: {}'.format(sys.version))
import numpy
print('numpy: {}'.format(numpy.__version__))
import pandas
print('pandas: {}'.format(pandas.__version__))
import matplotlib
print('matplotlib: {}'.format(matplotlib.__version__))
import scipy
print('scipy: {}'.format(scipy.__version__))
import sklearn
print('sklearn: {}'.format(sklearn.__version__))
Use pandas
to import dataset from publicly available CSV-file.
import numpy as np
import pandas as pd
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
dataset = pd.read_csv(url, names=names)
assert type(dataset) == pd.core.frame.DataFrame
assert dataset.shape == (150, 5)
# you can think of pandas DataFrame as a wrapper for numpy N-dimensional array.
assert type(dataset.values) == np.ndarray
Statistical Exploration
Use pandas
to explore the initial data.
print(dataset.info())
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 150 entries, 0 to 149
# Data columns (total 5 columns):
# sepal-length 150 non-null float64
# sepal-width 150 non-null float64
# petal-length 150 non-null float64
# petal-width 150 non-null float64
# class 150 non-null object
# dtypes: float64(4), object(1)
# memory usage: 5.9+ KB
print(dataset.head())
# sepal-length sepal-width petal-length petal-width class
# 0 5.1 3.5 1.4 0.2 Iris-setosa
# 1 4.9 3.0 1.4 0.2 Iris-setosa
# 2 4.7 3.2 1.3 0.2 Iris-setosa
# 3 4.6 3.1 1.5 0.2 Iris-setosa
# 4 5.0 3.6 1.4 0.2 Iris-setosa
print
print('Statistical summary:')
print(dataset.describe())
# sepal-length sepal-width petal-length petal-width
# count 150.000000 150.000000 150.000000 150.000000
# mean 5.843333 3.054000 3.758667 1.198667
# std 0.828066 0.433594 1.764420 0.763161
# min 4.300000 2.000000 1.000000 0.100000
# 25% 5.100000 2.800000 1.600000 0.300000
# 50% 5.800000 3.000000 4.350000 1.300000
# 75% 6.400000 3.300000 5.100000 1.800000
# max 7.900000 4.400000 6.900000 2.500000
print
print('Class distribution:')
print(dataset.groupby('class').size())
# class
# Iris-setosa 50
# Iris-versicolor 50
# Iris-virginica 50
# dtype: int64
Visualized Exploration
Univariate plots are used to understand each individual attribute.
import matplotlib.pyplot as plt
# box and whisker plots
dataset.plot(kind='box', subplots=True, layout=(2,2), sharex=False, sharey=False)
plt.show()
# histograms
dataset.hist()
plt.show()
Multivariate plots are used to understand the relationships between attributes.
from pandas.plotting import scatter_matrix
# scatterplots
# Note the diagonal grouping of some pairs of attributes.
# This suggests a high correlation and a predictable relationship.
scatter_matrix(dataset)
plt.show()
Preparing the Data
Create validation dataset. You can't use the same dataset to train and validate the model as it would lead to overfitting.
from sklearn import model_selection
data_values = dataset.values
assert type(data_values) == np.ndarray
assert data_values.shape == (150, 5) # 150 rows with 5 columns each
# We want to split our data to input `x` and expected output `y`.
# The last column ("class") becomes our `y` aka. the label.
x = data_values[:, 0:-1]
y = data_values[:, -1]
assert len(x) == len(y) == 150
assert len(x[0]) == 4
assert type(x[0]) == np.ndarray
assert type(y[0]) == str
# We will extract 20% of the data to be used in validation.
validation_proportion = 0.20
train_x, test_x, train_y, test_y = model_selection.train_test_split(
x,
y,
test_size=validation_proportion,
random_state=7
)
assert len(train_x) == len(train_y) == 120
assert len(test_x) == len(test_y) == 30
assert type(train_x) == type(train_y) == np.ndarray
assert type(test_x) == type(test_y) == np.ndarray
Finding the Right Learning Algorithm
Let's evaluate 2 linear algorithms from sklearn
toolkit:
- Logistic Regression (LR).
- Linear Discriminant Analysis (LDA).
And 4 nonlinear algorithms from the same sklearn
toolkit:
- K-Nearest Neighbors (KNN).
- Classification and Regression Trees (CART).
- Gaussian Naive Bayes (NB).
- Support Vector Machines (SVM).
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC()))
results = []
names = []
for name, model in models:
# 10-fold cross validation will split our dataset into 10 parts,
# train on 9 and test on 1 and repeat for all combinations
# of train-test splits. random_state is used to seed the data
# so each algorithm is evaluated using the same data split.
kfold = model_selection.KFold(n_splits=10, random_state=7)
# We want to find the best algorithm based on result 'accuracy'.
scoring = 'accuracy'
# Evaluate a score by cross-validation.
cross_result = model_selection.cross_val_score(
model,
train_x,
train_y,
cv=kfold,
scoring=scoring
)
assert type(cross_result) == np.ndarray
results.append(cross_result)
names.append(name)
msg = "%s: %f (%f)" % (name, cross_result.mean(), cross_result.std())
print(msg)
# LR: 0.966667 (0.040825)
# LDA: 0.975000 (0.038188)
# KNN: 0.983333 (0.033333)
# CART: 0.975000 (0.038188)
# NB: 0.975000 (0.053359)
# SVM: 0.991667 (0.025000)
# thus SVM has the largest estimated accuracy score.
Training the Model
Using the winning machine learning algorithm from sklearn
.
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
# Create and train KNN model.
model = SVC()
model.fit(train_x, train_y)
# Use the model to create predictions.
predictions = model.predict(test_x)
# We can see that the accuracy is 93%.
print(accuracy_score(test_y, predictions))
# 0.933333333333
# In a confusion matrix:
# - Each column of the matrix represents the instances in a predicted class.
# - Each row represents the instances in an actual class.
labels = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']
print(confusion_matrix(test_y, predictions, labels=labels))
# [[ 7 0 0]
# [ 0 10 2]
# [ 0 0 11]]
# Here we can see that two instances of 'Iris-versicolor' were
# incorrectly predicted to be 'Iris-virginica'.
# In a classification report:
# Precision is the ratio `true_positive / (true_positive + false_positive)`.
# Recall is the ratio `true_positive / (true_positive + false_negative)`.
# f1-score is the harmonic mean of precision and recall, from 0 to 1.
# Support is the number of class samples in `test_y`.
print(classification_report(test_y, predictions))
# precision recall f1-score support
#
# Iris-setosa 1.00 1.00 1.00 7
# Iris-versicolor 1.00 0.83 0.91 12
# Iris-virginica 0.85 1.00 0.92 11
#
# avg / total 0.94 0.93 0.93 30