ruk·si

sklearn
Simple Project Rundown

Updated at 2017-06-13 13:43

We'll be building a simple machine learning model using Python and machine learning algorithm from sklearn toolkit. Not quite as complex as building your own neural network with TensorFlow but a good starting point to learn machine learning.

The following 5 libraries are common for machine learning projects:

  • numpy: provides N-dimensional arrays and lower level math tools.
  • pandas: provides data structure, analysis and data processing tools.
  • matplotlib: provides 2D plotting tools.
  • scipy: tools for scientific work e.g. solving equations and optimization.
  • sklearn: standard machine learning algorithms.

Anaconda is the easiest way to install all of these. Wrapping everything in a Docker container with Jupyter notebooks is not a bad idea either.

conda install numpy, pandas, matplotlib, scipy, scikit-learn

Check if they're all installed:

import sys
print('Python: {}'.format(sys.version))

import numpy
print('numpy: {}'.format(numpy.__version__))

import pandas
print('pandas: {}'.format(pandas.__version__))

import matplotlib
print('matplotlib: {}'.format(matplotlib.__version__))

import scipy
print('scipy: {}'.format(scipy.__version__))

import sklearn
print('sklearn: {}'.format(sklearn.__version__))

Use pandas to import dataset from publicly available CSV-file.

import numpy as np
import pandas as pd

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
dataset = pd.read_csv(url, names=names)
assert type(dataset) == pd.core.frame.DataFrame
assert dataset.shape == (150, 5)

# you can think of pandas DataFrame as a wrapper for numpy N-dimensional array.
assert type(dataset.values) == np.ndarray

Statistical Exploration

Use pandas to explore the initial data.

print(dataset.info())
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 150 entries, 0 to 149
# Data columns (total 5 columns):
# sepal-length    150 non-null float64
# sepal-width     150 non-null float64
# petal-length    150 non-null float64
# petal-width     150 non-null float64
# class           150 non-null object
# dtypes: float64(4), object(1)
# memory usage: 5.9+ KB

print(dataset.head())
#    sepal-length  sepal-width  petal-length  petal-width        class
# 0           5.1          3.5           1.4          0.2  Iris-setosa
# 1           4.9          3.0           1.4          0.2  Iris-setosa
# 2           4.7          3.2           1.3          0.2  Iris-setosa
# 3           4.6          3.1           1.5          0.2  Iris-setosa
# 4           5.0          3.6           1.4          0.2  Iris-setosa

print
print('Statistical summary:')
print(dataset.describe())
#        sepal-length  sepal-width  petal-length  petal-width
# count    150.000000   150.000000    150.000000   150.000000
# mean       5.843333     3.054000      3.758667     1.198667
# std        0.828066     0.433594      1.764420     0.763161
# min        4.300000     2.000000      1.000000     0.100000
# 25%        5.100000     2.800000      1.600000     0.300000
# 50%        5.800000     3.000000      4.350000     1.300000
# 75%        6.400000     3.300000      5.100000     1.800000
# max        7.900000     4.400000      6.900000     2.500000

print
print('Class distribution:')
print(dataset.groupby('class').size())
# class
# Iris-setosa        50
# Iris-versicolor    50
# Iris-virginica     50
# dtype: int64

Visualized Exploration

Univariate plots are used to understand each individual attribute.

import matplotlib.pyplot as plt

# box and whisker plots
dataset.plot(kind='box', subplots=True, layout=(2,2), sharex=False, sharey=False)
plt.show()

# histograms
dataset.hist()
plt.show()

Univariate plots

Multivariate plots are used to understand the relationships between attributes.

from pandas.plotting import scatter_matrix

# scatterplots
# Note the diagonal grouping of some pairs of attributes.
# This suggests a high correlation and a predictable relationship.
scatter_matrix(dataset)
plt.show()

Multivariate plots

Preparing the Data

Create validation dataset. You can't use the same dataset to train and validate the model as it would lead to overfitting.

from sklearn import model_selection

data_values = dataset.values
assert type(data_values) == np.ndarray
assert data_values.shape == (150, 5) # 150 rows with 5 columns each

# We want to split our data to input `x` and expected output `y`.
# The last column ("class") becomes our `y` aka. the label.
x = data_values[:, 0:-1]
y = data_values[:, -1]
assert len(x) == len(y) == 150
assert len(x[0]) == 4
assert type(x[0]) == np.ndarray
assert type(y[0]) == str

# We will extract 20% of the data to be used in validation.
validation_proportion = 0.20

train_x, test_x, train_y, test_y = model_selection.train_test_split(
    x,
    y,
    test_size=validation_proportion,
    random_state=7
)

assert len(train_x) == len(train_y) == 120
assert len(test_x) == len(test_y) == 30
assert type(train_x) == type(train_y) == np.ndarray
assert type(test_x) == type(test_y) == np.ndarray

Finding the Right Learning Algorithm

Let's evaluate 2 linear algorithms from sklearn toolkit:

  • Logistic Regression (LR).
  • Linear Discriminant Analysis (LDA).

And 4 nonlinear algorithms from the same sklearn toolkit:

  • K-Nearest Neighbors (KNN).
  • Classification and Regression Trees (CART).
  • Gaussian Naive Bayes (NB).
  • Support Vector Machines (SVM).
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC()))

results = []
names = []
for name, model in models:
    # 10-fold cross validation will split our dataset into 10 parts,
    # train on 9 and test on 1 and repeat for all combinations
    # of train-test splits. random_state is used to seed the data
    # so each algorithm is evaluated using the same data split.
    kfold = model_selection.KFold(n_splits=10, random_state=7)

    # We want to find the best algorithm based on result 'accuracy'.
    scoring = 'accuracy'

    # Evaluate a score by cross-validation.
    cross_result = model_selection.cross_val_score(
        model,
        train_x,
        train_y,
        cv=kfold,
        scoring=scoring
    )
    assert type(cross_result) == np.ndarray

    results.append(cross_result)
    names.append(name)

    msg = "%s: %f (%f)" % (name, cross_result.mean(), cross_result.std())
    print(msg)

# LR:   0.966667 (0.040825)
# LDA:  0.975000 (0.038188)
# KNN:  0.983333 (0.033333)
# CART: 0.975000 (0.038188)
# NB:   0.975000 (0.053359)
# SVM:  0.991667 (0.025000)
# thus SVM has the largest estimated accuracy score.

Training the Model

Using the winning machine learning algorithm from sklearn.

from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

# Create and train KNN model.
model = SVC()
model.fit(train_x, train_y)

# Use the model to create predictions.
predictions = model.predict(test_x)

# We can see that the accuracy is 93%.
print(accuracy_score(test_y, predictions))
# 0.933333333333

# In a confusion matrix:
# - Each column of the matrix represents the instances in a predicted class.
# - Each row represents the instances in an actual class.
labels = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']
print(confusion_matrix(test_y, predictions, labels=labels))
# [[ 7  0  0]
#  [ 0 10  2]
#  [ 0  0 11]]
# Here we can see that two instances of 'Iris-versicolor' were
# incorrectly predicted to be 'Iris-virginica'.

# In a classification report:
# Precision is the ratio `true_positive / (true_positive + false_positive)`.
# Recall is the ratio `true_positive / (true_positive + false_negative)`.
# f1-score is the harmonic mean of precision and recall, from 0 to 1.
# Support is the number of class samples in `test_y`.
print(classification_report(test_y, predictions))
#                  precision    recall  f1-score   support
#
#     Iris-setosa       1.00      1.00      1.00         7
# Iris-versicolor       1.00      0.83      0.91        12
#  Iris-virginica       0.85      1.00      0.92        11
#
#     avg / total       0.94      0.93      0.93        30

Sources