MNIST - A Study of ML Classification

The MNIST dataset is a set of 70,000 small images of digits handwritten by high school students and employees of the US Census Bureau. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image. Each image is labeled with the digit it represents. It is a perfect database for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting. Whenever people come up with a new classification algorithm, they are curious to see how it will perform on MNIST. Whenever someone learns Machine Learning, sooner or later they tackle MNIST.


more about data set here: http://yann.lecun.com/exdb/mnist/

First, import few common libraries, ensure matplotlib plots figures inline and prepare a function to save the figures:

In [1]:
import numpy as np
import os

np.random.seed(42)

%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12

PROJECT_ROOT_DIR = "."
CHAPTER_ID = "classification"

def save_fig(fig_id, tight_layout=True):
    path = os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID, fig_id + ".png")
    print("Saving figure", fig_id)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format='png', dpi=300)
    
import warnings
warnings.filterwarnings(action='ignore')
In [2]:
from sklearn.datasets import fetch_mldata
mnist = fetch_mldata('MNIST original')
mnist
Out[2]:
{'COL_NAMES': ['label', 'data'],
 'DESCR': 'mldata.org dataset: mnist-original',
 'data': array([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]], dtype=uint8),
 'target': array([0., 0., 0., ..., 9., 9., 9.])}
In [3]:
X, y = mnist["data"], mnist["target"]
In [4]:
print("X shape:", X.shape)
print("y shape:", y.shape)
X shape: (70000, 784)
y shape: (70000,)
In [5]:
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt

random_digit = X[24000]
random_digit_image = random_digit.reshape(28,28)
plt.imshow(random_digit_image, cmap = matplotlib.cm.binary,
          interpolation="nearest")
plt.axis("off")

save_fig("random_digit")
plt.show()
Saving figure random_digit
In [6]:
def plot_digit(data):
    image = data.reshape(28, 28)
    plt.imshow(image, cmap = matplotlib.cm.binary,
               interpolation="nearest")
    plt.axis("off")
In [7]:
def plot_digits(instances, images_per_row=5):
    size = 28
    images_per_row = min(len(instances), images_per_row)
    images = [instance.reshape(size,size) for instance in instances]
    n_rows = (len(instances) - 1) // images_per_row + 1
    row_images = []
    n_empty = n_rows * images_per_row - len(instances)
    images.append(np.zeros((size, size * n_empty)))
    for row in range(n_rows):
        rimages = images[row * images_per_row : (row + 1) * images_per_row]
        row_images.append(np.concatenate(rimages, axis=1))
    image = np.concatenate(row_images, axis=0)
    plt.imshow(image, cmap = matplotlib.cm.binary)
    plt.axis("off")
In [8]:
plt.figure(figsize=(10,10))
example_images = np.r_[X[:12000:600], X[13000:30600:600], X[30600:60000:590]]
plot_digits(example_images, images_per_row=5)
save_fig("more_digits_plot")
plt.show()
Saving figure more_digits_plot
In [9]:
y[24000]
Out[9]:
3.0
In [10]:
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]
In [11]:
shuffle_index = np.random.permutation(60000)
X_train, y_train = X_train[shuffle_index], y_train[shuffle_index]

Training a Binary Classifier

To simplify the problem for now let's only try to identify one digit - the number 3. This "3-detector" will be an example of binary classifer, capable of distinguishing between just two classes, 3 and not 3.

In [12]:
y_train_3 = (y_train == 3)
y_test_5 = (y_test == 3)

To pick a classifer, a good place to start is with Stochastic Gradient Descent(SGD) classifie, using Scikit-Learn's SGDClassifier class. This classifier has the capablity to handle large datasets efficently.

In [13]:
from sklearn.linear_model import SGDClassifier

sgd_clf = SGDClassifier(random_state=42)
sgd_clf.fit(X_train, y_train_3)
Out[13]:
SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='hinge', max_iter=None, n_iter=None,
       n_jobs=1, penalty='l2', power_t=0.5, random_state=42, shuffle=True,
       tol=None, verbose=0, warm_start=False)
In [14]:
sgd_clf.predict([random_digit])
Out[14]:
array([ True])

Performance Measures

Evaluating a classifier is significantly trickier than evaluating a regressor. There are many performance measures available.

Measuring Accuracy Using Cross-Validation

A good way to evaluate performace measure is to use cross-validation:

In [15]:
from sklearn.model_selection import cross_val_score
cross_val_score(sgd_clf, X_train, y_train_3, cv=3, scoring="accuracy")
Out[15]:
array([0.94815259, 0.96385   , 0.9340467 ])

That's right that's over 90% accuracy across all 3 folds. This is simply because only about 10% of the images are 3s, so if we always guess that an image is not a 3, we will be right about 90% of the time.

1) Confusion Matrix

A much better way to evaluate the performance of a classifier is to look at the confusion matrix. The general is to count the number of times instances of class A are classified as class B. For eg, to know the number of times the classifier confused images of 3s with 6s, we would look in the 3rd row and 6th column of the confusion matrix. To compute the confusion matrix, we first need to have a set of predictions. We can get the predictions on the train set using cross_val_predict() function which return predictions for each instance in the training set.

In [16]:
from sklearn.model_selection import cross_val_predict
y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_3, cv=3)
In [17]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_train_3, y_train_pred)
Out[17]:
array([[51523,  2346],
       [  733,  5398]], dtype=int64)

The first row of this matrix considers non-3 images (the negative class): 51523 were correctly classified as non-3s(true negatives), while remaining 2346 were wrongly classified as 3's(false postives).
The second row of this matrix considers 3 images(the positive class): 733 were incorrectly classified as 3s(false negatives), while remaining 5398 were correctly classified as 3's(true postives).

2) Precision and Recall

In [18]:
from sklearn.metrics import precision_score, recall_score
precision_score(y_train_3, y_train_pred) # 5398 / (5398 + 2346) 
Out[18]:
0.6970557851239669
In [19]:
recall_score(y_train_3, y_train_pred) # 5398 / (5498 +733)
Out[19]:
0.8804436470396346

It is often convenient to combine precision and recall into single metric called the F1 score, in particular if we need a simple way to compare two classifier. The F1 score is the harmonic mean of precision and recall. Whereas the regular mean treats all values equally, the harmonic mean gives much more weight to low values. As a result, the classifier will get a high F1, score if both recall and precision are high.

In [20]:
from sklearn.metrics import f1_score
f1_score(y_train_3, y_train_pred)
Out[20]:
0.7780900900900901

3) Precision/Recall Tradeoff

In [21]:
y_scores = sgd_clf.decision_function([random_digit])
y_scores
Out[21]:
array([18165.28807677])
In [22]:
threshold = 0
y_random_digit_pred = (y_scores > threshold)
y_random_digit_pred
Out[22]:
array([ True])
In [23]:
threshold = 200000
y_random_digit_pred = (y_scores > threshold)
y_random_digit_pred
Out[23]:
array([False])
In [24]:
y_scores = cross_val_predict(sgd_clf, X_train, y_train_3, cv=3, method="decision_function")
In [25]:
from sklearn.metrics import precision_recall_curve
precisions, recalls, thresholds = precision_recall_curve(y_train_3, y_scores)
In [26]:
def plot_precision_recall_vs_threshold(precisions, recalls, thresholds):
    plt.plot(thresholds, precisions[:-1], "b--", label="Precision")
    plt.plot(thresholds, recalls[:-1], "g-", label="Recall")
    plt.xlabel("Threshold")
    plt.legend(loc="upper left")
    plt.xlim([-700000, 700000])
    plt.ylim([0,1])

plt.figure(figsize=(10, 4))
plot_precision_recall_vs_threshold(precisions, recalls, thresholds)
save_fig("pression_recall_vs_threshold_plot")
plt.show()
Saving figure pression_recall_vs_threshold_plot
In [27]:
y_train_pred_90 = (y_scores > 70000)
In [28]:
precision_score(y_train_3, y_train_pred_90)
Out[28]:
0.823249713067716
In [29]:
recall_score(y_train_3, y_train_pred_90)
Out[29]:
0.8189528625020388
In [30]:
def plot_precision_vs_recall(precisions, recalls):
    plt.plot(recalls, precisions, "b-", linewidth=2)
    plt.xlabel("Recall")
    plt.ylabel("Precision")
    plt.axis([0,1,0,1])
plt.figure(figsize=(8, 6))
plot_precision_vs_recall(precisions, recalls)
save_fig("precision_vs_recall_plot")
plt.show()
Saving figure precision_vs_recall_plot

ROC curves

The receiver operating characteristic (ROC) curve is another common tool used with binary classifiers. It is very similar to the precision/recall curve, but instead of plotting precision versus recall, the ROC curve plots the true positive rate (another name for recall) against the false positive rate(FPR). The FPR is the ratio of negative instances that are incorrectly classified as positive. It is equal to one minus the true negative rate (TNR), which is the ratio of negative instances that are correctly classified as negative. The TNR is also called specificity.Hence the ROC curve plots sensitivity (recall) versus 1–specificity. To plot the ROC curve, we first need to compute the TPR and FPR for various threshold values, using the roc_curve() function:

In [31]:
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_train_3, y_scores)
In [32]:
def plot_roc_curve(fpr, tpr, label=None):
    plt.plot(fpr, tpr, linewidth=2, label=label)
    plt.plot([0,1],[0,1], 'k--')
    plt.axis([0,1,0,1])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    
plt.figure(figsize=(8,6))
plot_roc_curve(fpr,tpr)
save_fig("roc_curve_plot")
plt.show()
Saving figure roc_curve_plot
In [33]:
from sklearn.metrics import roc_auc_score

roc_auc_score(y_train_3, y_scores)
Out[33]:
0.9650257133358359
In [34]:
from sklearn.ensemble import RandomForestClassifier
forest_clf = RandomForestClassifier(random_state=42)
y_probas_forest = cross_val_predict(forest_clf, X_train, y_train_3, cv=3,
                                    method="predict_proba")
In [35]:
y_scores_forest = y_probas_forest[:, 1] # score = proba of positive class
fpr_forest, tpr_forest, thresholds_forest = roc_curve(y_train_3,y_scores_forest)
In [36]:
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, "b:", linewidth=2, label="SGD")
plot_roc_curve(fpr_forest, tpr_forest, "Random Forest")
plt.legend(loc="lower right")
save_fig("roc_curve_comparison_plot")
plt.show()
Saving figure roc_curve_comparison_plot
In [37]:
roc_auc_score(y_train_3, y_scores_forest)
Out[37]:
0.9871330465842308
In [38]:
y_train_pred_forest = cross_val_predict(forest_clf, X_train, y_train_3, cv=3)
precision_score(y_train_3, y_train_pred_forest)
Out[38]:
0.9854662898667743
In [39]:
recall_score(y_train_3, y_train_pred_forest)
Out[39]:
0.7962811939324743

Multiclass Classification

Whereas binary classifiers distinguish between two classes, multiclass classifiers (also called multinomial classifiers) can distinguish between more than two classes.

In [40]:
sgd_clf.fit(X_train, y_train)
sgd_clf.predict([random_digit])
Out[40]:
array([3.])
In [41]:
random_digit_scores = sgd_clf.decision_function([random_digit])
random_digit_scores
Out[41]:
array([[ -499582.55705237,  -550812.09184579,  -735375.44069824,
           18165.28807677,  -816488.32333227,  -216282.22151398,
         -112654.73529414, -1426490.88561816, -1173446.3761975 ,
        -1126319.4885984 ]])
In [42]:
np.argmax(random_digit_scores)
Out[42]:
3
In [43]:
sgd_clf.classes_
Out[43]:
array([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])
In [44]:
sgd_clf.classes_[3]
Out[44]:
3.0
In [45]:
from sklearn.multiclass import OneVsOneClassifier
ovo_clf = OneVsOneClassifier(SGDClassifier(max_iter=5, random_state=42))
ovo_clf.fit(X_train, y_train)
ovo_clf.predict([random_digit])
Out[45]:
array([3.])
In [46]:
len(ovo_clf.estimators_)
Out[46]:
45
In [47]:
forest_clf.fit(X_train, y_train)
forest_clf.predict([random_digit])
Out[47]:
array([3.])
In [48]:
forest_clf.predict_proba([random_digit])
Out[48]:
array([[0. , 0. , 0. , 0.9, 0. , 0. , 0. , 0. , 0. , 0.1]])
In [49]:
cross_val_score(sgd_clf, X_train, y_train, cv=3, scoring="accuracy")
Out[49]:
array([0.84063187, 0.84899245, 0.86652998])
In [50]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train.astype(np.float64))
cross_val_score(sgd_clf, X_train_scaled, y_train, cv=3, scoring="accuracy")
Out[50]:
array([0.91011798, 0.90874544, 0.906636  ])

Error Analysis

Here, we will assume that we have found a promising model and we want to find ways to improve it. One way to do this is to analyze the types of errors it makes.

In [51]:
y_train_pred = cross_val_predict(sgd_clf, X_train_scaled, y_train, cv=3)
conf_mx = confusion_matrix(y_train, y_train_pred)
conf_mx
Out[51]:
array([[5725,    3,   24,    9,   10,   49,   50,   10,   39,    4],
       [   2, 6493,   43,   25,    7,   40,    5,   10,  109,    8],
       [  51,   41, 5321,  104,   89,   26,   87,   60,  166,   13],
       [  47,   46,  141, 5342,    1,  231,   40,   50,  141,   92],
       [  19,   29,   41,   10, 5366,    9,   56,   37,   86,  189],
       [  73,   45,   36,  193,   64, 4582,  111,   30,  193,   94],
       [  29,   34,   44,    2,   42,   85, 5627,   10,   45,    0],
       [  25,   24,   74,   32,   54,   12,    6, 5787,   15,  236],
       [  52,  161,   73,  156,   10,  163,   61,   25, 5027,  123],
       [  43,   35,   26,   92,  178,   28,    2,  223,   82, 5240]],
      dtype=int64)
In [52]:
"""To plot with color and a colorbar"""
def plot_confusion_matrix(matrix):
    fig = plt.figure(figsize=(8,8))
    ax = fig.add_subplot(111)
    cax = ax.matshow(matrix)
    fig.colorbar(cax)
In [53]:
plt.matshow(conf_mx, cmap=plt.cm.gray)
save_fig("confusion_matrix_plot", tight_layout=False)
plt.show()

plot_confusion_matrix(conf_mx)
Saving figure confusion_matrix_plot
In [54]:
rows_sums = conf_mx.sum(axis=1, keepdims=True)
norm_conf_mx = conf_mx / rows_sums
In [55]:
np.fill_diagonal(norm_conf_mx, 0)
plt.matshow(norm_conf_mx, cmap=plt.cm.gray)
plt.show()

plot_confusion_matrix(norm_conf_mx)
In [56]:
cl_a, cl_b = 3, 5
X_aa = X_train[(y_train == cl_a) & (y_train_pred == cl_a)]
X_ab = X_train[(y_train == cl_a) & (y_train_pred == cl_b)]
X_ba = X_train[(y_train == cl_b) & (y_train_pred == cl_a)]
X_bb = X_train[(y_train == cl_b) & (y_train_pred == cl_b)]

plt.figure(figsize=(8,8))
plt.subplot(221); plot_digits(X_aa[:25], images_per_row=5)
plt.subplot(222); plot_digits(X_ab[:25], images_per_row=5)
plt.subplot(223); plot_digits(X_ba[:25], images_per_row=5)
plt.subplot(224); plot_digits(X_bb[:25], images_per_row=5)
save_fig("error_analysis_digits_plot")
plt.show()
Saving figure error_analysis_digits_plot

Multilabel Classification

In some cases we may want our classifier to output multiple classes for each instance. For example, consider a face-recognition classifier: what should it do if it recognizes several people on the same picture? This code creates a y_multilabel array containing two target labels for each digit image: the first indicates whether or not the digit is large (7, 8, or 9) and the second indicates whether or not it is odd.

In [57]:
from sklearn.neighbors import KNeighborsClassifier

y_train_large = (y_train >= 7)
y_train_odd = (y_train % 2 == 1)
y_multilabel = np.c_[y_train_large, y_train_odd]

knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train, y_multilabel)
Out[57]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')
In [58]:
knn_clf.predict([random_digit])
Out[58]:
array([[False,  True]])

Multioutput Classification

Multioutput Classification is simply a generalization of multilabel classification where each label can be multiclass (i.e., it can have more than two possible values). Let’s build a system that removes noise from images. It will take as input a noisy digit image, and it will (hopefully) output a clean digit image, represented as an array of pixel intensities, just like the MNIST images.

In [59]:
noise = np.random.randint(0, 100, (len(X_train), 784))
X_train_mod = X_train + noise
noise = np.random.randint(0, 100, (len(X_test), 784))
X_test_mod = X_test + noise
y_train_mod = X_train
y_test_mod = X_test
In [60]:
random_index = 3500
plt.subplot(121); plot_digit(X_test_mod[random_index])
plt.subplot(122); plot_digit(y_test_mod[random_index])
save_fig("noisy_digit_example_plot")
plt.show()
Saving figure noisy_digit_example_plot

On the left is the noisy input image, and on the right is the clean target image. Now let’s train the classifier and make it clean this image:

In [61]:
knn_clf.fit(X_train_mod, y_train_mod)
clean_digit = knn_clf.predict([X_test_mod[random_index]])
plot_digit(clean_digit)
save_fig("cleaned_digit_example_plot")
Saving figure cleaned_digit_example_plot

The above image looks close enough to the target!