more about data set here: http://yann.lecun.com/exdb/mnist/
import numpy as np
import os
np.random.seed(42)
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12
PROJECT_ROOT_DIR = "."
CHAPTER_ID = "classification"
def save_fig(fig_id, tight_layout=True):
path = os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID, fig_id + ".png")
print("Saving figure", fig_id)
if tight_layout:
plt.tight_layout()
plt.savefig(path, format='png', dpi=300)
import warnings
warnings.filterwarnings(action='ignore')
from sklearn.datasets import fetch_mldata
mnist = fetch_mldata('MNIST original')
mnist
X, y = mnist["data"], mnist["target"]
print("X shape:", X.shape)
print("y shape:", y.shape)
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
random_digit = X[24000]
random_digit_image = random_digit.reshape(28,28)
plt.imshow(random_digit_image, cmap = matplotlib.cm.binary,
interpolation="nearest")
plt.axis("off")
save_fig("random_digit")
plt.show()
def plot_digit(data):
image = data.reshape(28, 28)
plt.imshow(image, cmap = matplotlib.cm.binary,
interpolation="nearest")
plt.axis("off")
def plot_digits(instances, images_per_row=5):
size = 28
images_per_row = min(len(instances), images_per_row)
images = [instance.reshape(size,size) for instance in instances]
n_rows = (len(instances) - 1) // images_per_row + 1
row_images = []
n_empty = n_rows * images_per_row - len(instances)
images.append(np.zeros((size, size * n_empty)))
for row in range(n_rows):
rimages = images[row * images_per_row : (row + 1) * images_per_row]
row_images.append(np.concatenate(rimages, axis=1))
image = np.concatenate(row_images, axis=0)
plt.imshow(image, cmap = matplotlib.cm.binary)
plt.axis("off")
plt.figure(figsize=(10,10))
example_images = np.r_[X[:12000:600], X[13000:30600:600], X[30600:60000:590]]
plot_digits(example_images, images_per_row=5)
save_fig("more_digits_plot")
plt.show()
y[24000]
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]
shuffle_index = np.random.permutation(60000)
X_train, y_train = X_train[shuffle_index], y_train[shuffle_index]
To simplify the problem for now let's only try to identify one digit - the number 3. This "3-detector" will be an example of binary classifer, capable of distinguishing between just two classes, 3 and not 3.
y_train_3 = (y_train == 3)
y_test_5 = (y_test == 3)
To pick a classifer, a good place to start is with Stochastic Gradient Descent(SGD) classifie, using Scikit-Learn's SGDClassifier class. This classifier has the capablity to handle large datasets efficently.
from sklearn.linear_model import SGDClassifier
sgd_clf = SGDClassifier(random_state=42)
sgd_clf.fit(X_train, y_train_3)
sgd_clf.predict([random_digit])
Evaluating a classifier is significantly trickier than evaluating a regressor. There are many performance measures available.
A good way to evaluate performace measure is to use cross-validation:
from sklearn.model_selection import cross_val_score
cross_val_score(sgd_clf, X_train, y_train_3, cv=3, scoring="accuracy")
That's right that's over 90% accuracy across all 3 folds. This is simply because only about 10% of the images are 3s, so if we always guess that an image is not a 3, we will be right about 90% of the time.
A much better way to evaluate the performance of a classifier is to look at the confusion matrix. The general is to count the number of times instances of class A are classified as class B. For eg, to know the number of times the classifier confused images of 3s with 6s, we would look in the 3rd row and 6th column of the confusion matrix. To compute the confusion matrix, we first need to have a set of predictions. We can get the predictions on the train set using cross_val_predict() function which return predictions for each instance in the training set.
from sklearn.model_selection import cross_val_predict
y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_3, cv=3)
from sklearn.metrics import confusion_matrix
confusion_matrix(y_train_3, y_train_pred)
The first row of this matrix considers non-3 images (the negative class): 51523 were correctly classified as non-3s(true negatives), while remaining 2346 were wrongly classified as 3's(false postives).
The second row of this matrix considers 3 images(the positive class): 733 were incorrectly classified as 3s(false negatives), while remaining 5398 were correctly classified as 3's(true postives).
from sklearn.metrics import precision_score, recall_score
precision_score(y_train_3, y_train_pred) # 5398 / (5398 + 2346)
recall_score(y_train_3, y_train_pred) # 5398 / (5498 +733)
It is often convenient to combine precision and recall into single metric called the F1 score, in particular if we need a simple way to compare two classifier. The F1 score is the harmonic mean of precision and recall. Whereas the regular mean treats all values equally, the harmonic mean gives much more weight to low values. As a result, the classifier will get a high F1, score if both recall and precision are high.
from sklearn.metrics import f1_score
f1_score(y_train_3, y_train_pred)
y_scores = sgd_clf.decision_function([random_digit])
y_scores
threshold = 0
y_random_digit_pred = (y_scores > threshold)
y_random_digit_pred
threshold = 200000
y_random_digit_pred = (y_scores > threshold)
y_random_digit_pred
y_scores = cross_val_predict(sgd_clf, X_train, y_train_3, cv=3, method="decision_function")
from sklearn.metrics import precision_recall_curve
precisions, recalls, thresholds = precision_recall_curve(y_train_3, y_scores)
def plot_precision_recall_vs_threshold(precisions, recalls, thresholds):
plt.plot(thresholds, precisions[:-1], "b--", label="Precision")
plt.plot(thresholds, recalls[:-1], "g-", label="Recall")
plt.xlabel("Threshold")
plt.legend(loc="upper left")
plt.xlim([-700000, 700000])
plt.ylim([0,1])
plt.figure(figsize=(10, 4))
plot_precision_recall_vs_threshold(precisions, recalls, thresholds)
save_fig("pression_recall_vs_threshold_plot")
plt.show()
y_train_pred_90 = (y_scores > 70000)
precision_score(y_train_3, y_train_pred_90)
recall_score(y_train_3, y_train_pred_90)
def plot_precision_vs_recall(precisions, recalls):
plt.plot(recalls, precisions, "b-", linewidth=2)
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.axis([0,1,0,1])
plt.figure(figsize=(8, 6))
plot_precision_vs_recall(precisions, recalls)
save_fig("precision_vs_recall_plot")
plt.show()
The receiver operating characteristic (ROC) curve is another common tool used with binary classifiers. It is very similar to the precision/recall curve, but instead of plotting precision versus recall, the ROC curve plots the true positive rate (another name for recall) against the false positive rate(FPR). The FPR is the ratio of negative instances that are incorrectly classified as positive. It is equal to one minus the true negative rate (TNR), which is the ratio of negative instances that are correctly classified as negative. The TNR is also called specificity.Hence the ROC curve plots sensitivity (recall) versus 1–specificity. To plot the ROC curve, we first need to compute the TPR and FPR for various threshold values, using the roc_curve() function:
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_train_3, y_scores)
def plot_roc_curve(fpr, tpr, label=None):
plt.plot(fpr, tpr, linewidth=2, label=label)
plt.plot([0,1],[0,1], 'k--')
plt.axis([0,1,0,1])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.figure(figsize=(8,6))
plot_roc_curve(fpr,tpr)
save_fig("roc_curve_plot")
plt.show()
from sklearn.metrics import roc_auc_score
roc_auc_score(y_train_3, y_scores)
from sklearn.ensemble import RandomForestClassifier
forest_clf = RandomForestClassifier(random_state=42)
y_probas_forest = cross_val_predict(forest_clf, X_train, y_train_3, cv=3,
method="predict_proba")
y_scores_forest = y_probas_forest[:, 1] # score = proba of positive class
fpr_forest, tpr_forest, thresholds_forest = roc_curve(y_train_3,y_scores_forest)
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, "b:", linewidth=2, label="SGD")
plot_roc_curve(fpr_forest, tpr_forest, "Random Forest")
plt.legend(loc="lower right")
save_fig("roc_curve_comparison_plot")
plt.show()
roc_auc_score(y_train_3, y_scores_forest)
y_train_pred_forest = cross_val_predict(forest_clf, X_train, y_train_3, cv=3)
precision_score(y_train_3, y_train_pred_forest)
recall_score(y_train_3, y_train_pred_forest)
Whereas binary classifiers distinguish between two classes, multiclass classifiers (also called multinomial classifiers) can distinguish between more than two classes.
sgd_clf.fit(X_train, y_train)
sgd_clf.predict([random_digit])
random_digit_scores = sgd_clf.decision_function([random_digit])
random_digit_scores
np.argmax(random_digit_scores)
sgd_clf.classes_
sgd_clf.classes_[3]
from sklearn.multiclass import OneVsOneClassifier
ovo_clf = OneVsOneClassifier(SGDClassifier(max_iter=5, random_state=42))
ovo_clf.fit(X_train, y_train)
ovo_clf.predict([random_digit])
len(ovo_clf.estimators_)
forest_clf.fit(X_train, y_train)
forest_clf.predict([random_digit])
forest_clf.predict_proba([random_digit])
cross_val_score(sgd_clf, X_train, y_train, cv=3, scoring="accuracy")
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train.astype(np.float64))
cross_val_score(sgd_clf, X_train_scaled, y_train, cv=3, scoring="accuracy")
Here, we will assume that we have found a promising model and we want to find ways to improve it. One way to do this is to analyze the types of errors it makes.
y_train_pred = cross_val_predict(sgd_clf, X_train_scaled, y_train, cv=3)
conf_mx = confusion_matrix(y_train, y_train_pred)
conf_mx
"""To plot with color and a colorbar"""
def plot_confusion_matrix(matrix):
fig = plt.figure(figsize=(8,8))
ax = fig.add_subplot(111)
cax = ax.matshow(matrix)
fig.colorbar(cax)
plt.matshow(conf_mx, cmap=plt.cm.gray)
save_fig("confusion_matrix_plot", tight_layout=False)
plt.show()
plot_confusion_matrix(conf_mx)
rows_sums = conf_mx.sum(axis=1, keepdims=True)
norm_conf_mx = conf_mx / rows_sums
np.fill_diagonal(norm_conf_mx, 0)
plt.matshow(norm_conf_mx, cmap=plt.cm.gray)
plt.show()
plot_confusion_matrix(norm_conf_mx)
cl_a, cl_b = 3, 5
X_aa = X_train[(y_train == cl_a) & (y_train_pred == cl_a)]
X_ab = X_train[(y_train == cl_a) & (y_train_pred == cl_b)]
X_ba = X_train[(y_train == cl_b) & (y_train_pred == cl_a)]
X_bb = X_train[(y_train == cl_b) & (y_train_pred == cl_b)]
plt.figure(figsize=(8,8))
plt.subplot(221); plot_digits(X_aa[:25], images_per_row=5)
plt.subplot(222); plot_digits(X_ab[:25], images_per_row=5)
plt.subplot(223); plot_digits(X_ba[:25], images_per_row=5)
plt.subplot(224); plot_digits(X_bb[:25], images_per_row=5)
save_fig("error_analysis_digits_plot")
plt.show()
In some cases we may want our classifier to output multiple classes for each instance. For example, consider a face-recognition classifier: what should it do if it recognizes several people on the same picture? This code creates a y_multilabel array containing two target labels for each digit image: the first indicates whether or not the digit is large (7, 8, or 9) and the second indicates whether or not it is odd.
from sklearn.neighbors import KNeighborsClassifier
y_train_large = (y_train >= 7)
y_train_odd = (y_train % 2 == 1)
y_multilabel = np.c_[y_train_large, y_train_odd]
knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train, y_multilabel)
knn_clf.predict([random_digit])
Multioutput Classification is simply a generalization of multilabel classification where each label can be multiclass (i.e., it can have more than two possible values). Let’s build a system that removes noise from images. It will take as input a noisy digit image, and it will (hopefully) output a clean digit image, represented as an array of pixel intensities, just like the MNIST images.
noise = np.random.randint(0, 100, (len(X_train), 784))
X_train_mod = X_train + noise
noise = np.random.randint(0, 100, (len(X_test), 784))
X_test_mod = X_test + noise
y_train_mod = X_train
y_test_mod = X_test
random_index = 3500
plt.subplot(121); plot_digit(X_test_mod[random_index])
plt.subplot(122); plot_digit(y_test_mod[random_index])
save_fig("noisy_digit_example_plot")
plt.show()
On the left is the noisy input image, and on the right is the clean target image. Now let’s train the classifier and make it clean this image:
knn_clf.fit(X_train_mod, y_train_mod)
clean_digit = knn_clf.predict([X_test_mod[random_index]])
plot_digit(clean_digit)
save_fig("cleaned_digit_example_plot")
The above image looks close enough to the target!
Github- https://github.com/jaynilpatel
LinkedIn- https://www.linkedin.com/in/jaynil-patel-48b43212b/