寫在前面:

  1. 這僅僅是自己的學習筆記,如果侵權,還請告知;
  2. 講義是參照杜客等人對cs231n的中文翻譯。

k-Nearest Neighbor (kNN) exercise

KNN分類器包含兩個階段:

  1. 訓練階段:訓練器訓練數據並且進行簡單記憶
  2. 測試階段:KNN通過將每張測試圖像與所有訓練圖像做比較並給出K個最相似的訓練樣本的標籤
  3. 對K值進行交叉
  4. 驗證

代碼分析:

step1 載入必要的設置

# Run some setup code for this notebook.
import random
import numpy as np
from cs231n.data_utils import load_CIFAR10
import matplotlib.pyplot as plt
from __future__ import print_function
# This is a bit of magic to make matplotlib figures appear inline in the notebook
# rather than in a new window.
%matplotlib inline
plt.rcParams[figure.figsize] = (10.0, 8.0) # set default size of plots
plt.rcParams[image.interpolation] = nearest
plt.rcParams[image.cmap] = gray
# Some more magic so that the notebook will reload external python modules;
# see Autoreload of modules in IPython
%load_ext autoreload
%autoreload 2

step 2 載入CIFAR-10 數據集

# Load the raw CIFAR-10 data.
cifar10_dir = cs231n/datasets/cifar-10-batches-py
X_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir)
# As a sanity check, we print out the size of the training and test data.
print(Training data shape: , X_train.shape)
print(Training labels shape: , y_train.shape)
print(Test data shape: , X_test.shape)
print(Test labels shape: , y_test.shape)

其中load_CIFAR10函數代碼為:

note:

「load_CIFAR_batch(filename):cifar-10數據集是分batch存儲的,這是載入單個batch;

返回值: X, Y: cifar batch中的 data 和 labels;」

「load_CIFAR10(ROOT):讀取載入整個 CIFAR-10 數據集;

X_train, Y_train: 訓練集 data 和 labels;

X_test, Y_test: 測試集 data 和 labels」

結果顯示:

Training data shape: (50000L, 32L, 32L, 3L)

Training labels shape: (50000L,)Test data shape: (10000L, 32L, 32L, 3L)Test labels shape: (10000L,)

##即訓練數據共5萬張圖片,每張圖片是32*32像素,像素值採用RGB3個值來表示

step 3 數據集的部分顯示(這一步對於KNN沒什麼用,只是為了給開發者展示一下效果)

# Visualize some examples from the dataset.
# We show a few examples of training images from each class.
classes = [plane, car, bird, cat, deer, dog, frog, horse, ship, truck] #類別列表
num_classes = len(classes) #類別數目
samples_per_class = 7 # 每個類別採樣個數
for y, cls in enumerate(classes): # 對列表的元素位置和元素進行循環,y表示元素位置
#cls 表示元素類型;enumerate()返回索引和值,每次只找classes中的一類圖片
idxs = np.flatnonzero(y_train == y) #找出標籤中y類的位置 #返回扁平化後矩陣中y_train == y的位置
idxs = np.random.choice(idxs, samples_per_class, replace=False) #從中選出我們所需的7個樣本#random.choic()隨機生成由idxs中的數據組成的長度為samples_per_class的矩陣
for i, idx in enumerate(idxs): #對所選的樣本的位置和樣本所對應的圖片在訓練集中的位置進行循環#enumerate()遍歷索引並遍曆元素
plt_idx = i * num_classes + y + 1 # 在子圖中所佔位置的計算
plt.subplot(samples_per_class, num_classes, plt_idx) # 說明要畫的子圖的編號 #subplot將整幅圖劃分為7*10的大小,
plt.imshow(X_train[idx].astype(uint8)) # 畫圖
plt.axis(off)
if i == 0:
plt.title(cls) # 寫上標題,也就是類別名
plt.show()

##5萬章圖片的標籤是10個,分別是plane, car, bird, cat, deer, dog, frog, horse, ship, truck;

顯示的圖形共劃分為7*10個子圖,內部循環每次只顯示一個類的一個圖,內部循環完成一次,將一個類的圖顯示完成,是一列的

外部循環是獲取每個類的採取樣本在X_train位置,並在圖中進行顯示。

由於選擇的樣本是隨機的(由np.random.choice(idxs, samples_per_class, replace=False))可知,因此,每次運行該部分代碼,顯示的圖形是不完全相同的。

step 4 調整數據集的大小

##由(50000L, 32L, 32L, 3L)這種形式調整為(5000L, 3072L)這種形式,即將一幅圖像的像素值由一行來表示

##這裡面將訓練數據集大小變為5000,測試數據大小變為500

# Subsample the data for more efficient code execution in this exercise
num_training = 5000
mask = list(range(num_training)) #隨機產生訓練樣本的位置
X_train = X_train[mask] #選擇隨機訓練樣本
y_train = y_train[mask] #確定隨機訓練樣本標籤
num_test = 500
mask = list(range(num_test)) #隨機產生測試樣本的位置
X_test = X_test[mask] #選擇隨機測試樣本
y_test = y_test[mask] #確定隨機測試樣本標籤
#print(X_test)
#print(y_test)

顯示結果:

(500L, 32L, 32L, 3L)

(500L,)

# Reshape the image data into rows
X_train = np.reshape(X_train, (X_train.shape[0], -1)) #改變數據集的緯度,(X_train.shape[0], -1)是32*32*3
X_test = np.reshape(X_test, (X_test.shape[0], -1))
print(X_train.shape, X_test.shape)

顯示結果:

(5000L, 3072L) (500L, 3072L)

step 5 調用KNN進行訓練

## kNN訓練過程包括兩個部分:

計算所有訓練樣本和測試樣本之間的距離

距離計算出後,對於每一個測試樣本,找到k個最近的樣本並返回標籤

from cs231n.classifiers.k_nearest_neighbor import KNearestNeighbor
# Create a kNN classifier instance.
# Remember that training a kNN classifier is a noop: #訓練即等待的過程
# the Classifier simply remembers the data and does no further processing #訓練數據只是將數據進行存儲而已,無其他操作
classifier = KNearestNeighbor()
classifier.train(X_train, y_train) #調用訓練函數

其中,KNearestNeighbor()中的train函數代碼如下:

##對於KNN而言,訓練的過程就是存儲一下而已

step 6 調用KNN分類器中的compute_distances_two_loops計算距離

# Open cs231n/classifiers/k_nearest_neighbor.py and implement
# compute_distances_two_loops.
# Test your implementation:
dists = classifier.compute_distances_two_loops(X_test)
print(dists.shape)

KNN中計算距離的公式主要有L1(曼哈頓距離)和L2(歐氏距離)兩種:

compute_distances_two_loops()函數代碼如下,要求採用l2距離公式:

def compute_distances_two_loops(self, X):
"""
Compute the distance between each test point in X and each training point
in self.X_train using a nested loop over both the training data and the
test data.
Inputs:
- X: A numpy array of shape (num_test, D) containing test data.
Returns:
- dists: A numpy array of shape (num_test, num_train) where dists[i, j]
is the Euclidean distance between the ith test point and the jth training
point.
"""
num_test = X.shape[0] #測試樣本數
num_train = self.X_train.shape[0] #訓練樣本數
dists = np.zeros((num_test, num_train)) #初始化距離矩陣
for i in xrange(num_test):
for j in xrange(num_train): #這一部分代碼是自己完成的

##要求採用l2距離計算公式計算兩者之間的距離
#####################################################################
# TODO: #
# Compute the l2 distance between the ith test point and the jth #
# training point, and store the result in dists[i, j]. You should #
# not use a loop over dimension. #
#####################################################################
#pass #空語句塊,是為了保持程序結構的完整性,一般用做佔位語句
dists[i,j]=np.sqrt(np.dot(X[i]-self.X_train[j], X[i]-self.X_train[j]))
#####################################################################
# END OF YOUR CODE #
####################################################################
return dists

結果顯示:(500L, 5000L)

step 7 dists距離矩陣顯示

# We can visualize the distance matrix: each row is a single test example and
# its distances to training examples
plt.imshow(dists, interpolation=none)
plt.show()

顯示結果:

用顏色深淺表示距離的長短(顏色越相近,說明距離數字越相近)

step 8 運行標籤預測函數predict_labels()

predict_labels()代碼如下:

def predict_labels(self, dists, k=1):
"""
Given a matrix of distances between test points and training points,
predict a label for each test point(函數功能描述:給定訓練數據及測試數據之間的距離矩陣,預測每個測試數據的標籤).
Inputs:
- dists: A numpy array of shape (num_test, num_train) where dists[i, j]
gives the distance betwen the ith test point and the jth training point.
Returns:
- y_pred: A numpy array of shape (num_test,) containing predicted labels for the
test data, where y_pred[i] is the predicted label for the test point X[i].
"""
num_test = dists.shape[0] #測試數據數量
y_pred = np.zeros(num_test) #初始化預測標籤矩陣
for i in xrange(num_test):
# A list of length k storing the labels of the k nearest neighbors to
# the ith test point.
closest_y = [] #長度為k的list存儲距離測試數據最近的k個標籤
#########################################################################
# TODO: #
# Use the distance matrix to find the k nearest neighbors of the ith #
# testing point, and use self.y_train to find the labels of these #
# neighbors. Store these labels in closest_y. #
# Hint: Look up the function numpy.argsort.#提醒使用 numpy.argsort()函數
# argsort() 將數據從小到大排列,顯示的是序號 #
#########################################################################
dists_order = np.argsort(dists[i]) #對每一行數據進行從小到大排列,dists_order中存儲的是序號
#########################################################################
# TODO: #
# Now that you have found the labels of the k nearest neighbors, you #
# need to find the most common label in the list closest_y of labels. #
# Store this label in y_pred[i]. Break ties by choosing the smaller #
# label. #
#########################################################################
pass
#########################################################################
# END OF YOUR CODE #
#########################################################################

return y_pred

step 交叉驗證

#交叉驗證主要是驗證k值和選擇L1和L2哪個更好。這些選擇稱為超參數。

#交叉驗證思路:將訓練集拆分為訓練組和驗證組,來篩選比較好的超參數。

num_folds = 5
k_choices = [1, 3, 5, 8, 10, 12, 15, 20, 50, 100]

X_train_folds = []
y_train_folds = []

################################################################################
# TODO: #
# Split up the training data into folds. After splitting, X_train_folds and #
# y_train_folds should each be lists of length num_folds, where #
# y_train_folds[i] is the label vector for the points in X_train_folds[i]. #
# Hint: Look up the numpy array_split function. #
################################################################################
#將訓練集分為5份
X_train_folds = np.array_split(X_train,num_folds)
y_train_folds = np.array_split(y_train,num_folds)
################################################################################
# END OF YOUR CODE #
################################################################################

# A dictionary holding the accuracies for different values of k that we find
# when running cross-validation. After running cross-validation,
# k_to_accuracies[k] should be a list of length num_folds giving the different
# accuracy values that we found when using that value of k.
#初始化存儲不同k的準確度的字典
k_to_accuracies = {}
################################################################################
# TODO: #
# Perform k-fold cross validation to find the best value of k. For each #
# possible value of k, run the k-nearest-neighbor algorithm num_folds times, #
# where in each case you use all but one of the folds as training data and the #
# last fold as a validation set. Store the accuracies for all fold and all #
# values of k in the k_to_accuracies dictionary. #
################################################################################
#對於每個k,採用交叉驗證,其實和前面的部分(計算訓練集和測試集的精確度方法是一致的。首先將四份訓練組組合成一個,然後將之與驗證組進行計算)
for k in k_choices:
k_to_accuracies[k] = np.zeros(num_folds)
for i in range(num_folds): #循環交叉驗證
Xtr = np.array(X_train_folds[:i] + X_train_folds[i+1:]) #四份訓練組組合
ytr = np.array(y_train_folds[:i] + y_train_folds[i+1:])
Xte = np.array(X_train_folds[i])
yte = np.array(y_train_folds[i])

Xtr = np.reshape(Xtr, (X_train.shape[0] * 4 / 5, -1)) #改變訓練數據的形式,把所有數據變為一列。這裡是不需要知道Xtr的列數,讓程序自己計算即可
ytr = np.reshape(ytr, (y_train.shape[0] * 4 / 5, -1))
Xte = np.reshape(Xte, (X_train.shape[0] / 5, -1))
yte = np.reshape(yte, (y_train.shape[0] / 5, -1))

classifier.train(Xtr, ytr)
yte_pred = classifier.predict(Xte, k)
yte_pred = np.reshape(yte_pred, (yte_pred.shape[0], -1))
num_correct = np.sum(yte_pred == yte)
accuracy = float(num_correct) / len(yte)
k_to_accuracies[k][i] = accuracy
################################################################################
# END OF YOUR CODE #
################################################################################
# Print out the computed accuracies
for k in sorted(k_to_accuracies):
for accuracy in k_to_accuracies[k]:
print(k = %d, accuracy = %f % (k, accuracy))

step 圖形化顯示

# plot the raw observations
for k in k_choices:
accuracies = k_to_accuracies[k]
plt.scatter([k] * len(accuracies), accuracies)#散點圖

# plot the trend line with error bars that correspond to standard deviation
accuracies_mean = np.array([np.mean(v) for k,v in sorted(k_to_accuracies.items())])
accuracies_std = np.array([np.std(v) for k,v in sorted(k_to_accuracies.items())])
plt.errorbar(k_choices, accuracies_mean, yerr=accuracies_std)
plt.title(Cross-validation on k)
plt.xlabel(k)
plt.ylabel(Cross-validation accuracy)
plt.show()

step 選取最好的k進行訓練

推薦閱讀:

相關文章