写在前面:

  1. 这仅仅是自己的学习笔记,如果侵权,还请告知;
  2. 讲义是参照杜客等人对cs231n的中文翻译。

k-Nearest Neighbor (kNN) exercise

KNN分类器包含两个阶段:

  1. 训练阶段:训练器训练数据并且进行简单记忆
  2. 测试阶段:KNN通过将每张测试图像与所有训练图像做比较并给出K个最相似的训练样本的标签
  3. 对K值进行交叉
  4. 验证

代码分析:

step1 载入必要的设置

# Run some setup code for this notebook.
import random
import numpy as np
from cs231n.data_utils import load_CIFAR10
import matplotlib.pyplot as plt
from __future__ import print_function
# This is a bit of magic to make matplotlib figures appear inline in the notebook
# rather than in a new window.
%matplotlib inline
plt.rcParams[figure.figsize] = (10.0, 8.0) # set default size of plots
plt.rcParams[image.interpolation] = nearest
plt.rcParams[image.cmap] = gray
# Some more magic so that the notebook will reload external python modules;
# see Autoreload of modules in IPython
%load_ext autoreload
%autoreload 2

step 2 载入CIFAR-10 数据集

# Load the raw CIFAR-10 data.
cifar10_dir = cs231n/datasets/cifar-10-batches-py
X_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir)
# As a sanity check, we print out the size of the training and test data.
print(Training data shape: , X_train.shape)
print(Training labels shape: , y_train.shape)
print(Test data shape: , X_test.shape)
print(Test labels shape: , y_test.shape)

其中load_CIFAR10函数代码为:

note:

「load_CIFAR_batch(filename):cifar-10数据集是分batch存储的,这是载入单个batch;

返回值: X, Y: cifar batch中的 data 和 labels;」

「load_CIFAR10(ROOT):读取载入整个 CIFAR-10 数据集;

X_train, Y_train: 训练集 data 和 labels;

X_test, Y_test: 测试集 data 和 labels」

结果显示:

Training data shape: (50000L, 32L, 32L, 3L)

Training labels shape: (50000L,)Test data shape: (10000L, 32L, 32L, 3L)Test labels shape: (10000L,)

##即训练数据共5万张图片,每张图片是32*32像素,像素值采用RGB3个值来表示

step 3 数据集的部分显示(这一步对于KNN没什么用,只是为了给开发者展示一下效果)

# Visualize some examples from the dataset.
# We show a few examples of training images from each class.
classes = [plane, car, bird, cat, deer, dog, frog, horse, ship, truck] #类别列表
num_classes = len(classes) #类别数目
samples_per_class = 7 # 每个类别采样个数
for y, cls in enumerate(classes): # 对列表的元素位置和元素进行循环,y表示元素位置
#cls 表示元素类型;enumerate()返回索引和值,每次只找classes中的一类图片
idxs = np.flatnonzero(y_train == y) #找出标签中y类的位置 #返回扁平化后矩阵中y_train == y的位置
idxs = np.random.choice(idxs, samples_per_class, replace=False) #从中选出我们所需的7个样本#random.choic()随机生成由idxs中的数据组成的长度为samples_per_class的矩阵
for i, idx in enumerate(idxs): #对所选的样本的位置和样本所对应的图片在训练集中的位置进行循环#enumerate()遍历索引并遍历元素
plt_idx = i * num_classes + y + 1 # 在子图中所占位置的计算
plt.subplot(samples_per_class, num_classes, plt_idx) # 说明要画的子图的编号 #subplot将整幅图划分为7*10的大小,
plt.imshow(X_train[idx].astype(uint8)) # 画图
plt.axis(off)
if i == 0:
plt.title(cls) # 写上标题,也就是类别名
plt.show()

##5万章图片的标签是10个,分别是plane, car, bird, cat, deer, dog, frog, horse, ship, truck;

显示的图形共划分为7*10个子图,内部循环每次只显示一个类的一个图,内部循环完成一次,将一个类的图显示完成,是一列的

外部循环是获取每个类的采取样本在X_train位置,并在图中进行显示。

由于选择的样本是随机的(由np.random.choice(idxs, samples_per_class, replace=False))可知,因此,每次运行该部分代码,显示的图形是不完全相同的。

step 4 调整数据集的大小

##由(50000L, 32L, 32L, 3L)这种形式调整为(5000L, 3072L)这种形式,即将一幅图像的像素值由一行来表示

##这里面将训练数据集大小变为5000,测试数据大小变为500

# Subsample the data for more efficient code execution in this exercise
num_training = 5000
mask = list(range(num_training)) #随机产生训练样本的位置
X_train = X_train[mask] #选择随机训练样本
y_train = y_train[mask] #确定随机训练样本标签
num_test = 500
mask = list(range(num_test)) #随机产生测试样本的位置
X_test = X_test[mask] #选择随机测试样本
y_test = y_test[mask] #确定随机测试样本标签
#print(X_test)
#print(y_test)

显示结果:

(500L, 32L, 32L, 3L)

(500L,)

# Reshape the image data into rows
X_train = np.reshape(X_train, (X_train.shape[0], -1)) #改变数据集的纬度,(X_train.shape[0], -1)是32*32*3
X_test = np.reshape(X_test, (X_test.shape[0], -1))
print(X_train.shape, X_test.shape)

显示结果:

(5000L, 3072L) (500L, 3072L)

step 5 调用KNN进行训练

## kNN训练过程包括两个部分:

计算所有训练样本和测试样本之间的距离

距离计算出后,对于每一个测试样本,找到k个最近的样本并返回标签

from cs231n.classifiers.k_nearest_neighbor import KNearestNeighbor
# Create a kNN classifier instance.
# Remember that training a kNN classifier is a noop: #训练即等待的过程
# the Classifier simply remembers the data and does no further processing #训练数据只是将数据进行存储而已,无其他操作
classifier = KNearestNeighbor()
classifier.train(X_train, y_train) #调用训练函数

其中,KNearestNeighbor()中的train函数代码如下:

##对于KNN而言,训练的过程就是存储一下而已

step 6 调用KNN分类器中的compute_distances_two_loops计算距离

# Open cs231n/classifiers/k_nearest_neighbor.py and implement
# compute_distances_two_loops.
# Test your implementation:
dists = classifier.compute_distances_two_loops(X_test)
print(dists.shape)

KNN中计算距离的公式主要有L1(曼哈顿距离)和L2(欧氏距离)两种:

compute_distances_two_loops()函数代码如下,要求采用l2距离公式:

def compute_distances_two_loops(self, X):
"""
Compute the distance between each test point in X and each training point
in self.X_train using a nested loop over both the training data and the
test data.
Inputs:
- X: A numpy array of shape (num_test, D) containing test data.
Returns:
- dists: A numpy array of shape (num_test, num_train) where dists[i, j]
is the Euclidean distance between the ith test point and the jth training
point.
"""
num_test = X.shape[0] #测试样本数
num_train = self.X_train.shape[0] #训练样本数
dists = np.zeros((num_test, num_train)) #初始化距离矩阵
for i in xrange(num_test):
for j in xrange(num_train): #这一部分代码是自己完成的

##要求采用l2距离计算公式计算两者之间的距离
#####################################################################
# TODO: #
# Compute the l2 distance between the ith test point and the jth #
# training point, and store the result in dists[i, j]. You should #
# not use a loop over dimension. #
#####################################################################
#pass #空语句块,是为了保持程序结构的完整性,一般用做占位语句
dists[i,j]=np.sqrt(np.dot(X[i]-self.X_train[j], X[i]-self.X_train[j]))
#####################################################################
# END OF YOUR CODE #
####################################################################
return dists

结果显示:(500L, 5000L)

step 7 dists距离矩阵显示

# We can visualize the distance matrix: each row is a single test example and
# its distances to training examples
plt.imshow(dists, interpolation=none)
plt.show()

显示结果:

用颜色深浅表示距离的长短(颜色越相近,说明距离数字越相近)

step 8 运行标签预测函数predict_labels()

predict_labels()代码如下:

def predict_labels(self, dists, k=1):
"""
Given a matrix of distances between test points and training points,
predict a label for each test point(函数功能描述:给定训练数据及测试数据之间的距离矩阵,预测每个测试数据的标签).
Inputs:
- dists: A numpy array of shape (num_test, num_train) where dists[i, j]
gives the distance betwen the ith test point and the jth training point.
Returns:
- y_pred: A numpy array of shape (num_test,) containing predicted labels for the
test data, where y_pred[i] is the predicted label for the test point X[i].
"""
num_test = dists.shape[0] #测试数据数量
y_pred = np.zeros(num_test) #初始化预测标签矩阵
for i in xrange(num_test):
# A list of length k storing the labels of the k nearest neighbors to
# the ith test point.
closest_y = [] #长度为k的list存储距离测试数据最近的k个标签
#########################################################################
# TODO: #
# Use the distance matrix to find the k nearest neighbors of the ith #
# testing point, and use self.y_train to find the labels of these #
# neighbors. Store these labels in closest_y. #
# Hint: Look up the function numpy.argsort.#提醒使用 numpy.argsort()函数
# argsort() 将数据从小到大排列,显示的是序号 #
#########################################################################
dists_order = np.argsort(dists[i]) #对每一行数据进行从小到大排列,dists_order中存储的是序号
#########################################################################
# TODO: #
# Now that you have found the labels of the k nearest neighbors, you #
# need to find the most common label in the list closest_y of labels. #
# Store this label in y_pred[i]. Break ties by choosing the smaller #
# label. #
#########################################################################
pass
#########################################################################
# END OF YOUR CODE #
#########################################################################

return y_pred

step 交叉验证

#交叉验证主要是验证k值和选择L1和L2哪个更好。这些选择称为超参数。

#交叉验证思路:将训练集拆分为训练组和验证组,来筛选比较好的超参数。

num_folds = 5
k_choices = [1, 3, 5, 8, 10, 12, 15, 20, 50, 100]

X_train_folds = []
y_train_folds = []

################################################################################
# TODO: #
# Split up the training data into folds. After splitting, X_train_folds and #
# y_train_folds should each be lists of length num_folds, where #
# y_train_folds[i] is the label vector for the points in X_train_folds[i]. #
# Hint: Look up the numpy array_split function. #
################################################################################
#将训练集分为5份
X_train_folds = np.array_split(X_train,num_folds)
y_train_folds = np.array_split(y_train,num_folds)
################################################################################
# END OF YOUR CODE #
################################################################################

# A dictionary holding the accuracies for different values of k that we find
# when running cross-validation. After running cross-validation,
# k_to_accuracies[k] should be a list of length num_folds giving the different
# accuracy values that we found when using that value of k.
#初始化存储不同k的准确度的字典
k_to_accuracies = {}
################################################################################
# TODO: #
# Perform k-fold cross validation to find the best value of k. For each #
# possible value of k, run the k-nearest-neighbor algorithm num_folds times, #
# where in each case you use all but one of the folds as training data and the #
# last fold as a validation set. Store the accuracies for all fold and all #
# values of k in the k_to_accuracies dictionary. #
################################################################################
#对于每个k,采用交叉验证,其实和前面的部分(计算训练集和测试集的精确度方法是一致的。首先将四份训练组组合成一个,然后将之与验证组进行计算)
for k in k_choices:
k_to_accuracies[k] = np.zeros(num_folds)
for i in range(num_folds): #循环交叉验证
Xtr = np.array(X_train_folds[:i] + X_train_folds[i+1:]) #四份训练组组合
ytr = np.array(y_train_folds[:i] + y_train_folds[i+1:])
Xte = np.array(X_train_folds[i])
yte = np.array(y_train_folds[i])

Xtr = np.reshape(Xtr, (X_train.shape[0] * 4 / 5, -1)) #改变训练数据的形式,把所有数据变为一列。这里是不需要知道Xtr的列数,让程序自己计算即可
ytr = np.reshape(ytr, (y_train.shape[0] * 4 / 5, -1))
Xte = np.reshape(Xte, (X_train.shape[0] / 5, -1))
yte = np.reshape(yte, (y_train.shape[0] / 5, -1))

classifier.train(Xtr, ytr)
yte_pred = classifier.predict(Xte, k)
yte_pred = np.reshape(yte_pred, (yte_pred.shape[0], -1))
num_correct = np.sum(yte_pred == yte)
accuracy = float(num_correct) / len(yte)
k_to_accuracies[k][i] = accuracy
################################################################################
# END OF YOUR CODE #
################################################################################
# Print out the computed accuracies
for k in sorted(k_to_accuracies):
for accuracy in k_to_accuracies[k]:
print(k = %d, accuracy = %f % (k, accuracy))

step 图形化显示

# plot the raw observations
for k in k_choices:
accuracies = k_to_accuracies[k]
plt.scatter([k] * len(accuracies), accuracies)#散点图

# plot the trend line with error bars that correspond to standard deviation
accuracies_mean = np.array([np.mean(v) for k,v in sorted(k_to_accuracies.items())])
accuracies_std = np.array([np.std(v) for k,v in sorted(k_to_accuracies.items())])
plt.errorbar(k_choices, accuracies_mean, yerr=accuracies_std)
plt.title(Cross-validation on k)
plt.xlabel(k)
plt.ylabel(Cross-validation accuracy)
plt.show()

step 选取最好的k进行训练

推荐阅读:

相关文章