Covariate Shift---从一道实际应用题说起

来自专栏从零学AI14 人赞了文章

（力求用最清晰的语言去解释清楚现象背后的原理）

首先，先上题目：

After setting up your train/dev/test sets, the City Council comes across another 1,000,000 images, called the 「citizens』 data」. Apparently the citizens of Peacetopia are so scared of birds that they volunteered to take pictures of the sky and label them, thus contributing these additional 1,000,000 images. These images are different from the distribution of images the City Council had originally given you, but you think it could help your algorithm.

You should not add the citizens』 data to the training set, because this will cause the training and dev/test set distributions to become different, thus hurting dev and test set performance. True/False?

本题来源于Structuring Machine Learning Projects的第一周的作业。

那么问题来了，True or False？

先给答案：False

直观解释：

（1）数据能不能放在test set上？不行，会造成一定的Covariate Shift，并且，测试集要能直接反应现实目标，本模型的目标是要识别出鸟，所以测试集一定是测试识别鸟。

（2）数据能不能放在Training set上？可以，训练集数据和Dev set, Test set可以不同，从长期来看，训练集更广泛能够让模型拥有更好的泛化能力，以及更加robust。

一般Machine Learning思路下的处理方式：

Original data set: 10,000,000 images of the sky above Peacetopia，根据常见的数据集分割方法，在deep learning中我们可以 95% 2.5% 2.5%这么分，以此来保证Training Set、Cross Validation Set or Dev Set 和 Test Set 之间的分布相同。在训练集和测试集的相同分布下，我们才能确保我们的模型学习到的规律符合我们的目标，Dev Set中tune出来的hyper-parameter适合实际的情况。

这里，我们也就引出了本文涉及到的概念：Covariate Shift, models are trained and tested on data that already happened。说白了，Covariate Shift就是训练集和测试集之间的分布不同，尤其在现实中，伴随著无数的新数据的产生。传统的机器学习更多的是假设训练集和测试集是独立同分布的，即要得到什么样的结果，就在什么样的情况环境中训练，最终也在这个大环境中应用。

按照原来的这种思路，理论上，我们应该把额外的1,000,000 images按照比例分给训练与测试集，以此达到目标是什么，就训练什么，最终在测试集上测试什么。并且，实际情况中，这些额外的照片中究竟有多少包含鸟？质量怎么样？我们都没有精力一一确定，但是最保险的方法莫过于按照比例分给训练与测试集。

这里又涉及到了另外一种情况：

You』ve handily beaten your competitor, and your system is now deployed in Peacetopia and is protecting the citizens from birds! But over the last few months, a new species of bird has been slowly migrating into the area, so the performance of your system slowly degrades because your data is being tested on a new type of data.

You have only 1,000 images of the new species of bird. The city expects a better system from you within the next 3 months. Which of these should you do first?

Answer:

√ Use the data you have to define a new evaluation metric (using a new dev/test set) taking into account the new species, and use that to drive further progress for your team.

× Add the 1,000 images into your dataset and reshuffle into a new train/dev/test split.

那这题中，为什么按照比例分给训练与测试集有错了呢？

这里，情况是有一种新的鸟类出现了，并且只有1000张样本（与之前1,000,000 images不同）。最好的处理办法个人认为应该是：首先先把这个问题快速解决了，通过设计新的evaluation metric，就是题目中的正确答案；然后再不断地收集这种新的鸟类（1000张显然不够），去训练出新的模型。

题目中的处理方式：

在题目中，Andrew这个方法相当讨巧，就是只把这些额外的1,000,000 images直接作为Training Set去训练，而不去作为Dev/test集。这种方式会不会造成题目中说的那种愿意，即测试集和训练集分布的不同，进而会造成Covariate shift呢？

答案是：不会。Covariate Shift是因为现实过程中出现的新数据，原来的老模型已经无法很好地回归or分类而导致的。你直接作为训练集训练来改变模型，当然没有什么问题。（这就好比，人民日益增长的物质文化需要同落后的社会生产之间的矛盾，老旧的社会生产力已经无法满足新时代的物质文化需求了，假设当然得变化。再何况，连旧时代的主要矛盾都无法满足新时代的需求了，主要矛盾都变了，模型自然也得变。）并且，增加了模型的训练样本之后，可以有效地缓解High Variance，减轻对原来数据的Over fitting。

那能不能直接作为Test Set呢？

我们可以用另外一道题来说明：

One member of the City Council knows a little about machine learning, and thinks you should add the 1,000,000 citizens』 data images to the test set. You object because:

(1) This would cause the dev and test set distributions to become different. This is a bad idea because you』re not aiming where you want to hit.

(2) The test set no longer reflects the distribution of data (security cameras) you most care about.

在实际模型建立中，我们一定要确定模型的实现目标。Test Set的存在意义，就在于测试模型应用在实际情况中的能力。如果把新图片直接加在Test Set上，可能会直接造成模型效果不好，进而误判。实际过程中，Covariate Shift也就是因为现实是不断变化的，模型不断会遇到新数据这种情况而产生的，进而会影响机器学习模型的实际效果。如果直接把新数据加在Test Set上，无疑会加重Covariate Shift。

题目讲清楚了，我们再来看看Covariate Shift。这个概念我应该最早从Batch Normalization中关注到的[4]，这篇文章通过BN来让深度神经网路更加容易训练，也让模型更具有鲁邦性。很多模型可以在数据科学竞赛中获得很好的成绩，但是一旦放到实际运用中，模型的效果就不尽如人意。原因便在于：数据科学竞赛中的数据时精心设计过的，和实际应用场景中不断变化的数据有著很大的差别。

图中展示了由Covariate Shift而导致的Test Samples变化，进而使得Learned Function不满足target的情况

Covariate Shift这个概念应该隶属于Dataset Shift，就是用来表示输入变数X分布的变化，它只会出现在这个过程中，被定义为 $P_{train}(y|x)=P_{test}(y|x)$ ,但是 $P_{train}(x) e P_{test}(x)$ [5]

在机器学习实践中，我们一定要注意训练数据集和实际情况产生的数据分布不同而带来的影响。

其实，这种思维也在其他行业中颇为普遍。举例来谈，在研究土力学的时候，土的特性试验在实验室里面做的再多，实际桩基工程中，我们也要力求能做个原位测试试验，因为实验室土的分布与现实情况是一定有差别；再比如，许小年接受新财富杂志采访的时候，也说过这么一段话：优秀的分析师要有自己的独特见解，要能超前市场半步，预测公司和市场的变化。AI应用的前景是大数据，机器人只能在海量历史数据中找规律，当公司的商业模式、技术发生根本性的变化，或者市场的运行脱离了历史趋势时，机器人不得不重新学习，但这时又没有足够的数据，在这个空档期内，人的经验和判断力是不可或缺的。在面对Covariate Shift时，海量数据+人的经验和判断力，才是解决实际问题的关键所在！

本文旨在普及Covariate Shift的概念，如果有数学推导描述不清楚的地方，欢迎各位大佬指正！

Reference：

1.http://blog.smola.org/post/4110255196/real-simple-covariate-shift-correction

2.【机器学习】covariate shift现象的解释

3. Coursera | Online Courses From Top Universities. Join for Free

4. Ioffe, Sergey, and Christian Szegedy. "Batch normalization: Accelerating deep network training by reducing internal covariate shift." arXiv preprint arXiv:1502.03167 (2015).

5. http://iwann.ugr.es/2011/pdf/InvitedTalk-FHerrera-IWANN11.pdf

6. Covariate Shift - Unearthing hidden problems in Real World Data Science

7. What is Covariate shift?