Covariate Shift---從一道實際應用題說起

來自專欄從零學AI14 人贊了文章

（力求用最清晰的語言去解釋清楚現象背後的原理）

首先，先上題目：

After setting up your train/dev/test sets, the City Council comes across another 1,000,000 images, called the 「citizens』 data」. Apparently the citizens of Peacetopia are so scared of birds that they volunteered to take pictures of the sky and label them, thus contributing these additional 1,000,000 images. These images are different from the distribution of images the City Council had originally given you, but you think it could help your algorithm.

You should not add the citizens』 data to the training set, because this will cause the training and dev/test set distributions to become different, thus hurting dev and test set performance. True/False?

本題來源於Structuring Machine Learning Projects的第一週的作業。

那麼問題來了，True or False？

先給答案：False

直觀解釋：

（1）數據能不能放在test set上？不行，會造成一定的Covariate Shift，並且，測試集要能直接反應現實目標，本模型的目標是要識別出鳥，所以測試集一定是測試識別鳥。

（2）數據能不能放在Training set上？可以，訓練集數據和Dev set, Test set可以不同，從長期來看，訓練集更廣泛能夠讓模型擁有更好的泛化能力，以及更加robust。

一般Machine Learning思路下的處理方式：

Original data set: 10,000,000 images of the sky above Peacetopia，根據常見的數據集分割方法，在deep learning中我們可以 95% 2.5% 2.5%這麼分，以此來保證Training Set、Cross Validation Set or Dev Set 和 Test Set 之間的分佈相同。在訓練集和測試集的相同分佈下，我們才能確保我們的模型學習到的規律符合我們的目標，Dev Set中tune出來的hyper-parameter適合實際的情況。

這裡，我們也就引出了本文涉及到的概念：Covariate Shift, models are trained and tested on data that already happened。說白了，Covariate Shift就是訓練集和測試集之間的分佈不同，尤其在現實中，伴隨著無數的新數據的產生。傳統的機器學習更多的是假設訓練集和測試集是獨立同分布的，即要得到什麼樣的結果，就在什麼樣的情況環境中訓練，最終也在這個大環境中應用。

按照原來的這種思路，理論上，我們應該把額外的1,000,000 images按照比例分給訓練與測試集，以此達到目標是什麼，就訓練什麼，最終在測試集上測試什麼。並且，實際情況中，這些額外的照片中究竟有多少包含鳥？質量怎麼樣？我們都沒有精力一一確定，但是最保險的方法莫過於按照比例分給訓練與測試集。

這裡又涉及到了另外一種情況：

You』ve handily beaten your competitor, and your system is now deployed in Peacetopia and is protecting the citizens from birds! But over the last few months, a new species of bird has been slowly migrating into the area, so the performance of your system slowly degrades because your data is being tested on a new type of data.

You have only 1,000 images of the new species of bird. The city expects a better system from you within the next 3 months. Which of these should you do first?

Answer:

√ Use the data you have to define a new evaluation metric (using a new dev/test set) taking into account the new species, and use that to drive further progress for your team.

× Add the 1,000 images into your dataset and reshuffle into a new train/dev/test split.

那這題中，為什麼按照比例分給訓練與測試集有錯了呢？

這裡，情況是有一種新的鳥類出現了，並且只有1000張樣本（與之前1,000,000 images不同）。最好的處理辦法個人認為應該是：首先先把這個問題快速解決了，通過設計新的evaluation metric，就是題目中的正確答案；然後再不斷地收集這種新的鳥類（1000張顯然不夠），去訓練出新的模型。

題目中的處理方式：

在題目中，Andrew這個方法相當討巧，就是隻把這些額外的1,000,000 images直接作為Training Set去訓練，而不去作為Dev/test集。這種方式會不會造成題目中說的那種願意，即測試集和訓練集分佈的不同，進而會造成Covariate shift呢？

答案是：不會。Covariate Shift是因為現實過程中出現的新數據，原來的老模型已經無法很好地回歸or分類而導致的。你直接作為訓練集訓練來改變模型，當然沒有什麼問題。（這就好比，人民日益增長的物質文化需要同落後的社會生產之間的矛盾，老舊的社會生產力已經無法滿足新時代的物質文化需求了，假設當然得變化。再何況，連舊時代的主要矛盾都無法滿足新時代的需求了，主要矛盾都變了，模型自然也得變。）並且，增加了模型的訓練樣本之後，可以有效地緩解High Variance，減輕對原來數據的Over fitting。

那能不能直接作為Test Set呢？

我們可以用另外一道題來說明：

One member of the City Council knows a little about machine learning, and thinks you should add the 1,000,000 citizens』 data images to the test set. You object because:

(1) This would cause the dev and test set distributions to become different. This is a bad idea because you』re not aiming where you want to hit.

(2) The test set no longer reflects the distribution of data (security cameras) you most care about.

在實際模型建立中，我們一定要確定模型的實現目標。Test Set的存在意義，就在於測試模型應用在實際情況中的能力。如果把新圖片直接加在Test Set上，可能會直接造成模型效果不好，進而誤判。實際過程中，Covariate Shift也就是因為現實是不斷變化的，模型不斷會遇到新數據這種情況而產生的，進而會影響機器學習模型的實際效果。如果直接把新數據加在Test Set上，無疑會加重Covariate Shift。

題目講清楚了，我們再來看看Covariate Shift。這個概念我應該最早從Batch Normalization中關注到的[4]，這篇文章通過BN來讓深度神經網路更加容易訓練，也讓模型更具有魯邦性。很多模型可以在數據科學競賽中獲得很好的成績，但是一旦放到實際運用中，模型的效果就不盡如人意。原因便在於：數據科學競賽中的數據時精心設計過的，和實際應用場景中不斷變化的數據有著很大的差別。

圖中展示了由Covariate Shift而導致的Test Samples變化，進而使得Learned Function不滿足target的情況

Covariate Shift這個概念應該隸屬於Dataset Shift，就是用來表示輸入變數X分佈的變化，它只會出現在這個過程中，被定義為 $P_{train}(y|x)=P_{test}(y|x)$ ,但是 $P_{train}(x) e P_{test}(x)$ [5]

在機器學習實踐中，我們一定要注意訓練數據集和實際情況產生的數據分佈不同而帶來的影響。

其實，這種思維也在其他行業中頗為普遍。舉例來談，在研究土力學的時候，土的特性試驗在實驗室裡面做的再多，實際樁基工程中，我們也要力求能做個原位測試試驗，因為實驗室土的分佈與現實情況是一定有差別；再比如，許小年接受新財富雜誌採訪的時候，也說過這麼一段話：優秀的分析師要有自己的獨特見解，要能超前市場半步，預測公司和市場的變化。AI應用的前景是大數據，機器人只能在海量歷史數據中找規律，當公司的商業模式、技術發生根本性的變化，或者市場的運行脫離了歷史趨勢時，機器人不得不重新學習，但這時又沒有足夠的數據，在這個空檔期內，人的經驗和判斷力是不可或缺的。在面對Covariate Shift時，海量數據+人的經驗和判斷力，纔是解決實際問題的關鍵所在！

本文旨在普及Covariate Shift的概念，如果有數學推導描述不清楚的地方，歡迎各位大佬指正！

Reference：

1.http://blog.smola.org/post/4110255196/real-simple-covariate-shift-correction

2.【機器學習】covariate shift現象的解釋

3. Coursera | Online Courses From Top Universities. Join for Free

4. Ioffe, Sergey, and Christian Szegedy. "Batch normalization: Accelerating deep network training by reducing internal covariate shift." arXiv preprint arXiv:1502.03167 (2015).

5. http://iwann.ugr.es/2011/pdf/InvitedTalk-FHerrera-IWANN11.pdf

6. Covariate Shift - Unearthing hidden problems in Real World Data Science

7. What is Covariate shift?