一種深度相關性匹配模型(DRMM)的實現
0.背景
一提到檢索相關性,就會想到傳統的BM25,以及基於全局表徵的網路模型DSSM,相對於DSSM關注在semantic matching,DRMM更多的關注在relevance matching,
論文(A Deep Relevance Matching Model for Ad-hoc Retrieval)中提到
In this way, the relevance matching could happen in any part of a relevant document, and we
do not require the document as a whole to be relevant to a query.
在實際的應用場景可以根據用戶的點擊 、提交或者審核數據進行訓練,下面以一個wiki的問答為例子具體談談實現細節
1.數據處理
1.1 輸入(sample.txt)
0 how are glacier caves formed ? A partly submerged glacier cave on Perito Moreno Glacier .
0 how are glacier caves formed ? The ice facade is approximately 60 m high
0 how are glacier caves formed ? Ice formations in the Titlis glacier cave
1 how are glacier caves formed ? A glacier cave is a cave formed within the ice of a glacier .
0 how are glacier caves formed ? Glacier caves are often called ice caves , but this term is properly used to describe bedrock caves that contain year-round ice .
0 How are the directions of the velocity and force vectors related in a circular motion In physics , circular motion is a movement of an object along the circumference of a circle or rotation along a circular path .
0 How are the directions of the velocity and force vectors related in a circular motion It can be uniform , with constant angular rate of rotation ( and constant speed ) , or non-uniform with a changing rate of rotation .
........
格式 labe|q|d
分割符為
1.2 數據準備
1.2.1 生成語料(corpus.txt)
Q2470 what is homebrew for wii
Q2477 where is the brisket from
Q2476 who is mary matalin married to
Q2475 who are all of the jonas brothers
Q2474 what is the main component of vaccines
D1222 The Australian Recording Industry Association Music Awards ( commonly known as ARIA Music Awards or ARIA Awards ) is an annual series of awards nights celebrating the Australian music industry , put on by the Australian Recording Industry Association ( ARIA ) .
D1223 The event has been held annually since 1987 and encompasses the general genre-specific and popular awards ( these are what is usually being referred to as `` the ARIA awards ) as well as Fine Arts Awards and Artisan Awards ( held separately from 2004 ) , Lifetime Achievement Awards and ARIA Hall of Fame – held separately from 2005 to 2010 but returned to the general ceremony in 2011 .
D1220 In India , sound was the transformative element that led to the rapid expansion of the nation s film industry—the most productive such industry in the world since the early 1960s .
D1221 African people
.......
格式 q_id|q 或者 d_id|d
分割符為空格
q_id格式 Q|自增長數字
d_id格式 D|自增長數字
1.2.2 生成關係文件(relation_train.txt)
0 Q1 D0
0 Q1 D1
0 Q1 D2
1 Q1 D3
0 Q1 D4
0 Q2 D5
0 Q2 D6
.......
格式 label|q_id|d_id
分割符為空格
1.3 數據預處理
1.3.1 生成詞典(word_dict.txt)
latest 0
season 1
psych 2
fruitcake 3
korean 4
money 5
called 6
homebrew 7
wii 8
brisket 9
mary 10
matalin 11
married 12
jonas 13
brothers 14
main 15
component 16
......
格式 word|word_id
分割符為空格
1.3.2 生成詞統計(word_stats.txt)
0 9 9 7.787672
1 331 276 4.466240
2 4 4 8.480820
3 2 2 8.991645
4 22 18 7.145819
5 92 78 5.720810
6 425 411 4.069234
7 11 11 7.605351
8 29 22 6.954763
9 5 5 8.298498
10 39 37 6.452671
.......
格式 word_id|整體詞頻|文檔頻率|log((1+文檔總數)/(1+文檔頻率)
分割符為空格
最後一項可以認為就是idf
1.3.3 生成新的語料(將原來的內容變成word_id組合)(corpus_preprocessed.txt)
Q2473 3 0 1 2
Q2472 1 3
Q2471 3 4 5 6
Q2470 2 7 8
Q2477 1 9
Q2476 3 10 11 12
Q2475 2 13 14
Q2474 3 15 16 17
.......
格式 q_id|文檔詞的個數|word_id組合
分割符為空格
1.4 產生 embedding映射
1.4.1 準備embedding 模型(文件)(比如glove.6B.300d.txt)
要麼去下載訓練好word embbeding,要麼拿語料訓練
模型(文件)如下示例
the 0.418 0.24968 -0.41242 0.1217 0.34527 -0.044457 -0.49688 -0.17862 -0.00066023 -0.6566 0.27843 -0.14767 -0.55677 0.14658 -0.0095095 0.011658 0.10204 -0.12792 -0.8443 -0.12181 -0.016801 -0.33279 -0.1552 -0.23131 -0.19181 -1.8823 -0.76746 0.099051 -0.42125 -0.19526 4.0071 -0.18594 -0.52287 -0.31681 0.00059213 0.0074449 0.17778 -0.15897 0.012041 -0.054223 -0.29871 -0.15749 -0.34758 -0.045637 -0.44251 0.18785 0.0027849 -0.18411 -0.11514 -0.78581
, 0.013441 0.23682 -0.16899 0.40951 0.63812 0.47709 -0.42852 -0.55641 -0.364 -0.23938 0.13001 -0.063734 -0.39575 -0.48162 0.23291 0.090201 -0.13324 0.078639 -0.41634 -0.15428 0.10068 0.48891 0.31226 -0.1252 -0.037512 -1.5179 0.12612 -0.02442 -0.042961 -0.28351 3.5416 -0.11956 -0.014533 -0.1499 0.21864 -0.33412 -0.13872 0.31806 0.70358 0.44858 -0.080262 0.63003 0.32111 -0.46765 0.22786 0.36034 -0.37818 -0.56657 0.044691 0.30392
. 0.15164 0.30177 -0.16763 0.17684 0.31719 0.33973 -0.43478 -0.31086 -0.44999 -0.29486 0.16608 0.11963 -0.41328 -0.42353 0.59868 0.28825 -0.11547 -0.041848 -0.67989 -0.25063 0.18472 0.086876 0.46582 0.015035 0.043474 -1.4671 -0.30384 -0.023441 0.30589 -0.21785 3.746 0.0042284 -0.18436 -0.46209 0.098329 -0.11907 0.23919 0.1161 0.41705 0.056763 -6.3681e-05 0.068987 0.087939 -0.10285 -0.13931 0.22314 -0.080803 -0.35652 0.016413 0.10216
of 0.70853 0.57088 -0.4716 0.18048 0.54449 0.72603 0.18157 -0.52393 0.10381 -0.17566 0.078852 -0.36216 -0.11829 -0.83336 0.11917 -0.16605 0.061555 -0.012719 -0.56623 0.013616 0.22851 -0.14396 -0.067549 -0.38157 -0.23698 -1.7037 -0.86692 -0.26704 -0.2589 0.1767 3.8676 -0.1613 -0.13273 -0.68881 0.18444 0.0052464 -0.33874 -0.078956 0.24185 0.36576 -0.34727 0.28483 0.075693 -0.062178 -0.38988 0.22902 -0.21617 -0.22562 -0.093918 -0.80375
to 0.68047 -0.039263 0.30186 -0.17792 0.42962 0.032246 -0.41376 0.13228 -0.29847 -0.085253 0.17118 0.22419 -0.10046 -0.43653 0.33418 0.67846 0.057204 -0.34448 -0.42785 -0.43275 0.55963 0.10032 0.18677 -0.26854 0.037334 -2.0932 0.22171 -0.39868 0.20912 -0.55725 3.8826 0.47466 -0.95658 -0.37788 0.20869 -0.32752 0.12751 0.088359 0.16351 -0.21634 -0.094375 0.018324 0.21048 -0.03088 -0.19722 0.082279 -0.09434 -0.073297 -0.064699 -0.26044
格式 詞|n維的float向量
(n在訓練時指定) 分割符為空格
1.4.2 產生詞典對應的embedding(embed_glove_d300.txt)
1172 1.5525 0.42452 0.46391 0.20761 -0.33586 0.85766 0.59856 0.78584 0.50756 -0.37813 1.2359 1.0378 0.93049 0.11673 -1.9087 0.57627 -0.43614 1.2179 0.15718 0.03569 0.34481 -0.93458 0.54173 0.11415 0.72905 0.77742 0.54737 0.28123 -0.48073 -0.32181 -0.098982 0.39377 0.64479 -0.39624 0.055439 0.63578 -0.29754 -0.096153 0.86403 -0.11456 0.47816 -1.6625 -0.85222 0.039357 0.85039 0.086544 0.51599 -0.44902 0.45219 -0.98423 ......
5282 0.15187 0.81713 -0.80558 -1.2145 0.27922 0.41656 -0.52632 -0.86776 -0.69063 0.83811 -0.66271 0.81108 -0.19408 -0.65114 0.14332 -1.037 -0.34031 -1.154 0.53 0.52483 0.16909 0.2206 0.67088 -0.12022 0.88935 -0.6516 -1.4625 -0.82655 -0.83824 0.091111 1.4623 -1.4354 0.33846 -1.4664 -0.40797 0.04179 -0.43726 0.43884 -0.81091 1.0104 0.83639 0.012196 -0.30462 0.87028 0.11972 0.26389 -0.045789 0.31057 -0.53455 -1.0437 ......
11492 -0.12748 0.085313 1.038 0.51177 -0.82343 1.2981 0.73726 0.32713 1.3706 0.71542 1.3761 -0.54163 -0.30436 1.4557 -0.080133 0.69019 -1.6733 0.46343 0.17175 -0.027124 -0.30607 -0.023095 -0.072841 -0.63228 -0.13369 -0.72023 -0.65965 -0.013201 -0.71407 -0.053591 0.23813 0.424 0.039012 -0.088765 -0.63599 0.67317 1.5351 1.0761 0.73815 -0.62496 0.52729 0.061921 -1.2465 0.90744 0.33899 -0.18433 0.4204 -0.14775 -0.72917 -0.13983 ......
........
格式 word_id|n維的float向量
分割符為空格
如果詞典中的詞不在embbeding模型文件中,那麼隨機產生一個
1.4.3 embedding 規範化(embed_glove_d300_norm.txt)
將上面的embbeding (n 維float向量)規範化即將向量中每個元素除以向量所有元素的平方和
1172 0.307593 0.084109 0.091913 0.041133 -0.066543 0.169926 0.118591 0.155697 0.100562 -0.074918 0.244866 0.205617 0.184356 0.023127 -0.378166 0.114175 -0.086411 0.241300 0.031142 0.007071 0.068316 -0.185166 0.107332 0.022616 0.144445 0.154028 0.108449 0.055719 -0.095246 -0.063759 -0.019611 0.078017 0.127751 -0.078506 0.010984 0.125966 -0.058951 -0.019051 0.171188 -0.022698 0.094737 -0.329387 -0.168848 0.007798 0.168486 0.017147 0.102232 -0.088963 0.089591 -0.195003
5282 0.029213 0.157181 -0.154960 -0.233618 0.053710 0.080129 -0.101242 -0.166920 -0.132848 0.161217 -0.127477 0.156017 -0.037333 -0.125252 0.027569 -0.199475 -0.065461 -0.221981 0.101950 0.100955 0.032526 0.042434 0.129049 -0.023125 0.171073 -0.125340 -0.281323 -0.158993 -0.161242 0.017526 0.281285 -0.276110 0.065105 -0.282073 -0.078476 0.008039 -0.084110 0.084414 -0.155985 0.194358 0.160886 0.002346 -0.058596 0.167405 0.023029 0.050761 -0.008808 0.059741 -0.102825 -0.200764
11492 -0.025103 0.016800 0.204401 0.100777 -0.162148 0.255620 0.145180 0.064418 0.269896 0.140879 0.270979 -0.106657 -0.059934 0.286654 -0.015780 0.135911 -0.329503 0.091258 0.033821 -0.005341 -0.060271 -0.004548 -0.014344 -0.124508 -0.026326 -0.141826 -0.129897 -0.002600 -0.140613 -0.010553 0.046892 0.083493 0.007682 -0.017479 -0.125238 0.132560 0.302289 0.211904 0.145355 -0.123066 0.103833 0.012193 -0.245459 0.178692 0.066753 -0.036298 0.082784 -0.029095 -0.143587 -0.027535
.......
對比看到每個元素的確變小了,規範的目的是使得向量的L2範數位1,那麼兩個向量的內積可以等價於cosine距離及相似度
1.4.4 另外一種形式的embbeding (embed.idf.txt)
注意前面提到的embedding 是典型的word embbeding
論文中還會用到另外一種形式的embbeding, 只有一維,即詞的idf
0 7.787672
1 4.466240
2 8.480820
3 8.991645
4 7.145819
5 5.720810
6 4.069234
7 7.605351
8 6.954763
9 8.298498
10 6.452671
.......
就是取了上文中詞統計記錄中的第一列:word_id和第四列:idf
格式 word_id|log((1+文檔總數)/(1+文檔頻率)
1.5 構造模型輸入(relation_train.hist-60.txt)
需要用到前面的規範化的embedding映射 、最終的語料和關係文件
- 遍歷關係文件,逐個獲取每個<q_id,d_id> 對
- 通過語料獲取q_id,d_id 中的每個word_id
- 通過embedding映射獲取q,d中的每個詞的embbeding
接下來開始計算histogram
假設q包含4個詞,d包含7個詞,那麼q和d分別是4 * d 和 7 * d的矩陣, d維embbeding的長度
- 計算d,q 的內積, 及q * (d的轉置), 得到4 * 7的相似矩陣, 代表了d中的每個詞及q中的每個詞的相似度
- 由於每個d的個數不定會導致相似矩陣的列發生變化,所以將結果採取了量化,從而固化了列
- 量化就是將結果劃分到固定的桶,桶的個數作為輸入參數傳入,舉個paper中的例子,相似結果的取值範圍[-1,1],假如我們設置為5個桶,那麼分別是[-1,-0.5],[-0.5,0],[0,0.5],[0.5,1],[1,1]再假設q是(cat),d是(car, rent, truck, bump, injunction, runway)
如果q和d的相似度為(1, 0.2, 0.7, 0.3, ?0.1, 0.1)
那麼hist為[0, 1, 3, 1, 1]
[0, 1, 3, 1, 1]怎麼來的,hist代表的是在每個桶的計數,由於沒有相似度在[-1,-0.5]這個區間,所以第一位是0,依次類推
實際上每個桶的值在論文中有3中做法,剛才舉的例子是桶計數,另外兩個一個是計數規範化,另一個是取計數的對數,最後的實驗結果是取對數效果較好
- 桶的索引的計算如下
vid = int((v + 1.) / 2. * (hist_size - 1.))
- 最後的histogram值就是
log10(1+計數)
最後得到的histogram如下
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.301030009985 0.0 0.0 0.0 0.477121263742 0.0 0.0 0.0 0.301030009985 0.0 0.301030009985 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.477121263742 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.301030009985 0.0 0.301030009985 0.0 0.0 0.0 0.301030009985 0.0 0.0 0.0 0.0 0.0 0.60206001997 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.301030009985 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.301030009985 0.0 0.0 0.0 0.0 0.0 0.0 0.60206001997 0.301030009985 0.0 0.0 0.301030009985 0.0 0.0 0.0 0.301030009985 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.301030009985 0.301030009985 0.477121263742 0.0 0.0 0.301030009985 0.301030009985 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.301030009985 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.301030009985 0.0 0.0 0.301030009985 0.301030009985 0.301030009985 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.301030009985 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.301030009985 0.0 0.477121263742 0.0 0.301030009985 0.0 0.301030009985 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.301030009985 0.477121263742 0.301030009985 0.301030009985 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.301030009985 0.301030009985 0.0 0.301030009985 0.477121263742 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.301030009985 0.0 0.301030009985 0.0 0.301030009985 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.301030009985 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.301030009985 0.0 0.0 0.0 0.0 0.301030009985 0.0 0.0 0.301030009985 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.301030009985 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.301030009985 0.477121263742 0.0 0.0 0.0 0.0 0.0 0.0 0.301030009985 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.301030009985 0.301030009985 0.0 0.0 0.0 0.301030009985 0.301030009985 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
.......
每行代表一個(q,d)對的hist值,格式 float數組
長度為 (q的詞的個數) * (固定的桶的長度)
2. 訓練
2.1 載入數據
2.1.1 載入embedding
載入規範化的word embedding(embed_glove_d300_norm.txt) 或者 idf embedding(embed.idf.txt)
2.1.2 載入語料
載入語料corpus_preprocessed.txt
2.1.3 載入關係
載入關係(relation_train.txt) 和相應的(histogram relation_train.hist-60.txt)
生成KV, key 為 q_id和d_id的二元組(tuple),value就是對應histogram值(如上文所示一個固定維度的float數組)
2.2 模型準備
2.2.1 初始化訓練batch輸入源
將上面提到的關係數據
0 Q1 D0
0 Q1 D1
0 Q1 D2
1 Q1 D3
0 Q1 D4
0 Q2 D5
0 Q2 D6
.......
轉化成如下結構
{
Q1:{0: [D0, D1, D2, D4], 1: [D3]},
Q2:{0: [D5, D6, D8, D9, D10, D11], 1: [D7]}
...........
}
?
最終轉為3元組集合
?
[
(Q1,D3,D0),
(Q1,D3,D1),
(Q1,D3,D2),
(Q1,D3,D4),
(Q2,D7,D5),
(Q2,D7,D6),
.......
]
2.2.2 初始化評估batch輸入源
將上面提到的關係數據
0 Q2241 D19682
0 Q2241 D19683
0 Q2241 D19684
0 Q2241 D19685
0 Q2241 D19686
1 Q2241 D19687
0 Q2242 D19688
0 Q2242 D19689
0 Q2242 D19690
0 Q2242 D19691
1 Q2242 D19692
0 Q2242 D19693
..........
轉化為
[
(Q2241, [(1, D19687), (0, D19686), (0, D19685), (0, D19684), (0, D19683), (0, D19682)]),
(Q2242, [(1, D19692), (0, D19693), (0, D19691), (0, D19690), (0, D19689), (0, D19688)])
.......
]
輸入的粒度以查詢為粒度
2.3 初始化模型