0.背景

一提到檢索相關性,就會想到傳統的BM25,以及基於全局表徵的網路模型DSSM,相對於DSSM關注在semantic matching,DRMM更多的關注在relevance matching,

論文(A Deep Relevance Matching Model for Ad-hoc Retrieval)中提到

In this way, the relevance matching could happen in any part of a relevant document, and we
do not require the document as a whole to be relevant to a query.

在實際的應用場景可以根據用戶的點擊 、提交或者審核數據進行訓練,下面以一個wiki的問答為例子具體談談實現細節

1.數據處理

1.1 輸入(sample.txt)

0 how are glacier caves formed ? A partly submerged glacier cave on Perito Moreno Glacier .
0 how are glacier caves formed ? The ice facade is approximately 60 m high
0 how are glacier caves formed ? Ice formations in the Titlis glacier cave
1 how are glacier caves formed ? A glacier cave is a cave formed within the ice of a glacier .
0 how are glacier caves formed ? Glacier caves are often called ice caves , but this term is properly used to describe bedrock caves that contain year-round ice .
0 How are the directions of the velocity and force vectors related in a circular motion In physics , circular motion is a movement of an object along the circumference of a circle or rotation along a circular path .
0 How are the directions of the velocity and force vectors related in a circular motion It can be uniform , with constant angular rate of rotation ( and constant speed ) , or non-uniform with a changing rate of rotation .
........

格式 labe|q|d 分割符為

1.2 數據準備

1.2.1 生成語料(corpus.txt)

Q2470 what is homebrew for wii
Q2477 where is the brisket from
Q2476 who is mary matalin married to
Q2475 who are all of the jonas brothers
Q2474 what is the main component of vaccines
D1222 The Australian Recording Industry Association Music Awards ( commonly known as ARIA Music Awards or ARIA Awards ) is an annual series of awards nights celebrating the Australian music industry , put on by the Australian Recording Industry Association ( ARIA ) .
D1223 The event has been held annually since 1987 and encompasses the general genre-specific and popular awards ( these are what is usually being referred to as `` the ARIA awards ) as well as Fine Arts Awards and Artisan Awards ( held separately from 2004 ) , Lifetime Achievement Awards and ARIA Hall of Fame – held separately from 2005 to 2010 but returned to the general ceremony in 2011 .
D1220 In India , sound was the transformative element that led to the rapid expansion of the nation s film industry—the most productive such industry in the world since the early 1960s .
D1221 African people
.......

格式 q_id|q 或者 d_id|d 分割符為空格

q_id格式 Q|自增長數字

d_id格式 D|自增長數字

1.2.2 生成關係文件(relation_train.txt)

0 Q1 D0
0 Q1 D1
0 Q1 D2
1 Q1 D3
0 Q1 D4
0 Q2 D5
0 Q2 D6
.......

格式 label|q_id|d_id 分割符為空格

1.3 數據預處理

1.3.1 生成詞典(word_dict.txt)

latest 0
season 1
psych 2
fruitcake 3
korean 4
money 5
called 6
homebrew 7
wii 8
brisket 9
mary 10
matalin 11
married 12
jonas 13
brothers 14
main 15
component 16
......

格式 word|word_id 分割符為空格

1.3.2 生成詞統計(word_stats.txt)

0 9 9 7.787672
1 331 276 4.466240
2 4 4 8.480820
3 2 2 8.991645
4 22 18 7.145819
5 92 78 5.720810
6 425 411 4.069234
7 11 11 7.605351
8 29 22 6.954763
9 5 5 8.298498
10 39 37 6.452671
.......

格式 word_id|整體詞頻|文檔頻率|log((1+文檔總數)/(1+文檔頻率) 分割符為空格

最後一項可以認為就是idf

1.3.3 生成新的語料(將原來的內容變成word_id組合)(corpus_preprocessed.txt)

Q2473 3 0 1 2
Q2472 1 3
Q2471 3 4 5 6
Q2470 2 7 8
Q2477 1 9
Q2476 3 10 11 12
Q2475 2 13 14
Q2474 3 15 16 17
.......

格式 q_id|文檔詞的個數|word_id組合 分割符為空格

1.4 產生 embedding映射

1.4.1 準備embedding 模型(文件)(比如glove.6B.300d.txt)

要麼去下載訓練好word embbeding,要麼拿語料訓練

模型(文件)如下示例

the 0.418 0.24968 -0.41242 0.1217 0.34527 -0.044457 -0.49688 -0.17862 -0.00066023 -0.6566 0.27843 -0.14767 -0.55677 0.14658 -0.0095095 0.011658 0.10204 -0.12792 -0.8443 -0.12181 -0.016801 -0.33279 -0.1552 -0.23131 -0.19181 -1.8823 -0.76746 0.099051 -0.42125 -0.19526 4.0071 -0.18594 -0.52287 -0.31681 0.00059213 0.0074449 0.17778 -0.15897 0.012041 -0.054223 -0.29871 -0.15749 -0.34758 -0.045637 -0.44251 0.18785 0.0027849 -0.18411 -0.11514 -0.78581
, 0.013441 0.23682 -0.16899 0.40951 0.63812 0.47709 -0.42852 -0.55641 -0.364 -0.23938 0.13001 -0.063734 -0.39575 -0.48162 0.23291 0.090201 -0.13324 0.078639 -0.41634 -0.15428 0.10068 0.48891 0.31226 -0.1252 -0.037512 -1.5179 0.12612 -0.02442 -0.042961 -0.28351 3.5416 -0.11956 -0.014533 -0.1499 0.21864 -0.33412 -0.13872 0.31806 0.70358 0.44858 -0.080262 0.63003 0.32111 -0.46765 0.22786 0.36034 -0.37818 -0.56657 0.044691 0.30392
. 0.15164 0.30177 -0.16763 0.17684 0.31719 0.33973 -0.43478 -0.31086 -0.44999 -0.29486 0.16608 0.11963 -0.41328 -0.42353 0.59868 0.28825 -0.11547 -0.041848 -0.67989 -0.25063 0.18472 0.086876 0.46582 0.015035 0.043474 -1.4671 -0.30384 -0.023441 0.30589 -0.21785 3.746 0.0042284 -0.18436 -0.46209 0.098329 -0.11907 0.23919 0.1161 0.41705 0.056763 -6.3681e-05 0.068987 0.087939 -0.10285 -0.13931 0.22314 -0.080803 -0.35652 0.016413 0.10216
of 0.70853 0.57088 -0.4716 0.18048 0.54449 0.72603 0.18157 -0.52393 0.10381 -0.17566 0.078852 -0.36216 -0.11829 -0.83336 0.11917 -0.16605 0.061555 -0.012719 -0.56623 0.013616 0.22851 -0.14396 -0.067549 -0.38157 -0.23698 -1.7037 -0.86692 -0.26704 -0.2589 0.1767 3.8676 -0.1613 -0.13273 -0.68881 0.18444 0.0052464 -0.33874 -0.078956 0.24185 0.36576 -0.34727 0.28483 0.075693 -0.062178 -0.38988 0.22902 -0.21617 -0.22562 -0.093918 -0.80375
to 0.68047 -0.039263 0.30186 -0.17792 0.42962 0.032246 -0.41376 0.13228 -0.29847 -0.085253 0.17118 0.22419 -0.10046 -0.43653 0.33418 0.67846 0.057204 -0.34448 -0.42785 -0.43275 0.55963 0.10032 0.18677 -0.26854 0.037334 -2.0932 0.22171 -0.39868 0.20912 -0.55725 3.8826 0.47466 -0.95658 -0.37788 0.20869 -0.32752 0.12751 0.088359 0.16351 -0.21634 -0.094375 0.018324 0.21048 -0.03088 -0.19722 0.082279 -0.09434 -0.073297 -0.064699 -0.26044

格式 詞|n維的float向量 (n在訓練時指定) 分割符為空格

1.4.2 產生詞典對應的embedding(embed_glove_d300.txt)

1172 1.5525 0.42452 0.46391 0.20761 -0.33586 0.85766 0.59856 0.78584 0.50756 -0.37813 1.2359 1.0378 0.93049 0.11673 -1.9087 0.57627 -0.43614 1.2179 0.15718 0.03569 0.34481 -0.93458 0.54173 0.11415 0.72905 0.77742 0.54737 0.28123 -0.48073 -0.32181 -0.098982 0.39377 0.64479 -0.39624 0.055439 0.63578 -0.29754 -0.096153 0.86403 -0.11456 0.47816 -1.6625 -0.85222 0.039357 0.85039 0.086544 0.51599 -0.44902 0.45219 -0.98423 ......
5282 0.15187 0.81713 -0.80558 -1.2145 0.27922 0.41656 -0.52632 -0.86776 -0.69063 0.83811 -0.66271 0.81108 -0.19408 -0.65114 0.14332 -1.037 -0.34031 -1.154 0.53 0.52483 0.16909 0.2206 0.67088 -0.12022 0.88935 -0.6516 -1.4625 -0.82655 -0.83824 0.091111 1.4623 -1.4354 0.33846 -1.4664 -0.40797 0.04179 -0.43726 0.43884 -0.81091 1.0104 0.83639 0.012196 -0.30462 0.87028 0.11972 0.26389 -0.045789 0.31057 -0.53455 -1.0437 ......
11492 -0.12748 0.085313 1.038 0.51177 -0.82343 1.2981 0.73726 0.32713 1.3706 0.71542 1.3761 -0.54163 -0.30436 1.4557 -0.080133 0.69019 -1.6733 0.46343 0.17175 -0.027124 -0.30607 -0.023095 -0.072841 -0.63228 -0.13369 -0.72023 -0.65965 -0.013201 -0.71407 -0.053591 0.23813 0.424 0.039012 -0.088765 -0.63599 0.67317 1.5351 1.0761 0.73815 -0.62496 0.52729 0.061921 -1.2465 0.90744 0.33899 -0.18433 0.4204 -0.14775 -0.72917 -0.13983 ......
........

格式 word_id|n維的float向量 分割符為空格

如果詞典中的詞不在embbeding模型文件中,那麼隨機產生一個

1.4.3 embedding 規範化(embed_glove_d300_norm.txt)

將上面的embbeding (n 維float向量)規範化即將向量中每個元素除以向量所有元素的平方和

1172 0.307593 0.084109 0.091913 0.041133 -0.066543 0.169926 0.118591 0.155697 0.100562 -0.074918 0.244866 0.205617 0.184356 0.023127 -0.378166 0.114175 -0.086411 0.241300 0.031142 0.007071 0.068316 -0.185166 0.107332 0.022616 0.144445 0.154028 0.108449 0.055719 -0.095246 -0.063759 -0.019611 0.078017 0.127751 -0.078506 0.010984 0.125966 -0.058951 -0.019051 0.171188 -0.022698 0.094737 -0.329387 -0.168848 0.007798 0.168486 0.017147 0.102232 -0.088963 0.089591 -0.195003
5282 0.029213 0.157181 -0.154960 -0.233618 0.053710 0.080129 -0.101242 -0.166920 -0.132848 0.161217 -0.127477 0.156017 -0.037333 -0.125252 0.027569 -0.199475 -0.065461 -0.221981 0.101950 0.100955 0.032526 0.042434 0.129049 -0.023125 0.171073 -0.125340 -0.281323 -0.158993 -0.161242 0.017526 0.281285 -0.276110 0.065105 -0.282073 -0.078476 0.008039 -0.084110 0.084414 -0.155985 0.194358 0.160886 0.002346 -0.058596 0.167405 0.023029 0.050761 -0.008808 0.059741 -0.102825 -0.200764
11492 -0.025103 0.016800 0.204401 0.100777 -0.162148 0.255620 0.145180 0.064418 0.269896 0.140879 0.270979 -0.106657 -0.059934 0.286654 -0.015780 0.135911 -0.329503 0.091258 0.033821 -0.005341 -0.060271 -0.004548 -0.014344 -0.124508 -0.026326 -0.141826 -0.129897 -0.002600 -0.140613 -0.010553 0.046892 0.083493 0.007682 -0.017479 -0.125238 0.132560 0.302289 0.211904 0.145355 -0.123066 0.103833 0.012193 -0.245459 0.178692 0.066753 -0.036298 0.082784 -0.029095 -0.143587 -0.027535
.......

對比看到每個元素的確變小了,規範的目的是使得向量的L2範數位1,那麼兩個向量的內積可以等價於cosine距離及相似度

1.4.4 另外一種形式的embbeding (embed.idf.txt)

注意前面提到的embedding 是典型的word embbeding

論文中還會用到另外一種形式的embbeding, 只有一維,即詞的idf

0 7.787672
1 4.466240
2 8.480820
3 8.991645
4 7.145819
5 5.720810
6 4.069234
7 7.605351
8 6.954763
9 8.298498
10 6.452671
.......

就是取了上文中詞統計記錄中的第一列:word_id和第四列:idf

格式 word_id|log((1+文檔總數)/(1+文檔頻率)

1.5 構造模型輸入(relation_train.hist-60.txt)

需要用到前面的規範化的embedding映射 、最終的語料和關係文件

  • 遍歷關係文件,逐個獲取每個<q_id,d_id> 對
  • 通過語料獲取q_id,d_id 中的每個word_id
  • 通過embedding映射獲取q,d中的每個詞的embbeding

接下來開始計算histogram

假設q包含4個詞,d包含7個詞,那麼q和d分別是4 * d 和 7 * d的矩陣, d維embbeding的長度

  • 計算d,q 的內積, 及q * (d的轉置), 得到4 * 7的相似矩陣, 代表了d中的每個詞及q中的每個詞的相似度
  • 由於每個d的個數不定會導致相似矩陣的列發生變化,所以將結果採取了量化,從而固化了列
  • 量化就是將結果劃分到固定的桶,桶的個數作為輸入參數傳入,舉個paper中的例子,相似結果的取值範圍[-1,1],假如我們設置為5個桶,那麼分別是[-1,-0.5],[-0.5,0],[0,0.5],[0.5,1],[1,1]再假設q是(cat),d是(car, rent, truck, bump, injunction, runway)

    如果q和d的相似度為(1, 0.2, 0.7, 0.3, ?0.1, 0.1)

    那麼hist為[0, 1, 3, 1, 1]

[0, 1, 3, 1, 1]怎麼來的,hist代表的是在每個桶的計數,由於沒有相似度在[-1,-0.5]這個區間,所以第一位是0,依次類推

實際上每個桶的值在論文中有3中做法,剛才舉的例子是桶計數,另外兩個一個是計數規範化,另一個是取計數的對數,最後的實驗結果是取對數效果較好

  • 桶的索引的計算如下vid = int((v + 1.) / 2. * (hist_size - 1.))
  • 最後的histogram值就是log10(1+計數)

最後得到的histogram如下

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.301030009985 0.0 0.0 0.0 0.477121263742 0.0 0.0 0.0 0.301030009985 0.0 0.301030009985 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.477121263742 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.301030009985 0.0 0.301030009985 0.0 0.0 0.0 0.301030009985 0.0 0.0 0.0 0.0 0.0 0.60206001997 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.301030009985 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.301030009985 0.0 0.0 0.0 0.0 0.0 0.0 0.60206001997 0.301030009985 0.0 0.0 0.301030009985 0.0 0.0 0.0 0.301030009985 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.301030009985 0.301030009985 0.477121263742 0.0 0.0 0.301030009985 0.301030009985 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.301030009985 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.301030009985 0.0 0.0 0.301030009985 0.301030009985 0.301030009985 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.301030009985 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.301030009985 0.0 0.477121263742 0.0 0.301030009985 0.0 0.301030009985 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.301030009985 0.477121263742 0.301030009985 0.301030009985 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.301030009985 0.301030009985 0.0 0.301030009985 0.477121263742 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.301030009985 0.0 0.301030009985 0.0 0.301030009985 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.301030009985 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.301030009985 0.0 0.0 0.0 0.0 0.301030009985 0.0 0.0 0.301030009985 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.301030009985 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.301030009985 0.477121263742 0.0 0.0 0.0 0.0 0.0 0.0 0.301030009985 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.301030009985 0.301030009985 0.0 0.0 0.0 0.301030009985 0.301030009985 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
.......

每行代表一個(q,d)對的hist值,格式 float數組 長度為 (q的詞的個數) * (固定的桶的長度)

2. 訓練

2.1 載入數據

2.1.1 載入embedding

載入規範化的word embedding(embed_glove_d300_norm.txt) 或者 idf embedding(embed.idf.txt)

2.1.2 載入語料

載入語料corpus_preprocessed.txt

2.1.3 載入關係

載入關係(relation_train.txt) 和相應的(histogram relation_train.hist-60.txt)

生成KV, key 為 q_id和d_id的二元組(tuple),value就是對應histogram值(如上文所示一個固定維度的float數組)

2.2 模型準備

2.2.1 初始化訓練batch輸入源

將上面提到的關係數據

0 Q1 D0
0 Q1 D1
0 Q1 D2
1 Q1 D3
0 Q1 D4
0 Q2 D5
0 Q2 D6
.......

轉化成如下結構

{
Q1:{0: [D0, D1, D2, D4], 1: [D3]},
Q2:{0: [D5, D6, D8, D9, D10, D11], 1: [D7]}
...........
}
?

最終轉為3元組集合

?
[
(Q1,D3,D0),
(Q1,D3,D1),
(Q1,D3,D2),
(Q1,D3,D4),
(Q2,D7,D5),
(Q2,D7,D6),
.......
]

2.2.2 初始化評估batch輸入源

將上面提到的關係數據

0 Q2241 D19682
0 Q2241 D19683
0 Q2241 D19684
0 Q2241 D19685
0 Q2241 D19686
1 Q2241 D19687
0 Q2242 D19688
0 Q2242 D19689
0 Q2242 D19690
0 Q2242 D19691
1 Q2242 D19692
0 Q2242 D19693
..........

轉化為

[
(Q2241, [(1, D19687), (0, D19686), (0, D19685), (0, D19684), (0, D19683), (0, D19682)]),
(Q2242, [(1, D19692), (0, D19693), (0, D19691), (0, D19690), (0, D19689), (0, D19688)])
.......
]

輸入的粒度以查詢為粒度

2.3 初始化模型

公式表達如下

第一行公式就是上面提到histogram的計算

第二行就是feed forward matching network的整體輸出

第三行就是最終的分數

其中gi 是term gating network的輸出

關於feed forward matching 在具體實現中,最後一層並未使用tanh作為激活函數,而是直接輸出全鏈接結果,我還專門問了下作者,他是這麼回復的

There are no special reason for the tanh in last full connection. In the original
paper, we added tanh in the TREC dataset, and found it worked good.
While we work on larger datasets like WikiQA, we found it is better to remove it.
In fact, I have no better idea about it. There needs more experiments to figure it
out.

實際上我也測試過當最後一層使用了tanh之後,map指標比不用少了3個點,ndcg也略有降低,但也不是在說在所有的數據下都是如此

2.4 初始化loss

rank_hinge_loss

2.5 初始化評測指標

MAP

NDCG

2.6 模型迭代

2.6.1 訓練batch輸入

從上文提到的3元組集合,隨機抽取batch個3元組

以某個3元組為例

(Q551, D5349, D5350)

會產生兩條記錄 (Q551, D5349) (positive pair) 和(Q551,D5350) (negtive pair)

產生如下3個輸入形式

  • query

[712 18669 18669 18669 18669 18669 18669 18669 18669 18669]
[712 18669 18669 18669 18669 18669 18669 18669 18669 18669]

Q551隻有一個查詢詞712, 後面用一個詞典的長度作為補齊

  • doc

[[0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0.30103 0.30103 0.30103 0.60206 0. 0.7781513
0.30103 0. 0.47712126 0.47712126 0.60206 0.47712126
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0.30103 0. ]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
..........

]
[[0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.30103
0. 0.30103 0. 0.47712126 0.47712126 0.69897
0. 0.47712126 0.60206 0. 0.47712126 0.30103
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. ]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
.........
]

上面就是前面提到的histogram

第一個(10*60) 二維數組就是(Q551, D5349) 結果,由於query只有一個詞,所以只有第一個60維有值,後面全部是填充

第二個(10*60)也是一樣

  • Y 即label

[1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0
1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
0 1 0 1 0 1 0 1 0 1 0 1 0 1 0...]

由於每次從3元組中構造,3元組是一正一負,所有batch個3元組就有變成batch*2, label就是batch個1,0

2.6.2 評估batch輸入

評估輸入源中取出batch個輸入

以單個輸入為例

(Q2417, [(1, D21129), (0, D21128), (0, D21127), (0, D21126), (0, D21125), (0, D21124), (0, D21123)])

會產生7個(q,d) pair

產生如下輸入形式

  • query

[ 7573 1757 164 18669 18669 18669 18669 18669 18669 18669]
[ 7573 1757 164 18669 18669 18669 18669 18669 18669 18669]
[ 7573 1757 164 18669 18669 18669 18669 18669 18669 18669]
[ 7573 1757 164 18669 18669 18669 18669 18669 18669 18669]
[ 7573 1757 164 18669 18669 18669 18669 18669 18669 18669]
[ 7573 1757 164 18669 18669 18669 18669 18669 18669 18669]
[ 7573 1757 164 18669 18669 18669 18669 18669 18669 18669]

因為是同一個查詢,因此數據完全一樣

Q2417 有3個詞(7573 1757 164),後面用一個詞典的長度作為補齊

  • doc

會產生7個10*60 histogram, 和訓練中的數據產生類似,不再贅述

Y 即label

[1,0,0,0,0,0,0]

2.6.3 為什麼訓練的batch 和評估的 batch的粒度不一樣?

一個是3元組粒度,一個是以query作為粒度,之所以會有這種差異性,主要為了後續方便指標評估

2.7 模型存儲

在模型的迭代過程中,週期的保存模型,所謂保存就是將上面提到的模型中的各層,以及相關的weights(如果有的話)存儲在文件中,以便後續預測時載入

3. 預測

3.1 載入數據(同訓練一致)

3.1.1 載入embedding

同訓練中載入embedding一致

3.1.2 載入語料

同訓練中載入語料一致(載入語料corpus_preprocessed.txt)

3.1.3 載入關係

同訓練中的載入關係一致

3.2 初始化預測batch輸入源

同上文中提到的初始化評估batch輸入源一致

3.3 模型載入

初始化模型(同訓練一致),然後載入訓練完成保存的模型weights文件

3.4 初始化預測指標

同上文中提到的初始化評估指標保持一致

3.5 batch預測

同評估batch一致


推薦閱讀:
相關文章