?關係抽取工具：DeepDive的環境配置與實踐排雷

系統： ubuntu-16.04.5

目的： 跑通支持中文的deepdive中給出的股權交易關係抽取示例

參考： deepdive官網、Deepdive抽取演員-電影間關係、支持中文的deepdive、PostgreSQL 10.1 手冊

說明：本文後半段與支持中文的deepdive示例教程對應部分一樣。

準備

deepdive安裝

在支持中文的deepdive中下載CNdeepdive並解壓;進入CNdeepdive目錄，運?install.sh，選擇1安裝deepdive：

./install.sh

如果出現：

則修改 install.sh文件，將第193行：tar xzvf "$tarball" -C "$PREFIX" 修改為 tar xvf "$tarball" -C "$PREFIX"。

配置環境變數，deepdive的可執??件?般安裝在~/local/bin?件夾下。在~/.bashrc下添加如下內容並保存：

export PATH="/local文件夾所在路徑/local/bin:$PATH"

找到local所在目錄，運行pwd得到local文件夾所在路徑。

由此可知最終在~/.bashrc中添加的內容為：

export PATH="/home/linux/local/bin:$PATH"

保存更改，執行source ~/.bashrc設置環境變數。

如果出現 deepdive: command not found，一般是這一步環境變數配置出了問題； 測試時遇到的兩種情況：

1.在~/.bashrc中添加 export PATH="/root/local/bin:$PATH"；

2.在~/.bash_profile中添加 export PATH="/root/local/bin:$PATH"；

以上兩種情況在測試時均不能正確添加環境變數。

postgresql安裝

先驗知識： 1.創建資料庫：

2.刪除資料庫：

運?install.sh，選擇6安裝postgresql。運行：

psql postgres

如果效果如下，則說明postgresql安裝成功。

運行q退出psql。

運行：

createdb transaction

建立資料庫；運行：

psql transaction

進入對應的postgresql賬戶；運行l顯示所有的資料庫；運行q退出psql。

nlp環境安裝

運?nlp_setup.sh，配置中?standford nlp環境。

框架搭建

建???的項??件夾transaction，在項??件夾下配置資料庫配置?件:

echo "postgresql://localhost:5432/transaction" >db.url

再在transaction下分別建?輸?數據?件夾input，腳本?件夾udf，?戶配置?件app.ddlog，模型配置?件deepdive.conf，可參照給定的transaction?件夾樣例格式。 (此處新建立的transaction存放待編譯的項目，CNdeepdive目錄中的transaction是已經建?完畢的項?，後?所需的腳本和數據?件都可以從CNdeepdive中的對應模塊中直接複製)；測試時按照中文deepdive官方教程設置資料庫配置文件 echo "postgresql://$USER@$HOSTNAME:5432/db_name" >db.url，無法正確連接deepdive與postgresql。

複製CNdeepdive/transaction/udf/?錄下的bazzar?件夾到當前示例項?的udf/中。這個模塊需要重新編譯。進?bazzar/parser?錄下，執?編譯命令:

sbt/sbt stage

編譯完成後會在target中?成可執??件。

實驗步驟

先驗數據導入

我們需要從知識庫中獲取已知具有交易關係的實體對，來作為訓練數據。本項?採?的數據來源於從國泰安資料庫。

通過匹配有交易的股票代碼對和代碼-公司對，過濾出存在交易關係的公司對，存?transaction_dbdata.csv中；之後將csv?件放?input/?件夾下。此處只需要將 CNdeepdive/transaction/input中的transaction_dbdata.csv複製到當前示例項目的input/?件夾下即可。

在app.ddlog中定義相應的數據表：

@source transaction_dbdata( @key company1_name text, @key company2_name text ).

命令??成postgresql數據表：

$ deepdive compile && deepdive do transaction_dbdata

在執?app.ddlog前，如果有改動，需要先執?deepdive compile編譯才能?效；對於不依賴於其他表的表格，deepdive會?動去input?件夾下找到同名csv?件，在postgresql?建表導?；運?命令時，deepdive會在當前命令???成?個執?計劃?件，和vi語法?樣，先按esc再使用:wq保存執?並退出。

上述代碼執行成功之後則會顯示：

注意此處如果為 run/ABORTED，這說明建表導入資料庫的過程中出現了錯誤。

待抽取?章導?

準備待抽取的?章（示例使?上市公司公告）已經在CNdeepdive/transaction/input中存在，直接複製到當前示例項目的input目錄下。

在app.ddlog中建?對應的articles表：

articles( id text, content text ).

表中的間隔使用空格即可。

同理，執?命令?，導??章到postgresql中。

$ deepdive do articles

deepdive可以直接查詢資料庫數據，?query語句或者deepdive sql "sql語句"進?資料庫操作。進?查詢id指令，檢驗導?是否成功：

$ deepdive query ?- articles(id, _).

結果如圖：

考慮到下一步所花的時間和待抽取文章的行數相關，此處僅取了原articles.csv中的前五行作為一個小的數據集進行測試。

?nlp模塊進??本處理

deepdive默認採?standford nlp進??本處理。輸??本數據，nlp模塊將以句?為單位，返回每句的分詞、 lemma、pos、NER和句法分析的結果，為後續特徵抽取做準備。我們將這些結果存?sentences表中。

在app.ddlog?件中定義sentences表，?於存放nlp結果：

sentences( doc_id text, sentence_index int, sentence_text text, tokens text[], lemmas text[], pos_tags text[], ner_tags text[], doc_offsets int[], dep_types text[], dep_tokens int[] ).

定義NLP處理的函數nlp_markup：

function nlp_markup over ( doc_id text, content text ) returns rows like sentences implementation "udf/nlp_markup.sh" handles tsv lines.

使?如下語法調?nlp_markup函數，從articles表中讀取輸?，輸出存放在sentences表中：

sentences += nlp_markup(doc_id, content) :- articles(doc_id, content).

聲明?個ddlog函數，這個函數輸??章的doc_id和content，輸出按sentences表的欄位格式。 函數調?udf/nlp_markup.sh調?nlp模塊，nlp_markup.sh的腳本內容?transaction示例代碼中的udf/?件夾，它調?udf/bazzar/parser下的run.sh實現； 此處需要將CNdeepdive示例代碼目錄transaction/udf下的nlp_markup.sh複製到當前項目的對應目錄下。

執?以下命令來查詢?成結果：

deepdive query doc_id, index, tokens, ner_tags | 5 ?- sentences(doc_id, index, text, tokens, lemmas, pos_tags, ner_tags, _, _, _).

結果如圖：

這?步跑的會?常慢，可能需要四五個?時。因此此處減少了articles的?數，來縮短時間以快速完成demo。

實體抽取及候選實體對?成

這?步，我們要抽取?本中的候選實體（公司），並?成候選實體對。 ?先在app.ddlog中定義實體數據表：

company_mention( mention_id text, mention_text text, doc_id text, sentence_index int, begin_index int, end_index int ).

每個實體都是表中的?列數據，同時存儲了實體的id，、實體內容、所在文本的id、句子索引、在句中的起始位置和結束位置。

再定義實體抽取的函數：

function map_company_mention over ( doc_id text, sentence_index int, tokens text[], ner_tags text[] ) returns rows like company_mention implementation "udf/map_company_mention.py" handles tsv lines.

map_company_mention.py也需要從CNdeepdive示例代碼目錄transaction/udf中複製到當前項目對應目錄中； 這個腳本遍歷每個資料庫中的句?，找出連續的NER標記為ORG的序列，再做其它過濾處理，返回候選實體；這個腳本是?個?成函數，?yield語句返回輸出?。 其它所有CNdeepdive示例代碼目錄transaction/udf下的腳本和文件都要複製過去（包括company_full_short.csv）。

然後在app.ddlog中寫調?函數，從sentences表中輸?，輸出到company_mention中：

company_mention += map_company_mention( doc_id, sentence_index, tokens, ner_tags) :- sentences(doc_id, sentence_index, _, tokens, _, _, ner_tags, _, _, _).

最後編譯並執?：

$ deepdive compile && deepdive do company_mention

測試剛剛抽取得到的實體表：

$ deepdive query mention_id, mention_text, doc_id,sentence_index, begin_index, end_index | 50 ?- company_mention( mention_id, mention_text, doc_id, sentence_index, begin_index, end_index).

下??成實體對，即要預測關係的兩個公司。在這?步我們將實體表做笛卡爾積，同時按?定義腳本過濾 ?些不符合形成交易條件的公司。定義數據表如下：

transaction_candidate( p1_id text, p1_name text, p2_id text, p2_name text ).

統計每個句?的實體數：

num_company(doc_id, sentence_index, COUNT(p)) :- company_mention(p, _, doc_id, sentence_index, _, _).

定義過濾函數：

function map_transaction_candidate over ( p1_id text, p1_name text, p2_id text, p2_name text ) returns rows like transaction_candidate implementation "udf/map_transaction_candidate.py" handles tsv lines.

您可以在這個函數內定義篩選候選實體的規則；調用這個函數並結合其他規則對實體對進行進一步的篩選，將篩選結果存儲到 transaction_candidate 表中：

transaction_candidate += map_transaction_candidate(p1, p1_name, p2, p2_name) :- num_company(same_doc, same_sentence, num_p), company_mention(p1, p1_name, same_doc, same_sentence, p1_begin, _), company_mention(p2, p2_name, same_doc, same_sentence, p2_begin, _), num_p < 5, p1_name != p2_name, p1_begin != p2_begin.

?些簡單的過濾操作可以直接通過app.ddlog中的資料庫語法執?，?如p1_name != p2_name，過濾掉兩個相同實體組成的實體對。

編譯並執?：

$ deepdive compile && deepdive do transaction_candidate

?成候選實體表。

如果此處報錯，可以將udf/transform.py中company_full_short.csv（ENTITY_FILE）的相對路徑改為絕對路徑：

測試剛剛得到的候選實體對錶：

特徵抽取

這?步我們抽取候選實體對的?本特徵。

定義特徵表：

play_feature( p1_id text, p2_id text, feature text ).

這?的feature列是實體對間?系列?本特徵的集合。

?成feature表需要的輸?為實體對錶和?本表，輸?和輸出屬性在app.ddlog中定義如下：

function extract_transaction_features over ( p1_id text, p2_id text, p1_begin_index int, p1_end_index int, p2_begin_index int, p2_end_index int, doc_id text, sent_index int, tokens text[], lemmas text[], pos_tags text[], ner_tags text[], dep_types text[], dep_tokens int[] ) returns rows like transaction_feature implementation "udf/extract_transaction_features.py" handles tsv lines.

函數調?extract_transaction_features.py來抽取特徵。這?調?了deepdive?帶的ddlib庫，得到各種POS/NER/詞序列的窗?特徵。此處也可以?定義特徵。

把sentences表和mention表做join，得到的結果輸?函數，輸出到transaction_feature表中。

transaction_feature += extract_transaction_features( p1_id, p2_id, p1_begin_index, p1_end_index, p2_begin_index, p2_end_index, doc_id, sent_index, tokens, lemmas, pos_tags, ner_tags, dep_types, dep_tokens ) :- company_mention(p1_id, _, doc_id, sent_index, p1_begin_index, p1_end_index), company_mention(p2_id, _, doc_id, sent_index, p2_begin_index, p2_end_index), sentences(doc_id, sent_index, _, tokens, lemmas, pos_tags, ner_tags, _, dep_types, dep_tokens).

然後編譯並執?，?成特徵資料庫：

$ deepdive compile && deepdive do transaction_feature

執?如下語句，查看?成結果：

deepdive query | 20 ?- transaction_feature(p1_id, p2_id, feature).

現在，我們已經有了想要判定關係的實體對和它們的特徵集合。

樣本打標

這?步，我們希望在候選實體對中標出部分正負例。 利?已知的實體對和候選實體對關聯； 利?規則打部分正負標籤；

?先在app.ddlog?定義transaction_label表，存儲監督數據：

@extraction transaction_label( @key @references(relation="has_transaction", column="p1_id", alias="has_transaction") p1_id text, @key @references(relation="has_transaction", column="p2_id", alias="has_transaction") p2_id text, @navigable label int, @navigable rule_id text ).

rule_id代表在標記決定相關性的規則名稱。label為正值表示正相關，負值表示負相關。絕對值越?，相關性越 ?。

初始化定義，複製transaction_candidate表，label均定義為零：

transaction_label(p1, p2, 0, NULL) :- transaction_candidate(p1, _, p2, _).

將最開始準備好的transaction_dbdata數據導?transaction_label表中，ruleid標記為"from_dbdata"。因為國泰安的數據?較官?，可以基於較?的權重，這?設為3。在app.ddlog中定義如下：

transaction_label(p1,p2, 3, "from_dbdata") :- transaction_candidate(p1, p1_name, p2, p2_name), transaction_dbdata(n1, n2), [ lower(n1) = lower(p1_name), lower(n2) = lower(p2_name) ; lower(n2) = lower(p1_name), lower(n1) = lower(p2_name) ].

如果只利?下載的實體對，可能和未知?本中提取的實體對重合度較?，不利於特徵參數推導。因此可以通過?些邏輯規則，對未知?本進?預標記。

function supervise over ( p1_id text, p1_begin int, p1_end int, p2_id text, p2_begin int, p2_end int, doc_id text, sentence_index int, sentence_text text, tokens text[], lemmas text[], pos_tags text[], ner_tags text[], dep_types text[], dep_tokens int[] ) returns ( p1_id text, p2_id text, label int, rule_id text ) implementation "udf/supervise_transaction.py" handles tsv lines.

函數調?udf/supervise_transaction.py，規則名稱和所佔的權重定義在腳本中;在app.ddlog中定義標記函數。

調?標記函數，將規則抽到的數據寫?transaction_label表中：

transaction_label += supervise( p1_id, p1_begin, p1_end, p2_id, p2_begin, p2_end, doc_id, sentence_index, sentence_text, tokens, lemmas, pos_tags, ner_tags, dep_types, dep_token_indexes ) :- transaction_candidate(p1_id, _, p2_id, _), company_mention(p1_id, p1_text, doc_id, sentence_index, p1_begin, p1_end), company_mention(p2_id, p2_text, _, _, p2_begin, p2_end), sentences( doc_id, sentence_index, sentence_text, tokens, lemmas, pos_tags, ner_tags, _, dep_types, dep_token_indexes ).

不同的規則可能覆蓋了相同的實體對，從未給出不同甚?相反的label。建?transactionlabelresolved表，統?實體對間的label。利?label求和，在多條規則和知識庫標記的結果中，為每對實體做vote。

transaction_label_resolved(p1_id, p2_id, SUM(vote)) :-transaction_label(p1_id, p2_id, vote, rule_id).

執?以下命令，得到最終標籤：

$ deepdive do transaction_label_resolved

模型構建

通過上面的所有操作，得到了所有前期需要準備的數據。下?開始構建模型。

變數表定義

首先定義最終存儲的表格，[?]表示此表是用戶模式下的變數表，即需要推到關係的表。這?我們預測的是公司間是否存在交易關係：

@extraction has_transaction?( p1_id text, p2_id text ).

根據打標的結果，寫?已知的變數：

has_transaction(p1_id, p2_id) = if l > 0 then TRUE else if l < 0 then FALSE else NULL end :- transaction_label_resolved(p1_id, p2_id, l).

此時變數表中的部分變數label已知，成為了先驗變數。

最後編譯執?決策表：

deepdive compile && deepdive do has_transaction

因?圖構建

指定特徵：將每?對has_transaction中的實體對和特徵表連接起來，通過特徵factor的連接，全局學習這些特徵的權重。在app.ddlog中定義：

@weight(f) has_transaction(p1_id, p2_id) :- transaction_candidate(p1_id, _, p2_id, _), transaction_feature(p1_id, p2_id, f).

指定變數間的依賴性：我們可以指定兩張變數表間遵守的規則，並給這個規則以權重。?如c1和c2有交易，可以推出c2和c1也有交易。這是?條可以確保的定理，因此給予較?權重：

@weight(3.0) has_transaction(p1_id, p2_id) => has_transaction(p2_id, p1_id) :- transaction_candidate(p1_id, _, p2_id, _).

變數表間的依賴性使得deepdive很好地?持了多關係下的抽取。

最後，編譯，並?成最終的概率模型：

$ deepdive compile && deepdive do probabilities

查看我們預測的公司間交易關係概率：

$ deepdive sql "SELECT p1_id, p2_id, expectation FROM has_transaction_label_inference ORDER BY random() LIMIT 20"

?此，我們的交易關係抽取demo就完成了。