?關係抽取工具：DeepDive的環境配置與實踐排雷

系統： ubuntu-16.04.5

目的： 跑通支持中文的deepdive中給出的股權交易關係抽取示例

參考： deepdive官網、Deepdive抽取演員-電影間關係、支持中文的deepdive、PostgreSQL 10.1 手冊

說明：本文後半段與支持中文的deepdive示例教程對應部分一樣。

準備

deepdive安裝

在支持中文的deepdive中下載CNdeepdive並解壓;進入CNdeepdive目錄，運?install.sh，選擇1安裝deepdive：

./install.sh

如果出現：

則修改 install.sh文件，將第193行：tar xzvf "$tarball" -C "$PREFIX" 修改為 tar xvf "$tarball" -C "$PREFIX"。

配置環境變數，deepdive的可執??件?般安裝在~/local/bin?件夾下。在~/.bashrc下添加如下內容並保存：

export PATH="/local文件夾所在路徑/local/bin:$PATH"

找到local所在目錄，運行pwd得到local文件夾所在路徑。

由此可知最終在~/.bashrc中添加的內容為：

export PATH="/home/linux/local/bin:$PATH"

保存更改，執行source ~/.bashrc設置環境變數。

如果出現 deepdive: command not found，一般是這一步環境變數配置出了問題； 測試時遇到的兩種情況：

1.在~/.bashrc中添加 export PATH="/root/local/bin:$PATH"；

2.在~/.bash_profile中添加 export PATH="/root/local/bin:$PATH"；

以上兩種情況在測試時均不能正確添加環境變數。

postgresql安裝

先驗知識： 1.創建資料庫：

2.刪除資料庫：

運?install.sh，選擇6安裝postgresql。運行：

psql postgres

如果效果如下，則說明postgresql安裝成功。

運行q退出psql。

運行：

createdb transaction

建立資料庫；運行：

psql transaction

進入對應的postgresql賬戶；運行l顯示所有的資料庫；運行q退出psql。

nlp環境安裝

運?nlp_setup.sh，配置中?standford nlp環境。

框架搭建

建???的項??件夾transaction，在項??件夾下配置資料庫配置?件:

echo "postgresql://localhost:5432/transaction" >db.url

再在transaction下分別建?輸?數據?件夾input，腳本?件夾udf，?戶配置?件app.ddlog，模型配置?件deepdive.conf，可參照給定的transaction?件夾樣例格式。 (此處新建立的transaction存放待編譯的項目，CNdeepdive目錄中的transaction是已經建?完畢的項?，後?所需的腳本和數據?件都可以從CNdeepdive中的對應模塊中直接複製)；測試時按照中文deepdive官方教程設置資料庫配置文件 echo "postgresql://$USER@$HOSTNAME:5432/db_name" >db.url，無法正確連接deepdive與postgresql。

複製CNdeepdive/transaction/udf/?錄下的bazzar?件夾到當前示例項?的udf/中。這個模塊需要重新編譯。進?bazzar/parser?錄下，執?編譯命令:

sbt/sbt stage

編譯完成後會在target中?成可執??件。

實驗步驟

先驗數據導入

我們需要從知識庫中獲取已知具有交易關係的實體對，來作為訓練數據。本項?採?的數據來源於從國泰安資料庫。

通過匹配有交易的股票代碼對和代碼-公司對，過濾出存在交易關係的公司對，存?transaction_dbdata.csv中；之後將csv?件放?input/?件夾下。此處只需要將 CNdeepdive/transaction/input中的transaction_dbdata.csv複製到當前示例項目的input/?件夾下即可。

在app.ddlog中定義相應的數據表：

@source transaction_dbdata( @key company1_name text, @key company2_name text ).

命令??成postgresql數據表：

$ deepdive compile && deepdive do transaction_dbdata

在執?app.ddlog前，如果有改動，需要先執?deepdive compile編譯才能?效；對於不依賴於其他表的表格，deepdive會?動去input?件夾下找到同名csv?件，在postgresql?建表導?；運?命令時，deepdive會在當前命令???成?個執?計劃?件，和vi語法?樣，先按esc再使用:wq保存執?並退出。

上述代碼執行成功之後則會顯示：

注意此處如果為 run/ABORTED，這說明建表導入資料庫的過程中出現了錯誤。

待抽取?章導?

準備待抽取的?章（示例使?上市公司公告）已經在CNdeepdive/transaction/input中存在，直接複製到當前示例項目的input目錄下。

在app.ddlog中建?對應的articles表：

articles( id text, content text ).

表中的間隔使用空格即可。

同理，執?命令?，導??章到postgresql中。

$ deepdive do articles

deepdive可以直接查詢資料庫數據，?query語句或者deepdive sql "sql語句"進?資料庫操作。進?查詢id指令，檢驗導?是否成功：

$ deepdive query ?- articles(id, _).

結果如圖：

考慮到下一步所花的時間和待抽取文章的行數相關，此處僅取了原articles.csv中的前五行作為一個小的數據集進行測試。

?nlp模塊進??本處理

deepdive默認採?standford nlp進??本處理。輸??本數據，nlp模塊將以句?為單位，返回每句的分詞、 lemma、pos、NER和句法分析的結果，為後續特徵抽取做準備。我們將這些結果存?sentences表中。

在app.ddlog?件中定義sentences表，?於存放nlp結果：

sentences( doc_id text, sentence_index int, sentence_text text, tokens text[], lemmas text[], pos_tags text[], ner_tags text[], doc_offsets int[], dep_types text[], dep_tokens int[] ).

定義NLP處理的函數nlp_markup：

function nlp_markup over ( doc_id text, content text ) returns rows like sentences implementation "udf/nlp_markup.sh" handles tsv lines.

使?如下語法調?nlp_markup函數，從articles表中讀取輸?，輸出存放在sentences表中：

sentences += nlp_markup(doc_id, content) :- articles(doc_id, content).

聲明?個ddlog函數，這個函數輸??章的doc_id和content，輸出按sentences表的欄位格式。 函數調?udf/nlp_markup.sh調?nlp模塊，nlp_markup.sh的腳本內容?transaction示例代碼中的udf/?件夾，它調?udf/bazzar/parser下的run.sh實現； 此處需要將CNdeepdive示例代碼目錄transaction/udf下的nlp_markup.sh複製到當前項目的對應目錄下。

執?以下命令來查詢?成結果：

deepdive query doc_id, index, tokens, ner_tags | 5 ?- sentences(doc_id, index, text, tokens, lemmas, pos_tags, ner_tags, _, _, _).

結果如圖：

這?步跑的會?常慢，可能需要四五個?時。因此此處減少了articles的?數，來縮短時間以快速完成demo。

實體抽取及候選實體對?成

這?步，我們要抽取?本中的候選實體（公司），並?成候選實體對。 ?先在app.ddlog中定義實體數據表：

company_mention( mention_id text, mention_text text, doc_id text, sentence_index int, begin_index int, end_index int ).

每個實體都是表中的?列數據，同時存儲了實體的id，、實體內容、所在文本的id、句子索引、在句中的起始位置和結束位置。

再定義實體抽取的函數：

function map_company_mention over ( doc_id text, sentence_index int, tokens text[], ner_tags text[] ) returns rows like company_mention implementation "udf/map_company_mention.py" handles tsv lines.

map_company_mention.py也需要從CNdeepdive示例代碼目錄transaction/udf中複製到當前項目對應目錄中； 這個腳本遍歷每個資料庫中的句?，找出連續的NER標記為ORG的序列，再做其它過濾處理，返回候選實體；這個腳本是?個?成函數，?yield語句返回輸出?。 其它所有CNdeepdive示例代碼目錄transaction/udf下的腳本和文件都要複製過去（包括company_full_short.csv）。

然後在app.ddlog中寫調?函數，從sentences表中輸?，輸出到company_mention中：

company_mention += map_company_mention( doc_id, sentence_index, tokens, ner_tags) :- sentences(doc_id, sentence_index, _, tokens, _, _, ner_tags, _, _, _).

最後編譯並執?：

$ deepdive compile && deepdive do company_mention

測試剛剛抽取得到的實體表：

$ deepdive query mention_id, mention_text, doc_id,sentence_index, begin_index, end_index | 50 ?- company_mention( mention_id, mention_text, doc_id, sentence_index, begin_index, end_index).

下??成實體對，即要預測關係的兩個公司。在這?步我們將實體表做笛卡爾積，同時按?定義腳本過濾 ?些不符合形成交易條件的公司。定義數據表如下：

transaction_candidate( p1_id text, p1_name text, p2_id text, p2_name text ).

統計每個句?的實體數：

num_company(doc_id, sentence_index, COUNT(p)) :- company_mention(p, _, doc_id, sentence_index, _, _).

定義過濾函數：

function map_transaction_candidate over ( p1_id text, p1_name text, p2_id text, p2_name text ) returns rows like transaction_candidate implementation "udf/map_transaction_candidate.py" handles tsv lines.

您可以在這個函數內定義篩選候選實體的規則；調用這個函數並結合其他規則對實體對進行進一步的篩選，將篩選結果存儲到 transaction_candidate 表中：

transaction_candidate += map_transaction_candidate(p1, p1_name, p2, p2_name) :- num_company(same_doc, same_sentence, num_p), company_mention(p1, p1_name, same_doc, same_sentence, p1_begin, _), company_mention(p2, p2_name, same_doc, same_sentence, p2_begin, _), num_p < 5, p1_name != p2_name, p1_begin != p2_begin.

?些簡單的過濾操作可以直接通過app.ddlog中的資料庫語法執?，?如p1_name != p2_name，過濾掉兩個相同實體組成的實體對。

編譯並執?：

$ deepdive compile && deepdive do transaction_candidate

?成候選實體表。

如果此處報錯，可以將udf/transform.py中company_full_short.csv（ENTITY_FILE）的相對路徑改為絕對路徑：

測試剛剛得到的候選實體對錶：

特徵抽取

這?步我們抽取候選實體對的?本特徵。

定義特徵表：

play_feature( p1_id text, p2_id text, feature text ).

這?的feature列是實體對間?系列?本特徵的集合。

?成feature表需要的輸?為實體對錶和?本表，輸?和輸出屬性在app.ddlog中定義如下：

function extract_transaction_features over ( p1_id text, p2_id text, p1_begin_index int, p1_end_index int, p2_begin_index int, p2_end_index int, doc_id text, sent_index int, tokens text[], lemmas text[], pos_tags text[], ner_tags text[], dep_types text[], dep_tokens int[] ) returns rows like transaction_feature implementation "udf/extract_transaction_features.py" handles tsv lines.

函數調?extract_transaction_features.py來抽取特徵。這?調?了deepdive?帶的ddlib庫，得到各種POS/NER/詞序列的窗?特徵。此處也可以?定義特徵。

把sentences表和mention表做join，得到的結果輸?函數，輸出到transaction_feature表中。

transaction_feature += extract_transaction_features( p1_id, p2_id, p1_begin_index, p1_end_index, p2_begin_index, p2_end_index, doc_id, sent_index, tokens, lemmas, pos_tags, ner_tags, dep_types, dep_tokens ) :- company_mention(p1_id, _, doc_id, sent_index, p1_begin_index, p1_end_index), company_mention(p2_id, _, doc_id, sent_index, p2_begin_index, p2_end_index), sentences(doc_id, sent_index, _, tokens, lemmas, pos_tags, ner_tags, _, dep_types, dep_tokens).

然後編譯並執?，?成特徵資料庫：

$ deepdive compile && deepdive do transaction_feature

執?如下語句，查看?成結果：

deepdive query | 20 ?- transaction_feature(p1_id, p2_id, feature).

現在，我們已經有了想要判定關係的實體對和它們的特徵集合。

樣本打標

這?步，我們希望在候選實體對中標出部分正負例。 利?已知的實體對和候選實體對關聯； 利?規則打部分正負標籤；

?先在app.ddlog?定義transaction_label表，存儲監督數據：

@extraction transaction_label( @key @references(relation="has_transaction", column="p1_id", alias="has_transaction") p1_id text, @key @references(relation="has_transaction", column="p2_id", alias="has_transaction") p2_id text, @navigable label int, @navigable rule_id text ).

rule_id代表在標記決定相關性的規則名稱。label為正值表示正相關，負值表示負相關。絕對值越?，相關性越 ?。

初始化定義，複製transaction_candidate表，label均定義為零：

transaction_label(p1, p2, 0, NULL) :- transaction_candidate(p1, _, p2, _).

將最開始準備好的transaction_dbdata數據導?transaction_label表中，ruleid標記為"from_dbdata"。因為國泰安的數據?較官?，可以基於較?的權重，這?設為3。在app.ddlog中定義如下：

transaction_label(p1,p2, 3, "from_dbdata") :- transaction_candidate(p1, p1_name, p2, p2_name), transaction_dbdata(n1, n2), [ lower(n1) = lower(p1_name), lower(n2) = lower(p2_name) ; lower(n2) = lower(p1_name), lower(n1) = lower(p2_name) ].

如果只利?下載的實體對，可能和未知?本中提取的實體對重合度較?，不利於特徵參數推導。因此可以通過?些邏輯規則，對未知?本進?預標記。

function supervise over ( p1_id text, p1_begin int, p1_end int, p2_id text, p2_begin int, p2_end int, doc_id text, sentence_index int, sentence_text text, tokens text[], lemmas text[], pos_tags text[], ner_tags text[], dep_types text[], dep_tokens int[] ) returns ( p1_id text, p2_id text, label int, rule_id text ) implementation "udf/supervise_transaction.py" handles tsv lines.

函數調?udf/supervise_transaction.py，規則名稱和所佔的權重定義在腳本中;在app.ddlog中定義標記函數。

調?標記函數，將規則抽到的數據寫?transaction_label表中：

transaction_label += supervise( p1_id, p1_begin, p1_end, p2_id, p2_begin, p2_end, doc_id, sentence_index, sentence_text, tokens, lemmas, pos_tags, ner_tags, dep_types, dep_token_indexes ) :- transaction_candidate(p1_id, _, p2_id, _), company_mention(p1_id, p1_text, doc_id, sentence_index, p1_begin, p1_end), company_mention(p2_id, p2_text, _, _, p2_begin, p2_end), sentences( doc_id, sentence_index, sentence_text, tokens, lemmas, pos_tags, ner_tags, _, dep_types, dep_token_indexes ).

不同的規則可能覆蓋了相同的實體對，從未給出不同甚?相反的label。建?transactionlabelresolved表，統?實體對間的label。利?label求和，在多條規則和知識庫標記的結果中，為每對實體做vote。

transaction_label_resolved(p1_id, p2_id, SUM(vote)) :-transaction_label(p1_id, p2_id, vote, rule_id).

執?以下命令，得到最終標籤：

$ deepdive do transaction_label_resolved

模型構建

通過上面的所有操作，得到了所有前期需要準備的數據。下?開始構建模型。

變數表定義

首先定義最終存儲的表格，[?]表示此表是用戶模式下的變數表，即需要推到關係的表。這?我們預測的是公司間是否存在交易關係：

@extraction has_transaction?( p1_id text, p2_id text ).

根據打標的結果，寫?已知的變數：

has_transaction(p1_id, p2_id) = if l > 0 then TRUE else if l < 0 then FALSE else NULL end :- transaction_label_resolved(p1_id, p2_id, l).

此時變數表中的部分變數label已知，成為了先驗變數。

最後編譯執?決策表：

deepdive compile && deepdive do has_transaction

因?圖構建

指定特徵：將每?對has_transaction中的實體對和特徵表連接起來，通過特徵factor的連接，全局學習這些特徵的權重。在app.ddlog中定義：

@weight(f) has_transaction(p1_id, p2_id) :- transaction_candidate(p1_id, _, p2_id, _), transaction_feature(p1_id, p2_id, f).

指定變數間的依賴性：我們可以指定兩張變數表間遵守的規則，並給這個規則以權重。?如c1和c2有交易，可以推出c2和c1也有交易。這是?條可以確保的定理，因此給予較?權重：

@weight(3.0) has_transaction(p1_id, p2_id) => has_transaction(p2_id, p1_id) :- transaction_candidate(p1_id, _, p2_id, _).

變數表間的依賴性使得deepdive很好地?持了多關係下的抽取。

最後，編譯，並?成最終的概率模型：

$ deepdive compile && deepdive do probabilities

查看我們預測的公司間交易關係概率：

$ deepdive sql "SELECT p1_id, p2_id, expectation FROM has_transaction_label_inference ORDER BY random() LIMIT 20"

?此，我們的交易關係抽取demo就完成了。

?關係抽取工具：DeepDive的環境配置與實踐排雷

準備

deepdive安裝

postgresql安裝

nlp環境安裝

框架搭建

實驗步驟

先驗數據導入

待抽取?章導?

?nlp模塊進??本處理

實體抽取及候選實體對?成

特徵抽取

樣本打標

模型構建

變數表定義

因?圖構建

熱門新聞

週熱門

?關係抽取工具：DeepDive的環境配置與實踐排雷

準備

deepdive安裝

postgresql安裝

nlp環境安裝

框架搭建

實驗步驟

先驗數據導入

待抽取?章導?

?nlp模塊進??本處理

實體抽取及候選實體對?成

特徵抽取

樣本打標

模型構建

變數表定義

因?圖構建

小領域知識圖譜應該怎麼構建？

知識圖譜有什麼值得研究的問題嗎?

ACL 2019將會有哪些值得關注的論文？

nlp的word2vec中如何把英文片語向量化？

Attention模型理解？

有哪些比BERT-CRF更好的NER模型？

NLP或機器學習中什麼是結構化數據和非結構化數據？

關於使用keras、CNN實現文本多標籤多分類的問題？

如何評價NLP演算法ELECTRA的表現？

創作一個軟體，可以將文字描述直接繪製成圖，就目前而言有可能實現嗎？或者說現在已經有類似的軟體了嗎？

BERT模型有什麼調參技巧?

NLP問題中是怎麼構造數據集的？

BERT這麼厲害，如何利用BERT做語義相似度匹配任務呢，或者說，如何利用BERT得到句子語義向量呢？

國內哪些公司有語音合成團隊，在做語音合成（TTS）方面的研發？

CTR預估：(標籤-權重)列表類特徵怎麼輸入到模型？

熱門新聞

週熱門