基於feature based的BERT中文標題分類實戰

在上篇分享中

https://zhuanlan.zhihu.com/p/72448986?

zhuanlan.zhihu.com
圖標

我們實現了基於pytorch

pretrained-bert提供的pretrained-bert進行fine tuning的中文標題分類,事實上在pytorch pretrained-bert中對於下游NLP任務的應用提供了比較豐富的封裝和實現,如針對文本分類的BertForSequenceClassification,針對字元分類的BertForTokenClassification,以及判斷句子前後關係的BertForNextSentencePrediction

事實上,上面提到的這些類都是在原先的BertModel基礎上對於各種應用的適配,如在pool層上加一個輸出大小為1的dense層做二分類便可以用於BertForSequenceClassificationBertForNextSentencePrediction,而在pool層上加一個輸出大小為詞典數的dense層便可用於BertForTokenClassification,因此具體代碼上大同小異,都非常方便。如果想要觀察各個模型具體的網路結構上的差異,可以通過zhuanlan.zhihu.com/p/71 中提到的可視化工具進行網路的可視化,下面是上文中的標題分類模型的網路結構圖。

從上圖可以看出,因為總共有28個類別,因此會有769 × 28的全連接層,總之通過可視化可以幫助我們更加直觀的瞭解各個網路之間的差異。

在上篇分享中我們側重的是fine tuning based,本文主要側重的是feature based,即將bert作為文本語義特徵的提取/生成工具,通過為樣本生成低維稠密特徵而快速適用於多種機器學習、深度學習模型,該種方式或許無法完全發揮bert的表徵學習能力,但是為後續模型的選擇和設計提供了很大的便捷性及自由度。本文中使用的數據上一篇文章中進行了仔細的介紹,因此有需要的話可以參閱本專欄前一篇文章。

載入庫

import csv
import os
import sys
import pickle
import pandas as pd
import numpy as np
from concurrent.futures import ThreadPoolExecutor
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.nn import Conv1d,BatchNorm1d,MaxPool1d,ReLU,Dropout
from torch.optim import Adam
import pickle
from sklearn.preprocessing import LabelEncoder
from torch.optim import optimizer
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler, TensorDataset
from torch.nn import CrossEntropyLoss,BCEWithLogitsLoss
from tqdm import tqdm_notebook, trange
from pytorch_pretrained_bert import BertTokenizer, BertModel, BertForMaskedLM, BertForSequenceClassification
from pytorch_pretrained_bert.optimization import BertAdam, WarmupLinearSchedule
import matplotlib.pyplot as plt
%matplotlib inline

數據預處理

class DataPrecessForSingleSentence(object):
"""
對文本進行處理
"""
?
def __init__(self, bert_tokenizer, max_workers=10):
"""
bert_tokenizer :分詞器
dataset :包含列名為text與label的pandas dataframe
"""
self.bert_tokenizer = bert_tokenizer
# 創建多線程池
self.pool = ThreadPoolExecutor(max_workers=max_workers)
# 獲取文本與標籤
?
def get_input(self, dataset, max_seq_len=30):
"""
通過多線程(因為notebook中多進程使用存在一些問題)的方式對輸入文本進行分詞、ID化、截斷、填充等流程得到最終的可用於模型輸入的序列。

入參:
dataset : pandas的dataframe格式,包含兩列,第一列為文本,第二列為標籤。標籤取值為{0,1},其中0表示負樣本,1代表正樣本。
max_seq_len : 目標序列長度,該值需要預先對文本長度進行分別得到,可以設置為小於等於512(BERT的最長文本序列長度為512)的整數。

出參:
seq : 在入參seq的頭尾分別拼接了CLS與SEP符號,如果長度仍小於max_seq_len,則使用0在尾部進行了填充。
seq_mask : 只包含0、1且長度等於seq的序列,用於表徵seq中的符號是否是有意義的,如果seq序列對應位上為填充符號,
那麼取值為1,否則為0。
seq_segment : shape等於seq,因為是單句,所以取值都為0。
labels : 標籤取值為{0,1},其中0表示負樣本,1代表正樣本。


"""
sentences = dataset.iloc[:, 0].tolist()
labels = dataset.iloc[:, 1].tolist()
# 切詞
tokens_seq = list(
self.pool.map(self.bert_tokenizer.tokenize, sentences))
# 獲取定長序列及其mask
result = list(
self.pool.map(self.trunate_and_pad, tokens_seq,
[max_seq_len] * len(tokens_seq)))
seqs = [i[0] for i in result]
seq_masks = [i[1] for i in result]
seq_segments = [i[2] for i in result]
return seqs, seq_masks, seq_segments, labels
?
def trunate_and_pad(self, seq, max_seq_len):
"""
1. 因為本類處理的是單句序列,按照BERT中的序列處理方式,需要在輸入序列頭尾分別拼接特殊字元CLS與SEP,
因此不包含兩個特殊字元的序列長度應該小於等於max_seq_len-2,如果序列長度大於該值需要那麼進行截斷。
2. 對輸入的序列 最終形成[CLS,seq,SEP]的序列,該序列的長度如果小於max_seq_len,那麼使用0進行填充。

入參:
seq : 輸入序列,在本處其為單個句子。
max_seq_len : 拼接CLS與SEP這兩個特殊字元後的序列長度

出參:
seq : 在入參seq的頭尾分別拼接了CLS與SEP符號,如果長度仍小於max_seq_len,則使用0在尾部進行了填充。
seq_mask : 只包含0、1且長度等於seq的序列,用於表徵seq中的符號是否是有意義的,如果seq序列對應位上為填充符號,
那麼取值為1,否則為0。
seq_segment : shape等於seq,因為是單句,所以取值都為0。

"""
# 對超長序列進行截斷
if len(seq) > (max_seq_len - 2):
seq = seq[0:(max_seq_len - 2)]
# 分別在首尾拼接特殊符號
seq = [[CLS]] + seq + [[SEP]]
# ID化
seq = self.bert_tokenizer.convert_tokens_to_ids(seq)
# 根據max_seq_len與seq的長度產生填充序列
padding = [0] * (max_seq_len - len(seq))
# 創建seq_mask
seq_mask = [1] * len(seq) + padding
# 創建seq_segment
seq_segment = [0] * len(seq) + padding
# 對seq拼接填充序列
seq += padding
assert len(seq) == max_seq_len
assert len(seq_mask) == max_seq_len
assert len(seq_segment) == max_seq_len
return seq, seq_mask, seq_segment

載入數據

data = pd.read_pickle("/home/zhaoqi/title_category.pkl")
data.columns = [text,label]
le = LabelEncoder()
le.fit(data.label.tolist())
data[label] = le.transform(data.label.tolist())

產生訓練數據

bert_tokenizer = BertTokenizer.from_pretrained(bert-base-chinese, do_lower_case=False)
processor = DataPrecessForSingleSentence(bert_tokenizer= bert_tokenizer)
seqs, seq_masks, seq_segments, labels = processor.get_input(
dataset=data, max_seq_len=30)

載入預訓練的bert模型

bert = BertModel.from_pretrained(bert-base-chinese)

數據格式化

t_seqs = torch.tensor(seqs, dtype=torch.long)
t_seq_masks = torch.tensor(seq_masks, dtype = torch.long)
t_seq_segments = torch.tensor(seq_segments, dtype = torch.long)
t_labels = torch.tensor(labels, dtype = torch.long)
?
train_data = TensorDataset(t_seqs, t_seq_masks, t_seq_segments, t_labels)
train_sampler = RandomSampler(train_data)
train_dataloder = DataLoader(dataset= train_data, sampler= train_sampler,batch_size = 256)
bert.eval()

特徵抽取

pytorch版本的pretrained-bert在模型搭建時使用了nn.Sequantial方法,因此如果觀察model的網路結構可以發現有非常清晰的層級關係。

BertModel(
(embeddings): BertEmbeddings(
(word_embeddings): Embedding(21128, 768, padding_idx=0)
(position_embeddings): Embedding(512, 768)
(token_type_embeddings): Embedding(2, 768)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
(encoder): BertEncoder(
(layer): ModuleList(
(0): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(1): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(2): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(3): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(4): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(5): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(6): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(7): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(8): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(9): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(10): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(11): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
)
)
(pooler): BertPooler(
(dense): Linear(in_features=768, out_features=768, bias=True)
(activation): Tanh()
)
)

bert的輸出為一個元祖,第一個元素是一個length為12的列表,而其中每一個tensor的shape為[batch_size, max_seq_len, 768],按照上文中的設定,此處每個tensor的大小為[256, 30, 768];第二個元素是一個size為[batch_size, 768]的tensor。第二個元素可以認為是第一個元素的精簡版,因為每一個輸入的seq序列的首字母都為[CLS],並且因為bert是雙向的encoder,因此可以認為[CLS]字元融合了整個seq序列的高階表徵信息,因此可以只取每個樣本的第一個字元對應的隱層狀態。這種降維的方式也可認為是「池化」中的一種,在官方代碼中在此基礎上還增加了一個輸入輸出同為768的全連接層。

本文只抽取bert輸出的第二個tensor作為每一個seq樣本的表徵,代碼如下:

bert = BertModel.from_pretrained(bert-base-chinese)
bert.eval()
?
train_features = []
train_labels = []
with torch.no_grad():
for step, batch_data in enumerate(
tqdm_notebook(train_dataloder, desc=Iteration)):
batch_seqs, batch_seq_masks, batch_seq_segments, batch_labels = batch_data
features = bert(batch_seqs, batch_seq_masks, batch_seq_segments)[1]
train_features.append(features.detach())
train_labels.append(batch_labels.detach())
train_features = torch.cat(train_features)
train_labels = torch.cat(train_labels)

訓練

搭建一個簡單的神經網路用語類別的預測。

class LogisticRegression(nn.Module):
def __init__(self):
super(LogisticRegression, self).__init__()
?
self.classifier = nn.Sequential(*[
nn.Linear(in_features=768, out_features=384, bias=True),
BatchNorm1d(384),
ReLU(inplace=True),
Dropout(0.5),
nn.Linear(in_features=384, out_features=28, bias=True)
])
?
def forward(self, x):
x = self.classifier(x)
return x
?
lr = LogisticRegression()
train_data = TensorDataset(train_features, train_labels)
train_sampler = RandomSampler(train_data)
train_dataloder = DataLoader(dataset= train_data, sampler = train_sampler, batch_size = 512)
?
param_optimizer = list(lr.parameters())
optimizer = Adam(param_optimizer,lr=1e-04)
?
loss_collect = []
?
for i in trange(10, desc=Epoch):
for step, batch_data in enumerate(
tqdm_notebook(train_dataloder, desc=Iteration)):
batch_features, batch_labels = batch_data
# 對標籤進行onehot編碼
one_hot = torch.zeros(batch_labels.size(0), 28).long()
one_hot_batch_labels = one_hot.scatter_(
dim=1,
index=torch.unsqueeze(batch_labels, dim=1),
src=torch.ones(batch_labels.size(0), 28).long())
?
logits = lr(batch_features)
logits = logits.softmax(dim=1)
loss_function = CrossEntropyLoss()
loss = loss_function(logits, batch_labels)
loss.backward()
loss_collect.append(loss.item())
print("
%f" % loss, end=)
optimizer.step()
optimizer.zero_grad()

訓練loss可視化

plt.figure(figsize=(12,8))
plt.plot(range(len(loss_collect)), loss_collect,g.)
plt.grid(True)
plt.show()

可以看出在訓練數據上的loss得以快速下降,但是從絕對值來看距離收斂還有很大空間,因為只是表意,因此epoch設置的非常小,也沒有進行任何的調參工作。綜上,在算力受限的情況下,將bert作為feature生成器為眾多文本類任務提取稠密低維的特徵,從而快速搭建模型也不失為一種可行的路徑。

推薦閱讀:

相關文章