基于feature based的BERT中文标题分类实战

在上篇分享中

https://zhuanlan.zhihu.com/p/72448986?

zhuanlan.zhihu.com
图标

我们实现了基于pytorch

pretrained-bert提供的pretrained-bert进行fine tuning的中文标题分类,事实上在pytorch pretrained-bert中对于下游NLP任务的应用提供了比较丰富的封装和实现,如针对文本分类的BertForSequenceClassification,针对字元分类的BertForTokenClassification,以及判断句子前后关系的BertForNextSentencePrediction

事实上,上面提到的这些类都是在原先的BertModel基础上对于各种应用的适配,如在pool层上加一个输出大小为1的dense层做二分类便可以用于BertForSequenceClassificationBertForNextSentencePrediction,而在pool层上加一个输出大小为词典数的dense层便可用于BertForTokenClassification,因此具体代码上大同小异,都非常方便。如果想要观察各个模型具体的网路结构上的差异,可以通过zhuanlan.zhihu.com/p/71 中提到的可视化工具进行网路的可视化,下面是上文中的标题分类模型的网路结构图。

从上图可以看出,因为总共有28个类别,因此会有769 × 28的全连接层,总之通过可视化可以帮助我们更加直观的了解各个网路之间的差异。

在上篇分享中我们侧重的是fine tuning based,本文主要侧重的是feature based,即将bert作为文本语义特征的提取/生成工具,通过为样本生成低维稠密特征而快速适用于多种机器学习、深度学习模型,该种方式或许无法完全发挥bert的表征学习能力,但是为后续模型的选择和设计提供了很大的便捷性及自由度。本文中使用的数据上一篇文章中进行了仔细的介绍,因此有需要的话可以参阅本专栏前一篇文章。

载入库

import csv
import os
import sys
import pickle
import pandas as pd
import numpy as np
from concurrent.futures import ThreadPoolExecutor
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.nn import Conv1d,BatchNorm1d,MaxPool1d,ReLU,Dropout
from torch.optim import Adam
import pickle
from sklearn.preprocessing import LabelEncoder
from torch.optim import optimizer
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler, TensorDataset
from torch.nn import CrossEntropyLoss,BCEWithLogitsLoss
from tqdm import tqdm_notebook, trange
from pytorch_pretrained_bert import BertTokenizer, BertModel, BertForMaskedLM, BertForSequenceClassification
from pytorch_pretrained_bert.optimization import BertAdam, WarmupLinearSchedule
import matplotlib.pyplot as plt
%matplotlib inline

数据预处理

class DataPrecessForSingleSentence(object):
"""
对文本进行处理
"""
?
def __init__(self, bert_tokenizer, max_workers=10):
"""
bert_tokenizer :分词器
dataset :包含列名为text与label的pandas dataframe
"""
self.bert_tokenizer = bert_tokenizer
# 创建多线程池
self.pool = ThreadPoolExecutor(max_workers=max_workers)
# 获取文本与标签
?
def get_input(self, dataset, max_seq_len=30):
"""
通过多线程(因为notebook中多进程使用存在一些问题)的方式对输入文本进行分词、ID化、截断、填充等流程得到最终的可用于模型输入的序列。

入参:
dataset : pandas的dataframe格式,包含两列,第一列为文本,第二列为标签。标签取值为{0,1},其中0表示负样本,1代表正样本。
max_seq_len : 目标序列长度,该值需要预先对文本长度进行分别得到,可以设置为小于等于512(BERT的最长文本序列长度为512)的整数。

出参:
seq : 在入参seq的头尾分别拼接了CLS与SEP符号,如果长度仍小于max_seq_len,则使用0在尾部进行了填充。
seq_mask : 只包含0、1且长度等于seq的序列,用于表征seq中的符号是否是有意义的,如果seq序列对应位上为填充符号,
那么取值为1,否则为0。
seq_segment : shape等于seq,因为是单句,所以取值都为0。
labels : 标签取值为{0,1},其中0表示负样本,1代表正样本。


"""
sentences = dataset.iloc[:, 0].tolist()
labels = dataset.iloc[:, 1].tolist()
# 切词
tokens_seq = list(
self.pool.map(self.bert_tokenizer.tokenize, sentences))
# 获取定长序列及其mask
result = list(
self.pool.map(self.trunate_and_pad, tokens_seq,
[max_seq_len] * len(tokens_seq)))
seqs = [i[0] for i in result]
seq_masks = [i[1] for i in result]
seq_segments = [i[2] for i in result]
return seqs, seq_masks, seq_segments, labels
?
def trunate_and_pad(self, seq, max_seq_len):
"""
1. 因为本类处理的是单句序列,按照BERT中的序列处理方式,需要在输入序列头尾分别拼接特殊字元CLS与SEP,
因此不包含两个特殊字元的序列长度应该小于等于max_seq_len-2,如果序列长度大于该值需要那么进行截断。
2. 对输入的序列 最终形成[CLS,seq,SEP]的序列,该序列的长度如果小于max_seq_len,那么使用0进行填充。

入参:
seq : 输入序列,在本处其为单个句子。
max_seq_len : 拼接CLS与SEP这两个特殊字元后的序列长度

出参:
seq : 在入参seq的头尾分别拼接了CLS与SEP符号,如果长度仍小于max_seq_len,则使用0在尾部进行了填充。
seq_mask : 只包含0、1且长度等于seq的序列,用于表征seq中的符号是否是有意义的,如果seq序列对应位上为填充符号,
那么取值为1,否则为0。
seq_segment : shape等于seq,因为是单句,所以取值都为0。

"""
# 对超长序列进行截断
if len(seq) > (max_seq_len - 2):
seq = seq[0:(max_seq_len - 2)]
# 分别在首尾拼接特殊符号
seq = [[CLS]] + seq + [[SEP]]
# ID化
seq = self.bert_tokenizer.convert_tokens_to_ids(seq)
# 根据max_seq_len与seq的长度产生填充序列
padding = [0] * (max_seq_len - len(seq))
# 创建seq_mask
seq_mask = [1] * len(seq) + padding
# 创建seq_segment
seq_segment = [0] * len(seq) + padding
# 对seq拼接填充序列
seq += padding
assert len(seq) == max_seq_len
assert len(seq_mask) == max_seq_len
assert len(seq_segment) == max_seq_len
return seq, seq_mask, seq_segment

载入数据

data = pd.read_pickle("/home/zhaoqi/title_category.pkl")
data.columns = [text,label]
le = LabelEncoder()
le.fit(data.label.tolist())
data[label] = le.transform(data.label.tolist())

产生训练数据

bert_tokenizer = BertTokenizer.from_pretrained(bert-base-chinese, do_lower_case=False)
processor = DataPrecessForSingleSentence(bert_tokenizer= bert_tokenizer)
seqs, seq_masks, seq_segments, labels = processor.get_input(
dataset=data, max_seq_len=30)

载入预训练的bert模型

bert = BertModel.from_pretrained(bert-base-chinese)

数据格式化

t_seqs = torch.tensor(seqs, dtype=torch.long)
t_seq_masks = torch.tensor(seq_masks, dtype = torch.long)
t_seq_segments = torch.tensor(seq_segments, dtype = torch.long)
t_labels = torch.tensor(labels, dtype = torch.long)
?
train_data = TensorDataset(t_seqs, t_seq_masks, t_seq_segments, t_labels)
train_sampler = RandomSampler(train_data)
train_dataloder = DataLoader(dataset= train_data, sampler= train_sampler,batch_size = 256)
bert.eval()

特征抽取

pytorch版本的pretrained-bert在模型搭建时使用了nn.Sequantial方法,因此如果观察model的网路结构可以发现有非常清晰的层级关系。

BertModel(
(embeddings): BertEmbeddings(
(word_embeddings): Embedding(21128, 768, padding_idx=0)
(position_embeddings): Embedding(512, 768)
(token_type_embeddings): Embedding(2, 768)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
(encoder): BertEncoder(
(layer): ModuleList(
(0): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(1): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(2): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(3): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(4): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(5): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(6): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(7): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(8): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(9): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(10): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(11): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
)
)
(pooler): BertPooler(
(dense): Linear(in_features=768, out_features=768, bias=True)
(activation): Tanh()
)
)

bert的输出为一个元祖,第一个元素是一个length为12的列表,而其中每一个tensor的shape为[batch_size, max_seq_len, 768],按照上文中的设定,此处每个tensor的大小为[256, 30, 768];第二个元素是一个size为[batch_size, 768]的tensor。第二个元素可以认为是第一个元素的精简版,因为每一个输入的seq序列的首字母都为[CLS],并且因为bert是双向的encoder,因此可以认为[CLS]字元融合了整个seq序列的高阶表征信息,因此可以只取每个样本的第一个字元对应的隐层状态。这种降维的方式也可认为是「池化」中的一种,在官方代码中在此基础上还增加了一个输入输出同为768的全连接层。

本文只抽取bert输出的第二个tensor作为每一个seq样本的表征,代码如下:

bert = BertModel.from_pretrained(bert-base-chinese)
bert.eval()
?
train_features = []
train_labels = []
with torch.no_grad():
for step, batch_data in enumerate(
tqdm_notebook(train_dataloder, desc=Iteration)):
batch_seqs, batch_seq_masks, batch_seq_segments, batch_labels = batch_data
features = bert(batch_seqs, batch_seq_masks, batch_seq_segments)[1]
train_features.append(features.detach())
train_labels.append(batch_labels.detach())
train_features = torch.cat(train_features)
train_labels = torch.cat(train_labels)

训练

搭建一个简单的神经网路用语类别的预测。

class LogisticRegression(nn.Module):
def __init__(self):
super(LogisticRegression, self).__init__()
?
self.classifier = nn.Sequential(*[
nn.Linear(in_features=768, out_features=384, bias=True),
BatchNorm1d(384),
ReLU(inplace=True),
Dropout(0.5),
nn.Linear(in_features=384, out_features=28, bias=True)
])
?
def forward(self, x):
x = self.classifier(x)
return x
?
lr = LogisticRegression()
train_data = TensorDataset(train_features, train_labels)
train_sampler = RandomSampler(train_data)
train_dataloder = DataLoader(dataset= train_data, sampler = train_sampler, batch_size = 512)
?
param_optimizer = list(lr.parameters())
optimizer = Adam(param_optimizer,lr=1e-04)
?
loss_collect = []
?
for i in trange(10, desc=Epoch):
for step, batch_data in enumerate(
tqdm_notebook(train_dataloder, desc=Iteration)):
batch_features, batch_labels = batch_data
# 对标签进行onehot编码
one_hot = torch.zeros(batch_labels.size(0), 28).long()
one_hot_batch_labels = one_hot.scatter_(
dim=1,
index=torch.unsqueeze(batch_labels, dim=1),
src=torch.ones(batch_labels.size(0), 28).long())
?
logits = lr(batch_features)
logits = logits.softmax(dim=1)
loss_function = CrossEntropyLoss()
loss = loss_function(logits, batch_labels)
loss.backward()
loss_collect.append(loss.item())
print("
%f" % loss, end=)
optimizer.step()
optimizer.zero_grad()

训练loss可视化

plt.figure(figsize=(12,8))
plt.plot(range(len(loss_collect)), loss_collect,g.)
plt.grid(True)
plt.show()

可以看出在训练数据上的loss得以快速下降,但是从绝对值来看距离收敛还有很大空间,因为只是表意,因此epoch设置的非常小,也没有进行任何的调参工作。综上,在算力受限的情况下,将bert作为feature生成器为众多文本类任务提取稠密低维的特征,从而快速搭建模型也不失为一种可行的路径。

推荐阅读:

相关文章