NO·14 - 使用RNN對文本進行分類
本文本分類教程在IMDB大型影評數據集上訓練一個循環神經網路進行情感分類。
from __future__ import absolute_import, division, print_function, unnicode_literals
import tensorflow_datasets as tfds
import tensorflow as tf
導入matplotlib並創建一個輔助函數來繪製圖形:
import matplotlib.pyplot as plt
def plot_graphs(history, string):
plt.plot(history.history[string])
plt.plot(history.history[『val_』 + string])
plt.xlabel(『Epochs』)
plt.ylabel(string)
plt.legend([string, 『val_』 + string])
plt.show()
1. 設置輸入管道
IMDB大型電影影評數據集是一個二元分類數據集,所有評論都有正面或負面的情緒標籤。
使用TFDS下載數據集,數據集附帶一個內容的子單詞標記器。
dataset, info = tfds.load(『imdb_reviews/subwords8k』, with_info=True, as_supervised=True)
train_dataset, test_dataset = dataset[『train』], dataset[『test』]
由於這是一個字單詞標記器,它可以傳遞任何字元串,並對其進行標記。
tokenizer = info.features[『text』].encoder
print(『Vocabulary size: {}』.format(tokenizer.vocab_size))
Vocabulary size: 8185
sample_stirng = 『TensorFlow is cool.』
tokenized_string = tokenizer.encode(sample_string)
print(『Tokenized string is {}』.format(tokenized_string))
original_string = tokenizer.decode(tokenized_string)
print(『The original string: {}』.format(original_string))
assert original_string == sample_string
Tokenized string is [6307, 2327, 4043, 4265, 9, 2724, 7975] The original string: TensorFlow is cool.
如果字元串不在字典中,則標記生成器通過將字元串分解為子字元串來對字元串進行編碼。
for ts in tokenized_string:
print(『{} -- {}』.format(ts, tokenizer.decode([ts])))
6307 ----> Ten 2327 ----> sor 4043 ----> Fl 4265 ----> ow 9 ----> is 2724 ----> cool 7975 ----> .
BUFFER_SIZE = 10000
BATCH_SIZE = 64
train_dataset = train_dataset.shuffle(BUFFER_SIZE)
train_dataset = train_dataset.padded_batch(BATCH_SIZE, train_dataset.output_shapes)
test_dataset = test_dataset..padded_batch(BATCH_SIZE, test_dataset.output_shapes)
2. 創建模型
構建一個tf.keras.Sequential模型並從嵌入層開始,嵌入層每個字存儲一個向量,當被調用時,它將單詞索引的序列轉換為向量序列,這些載體是可以訓練的,在訓練之後(在足夠的數據上),具有相似含義的此通常具有相似的向量。
這種索引查找比通過tf.keras.layers.Dense層傳遞獨熱編碼向量的等效操作更有效。
遞歸神經網路(RNN)通過迭代元素來處理序列輸入,RNN將輸出從一個時間步傳遞到其輸入端,然後傳遞到下一個時間步。
tf.keras.layers.Bidirectional 封裝類也可以與RNN層一起使用,這將輸入前向和後向通過RNN層傳播,然後連接輸出,這有助於RNN了解長期以來關係。
model = tf.keras.Sequential([
tf.keras.layers.Embedding(tokenizer.vocab_size, 64),
tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
tf.keras.layers.Dense(64, activation=relu),
tf.keras.layers.Dense(1, activation=sigmoid)
])
model.compile(loss=』binary_crossentropy』, optimizer=』adam』, metrics=[『accuracy』])
3. 訓練模型
history = model.fit(train_dataset, epochs=10, validation_data=test_dataset)
Epoch 1/10 391/391 [==============================] - 75s 191ms/step - loss: 0.5536 - accuracy: 0.7140 - val_loss: 0.0000e+00 - val_accuracy: 0.0000e+00 Epoch 2/10 391/391 [==============================] - 73s 187ms/step - loss: 0.3922 - accuracy: 0.8311 - val_loss: 0.5141 - val_accuracy: 0.7940 Epoch 3/10 391/391 [==============================] - 71s 182ms/step - loss: 0.3120 - accuracy: 0.8807 - val_loss: 0.4517 - val_accuracy: 0.8098 Epoch 4/10 391/391 [==============================] - 78s 199ms/step - loss: 0.2548 - accuracy: 0.9030 - val_loss: 0.4383 - val_accuracy: 0.8235 Epoch 5/10 391/391 [==============================] - 72s 185ms/step - loss: 0.2387 - accuracy: 0.9078 - val_loss: 0.4918 - val_accuracy: 0.8214 Epoch 6/10 391/391 [==============================] - 71s 182ms/step - loss: 0.1905 - accuracy: 0.9293 - val_loss: 0.4849 - val_accuracy: 0.8162 Epoch 7/10 391/391 [==============================] - 71s 182ms/step - loss: 0.1900 - accuracy: 0.9282 - val_loss: 0.5919 - val_accuracy: 0.8257 Epoch 8/10 391/391 [==============================] - 74s 190ms/step - loss: 0.1321 - accuracy: 0.9526 - val_loss: 0.6331 - val_accuracy: 0.7657 Epoch 9/10 391/391 [==============================] - 73s 187ms/step - loss: 0.3290 - accuracy: 0.8516 - val_loss: 0.6709 - val_accuracy: 0.6501 Epoch 10/10 391/391 [==============================] - 70s 180ms/step - loss: 0.3074 - accuracy: 0.8692 - val_loss: 0.5533 - val_accuracy: 0.7873
test_loss, test_acc = model.evaluate(test_dataset)
print(『Test Loss: {}』.format(test_loss))
print(『Test Accuracy: {}』.format(test_acc))
391/Unknown - 19s 47ms/step - loss: 0.5533 - accuracy: 0.7873Test Loss: 0.553319326714 Test Accuracy: 0.787320017815
上述模型不會掩蓋應用於序列的填充,如果我們在填充序列上訓練兵測試未填充的序列,這可能導致偏斜,理想情況下,模型會學會忽略填充,但正如您在下面看到的那樣,它對輸出的影響很小。
如果預測>=0.5,則為正,否則為負。
def pad_to_size(vec, size):
zeros = [0] * (size – len(vec))
vec.extend(zeros)
return vec
def sample_predict(sentence, pad):
tokenized_sample_pred_text = tokenizer.encode(sample_pred_text)
if pad:
tokenized_sample_pred_text = pad_to_size(tokenized_sample_pred_text, 64)
predictions = model.predict(tf.expand_dims(tokenized_sample_pred_text, 0)
return (predictions)
# 對不帶填充的示例文本進行預測
sample_pred_text = (The movie was cool. The animation and the graphics
were out of this world. I would recommend this movie.)
predictions = sample_predict(sample_pred_text, pad=True)
print(predictions)
[[ 0.68634349]]
plot_graphs(history, 『accuracy』)