本文本分類教程在IMDB大型影評數據集上訓練一個循環神經網路進行情感分類。

from __future__ import absolute_import, division, print_function, unnicode_literals
import tensorflow_datasets as tfds
import tensorflow as tf

導入matplotlib並創建一個輔助函數來繪製圖形:

import matplotlib.pyplot as plt

def plot_graphs(history, string):
plt.plot(history.history[string])
plt.plot(history.history[『val_』 + string])
plt.xlabel(『Epochs』)
plt.ylabel(string)
plt.legend([string, 『val_』 + string])
plt.show()

1. 設置輸入管道

IMDB大型電影影評數據集是一個二元分類數據集,所有評論都有正面或負面的情緒標籤。

使用TFDS下載數據集,數據集附帶一個內容的子單詞標記器。

dataset, info = tfds.load(『imdb_reviews/subwords8k』, with_info=True, as_supervised=True)
train_dataset, test_dataset = dataset[『train』], dataset[『test』]

由於這是一個字單詞標記器,它可以傳遞任何字元串,並對其進行標記。

tokenizer = info.features[『text』].encoder
print(『Vocabulary size: {}』.format(tokenizer.vocab_size))

Vocabulary size: 8185

sample_stirng = 『TensorFlow is cool.』

tokenized_string = tokenizer.encode(sample_string)
print(『Tokenized string is {}』.format(tokenized_string))

original_string = tokenizer.decode(tokenized_string)
print(『The original string: {}』.format(original_string))

assert original_string == sample_string

Tokenized string is [6307, 2327, 4043, 4265, 9, 2724, 7975] The original string: TensorFlow is cool.

如果字元串不在字典中,則標記生成器通過將字元串分解為子字元串來對字元串進行編碼。

for ts in tokenized_string:
print(『{} -- {}』.format(ts, tokenizer.decode([ts])))

6307 ----> Ten 2327 ----> sor 4043 ----> Fl 4265 ----> ow 9 ----> is 2724 ----> cool 7975 ----> .

BUFFER_SIZE = 10000
BATCH_SIZE = 64

train_dataset = train_dataset.shuffle(BUFFER_SIZE)
train_dataset = train_dataset.padded_batch(BATCH_SIZE, train_dataset.output_shapes)
test_dataset = test_dataset..padded_batch(BATCH_SIZE, test_dataset.output_shapes)

2. 創建模型

構建一個tf.keras.Sequential模型並從嵌入層開始,嵌入層每個字存儲一個向量,當被調用時,它將單詞索引的序列轉換為向量序列,這些載體是可以訓練的,在訓練之後(在足夠的數據上),具有相似含義的此通常具有相似的向量。

這種索引查找比通過tf.keras.layers.Dense層傳遞獨熱編碼向量的等效操作更有效。

遞歸神經網路(RNN)通過迭代元素來處理序列輸入,RNN將輸出從一個時間步傳遞到其輸入端,然後傳遞到下一個時間步。

tf.keras.layers.Bidirectional 封裝類也可以與RNN層一起使用,這將輸入前向和後向通過RNN層傳播,然後連接輸出,這有助於RNN了解長期以來關係。

model = tf.keras.Sequential([
tf.keras.layers.Embedding(tokenizer.vocab_size, 64),
tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
tf.keras.layers.Dense(64, activation=relu),
tf.keras.layers.Dense(1, activation=sigmoid)
])

model.compile(loss=』binary_crossentropy』, optimizer=』adam』, metrics=[『accuracy』])

3. 訓練模型

history = model.fit(train_dataset, epochs=10, validation_data=test_dataset)

Epoch 1/10 391/391 [==============================] - 75s 191ms/step - loss: 0.5536 - accuracy: 0.7140 - val_loss: 0.0000e+00 - val_accuracy: 0.0000e+00 Epoch 2/10 391/391 [==============================] - 73s 187ms/step - loss: 0.3922 - accuracy: 0.8311 - val_loss: 0.5141 - val_accuracy: 0.7940 Epoch 3/10 391/391 [==============================] - 71s 182ms/step - loss: 0.3120 - accuracy: 0.8807 - val_loss: 0.4517 - val_accuracy: 0.8098 Epoch 4/10 391/391 [==============================] - 78s 199ms/step - loss: 0.2548 - accuracy: 0.9030 - val_loss: 0.4383 - val_accuracy: 0.8235 Epoch 5/10 391/391 [==============================] - 72s 185ms/step - loss: 0.2387 - accuracy: 0.9078 - val_loss: 0.4918 - val_accuracy: 0.8214 Epoch 6/10 391/391 [==============================] - 71s 182ms/step - loss: 0.1905 - accuracy: 0.9293 - val_loss: 0.4849 - val_accuracy: 0.8162 Epoch 7/10 391/391 [==============================] - 71s 182ms/step - loss: 0.1900 - accuracy: 0.9282 - val_loss: 0.5919 - val_accuracy: 0.8257 Epoch 8/10 391/391 [==============================] - 74s 190ms/step - loss: 0.1321 - accuracy: 0.9526 - val_loss: 0.6331 - val_accuracy: 0.7657 Epoch 9/10 391/391 [==============================] - 73s 187ms/step - loss: 0.3290 - accuracy: 0.8516 - val_loss: 0.6709 - val_accuracy: 0.6501 Epoch 10/10 391/391 [==============================] - 70s 180ms/step - loss: 0.3074 - accuracy: 0.8692 - val_loss: 0.5533 - val_accuracy: 0.7873

test_loss, test_acc = model.evaluate(test_dataset)

print(『Test Loss: {}』.format(test_loss))
print(『Test Accuracy: {}』.format(test_acc))

391/Unknown - 19s 47ms/step - loss: 0.5533 - accuracy: 0.7873Test Loss: 0.553319326714 Test Accuracy: 0.787320017815

上述模型不會掩蓋應用於序列的填充,如果我們在填充序列上訓練兵測試未填充的序列,這可能導致偏斜,理想情況下,模型會學會忽略填充,但正如您在下面看到的那樣,它對輸出的影響很小。

如果預測>=0.5,則為正,否則為負。

def pad_to_size(vec, size):
zeros = [0] * (size – len(vec))
vec.extend(zeros)
return vec

def sample_predict(sentence, pad):
tokenized_sample_pred_text = tokenizer.encode(sample_pred_text)

if pad:
tokenized_sample_pred_text = pad_to_size(tokenized_sample_pred_text, 64)

predictions = model.predict(tf.expand_dims(tokenized_sample_pred_text, 0)
return (predictions)

# 對不帶填充的示例文本進行預測
sample_pred_text = (The movie was cool. The animation and the graphics
were out of this world. I would recommend this movie.)
predictions = sample_predict(sample_pred_text, pad=True)
print(predictions)

[[ 0.68634349]]

plot_graphs(history, 『accuracy』)

plot_graphs(history, 『loss』)

4. 堆疊兩個或更多LSTM層

Keras遞歸層有兩種可以用的模式,由return_sequences構造函數參數控制:

  • 返回每個時間不長的連續輸出的完整序列(3D張量形狀(batch_size, timesteps, output_features))。
  • 僅返回每個輸入序列的最後一個輸出(2D張量形狀(batch_size, output_features))。 model = tf.keras.Sequential([ tf.keras.layers.Embedding(tokenizer.vocab_size, 64), tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64, return_sequences=True), tf.keras.lyaers.Bidirectional(tf.kreas.layers.LSTM(32)), tf.keras.layers.Dense(64, activation=』relu』), tf.kreas.layers.Dense(1, activation』sigmoid』) ]) model.compile(loss=』binary_crossentropy』, optimizer=』adam』, metrics=[『accuracy』]) history = model.fit(train_dataset, epochs=10,validation_data=test_dataset)

Epoch 1/10 391/391 [==============================] - 155s 397ms/step - loss: 0.6349 - accuracy: 0.6162 - val_loss: 0.0000e+00 - val_accuracy: 0.0000e+00 Epoch 2/10 391/391 [==============================] - 155s 396ms/step - loss: 0.6333 - accuracy: 0.6134 - val_loss: 0.5872 - val_accuracy: 0.6914 Epoch 3/10 391/391 [==============================] - 153s 391ms/step - loss: 0.4199 - accuracy: 0.8217 - val_loss: 0.4361 - val_accuracy: 0.8187 Epoch 4/10 391/391 [==============================] - 156s 398ms/step - loss: 0.3088 - accuracy: 0.8785 - val_loss: 0.4131 - val_accuracy: 0.8319 Epoch 5/10 391/391 [==============================] - 153s 391ms/step - loss: 0.3328 - accuracy: 0.8564 - val_loss: 0.4689 - val_accuracy: 0.7958 Epoch 6/10 391/391 [==============================] - 156s 398ms/step - loss: 0.2383 - accuracy: 0.9128 - val_loss: 0.4299 - val_accuracy: 0.8404 Epoch 7/10 391/391 [==============================] - 152s 388ms/step - loss: 0.2426 - accuracy: 0.9039 - val_loss: 0.4934 - val_accuracy: 0.8299 Epoch 8/10 391/391 [==============================] - 155s 396ms/step - loss: 0.1638 - accuracy: 0.9440 - val_loss: 0.5106 - val_accuracy: 0.8279 Epoch 9/10 391/391 [==============================] - 150s 383ms/step - loss: 0.1616 - accuracy: 0.9420 - val_loss: 0.5287 - val_accuracy: 0.8245 Epoch 10/10 391/391 [==============================] - 154s 394ms/step - loss: 0.1120 - accuracy: 0.9643 - val_loss: 0.5646 - val_accuracy: 0.8070

test_loss, test_acc = model.evaluate(test_dataset)
print(『Test Loss: {}』.format(test_loss))
print(『Test Accuracy: {}』.format(test_acc))

391/Unknown - 45s 115ms/step - loss: 0.5646 - accuracy: 0.8070Test Loss: 0.564571284348 Test Accuracy: 0.80703997612

# 在沒有填充的情況下預測示例文本
sample_pred_text = (The movie was not good. The animation and the graphics
were terrible. I would not recommend this movie.)
predictions = sample_predict(sample_pred_test, pad=False)
print(predictions)

[[ 0.00393916]]

# 在有填充的情況下預測示例文本
sample_pred_text = (The movie was not good. The animation and the graphics
were terrible. I would not recommend this movie.)
predictions = sample_predict(sample_pred_test, pad=True)
print(predictions)

[[ 0.01098633]]

plot_graphs(history, 『accuracy』)

plot_graphs(history, 『loss』)

查看其它現有的遞歸層,例如GRU層。


推薦閱讀:
相关文章