Where one RNN reads in a sentence and then different one outputs a sentence. Theres a modification to this called the Attention model. The attention algorithm, the attention idea has one of the most influential ideas in deep learning.

Lets assume you have an input sentence and you use a bidirectional RNN, or bidirectional GRU, or bidirectional LSTM to compute features on every word. In practice, GRUs and LSTMs are often used for this, with maybe LSTMs be more common.

attention model

Even though we have the features computed from the forward occurrence and from the backward occurrence in the bidirectional RNN. Im just going to use a<t> to represent both of these concatenated together. So a of t is going to be a feature vector for time step t. Although to be consistent with notation, were using second, Im going to call this t.

t --- index of output, t --- index of input

attention mechanism

The context will depend on the attention parameters so alpha_11, alpha_12 and so on tells us how much attention. And these alpha parameters tells us how much the context would depend on the features were getting or the activations were getting from the different time steps. And so the way we define the context is actually be a way to some of the features from the different time steps waited by these attention waits. So more formally the attention waits will satisfy this that they are all be non-negative, so it will be a zero positive and theyll sum to one. so we use the softmax function to satisfy that.

So alpha(t) is the amount of attention thats yt should pay to a<t>. So when youre generating the t of the output words, how much you should be paying attention to the t input to word. So thats one step of generating the output and then at the next time step, you generate the second output and is again. And you will have a new set of attention.

This is formula you could use the compute alpha(t, t) which is going to compute these terms e(t, t) and then use essentially a soft pass to make sure that these waits sum to one if you sum over t_prime. So for every fix value of t, these things sum to one if youre summing over t_prime. And using this soft max prioritization, just ensures this properly sums to one.

To calculate e<t, t> we use a small neural network.

So look at the network we have. If youre trying to generate yt then s<t-1> was the hidden state from the previous step that just fell into s<t> and thats one input to very small neural network. Usually, one hidden layer in neural network because you need to compute these a lot. And then a<t> -- the features from time step t is the other inputs.

So it seems pretty natural that alpha(t, t) and e(t, t) should depend on these two quantities. But we dont know what the function is. So one thing you could do is just train a very small neural network to learn whatever this function should be. And trust back propagation and trust gradient descent to learn the right function. And it turns out that if you implemented this whole model and train it with gradient descent, the whole thing actually works. This little neural network does a pretty decent job telling you how much attention yt should pay to a<t> and then as you generating one word at a time, this neural network actually pays attention to the right parts of the input sentence that learns all this automatically using gradient descent.

Reference:

Sequence Models | Coursera?

www.coursera.org
圖標

推薦閱讀:
相關文章