Skip to content

Commit 4e470b3

Browse files
committed
ch16
1 parent 5bbdd73 commit 4e470b3

File tree

7 files changed

+50
-31
lines changed

7 files changed

+50
-31
lines changed

etc/docsify-to-pdf/static/main.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -952,7 +952,7 @@ When solving tasks like text classification, we need to be able to represent tex
952952

953953
<img src="images/bow.png" width="90%"/>
954954

955-
> Image by author
955+
> Image by the author
956956
957957
BoW essentially represents which words appear in text and in which quantities, which can indeed be a good indication of what the text is about. For example, news article on politics is likely to contains words such as *president* and *country*, while scientific publication would have something like *collider*, *discovered*, etc. Thus, word frequencies can in many cases be a good indicator of text content.
958958

@@ -980,7 +980,7 @@ By using embedding layer as a first layer in our classifier network, we can swit
980980

981981
![Image showing an embedding classifier for five sequence words.](../../../lessons/5-NLP/14-Embeddings/images/embedding-classifier-example.png)
982982

983-
> Image by author
983+
> Image by the author
984984
985985
## Continue in Notebooks
986986

@@ -1054,7 +1054,7 @@ To capture the meaning of text sequence, we need to use another neural network a
10541054

10551055
![RNN](../../../lessons/5-NLP/16-RNN/images/rnn.png)
10561056

1057-
> Image by author
1057+
> Image by the author
10581058
10591059
Given the input sequence of tokens X<sub>0</sub>,...,X<sub>n</sub>, RNN creates a sequence of neural network blocks, and trains this sequence end-to-end using back propagation. Each network block takes a pair (X<sub>i</sub>,S<sub>i</sub>) as an input, and produces S<sub>i+1</sub> as a result. Final state S<sub>n</sub> or (output Y<sub>n</sub>) goes into a linear classifier to produce the result. All network blocks share the same weights, and are trained end-to-end using one backpropagation pass.
10601060

@@ -1070,7 +1070,7 @@ Simple RNN cell has two weight matrices inside: one transforms input symbol (let
10701070

10711071
<img alt="RNN Cell Anatomy" src="images/rnn-anatomy.png" width="50%"/>
10721072

1073-
> Image by author
1073+
> Image by the author
10741074
10751075
In many cases, input tokens are passed through the embedding layer before entering the RNN to lower the dimensionality. In this case, if the dimension of the input vectors is *emb_size*, and state vector is *hid_size* - the size of W is *emb_size*&times;*hid_size*, and the size of H is *hid_size*&times;*hid_size*.
10761076

@@ -1138,7 +1138,7 @@ When generating text (during inference), we start with some **prompt**, which is
11381138

11391139
<img src="images/rnn-generate-inf.png" width="60%"/>
11401140

1141-
> Image by author
1141+
> Image by the author
11421142
## Continue to Notebooks
11431143

11441144
* [Generative Networks with PyTorch](GenerativePyTorch.ipynb)
@@ -1208,7 +1208,7 @@ We then mix the token position with token embedding vector. To transform positio
12081208

12091209
<img src="images/pos-embedding.png" width="50%"/>
12101210

1211-
> Image by author
1211+
> Image by the author
12121212
12131213
The result we get with positional embedding embeds both original token and its position within sequence.
12141214

lessons/5-NLP/13-TextRep/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,7 @@ When solving tasks like text classification, we need to be able to represent tex
4141

4242
<img src="images/bow.png" width="90%"/>
4343

44-
> Image by author
44+
> Image by the author
4545
4646
A BoW essentially represents which words appear in text and in which quantities, which can indeed be a good indication of what the text is about. For example, news article on politics is likely to contains words such as *president* and *country*, while scientific publication would have something like *collider*, *discovered*, etc. Thus, word frequencies can in many cases be a good indicator of text content.
4747

lessons/5-NLP/14-Embeddings/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ By using an embedding layer as a first layer in our classifier network, we can s
1212

1313
![Image showing an embedding classifier for five sequence words.](images/embedding-classifier-example.png)
1414

15-
> Image by author
15+
> Image by the author
1616
1717
## ✍️ Exercises: Embeddings
1818

lessons/5-NLP/16-RNN/README.md

Lines changed: 37 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -2,67 +2,83 @@
22

33
## [Pre-lecture quiz](https://black-ground-0cc93280f.1.azurestaticapps.net/quiz/116)
44

5-
In the previous sections, we have been using rich semantic representations of text, and a simple linear classifier on top of the embeddings. What this architecture does is to capture aggregated meaning of words in a sentence, but it does not take into account the **order** of words, because aggregation operation on top of embeddings removed this information from the original text. Because these models are unable to model word ordering, they cannot solve more complex or ambiguous tasks such as text generation or question answering.
5+
In previous sections, we have been using rich semantic representations of text and a simple linear classifier on top of the embeddings. What this architecture does is to capture the aggregated meaning of words in a sentence, but it does not take into account the **order** of words, because the aggregation operation on top of embeddings removed this information from the original text. Because these models are unable to model word ordering, they cannot solve more complex or ambiguous tasks such as text generation or question answering.
66

77
To capture the meaning of text sequence, we need to use another neural network architecture, which is called a **recurrent neural network**, or RNN. In RNN, we pass our sentence through the network one symbol at a time, and the network produces some **state**, which we then pass to the network again with the next symbol.
88

99
![RNN](./images/rnn.png)
1010

11-
> Image by author
11+
> Image by the author
1212
13-
Given the input sequence of tokens X<sub>0</sub>,...,X<sub>n</sub>, RNN creates a sequence of neural network blocks, and trains this sequence end-to-end using back propagation. Each network block takes a pair (X<sub>i</sub>,S<sub>i</sub>) as an input, and produces S<sub>i+1</sub> as a result. Final state S<sub>n</sub> or (output Y<sub>n</sub>) goes into a linear classifier to produce the result. All network blocks share the same weights, and are trained end-to-end using one back propagation pass.
13+
Given the input sequence of tokens X<sub>0</sub>,...,X<sub>n</sub>, RNN creates a sequence of neural network blocks, and trains this sequence end-to-end using backpropagation. Each network block takes a pair (X<sub>i</sub>,S<sub>i</sub>) as an input, and produces S<sub>i+1</sub> as a result. The final state S<sub>n</sub> or (output Y<sub>n</sub>) goes into a linear classifier to produce the result. All the network blocks share the same weights, and are trained end-to-end using one backpropagation pass.
1414

15-
Because state vectors S<sub>0</sub>,...,S<sub>n</sub> are passed through the network, it is able to learn the sequential dependencies between words. For example, when the word *not* appears somewhere in the sequence, it can learn to negate certain elements within the state vector, resulting in negation.
15+
Because state vectors S<sub>0</sub>,...,S<sub>n</sub> are passed through the network, it is able to learn the sequential dependencies between words. For example, when the word *not* appears somewhere in the sequence, it can learn to negate certain elements within the state vector, resulting in negation.
1616

17-
> Since weights of all RNN blocks on the picture are shared, the same picture can be represented as one block (on the right) with a recurrent feedback loop, which passes output state of the network back to the input.
17+
> Since the weights of all RNN blocks on the picture above are shared, the same picture can be represented as one block (on the right) with a recurrent feedback loop, which passes the output state of the network back to the input.
1818
19-
## Anatomy of RNN Cell
19+
## Anatomy of an RNN Cell
2020

21-
Let's see how simple RNN cell is organized. It accepts previous state S<sub>i-1</sub> and current symbol X<sub>i</sub> as inputs, and has to produce output state S<sub>i</sub> (and, sometimes, we are also interested in some other output Y<sub>i</sub>, as in case with generative networks).
21+
Let's see how a simple RNN cell is organized. It accepts the previous state S<sub>i-1</sub> and current symbol X<sub>i</sub> as inputs, and has to produce the output state S<sub>i</sub> (and, sometimes, we are also interested in some other output Y<sub>i</sub>, as in the case with generative networks).
2222

23-
Simple RNN cell has two weight matrices inside: one transforms input symbol (let call it W), and another one transforms input state (H). In this case the output of the network is calculated as &sigma;(W&times;X<sub>i</sub>+H&times;S<sub>i-1</sub>+b), where &sigma; is the activation function, b is additional bias.
23+
A simple RNN cell has two weight matrices inside: one transforms an input symbol (let's call it W), and another one transforms an input state (H). In this case the output of the network is calculated as &sigma;(W&times;X<sub>i</sub>+H&times;S<sub>i-1</sub>+b), where &sigma; is the activation function and b is additional bias.
2424

2525
<img alt="RNN Cell Anatomy" src="images/rnn-anatomy.png" width="50%"/>
2626

27-
> Image by author
27+
> Image by the author
2828
2929
In many cases, input tokens are passed through the embedding layer before entering the RNN to lower the dimensionality. In this case, if the dimension of the input vectors is *emb_size*, and state vector is *hid_size* - the size of W is *emb_size*&times;*hid_size*, and the size of H is *hid_size*&times;*hid_size*.
3030

3131
## Long Short Term Memory (LSTM)
3232

33-
One of the main problems of classical RNNs is so-called **vanishing gradients** problem. Because RNNs are trained end-to-end in one back-propagation pass, it is having hard times propagating error to the first layers of the network, and thus the network cannot learn relationships between distant tokens. One of the ways to avoid this problem is to introduce **explicit state management** by using so called **gates**. There are two most known architectures of this kind: **Long Short Term Memory** (LSTM) and **Gated Relay Unit** (GRU).
33+
One of the main problems of classical RNNs is the so-called **vanishing gradients** problem. Because RNNs are trained end-to-end in one backpropagation pass, it has difficulty propagating error to the first layers of the network, and thus the network cannot learn relationships between distant tokens. One of the ways to avoid this problem is to introduce **explicit state management** by using so called **gates**. There are two well-known architectures of this kind: **Long Short Term Memory** (LSTM) and **Gated Relay Unit** (GRU).
3434

3535
![Image showing an example long short term memory cell](./images/long-short-term-memory-cell.svg)
3636

37-
LSTM Network is organized in a manner similar to RNN, but there are two states that are being passed from layer to layer: actual state C, and hidden vector H. At each unit, hidden vector H<sub>i</sub> is concatenated with input X<sub>i</sub>, and they control what happens to the state C via **gates**. Each gate is a neural network with sigmoid activation (output in the range [0,1]), which can be thought of as bitwise mask when multiplied by the state vector. There are the following gates (from left to right on the picture above):
37+
> Image source TBD
3838
39-
* **forget gate** takes hidden vector and determines, which components of the vector C we need to forget, and which to pass through.
40-
* **input gate** takes some information from the input and hidden vector, and inserts it into state.
41-
* **output gate** transforms state via some linear layer with *tanh* activation, then selects some of its components using hidden vector H<sub>i</sub> to produce new state C<sub>i+1</sub>.
39+
The LSTM Network is organized in a manner similar to RNN, but there are two states that are being passed from layer to layer: the actual state C, and the hidden vector H. At each unit, the hidden vector H<sub>i</sub> is concatenated with input X<sub>i</sub>, and they control what happens to the state C via **gates**. Each gate is a neural network with sigmoid activation (output in the range [0,1]), which can be thought of as a bitwise mask when multiplied by the state vector. There are the following gates (from left to right on the picture above):
4240

43-
Components of the state C can be thought of as some flags that can be switched on and off. For example, when we encounter a name *Alice* in the sequence, we may want to assume that it refers to female character, and raise the flag in the state that we have female noun in the sentence. When we further encounter phrases *and Tom*, we will raise the flag that we have plural noun. Thus by manipulating state we can supposedly keep track of grammatical properties of sentence parts.
41+
* The **forget gate** takes a hidden vector and determines which components of the vector C we need to forget, and which to pass through.
42+
* The **input gate** takes some information from the input and hidden vectors and inserts it into state.
43+
* The **output gate** transforms state via a linear layer with *tanh* activation, then selects some of its components using a hidden vector H<sub>i</sub> to produce a new state C<sub>i+1</sub>.
4444

45-
> **Note**: A great resource for understanding internals of LSTM is this great article [Understanding LSTM Networks](https://colah.github.io/posts/2015-08-Understanding-LSTMs/) by Christopher Olah.
45+
Components of the state C can be thought of as some flags that can be switched on and off. For example, when we encounter a name *Alice* in the sequence, we may want to assume that it refers to a female character, and raise the flag in the state that we have a female noun in the sentence. When we further encounter phrases *and Tom*, we will raise the flag that we have a plural noun. Thus by manipulating state we can supposedly keep track of the grammatical properties of sentence parts.
4646

47-
## Bidirectional and multilayer RNNs
47+
> ✅ An excellent resource for understanding the internals of LSTM is this great article [Understanding LSTM Networks](https://colah.github.io/posts/2015-08-Understanding-LSTMs/) by Christopher Olah.
48+
49+
## Bidirectional and Multilayer RNNs
4850

4951
We have discussed recurrent networks that operate in one direction, from beginning of a sequence to the end. It looks natural, because it resembles the way we read and listen to speech. However, since in many practical cases we have random access to the input sequence, it might make sense to run recurrent computation in both directions. Such networks are call **bidirectional** RNNs. When dealing with bidirectional network, we would need two hidden state vectors, one for each direction.
5052

51-
Recurrent network, one-directional or bidirectional, captures certain patterns within a sequence, and can store them into state vector or pass into output. As with convolutional networks, we can build another recurrent layer on top of the first one to capture higher level patterns, build from low-level patterns extracted by the first layer. This leads us to the notion of **multi-layer RNN**, which consists of two or more recurrent networks, where output of the previous layer is passed to the next layer as input.
53+
A Recurrent network, either one-directional or bidirectional, captures certain patterns within a sequence, and can store them into a state vector or pass into output. As with convolutional networks, we can build another recurrent layer on top of the first one to capture higher level patterns and build from low-level patterns extracted by the first layer. This leads us to the notion of a **multi-layer RNN** which consists of two or more recurrent networks, where the output of the previous layer is passed to the next layer as input.
5254

5355
![Image showing a Multilayer long-short-term-memory- RNN](./images/multi-layer-lstm.jpg)
5456

5557
*Picture from [this wonderful post](https://towardsdatascience.com/from-a-lstm-cell-to-a-multilayer-lstm-network-with-pytorch-2899eb5696f3) by Fernando López*
5658

57-
## Continue to Notebooks
59+
## ✍️ Exercises: Embeddings
60+
61+
Continue your learning in the following notebooks:
5862

5963
* [RNNs with PyTorch](RNNPyTorch.ipynb)
6064
* [RNNs with TensorFlow](RNNTF.ipynb)
6165

62-
## RNNs for other tasks
66+
## Conclusion
6367

6468
In this unit, we have seen that RNNs can be used for sequence classification, but in fact, they can handle many more tasks, such as text generation, machine translation, and more. We will consider those tasks in the next unit.
6569

70+
## 🚀 Challenge
71+
72+
Read through some literature about LSTMs and consider their applications:
73+
74+
- [Grid Long Short-Term Memory](https://arxiv.org/pdf/1507.01526v1.pdf)
75+
- [Show, Attend and Tell: Neural Image Caption
76+
Generation with Visual Attention](https://arxiv.org/pdf/1502.03044v2.pdf)
77+
6678
## [Post-lecture quiz](https://black-ground-0cc93280f.1.azurestaticapps.net/quiz/216)
6779

68-
> ✅ Todo: conclusion, Assignment, challenge, reference.
80+
## Review & Self Study
81+
82+
- [Understanding LSTM Networks](https://colah.github.io/posts/2015-08-Understanding-LSTMs/) by Christopher Olah.
83+
84+
## [Assignment: Notebooks](assignment.md)

lessons/5-NLP/16-RNN/assignment.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# Assignment: Notebooks
2+
3+
Using the notebooks associated to this lesson (either the PyTorch or the TensorFlow version), rerun them using your own dataset, perhaps one from Kaggle, used with attribution. Rewrite the notebook to underline your own findings. Try a different kind of dataset and document your findings, using text such as [this Kaggle competition dataset about weather tweets](https://www.kaggle.com/competitions/crowdflower-weather-twitter/data?select=train.csv).

lessons/5-NLP/17-GenerativeNetworks/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ When generating text (during inference), we start with some **prompt**, which is
2727

2828
<img src="images/rnn-generate-inf.png" width="60%"/>
2929

30-
> Image by author
30+
> Image by the author
3131
3232
## Continue to Notebooks
3333

lessons/5-NLP/18-Transformers/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,7 @@ We then mix the token position with token embedding vector. To transform positio
4545

4646
<img src="images/pos-embedding.png" width="50%"/>
4747

48-
> Image by author
48+
> Image by the author
4949
5050
The result we get with positional embedding embeds both original token and its position within sequence.
5151

0 commit comments

Comments
 (0)