1
00:00:00,000 --> 00:00:04,320
Hi everyone. So in this video, I'd like us to cover the process of tokenization in large

2
00:00:04,320 --> 00:00:10,040
language models. Now, you see here that I have a sad face, and that's because tokenization is my

3
00:00:10,040 --> 00:00:14,020
least favorite part of working with large language models, but unfortunately it is necessary to

4
00:00:14,020 --> 00:00:18,600
understand in some detail because it is fairly hairy, gnarly, and there's a lot of hidden foot

5
00:00:18,600 --> 00:00:23,560
guns to be aware of, and a lot of oddness with large language models typically traces back

6
00:00:23,560 --> 00:00:29,720
to tokenization. So what is tokenization? Now, in my previous video, Let's Build GPT

7
00:00:29,720 --> 00:00:34,480
from Scratch, we actually already did tokenization, but we did a very naive,

8
00:00:34,780 --> 00:00:39,080
simple version of tokenization. So when you go to the Google Colab for that video,

9
00:00:39,880 --> 00:00:46,120
you see here that we loaded our training set, and our training set was this Shakespeare data set.

10
00:00:47,360 --> 00:00:51,820
Now, in the beginning, the Shakespeare data set is just a large string in Python. It's just text,

11
00:00:52,300 --> 00:00:57,900
and so the question is, how do we plug text into large language models? And in this case here,

12
00:00:58,780 --> 00:00:59,700
we created

13
00:00:59,700 --> 00:01:05,740
a vocabulary of 65 possible characters that we saw occur in this string. These were the possible

14
00:01:05,740 --> 00:01:11,860
characters, and we saw that there are 65 of them, and then we created a lookup table for converting

15
00:01:11,860 --> 00:01:19,280
from every possible character, a little string piece, into a token, an integer. So here, for

16
00:01:19,280 --> 00:01:26,080
example, we tokenized the string, hi there, and we received this sequence of tokens. And here,

17
00:01:26,080 --> 00:01:29,100
we took the first 1,000 characters of our data set,

18
00:01:29,700 --> 00:01:36,440
we encoded it into tokens. And because this is character level, we received 1,000 tokens in

19
00:01:36,440 --> 00:01:45,460
the sequence. So, token 1847, etc. Now, later, we saw that the way we plug these tokens into the

20
00:01:45,460 --> 00:01:52,540
language model is by using an embedding table. And so basically, if we have 65 possible tokens,

21
00:01:52,540 --> 00:01:58,300
then this embedding table is going to have 65 rows. And roughly speaking, we're taking the

22
00:01:58,300 --> 00:01:59,040
integer

23
00:01:59,040 --> 00:02:04,040
with every single token. We're using that as a lookup into this table and we're

24
00:02:04,040 --> 00:02:09,220
plucking out the corresponding row and this row is trainable parameters

25
00:02:09,220 --> 00:02:12,840
that we're going to train using back propagation and this is the vector that

26
00:02:12,840 --> 00:02:16,440
then feeds into the transformer and that's how the transformer sort of

27
00:02:16,440 --> 00:02:22,360
perceives every single token. So here we had a very naive tokenization process

28
00:02:22,360 --> 00:02:27,240
that was a character level tokenizer but in practice in state-of-the-art language

29
00:02:27,240 --> 00:02:31,000
models people use a lot more complicated schemes unfortunately for

30
00:02:31,000 --> 00:02:37,260
constructing these token vocabularies. So we're not dealing on the character level

31
00:02:37,260 --> 00:02:42,660
we're dealing on chunk level and the way these character chunks are constructed

32
00:02:42,660 --> 00:02:46,840
is using algorithms such as for example the byte pair encoding algorithm which

33
00:02:46,840 --> 00:02:52,680
we're going to go into in detail and cover in this video. I'd like to briefly

34
00:02:52,680 --> 00:02:56,940
show you the paper that introduced a byte level encoding as a mechanism for

35
00:02:56,940 --> 00:02:57,200
tokenization.

36
00:02:57,240 --> 00:03:01,140
In the context of large language models and I would say that that's probably the

37
00:03:01,140 --> 00:03:07,260
GPT-2 paper and if you scroll down here to the section input representation this

38
00:03:07,260 --> 00:03:10,600
is where they cover tokenization the kinds of properties that you'd like the

39
00:03:10,600 --> 00:03:14,760
tokenization to have and they conclude here that they're going to have a

40
00:03:14,760 --> 00:03:18,840
tokenizer where you have a vocabulary of fifty thousand two hundred and fifty

41
00:03:18,840 --> 00:03:26,460
seven possible tokens and the context size is going to be 1024 tokens so in

42
00:03:26,460 --> 00:03:26,940
the entire

43
00:03:26,940 --> 00:03:31,020
concept in the attention layer of the transformer neural network every single

44
00:03:31,020 --> 00:03:34,500
token is attending to the previous tokens in the sequence and it's going to

45
00:03:34,500 --> 00:03:42,360
see up to 1024 tokens so tokens are this like fundamental unit the atom of large

46
00:03:42,360 --> 00:03:45,240
language models if you will and everything is in units of tokens

47
00:03:45,240 --> 00:03:50,160
everything is about tokens and tokenization is the process for translating strings or

48
00:03:50,160 --> 00:03:56,640
text into sequences of tokens and vice versa when you go into the Lama 2 paper

49
00:03:56,640 --> 00:04:00,600
as well I can show you that when you search token you're going to get 63 hits

50
00:04:00,600 --> 00:04:05,120
and that's because tokens are again pervasive so here they mentioned that

51
00:04:05,120 --> 00:04:10,340
they trained on two trillion tokens of data and so on so we're going to build

52
00:04:10,340 --> 00:04:13,860
our own tokenizer luckily the byte-bearing coding algorithm is not

53
00:04:13,860 --> 00:04:18,200
that super complicated and we can build it from scratch ourselves and we'll see

54
00:04:18,200 --> 00:04:21,980
exactly how this works before we dive into code I'd like to give you a brief

55
00:04:21,980 --> 00:04:25,760
taste of some of the complexities that come from the tokenization because I

56
00:04:25,760 --> 00:04:26,460
just want to make sure that we're very clear on how the tokenization works and so let's get started.

57
00:04:26,640 --> 00:04:33,200
motivated sufficiently for why we are doing all this and why this is so gross. So tokenization

58
00:04:33,200 --> 00:04:36,960
is at the heart of a lot of weirdness in large language models and I would advise that you do

59
00:04:36,960 --> 00:04:42,240
not brush it off. A lot of the issues that may look like just issues with the neural network

60
00:04:42,240 --> 00:04:47,680
architecture or the large language model itself are actually issues with the tokenization and

61
00:04:47,680 --> 00:04:53,440
fundamentally trace back to it. So if you've noticed any issues with large language models

62
00:04:53,440 --> 00:04:58,400
can't, you know, not able to do spelling tasks very easily, that's usually due to tokenization.

63
00:04:58,960 --> 00:05:03,600
Simple string processing can be difficult for the large language model to perform natively.

64
00:05:04,720 --> 00:05:09,360
Non-English languages can work much worse and to a large extent this is due to tokenization.

65
00:05:10,160 --> 00:05:17,520
Sometimes LLMs are bad at simple arithmetic, also can be traced to tokenization. GPT-2 specifically

66
00:05:17,520 --> 00:05:22,880
would have had quite a bit more issues with Python than future versions of it due to tokenization.

67
00:05:23,600 --> 00:05:27,040
There's a lot of other issues, maybe you've seen weird warnings about a trailing whitespace,

68
00:05:27,040 --> 00:05:34,880
this is a tokenization issue. If you had asked GPT earlier about solid gold Magikarp and what it is,

69
00:05:34,880 --> 00:05:40,080
you would see the LLM go totally crazy and it would start going off about a completely unrelated

70
00:05:40,080 --> 00:05:44,640
tangent topic. Maybe you've been told to use YAML over JSON in structured data,

71
00:05:44,640 --> 00:05:48,800
all of that has to do with tokenization. So basically tokenization is at the heart

72
00:05:48,800 --> 00:05:52,960
of many issues. I will loop back around to these at the end of the video,

73
00:05:53,040 --> 00:05:58,400
but for now let me just skip over it a little bit and let's go to this web app,

74
00:05:59,520 --> 00:06:04,880
the tiktokenizer.versal.app. So I have it loaded here and what I like about this web app is that

75
00:06:04,880 --> 00:06:10,560
tokenization is running a sort of live in your browser in JavaScript. So you can just type here

76
00:06:10,560 --> 00:06:18,720
stuff, hello world, and the whole string retokenizes. So here what we see on the left

77
00:06:18,720 --> 00:06:22,480
is a string that you put in, on the right we're currently using the GPT-2 tokenizer,

78
00:06:23,120 --> 00:06:28,000
we see that this string that I pasted here is currently tokenizing into 300 tokens

79
00:06:28,560 --> 00:06:33,440
and here they are sort of shown explicitly in different colors for every single token.

80
00:06:34,240 --> 00:06:43,680
So for example this word tokenization became two tokens, the token 30642 and 1634.

81
00:06:44,880 --> 00:06:52,400
The token space is is token 318. So be careful on the bottom you can show white space,

82
00:06:53,200 --> 00:06:58,480
and keep in mind that there are spaces and slash n new line characters in here,

83
00:06:58,480 --> 00:07:05,440
but you can hide them for clarity. The token space at is token 379,

84
00:07:06,400 --> 00:07:14,960
the token space the is 262, etc. So you notice here that the space is part of that token chunk.

85
00:07:16,800 --> 00:07:22,240
Now so this is kind of like how our English sentence broke up and that seems all well and good.

86
00:07:23,040 --> 00:07:32,400
Now here I put in some arithmetic. So we see that the token 127 plus and then token 6 space 6

87
00:07:32,400 --> 00:07:37,760
followed by 77. So what's happening here is that 127 is feeding in as a single token into

88
00:07:37,760 --> 00:07:44,880
the large language model, but the number 677 will actually feed in as two separate tokens.

89
00:07:45,680 --> 00:07:52,320
And so the large language model has to sort of take account of that and process it correctly in

90
00:07:52,320 --> 00:07:58,240
its network. And see here 804 will be broken up into two tokens and it's all completely arbitrary.

91
00:07:58,880 --> 00:08:03,280
And here I have another example of four digit numbers and they break up in the way that they

92
00:08:03,280 --> 00:08:08,480
break up and it's totally arbitrary. Sometimes you have multiple digits single token, sometimes

93
00:08:08,480 --> 00:08:13,120
you have individual digits as many tokens, and it's all kind of pretty arbitrary and coming out

94
00:08:13,120 --> 00:08:21,440
of the tokenizer. Here's another example. We have the string egg and you see here that this became two tokens.

95
00:08:23,200 --> 00:08:27,680
But for some reason when I say I have an egg, you see when it's a space egg

96
00:08:28,640 --> 00:08:34,480
it's a single token. So just egg by itself in the beginning of a sentence is

97
00:08:34,480 --> 00:08:40,960
two tokens, but here as a space egg is suddenly a single token for the exact same string.

98
00:08:41,600 --> 00:08:46,880
Here lowercase egg turns out to be a single token and in particular notice that

99
00:08:46,880 --> 00:08:51,840
the color is different, so this is a different token, so this is case sensitive. And of course,

100
00:08:52,480 --> 00:08:58,720
if you look at the formula for the literal egg, it would also be different tokens. And again this

101
00:08:58,720 --> 00:09:04,160
would be two tokens arbitrarily. So for the same concept egg depending on if it's in the beginning

102
00:09:04,160 --> 00:09:09,360
of a sentence, at the end of a sentence, lowercase, uppercase or mixed, all this will be basically

103
00:09:09,360 --> 00:09:14,160
very different tokens and different IDs. And the language model has to learn from raw data from

104
00:09:14,160 --> 00:09:17,520
all the internet text that it's going to be training on that these are actually all the

105
00:09:17,520 --> 00:09:21,600
exact same concept. And it has to sort of group them in the parameters of the neural network and understand, just based on data,

106
00:09:21,600 --> 00:09:25,360
just based on the data patterns that these are all very similar but maybe not

107
00:09:25,360 --> 00:09:32,220
almost exactly similar but very very similar. After the demonstration here I

108
00:09:32,220 --> 00:09:41,160
have an introduction from OpenAI's ChatGPT in Korean so 만났어, 반가워요, etc.

109
00:09:41,160 --> 00:09:46,320
So this is in Korean and the reason I put this here is because you'll notice

110
00:09:46,320 --> 00:09:53,940
that non-English languages work slightly worse in ChatGPT. Part of this is

111
00:09:53,940 --> 00:09:57,720
because of course the training data set for ChatGPT is much larger for

112
00:09:57,720 --> 00:10:01,360
English than for everything else but the same is true not just for the large

113
00:10:01,360 --> 00:10:05,580
language model itself but also for the tokenizer. So when we train the tokenizer

114
00:10:05,580 --> 00:10:08,520
we're going to see that there's a training set as well and there's a lot

115
00:10:08,520 --> 00:10:12,540
more English than non-English and what ends up happening is that we're going to

116
00:10:12,540 --> 00:10:16,300
have a lot more longer tokens for English

117
00:10:16,300 --> 00:10:21,400
so how do I put this if you have a single sentence in English and you

118
00:10:21,400 --> 00:10:25,460
tokenize it you might see that it's 10 tokens or something like that but if you

119
00:10:25,460 --> 00:10:29,360
translate that sentence into say Korean or Japanese or something else you'll

120
00:10:29,360 --> 00:10:33,520
typically see that number of tokens used is much larger and that's because the

121
00:10:33,520 --> 00:10:38,440
chunks here are a lot more broken up so we're using a lot more tokens for the

122
00:10:38,440 --> 00:10:43,720
exact same thing and what this does is it bloats up the sequence length of all

123
00:10:43,720 --> 00:10:45,920
the documents so you're using up more tokens if the number of tokens you use is

124
00:10:45,920 --> 00:10:50,160
more tokens and then in the attention of the transformer when these tokens try to attend

125
00:10:50,160 --> 00:10:56,320
each other you are running out of context in the maximum context length of that transformer

126
00:10:56,320 --> 00:11:02,400
and so basically all the non-english text is stretched out from the perspective of the

127
00:11:02,400 --> 00:11:07,600
transformer and this just has to do with the trainings that used for the tokenizer and the

128
00:11:07,600 --> 00:11:13,760
tokenization itself so it will create a lot bigger tokens and a lot larger groups in english and it

129
00:11:13,760 --> 00:11:17,200
will have a lot of little boundaries for all the other non-english text

130
00:11:19,280 --> 00:11:23,040
so if we translated this into english it would be significantly fewer tokens

131
00:11:24,240 --> 00:11:28,320
the final example i have here is a little snippet of python for doing fizzbuzz

132
00:11:29,040 --> 00:11:35,920
and what i'd like you to notice is look all these individual spaces are all separate tokens they are

133
00:11:35,920 --> 00:11:43,680
token 220 so uh 220 220 220 220 and then space if is a single token

134
00:11:43,760 --> 00:11:48,280
token. And so what's going on here is that when the transformer is going to consume or try to

135
00:11:48,280 --> 00:11:55,740
create this text, it needs to handle all these spaces individually. They all feed in one by one

136
00:11:55,740 --> 00:12:01,440
into the entire transformer in the sequence. And so this is being extremely wasteful tokenizing it

137
00:12:01,440 --> 00:12:07,900
in this way. And so as a result of that, GPT-2 is not very good with Python, and it's not anything

138
00:12:07,900 --> 00:12:12,080
to do with coding or the language model itself. It's just that if you use a lot of indentation

139
00:12:12,080 --> 00:12:18,040
using space in Python, like we usually do, you just end up bloating out all the text, and it's

140
00:12:18,040 --> 00:12:22,020
separated across way too much of the sequence, and we are running out of the context length

141
00:12:22,020 --> 00:12:26,660
in the sequence. That's roughly speaking what's happening. We're being way too wasteful. We're

142
00:12:26,660 --> 00:12:31,000
taking up way too much token space. Now we can also scroll up here, and we can change the tokenizer.

143
00:12:31,640 --> 00:12:38,140
So note here that GPT-2 tokenizer creates a token count of 300 for this string here. We can change

144
00:12:38,140 --> 00:12:42,020
it to CL100KBASE, which is the GPT-4 tokenizer. And we see

145
00:12:42,020 --> 00:12:42,060
that it's not very good with Python, and it's not very good with Python, and it's not very good

146
00:12:42,060 --> 00:12:48,440
the token count drops to 185. So for the exact same string, we are now roughly halving the number

147
00:12:48,440 --> 00:12:54,540
of tokens. And roughly speaking, this is because the number of tokens in the GPT-4 tokenizer is

148
00:12:54,540 --> 00:12:59,840
roughly double that of the number of tokens in the GPT-2 tokenizer. So we went from roughly 50K

149
00:12:59,840 --> 00:13:05,920
to roughly 100K. Now you can imagine that this is a good thing, because the same text is now

150
00:13:06,200 --> 00:13:12,000
squished into half as many tokens. So this is a lot denser input to the

151
00:13:12,000 --> 00:13:17,840
transformer. And in the transformer, every single token has a finite number of tokens before it

152
00:13:17,840 --> 00:13:23,240
that it's going to pay attention to. And so what this is doing is we're roughly able to see twice

153
00:13:23,240 --> 00:13:29,800
as much text as a context for what token to predict next because of this change. But of course,

154
00:13:29,980 --> 00:13:34,940
just increasing the number of tokens is not strictly better infinitely, because as you

155
00:13:34,940 --> 00:13:40,280
increase the number of tokens, now your embedding table is sort of getting a lot larger. And also

156
00:13:40,280 --> 00:13:41,980
at the output, we are trying to predict the next number of tokens. So we're able to see

157
00:13:42,000 --> 00:13:45,880
the next token, and there's the softmax there, and that grows as well. We're going to go into more

158
00:13:45,880 --> 00:13:51,400
detail later on this. But there's some kind of a sweet spot somewhere where you have a just right

159
00:13:51,400 --> 00:13:55,560
number of tokens in your vocabulary, where everything is appropriately dense and still

160
00:13:55,560 --> 00:14:01,080
fairly efficient. Now, one thing I would like you to note specifically for the GPT-4 tokenizer

161
00:14:01,080 --> 00:14:07,480
is that the handling of the whitespace for Python has improved a lot. You see that here,

162
00:14:07,480 --> 00:14:11,960
these four spaces are represented as one single token for the three spaces here.

163
00:14:12,000 --> 00:14:19,820
And here, seven spaces were all grouped into a single token. So we're being a lot more efficient

164
00:14:19,820 --> 00:14:23,900
in how we represent Python. And this was a deliberate choice made by OpenAI when they

165
00:14:23,900 --> 00:14:29,960
designed the GPT-4 tokenizer. And they group a lot more whitespace into a single character.

166
00:14:30,480 --> 00:14:37,720
What this does is this densifies Python, and therefore we can attend to more code before it

167
00:14:37,720 --> 00:14:41,940
when we're trying to predict the next token in the sequence. And so the improvement in

168
00:14:42,000 --> 00:14:47,680
Python coding ability from GPT-2 to GPT-4 is not just a matter of the language model and the

169
00:14:47,680 --> 00:14:52,080
architecture and the details of the optimization, but a lot of the improvement here is also coming

170
00:14:52,080 --> 00:14:57,160
from the design of the tokenizer and how it groups characters into tokens. Okay, so let's

171
00:14:57,160 --> 00:15:03,200
now start writing some code. So remember what we want to do. We want to take strings and feed them

172
00:15:03,200 --> 00:15:09,080
into language models. For that, we need to somehow tokenize strings into some integers

173
00:15:09,080 --> 00:15:11,320
in some fixed vocabulary.

174
00:15:12,000 --> 00:15:17,040
And then we will use those integers to make a lookup into a lookup table of vectors and feed

175
00:15:17,040 --> 00:15:22,240
those vectors into the transformer as an input. Now, the reason this gets a little bit tricky,

176
00:15:22,240 --> 00:15:26,320
of course, is that we don't just want to support the simple English alphabet. We want to support

177
00:15:26,320 --> 00:15:31,440
different kinds of languages. So this is annyeonghaseyo in Korean, which is hello.

178
00:15:31,440 --> 00:15:35,520
And we also want to support many kinds of special characters that we might find on the internet.

179
00:15:35,520 --> 00:15:41,920
For example, emoji. So how do we feed this text into a transformer?

180
00:15:42,000 --> 00:15:48,560
Well, what is this text anyway in Python? So if you go to the documentation of a string in

181
00:15:48,560 --> 00:15:55,600
Python, you can see that strings are immutable sequences of Unicode code points. Okay, what are

182
00:15:55,600 --> 00:16:03,040
Unicode code points? We can go to Wikipedia. So Unicode code points are defined by the Unicode

183
00:16:03,040 --> 00:16:09,520
consortium as part of the Unicode standard. And what this is really is that it's just a definition

184
00:16:09,520 --> 00:16:11,680
of roughly 150,000 characters right now. So it's just a definition of roughly 150,000 characters

185
00:16:12,000 --> 00:16:18,560
right now. And roughly speaking, what they look like and what integers represent those characters.

186
00:16:18,560 --> 00:16:24,560
So it says 150,000 characters across 161 scripts as of right now. So if you scroll down here,

187
00:16:24,560 --> 00:16:29,760
you can see that the standard is very much alive. The latest standard 15.1 is September 2023.

188
00:16:31,040 --> 00:16:38,800
And basically, this is just a way to define lots of types of characters. Like for example,

189
00:16:38,800 --> 00:16:40,400
all these characters across different scripts.

190
00:16:42,000 --> 00:16:46,480
The way we can access the Unicode code point given a single character is by using the ORT

191
00:16:46,480 --> 00:16:52,800
function in Python. So for example, I can pass in ORT of H. And I can see that for the single

192
00:16:52,800 --> 00:17:01,280
character H, the Unicode code point is 104. Okay. But this can be arbitrarily complicated. So we

193
00:17:01,280 --> 00:17:06,560
can take, for example, our emoji here. And we can see that the code point for this one is 128,000.

194
00:17:07,520 --> 00:17:08,960
Or we can take un.

195
00:17:12,000 --> 00:17:18,800
Now keep in mind, you can't plug in strings here because this doesn't have a single code point.

196
00:17:18,800 --> 00:17:25,520
It only takes a single Unicode code point character and tells you its integer. So in this

197
00:17:25,520 --> 00:17:32,640
way, we can look up all the characters of this specific string and their code points. So ORT of

198
00:17:32,640 --> 00:17:41,520
X for X in this string. And we get this encoding here. Now see here, we've already turned the raw

199
00:17:42,000 --> 00:17:47,360
code points. Already have integers. So why can't we simply just use these integers and not have any

200
00:17:47,360 --> 00:17:51,520
tokenization at all? Why can't we just use this natively as is and just use the code point?

201
00:17:52,320 --> 00:17:56,160
Well, one reason for that, of course, is that the vocabulary in that case would be quite long.

202
00:17:56,160 --> 00:18:02,880
So in this case, for Unicode, this is a vocabulary of 150,000 different code points. But more

203
00:18:02,880 --> 00:18:08,640
worryingly than that, I think the Unicode standard is very much alive and it keeps changing. And so

204
00:18:08,640 --> 00:18:11,920
it's not kind of a stable representation necessarily that we may want to use.

205
00:18:12,000 --> 00:18:16,720
So for those reasons, we need something a bit better. So to find something better,

206
00:18:16,720 --> 00:18:21,200
we turn to encodings. So if you go to the Wikipedia page here, we see that the Unicode

207
00:18:21,200 --> 00:18:29,440
consortium defines three types of encodings, UTF-8, UTF-16, and UTF-32. These encodings are

208
00:18:29,440 --> 00:18:35,120
the way by which we can take Unicode text and translate it into binary data or byte streams.

209
00:18:36,000 --> 00:18:41,840
UTF-8 is by far the most common. So this is the UTF-8 page. Now, this Wikipedia page is actually

210
00:18:41,840 --> 00:18:46,880
quite long, but what's important for our purposes is that UTF-8 takes every single code point

211
00:18:47,680 --> 00:18:53,440
and it translates it to a byte stream. And this byte stream is between one to four bytes. So it's

212
00:18:53,440 --> 00:18:57,840
a variable length encoding. So depending on the Unicode point according to the schema,

213
00:18:57,840 --> 00:19:03,200
you're going to end up with between one to four bytes for each code point. On top of that, there's

214
00:19:03,200 --> 00:19:10,880
UTF-8, UTF-16, and UTF-32. UTF-32 is nice because it is fixed length instead of variable length,

215
00:19:10,880 --> 00:19:11,840
but it has many other variables. So here's the UTF-8 code, and here's UTF-16 code, and here's UTF-32.

216
00:19:11,840 --> 00:19:18,020
downsides as well. So the full kind of spectrum of pros and cons of all these

217
00:19:18,020 --> 00:19:21,980
different tree encodings are beyond the scope of this video. I just like to point

218
00:19:21,980 --> 00:19:26,600
out that I enjoyed this blog post and this blog post at the end of it also has

219
00:19:26,600 --> 00:19:31,660
a number of references that can be quite useful. One of them is UTF-8 Everywhere

220
00:19:31,660 --> 00:19:37,160
manifesto and this manifesto describes the reason why UTF-8 is significantly

221
00:19:37,160 --> 00:19:42,480
preferred and a lot nicer than the other encodings and why it is used a lot

222
00:19:42,480 --> 00:19:48,700
more prominently on the Internet. One of the major advantages does just give you

223
00:19:48,700 --> 00:19:53,180
a sense is that UTF-8 is the only one of these that is backwards compatible to

224
00:19:53,180 --> 00:19:57,780
the much simpler ASCII encoding of text, but I'm not going to go into the full

225
00:19:57,780 --> 00:20:02,220
detail in this video. So suffice to say that we like the UTF-8 encoding and

226
00:20:02,220 --> 00:20:07,060
let's try to take the string and see what we get if we encode it into UTF-8.

227
00:20:07,060 --> 00:20:07,120
So let's try to take the string and see what we get if we encode it into UTF-8.

228
00:20:07,120 --> 00:20:07,140
So let's try to take the string and see what we get if we encode it into UTF-8.

229
00:20:07,140 --> 00:20:13,120
the string class in python actually has dot encode and you can give it the encoding which is

230
00:20:13,120 --> 00:20:20,700
say utf-8 now we get out of this is not very nice because this is the bytes is a bytes object and

231
00:20:20,700 --> 00:20:25,680
it's not very nice in the way that it's printed so i personally like to take it through a list

232
00:20:25,680 --> 00:20:32,760
because then we actually get the raw bytes of this encoding so this is the raw bytes

233
00:20:32,760 --> 00:20:40,360
that represent this string according to the utf-8 encoding we can also look at utf-16 we get a

234
00:20:40,360 --> 00:20:45,880
slightly different byte stream and here we start to see one of the disadvantages of utf-16 you see

235
00:20:45,880 --> 00:20:50,100
how we have zero zero something zero something zero something we're starting to get a sense

236
00:20:50,100 --> 00:20:55,800
that this is a bit of a wasteful encoding and indeed for simple ascii characters or english

237
00:20:55,800 --> 00:21:01,080
characters here we just have the structure of zero something zero something and it's not exactly

238
00:21:01,080 --> 00:21:02,740
nice same for utf-8

239
00:21:02,760 --> 00:21:07,320
but if we look at utf-32 when we expand this we can start to get a sense of the wastefulness of

240
00:21:07,320 --> 00:21:13,640
this encoding for our purposes you see a lot of zeros followed by something and so this is not

241
00:21:13,640 --> 00:21:22,840
desirable so suffice it to say that we would like to stick with utf-8 for our purposes however if we

242
00:21:22,840 --> 00:21:30,120
just use utf-8 naively these are byte streams so that would imply a vocabulary length of only 256

243
00:21:30,120 --> 00:21:31,080
possible tokens

244
00:21:32,120 --> 00:21:32,520
but this

245
00:21:32,760 --> 00:21:37,880
this vocabulary size is very very small what this is going to do if we just were to use it naively

246
00:21:37,880 --> 00:21:44,840
is that all of our text would be stretched out over very very long sequences of bytes and so

247
00:21:47,320 --> 00:21:51,720
what this does is that certainly the embedding table is going to be tiny and the prediction at

248
00:21:51,720 --> 00:21:56,360
the top of the final layer is going to be very tiny but our sequences are very long and remember

249
00:21:56,360 --> 00:22:02,680
that we have pretty finite context lengths in the attention that we can support in a transformer for

250
00:22:02,760 --> 00:22:08,440
computational reasons and so we only have as much context length but now we have very very long

251
00:22:08,440 --> 00:22:12,840
sequences and this is just inefficient and it's not going to allow us to attend to sufficiently

252
00:22:12,840 --> 00:22:19,560
long text before us for the purposes of the next token prediction task so we don't want to use

253
00:22:20,280 --> 00:22:26,760
the raw bytes of the utf-8 encoding we want to be able to support larger vocabulary size that

254
00:22:26,760 --> 00:22:32,280
we can tune as a height parameter but we want to stick with the utf-8 encoding of these strings

255
00:22:32,760 --> 00:22:37,320
so what do we do well the answer of course is we turn to the byte pair encoding algorithm

256
00:22:37,320 --> 00:22:39,800
which will allow us to compress these byte sequences

257
00:22:41,160 --> 00:22:45,960
to a variable amount so we'll get to that in a bit but i just want to briefly speak to the

258
00:22:45,960 --> 00:22:51,880
fact that i would love nothing more than to be able to feed raw byte sequences into

259
00:22:52,920 --> 00:22:57,800
language models in fact there's a paper about how this could potentially be done from summer

260
00:22:57,800 --> 00:23:02,280
last year now the problem is you actually have to go in and you have to modify the transformer

261
00:23:03,080 --> 00:23:07,720
because as i mentioned you're going to have a problem where the attention will start to become

262
00:23:07,720 --> 00:23:13,720
extremely expensive because the sequences are so long and so in this paper they propose kind of a

263
00:23:13,720 --> 00:23:19,880
hierarchical structuring of the transformer that could allow you to just feed in raw bytes and so

264
00:23:19,880 --> 00:23:24,040
at the end they say together these results establish the viability of tokenization free

265
00:23:24,040 --> 00:23:27,960
autoregressive sequence modeling at scale so tokenization free would indeed be

266
00:23:27,960 --> 00:23:32,680
amazing we would just feed byte streams directly into our models but unfortunately

267
00:23:32,760 --> 00:23:37,480
I don't know that this has really been proven out yet by sufficiently many groups and at sufficient

268
00:23:37,480 --> 00:23:41,880
scale but something like this at one point would be amazing and I hope someone comes up with it

269
00:23:41,880 --> 00:23:46,840
but for now we have to come back and we can't feed this directly into language models and we have to

270
00:23:46,840 --> 00:23:51,320
compress it using the byte pair encoding algorithm so let's see how that works so as I mentioned the

271
00:23:51,320 --> 00:23:55,640
byte pair encoding algorithm is not all that complicated and the Wikipedia page is actually

272
00:23:55,640 --> 00:24:00,680
quite instructive as far as the basic idea goes what we're doing is we have some kind of a input

273
00:24:00,680 --> 00:24:06,600
sequence like for example here we have only four elements in our vocabulary a b c and d and we have

274
00:24:06,600 --> 00:24:13,400
a sequence of them so instead of bytes let's say we just had four a vocab size of four the sequence

275
00:24:13,400 --> 00:24:21,320
is too long we'd like to compress it so we do is that we iteratively find the pair of tokens that

276
00:24:21,320 --> 00:24:29,320
occur the most frequently and then once we've identified that pair we replace that pair with

277
00:24:29,320 --> 00:24:30,600
just a single new token

278
00:24:30,680 --> 00:24:37,560
that we append to our vocabulary so for example here the byte pair a a occurs most often so we

279
00:24:37,560 --> 00:24:43,880
meant a new token let's call it capital Z and we replace every single occurrence of a a by z

280
00:24:44,680 --> 00:24:52,440
so now we have two z's here so here we took a sequence of 11 characters with vocabulary size

281
00:24:52,440 --> 00:25:00,360
four and we've converted it to a sequence of only nine tokens but now with a vocabulary of five

282
00:25:00,360 --> 00:25:00,520
b and z so that's how we make our token of a and b so let's now move on from that we're going to

283
00:25:00,520 --> 00:25:00,600
make our token of a and b our token of a and b now we're going to make our token of a and b so

284
00:25:00,600 --> 00:25:00,660
now we have a binary of 10 characters of 10 characters of 10 characters of 10 characters of 10 characters of 10 characters

285
00:25:00,660 --> 00:25:02,800
because we have a fifth vocabulary element

286
00:25:02,800 --> 00:25:03,840
that we just created,

287
00:25:03,840 --> 00:25:07,340
and it's Z standing for concatenation of AA.

288
00:25:07,340 --> 00:25:09,700
And we can, again, repeat this process.

289
00:25:09,700 --> 00:25:12,040
So we, again, look at the sequence

290
00:25:12,040 --> 00:25:16,720
and identify the pair of tokens that are most frequent.

291
00:25:16,720 --> 00:25:19,180
Let's say that that is now AB.

292
00:25:19,180 --> 00:25:20,660
Well, we are going to replace AB

293
00:25:20,660 --> 00:25:23,500
with a new token that we meant called Y.

294
00:25:23,500 --> 00:25:24,640
So Y becomes AB,

295
00:25:24,640 --> 00:25:26,280
and then every single occurrence of AB

296
00:25:26,280 --> 00:25:28,160
is now replaced with Y.

297
00:25:28,160 --> 00:25:29,820
So we end up with this.

298
00:25:30,660 --> 00:25:32,900
So now we only have one, two, three,

299
00:25:32,900 --> 00:25:36,120
four, five, six, seven characters in our sequence,

300
00:25:36,120 --> 00:25:41,120
but we have not just four vocabulary elements,

301
00:25:41,420 --> 00:25:43,600
or five, but now we have six.

302
00:25:43,600 --> 00:25:45,560
And for the final round,

303
00:25:45,560 --> 00:25:47,320
we, again, look through the sequence,

304
00:25:47,320 --> 00:25:51,520
find that the phrase ZY or the pair ZY is most common,

305
00:25:51,520 --> 00:25:55,520
and replace it one more time with another character,

306
00:25:55,520 --> 00:25:56,560
let's say X.

307
00:25:56,560 --> 00:25:59,960
So X is ZY, and we replace all occurrences of ZY,

308
00:25:59,960 --> 00:26:02,020
and we get this following sequence.

309
00:26:02,020 --> 00:26:04,840
So basically, after we have gone through this process,

310
00:26:04,840 --> 00:26:09,840
instead of having a sequence of 11 tokens

311
00:26:12,500 --> 00:26:14,680
with a vocabulary length of four,

312
00:26:14,680 --> 00:26:19,680
we now have a sequence of one, two, three, four, five tokens,

313
00:26:20,780 --> 00:26:24,120
but our vocabulary length now is seven.

314
00:26:24,120 --> 00:26:25,240
And so in this way,

315
00:26:25,240 --> 00:26:27,580
we can iteratively compress our sequence

316
00:26:27,580 --> 00:26:29,280
as we mint new tokens.

317
00:26:29,960 --> 00:26:31,320
In the exact same way,

318
00:26:31,320 --> 00:26:34,260
we start out with byte sequences.

319
00:26:34,260 --> 00:26:37,500
So we have 256 vocabulary size,

320
00:26:37,500 --> 00:26:38,960
but we're now going to go through these

321
00:26:38,960 --> 00:26:42,140
and find the byte pairs that occur the most.

322
00:26:42,140 --> 00:26:44,900
And we're going to iteratively start minting new tokens,

323
00:26:44,900 --> 00:26:48,140
appending them to our vocabulary, and replacing things.

324
00:26:48,140 --> 00:26:48,980
And in this way,

325
00:26:48,980 --> 00:26:51,540
we're going to end up with a compressed training dataset,

326
00:26:51,540 --> 00:26:54,880
and also an algorithm for taking any arbitrary sequence

327
00:26:54,880 --> 00:26:58,200
and encoding it using this vocabulary,

328
00:26:58,200 --> 00:26:59,820
and also decoding it back to store.

329
00:26:59,820 --> 00:27:02,060
So we can then use this to do string.

330
00:27:02,060 --> 00:27:03,960
So let's now implement all that.

331
00:27:03,960 --> 00:27:05,360
So here's what I did.

332
00:27:05,360 --> 00:27:07,560
I went to this blog post that I enjoyed,

333
00:27:07,560 --> 00:27:08,960
and I took the first paragraph

334
00:27:08,960 --> 00:27:11,740
and I copy pasted it here into text.

335
00:27:11,740 --> 00:27:13,900
So this is one very long line here.

336
00:27:15,200 --> 00:27:17,340
Now, to get the tokens, as I mentioned,

337
00:27:17,340 --> 00:27:20,300
we just take our text and we encode it into UTF-8.

338
00:27:20,300 --> 00:27:23,440
The tokens here at this point will be a raw bytes,

339
00:27:23,440 --> 00:27:25,580
single stream of bytes.

340
00:27:25,580 --> 00:27:27,620
And just so that it's easier to work with,

341
00:27:27,620 --> 00:27:29,820
instead of just a bytes object,

342
00:27:29,820 --> 00:27:32,120
we can import all those bytes to integers,

343
00:27:32,120 --> 00:27:33,500
and then create a list of it,

344
00:27:33,500 --> 00:27:35,100
just so it's easier for us to manipulate

345
00:27:35,100 --> 00:27:37,300
and work with in Python and visualize.

346
00:27:37,300 --> 00:27:38,400
And here I'm printing all of that.

347
00:27:38,400 --> 00:27:41,400
So this is the original paragraph,

348
00:27:44,560 --> 00:27:48,800
and its length is 533 code points.

349
00:27:48,800 --> 00:27:52,840
And then here are the bytes encoded in UTF-8.

350
00:27:52,840 --> 00:27:56,240
And we see that this has a length of 616 bytes

351
00:27:56,240 --> 00:27:58,580
at this point, or 616 tokens.

352
00:27:58,580 --> 00:27:59,820
And the reason this is more,

353
00:27:59,820 --> 00:28:03,220
is because a lot of these simple ASCII characters,

354
00:28:03,220 --> 00:28:06,220
or simple characters, they just become a single byte.

355
00:28:06,220 --> 00:28:08,860
But a lot of these Unicode, more complex characters,

356
00:28:08,860 --> 00:28:10,740
become multiple bytes, up to four.

357
00:28:10,740 --> 00:28:12,760
And so we are expanding that size.

358
00:28:13,860 --> 00:28:15,900
So now what we'd like to do as a first step of the algorithm

359
00:28:15,900 --> 00:28:17,700
is we'd like to iterate over here

360
00:28:17,700 --> 00:28:22,040
and find the pair of bytes that occur most frequently,

361
00:28:22,040 --> 00:28:23,840
because we're then going to merge it.

362
00:28:23,840 --> 00:28:26,480
So if you are working along on a notebook on a side,

363
00:28:26,480 --> 00:28:28,740
then I encourage you to basically click on the link,

364
00:28:28,740 --> 00:28:29,680
find this notebook,

365
00:28:29,820 --> 00:28:31,920
and try to write that function yourself.

366
00:28:31,920 --> 00:28:34,060
Otherwise, I'm going to come here and implement first

367
00:28:34,060 --> 00:28:36,300
the function that finds the most common pair.

368
00:28:36,300 --> 00:28:37,660
Okay, so here's what I came up with.

369
00:28:37,660 --> 00:28:39,800
There are many different ways to implement this,

370
00:28:39,800 --> 00:28:41,400
but I'm calling the function getStats.

371
00:28:41,400 --> 00:28:43,300
It expects a list of integers.

372
00:28:43,300 --> 00:28:45,200
I'm using a dictionary to keep track

373
00:28:45,200 --> 00:28:46,800
of basically the counts.

374
00:28:46,800 --> 00:28:48,040
And then this is a Pythonic way

375
00:28:48,040 --> 00:28:51,280
to iterate consecutive elements off this list,

376
00:28:51,280 --> 00:28:53,680
which we covered in a previous video.

377
00:28:53,680 --> 00:28:55,980
And then here, I'm just keeping track of

378
00:28:55,980 --> 00:28:59,320
just incrementing by one for all the pairs.

379
00:28:59,820 --> 00:29:02,160
So if I do this on all the tokens here,

380
00:29:02,160 --> 00:29:04,460
then the stats comes out here.

381
00:29:04,460 --> 00:29:05,860
So this is the dictionary.

382
00:29:05,860 --> 00:29:09,700
The keys are these tuples of consecutive elements,

383
00:29:09,700 --> 00:29:11,340
and this is the count.

384
00:29:11,340 --> 00:29:14,940
So just to print it in a slightly better way,

385
00:29:14,940 --> 00:29:17,740
this is one way that I like to do that,

386
00:29:17,740 --> 00:29:20,980
where it's a little bit compound here,

387
00:29:20,980 --> 00:29:22,380
so you can pause if you like.

388
00:29:22,380 --> 00:29:24,640
But we iterate all the items.

389
00:29:24,640 --> 00:29:29,220
The items called on dictionary returns pairs of key value,

390
00:29:29,220 --> 00:29:34,100
instead, I create a list here of value key,

391
00:29:34,100 --> 00:29:36,260
because if it's a value key list,

392
00:29:36,260 --> 00:29:38,060
then I can call sort on it.

393
00:29:38,060 --> 00:29:42,800
And by default, Python will use the first element,

394
00:29:42,800 --> 00:29:44,100
which in this case, it will be value,

395
00:29:44,100 --> 00:29:46,540
to sort by if it's given tuples.

396
00:29:46,540 --> 00:29:49,240
And then reverse, so it's descending, and print that.

397
00:29:49,240 --> 00:29:52,680
So basically, it looks like 101 comma 32

398
00:29:52,680 --> 00:29:55,580
was the most commonly occurring consecutive pair,

399
00:29:55,580 --> 00:29:57,480
and it occurred 20 times.

400
00:29:57,480 --> 00:29:59,180
We can double check that that makes

401
00:29:59,220 --> 00:30:00,220
a reasonable sense.

402
00:30:00,220 --> 00:30:03,160
So if I just search 101, 32,

403
00:30:03,160 --> 00:30:07,660
then you see that these are the 20 occurrences of that pair.

404
00:30:09,440 --> 00:30:11,460
And if we'd like to take a look at what exactly

405
00:30:11,460 --> 00:30:13,940
that pair is, we can use char,

406
00:30:13,940 --> 00:30:16,340
which is the opposite of ord in Python.

407
00:30:16,340 --> 00:30:19,180
So we give it a Unicode code point,

408
00:30:19,180 --> 00:30:24,180
so 101 and of 32, and we see that this is E and space.

409
00:30:24,940 --> 00:30:28,220
So basically, there's a lot of E space here,

410
00:30:28,220 --> 00:30:29,220
meaning that a lot of these words

411
00:30:29,220 --> 00:30:30,580
seem to end with E.

412
00:30:30,580 --> 00:30:33,120
So here's E space as an example.

413
00:30:33,120 --> 00:30:34,460
So there's a lot of that going on here,

414
00:30:34,460 --> 00:30:36,720
and this is the most common pair.

415
00:30:36,720 --> 00:30:39,040
So now that we've identified the most common pair,

416
00:30:39,040 --> 00:30:41,860
we would like to iterate over the sequence.

417
00:30:41,860 --> 00:30:46,240
We're going to mint a new token with the ID of 256, right?

418
00:30:46,240 --> 00:30:49,640
Because these tokens currently go from zero to 255.

419
00:30:49,640 --> 00:30:51,080
So when we create a new token,

420
00:30:51,080 --> 00:30:53,680
it will have an ID of 256.

421
00:30:53,680 --> 00:30:58,340
And we're going to iterate over this entire list,

422
00:30:58,340 --> 00:30:59,180
and every,

423
00:30:59,220 --> 00:31:02,020
every time we see 101 comma 32,

424
00:31:02,020 --> 00:31:04,860
we're going to swap that out for 256.

425
00:31:04,860 --> 00:31:06,960
So let's implement that now,

426
00:31:06,960 --> 00:31:09,800
and feel free to do that yourself as well.

427
00:31:09,800 --> 00:31:11,600
So first I commented this,

428
00:31:11,600 --> 00:31:14,900
just so we don't pollute the notebook too much.

429
00:31:14,900 --> 00:31:17,980
This is a nice way of, in Python,

430
00:31:17,980 --> 00:31:20,400
obtaining the highest ranking pair.

431
00:31:20,400 --> 00:31:24,980
So we're basically calling the max on this dictionary stats,

432
00:31:24,980 --> 00:31:28,580
and this will return the maximum key.

433
00:31:28,580 --> 00:31:31,720
And then the question is, how does it rank keys?

434
00:31:31,720 --> 00:31:34,820
So you can provide it with a function that ranks keys,

435
00:31:34,820 --> 00:31:36,860
and that function is just stats.get.

436
00:31:37,720 --> 00:31:41,100
Stats.get would basically return the value.

437
00:31:41,100 --> 00:31:42,720
And so we're ranking by the value

438
00:31:42,720 --> 00:31:44,460
and getting the maximum key.

439
00:31:44,460 --> 00:31:47,500
So it's 101 comma 32, as we saw.

440
00:31:47,500 --> 00:31:49,840
Now to actually merge 101, 32,

441
00:31:51,040 --> 00:31:52,200
this is the function that I wrote,

442
00:31:52,200 --> 00:31:54,800
but again, there are many different versions of it.

443
00:31:54,800 --> 00:31:56,880
So we're going to take a list of IDs

444
00:31:56,880 --> 00:31:57,780
and the pair that we want to replace, and we're going to do that.

445
00:31:57,780 --> 00:31:58,580
And we're going to take a list of IDs, and the pair that we want to replace,

446
00:31:58,580 --> 00:32:02,820
and that pair will be replaced with the new index IDX.

447
00:32:02,820 --> 00:32:04,980
So iterating through IDs,

448
00:32:04,980 --> 00:32:08,220
if we find the pair, swap it out for IDX.

449
00:32:08,220 --> 00:32:11,660
So we create this new list, and then we start at zero,

450
00:32:11,660 --> 00:32:13,820
and then we go through this entire list sequentially

451
00:32:13,820 --> 00:32:14,820
from left to right.

452
00:32:15,760 --> 00:32:18,000
And here we are checking for equality

453
00:32:18,000 --> 00:32:20,400
at the current position with the pair.

454
00:32:22,200 --> 00:32:24,640
So here we are checking that the pair matches.

455
00:32:24,640 --> 00:32:26,340
Now here's a bit of a tricky condition

456
00:32:26,340 --> 00:32:28,580
that you have to append if you're trying to be careful.

457
00:32:28,580 --> 00:32:31,620
And that is that you don't want this here

458
00:32:31,620 --> 00:32:34,220
to be out of bounds at the very last position

459
00:32:34,220 --> 00:32:36,580
when you're on the rightmost element of this list.

460
00:32:36,580 --> 00:32:39,320
Otherwise this would give you an out of bounds error.

461
00:32:39,320 --> 00:32:40,700
So we have to make sure that we're not

462
00:32:40,700 --> 00:32:42,720
at the very, very last element.

463
00:32:42,720 --> 00:32:45,760
So this would be false for that.

464
00:32:45,760 --> 00:32:50,500
So if we find a match, we append to this new list

465
00:32:50,500 --> 00:32:54,000
that replacement index, and we increment the position by two.

466
00:32:54,000 --> 00:32:56,340
So we skip over that entire pair.

467
00:32:56,340 --> 00:32:58,580
But otherwise, if we haven't found a matching pair,

468
00:32:58,580 --> 00:33:02,380
we just sort of copy over the element at that position

469
00:33:02,380 --> 00:33:06,020
and increment by one, and then return this.

470
00:33:06,020 --> 00:33:07,520
So here's a very small toy example.

471
00:33:07,520 --> 00:33:09,960
If we have a list five, six, six, seven, nine, one,

472
00:33:09,960 --> 00:33:13,300
and we wanna replace the occurrences of 67 with 99,

473
00:33:13,300 --> 00:33:17,000
then calling this on that will give us

474
00:33:17,000 --> 00:33:18,300
what we're asking for.

475
00:33:18,300 --> 00:33:21,140
So here the six, seven is replaced with nine, nine.

476
00:33:22,340 --> 00:33:25,940
So now I'm gonna uncomment this for our actual use case,

477
00:33:25,940 --> 00:33:28,200
where we wanna take our tokens,

478
00:33:28,200 --> 00:33:28,580
we wanna take our tokens,

479
00:33:28,580 --> 00:33:31,020
we wanna take the top pair here

480
00:33:31,020 --> 00:33:34,080
and replace it with two, five, six to get tokens two.

481
00:33:34,080 --> 00:33:36,660
If we run this, we get the following.

482
00:33:38,220 --> 00:33:43,220
So recall that previously we had a length 616 in this list.

483
00:33:44,700 --> 00:33:47,540
And now we have a length 596, right?

484
00:33:47,540 --> 00:33:50,200
So this decreased by 20, which makes sense

485
00:33:50,200 --> 00:33:52,400
because there are 20 occurrences.

486
00:33:52,400 --> 00:33:55,480
Moreover, we can try to find two, five, six here,

487
00:33:55,480 --> 00:33:57,600
and we see plenty of occurrences of it.

488
00:33:58,580 --> 00:33:59,780
And moreover, just double check,

489
00:33:59,780 --> 00:34:02,340
there should be no occurrence of 101, 32.

490
00:34:02,340 --> 00:34:04,980
So this is the original array, plenty of them.

491
00:34:04,980 --> 00:34:08,320
And in the second array, there are no occurrences of 101, 32.

492
00:34:08,320 --> 00:34:11,500
So we've successfully merged this single pair.

493
00:34:11,500 --> 00:34:13,320
And now we just iterate this.

494
00:34:13,320 --> 00:34:15,440
So we are gonna go over the sequence again,

495
00:34:15,440 --> 00:34:17,740
find the most common pair and replace it.

496
00:34:17,740 --> 00:34:20,180
So let me now write a while loop that uses these functions

497
00:34:20,180 --> 00:34:22,820
to do this sort of iteratively.

498
00:34:22,820 --> 00:34:25,180
And how many times do we do it for?

499
00:34:25,180 --> 00:34:27,460
Well, that's totally up to us as a hyperparameter.

500
00:34:27,460 --> 00:34:28,460
The more, the better.

501
00:34:28,580 --> 00:34:32,380
The more steps we take, the larger will be our vocabulary

502
00:34:32,380 --> 00:34:34,460
and the shorter will be our sequence.

503
00:34:34,460 --> 00:34:36,940
And there is some sweet spot that we usually find

504
00:34:36,940 --> 00:34:38,760
works the best in practice.

505
00:34:38,760 --> 00:34:41,560
And so this is kind of a hyperparameter and we tune it

506
00:34:41,560 --> 00:34:44,000
and we find good vocabulary sizes.

507
00:34:44,000 --> 00:34:47,760
As an example, GPT-4 currently uses roughly 100,000 tokens

508
00:34:47,760 --> 00:34:51,560
and ballpark, those are reasonable numbers currently

509
00:34:51,560 --> 00:34:53,460
in state-of-the-art language language models.

510
00:34:53,460 --> 00:34:56,560
So let me now write, putting it all together

511
00:34:56,560 --> 00:34:58,340
and iterating these steps.

512
00:34:58,580 --> 00:35:00,480
Okay, now, before we dive into the while loop,

513
00:35:00,480 --> 00:35:03,120
I wanted to add one more cell here

514
00:35:03,120 --> 00:35:04,520
where I went to the blog post

515
00:35:04,520 --> 00:35:07,040
and instead of grabbing just the first paragraph or two,

516
00:35:07,040 --> 00:35:08,680
I took the entire blog post

517
00:35:08,680 --> 00:35:10,740
and I stretched it out in a single line.

518
00:35:10,740 --> 00:35:12,400
And basically just using longer text

519
00:35:12,400 --> 00:35:14,680
will allow us to have more representative statistics

520
00:35:14,680 --> 00:35:15,880
for the byte pairs

521
00:35:15,880 --> 00:35:18,900
and we'll just get a more sensible results out of it

522
00:35:18,900 --> 00:35:20,240
because it's longer text.

523
00:35:21,420 --> 00:35:23,140
So here we have the raw text.

524
00:35:23,140 --> 00:35:27,380
We encode it into bytes using the UTF-8 encoding.

525
00:35:27,380 --> 00:35:28,460
And then here,

526
00:35:28,460 --> 00:35:30,600
as before we are just changing it

527
00:35:30,600 --> 00:35:32,200
into a list of integers in Python,

528
00:35:32,200 --> 00:35:34,020
just so it's easier to work with

529
00:35:34,020 --> 00:35:36,340
instead of the raw bytes objects.

530
00:35:36,340 --> 00:35:40,200
And then this is the code that I came up with

531
00:35:40,200 --> 00:35:43,840
to actually do the merging and loop.

532
00:35:43,840 --> 00:35:45,660
These two functions here are identical

533
00:35:45,660 --> 00:35:46,700
to what we had above.

534
00:35:46,700 --> 00:35:48,080
I only included them here

535
00:35:48,080 --> 00:35:50,740
just so that you have the point of reference here.

536
00:35:51,640 --> 00:35:53,980
So these two are identical

537
00:35:53,980 --> 00:35:56,500
and then this is the new code that I added.

538
00:35:56,500 --> 00:35:58,340
So the first thing you wanna do is you want to decide on a,

539
00:35:58,340 --> 00:36:00,140
the final vocabulary size

540
00:36:00,140 --> 00:36:02,440
that we want our tokenizer to have.

541
00:36:02,440 --> 00:36:03,980
And as I mentioned, this is a hyperparameter

542
00:36:03,980 --> 00:36:05,240
and you set it in some way

543
00:36:05,240 --> 00:36:07,480
depending on your best performance.

544
00:36:07,480 --> 00:36:10,100
So let's say for us, we're going to use 276

545
00:36:10,100 --> 00:36:13,960
because that way we're going to be doing exactly 20 merges.

546
00:36:13,960 --> 00:36:18,380
And 20 merges because we already have 256 tokens

547
00:36:18,380 --> 00:36:20,240
for the raw bytes.

548
00:36:20,240 --> 00:36:23,560
And to reach 276, we have to do 20 merges

549
00:36:23,560 --> 00:36:25,120
to add 20 new tokens.

550
00:36:26,480 --> 00:36:27,560
Here, this is a one way in Python,

551
00:36:27,560 --> 00:36:28,080
we have to do 20 merges to add 20 new tokens.

552
00:36:28,080 --> 00:36:28,620
So I'm going to use the same method in Python

553
00:36:28,620 --> 00:36:30,720
to just create a copy of a list.

554
00:36:31,580 --> 00:36:33,240
So I'm taking the tokens list

555
00:36:33,240 --> 00:36:34,880
and by wrapping it in the list,

556
00:36:34,880 --> 00:36:36,900
Python will construct a new list

557
00:36:36,900 --> 00:36:38,060
of all the individual elements.

558
00:36:38,060 --> 00:36:39,660
So this is just a copy operation.

559
00:36:40,960 --> 00:36:44,300
Then here, I'm creating a merges dictionary.

560
00:36:44,300 --> 00:36:47,120
So this merges dictionary is going to maintain basically

561
00:36:47,120 --> 00:36:52,120
the child one, child two mapping to a new token.

562
00:36:52,260 --> 00:36:53,700
And so what we're going to be building up here

563
00:36:53,700 --> 00:36:56,460
is a binary tree of merges.

564
00:36:56,460 --> 00:36:57,840
But actually it's not exactly a tree

565
00:36:57,840 --> 00:37:00,500
because a tree would have a single root node

566
00:37:00,500 --> 00:37:02,220
with a bunch of leaves.

567
00:37:02,220 --> 00:37:04,640
For us, we're starting with the leaves on the bottom,

568
00:37:04,640 --> 00:37:06,060
which are the individual bytes.

569
00:37:06,060 --> 00:37:08,600
Those are the starting 256 tokens.

570
00:37:08,600 --> 00:37:11,400
And then we're starting to like merge two of them at a time.

571
00:37:11,400 --> 00:37:13,800
And so it's not a tree, it's more like a forest

572
00:37:16,180 --> 00:37:18,360
as we merge these elements.

573
00:37:18,360 --> 00:37:21,560
So for 20 merges,

574
00:37:21,560 --> 00:37:24,840
we're going to find the most commonly occurring pair.

575
00:37:24,840 --> 00:37:27,840
We're going to mint a new token integer for it.

576
00:37:27,840 --> 00:37:29,360
So I here will start at zero.

577
00:37:29,360 --> 00:37:31,840
So we're going to start at 256.

578
00:37:31,840 --> 00:37:33,520
We're going to print that we're merging it.

579
00:37:33,520 --> 00:37:36,560
And we're going to replace all the occurrences of that pair

580
00:37:36,560 --> 00:37:39,440
with the new, newly minted token.

581
00:37:39,440 --> 00:37:42,900
And we're going to record that this pair of integers

582
00:37:42,900 --> 00:37:45,340
merged into this new integer.

583
00:37:46,400 --> 00:37:49,860
So running this gives us the following output.

584
00:37:52,080 --> 00:37:54,060
So we did 20 merges.

585
00:37:54,060 --> 00:37:57,180
And for example, the first merge was exactly as before.

586
00:37:57,180 --> 00:38:02,180
The 101, 32 tokens merging into a new token, 256.

587
00:38:02,900 --> 00:38:06,460
Now keep in mind that the individual tokens 101 and 32

588
00:38:06,460 --> 00:38:09,240
can still occur in the sequence after merging.

589
00:38:09,240 --> 00:38:11,820
It's only when they occur exactly consecutively

590
00:38:11,820 --> 00:38:13,700
that that becomes 256 now.

591
00:38:15,700 --> 00:38:17,640
And in particular, the other thing to notice here

592
00:38:17,640 --> 00:38:20,920
is that the token 256, which is the newly minted token,

593
00:38:20,920 --> 00:38:22,860
is also eligible for merging.

594
00:38:22,860 --> 00:38:25,660
So here on the bottom, the 20th merge was a merge of 256,

595
00:38:25,660 --> 00:38:27,180
and 256 was the second merge.

596
00:38:27,180 --> 00:38:29,800
So now we have 259 becoming 275.

597
00:38:29,800 --> 00:38:32,180
So every time we replace these tokens,

598
00:38:32,180 --> 00:38:33,680
they become eligible for merging

599
00:38:33,680 --> 00:38:35,820
in the next round of the iteration.

600
00:38:35,820 --> 00:38:38,440
So that's why we're building up a small sort of binary forest

601
00:38:38,440 --> 00:38:40,240
instead of a single individual tree.

602
00:38:41,260 --> 00:38:43,020
One thing we can take a look at as well

603
00:38:43,020 --> 00:38:44,740
is we can take a look at the compression ratio

604
00:38:44,740 --> 00:38:46,120
that we've achieved.

605
00:38:46,120 --> 00:38:49,360
So in particular, we started off with this tokens list.

606
00:38:50,240 --> 00:38:53,480
So we started off with 24,000 bytes.

607
00:38:53,480 --> 00:38:56,940
And after merging 20 times, we now have, you know,

608
00:38:56,940 --> 00:39:01,240
we now have only 19,000 tokens.

609
00:39:01,240 --> 00:39:02,700
And so therefore, the compression ratio,

610
00:39:02,700 --> 00:39:06,240
simply just dividing the two, is roughly 1.27.

611
00:39:06,240 --> 00:39:08,380
So that's the amount of compression we were able to achieve

612
00:39:08,380 --> 00:39:11,120
of this text with only 20 merges.

613
00:39:12,180 --> 00:39:15,380
And of course, the more vocabulary elements you add,

614
00:39:15,380 --> 00:39:17,700
the greater the compression ratio here would be.

615
00:39:20,320 --> 00:39:24,160
Finally, so that's kind of like the training

616
00:39:24,160 --> 00:39:25,760
of the tokenizer, if you will.

617
00:39:25,760 --> 00:39:26,940
Now, one point that I wanted

618
00:39:26,940 --> 00:39:29,360
to make is that, and maybe this is a diagram

619
00:39:29,360 --> 00:39:32,320
that can help kind of illustrate,

620
00:39:32,320 --> 00:39:34,780
is that the tokenizer is a completely separate object

621
00:39:34,780 --> 00:39:36,780
from the large language model itself.

622
00:39:36,780 --> 00:39:37,920
So everything in this lecture,

623
00:39:37,920 --> 00:39:40,080
we're not really touching the LLM itself.

624
00:39:40,080 --> 00:39:41,660
We're just training the tokenizer.

625
00:39:41,660 --> 00:39:44,820
This is a completely separate pre-processing stage usually.

626
00:39:44,820 --> 00:39:47,440
So the tokenizer will have its own training set,

627
00:39:47,440 --> 00:39:48,640
just like a large language model

628
00:39:48,640 --> 00:39:51,540
has a potentially different training set.

629
00:39:51,540 --> 00:39:53,300
So the tokenizer has a training set of documents

630
00:39:53,300 --> 00:39:55,720
on which you're going to train the tokenizer.

631
00:39:55,720 --> 00:39:56,880
And then,

632
00:39:56,940 --> 00:39:59,400
we're performing the byte-pair encoding algorithm,

633
00:39:59,400 --> 00:40:00,480
as we saw above,

634
00:40:00,480 --> 00:40:03,620
to train the vocabulary of this tokenizer.

635
00:40:03,620 --> 00:40:05,020
So it has its own training set.

636
00:40:05,020 --> 00:40:06,340
It has a pre-processing stage

637
00:40:06,340 --> 00:40:08,780
that you would run a single time in the beginning.

638
00:40:09,880 --> 00:40:11,700
And the tokenizer is trained

639
00:40:11,700 --> 00:40:13,800
using byte-pair encoding algorithm.

640
00:40:13,800 --> 00:40:15,120
Once you have the tokenizer,

641
00:40:15,120 --> 00:40:17,160
once it's trained and you have the vocabulary

642
00:40:17,160 --> 00:40:19,180
and you have the merges,

643
00:40:19,180 --> 00:40:22,260
we can do both encoding and decoding.

644
00:40:22,260 --> 00:40:24,300
So these two arrows here.

645
00:40:24,300 --> 00:40:26,420
So the tokenizer is a translation layer

646
00:40:26,420 --> 00:40:28,180
between raw text,

647
00:40:28,180 --> 00:40:31,800
which is as we saw the sequence of Unicode code points.

648
00:40:31,800 --> 00:40:35,500
It can take raw text and turn it into a token sequence

649
00:40:35,500 --> 00:40:36,340
and vice versa,

650
00:40:36,340 --> 00:40:38,060
it can take a token sequence

651
00:40:38,060 --> 00:40:40,100
and translate it back into raw text.

652
00:40:41,860 --> 00:40:44,100
So now that we have trained the tokenizer

653
00:40:44,100 --> 00:40:45,900
and we have these merges,

654
00:40:45,900 --> 00:40:48,080
we are going to turn to how we can do the encoding

655
00:40:48,080 --> 00:40:49,360
and the decoding step.

656
00:40:49,360 --> 00:40:51,980
If you give me text, here are the tokens and vice versa.

657
00:40:51,980 --> 00:40:54,200
If you give me tokens, here's the text.

658
00:40:54,200 --> 00:40:55,040
Once we have that,

659
00:40:55,040 --> 00:40:55,960
we can translate between these two.

660
00:40:55,960 --> 00:40:59,140
And then the language model is going to be trained

661
00:40:59,140 --> 00:41:01,080
as a step two afterwards.

662
00:41:01,080 --> 00:41:05,240
And typically in a sort of a state of the art application,

663
00:41:05,240 --> 00:41:06,640
you might take all of your training data

664
00:41:06,640 --> 00:41:07,760
for the language model

665
00:41:07,760 --> 00:41:09,420
and you might run it through the tokenizer

666
00:41:09,420 --> 00:41:11,260
and sort of translate everything

667
00:41:11,260 --> 00:41:13,020
into a massive token sequence.

668
00:41:13,020 --> 00:41:14,640
And then you can throw away the raw text.

669
00:41:14,640 --> 00:41:17,000
You're just left with the tokens themselves.

670
00:41:17,000 --> 00:41:19,100
And those are stored on disk.

671
00:41:19,100 --> 00:41:21,420
And that is what the large language model is actually reading

672
00:41:21,420 --> 00:41:22,920
when it's training on them.

673
00:41:22,920 --> 00:41:24,140
So that's one approach that you can take

674
00:41:24,140 --> 00:41:25,880
as a single massive pre-processing state.

675
00:41:25,960 --> 00:41:30,220
So yeah, basically,

676
00:41:30,220 --> 00:41:31,800
I think the most important thing I want to get across

677
00:41:31,800 --> 00:41:33,440
is that this is completely separate stage.

678
00:41:33,440 --> 00:41:36,340
It usually has its own entire training set.

679
00:41:36,340 --> 00:41:38,460
You may want to have those training sets be different

680
00:41:38,460 --> 00:41:40,720
between the tokenizer and the large language model.

681
00:41:40,720 --> 00:41:43,160
So for example, when you're training the tokenizer,

682
00:41:43,160 --> 00:41:44,000
as I mentioned,

683
00:41:44,000 --> 00:41:46,420
we don't just care about the performance of English text.

684
00:41:46,420 --> 00:41:49,460
We care about multi, many different languages.

685
00:41:49,460 --> 00:41:51,640
And we also care about code or not code.

686
00:41:51,640 --> 00:41:54,600
So you may want to look into different kinds of mixtures

687
00:41:54,600 --> 00:41:55,960
of different kinds of languages

688
00:41:55,960 --> 00:41:58,760
and different amounts of code and things like that,

689
00:41:58,760 --> 00:42:01,340
because the amount of different language

690
00:42:01,340 --> 00:42:03,720
that you have in your tokenizer training set

691
00:42:03,720 --> 00:42:07,280
will determine how many merges of it there will be.

692
00:42:07,280 --> 00:42:09,420
And therefore that determines the density

693
00:42:09,420 --> 00:42:14,420
with which this type of data is sort of has

694
00:42:14,500 --> 00:42:16,220
in the token space.

695
00:42:16,220 --> 00:42:18,800
And so roughly speaking, intuitively,

696
00:42:18,800 --> 00:42:20,000
if you add some amount of data,

697
00:42:20,000 --> 00:42:21,720
like say you have a ton of Japanese data

698
00:42:21,720 --> 00:42:24,060
in your tokenizer training set,

699
00:42:24,060 --> 00:42:24,260
then that means that more Japanese tokens will get in there,

700
00:42:24,260 --> 00:42:24,300
then that means that more Japanese tokens will get in there,

701
00:42:24,300 --> 00:42:26,800
then that means that more Japanese tokens will get merged,

702
00:42:26,800 --> 00:42:29,900
and therefore Japanese will have shorter sequences.

703
00:42:29,900 --> 00:42:31,120
And that's going to be beneficial

704
00:42:31,120 --> 00:42:32,340
for the large language model,

705
00:42:32,340 --> 00:42:34,260
which has a finite context length

706
00:42:34,260 --> 00:42:37,780
on which it can work on in the token space.

707
00:42:37,780 --> 00:42:39,300
So hopefully that makes sense.

708
00:42:39,300 --> 00:42:42,080
So we're now going to turn to encoding and decoding

709
00:42:42,080 --> 00:42:44,000
now that we have trained a tokenizer.

710
00:42:44,000 --> 00:42:46,120
So we have our merges,

711
00:42:46,120 --> 00:42:48,300
and now how do we do encoding and decoding?

712
00:42:48,300 --> 00:42:50,080
Okay, so let's begin with decoding,

713
00:42:50,080 --> 00:42:51,840
which is this arrow over here.

714
00:42:51,840 --> 00:42:53,780
So given a token sequence,

715
00:42:53,780 --> 00:42:54,980
let's go through the tokenizer

716
00:42:54,980 --> 00:42:57,200
to get back a Python string object,

717
00:42:57,200 --> 00:42:59,100
so the raw text.

718
00:42:59,100 --> 00:43:01,820
So this is the function that we'd like to implement.

719
00:43:01,820 --> 00:43:03,240
We're given the list of integers,

720
00:43:03,240 --> 00:43:04,960
and we want to return a Python string.

721
00:43:04,960 --> 00:43:07,400
If you'd like, try to implement this function yourself.

722
00:43:07,400 --> 00:43:08,560
It's a fun exercise.

723
00:43:08,560 --> 00:43:12,240
Otherwise, I'm going to start pasting in my own solution.

724
00:43:12,240 --> 00:43:15,080
So there are many different ways to do it.

725
00:43:15,080 --> 00:43:16,400
Here's one way.

726
00:43:16,400 --> 00:43:19,440
I will create a kind of preprocessing variable

727
00:43:19,440 --> 00:43:20,760
that I will call vocab.

728
00:43:21,960 --> 00:43:23,620
And vocab is a function that's called vocab.

729
00:43:23,620 --> 00:43:23,700
And vocab is a function that's called vocab.

730
00:43:23,700 --> 00:43:23,780
And vocab is a function that's called vocab.

731
00:43:23,780 --> 00:43:26,020
And vocab is a mapping or dictionary in Python

732
00:43:26,020 --> 00:43:31,020
from the token ID to the bytes object for that token.

733
00:43:31,560 --> 00:43:35,880
So we begin with the raw bytes for tokens from zero to 255.

734
00:43:35,880 --> 00:43:38,760
And then we go in order of all the merges,

735
00:43:38,760 --> 00:43:41,720
and we sort of populate this vocab list

736
00:43:41,720 --> 00:43:43,360
by doing an addition here.

737
00:43:43,360 --> 00:43:46,860
So this is basically the bytes representation

738
00:43:46,860 --> 00:43:49,940
of the first child, followed by the second one.

739
00:43:49,940 --> 00:43:51,740
And remember, these are bytes objects.

740
00:43:51,740 --> 00:43:53,620
So this addition here is an addition

741
00:43:53,620 --> 00:43:57,080
of two bytes objects, just concatenation.

742
00:43:57,080 --> 00:43:58,440
So that's what we get here.

743
00:43:59,700 --> 00:44:01,540
One tricky thing to be careful with, by the way,

744
00:44:01,540 --> 00:44:04,340
is that I'm iterating a dictionary in Python

745
00:44:04,340 --> 00:44:05,900
using a dot items.

746
00:44:05,900 --> 00:44:09,420
And it really matters that this runs in the order

747
00:44:09,420 --> 00:44:13,220
in which we inserted items into the merges dictionary.

748
00:44:13,220 --> 00:44:15,120
Luckily, starting with Python 3.7,

749
00:44:15,120 --> 00:44:16,640
this is guaranteed to be the case.

750
00:44:16,640 --> 00:44:18,280
But before Python 3.7,

751
00:44:18,280 --> 00:44:20,140
this iteration may have been out of order

752
00:44:20,140 --> 00:44:22,880
with respect to how we inserted elements into merges.

753
00:44:22,880 --> 00:44:23,540
And this may not have been the case.

754
00:44:24,440 --> 00:44:27,120
But we are using modern Python, so we're okay.

755
00:44:29,120 --> 00:44:30,400
And then here, given the IDs,

756
00:44:30,400 --> 00:44:33,520
the first thing we're going to do is get the tokens.

757
00:44:33,520 --> 00:44:37,840
So the way I implemented this here is I'm taking,

758
00:44:37,840 --> 00:44:39,880
I'm iterating over all the IDs.

759
00:44:39,880 --> 00:44:42,080
I'm using vocab to look up their bytes.

760
00:44:42,080 --> 00:44:44,660
And then here, this is one way in Python

761
00:44:44,660 --> 00:44:48,940
to concatenate all these bytes together to create our tokens.

762
00:44:48,940 --> 00:44:52,000
And then these tokens here at this point are raw bytes.

763
00:44:52,000 --> 00:44:59,180
bytes so I have to decode using UTF-8 now back into Python strings. So

764
00:44:59,180 --> 00:45:03,240
previously we called dot encode on a string object to get the bytes and now

765
00:45:03,240 --> 00:45:07,680
we're doing it opposite. We're taking the bytes and calling a decode on the bytes

766
00:45:07,680 --> 00:45:15,920
object to get a string in Python and then we can return text. So this is how

767
00:45:15,920 --> 00:45:21,700
we can do it. Now this actually has a issue in the way I implemented it. In

768
00:45:21,700 --> 00:45:25,660
this could actually throw an error. So try to figure out why this code

769
00:45:25,660 --> 00:45:31,660
could actually result in an error if we plug in some sequence of IDs that is

770
00:45:31,660 --> 00:45:36,880
unlucky. So let me demonstrate the issue. When I try to decode just something like

771
00:45:36,880 --> 00:45:43,720
97, I am going to get a letter A here back so nothing too crazy happening. But

772
00:45:43,720 --> 00:45:50,220
when I try to decode 128 as a single element, the token 128 is what in string

773
00:45:50,220 --> 00:45:51,660
or in Python.

774
00:45:51,700 --> 00:45:58,240
So this is an object, Unicode decoder. UTF-8 can't decode byte 0x80 which is

775
00:45:58,240 --> 00:46:03,340
this in hex in position 0 invalid start byte. What does that mean? Well to

776
00:46:03,340 --> 00:46:07,120
understand what this means we have to go back to our UTF-8 page that I briefly

777
00:46:07,120 --> 00:46:12,280
showed earlier and this is Wikipedia UTF-8 and basically there's a specific

778
00:46:12,280 --> 00:46:17,920
schema that UTF-8 bytes take. So in particular if you have a multi byte

779
00:46:17,920 --> 00:46:21,640
object for some of the Unicode characters they have to have the

780
00:46:21,700 --> 00:46:26,500
special sort of envelope in how the encoding works. And so what's happening

781
00:46:26,500 --> 00:46:33,220
here is that invalid start byte that's because 128 the binary representation of

782
00:46:33,220 --> 00:46:39,580
it is 1 followed by all zeros. So we have 1 and then all 0 and we see here that

783
00:46:39,580 --> 00:46:43,000
that doesn't conform to the format because 1 followed by all 0 just doesn't

784
00:46:43,000 --> 00:46:48,100
fit any of these rules so to speak. So it's an invalid start byte which is byte

785
00:46:48,100 --> 00:46:51,600
1. This 1 must have a 1 following it.

786
00:46:51,700 --> 00:46:57,220
And then a 0 following it. And then the content of your Unicode in X is here. So

787
00:46:57,220 --> 00:47:01,300
basically we don't exactly follow the UTF-8 standard and this cannot be

788
00:47:01,300 --> 00:47:10,760
decoded. And so the way to fix this is to use this errors equals in bytes.decode

789
00:47:10,760 --> 00:47:16,500
function of Python. And by default errors is strict so we will throw an error if

790
00:47:16,500 --> 00:47:21,680
it's not valid UTF-8 bytes encoding. But there are many different things that

791
00:47:21,700 --> 00:47:25,240
you could put here on error handling. This is the full list of all the errors

792
00:47:25,240 --> 00:47:28,820
that you can use. And in particular instead of strict let's change it to

793
00:47:28,820 --> 00:47:35,560
replace. And that will replace with this special marker. This is the replacement

794
00:47:35,560 --> 00:47:44,200
character. So errors equals replace. And now we just get that character back. So

795
00:47:44,200 --> 00:47:50,640
basically not every single byte sequence is valid UTF-8. And if it happens that

796
00:47:50,640 --> 00:47:51,680
your large language is not valid, then it's not valid UTF-8. So basically not every single byte sequence is valid UTF-8. And if it happens that your large language

797
00:47:51,700 --> 00:47:57,160
model for example predicts your tokens in a bad manner, then they might not fall

798
00:47:57,160 --> 00:48:03,220
into valid UTF-8. And then we won't be able to decode them. So the standard

799
00:48:03,220 --> 00:48:07,720
practice is to basically use errors equals replace. And this is what you will

800
00:48:07,720 --> 00:48:12,700
also find in the OpenAI code that they released as well. But basically

801
00:48:12,700 --> 00:48:15,480
whenever you see this kind of a character in your output in that case

802
00:48:15,480 --> 00:48:21,080
something went wrong and the LM output was not valid sort of sequence of tokens.

803
00:48:22,420 --> 00:48:27,620
OK. And now we're going to go the other way. So we are going to implement this error right here.

804
00:48:27,620 --> 00:48:30,960
Where we are going to be given a string and we want to encode it into tokens.

805
00:48:32,100 --> 00:48:35,060
So this is a signature of the function that we're interested in

806
00:48:35,060 --> 00:48:41,820
and this should basically print a list of integers of the tokens. So again try

807
00:48:41,820 --> 00:48:45,840
to maybe implement this yourself if you'd like a fun exercise. And pause here

808
00:48:45,840 --> 00:48:49,920
otherwise I'm going to start putting in my solution. So again there are many ways

809
00:48:49,920 --> 00:48:50,780
to do this.

810
00:48:50,780 --> 00:48:51,420
So

811
00:48:51,700 --> 00:48:56,820
This is one of the ways that I came up with.

812
00:48:56,820 --> 00:48:59,380
The first thing we're going to do is we are going

813
00:48:59,380 --> 00:49:04,940
to take our text encoded into UTF-8 to get the raw bytes.

814
00:49:04,940 --> 00:49:06,940
Then as before, we're going to call list on

815
00:49:06,940 --> 00:49:11,480
the bytes object to get a list of integers of those bytes.

816
00:49:11,480 --> 00:49:13,200
Those are the starting tokens.

817
00:49:13,200 --> 00:49:15,300
Those are the raw bytes of our sequence.

818
00:49:15,300 --> 00:49:18,800
But now, of course, according to the merges dictionary above,

819
00:49:18,800 --> 00:49:21,360
and recall this was the merges,

820
00:49:21,360 --> 00:49:25,680
some of the bytes may be merged according to this lookup.

821
00:49:25,680 --> 00:49:27,340
In addition to that, remember that

822
00:49:27,340 --> 00:49:29,280
the merges was built from top to bottom.

823
00:49:29,280 --> 00:49:32,520
This is the order in which we inserted stuff into merges.

824
00:49:32,520 --> 00:49:34,980
We prefer to do all these merges in

825
00:49:34,980 --> 00:49:37,580
the beginning before we do these merges later.

826
00:49:37,580 --> 00:49:39,540
Because for example,

827
00:49:39,540 --> 00:49:43,980
this merge over here relies on the 256 which got merged here.

828
00:49:43,980 --> 00:49:47,140
We have to go in the order from top to bottom

829
00:49:47,140 --> 00:49:48,760
if we are going to be merging anything.

830
00:49:48,760 --> 00:49:52,080
Now, we expect to be doing a few merges,

831
00:49:52,080 --> 00:49:54,720
so we're going to be doing while true.

832
00:49:55,320 --> 00:49:59,160
Now, we want to find a pair of bytes that is

833
00:49:59,160 --> 00:50:03,060
consecutive that we are allowed to merge according to this.

834
00:50:03,060 --> 00:50:06,120
In order to reuse some of the functionality that we've already written,

835
00:50:06,120 --> 00:50:09,540
I'm going to reuse the function getStats.

836
00:50:09,540 --> 00:50:14,780
Recall that getStats will basically count up how many times

837
00:50:14,780 --> 00:50:18,200
every single pair occurs in our sequence of tokens.

838
00:50:18,200 --> 00:50:20,140
Return that as a dictionary.

839
00:50:20,140 --> 00:50:23,640
The dictionary was a mapping from

840
00:50:23,640 --> 00:50:26,260
all the different byte pairs

841
00:50:26,260 --> 00:50:28,700
to the number of times that they occur.

842
00:50:28,700 --> 00:50:33,000
At this point, we don't actually care how many times they occur in the sequence.

843
00:50:33,000 --> 00:50:36,600
We only care what the raw pairs are in that sequence.

844
00:50:36,600 --> 00:50:39,700
I'm only going to be using basically the keys of the dictionary.

845
00:50:39,700 --> 00:50:41,140
I only care about the set of

846
00:50:41,140 --> 00:50:44,200
possible merge candidates, if that makes sense.

847
00:50:44,200 --> 00:50:48,080
Now, we want to identify the pair that we're going to be merging at this stage,

848
00:50:48,080 --> 00:50:49,320
of the loop.

849
00:50:49,320 --> 00:50:50,120
So, what do we want?

850
00:50:50,120 --> 00:50:54,740
We want to find the pair, or like a key inside stats,

851
00:50:54,740 --> 00:50:59,360
that has the lowest index in the merges dictionary,

852
00:50:59,360 --> 00:51:04,040
because we want to do all the early merges before we work our way to the late merges.

853
00:51:04,040 --> 00:51:06,180
So, again, there are many different ways to implement this,

854
00:51:06,180 --> 00:51:12,280
but I'm going to do something a little bit fancy here.

855
00:51:12,280 --> 00:51:15,920
So, I'm going to be using the min over an iterator.

856
00:51:15,920 --> 00:51:18,080
In Python, when you call min on an iterator,

857
00:51:18,080 --> 00:51:19,880
and stats here is a dictionary,

858
00:51:19,880 --> 00:51:23,440
we're going to be iterating the keys of this dictionary in Python.

859
00:51:23,440 --> 00:51:28,380
So, we're looking at all the pairs inside stats,

860
00:51:28,380 --> 00:51:30,440
which are all the consecutive pairs.

861
00:51:30,440 --> 00:51:34,180
And we're going to be taking the consecutive pair inside tokens

862
00:51:34,180 --> 00:51:37,040
that has the minimum, what.

863
00:51:37,040 --> 00:51:38,920
The min takes a key,

864
00:51:38,920 --> 00:51:41,880
which gives us the function that is going to return a value

865
00:51:41,880 --> 00:51:44,080
over which we're going to do the min.

866
00:51:44,080 --> 00:51:47,320
And the one we care about is we care about taking merges,

867
00:51:47,320 --> 00:51:53,720
and basically getting that pair's index.

868
00:51:53,720 --> 00:51:58,180
So, basically, for any pair inside stats,

869
00:51:58,180 --> 00:52:02,680
we are going to be looking into merges at what index it has,

870
00:52:02,680 --> 00:52:05,760
and we want to get the pair with the min number.

871
00:52:05,760 --> 00:52:08,280
So, as an example, if there's a pair 101 and 32,

872
00:52:08,280 --> 00:52:10,560
we definitely want to get that pair.

873
00:52:10,560 --> 00:52:12,520
We want to identify it here and return it,

874
00:52:12,520 --> 00:52:16,780
and pair would become 101, 32 if it occurs.

875
00:52:16,780 --> 00:52:20,780
And the reason that I'm putting a float inf here as a fallback

876
00:52:20,780 --> 00:52:23,820
is that in the get function, when we call,

877
00:52:23,820 --> 00:52:28,340
when we basically consider a pair that doesn't occur in the merges,

878
00:52:28,340 --> 00:52:30,680
then that pair is not eligible to be merged, right?

879
00:52:30,680 --> 00:52:32,640
So, if in the token sequence,

880
00:52:32,640 --> 00:52:35,120
there's some pair that is not a merging pair,

881
00:52:35,120 --> 00:52:36,380
it cannot be merged,

882
00:52:36,380 --> 00:52:38,580
then it doesn't actually occur here,

883
00:52:38,580 --> 00:52:40,180
and it doesn't have an index,

884
00:52:40,180 --> 00:52:41,780
and it cannot be merged,

885
00:52:41,780 --> 00:52:44,080
which we will denote as float inf.

886
00:52:44,080 --> 00:52:46,480
And the reason infinity is nice here is because for sure,

887
00:52:46,480 --> 00:52:48,720
we're guaranteed that it's not going to participate

888
00:52:48,720 --> 00:52:51,840
in the list of candidates when we do the min.

889
00:52:51,840 --> 00:52:55,080
So, this is one way to do it.

890
00:52:55,080 --> 00:52:56,380
So, basically, long story short,

891
00:52:56,380 --> 00:53:00,720
this returns the most eligible merging candidate pair

892
00:53:00,720 --> 00:53:02,340
that occurs in the tokens.

893
00:53:02,340 --> 00:53:05,140
Now, one thing to be careful with here is

894
00:53:05,140 --> 00:53:09,520
this function here might fail in the following way.

895
00:53:09,520 --> 00:53:11,180
If there is nothing to merge,

896
00:53:11,180 --> 00:53:16,180
then there's nothing in merges that satisfies this function.

897
00:53:16,480 --> 00:53:18,720
If there is nothing that is satisfied anymore,

898
00:53:18,720 --> 00:53:19,820
there's nothing to merge.

899
00:53:19,820 --> 00:53:22,180
Everything just returns float infs,

900
00:53:22,180 --> 00:53:23,980
and then the pair, I think,

901
00:53:23,980 --> 00:53:27,980
will just become the very first element of stats.

902
00:53:27,980 --> 00:53:29,780
But this pair is not actually a mergeable pair.

903
00:53:29,780 --> 00:53:33,480
It just becomes the first pair inside stats arbitrarily

904
00:53:33,480 --> 00:53:36,980
because all of these pairs evaluate to float inf

905
00:53:36,980 --> 00:53:38,720
for the merging criterion.

906
00:53:38,720 --> 00:53:41,320
So, basically, it could be that this doesn't succeed

907
00:53:41,320 --> 00:53:42,620
because there's no more merging pairs.

908
00:53:42,620 --> 00:53:45,780
So, if this pair is not in merges that was returned,

909
00:53:45,780 --> 00:53:46,320
then this is a failure.

910
00:53:46,480 --> 00:53:48,320
So, this signals for us that, actually,

911
00:53:48,320 --> 00:53:49,720
there was nothing to merge.

912
00:53:49,720 --> 00:53:51,720
No single pair can be merged anymore.

913
00:53:51,720 --> 00:53:55,480
In that case, we will break out.

914
00:53:55,480 --> 00:53:59,320
Nothing else can be merged.

915
00:53:59,320 --> 00:54:00,680
You might come up with a different implementation,

916
00:54:00,680 --> 00:54:00,980
by the way.

917
00:54:00,980 --> 00:54:05,480
This is kind of like really trying hard in Python.

918
00:54:05,480 --> 00:54:07,240
But really, we're just trying to find a pair

919
00:54:07,240 --> 00:54:10,720
that can be merged with a lowest index here.

920
00:54:10,720 --> 00:54:15,520
Now, if we did find a pair that is inside merges

921
00:54:15,520 --> 00:54:15,780
with the lowest index,

922
00:54:15,780 --> 00:54:17,680
then we can merge it.

923
00:54:17,680 --> 00:54:22,420
So, we're going to look into the mergers dictionary

924
00:54:22,420 --> 00:54:25,240
for that pair to look up the index,

925
00:54:25,240 --> 00:54:28,620
and we're going to now merge into that index.

926
00:54:28,620 --> 00:54:30,220
So, we're going to do tokens equals,

927
00:54:30,220 --> 00:54:34,440
and we're going to replace the original tokens.

928
00:54:34,440 --> 00:54:36,620
We're going to be replacing the pair pair,

929
00:54:36,620 --> 00:54:39,020
and we're going to be replacing it with index IDX.

930
00:54:39,020 --> 00:54:41,580
And this returns a new list of tokens

931
00:54:41,580 --> 00:54:44,440
where every occurrence of pair is replaced with IDX.

932
00:54:44,440 --> 00:54:45,620
So, we're doing a merge.

933
00:54:45,780 --> 00:54:47,540
And we're going to be continuing this

934
00:54:47,540 --> 00:54:49,320
until eventually nothing can be merged.

935
00:54:49,320 --> 00:54:51,280
We'll come out here and we'll break out.

936
00:54:51,280 --> 00:54:54,180
And here, we just return tokens.

937
00:54:54,180 --> 00:54:56,780
And so, that's the implementation, I think.

938
00:54:56,780 --> 00:54:58,140
So, hopefully, this runs.

939
00:54:58,140 --> 00:55:00,980
Okay, cool.

940
00:55:00,980 --> 00:55:02,880
Yeah, and this looks reasonable.

941
00:55:02,880 --> 00:55:05,520
So, for example, 32 is a space in ASCII.

942
00:55:05,520 --> 00:55:08,380
So, that's here.

943
00:55:08,380 --> 00:55:09,920
So, this looks like it worked.

944
00:55:09,920 --> 00:55:10,680
Great.

945
00:55:10,680 --> 00:55:13,440
Okay, so let's wrap up this section of the video, at least.

946
00:55:13,440 --> 00:55:15,680
I wanted to point out that this is not quite the right implementation.

947
00:55:15,680 --> 00:55:18,680
Just yet, because we are leaving out a special case.

948
00:55:18,680 --> 00:55:21,480
So, in particular, if we try to do this,

949
00:55:21,480 --> 00:55:23,180
this would give us an error.

950
00:55:23,180 --> 00:55:26,280
And the issue is that if we only have a single character

951
00:55:26,280 --> 00:55:28,880
or an empty string, then stats is empty.

952
00:55:28,880 --> 00:55:30,920
And that causes an issue inside min.

953
00:55:30,920 --> 00:55:36,080
So, one way to fight this is if len of tokens is at least two.

954
00:55:36,080 --> 00:55:38,720
Because if it's less than two, it's just a single token or no tokens,

955
00:55:38,720 --> 00:55:40,880
then let's just, there's nothing to merge.

956
00:55:40,880 --> 00:55:42,320
So, we just return.

957
00:55:42,320 --> 00:55:45,380
So, that would fix that case.

958
00:55:45,380 --> 00:55:46,280
Okay.

959
00:55:46,280 --> 00:55:49,880
And then second, I have a few test cases here for us as well.

960
00:55:49,880 --> 00:55:55,080
So, first, let's make sure about, or let's note the following.

961
00:55:55,080 --> 00:55:58,680
If we take a string and we try to encode it and then decode it back,

962
00:55:58,680 --> 00:56:00,980
you'd expect to get the same string back, right?

963
00:56:00,980 --> 00:56:05,580
Is that true for all strings?

964
00:56:05,580 --> 00:56:07,380
So, I think, so here it is the case.

965
00:56:07,380 --> 00:56:11,080
And I think in general, this is probably the case.

966
00:56:11,080 --> 00:56:15,180
But notice that going backwards is not, is not, you're not going to have an identity.

967
00:56:15,380 --> 00:56:16,280
Going backwards.

968
00:56:16,280 --> 00:56:24,380
Because as I mentioned, not all token sequences are valid UTF-8 sort of byte streams.

969
00:56:24,380 --> 00:56:28,480
And so, therefore, some of them can't even be decodable.

970
00:56:28,480 --> 00:56:31,080
So, this only goes in one direction.

971
00:56:31,080 --> 00:56:33,980
But for that one direction, we can check here.

972
00:56:33,980 --> 00:56:37,380
If we take the training text, which is the text that we trained the tokenizer on,

973
00:56:37,380 --> 00:56:41,680
we can make sure that when we encode and decode, we get the same thing back, which is true.

974
00:56:41,680 --> 00:56:43,280
And here I took some validation data.

975
00:56:43,280 --> 00:56:45,280
So, I went to, I think, this web page.

976
00:56:45,280 --> 00:56:46,880
And I grabbed some text.

977
00:56:46,880 --> 00:56:49,280
So, this is text that the tokenizer has not seen.

978
00:56:49,280 --> 00:56:52,580
And we can make sure that this also works.

979
00:56:52,580 --> 00:56:55,880
So, that gives us some confidence that this was correctly implemented.

980
00:56:55,880 --> 00:56:59,080
So, those are the basics of the byte pair encoding algorithm.

981
00:56:59,080 --> 00:57:03,580
We saw how we can take some training set, train a tokenizer.

982
00:57:03,580 --> 00:57:07,780
The parameters of this tokenizer really are just this dictionary of merges.

983
00:57:07,780 --> 00:57:12,480
And that basically creates the little binary forest on top of raw bytes.

984
00:57:12,480 --> 00:57:14,580
Once we have this, the merges table,

985
00:57:14,580 --> 00:57:18,780
we can both encode and decode between raw text and token sequences.

986
00:57:18,780 --> 00:57:22,180
So, that's the simplest setting of the tokenizer.

987
00:57:22,180 --> 00:57:25,180
What we're going to do now, though, is we're going to look at some of the state-of-the-art

988
00:57:25,180 --> 00:57:28,380
large language models and the kinds of tokenizers that they use.

989
00:57:28,380 --> 00:57:30,980
And we're going to see that this picture complexifies very quickly.

990
00:57:30,980 --> 00:57:37,080
So, we're going to go through the details of this complexification one at a time.

991
00:57:37,080 --> 00:57:39,880
So, let's kick things off by looking at the GPT series.

992
00:57:39,880 --> 00:57:43,180
So, in particular, I have the GPT-2 paper here.

993
00:57:43,180 --> 00:57:47,780
And this paper is from 2019 or so, so five years ago.

994
00:57:47,780 --> 00:57:50,980
And let's scroll down to input representation.

995
00:57:50,980 --> 00:57:54,880
This is where they talk about the tokenizer that they're using for GPT-2.

996
00:57:54,880 --> 00:57:59,680
Now, this is all fairly readable, so I encourage you to pause and read this yourself.

997
00:57:59,680 --> 00:58:03,680
But this is where they motivate the use of the byte pair encoding algorithm

998
00:58:03,680 --> 00:58:08,380
on the byte level representation of UTF-8 encoding.

999
00:58:08,380 --> 00:58:12,780
So, this is where they motivate it, and they talk about the vocabulary sizes and everything.

1000
00:58:13,180 --> 00:58:16,080
Now, everything here is exactly as we've covered it so far,

1001
00:58:16,080 --> 00:58:18,780
but things start to depart around here.

1002
00:58:18,780 --> 00:58:23,580
So, what they mention is that they don't just apply the naive algorithm as we have done it.

1003
00:58:23,580 --> 00:58:26,280
And in particular, here's a motivating example.

1004
00:58:26,280 --> 00:58:28,480
Suppose that you have common words like dog.

1005
00:58:28,480 --> 00:58:32,680
What will happen is that dog, of course, occurs very frequently in the text,

1006
00:58:32,680 --> 00:58:36,180
and it occurs right next to all kinds of punctuation, as an example.

1007
00:58:36,180 --> 00:58:40,880
So, dog dot, dog exclamation mark, dog question mark, et cetera.

1008
00:58:40,880 --> 00:58:42,980
And naively, you might imagine that the BP algorithm

1009
00:58:42,980 --> 00:58:45,580
could merge these to be single tokens.

1010
00:58:45,580 --> 00:58:48,080
And then you end up with lots of tokens that are just like dog

1011
00:58:48,080 --> 00:58:50,080
with a slightly different punctuation.

1012
00:58:50,080 --> 00:58:52,480
And so, it feels like you're clustering things that shouldn't be clustered.

1013
00:58:52,480 --> 00:58:56,480
You're combining kind of semantics with punctuation.

1014
00:58:56,480 --> 00:58:58,780
And this feels suboptimal.

1015
00:58:58,780 --> 00:59:01,480
And indeed, they also say that this is suboptimal,

1016
00:59:01,480 --> 00:59:03,380
according to some of the experiments.

1017
00:59:03,380 --> 00:59:06,280
So, what they want to do is they want to top-down, in a manual way,

1018
00:59:06,280 --> 00:59:12,680
enforce that some types of characters should never be merged together.

1019
00:59:12,680 --> 00:59:14,780
So, they want to enforce these merging rules

1020
00:59:14,780 --> 00:59:17,680
on top of the byte pair encoding algorithm.

1021
00:59:17,680 --> 00:59:21,480
So, let's take a look at their code and see how they actually enforce this

1022
00:59:21,480 --> 00:59:24,280
and what kinds of mergers they actually do perform.

1023
00:59:24,280 --> 00:59:29,480
So, I have the tab open here for GPT-2 under OpenAI on GitHub.

1024
00:59:29,480 --> 00:59:33,980
And when we go to source, there is an encoder.py.

1025
00:59:33,980 --> 00:59:36,280
Now, I don't personally love that they call it encoder.py

1026
00:59:36,280 --> 00:59:38,080
because this is the tokenizer.

1027
00:59:38,080 --> 00:59:41,080
And the tokenizer can do both encode and decode.

1028
00:59:41,080 --> 00:59:42,580
So, it feels kind of awkward to me that it's called that.

1029
00:59:42,680 --> 00:59:45,780
It's called encoder, but that is the tokenizer.

1030
00:59:45,780 --> 00:59:46,880
And there's a lot going on here,

1031
00:59:46,880 --> 00:59:49,580
and we're going to step through it in detail at one point.

1032
00:59:49,580 --> 00:59:53,380
For now, I just want to focus on this part here.

1033
00:59:53,380 --> 00:59:56,480
They create a regex pattern here that looks very complicated,

1034
00:59:56,480 --> 00:59:58,880
and we're going to go through it in a bit.

1035
00:59:58,880 --> 01:00:02,880
But this is the core part that allows them to enforce rules

1036
01:00:02,880 --> 01:00:07,280
for what parts of the text will never be merged for sure.

1037
01:00:07,280 --> 01:00:09,880
Now, notice that re.compile here is a little bit misleading

1038
01:00:09,880 --> 01:00:12,380
because we're not just doing import re,

1039
01:00:12,380 --> 01:00:15,680
we're doing import regex as re,

1040
01:00:15,680 --> 01:00:18,780
and regex is a Python package that you can install,

1041
01:00:18,780 --> 01:00:21,780
pip install regex, and it's basically an extension of re,

1042
01:00:21,780 --> 01:00:26,080
so it's a bit more powerful re.

1043
01:00:26,080 --> 01:00:29,680
So, let's take a look at this pattern and what it's doing

1044
01:00:29,680 --> 01:00:32,280
and why this is actually doing the separation

1045
01:00:32,280 --> 01:00:33,880
that they are looking for.

1046
01:00:33,880 --> 01:00:35,780
Okay, so I've copy pasted the pattern here

1047
01:00:35,780 --> 01:00:38,280
to our Jupyter notebook where we left off,

1048
01:00:38,280 --> 01:00:40,680
and let's take this pattern for a spin.

1049
01:00:40,680 --> 01:00:42,180
So, in the exact same way that

1050
01:00:42,380 --> 01:00:45,780
Jupyter code does, we're going to call an re.findall

1051
01:00:45,780 --> 01:00:48,180
for this pattern on any arbitrary string

1052
01:00:48,180 --> 01:00:49,480
that we are interested in.

1053
01:00:49,480 --> 01:00:53,680
So, this is the string that we want to encode into tokens

1054
01:00:53,680 --> 01:00:56,880
to feed into an LLM like GPT-2.

1055
01:00:56,880 --> 01:00:59,080
So, what exactly is this doing?

1056
01:00:59,080 --> 01:01:01,080
Well, re.findall will take this pattern

1057
01:01:01,080 --> 01:01:04,980
and try to match it against this string.

1058
01:01:04,980 --> 01:01:07,780
The way this works is that you are going from left to right

1059
01:01:07,780 --> 01:01:11,380
in the string, and you're trying to match the pattern.

1060
01:01:11,380 --> 01:01:12,280
And re.fall,

1061
01:01:12,380 --> 01:01:15,080
re.findall will get all the occurrences

1062
01:01:15,080 --> 01:01:17,380
and organize them into a list.

1063
01:01:17,380 --> 01:01:20,480
Now, when you look at this pattern,

1064
01:01:20,480 --> 01:01:23,880
first of all, notice that this is a raw string,

1065
01:01:23,880 --> 01:01:26,180
and then these are three double quotes

1066
01:01:26,180 --> 01:01:27,780
just to start the string.

1067
01:01:27,780 --> 01:01:29,380
So, really, the string itself,

1068
01:01:29,380 --> 01:01:32,380
this is the pattern itself, right?

1069
01:01:32,380 --> 01:01:35,280
And notice that it's made up of a lot of ors.

1070
01:01:35,280 --> 01:01:36,480
So, see these vertical bars?

1071
01:01:36,480 --> 01:01:39,380
Those are ors in regex.

1072
01:01:39,380 --> 01:01:41,380
And so, you go from left to right in this pattern

1073
01:01:41,380 --> 01:01:44,480
and try to match it against the string wherever you are.

1074
01:01:44,480 --> 01:01:47,980
So, we have hello, and we're going to try to match it.

1075
01:01:47,980 --> 01:01:49,480
Well, it's not apostrophe s.

1076
01:01:49,480 --> 01:01:52,480
It's not apostrophe t or any of these,

1077
01:01:52,480 --> 01:01:57,180
but it is an optional space followed by dash p of,

1078
01:01:57,180 --> 01:02:00,280
sorry, slash p of l one or more times.

1079
01:02:00,280 --> 01:02:01,880
What is slash p of l?

1080
01:02:01,880 --> 01:02:06,880
It is coming to some documentation that I found.

1081
01:02:06,880 --> 01:02:09,180
There might be other sources as well.

1082
01:02:09,180 --> 01:02:11,180
Slash p of l is a letter.

1083
01:02:11,180 --> 01:02:13,680
Any kind of letter from any language.

1084
01:02:13,680 --> 01:02:16,080
And hello is made up of letters.

1085
01:02:16,080 --> 01:02:18,380
H-E-L-L-O, et cetera.

1086
01:02:18,380 --> 01:02:21,480
So, optional space followed by a bunch of letters,

1087
01:02:21,480 --> 01:02:24,780
one or more letters, is going to match hello,

1088
01:02:24,780 --> 01:02:28,780
but then the match ends because a white space is not a letter.

1089
01:02:28,780 --> 01:02:33,280
So, from there on begins a new sort of attempt

1090
01:02:33,280 --> 01:02:35,880
to match against the string again.

1091
01:02:35,880 --> 01:02:39,180
And starting in here, we're going to skip over all of these again

1092
01:02:39,180 --> 01:02:40,980
until we get to the exact same point again.

1093
01:02:41,180 --> 01:02:43,580
And we see that there's an optional space.

1094
01:02:43,580 --> 01:02:46,180
This is the optional space followed by a bunch of letters,

1095
01:02:46,180 --> 01:02:47,180
one or more of them.

1096
01:02:47,180 --> 01:02:48,680
And so, that matches.

1097
01:02:48,680 --> 01:02:53,080
So, when we run this, we get a list of two elements, hello,

1098
01:02:53,080 --> 01:02:55,680
and then space world.

1099
01:02:55,680 --> 01:02:58,880
So, how are you if we add more letters?

1100
01:02:58,880 --> 01:03:01,180
We would just get them like this.

1101
01:03:01,180 --> 01:03:03,680
Now, what is this doing and why is this important?

1102
01:03:03,680 --> 01:03:08,380
We are taking our string and instead of directly encoding it

1103
01:03:08,380 --> 01:03:11,080
for tokenization, we are first splitting it.

1104
01:03:11,180 --> 01:03:13,980
And when you actually step through the code,

1105
01:03:13,980 --> 01:03:16,180
and we'll do that in a bit more detail,

1106
01:03:16,180 --> 01:03:20,680
what really it's doing on a high level is that it first splits your text

1107
01:03:20,680 --> 01:03:24,480
into a list of texts, just like this one.

1108
01:03:24,480 --> 01:03:27,480
And all these elements of this list are processed independently

1109
01:03:27,480 --> 01:03:29,080
by the tokenizer.

1110
01:03:29,080 --> 01:03:33,080
And all of the results of that processing are simply concatenated.

1111
01:03:33,080 --> 01:03:35,180
So, hello, world.

1112
01:03:35,180 --> 01:03:37,480
Oh, I missed how.

1113
01:03:37,480 --> 01:03:39,380
Hello, world, how are you?

1114
01:03:39,380 --> 01:03:41,180
We have five elements of a list.

1115
01:03:41,180 --> 01:03:48,080
All of these will independently go from text to a token sequence.

1116
01:03:48,080 --> 01:03:50,580
And then that token sequence is going to be concatenated.

1117
01:03:50,580 --> 01:03:52,680
It's all going to be joined up.

1118
01:03:52,680 --> 01:03:57,380
And roughly speaking, what that does is you're only ever finding merges

1119
01:03:57,380 --> 01:03:59,280
between the elements of this list.

1120
01:03:59,280 --> 01:04:01,480
So, you can only ever consider merges within every one

1121
01:04:01,480 --> 01:04:04,080
of these elements individually.

1122
01:04:04,080 --> 01:04:07,880
And after you've done all the possible merging for all

1123
01:04:07,880 --> 01:04:10,180
of these elements individually, the results of all

1124
01:04:10,180 --> 01:04:11,080
that will be joined up.

1125
01:04:11,180 --> 01:04:19,180
So, basically, what you're doing effectively is you are never going

1126
01:04:19,180 --> 01:04:23,480
to be merging this E with this space because they are now parts

1127
01:04:23,480 --> 01:04:25,880
of the separate elements of this list.

1128
01:04:25,880 --> 01:04:31,080
And so, you are saying we are never going to merge E space

1129
01:04:31,080 --> 01:04:33,580
because we're breaking it up in this way.

1130
01:04:33,580 --> 01:04:36,580
So, basically, using this regex pattern to chunk

1131
01:04:36,580 --> 01:04:41,080
up the text is just one way of enforcing that some merges

1132
01:04:41,180 --> 01:04:42,380
are not to happen.

1133
01:04:42,380 --> 01:04:44,980
And we're going to go into more of this text and we'll see

1134
01:04:44,980 --> 01:04:47,180
that what this is trying to do on a high level is we're trying

1135
01:04:47,180 --> 01:04:50,180
to not merge across letters, across numbers,

1136
01:04:50,180 --> 01:04:52,680
across punctuation, and so on.

1137
01:04:52,680 --> 01:04:54,480
So, let's see in more detail how that works.

1138
01:04:54,480 --> 01:04:55,880
So, let's continue now.

1139
01:04:55,880 --> 01:04:59,680
We have slash P of N. If you go to the documentation,

1140
01:04:59,680 --> 01:05:04,380
slash P of N is any kind of numeric character in any script.

1141
01:05:04,380 --> 01:05:05,880
So, it's numbers.

1142
01:05:05,880 --> 01:05:07,880
So, we have an optional space followed by numbers

1143
01:05:07,880 --> 01:05:09,680
and those would be separated out.

1144
01:05:09,680 --> 01:05:11,180
So, letters and numbers are being separated out.

1145
01:05:11,180 --> 01:05:14,980
So, if I do hello world, one, two, three, how are you?

1146
01:05:14,980 --> 01:05:19,580
Then world will stop matching here because one is not a letter anymore.

1147
01:05:19,580 --> 01:05:22,480
But one is a number, so this group will match for that

1148
01:05:22,480 --> 01:05:26,780
and we'll get it as a separate entity.

1149
01:05:26,780 --> 01:05:28,380
Let's see how these apostrophes work.

1150
01:05:28,380 --> 01:05:36,680
So, here, if we have slash V or, I mean, apostrophe V as an example,

1151
01:05:36,680 --> 01:05:40,580
then apostrophe here is not a letter or a number.

1152
01:05:40,580 --> 01:05:45,880
So, hello will stop matching and then we will exactly match this with that.

1153
01:05:45,880 --> 01:05:49,180
So, that will come out as a separate thing.

1154
01:05:49,180 --> 01:05:51,580
So, why are they doing the apostrophes here?

1155
01:05:51,580 --> 01:05:54,780
Honestly, I think that these are just like very common apostrophes

1156
01:05:54,780 --> 01:05:57,980
that are used typically.

1157
01:05:57,980 --> 01:05:59,580
I don't love that they've done this

1158
01:05:59,580 --> 01:06:06,380
because let me show you what happens when you have some Unicode apostrophes.

1159
01:06:06,380 --> 01:06:10,080
Like, for example, you can have, if you have house,

1160
01:06:10,080 --> 01:06:13,080
then this will be separated out because of this matching.

1161
01:06:13,080 --> 01:06:17,180
But if you use the Unicode apostrophe like this,

1162
01:06:17,180 --> 01:06:19,880
then suddenly this does not work.

1163
01:06:19,880 --> 01:06:23,680
And so, this apostrophe will actually become its own thing now.

1164
01:06:23,680 --> 01:06:28,180
And so, it's basically hard-coded for this specific kind of apostrophe

1165
01:06:28,180 --> 01:06:33,180
and otherwise they become completely separate tokens.

1166
01:06:33,180 --> 01:06:37,580
In addition to this, you can go to the GPT-2 docs

1167
01:06:37,580 --> 01:06:39,880
and here when they define the pattern, they say,

1168
01:06:40,080 --> 01:06:42,280
should have added re.ignorecase.

1169
01:06:42,280 --> 01:06:45,580
So, BP merges can happen for capitalized versions of contractions.

1170
01:06:45,580 --> 01:06:48,480
So, what they're pointing out is that you see how this is apostrophe

1171
01:06:48,480 --> 01:06:50,780
and then lowercase letters.

1172
01:06:50,780 --> 01:06:53,880
Well, because they didn't do re.ignorecase,

1173
01:06:53,880 --> 01:06:59,880
then these rules will not separate out the apostrophes if it's uppercase.

1174
01:06:59,880 --> 01:07:04,780
So, house would be like this.

1175
01:07:04,780 --> 01:07:09,880
But if I did house from uppercase, then notice,

1176
01:07:10,080 --> 01:07:13,280
suddenly the apostrophe comes by itself.

1177
01:07:13,280 --> 01:07:17,480
So, the tokenization will work differently in uppercase and lowercase,

1178
01:07:17,480 --> 01:07:19,880
inconsistently separating out these apostrophes.

1179
01:07:19,880 --> 01:07:23,880
So, it feels extremely gnarly and slightly gross.

1180
01:07:23,880 --> 01:07:25,780
But that's how that works.

1181
01:07:25,780 --> 01:07:27,280
Okay, so let's come back.

1182
01:07:27,280 --> 01:07:29,880
After trying to match a bunch of apostrophe expressions,

1183
01:07:29,880 --> 01:07:33,280
by the way, the other issue here is that these are quite language-specific probably.

1184
01:07:33,280 --> 01:07:37,080
So, I don't know that all the languages, for example, use or don't use apostrophes,

1185
01:07:37,080 --> 01:07:39,880
but that would be inconsistently tokenized as a result.

1186
01:07:40,080 --> 01:07:44,280
Well, then we try to match letters, then we try to match numbers,

1187
01:07:44,280 --> 01:07:47,480
and then if that doesn't work, we fall back to here.

1188
01:07:47,480 --> 01:07:51,280
And what this is saying is, again, optional space followed by something that is not a letter,

1189
01:07:51,280 --> 01:07:55,080
number, or a space, and one or more of that.

1190
01:07:55,080 --> 01:07:58,280
So, what this is doing effectively is this is trying to match punctuation,

1191
01:07:58,280 --> 01:08:01,080
roughly speaking, not letters and not numbers.

1192
01:08:01,080 --> 01:08:03,280
So, this group will try to trigger for that.

1193
01:08:03,280 --> 01:08:09,480
So, if I do something like this, then these parts here are not letters or numbers,

1194
01:08:09,480 --> 01:08:13,480
but they will actually get caught here.

1195
01:08:13,480 --> 01:08:15,680
And so, they become its own group.

1196
01:08:15,680 --> 01:08:18,380
So, we've separated out the punctuation.

1197
01:08:18,380 --> 01:08:21,580
And finally, this is also a little bit confusing.

1198
01:08:21,580 --> 01:08:24,080
So, this is matching whitespace,

1199
01:08:24,080 --> 01:08:28,980
but this is using a negative look-ahead assertion in regex.

1200
01:08:28,980 --> 01:08:32,080
So, what this is doing is it's matching whitespace up to,

1201
01:08:32,080 --> 01:08:35,980
but not including the last whitespace character.

1202
01:08:35,980 --> 01:08:39,380
Why is this important? This is pretty subtle, I think.

1203
01:08:39,380 --> 01:08:43,380
So, you see how the whitespace is always included at the beginning of the word.

1204
01:08:43,380 --> 01:08:47,280
So, space R, space U, et cetera.

1205
01:08:47,280 --> 01:08:50,480
Suppose we have a lot of spaces here.

1206
01:08:50,480 --> 01:08:53,680
What's going to happen here is that these spaces up to

1207
01:08:53,680 --> 01:08:57,980
and not including the last character will get caught by this.

1208
01:08:57,980 --> 01:09:01,480
And what that will do is it will separate out the spaces up to,

1209
01:09:01,480 --> 01:09:03,280
but not including the last character,

1210
01:09:03,280 --> 01:09:08,580
so that the last character can come here and join with the space U.

1211
01:09:08,580 --> 01:09:12,580
And the reason that's nice is because space U is the common token.

1212
01:09:12,580 --> 01:09:16,480
So, if I didn't have these extra spaces here, we just have space U.

1213
01:09:16,480 --> 01:09:20,680
And if I add tokens, if I add spaces, we still have a space U,

1214
01:09:20,680 --> 01:09:23,080
but now we have all this extra whitespace.

1215
01:09:23,080 --> 01:09:27,780
So, basically, the GPT-2 tokenizer really likes to have space letters or numbers,

1216
01:09:27,780 --> 01:09:30,280
and it prepends these spaces.

1217
01:09:30,280 --> 01:09:32,980
And this is just something that it is consistent about.

1218
01:09:32,980 --> 01:09:34,380
So, that's what that is for.

1219
01:09:34,380 --> 01:09:38,380
And then, finally, we have all the last fallback is whitespace.

1220
01:09:38,580 --> 01:09:45,980
So, that would be just if that doesn't get caught,

1221
01:09:45,980 --> 01:09:49,980
then this thing will catch any trailing spaces and so on.

1222
01:09:49,980 --> 01:09:52,680
I wanted to show one more real-world example here.

1223
01:09:52,680 --> 01:09:55,280
So, if we have this string, which is a piece of Python code,

1224
01:09:55,280 --> 01:09:59,480
and then we try to split it up, then this is the kind of output we get.

1225
01:09:59,480 --> 01:10:01,480
So, you'll notice that the list has many elements here,

1226
01:10:01,480 --> 01:10:08,480
and that's because we are splitting up fairly often every time sort of a category changes.

1227
01:10:08,480 --> 01:10:11,880
So, there will never be any mergers within these elements.

1228
01:10:11,880 --> 01:10:14,780
And that's what you are seeing here.

1229
01:10:14,780 --> 01:10:18,980
Now, you might think that in order to train the tokenizer,

1230
01:10:18,980 --> 01:10:23,180
OpenAI has used this to split up text into chunks

1231
01:10:23,180 --> 01:10:26,780
and then run just a BP algorithm within all the chunks.

1232
01:10:26,780 --> 01:10:28,580
But that is not exactly what happened.

1233
01:10:28,580 --> 01:10:30,380
And the reason is the following.

1234
01:10:30,380 --> 01:10:33,280
Notice that we have the spaces here.

1235
01:10:33,280 --> 01:10:36,380
Those spaces end up being entire elements,

1236
01:10:38,480 --> 01:10:40,880
and they end up being merged by OpenAI.

1237
01:10:40,880 --> 01:10:46,880
And the way you can tell is that if you copy-paste the exact same chunk here into a tick tokenizer,

1238
01:10:46,880 --> 01:10:51,880
you see that all the spaces are kept independent, and they are all token 220.

1239
01:10:51,880 --> 01:10:57,880
So, I think OpenAI at some point enforced some rule that these spaces would never be merged.

1240
01:10:57,880 --> 01:11:05,880
And so, there are some additional rules on top of just chunking and BPE that OpenAI is not clear about.

1241
01:11:05,880 --> 01:11:07,880
Now, the training code for the GPT-2 tokenizer was never released.

1242
01:11:07,880 --> 01:11:08,380
Now, the training code for the GPT-2 tokenizer was never released.

1243
01:11:08,380 --> 01:11:12,180
So, all we have is the code that I've already shown you.

1244
01:11:12,180 --> 01:11:17,180
But this code here that they've released is only the inference code for the tokens.

1245
01:11:17,180 --> 01:11:18,480
So, this is not the training code.

1246
01:11:18,480 --> 01:11:21,480
You can't give it a piece of text and train the tokenizer.

1247
01:11:21,480 --> 01:11:26,180
This is just the inference code which takes the merges that we have up above

1248
01:11:26,180 --> 01:11:29,180
and applies them to a new piece of text.

1249
01:11:29,180 --> 01:11:32,880
And so, we don't know exactly how OpenAI trained the tokenizer,

1250
01:11:32,880 --> 01:11:37,780
but it wasn't as simple as chunk it up and BPE it, whatever it was.

1251
01:11:37,780 --> 01:11:41,780
Next, I wanted to introduce you to the tiktokin library from OpenAI,

1252
01:11:41,780 --> 01:11:45,780
which is the official library for tokenization from OpenAI.

1253
01:11:45,780 --> 01:11:53,780
So, this is tiktokin, pip install tiktokin, and then you can do the tokenization inference.

1254
01:11:53,780 --> 01:11:55,780
This is, again, not training code.

1255
01:11:55,780 --> 01:11:57,780
This is only inference code for tokenization.

1256
01:11:57,780 --> 01:12:00,780
I wanted to show you how you would use it.

1257
01:12:00,780 --> 01:12:01,780
Quite simple.

1258
01:12:01,780 --> 01:12:05,780
And running this just gives us the GPT-2 tokens or the GPT-4 tokens.

1259
01:12:05,780 --> 01:12:06,780
So, this is the tokenizer you're using.

1260
01:12:07,780 --> 01:12:09,780
This is the tokenizer you're using for GPT-4.

1261
01:12:09,780 --> 01:12:13,780
And so, in particular, we see that the whitespace in GPT-2 remains unmerged,

1262
01:12:13,780 --> 01:12:18,780
but in GPT-4, these whitespaces merge, as we also saw in this one,

1263
01:12:18,780 --> 01:12:26,780
where here they're all unmerged, but if we go down to GPT-4, they become merged.

1264
01:12:26,780 --> 01:12:34,780
Now, in the GPT-4 tokenizer, they changed the regular expression that they use to chunk up text.

1265
01:12:34,780 --> 01:12:37,780
So, the way to see this is that if you come to the tiktokin library,

1266
01:12:37,780 --> 01:12:43,780
and then you go to this file, tiktokin.ext.openai.public,

1267
01:12:43,780 --> 01:12:48,780
this is where sort of like the definition of all these different tokenizers that OpenAI maintains is.

1268
01:12:48,780 --> 01:12:53,780
And so, necessarily to do the inference, they had to publish some of the details about the strings.

1269
01:12:53,780 --> 01:12:56,780
So, this is the string that we already saw for GPT-2.

1270
01:12:56,780 --> 01:13:01,780
It is slightly different, but it is actually equivalent to what we discussed here.

1271
01:13:01,780 --> 01:13:05,780
So, this pattern that we discussed is equivalent to this pattern.

1272
01:13:05,780 --> 01:13:07,780
This one just executes a little bit faster.

1273
01:13:07,780 --> 01:13:11,780
So, here you see a little bit of a slightly different definition, but otherwise it's the same.

1274
01:13:11,780 --> 01:13:14,780
We're going to go into special tokens in a bit.

1275
01:13:14,780 --> 01:13:19,780
And then if you scroll down to CL100K, this is the GPT-4 tokenizer,

1276
01:13:19,780 --> 01:13:22,780
you see that the pattern has changed.

1277
01:13:22,780 --> 01:13:27,780
And this is kind of like the major change in addition to a bunch of other special tokens,

1278
01:13:27,780 --> 01:13:29,780
which we'll go into a bit again.

1279
01:13:29,780 --> 01:13:33,780
Now, I'm not going to actually go into the full detail of the pattern change,

1280
01:13:33,780 --> 01:13:35,780
because honestly, this isn't mind-numbing.

1281
01:13:35,780 --> 01:13:37,780
I would just advise that you pull out your GPT-4 tokenizer,

1282
01:13:37,780 --> 01:13:41,780
pull out ChatGPT and the regex documentation, and just step through it.

1283
01:13:41,780 --> 01:13:46,780
But really, the major changes are, number one, you see this I here?

1284
01:13:46,780 --> 01:13:52,780
That means that the case sensitivity, this is case insensitive match.

1285
01:13:52,780 --> 01:13:57,780
And so, the comment that we saw earlier on, oh, we should have used re.uppercase,

1286
01:13:57,780 --> 01:14:05,780
basically, we're now going to be matching these apostrophe s, apostrophe d, apostrophe m, etc.

1287
01:14:05,780 --> 01:14:07,780
We're going to be matching them both in lowercase.

1288
01:14:07,780 --> 01:14:08,780
And in uppercase.

1289
01:14:08,780 --> 01:14:10,780
So, that's fixed.

1290
01:14:10,780 --> 01:14:12,780
There's a bunch of different, like, handling of the white space

1291
01:14:12,780 --> 01:14:14,780
that I'm not going to go into the full details of.

1292
01:14:14,780 --> 01:14:19,780
And then, one more thing here is you will notice that when they match the numbers,

1293
01:14:19,780 --> 01:14:22,780
they only match one to three numbers.

1294
01:14:22,780 --> 01:14:29,780
So, they will never merge numbers that are in more than three digits.

1295
01:14:29,780 --> 01:14:33,780
Only up to three digits of numbers will ever be merged.

1296
01:14:33,780 --> 01:14:35,780
And that's one change that they made as well,

1297
01:14:35,780 --> 01:14:37,780
to prevent tokens,

1298
01:14:37,780 --> 01:14:40,780
that are very, very long number sequences.

1299
01:14:40,780 --> 01:14:43,780
But again, we don't really know why they do any of this stuff

1300
01:14:43,780 --> 01:14:45,780
because none of this is documented.

1301
01:14:45,780 --> 01:14:47,780
And it's just, we just get the pattern.

1302
01:14:47,780 --> 01:14:50,780
So, yeah, it is what it is.

1303
01:14:50,780 --> 01:14:53,780
But those are some of the changes that GPT-4 has made.

1304
01:14:53,780 --> 01:14:58,780
And of course, the vocabulary size went from roughly 50k to roughly 100k.

1305
01:14:58,780 --> 01:15:00,780
The next thing I would like to do very briefly

1306
01:15:00,780 --> 01:15:05,780
is to take you through the GPT-2 encoder.py that OpenAI has released.

1307
01:15:05,780 --> 01:15:06,780
This is the file.

1308
01:15:06,780 --> 01:15:08,780
They already mentioned to you briefly.

1309
01:15:08,780 --> 01:15:11,780
Now, this file is fairly short

1310
01:15:11,780 --> 01:15:14,780
and should be relatively understandable to you at this point.

1311
01:15:14,780 --> 01:15:17,780
Starting at the bottom here,

1312
01:15:17,780 --> 01:15:20,780
they are loading two files,

1313
01:15:20,780 --> 01:15:22,780
encoder.json and vocab.bpe.

1314
01:15:22,780 --> 01:15:24,780
And they do some light processing on it

1315
01:15:24,780 --> 01:15:27,780
and then they call this encoder object, which is the tokenizer.

1316
01:15:27,780 --> 01:15:30,780
Now, if you'd like to inspect these two files,

1317
01:15:30,780 --> 01:15:33,780
which together constitute their saved tokenizer,

1318
01:15:33,780 --> 01:15:35,780
then you can do that with a piece of code like this.

1319
01:15:36,780 --> 01:15:39,780
This is where you can download these two files

1320
01:15:39,780 --> 01:15:41,780
and you can inspect them if you'd like.

1321
01:15:41,780 --> 01:15:43,780
And what you will find is that this encoder,

1322
01:15:43,780 --> 01:15:45,780
as they call it in their code,

1323
01:15:45,780 --> 01:15:47,780
is exactly equivalent to our vocab.

1324
01:15:47,780 --> 01:15:52,780
So remember here where we have this vocab object,

1325
01:15:52,780 --> 01:15:54,780
which allowed us to decode very efficiently

1326
01:15:54,780 --> 01:16:00,780
and basically it took us from the integer to the bytes for that integer.

1327
01:16:00,780 --> 01:16:04,780
So our vocab is exactly their encoder.

1328
01:16:04,780 --> 01:16:06,780
And then their vocab.bpe,

1329
01:16:06,780 --> 01:16:08,780
confusingly,

1330
01:16:08,780 --> 01:16:10,780
is actually our merges.

1331
01:16:10,780 --> 01:16:12,780
So their bpe merges,

1332
01:16:12,780 --> 01:16:15,780
which is based on the data inside vocab.bpe,

1333
01:16:15,780 --> 01:16:18,780
ends up being equivalent to our merges.

1334
01:16:18,780 --> 01:16:22,780
So basically they are saving and loading

1335
01:16:22,780 --> 01:16:25,780
the two variables that for us are also critical,

1336
01:16:25,780 --> 01:16:28,780
the merges variable and the vocab variable.

1337
01:16:28,780 --> 01:16:30,780
Using just these two variables,

1338
01:16:30,780 --> 01:16:32,780
you can represent a tokenizer

1339
01:16:32,780 --> 01:16:34,780
and you can both do encoding and decoding

1340
01:16:34,780 --> 01:16:36,780
once you've trained this tokenizer.

1341
01:16:36,780 --> 01:16:40,780
Now the only thing that is actually slightly confusing

1342
01:16:40,780 --> 01:16:42,780
inside what OpenAI does here

1343
01:16:42,780 --> 01:16:45,780
is that in addition to this encoder and the decoder,

1344
01:16:45,780 --> 01:16:47,780
they also have something called a byte encoder

1345
01:16:47,780 --> 01:16:49,780
and a byte decoder.

1346
01:16:49,780 --> 01:16:51,780
And this is actually, unfortunately,

1347
01:16:51,780 --> 01:16:55,780
just kind of a spurious implementation detail.

1348
01:16:55,780 --> 01:16:57,780
It isn't actually deep or interesting in any way,

1349
01:16:57,780 --> 01:16:59,780
so I'm going to skip the discussion of it.

1350
01:16:59,780 --> 01:17:00,780
But what OpenAI does here,

1351
01:17:00,780 --> 01:17:02,780
for reasons that I don't fully understand,

1352
01:17:02,780 --> 01:17:04,780
is that not only have they this tokenizer,

1353
01:17:04,780 --> 01:17:06,780
which can encode and decode,

1354
01:17:06,780 --> 01:17:09,780
but they have a whole separate layer here in addition

1355
01:17:09,780 --> 01:17:11,780
that is used serially with the tokenizer.

1356
01:17:11,780 --> 01:17:15,780
And so you first do byte encode and then encode,

1357
01:17:15,780 --> 01:17:18,780
and then you do decode and then byte decode.

1358
01:17:18,780 --> 01:17:19,780
So that's the loop,

1359
01:17:19,780 --> 01:17:22,780
and they are just stacked serial on top of each other.

1360
01:17:22,780 --> 01:17:24,780
And it's not that interesting, so I won't cover it,

1361
01:17:24,780 --> 01:17:26,780
and you can step through it if you'd like.

1362
01:17:26,780 --> 01:17:27,780
Otherwise, this file,

1363
01:17:27,780 --> 01:17:30,780
if you ignore the byte encoder and the byte decoder,

1364
01:17:30,780 --> 01:17:32,780
will be algorithmically very familiar with you.

1365
01:17:32,780 --> 01:17:34,780
And the meat of it here is the,

1366
01:17:34,780 --> 01:17:36,780
what they call BPE function,

1367
01:17:36,780 --> 01:17:39,780
and you should recognize this loop here,

1368
01:17:39,780 --> 01:17:41,780
which is very similar to our own while loop,

1369
01:17:41,780 --> 01:17:44,780
where they're trying to identify the bigram,

1370
01:17:44,780 --> 01:17:45,780
a pair,

1371
01:17:45,780 --> 01:17:47,780
that they should be merging next.

1372
01:17:47,780 --> 01:17:49,780
And then here, just like we had,

1373
01:17:49,780 --> 01:17:51,780
they have a for loop trying to merge this pair.

1374
01:17:51,780 --> 01:17:53,780
So they will go over all of the sequence

1375
01:17:53,780 --> 01:17:56,780
and they will merge the pair whenever they find it.

1376
01:17:56,780 --> 01:17:58,780
And they keep repeating that

1377
01:17:58,780 --> 01:18:01,780
until they run out of possible merges in the text.

1378
01:18:01,780 --> 01:18:03,780
So that's the meat of this file,

1379
01:18:03,780 --> 01:18:05,780
and there's an encode and decode function,

1380
01:18:05,780 --> 01:18:07,780
just like we have implemented it.

1381
01:18:07,780 --> 01:18:08,780
So long story short,

1382
01:18:08,780 --> 01:18:10,780
what I want you to take away at this point is that,

1383
01:18:10,780 --> 01:18:12,780
unfortunately, it's a little bit of a messy code that they have,

1384
01:18:12,780 --> 01:18:14,780
but algorithmically it is identical

1385
01:18:14,780 --> 01:18:16,780
to what we've built up above.

1386
01:18:16,780 --> 01:18:18,780
And what we've built up above, if you understand it,

1387
01:18:18,780 --> 01:18:20,780
is algorithmically what is necessary

1388
01:18:20,780 --> 01:18:23,780
to actually build a BPE tokenizer,

1389
01:18:23,780 --> 01:18:26,780
train it, and then both encode and decode.

1390
01:18:26,780 --> 01:18:27,780
The next topic I would like to turn to

1391
01:18:27,780 --> 01:18:29,780
is that of special tokens.

1392
01:18:29,780 --> 01:18:31,780
So in addition to tokens that are coming from,

1393
01:18:31,780 --> 01:18:34,780
you know, raw bytes and the BPE merges,

1394
01:18:34,780 --> 01:18:37,780
we can insert all kinds of tokens that we are going to use

1395
01:18:37,780 --> 01:18:39,780
to delimit different parts of the data

1396
01:18:39,780 --> 01:18:41,780
or introduce to create a special structure

1397
01:18:41,780 --> 01:18:44,780
of the token streams.

1398
01:18:44,780 --> 01:18:47,780
So if you look at this encoder object

1399
01:18:47,780 --> 01:18:50,780
from OpenAI's GPT-2 right here,

1400
01:18:50,780 --> 01:18:52,780
we mentioned this is very similar to our vocab.

1401
01:18:52,780 --> 01:18:59,780
You'll notice that the length of this is 50,257.

1402
01:18:59,780 --> 01:19:01,780
As I mentioned, it's mapping,

1403
01:19:01,780 --> 01:19:03,780
and it's inverted from the mapping of our vocab.

1404
01:19:03,780 --> 01:19:06,780
Our vocab goes from integer to string,

1405
01:19:06,780 --> 01:19:10,780
and they go the other way around for no amazing reason.

1406
01:19:10,780 --> 01:19:12,780
But the thing to note here is that

1407
01:19:12,780 --> 01:19:15,780
the mapping table here is 50,257.

1408
01:19:15,780 --> 01:19:17,780
Where does that number come from?

1409
01:19:17,780 --> 01:19:19,780
Where are the tokens?

1410
01:19:19,780 --> 01:19:24,780
As I mentioned, there are 256 raw byte tokens.

1411
01:19:24,780 --> 01:19:28,780
And then OpenAI actually did 50,000 merges.

1412
01:19:28,780 --> 01:19:31,780
So those become the other tokens.

1413
01:19:31,780 --> 01:19:33,780
But this would have been 50,257.

1414
01:19:33,780 --> 01:19:36,780
So what is the 57th token?

1415
01:19:36,780 --> 01:19:40,780
And there is basically one special token.

1416
01:19:40,780 --> 01:19:42,780
And that one special token,

1417
01:19:42,780 --> 01:19:45,780
you can see, is called end of text.

1418
01:19:45,780 --> 01:19:47,780
So this is a special token,

1419
01:19:47,780 --> 01:19:49,780
and it's the very last token.

1420
01:19:49,780 --> 01:19:52,780
And this token is used to delimit documents

1421
01:19:52,780 --> 01:19:54,780
in the training set.

1422
01:19:54,780 --> 01:19:56,780
So when we're creating the training data,

1423
01:19:56,780 --> 01:19:57,780
we have all these documents,

1424
01:19:57,780 --> 01:19:58,780
and we tokenize them,

1425
01:19:58,780 --> 01:20:00,780
and we get a stream of tokens.

1426
01:20:00,780 --> 01:20:02,780
Those tokens only range from 0

1427
01:20:02,780 --> 01:20:05,780
to 50,256.

1428
01:20:05,780 --> 01:20:07,780
And then in between those documents,

1429
01:20:07,780 --> 01:20:10,780
we put special end of text token.

1430
01:20:10,780 --> 01:20:13,780
And we insert that token in between documents.

1431
01:20:13,780 --> 01:20:17,780
And we are using this as a signal to the language model

1432
01:20:17,780 --> 01:20:19,780
that the document has ended,

1433
01:20:19,780 --> 01:20:21,780
and what follows is going to be unrelated

1434
01:20:21,780 --> 01:20:23,780
to the document previously.

1435
01:20:23,780 --> 01:20:26,780
That said, the language model has to learn this from data.

1436
01:20:26,780 --> 01:20:29,780
It needs to learn that this token usually means

1437
01:20:29,780 --> 01:20:31,780
that it should wipe its sort of memory

1438
01:20:31,780 --> 01:20:32,780
of what came before,

1439
01:20:32,780 --> 01:20:34,780
and what came before this token

1440
01:20:34,780 --> 01:20:36,780
is not actually informative to what comes next.

1441
01:20:36,780 --> 01:20:38,780
But we are expecting the language model

1442
01:20:38,780 --> 01:20:39,780
to just like learn this,

1443
01:20:39,780 --> 01:20:41,780
but we're giving it the special sort of delimiter

1444
01:20:41,780 --> 01:20:43,780
of these documents.

1445
01:20:43,780 --> 01:20:45,780
We can go here to tiktokenizer,

1446
01:20:45,780 --> 01:20:48,780
and this is the GPT to tokenizer,

1447
01:20:48,780 --> 01:20:50,780
our code that we've been playing with before.

1448
01:20:50,780 --> 01:20:51,780
So we can add here, right?

1449
01:20:51,780 --> 01:20:53,780
Hello world, how are you?

1450
01:20:53,780 --> 01:20:55,780
And we're getting different tokens.

1451
01:20:55,780 --> 01:20:57,780
But now you can see what happens

1452
01:20:57,780 --> 01:20:59,780
if I put end of text.

1453
01:20:59,780 --> 01:21:00,780
You see how,

1454
01:21:00,780 --> 01:21:02,780
until I finished it,

1455
01:21:02,780 --> 01:21:04,780
these are all different tokens.

1456
01:21:04,780 --> 01:21:06,780
End of text,

1457
01:21:06,780 --> 01:21:08,780
still tokens,

1458
01:21:08,780 --> 01:21:09,780
and now when I finish it,

1459
01:21:09,780 --> 01:21:13,780
suddenly we get token 50,256.

1460
01:21:13,780 --> 01:21:16,780
And the reason this works is because

1461
01:21:16,780 --> 01:21:19,780
this didn't actually go through the BPE merges.

1462
01:21:19,780 --> 01:21:23,780
Instead, the code that actually outputs the tokens

1463
01:21:23,780 --> 01:21:25,780
has special case instructions

1464
01:21:25,780 --> 01:21:28,780
for handling special tokens.

1465
01:21:28,780 --> 01:21:30,780
We did not see these special instructions

1466
01:21:30,780 --> 01:21:33,780
for handling special tokens in the encoder.py.

1467
01:21:33,780 --> 01:21:35,780
It's absent there.

1468
01:21:35,780 --> 01:21:37,780
But if you go to tiktoken library,

1469
01:21:37,780 --> 01:21:39,780
which is implemented in Rust,

1470
01:21:39,780 --> 01:21:41,780
you will find all kinds of special case handling

1471
01:21:41,780 --> 01:21:43,780
for these special tokens

1472
01:21:43,780 --> 01:21:45,780
that you can register, create,

1473
01:21:45,780 --> 01:21:47,780
add to the vocabulary,

1474
01:21:47,780 --> 01:21:48,780
and then it looks for them.

1475
01:21:48,780 --> 01:21:51,780
And whenever it sees these special tokens like this,

1476
01:21:51,780 --> 01:21:54,780
it will actually come in and swap in that special token.

1477
01:21:54,780 --> 01:21:57,780
So these things are outside of the typical algorithm

1478
01:21:57,780 --> 01:21:59,780
of byte pairing coding.

1479
01:21:59,780 --> 01:22:02,780
So these special tokens are used pervasively,

1480
01:22:02,780 --> 01:22:05,780
not just in basically base language modeling

1481
01:22:05,780 --> 01:22:07,780
of predicting the next token in the sequence,

1482
01:22:07,780 --> 01:22:09,780
but especially when it gets to later

1483
01:22:09,780 --> 01:22:10,780
to the fine-tuning stage

1484
01:22:10,780 --> 01:22:13,780
and all of the chat GPT sort of aspects of it,

1485
01:22:13,780 --> 01:22:15,780
because we don't just want to delimit documents,

1486
01:22:15,780 --> 01:22:17,780
we want to delimit entire conversations

1487
01:22:17,780 --> 01:22:19,780
between an assistant and a user.

1488
01:22:19,780 --> 01:22:22,780
So if I refresh this tiktokenizer page,

1489
01:22:22,780 --> 01:22:24,780
the default example that they have here

1490
01:22:24,780 --> 01:22:28,780
is using not sort of base model encoders,

1491
01:22:28,780 --> 01:22:32,780
but fine-tuned model sort of tokenizers.

1492
01:22:32,780 --> 01:22:35,780
So for example, using the GPT 3.5 Turbo scheme,

1493
01:22:35,780 --> 01:22:38,780
these here are all special tokens,

1494
01:22:38,780 --> 01:22:41,780
IAM start, IAM end, et cetera.

1495
01:22:41,780 --> 01:22:44,780
This is short for imaginary model log

1496
01:22:44,780 --> 01:22:46,780
underscore start, by the way.

1497
01:22:46,780 --> 01:22:49,780
But you can see here that there's a sort of start

1498
01:22:49,780 --> 01:22:51,780
and end of every single message,

1499
01:22:51,780 --> 01:22:53,780
and there can be many other tokens,

1500
01:22:53,780 --> 01:22:57,780
lots of tokens in use to delimit these conversations

1501
01:22:57,780 --> 01:23:01,780
and kind of keep track of the flow of the messages here.

1502
01:23:01,780 --> 01:23:04,780
Now we can go back to the tiktoken library,

1503
01:23:04,780 --> 01:23:06,780
and here when you scroll to the bottom,

1504
01:23:06,780 --> 01:23:09,780
they talk about how you can extend tiktoken,

1505
01:23:09,780 --> 01:23:12,780
and you can create, basically you can fork

1506
01:23:12,780 --> 01:23:16,780
the CL100K base tokenizer used in GPT-4,

1507
01:23:16,780 --> 01:23:18,780
and for example, you can extend it

1508
01:23:18,780 --> 01:23:19,780
by adding more special tokens,

1509
01:23:19,780 --> 01:23:20,780
and these are totally up to you.

1510
01:23:20,780 --> 01:23:22,780
You can come up with any arbitrary tokens

1511
01:23:22,780 --> 01:23:25,780
and add them with the new ID afterwards,

1512
01:23:25,780 --> 01:23:27,780
and the tiktoken library will correct

1513
01:23:27,780 --> 01:23:29,780
or directly swap them out

1514
01:23:29,780 --> 01:23:32,780
when it sees this in the strings.

1515
01:23:32,780 --> 01:23:34,780
Now we can also go back to this file,

1516
01:23:34,780 --> 01:23:36,780
which we looked at previously,

1517
01:23:36,780 --> 01:23:39,780
and I mentioned that the GPT-2 in tiktoken,

1518
01:23:39,780 --> 01:23:41,780
opening in public.py,

1519
01:23:41,780 --> 01:23:43,780
we have the vocabulary,

1520
01:23:43,780 --> 01:23:45,780
we have the pattern for splitting,

1521
01:23:45,780 --> 01:23:46,780
and then here we are registering

1522
01:23:46,780 --> 01:23:48,780
the single special token in GPT-2,

1523
01:23:48,780 --> 01:23:50,780
which was the end of text token,

1524
01:23:50,780 --> 01:23:52,780
and we saw that it has this ID.

1525
01:23:52,780 --> 01:23:55,780
In GPT-4, when they defined this here,

1526
01:23:55,780 --> 01:23:57,780
you see that the pattern has changed

1527
01:23:57,780 --> 01:23:58,780
as we've discussed,

1528
01:23:58,780 --> 01:24:00,780
but also the special tokens have changed

1529
01:24:00,780 --> 01:24:01,780
in this tokenizer.

1530
01:24:01,780 --> 01:24:03,780
So we of course have the end of text,

1531
01:24:03,780 --> 01:24:04,780
just like in GPT-2,

1532
01:24:04,780 --> 01:24:06,780
but we also see three,

1533
01:24:06,780 --> 01:24:08,780
sorry, four additional tokens here,

1534
01:24:08,780 --> 01:24:10,780
thim prefix, middle, and suffix.

1535
01:24:10,780 --> 01:24:11,780
What is thim?

1536
01:24:11,780 --> 01:24:14,780
Thim is short for fill in the middle,

1537
01:24:14,780 --> 01:24:16,780
and if you'd like to learn more about this idea,

1538
01:24:16,780 --> 01:24:19,780
it comes from this paper,

1539
01:24:19,780 --> 01:24:21,780
and I'm not going to go into detail in this video,

1540
01:24:21,780 --> 01:24:22,780
it's beyond this video,

1541
01:24:22,780 --> 01:24:24,780
and then there's one additional

1542
01:24:24,780 --> 01:24:26,780
sort of token here.

1543
01:24:26,780 --> 01:24:28,780
So that's that encoding as well.

1544
01:24:28,780 --> 01:24:30,780
So it's very common, basically,

1545
01:24:30,780 --> 01:24:32,780
to train a language model,

1546
01:24:32,780 --> 01:24:34,780
and then if you'd like,

1547
01:24:34,780 --> 01:24:36,780
you can add special tokens.

1548
01:24:36,780 --> 01:24:38,780
Now, when you add special tokens,

1549
01:24:38,780 --> 01:24:40,780
you of course have to do some model surgery

1550
01:24:40,780 --> 01:24:42,780
to the transformer

1551
01:24:42,780 --> 01:24:44,780
and all the parameters involved in that transformer,

1552
01:24:44,780 --> 01:24:46,780
because you are basically adding an integer,

1553
01:24:46,780 --> 01:24:47,780
and you want to make sure that,

1554
01:24:47,780 --> 01:24:49,780
for example, your embedding matrix

1555
01:24:49,780 --> 01:24:51,780
for the vocabulary tokens

1556
01:24:51,780 --> 01:24:53,780
has to be extended by adding a row,

1557
01:24:53,780 --> 01:24:55,780
and typically this row would be initialized

1558
01:24:55,780 --> 01:24:57,780
with small random numbers or something like that,

1559
01:24:57,780 --> 01:24:59,780
because we need to have a vector

1560
01:24:59,780 --> 01:25:01,780
that now stands for that token.

1561
01:25:01,780 --> 01:25:02,780
In addition to that,

1562
01:25:02,780 --> 01:25:04,780
you have to go to the final layer of the transformer,

1563
01:25:04,780 --> 01:25:06,780
and you have to make sure that that projection

1564
01:25:06,780 --> 01:25:08,780
at the very end into the classifier

1565
01:25:08,780 --> 01:25:10,780
is extended by one as well.

1566
01:25:10,780 --> 01:25:12,780
So basically there's some model surgery involved

1567
01:25:12,780 --> 01:25:15,780
that you have to couple with the tokenization changes

1568
01:25:15,780 --> 01:25:18,780
if you are going to add special tokens.

1569
01:25:18,780 --> 01:25:20,780
But this is a very common operation that people do,

1570
01:25:20,780 --> 01:25:22,780
especially if they'd like to fine-tune the model,

1571
01:25:22,780 --> 01:25:24,780
for example, taking it from a base model

1572
01:25:24,780 --> 01:25:28,780
to a chat model like ChatGPT.

1573
01:25:28,780 --> 01:25:29,780
Okay, so at this point,

1574
01:25:29,780 --> 01:25:30,780
you should have everything you need

1575
01:25:30,780 --> 01:25:32,780
in order to build your own GPT-4 tokenizer.

1576
01:25:32,780 --> 01:25:34,780
Now, in the process of developing this lecture,

1577
01:25:34,780 --> 01:25:35,780
I've done that,

1578
01:25:35,780 --> 01:25:39,780
and I've published the code under this repository minBPE.

1579
01:25:39,780 --> 01:25:42,780
So minBPE looks like this right now as I'm recording,

1580
01:25:42,780 --> 01:25:45,780
but the minBPE repository will probably change quite a bit

1581
01:25:45,780 --> 01:25:49,780
because I intend to continue working on it.

1582
01:25:49,780 --> 01:25:51,780
In addition to the minBPE repository,

1583
01:25:51,780 --> 01:25:53,780
I've published this exercise progression

1584
01:25:53,780 --> 01:25:54,780
that you can follow.

1585
01:25:54,780 --> 01:25:56,780
So if you go to exercise.md here,

1586
01:25:56,780 --> 01:26:00,780
this is sort of me breaking up the task ahead of you

1587
01:26:00,780 --> 01:26:03,780
into four steps that sort of build up

1588
01:26:03,780 --> 01:26:05,780
to what can be a GPT-4 tokenizer.

1589
01:26:05,780 --> 01:26:08,780
And so feel free to follow these steps exactly

1590
01:26:08,780 --> 01:26:10,780
and follow a little bit of the guidance

1591
01:26:10,780 --> 01:26:11,780
that I've laid out here.

1592
01:26:11,780 --> 01:26:13,780
And anytime you feel stuck,

1593
01:26:13,780 --> 01:26:16,780
just reference the minBPE repository here.

1594
01:26:16,780 --> 01:26:18,780
So either the tests could be useful

1595
01:26:18,780 --> 01:26:20,780
or the minBPE repository itself.

1596
01:26:20,780 --> 01:26:22,780
I try to keep the code fairly clean

1597
01:26:22,780 --> 01:26:24,780
and understandable.

1598
01:26:24,780 --> 01:26:29,780
And so feel free to reference it whenever you get stuck.

1599
01:26:29,780 --> 01:26:31,780
In addition to that, basically,

1600
01:26:31,780 --> 01:26:33,780
once you write it,

1601
01:26:33,780 --> 01:26:36,780
you should be able to reproduce this behavior from Tiktoken.

1602
01:26:36,780 --> 01:26:38,780
So getting the GPT-4 tokenizer,

1603
01:26:38,780 --> 01:26:40,780
you can encode this string

1604
01:26:40,780 --> 01:26:42,780
and you should get these tokens.

1605
01:26:42,780 --> 01:26:43,780
And then you can encode and decode

1606
01:26:43,780 --> 01:26:45,780
the exact same string to recover it.

1607
01:26:45,780 --> 01:26:46,780
And in addition to all that,

1608
01:26:46,780 --> 01:26:49,780
you should be able to implement your own train function,

1609
01:26:49,780 --> 01:26:51,780
which Tiktoken library does not provide.

1610
01:26:51,780 --> 01:26:53,780
It's, again, only inference code.

1611
01:26:53,780 --> 01:26:55,780
But you could write your own train.

1612
01:26:55,780 --> 01:26:57,780
minBPE does it as well.

1613
01:26:57,780 --> 01:27:01,780
And that will allow you to train your own token vocabularies.

1614
01:27:01,780 --> 01:27:03,780
So here's some of the code inside minBPE,

1615
01:27:03,780 --> 01:27:06,780
minBPE shows the token vocabularies

1616
01:27:06,780 --> 01:27:08,780
that you might obtain.

1617
01:27:08,780 --> 01:27:10,780
So on the left here,

1618
01:27:10,780 --> 01:27:12,780
we have the GPT-4 merges.

1619
01:27:12,780 --> 01:27:16,780
So the first 256 are raw individual bytes.

1620
01:27:16,780 --> 01:27:18,780
And then here I am visualizing the merges

1621
01:27:18,780 --> 01:27:20,780
that GPT-4 performed during its training.

1622
01:27:20,780 --> 01:27:23,780
So the very first merge that GPT-4 did

1623
01:27:23,780 --> 01:27:26,780
was merge two spaces into a single token

1624
01:27:26,780 --> 01:27:28,780
for, you know, two spaces.

1625
01:27:28,780 --> 01:27:30,780
And that is the token 256.

1626
01:27:30,780 --> 01:27:32,780
And so this is the order in which things merged

1627
01:27:32,780 --> 01:27:33,780
during GPT-4 training.

1628
01:27:33,780 --> 01:27:35,780
And this is the merge order

1629
01:27:35,780 --> 01:27:38,780
that we obtain in minBPE

1630
01:27:38,780 --> 01:27:40,780
by training a tokenizer.

1631
01:27:40,780 --> 01:27:41,780
And in this case, I trained it

1632
01:27:41,780 --> 01:27:43,780
on a Wikipedia page of Taylor Swift.

1633
01:27:43,780 --> 01:27:45,780
Not because I'm a Swifty,

1634
01:27:45,780 --> 01:27:47,780
but because that is one of the longest

1635
01:27:47,780 --> 01:27:49,780
Wikipedia pages apparently that's available.

1636
01:27:49,780 --> 01:27:51,780
But she is pretty cool.

1637
01:27:51,780 --> 01:27:55,780
And what was I going to say?

1638
01:27:55,780 --> 01:27:58,780
Yeah, so you can compare these two vocabularies.

1639
01:27:58,780 --> 01:28:02,780
And so as an example,

1640
01:28:02,780 --> 01:28:05,780
here GPT-4 merged IN to become IN.

1641
01:28:05,780 --> 01:28:07,780
And we've done the exact same thing

1642
01:28:07,780 --> 01:28:09,780
on this token, 259.

1643
01:28:09,780 --> 01:28:11,780
Here, space T becomes space T.

1644
01:28:11,780 --> 01:28:14,780
And that happened for us a little bit later as well.

1645
01:28:14,780 --> 01:28:16,780
So the difference here is, again,

1646
01:28:16,780 --> 01:28:17,780
to my understanding,

1647
01:28:17,780 --> 01:28:19,780
only a difference of the training set.

1648
01:28:19,780 --> 01:28:21,780
As an example, because I see a lot of white space,

1649
01:28:21,780 --> 01:28:23,780
I expect that GPT-4 probably had a lot of Python code

1650
01:28:23,780 --> 01:28:25,780
in its training set, I'm not sure,

1651
01:28:25,780 --> 01:28:27,780
for the tokenizer.

1652
01:28:27,780 --> 01:28:29,780
And here we see much less of that,

1653
01:28:29,780 --> 01:28:32,780
of course, in the Wikipedia page.

1654
01:28:32,780 --> 01:28:34,780
So roughly speaking, they look the same.

1655
01:28:34,780 --> 01:28:35,780
And they look the same because they're

1656
01:28:35,780 --> 01:28:36,780
running the same algorithm.

1657
01:28:36,780 --> 01:28:38,780
And when you train your own,

1658
01:28:38,780 --> 01:28:40,780
you're probably going to get something similar

1659
01:28:40,780 --> 01:28:41,780
depending on what you train it on.

1660
01:28:41,780 --> 01:28:43,780
Okay, so we are now going to move on

1661
01:28:43,780 --> 01:28:44,780
from TickToken

1662
01:28:44,780 --> 01:28:46,780
and the way that OpenAI tokenizes its strings.

1663
01:28:46,780 --> 01:28:48,780
And we're going to discuss one more

1664
01:28:48,780 --> 01:28:49,780
very commonly used library

1665
01:28:49,780 --> 01:28:51,780
for working with tokenization in LLMs,

1666
01:28:51,780 --> 01:28:53,780
and that is SentencePiece.

1667
01:28:53,780 --> 01:28:56,780
So SentencePiece is very commonly used

1668
01:28:56,780 --> 01:28:59,780
in language models because unlike TickToken,

1669
01:28:59,780 --> 01:29:01,780
it can do both training and inference

1670
01:29:01,780 --> 01:29:03,780
and is quite efficient at both.

1671
01:29:03,780 --> 01:29:05,780
It supports a number of algorithms

1672
01:29:05,780 --> 01:29:07,780
for training vocabularies,

1673
01:29:07,780 --> 01:29:09,780
but one of them is the byte pairing coding algorithm

1674
01:29:09,780 --> 01:29:10,780
that we've been looking at.

1675
01:29:10,780 --> 01:29:12,780
So it supports it.

1676
01:29:12,780 --> 01:29:14,780
Now, SentencePiece is used both by Lama

1677
01:29:14,780 --> 01:29:17,780
and Mistral series and many other models as well.

1678
01:29:17,780 --> 01:29:21,780
It is on GitHub under Google slash SentencePiece.

1679
01:29:21,780 --> 01:29:23,780
And the big difference with SentencePiece,

1680
01:29:23,780 --> 01:29:25,780
and we're going to look at example

1681
01:29:25,780 --> 01:29:28,780
because this is kind of hard and subtle to explain,

1682
01:29:28,780 --> 01:29:30,780
is that they think different

1683
01:29:30,780 --> 01:29:33,780
about the order of operations here.

1684
01:29:33,780 --> 01:29:35,780
So in the case of TickToken,

1685
01:29:35,780 --> 01:29:39,780
we first take our code points in a string.

1686
01:29:39,780 --> 01:29:41,780
We encode them using UTF-82 bytes

1687
01:29:41,780 --> 01:29:43,780
and then we're merging bytes.

1688
01:29:43,780 --> 01:29:45,780
It's fairly straightforward.

1689
01:29:45,780 --> 01:29:46,780
For SentencePiece,

1690
01:29:46,780 --> 01:29:48,780
it works directly on the level

1691
01:29:48,780 --> 01:29:50,780
of the code points themselves.

1692
01:29:50,780 --> 01:29:52,780
So it looks at whatever code points

1693
01:29:52,780 --> 01:29:54,780
are available in your training set

1694
01:29:54,780 --> 01:29:56,780
and then it starts merging those code points.

1695
01:29:56,780 --> 01:30:01,780
And the BPE is running on the level of code points.

1696
01:30:01,780 --> 01:30:04,780
And if you happen to run out of code points,

1697
01:30:04,780 --> 01:30:06,780
so there are maybe some rare code points

1698
01:30:06,780 --> 01:30:07,780
that just don't come up too often

1699
01:30:07,780 --> 01:30:08,780
and the rarity is determined

1700
01:30:08,780 --> 01:30:11,780
by this character coverage hyperparameter,

1701
01:30:11,780 --> 01:30:14,780
then these code points will either get maps

1702
01:30:14,780 --> 01:30:16,780
to a special unknown token,

1703
01:30:16,780 --> 01:30:17,780
like Ankh,

1704
01:30:17,780 --> 01:30:20,780
or if you have the byte fallback option turned on,

1705
01:30:20,780 --> 01:30:23,780
then that will take those rare code points,

1706
01:30:23,780 --> 01:30:25,780
it will encode them using UTF-8,

1707
01:30:25,780 --> 01:30:27,780
and then the individual bytes of that encoding

1708
01:30:27,780 --> 01:30:29,780
will be translated into tokens.

1709
01:30:29,780 --> 01:30:31,780
And there are these special byte tokens

1710
01:30:31,780 --> 01:30:33,780
that basically get added to the vocabulary.

1711
01:30:33,780 --> 01:30:37,780
So it uses BPE on the code points

1712
01:30:37,780 --> 01:30:39,780
and then it falls back to bytes

1713
01:30:39,780 --> 01:30:42,780
for rare code points.

1714
01:30:42,780 --> 01:30:44,780
And so that's kind of like the difference.

1715
01:30:44,780 --> 01:30:45,780
Personally, I find that TickToken

1716
01:30:45,780 --> 01:30:47,780
is significantly cleaner,

1717
01:30:47,780 --> 01:30:48,780
but it's kind of like a subtle

1718
01:30:48,780 --> 01:30:49,780
but pretty major difference

1719
01:30:49,780 --> 01:30:51,780
between the way they approach tokenization.

1720
01:30:51,780 --> 01:30:52,780
Let's work with a concrete example

1721
01:30:52,780 --> 01:30:54,780
because otherwise this is kind of hard

1722
01:30:54,780 --> 01:30:57,780
to get your head around.

1723
01:30:57,780 --> 01:30:59,780
So let's work with a concrete example.

1724
01:30:59,780 --> 01:31:02,780
This is how we can import sentence piece.

1725
01:31:02,780 --> 01:31:04,780
And then here we're going to take,

1726
01:31:04,780 --> 01:31:06,780
I think I took like the description of sentence piece

1727
01:31:06,780 --> 01:31:08,780
and I just created like a little toy data set.

1728
01:31:08,780 --> 01:31:09,780
It really likes to have a file.

1729
01:31:09,780 --> 01:31:13,780
So I created a toy.txt file with this content.

1730
01:31:13,780 --> 01:31:15,780
Now, what's kind of a little bit crazy

1731
01:31:15,780 --> 01:31:16,780
about sentence piece

1732
01:31:16,780 --> 01:31:19,780
is that there's a ton of options and configurations.

1733
01:31:19,780 --> 01:31:20,780
And the reason this is so

1734
01:31:20,780 --> 01:31:22,780
is because sentence piece has been around,

1735
01:31:22,780 --> 01:31:23,780
I think for a while,

1736
01:31:23,780 --> 01:31:24,780
and it really tries to handle

1737
01:31:24,780 --> 01:31:26,780
a large diversity of things.

1738
01:31:26,780 --> 01:31:28,780
And because it's been around,

1739
01:31:28,780 --> 01:31:30,780
I think it has quite a bit of accumulated

1740
01:31:30,780 --> 01:31:32,780
historical baggage as well.

1741
01:31:32,780 --> 01:31:33,780
And so in particular,

1742
01:31:33,780 --> 01:31:36,780
there's like a ton of configuration arguments.

1743
01:31:36,780 --> 01:31:38,780
This is not even all of it.

1744
01:31:38,780 --> 01:31:42,780
You can go to here to see all the training options.

1745
01:31:42,780 --> 01:31:45,780
And there's also quite useful documentation

1746
01:31:45,780 --> 01:31:48,780
like the raw protobuf that is used

1747
01:31:48,780 --> 01:31:52,780
to represent the trainer spec and so on.

1748
01:31:52,780 --> 01:31:54,780
Many of these options are irrelevant to us.

1749
01:31:54,780 --> 01:31:56,780
So maybe to point out one example,

1750
01:31:56,780 --> 01:31:58,780
dash dash shrinking factor.

1751
01:31:58,780 --> 01:32:00,780
This shrinking factor is not used

1752
01:32:00,780 --> 01:32:02,780
in the byte pairing coding algorithm.

1753
01:32:02,780 --> 01:32:05,780
So this is just an argument that is irrelevant to us.

1754
01:32:05,780 --> 01:32:10,780
It applies to a different training algorithm.

1755
01:32:10,780 --> 01:32:11,780
Now, what I tried to do here

1756
01:32:11,780 --> 01:32:13,780
is I tried to set up sentence piece

1757
01:32:13,780 --> 01:32:15,780
in a way that is very, very similar.

1758
01:32:15,780 --> 01:32:18,780
So I can tell to maybe identical, hopefully,

1759
01:32:18,780 --> 01:32:21,780
to the way that Lama2 was trained.

1760
01:32:21,780 --> 01:32:25,780
So the way they trained their own tokenizer.

1761
01:32:25,780 --> 01:32:27,780
And the way I did this was basically

1762
01:32:27,780 --> 01:32:29,780
you can take the tokenizer.model file

1763
01:32:29,780 --> 01:32:30,780
that Meta released,

1764
01:32:30,780 --> 01:32:34,780
and you can open it using the protobuf

1765
01:32:34,780 --> 01:32:37,780
sort of file that you can generate.

1766
01:32:37,780 --> 01:32:39,780
And then you can inspect all the options.

1767
01:32:39,780 --> 01:32:41,780
And I tried to copy over all the options

1768
01:32:41,780 --> 01:32:42,780
that looked relevant.

1769
01:32:42,780 --> 01:32:44,780
So here we set up the input.

1770
01:32:44,780 --> 01:32:46,780
This is a raw text in this file.

1771
01:32:46,780 --> 01:32:47,780
Here's going to be the output.

1772
01:32:47,780 --> 01:32:52,780
So it's going to be protoc400.model and .vocap.

1773
01:32:52,780 --> 01:32:54,780
We're saying that we're going to use the BP algorithm,

1774
01:32:54,780 --> 01:32:56,780
and we want a vocab size of 400.

1775
01:32:56,780 --> 01:32:58,780
And there's a ton of configurations here

1776
01:32:58,780 --> 01:33:04,780
for basically preprocessing

1777
01:33:04,780 --> 01:33:06,780
and normalization rules, as they're called.

1778
01:33:06,780 --> 01:33:09,780
Normalization used to be very prevalent,

1779
01:33:09,780 --> 01:33:12,780
I would say, before LLMs in natural language processing.

1780
01:33:12,780 --> 01:33:13,780
So in machine translation

1781
01:33:13,780 --> 01:33:15,780
and text classification and so on,

1782
01:33:15,780 --> 01:33:17,780
you want to normalize and simplify the text,

1783
01:33:17,780 --> 01:33:18,780
and you want to turn it all lowercase,

1784
01:33:18,780 --> 01:33:21,780
and you want to remove all double white space, etc.

1785
01:33:21,780 --> 01:33:22,780
And in language models,

1786
01:33:22,780 --> 01:33:24,780
we prefer not to do any of it,

1787
01:33:24,780 --> 01:33:25,780
or at least that is my preference

1788
01:33:25,780 --> 01:33:26,780
as a deep learning person.

1789
01:33:26,780 --> 01:33:28,780
You want to not touch your data.

1790
01:33:28,780 --> 01:33:29,780
You want to keep the raw data

1791
01:33:29,780 --> 01:33:33,780
as much as possible in a raw form.

1792
01:33:33,780 --> 01:33:34,780
So you're basically trying to turn off

1793
01:33:34,780 --> 01:33:37,780
a lot of this if you can.

1794
01:33:37,780 --> 01:33:38,780
The other thing that sentence piece does

1795
01:33:38,780 --> 01:33:41,780
is that it has this concept of sentences.

1796
01:33:41,780 --> 01:33:43,780
So sentence piece,

1797
01:33:43,780 --> 01:33:44,780
it's back,

1798
01:33:44,780 --> 01:33:45,780
it kind of like was developed,

1799
01:33:45,780 --> 01:33:46,780
I think, early in the days

1800
01:33:46,780 --> 01:33:49,780
where there was an idea

1801
01:33:49,780 --> 01:33:51,780
that you're training a tokenizer

1802
01:33:51,780 --> 01:33:53,780
on a bunch of independent sentences.

1803
01:33:53,780 --> 01:33:54,780
So it has a lot of like

1804
01:33:54,780 --> 01:33:56,780
how many sentences you're going to train on,

1805
01:33:56,780 --> 01:34:01,780
what is the maximum sentence length,

1806
01:34:01,780 --> 01:34:02,780
shuffling sentences.

1807
01:34:02,780 --> 01:34:03,780
And so for it,

1808
01:34:03,780 --> 01:34:04,780
sentences are kind of like

1809
01:34:04,780 --> 01:34:05,780
the individual training examples.

1810
01:34:05,780 --> 01:34:07,780
But again, in the context of LLMs,

1811
01:34:07,780 --> 01:34:09,780
I find that this is like a very spurious

1812
01:34:09,780 --> 01:34:10,780
and weird distinction.

1813
01:34:10,780 --> 01:34:12,780
Like sentences are

1814
01:34:12,780 --> 01:34:14,780
just like don't touch the raw data.

1815
01:34:14,780 --> 01:34:15,780
Sentences happen to exist.

1816
01:34:15,780 --> 01:34:17,780
But in the raw data sets,

1817
01:34:17,780 --> 01:34:19,780
there are a lot of like in-betweens,

1818
01:34:19,780 --> 01:34:20,780
like what exactly is a sentence?

1819
01:34:20,780 --> 01:34:22,780
What isn't a sentence?

1820
01:34:22,780 --> 01:34:24,780
And so I think like it's really hard to define

1821
01:34:24,780 --> 01:34:26,780
what an actual sentence is

1822
01:34:26,780 --> 01:34:28,780
if you really like dig into it.

1823
01:34:28,780 --> 01:34:30,780
And there could be different concepts of it

1824
01:34:30,780 --> 01:34:31,780
in different languages or something like that.

1825
01:34:31,780 --> 01:34:33,780
So why even introduce the concept?

1826
01:34:33,780 --> 01:34:35,780
It doesn't honestly make sense to me.

1827
01:34:35,780 --> 01:34:37,780
I would just prefer to treat a file

1828
01:34:37,780 --> 01:34:40,780
as a giant stream of bytes.

1829
01:34:40,780 --> 01:34:41,780
It has a lot of treatment around

1830
01:34:41,780 --> 01:34:43,780
the rare word characters.

1831
01:34:43,780 --> 01:34:44,780
And when I say word,

1832
01:34:44,780 --> 01:34:45,780
I mean code points.

1833
01:34:45,780 --> 01:34:47,780
We're going to come back to this in a second.

1834
01:34:47,780 --> 01:34:48,780
And it has a lot of other rules

1835
01:34:48,780 --> 01:34:52,780
for basically splitting digits,

1836
01:34:52,780 --> 01:34:54,780
splitting white space and numbers

1837
01:34:54,780 --> 01:34:55,780
and how you deal with that.

1838
01:34:55,780 --> 01:34:58,780
So these are some kind of like merge rules.

1839
01:34:58,780 --> 01:34:59,780
So I think this is a little bit equivalent

1840
01:34:59,780 --> 01:35:02,780
to TikToken using the regular expression

1841
01:35:02,780 --> 01:35:04,780
to split up categories.

1842
01:35:04,780 --> 01:35:07,780
There's like kind of equivalence of it

1843
01:35:07,780 --> 01:35:09,780
if you squint at it in sentence piece

1844
01:35:09,780 --> 01:35:10,780
where you can also, for example,

1845
01:35:10,780 --> 01:35:16,780
split up the digits and so on.

1846
01:35:16,780 --> 01:35:17,780
There's a few more things here

1847
01:35:17,780 --> 01:35:18,780
that I'll come back to in a bit.

1848
01:35:18,780 --> 01:35:19,780
And then there are some special tokens

1849
01:35:19,780 --> 01:35:20,780
that you can indicate.

1850
01:35:20,780 --> 01:35:23,780
And it hardcodes the UNK token,

1851
01:35:23,780 --> 01:35:24,780
the beginning of sentence,

1852
01:35:24,780 --> 01:35:25,780
end of sentence,

1853
01:35:25,780 --> 01:35:27,780
and a pad token.

1854
01:35:27,780 --> 01:35:29,780
And the UNK token must exist

1855
01:35:29,780 --> 01:35:31,780
from my understanding.

1856
01:35:31,780 --> 01:35:33,780
And then some systems things.

1857
01:35:33,780 --> 01:35:34,780
So we can train.

1858
01:35:34,780 --> 01:35:36,780
And when I press train,

1859
01:35:36,780 --> 01:35:38,780
it's going to create this file

1860
01:35:38,780 --> 01:35:39,780
talk400.model

1861
01:35:39,780 --> 01:35:41,780
and talk400.vocab.

1862
01:35:41,780 --> 01:35:43,780
I can then load the model file

1863
01:35:43,780 --> 01:35:46,780
and I can inspect the vocabulary of it.

1864
01:35:46,780 --> 01:35:49,780
And so we trained vocab size 400

1865
01:35:49,780 --> 01:35:52,780
on this text here.

1866
01:35:52,780 --> 01:35:54,780
And these are the individual pieces,

1867
01:35:54,780 --> 01:35:55,780
the individual tokens

1868
01:35:55,780 --> 01:35:57,780
that sentence piece will create.

1869
01:35:57,780 --> 01:35:58,780
So in the beginning,

1870
01:35:58,780 --> 01:36:00,780
we see that we have the UNK token

1871
01:36:00,780 --> 01:36:02,780
with the ID 0.

1872
01:36:02,780 --> 01:36:04,780
Then we have the beginning of sequence,

1873
01:36:04,780 --> 01:36:06,780
end of sequence, 1 and 2.

1874
01:36:06,780 --> 01:36:08,780
And then we said that the pad ID

1875
01:36:08,780 --> 01:36:09,780
is negative 1.

1876
01:36:09,780 --> 01:36:11,780
So we chose not to use it.

1877
01:36:11,780 --> 01:36:13,780
So there's no pad ID here.

1878
01:36:13,780 --> 01:36:17,780
Then these are individual byte tokens.

1879
01:36:17,780 --> 01:36:19,780
So here we saw that byte fallback

1880
01:36:19,780 --> 01:36:21,780
in Llama was turned on.

1881
01:36:21,780 --> 01:36:22,780
So it's true.

1882
01:36:22,780 --> 01:36:24,780
So what follows are going to be

1883
01:36:24,780 --> 01:36:27,780
the 256 byte tokens.

1884
01:36:27,780 --> 01:36:32,780
And these are their IDs.

1885
01:36:32,780 --> 01:36:34,780
And then at the bottom,

1886
01:36:34,780 --> 01:36:35,780
after the byte tokens,

1887
01:36:35,780 --> 01:36:38,780
come the merges.

1888
01:36:38,780 --> 01:36:41,780
And these are the parent nodes in the merges.

1889
01:36:41,780 --> 01:36:42,780
So we're not seeing the children.

1890
01:36:42,780 --> 01:36:45,780
We're just seeing the parents and their ID.

1891
01:36:45,780 --> 01:36:47,780
And then after the merges

1892
01:36:47,780 --> 01:36:51,780
comes eventually the individual tokens

1893
01:36:51,780 --> 01:36:52,780
and their IDs.

1894
01:36:52,780 --> 01:36:54,780
And so these are the individual tokens.

1895
01:36:54,780 --> 01:36:57,780
So these are the individual code point tokens,

1896
01:36:57,780 --> 01:36:58,780
if you will,

1897
01:36:58,780 --> 01:36:59,780
and they come at the end.

1898
01:36:59,780 --> 01:37:00,780
So that is the ordering

1899
01:37:00,780 --> 01:37:01,780
with which sentence piece

1900
01:37:01,780 --> 01:37:03,780
sort of like represents its vocabularies.

1901
01:37:03,780 --> 01:37:05,780
It starts with special tokens,

1902
01:37:05,780 --> 01:37:06,780
then the byte tokens,

1903
01:37:06,780 --> 01:37:07,780
then the merge tokens,

1904
01:37:07,780 --> 01:37:10,780
and then the individual code point tokens.

1905
01:37:10,780 --> 01:37:13,780
And all these raw code point tokens

1906
01:37:13,780 --> 01:37:14,780
are the ones that it encountered

1907
01:37:14,780 --> 01:37:16,780
in the training set.

1908
01:37:16,780 --> 01:37:18,780
So those individual code points

1909
01:37:18,780 --> 01:37:21,780
are all the entire set of code points

1910
01:37:21,780 --> 01:37:24,780
that occurred here.

1911
01:37:24,780 --> 01:37:26,780
So those all get put in there.

1912
01:37:26,780 --> 01:37:28,780
And then those are extremely rare

1913
01:37:28,780 --> 01:37:30,780
as determined by character coverage.

1914
01:37:30,780 --> 01:37:31,780
So if a code point occurred

1915
01:37:31,780 --> 01:37:32,780
only a single time

1916
01:37:32,780 --> 01:37:34,780
out of like a million sentences

1917
01:37:34,780 --> 01:37:35,780
or something like that,

1918
01:37:35,780 --> 01:37:37,780
then it would be ignored.

1919
01:37:37,780 --> 01:37:41,780
And it would not be added to our vocabulary.

1920
01:37:41,780 --> 01:37:42,780
Once we have a vocabulary,

1921
01:37:42,780 --> 01:37:44,780
we can encode into IDs

1922
01:37:44,780 --> 01:37:47,780
and we can sort of get a list.

1923
01:37:47,780 --> 01:37:48,780
And then here,

1924
01:37:48,780 --> 01:37:52,780
I am also decoding the individual tokens

1925
01:37:52,780 --> 01:37:54,780
back into little pieces,

1926
01:37:54,780 --> 01:37:55,780
as they call it.

1927
01:37:55,780 --> 01:37:58,780
So let's take a look at what happened here.

1928
01:37:58,780 --> 01:37:59,780
Hello, space,

1929
01:37:59,780 --> 01:38:01,780
Annyeonghaseyo.

1930
01:38:01,780 --> 01:38:04,780
So these are the token IDs we got back.

1931
01:38:04,780 --> 01:38:06,780
And when we look here,

1932
01:38:06,780 --> 01:38:07,780
a few things

1933
01:38:07,780 --> 01:38:10,780
sort of jump to mind.

1934
01:38:10,780 --> 01:38:11,780
Number one,

1935
01:38:11,780 --> 01:38:13,780
take a look at these characters.

1936
01:38:13,780 --> 01:38:14,780
The Korean characters, of course,

1937
01:38:14,780 --> 01:38:16,780
were not part of the training set.

1938
01:38:16,780 --> 01:38:18,780
So sentence piece is encountering code points

1939
01:38:18,780 --> 01:38:21,780
that it has not seen during training time.

1940
01:38:21,780 --> 01:38:23,780
And those code points do not have

1941
01:38:23,780 --> 01:38:25,780
a token associated with them.

1942
01:38:25,780 --> 01:38:27,780
So suddenly these are unk tokens,

1943
01:38:27,780 --> 01:38:29,780
unknown tokens.

1944
01:38:29,780 --> 01:38:31,780
But because byte fallback is true,

1945
01:38:31,780 --> 01:38:32,780
instead,

1946
01:38:32,780 --> 01:38:35,780
sentence piece falls back to bytes.

1947
01:38:35,780 --> 01:38:36,780
And so it takes this,

1948
01:38:36,780 --> 01:38:38,780
it encodes it with UTF-8,

1949
01:38:38,780 --> 01:38:41,780
and then it uses these tokens

1950
01:38:41,780 --> 01:38:43,780
to represent those bytes.

1951
01:38:43,780 --> 01:38:46,780
And that's what we are getting sort of here.

1952
01:38:46,780 --> 01:38:49,780
This is the UTF-8 encoding,

1953
01:38:49,780 --> 01:38:51,780
and it is shifted by three

1954
01:38:51,780 --> 01:38:55,780
because of these special tokens here

1955
01:38:55,780 --> 01:38:57,780
that have IDs earlier on.

1956
01:38:57,780 --> 01:38:59,780
So that's what happened here.

1957
01:38:59,780 --> 01:39:01,780
Now, one more thing that,

1958
01:39:01,780 --> 01:39:03,780
well, first before I go on,

1959
01:39:03,780 --> 01:39:05,780
with respect to the byte fallback,

1960
01:39:05,780 --> 01:39:08,780
let me remove byte fallback.

1961
01:39:08,780 --> 01:39:09,780
If this is false,

1962
01:39:09,780 --> 01:39:10,780
what's going to happen?

1963
01:39:10,780 --> 01:39:12,780
Let's retrain.

1964
01:39:12,780 --> 01:39:13,780
So the first thing that happened is

1965
01:39:13,780 --> 01:39:16,780
all of the byte tokens disappeared, right?

1966
01:39:16,780 --> 01:39:17,780
And now we just have the merges,

1967
01:39:17,780 --> 01:39:19,780
and we have a lot more merges now

1968
01:39:19,780 --> 01:39:20,780
because we have a lot more space

1969
01:39:20,780 --> 01:39:22,780
because we're not taking up space

1970
01:39:22,780 --> 01:39:25,780
in the vocab size with all the bytes.

1971
01:39:25,780 --> 01:39:28,780
And now if we encode this,

1972
01:39:28,780 --> 01:39:30,780
we get a zero.

1973
01:39:30,780 --> 01:39:32,780
So this entire string here,

1974
01:39:32,780 --> 01:39:34,780
suddenly there's no byte fallback.

1975
01:39:34,780 --> 01:39:36,780
So this is unknown,

1976
01:39:36,780 --> 01:39:38,780
and unknown is unk.

1977
01:39:38,780 --> 01:39:40,780
And so this is zero

1978
01:39:40,780 --> 01:39:43,780
because the unk token is token zero.

1979
01:39:43,780 --> 01:39:44,780
And you have to keep in mind

1980
01:39:44,780 --> 01:39:47,780
that this would feed into your language model.

1981
01:39:47,780 --> 01:39:48,780
So what is the language model supposed to do

1982
01:39:48,780 --> 01:39:50,780
when all kinds of different things

1983
01:39:50,780 --> 01:39:52,780
that are unrecognized because they're rare

1984
01:39:52,780 --> 01:39:54,780
just end up mapping into unk?

1985
01:39:54,780 --> 01:39:56,780
It's not exactly the property that you want.

1986
01:39:56,780 --> 01:39:58,780
So that's why I think Lama correctly

1987
01:39:58,780 --> 01:40:01,780
used byte fallback true

1988
01:40:01,780 --> 01:40:03,780
because we definitely want to feed these

1989
01:40:03,780 --> 01:40:05,780
unknown or rare code points

1990
01:40:05,780 --> 01:40:07,780
into the model in some manner.

1991
01:40:07,780 --> 01:40:10,780
The next thing I want to show you is the following.

1992
01:40:10,780 --> 01:40:12,780
Notice here when we are decoding

1993
01:40:12,780 --> 01:40:14,780
all the individual tokens.

1994
01:40:14,780 --> 01:40:17,780
You see how spaces, space here,

1995
01:40:17,780 --> 01:40:20,780
ends up being this bold underline.

1996
01:40:20,780 --> 01:40:21,780
I'm not 100% sure, by the way,

1997
01:40:21,780 --> 01:40:23,780
why sentence piece switches white space

1998
01:40:23,780 --> 01:40:26,780
into these bold underscore characters.

1999
01:40:26,780 --> 01:40:27,780
Maybe it's for visualization.

2000
01:40:27,780 --> 01:40:30,780
I'm not 100% sure why that happens.

2001
01:40:30,780 --> 01:40:31,780
But notice this.

2002
01:40:31,780 --> 01:40:34,780
Why do we have an extra space

2003
01:40:34,780 --> 01:40:38,780
in the front of hello?

2004
01:40:38,780 --> 01:40:40,780
Where is this coming from?

2005
01:40:40,780 --> 01:40:45,780
Well, it's coming from this option here.

2006
01:40:45,780 --> 01:40:47,780
Add dummy prefix is true.

2007
01:40:47,780 --> 01:40:50,780
And when you go to the documentation,

2008
01:40:50,780 --> 01:40:52,780
add dummy white space at the beginning of text

2009
01:40:52,780 --> 01:40:54,780
in order to treat world in world

2010
01:40:54,780 --> 01:40:56,780
and hello world in the exact same way.

2011
01:40:56,780 --> 01:40:59,780
So what this is trying to do is the following.

2012
01:40:59,780 --> 01:41:01,780
If we go back to our tick tokenizer,

2013
01:41:01,780 --> 01:41:05,780
world as a token by itself

2014
01:41:05,780 --> 01:41:09,780
has a different ID than space world.

2015
01:41:09,780 --> 01:41:11,780
So we have this is 1917,

2016
01:41:11,780 --> 01:41:13,780
but this is 14, etc.

2017
01:41:13,780 --> 01:41:15,780
So these are two different tokens

2018
01:41:15,780 --> 01:41:16,780
for the language model.

2019
01:41:16,780 --> 01:41:18,780
And the language model has to learn from data

2020
01:41:18,780 --> 01:41:19,780
that they are actually kind of like

2021
01:41:19,780 --> 01:41:20,780
a very similar concept.

2022
01:41:20,780 --> 01:41:23,780
So to the language model in the tick token world,

2023
01:41:23,780 --> 01:41:26,780
basically words in the beginning of sentences

2024
01:41:26,780 --> 01:41:28,780
and words in the middle of sentences

2025
01:41:28,780 --> 01:41:30,780
actually look completely different.

2026
01:41:30,780 --> 01:41:33,780
And it has learned that they are roughly the same.

2027
01:41:33,780 --> 01:41:35,780
So this add dummy prefix

2028
01:41:35,780 --> 01:41:37,780
is trying to fight that a little bit.

2029
01:41:37,780 --> 01:41:39,780
And the way that works is that

2030
01:41:39,780 --> 01:41:43,780
it basically adds a dummy prefix.

2031
01:41:43,780 --> 01:41:47,780
So as a part of preprocessing,

2032
01:41:47,780 --> 01:41:50,780
it will take the string and it will add a space.

2033
01:41:50,780 --> 01:41:52,780
It will do this.

2034
01:41:52,780 --> 01:41:54,780
And that's done in an effort

2035
01:41:54,780 --> 01:41:56,780
to make this world and that world the same.

2036
01:41:56,780 --> 01:41:58,780
They will both be space world.

2037
01:41:58,780 --> 01:42:00,780
So that's one other

2038
01:42:00,780 --> 01:42:03,780
kind of preprocessing option that is turned on.

2039
01:42:03,780 --> 01:42:06,780
And Lama2 also uses this option.

2040
01:42:06,780 --> 01:42:08,780
And that's I think everything that I want to say

2041
01:42:08,780 --> 01:42:09,780
for my preview of sentence piece

2042
01:42:09,780 --> 01:42:11,780
and how it is different.

2043
01:42:11,780 --> 01:42:13,780
Maybe here what I've done is

2044
01:42:13,780 --> 01:42:17,780
I just put in the raw protocol buffer

2045
01:42:17,780 --> 01:42:20,780
representation basically of the tokenizer

2046
01:42:20,780 --> 01:42:22,780
that Lama2 trained.

2047
01:42:22,780 --> 01:42:24,780
So feel free to sort of step through this.

2048
01:42:24,780 --> 01:42:26,780
And if you would like your tokenization

2049
01:42:26,780 --> 01:42:29,780
to look identical to that of the meta Lama2,

2050
01:42:29,780 --> 01:42:31,780
then you would be copy pasting these settings

2051
01:42:31,780 --> 01:42:33,780
as I've tried to do up above.

2052
01:42:33,780 --> 01:42:36,780
And yeah, I think that's it for this section.

2053
01:42:36,780 --> 01:42:38,780
I think my summary for sentence piece

2054
01:42:38,780 --> 01:42:40,780
from all this is number one,

2055
01:42:40,780 --> 01:42:42,780
I think that there's a lot of historical baggage

2056
01:42:42,780 --> 01:42:43,780
in sentence piece.

2057
01:42:43,780 --> 01:42:46,780
A lot of concepts that I think are slightly confusing

2058
01:42:46,780 --> 01:42:48,780
and I think potentially contain foot guns

2059
01:42:48,780 --> 01:42:50,780
like this concept of a sentence

2060
01:42:50,780 --> 01:42:52,780
and its maximum length and stuff like that.

2061
01:42:52,780 --> 01:42:56,780
Otherwise, it is fairly commonly used in the industry

2062
01:42:56,780 --> 01:42:58,780
because it is efficient and can do both training and training.

2063
01:42:58,780 --> 01:43:00,780
and can do both training and inference.

2064
01:43:00,780 --> 01:43:01,780
It has a few quirks.

2065
01:43:01,780 --> 01:43:02,780
Like for example,

2066
01:43:02,780 --> 01:43:03,780
unktoken must exist

2067
01:43:03,780 --> 01:43:05,780
and the way the byte fallbacks are done and so on

2068
01:43:05,780 --> 01:43:07,780
I don't find particularly elegant.

2069
01:43:07,780 --> 01:43:08,780
And unfortunately, I have to say

2070
01:43:08,780 --> 01:43:09,780
it's not very well documented.

2071
01:43:09,780 --> 01:43:13,780
So it took me a lot of time working with this myself

2072
01:43:13,780 --> 01:43:15,780
and just visualizing things

2073
01:43:15,780 --> 01:43:17,780
and try to really understand what is happening here

2074
01:43:17,780 --> 01:43:19,780
because the documentation unfortunately

2075
01:43:19,780 --> 01:43:21,780
is in my opinion not super amazing.

2076
01:43:21,780 --> 01:43:23,780
But it is a very nice repo

2077
01:43:23,780 --> 01:43:25,780
that is available to you

2078
01:43:25,780 --> 01:43:27,780
if you'd like to train your own tokenizer right now.

2079
01:43:27,780 --> 01:43:28,780
Okay.

2080
01:43:28,780 --> 01:43:29,780
I'll switch gears again

2081
01:43:29,780 --> 01:43:31,780
as we're starting to slowly wrap up here.

2082
01:43:31,780 --> 01:43:33,780
I want to revisit this issue in a bit more detail

2083
01:43:33,780 --> 01:43:35,780
of how we should set the vocab size

2084
01:43:35,780 --> 01:43:37,780
and what are some of the considerations around it.

2085
01:43:37,780 --> 01:43:39,780
So for this,

2086
01:43:39,780 --> 01:43:41,780
I'd like to go back to the model architecture

2087
01:43:41,780 --> 01:43:43,780
that we developed in the last video

2088
01:43:43,780 --> 01:43:45,780
when we built the GPT from scratch.

2089
01:43:45,780 --> 01:43:48,780
So this here was the file that we built in the previous video

2090
01:43:48,780 --> 01:43:50,780
and we defined the transformer model

2091
01:43:50,780 --> 01:43:52,780
and let's specifically look at vocab size

2092
01:43:52,780 --> 01:43:54,780
and where it appears in this file.

2093
01:43:54,780 --> 01:43:56,780
So here we define the vocab size.

2094
01:43:56,780 --> 01:43:57,780
At this time,

2095
01:43:57,780 --> 01:43:59,780
it was 65 or something like that,

2096
01:43:59,780 --> 01:44:00,780
extremely small number.

2097
01:44:00,780 --> 01:44:02,780
So this will grow much larger.

2098
01:44:02,780 --> 01:44:04,780
You'll see that vocab size doesn't come up too much

2099
01:44:04,780 --> 01:44:05,780
in most of these layers.

2100
01:44:05,780 --> 01:44:07,780
The only place that it comes up to

2101
01:44:07,780 --> 01:44:10,780
is in exactly these two places here.

2102
01:44:10,780 --> 01:44:12,780
So when we define the language model,

2103
01:44:12,780 --> 01:44:14,780
there's the token embedding table

2104
01:44:14,780 --> 01:44:16,780
which is this two-dimensional array

2105
01:44:16,780 --> 01:44:19,780
where the vocab size is basically the number of rows

2106
01:44:19,780 --> 01:44:22,780
and each vocabulary element,

2107
01:44:22,780 --> 01:44:24,780
each token has a vector

2108
01:44:24,780 --> 01:44:26,780
that we're going to train using backpropagation.

2109
01:44:26,780 --> 01:44:28,780
That vector is of size and embed,

2110
01:44:28,780 --> 01:44:30,780
which is number of channels in the transformer.

2111
01:44:30,780 --> 01:44:31,780
And basically,

2112
01:44:31,780 --> 01:44:32,780
as vocab size increases,

2113
01:44:32,780 --> 01:44:33,780
this embedding table,

2114
01:44:33,780 --> 01:44:34,780
as I mentioned earlier,

2115
01:44:34,780 --> 01:44:35,780
is going to also grow.

2116
01:44:35,780 --> 01:44:37,780
We're going to be adding rows.

2117
01:44:37,780 --> 01:44:38,780
In addition to that,

2118
01:44:38,780 --> 01:44:40,780
at the end of the transformer,

2119
01:44:40,780 --> 01:44:41,780
there's this LM head layer,

2120
01:44:41,780 --> 01:44:43,780
which is a linear layer.

2121
01:44:43,780 --> 01:44:45,780
And you'll notice that that layer is used

2122
01:44:45,780 --> 01:44:47,780
at the very end to produce the logits,

2123
01:44:47,780 --> 01:44:49,780
which become the probabilities

2124
01:44:49,780 --> 01:44:50,780
for the next token in a sequence.

2125
01:44:50,780 --> 01:44:51,780
And so intuitively,

2126
01:44:51,780 --> 01:44:53,780
we're trying to produce a probability

2127
01:44:53,780 --> 01:44:55,780
for every single token

2128
01:44:55,780 --> 01:44:56,780
that might come next

2129
01:44:56,780 --> 01:44:59,780
at every point in time of that transformer.

2130
01:44:59,780 --> 01:45:01,780
And if we have more and more tokens,

2131
01:45:01,780 --> 01:45:03,780
we need to produce more and more probabilities.

2132
01:45:03,780 --> 01:45:04,780
So every single token

2133
01:45:04,780 --> 01:45:06,780
is going to introduce an additional dot product

2134
01:45:06,780 --> 01:45:09,780
that we have to do here in this linear layer

2135
01:45:09,780 --> 01:45:11,780
for this final layer in the transformer.

2136
01:45:11,780 --> 01:45:14,780
So why can't vocab size be infinite?

2137
01:45:14,780 --> 01:45:15,780
Why can't we grow to infinity?

2138
01:45:15,780 --> 01:45:16,780
Well, number one,

2139
01:45:16,780 --> 01:45:19,780
your token embedding table is going to grow.

2140
01:45:19,780 --> 01:45:22,780
Your linear layer is going to grow.

2141
01:45:22,780 --> 01:45:24,780
So we're going to be doing a lot more computation here

2142
01:45:24,780 --> 01:45:25,780
because this LM head layer

2143
01:45:25,780 --> 01:45:27,780
will become more competitionally expensive.

2144
01:45:27,780 --> 01:45:29,780
Number two, because we have more parameters,

2145
01:45:29,780 --> 01:45:31,780
we could be worried that we are going to be

2146
01:45:31,780 --> 01:45:34,780
under-training some of these parameters.

2147
01:45:34,780 --> 01:45:36,780
So intuitively,

2148
01:45:36,780 --> 01:45:37,780
if you have a very large vocabulary size,

2149
01:45:37,780 --> 01:45:39,780
say we have a million tokens,

2150
01:45:39,780 --> 01:45:41,780
then every one of these tokens

2151
01:45:41,780 --> 01:45:43,780
is going to come up more and more rarely

2152
01:45:43,780 --> 01:45:44,780
in the training data

2153
01:45:44,780 --> 01:45:45,780
because there's a lot more other tokens

2154
01:45:45,780 --> 01:45:46,780
all over the place.

2155
01:45:46,780 --> 01:45:49,780
And so we're going to be seeing fewer and fewer examples

2156
01:45:49,780 --> 01:45:51,780
for each individual token.

2157
01:45:51,780 --> 01:45:53,780
And you might be worried that basically

2158
01:45:53,780 --> 01:45:54,780
the vectorization

2159
01:45:54,780 --> 01:45:55,780
of the vectors associated with every token

2160
01:45:55,780 --> 01:45:57,780
will be under-trained as a result

2161
01:45:57,780 --> 01:45:59,780
because they just don't come up too often

2162
01:45:59,780 --> 01:46:01,780
and they don't participate in the forward-backward pass.

2163
01:46:01,780 --> 01:46:02,780
In addition to that,

2164
01:46:02,780 --> 01:46:04,780
as your vocab size grows,

2165
01:46:04,780 --> 01:46:07,780
you're going to start shrinking your sequences a lot, right?

2166
01:46:07,780 --> 01:46:08,780
And that's really nice because

2167
01:46:08,780 --> 01:46:10,780
that means that we're going to be attending

2168
01:46:10,780 --> 01:46:11,780
to more and more text.

2169
01:46:11,780 --> 01:46:12,780
So that's nice.

2170
01:46:12,780 --> 01:46:13,780
But also you might be worrying

2171
01:46:13,780 --> 01:46:15,780
that too large of chunks

2172
01:46:15,780 --> 01:46:17,780
are being squished into single tokens.

2173
01:46:17,780 --> 01:46:19,780
And so the model just doesn't have

2174
01:46:19,780 --> 01:46:21,780
as much sort of time to think

2175
01:46:21,780 --> 01:46:23,780
per sort of

2176
01:46:23,780 --> 01:46:25,780
some number of characters in a text,

2177
01:46:25,780 --> 01:46:27,780
or you can think about it that way, right?

2178
01:46:27,780 --> 01:46:29,780
So basically we're squishing too much information

2179
01:46:29,780 --> 01:46:30,780
into a single token

2180
01:46:30,780 --> 01:46:32,780
and then the forward pass of the transformer

2181
01:46:32,780 --> 01:46:33,780
is not enough to actually process

2182
01:46:33,780 --> 01:46:35,780
that information appropriately.

2183
01:46:35,780 --> 01:46:36,780
And so these are some of the considerations

2184
01:46:36,780 --> 01:46:37,780
you're thinking about

2185
01:46:37,780 --> 01:46:39,780
when you're designing the vocab size.

2186
01:46:39,780 --> 01:46:40,780
As I mentioned, this is mostly

2187
01:46:40,780 --> 01:46:41,780
an empirical hyperparameter.

2188
01:46:41,780 --> 01:46:42,780
And it seems like

2189
01:46:42,780 --> 01:46:44,780
in state-of-the-art architectures today,

2190
01:46:44,780 --> 01:46:46,780
this is usually in the high 10,000s

2191
01:46:46,780 --> 01:46:48,780
or somewhere around 100,000 today.

2192
01:46:48,780 --> 01:46:49,780
And the next consideration

2193
01:46:49,780 --> 01:46:51,780
I want to briefly talk about is

2194
01:46:51,780 --> 01:46:53,780
what if we want to take a pre-trained model

2195
01:46:53,780 --> 01:46:55,780
and we want to extend the vocab size?

2196
01:46:55,780 --> 01:46:57,780
And this is done fairly commonly actually.

2197
01:46:57,780 --> 01:46:58,780
So for example,

2198
01:46:58,780 --> 01:47:00,780
when you're doing fine-tuning for ChatGPT,

2199
01:47:00,780 --> 01:47:02,780
a lot more new special tokens

2200
01:47:02,780 --> 01:47:04,780
get introduced on top of the base model

2201
01:47:04,780 --> 01:47:06,780
to maintain the metadata

2202
01:47:06,780 --> 01:47:09,780
and all the structure of conversation objects

2203
01:47:09,780 --> 01:47:10,780
between the user and the system.

2204
01:47:10,780 --> 01:47:12,780
So that takes a lot of special tokens.

2205
01:47:12,780 --> 01:47:15,780
You might also try to throw in more special tokens,

2206
01:47:15,780 --> 01:47:16,780
for example, for using the browser

2207
01:47:16,780 --> 01:47:17,780
or any other tool.

2208
01:47:17,780 --> 01:47:20,780
And so it's very tempting to add a lot of tokens

2209
01:47:20,780 --> 01:47:22,780
for all kinds of special functionality.

2210
01:47:22,780 --> 01:47:24,780
So if you want to be adding a token,

2211
01:47:24,780 --> 01:47:25,780
that's totally possible, right?

2212
01:47:25,780 --> 01:47:28,780
All we have to do is we have to resize this embedding.

2213
01:47:28,780 --> 01:47:30,780
So we have to add rows.

2214
01:47:30,780 --> 01:47:32,780
We would initialize these parameters from scratch,

2215
01:47:32,780 --> 01:47:34,780
which would be small random numbers.

2216
01:47:34,780 --> 01:47:36,780
And then we have to extend the weight

2217
01:47:36,780 --> 01:47:38,780
inside this linear.

2218
01:47:38,780 --> 01:47:40,780
So we have to start making dot products

2219
01:47:40,780 --> 01:47:42,780
with the associated parameters as well

2220
01:47:42,780 --> 01:47:44,780
to basically calculate the probabilities

2221
01:47:44,780 --> 01:47:45,780
for these new tokens.

2222
01:47:45,780 --> 01:47:48,780
So both of these are just resizing operation.

2223
01:47:48,780 --> 01:47:50,780
It's a very mild model surgery

2224
01:47:50,780 --> 01:47:51,780
and can be done fairly easily.

2225
01:47:51,780 --> 01:47:53,780
And it's quite common that basically

2226
01:47:53,780 --> 01:47:54,780
you would freeze the base model.

2227
01:47:54,780 --> 01:47:56,780
You introduce these new parameters

2228
01:47:56,780 --> 01:47:58,780
and then you only train these new parameters

2229
01:47:58,780 --> 01:48:00,780
to introduce new tokens into the architecture.

2230
01:48:00,780 --> 01:48:03,780
And so you can freeze arbitrary parts of it

2231
01:48:03,780 --> 01:48:05,780
or you can train arbitrary parts of it.

2232
01:48:05,780 --> 01:48:06,780
And that's totally up to you.

2233
01:48:06,780 --> 01:48:08,780
But basically minor surgery required

2234
01:48:08,780 --> 01:48:10,780
if you'd like to introduce new tokens.

2235
01:48:10,780 --> 01:48:12,780
And finally, I'd like to mention that actually

2236
01:48:12,780 --> 01:48:14,780
there's an entire design space of applications

2237
01:48:14,780 --> 01:48:17,780
in terms of introducing new tokens into a vocabulary

2238
01:48:17,780 --> 01:48:19,780
that go way beyond just adding special tokens

2239
01:48:19,780 --> 01:48:20,780
and special new functionality.

2240
01:48:20,780 --> 01:48:23,780
So just to give you a sense of the design space,

2241
01:48:23,780 --> 01:48:25,780
but this could be an entire video just by itself,

2242
01:48:25,780 --> 01:48:28,780
this is a paper on learning to compress prompts

2243
01:48:28,780 --> 01:48:30,780
with what they called GIST tokens.

2244
01:48:30,780 --> 01:48:32,780
And the rough idea is,

2245
01:48:32,780 --> 01:48:34,780
suppose that you're using language models

2246
01:48:34,780 --> 01:48:36,780
in a setting that requires very long prompts.

2247
01:48:36,780 --> 01:48:38,780
Well, these long prompts just slow everything down

2248
01:48:38,780 --> 01:48:39,780
because you have to encode them

2249
01:48:39,780 --> 01:48:40,780
and then you have to use them

2250
01:48:40,780 --> 01:48:42,780
and then you're tending over them

2251
01:48:42,780 --> 01:48:45,780
and it's just heavy to have very large prompts.

2252
01:48:45,780 --> 01:48:48,780
So instead, what they do here in this paper

2253
01:48:48,780 --> 01:48:50,780
is they introduce new functions

2254
01:48:50,780 --> 01:48:52,780
and new tokens.

2255
01:48:52,780 --> 01:48:55,780
And imagine basically having a few new tokens,

2256
01:48:55,780 --> 01:48:57,780
you put them in a sequence,

2257
01:48:57,780 --> 01:49:00,780
and then you train the model by distillation.

2258
01:49:00,780 --> 01:49:02,780
So you are keeping the entire model frozen

2259
01:49:02,780 --> 01:49:04,780
and you're only training the representations

2260
01:49:04,780 --> 01:49:06,780
of the new tokens, their embeddings,

2261
01:49:06,780 --> 01:49:08,780
and you're optimizing over the new tokens

2262
01:49:08,780 --> 01:49:10,780
such that the behavior of the language model

2263
01:49:10,780 --> 01:49:14,780
is identical to the model

2264
01:49:14,780 --> 01:49:17,780
that has a very long prompt that works for you.

2265
01:49:17,780 --> 01:49:18,780
And so it's a compression technique

2266
01:49:18,780 --> 01:49:20,780
of compressing that very long prompt

2267
01:49:20,780 --> 01:49:22,780
into those few new gist tokens.

2268
01:49:22,780 --> 01:49:24,780
And so you can train this and then at test time

2269
01:49:24,780 --> 01:49:25,780
you can discard your old prompt

2270
01:49:25,780 --> 01:49:27,780
and just swap in those tokens

2271
01:49:27,780 --> 01:49:29,780
and they sort of like stand in

2272
01:49:29,780 --> 01:49:30,780
for that very long prompt

2273
01:49:30,780 --> 01:49:32,780
and have an almost identical performance.

2274
01:49:32,780 --> 01:49:35,780
And so this is one technique

2275
01:49:35,780 --> 01:49:38,780
in a class of parameter-efficient fine-tuning techniques

2276
01:49:38,780 --> 01:49:40,780
where most of the model is basically fixed

2277
01:49:40,780 --> 01:49:42,780
and there's no training of the model weights,

2278
01:49:42,780 --> 01:49:44,780
there's no training of LoRa or anything like that

2279
01:49:44,780 --> 01:49:45,780
of new parameters.

2280
01:49:45,780 --> 01:49:47,780
The parameters that you're training

2281
01:49:47,780 --> 01:49:49,780
are now just the token embeddings.

2282
01:49:49,780 --> 01:49:51,780
So that's just one example,

2283
01:49:51,780 --> 01:49:53,780
but this could again be like an entire video,

2284
01:49:53,780 --> 01:49:54,780
but just to give you a sense

2285
01:49:54,780 --> 01:49:55,780
that there's a whole design space here

2286
01:49:55,780 --> 01:49:57,780
that is potentially worth exploring in the future.

2287
01:49:57,780 --> 01:49:59,780
The next thing I want to briefly address

2288
01:49:59,780 --> 01:50:01,780
is that I think recently there's a lot of momentum

2289
01:50:01,780 --> 01:50:04,780
in how you actually could construct transformers

2290
01:50:04,780 --> 01:50:05,780
that can simultaneously process

2291
01:50:05,780 --> 01:50:07,780
not just text as the input modality,

2292
01:50:07,780 --> 01:50:09,780
but a lot of other modalities.

2293
01:50:09,780 --> 01:50:12,780
So be it images, videos, audio, etc.

2294
01:50:12,780 --> 01:50:14,780
And how do you feed in all these modalities

2295
01:50:14,780 --> 01:50:16,780
and potentially predict these modalities

2296
01:50:16,780 --> 01:50:18,780
from a transformer?

2297
01:50:18,780 --> 01:50:19,780
Do you have to change the architecture

2298
01:50:19,780 --> 01:50:20,780
in some fundamental way?

2299
01:50:20,780 --> 01:50:21,780
And I think what a lot of people

2300
01:50:21,780 --> 01:50:22,780
are starting to converge towards

2301
01:50:22,780 --> 01:50:24,780
is that you're not changing the architecture,

2302
01:50:24,780 --> 01:50:25,780
you stick with the transformer,

2303
01:50:25,780 --> 01:50:28,780
you just kind of tokenize your input domains

2304
01:50:28,780 --> 01:50:29,780
and then call it a day

2305
01:50:29,780 --> 01:50:30,780
and pretend it's just text tokens

2306
01:50:30,780 --> 01:50:34,780
and just do everything else in an identical manner.

2307
01:50:34,780 --> 01:50:35,780
So here, for example,

2308
01:50:35,780 --> 01:50:37,780
there was an early paper that has a nice graphic

2309
01:50:37,780 --> 01:50:38,780
for how you can take an image

2310
01:50:38,780 --> 01:50:42,780
and you can truncate it into integers.

2311
01:50:42,780 --> 01:50:44,780
And these sometimes...

2312
01:50:44,780 --> 01:50:45,780
So these would basically become

2313
01:50:45,780 --> 01:50:48,780
the tokens of images, as an example.

2314
01:50:48,780 --> 01:50:51,780
And these tokens can be hard tokens

2315
01:50:51,780 --> 01:50:53,780
where you force them to be integers.

2316
01:50:53,780 --> 01:50:55,780
They can also be soft tokens

2317
01:50:55,780 --> 01:50:58,780
where you sort of don't require

2318
01:50:58,780 --> 01:51:00,780
these to be discrete,

2319
01:51:00,780 --> 01:51:02,780
but you do force these representations

2320
01:51:02,780 --> 01:51:03,780
to go through bottlenecks,

2321
01:51:03,780 --> 01:51:05,780
like in autoencoders.

2322
01:51:05,780 --> 01:51:07,780
Also in this paper that came out from OpenAI,

2323
01:51:07,780 --> 01:51:10,780
Sora, which I think really

2324
01:51:10,780 --> 01:51:12,780
blew the mind of many people

2325
01:51:12,780 --> 01:51:13,780
and inspired a lot of people

2326
01:51:13,780 --> 01:51:14,780
in terms of what's possible,

2327
01:51:14,780 --> 01:51:15,780
they have a graphic here

2328
01:51:15,780 --> 01:51:17,780
and they talk briefly about how

2329
01:51:17,780 --> 01:51:19,780
LLMs have text tokens,

2330
01:51:19,780 --> 01:51:21,780
Sora has visual patches.

2331
01:51:21,780 --> 01:51:22,780
So again, they came up with a way

2332
01:51:22,780 --> 01:51:25,780
to truncate videos into basically tokens

2333
01:51:25,780 --> 01:51:27,780
with their own vocabularies.

2334
01:51:27,780 --> 01:51:29,780
And then you can either process discrete tokens,

2335
01:51:29,780 --> 01:51:30,780
say, with autoregressive models

2336
01:51:30,780 --> 01:51:33,780
or even soft tokens with diffusion models.

2337
01:51:33,780 --> 01:51:36,780
And all of that is sort of

2338
01:51:36,780 --> 01:51:38,780
being actively worked on, designed on,

2339
01:51:38,780 --> 01:51:39,780
and it's beyond the scope of this video,

2340
01:51:39,780 --> 01:51:41,780
but just something I wanted to mention briefly.

2341
01:51:41,780 --> 01:51:43,780
Okay, now that we have gone quite deep

2342
01:51:43,780 --> 01:51:45,780
into the tokenization algorithm

2343
01:51:45,780 --> 01:51:46,780
and we understand a lot more

2344
01:51:46,780 --> 01:51:47,780
about how it works,

2345
01:51:47,780 --> 01:51:48,780
let's loop back around

2346
01:51:48,780 --> 01:51:49,780
to the beginning of this video

2347
01:51:49,780 --> 01:51:51,780
and go through some of these bullet points

2348
01:51:51,780 --> 01:51:53,780
and really see why they happen.

2349
01:51:53,780 --> 01:51:54,780
So first of all,

2350
01:51:54,780 --> 01:51:57,780
why can't my LLM spell words very well

2351
01:51:57,780 --> 01:52:00,780
or do other spell-related tasks?

2352
01:52:00,780 --> 01:52:02,780
So fundamentally, this is because,

2353
01:52:02,780 --> 01:52:04,780
as we saw, these characters

2354
01:52:04,780 --> 01:52:06,780
are chunked up into tokens

2355
01:52:06,780 --> 01:52:07,780
and some of these tokens

2356
01:52:07,780 --> 01:52:09,780
are actually fairly long.

2357
01:52:09,780 --> 01:52:10,780
So as an example,

2358
01:52:10,780 --> 01:52:12,780
I went to the GPT-4 vocabulary

2359
01:52:12,780 --> 01:52:14,780
and I looked at one of the longer tokens.

2360
01:52:14,780 --> 01:52:16,780
So .defaultset

2361
01:52:16,780 --> 01:52:18,780
turns out to be a single individual token.

2362
01:52:18,780 --> 01:52:19,780
So that's a lot of characters

2363
01:52:19,780 --> 01:52:20,780
for a single token.

2364
01:52:20,780 --> 01:52:22,780
So my suspicion is that

2365
01:52:22,780 --> 01:52:23,780
there's just too much crammed

2366
01:52:23,780 --> 01:52:24,780
into this single token.

2367
01:52:24,780 --> 01:52:26,780
And my suspicion was that

2368
01:52:26,780 --> 01:52:27,780
the model should not be very good

2369
01:52:27,780 --> 01:52:30,780
at tasks related to spelling

2370
01:52:30,780 --> 01:52:33,780
of this single token.

2371
01:52:33,780 --> 01:52:34,780
So I asked,

2372
01:52:34,780 --> 01:52:35,780
how many letters L

2373
01:52:35,780 --> 01:52:38,780
are there in the word .defaultstyle?

2374
01:52:38,780 --> 01:52:39,780
And of course,

2375
01:52:39,780 --> 01:52:43,780
my prompt is intentionally done that way.

2376
01:52:43,780 --> 01:52:44,780
And you see how .defaultstyle

2377
01:52:44,780 --> 01:52:45,780
will be a single token.

2378
01:52:45,780 --> 01:52:47,780
So this is what the model sees.

2379
01:52:47,780 --> 01:52:48,780
So my suspicion is that

2380
01:52:48,780 --> 01:52:49,780
it wouldn't be very good at this.

2381
01:52:49,780 --> 01:52:51,780
And indeed, it is not.

2382
01:52:51,780 --> 01:52:52,780
It doesn't actually know

2383
01:52:52,780 --> 01:52:53,780
how many Ls are in there.

2384
01:52:53,780 --> 01:52:54,780
It thinks there are three

2385
01:52:54,780 --> 01:52:56,780
and actually there are four,

2386
01:52:56,780 --> 01:52:58,780
if I'm not getting this wrong myself.

2387
01:52:58,780 --> 01:53:00,780
So that didn't go extremely well.

2388
01:53:00,780 --> 01:53:02,780
Let's look at another

2389
01:53:02,780 --> 01:53:04,780
kind of character-level task.

2390
01:53:04,780 --> 01:53:05,780
So for example,

2391
01:53:05,780 --> 01:53:07,780
here I asked GPT-4

2392
01:53:07,780 --> 01:53:10,780
to reverse the string .defaultstyle

2393
01:53:10,780 --> 01:53:12,780
and to try to use a code interpreter.

2394
01:53:12,780 --> 01:53:13,780
And I stopped it

2395
01:53:13,780 --> 01:53:14,780
and I said, just do it.

2396
01:53:14,780 --> 01:53:15,780
Just try it.

2397
01:53:15,780 --> 01:53:17,780
And it gave me jumble.

2398
01:53:17,780 --> 01:53:20,780
So it doesn't actually really know

2399
01:53:20,780 --> 01:53:21,780
how to reverse this string

2400
01:53:21,780 --> 01:53:23,780
going from right to left.

2401
01:53:23,780 --> 01:53:25,780
So it gave it wrong result.

2402
01:53:25,780 --> 01:53:28,780
So again, like working with this working hypothesis

2403
01:53:28,780 --> 01:53:30,780
that maybe this is due to the tokenization,

2404
01:53:30,780 --> 01:53:31,780
I tried a different approach.

2405
01:53:31,780 --> 01:53:32,780
I said, okay,

2406
01:53:32,780 --> 01:53:34,780
let's reverse the exact same string,

2407
01:53:34,780 --> 01:53:36,780
but take the following approach.

2408
01:53:36,780 --> 01:53:37,780
Step one,

2409
01:53:37,780 --> 01:53:38,780
just print out every single character

2410
01:53:38,780 --> 01:53:39,780
separated by spaces.

2411
01:53:39,780 --> 01:53:40,780
And then as a step two,

2412
01:53:40,780 --> 01:53:42,780
reverse that list.

2413
01:53:42,780 --> 01:53:43,780
And it again tried to use a tool,

2414
01:53:43,780 --> 01:53:44,780
but when I said,

2415
01:53:44,780 --> 01:53:45,780
I stopped it,

2416
01:53:45,780 --> 01:53:47,780
it first produced all the characters

2417
01:53:47,780 --> 01:53:49,780
and that was actually correct.

2418
01:53:49,780 --> 01:53:50,780
And then it reversed them

2419
01:53:50,780 --> 01:53:51,780
and that was correct

2420
01:53:51,780 --> 01:53:52,780
once it had this.

2421
01:53:52,780 --> 01:53:54,780
So somehow it can't reverse it directly.

2422
01:53:54,780 --> 01:53:56,780
But when you go just first,

2423
01:53:56,780 --> 01:53:57,780
you know,

2424
01:53:57,780 --> 01:53:58,780
listing it out in order,

2425
01:53:58,780 --> 01:53:59,780
it can do that somehow.

2426
01:53:59,780 --> 01:54:00,780
And then it can,

2427
01:54:00,780 --> 01:54:02,780
once it's broken up this way,

2428
01:54:02,780 --> 01:54:04,780
this becomes all these individual characters.

2429
01:54:04,780 --> 01:54:06,780
And so now this is much easier

2430
01:54:06,780 --> 01:54:08,780
for it to see these individual tokens

2431
01:54:08,780 --> 01:54:10,780
and reverse them and print them out.

2432
01:54:10,780 --> 01:54:13,780
So that is kind of interesting.

2433
01:54:13,780 --> 01:54:15,780
So let's continue now.

2434
01:54:15,780 --> 01:54:19,780
Why are LLMs worse at non-English languages?

2435
01:54:19,780 --> 01:54:21,780
And I briefly covered this already,

2436
01:54:21,780 --> 01:54:22,780
but basically,

2437
01:54:22,780 --> 01:54:24,780
it's not only that the language model

2438
01:54:24,780 --> 01:54:26,780
sees less non-English data

2439
01:54:26,780 --> 01:54:28,780
during training of the model parameters,

2440
01:54:28,780 --> 01:54:30,780
but also the tokenizer

2441
01:54:30,780 --> 01:54:33,780
is not sufficiently trained

2442
01:54:33,780 --> 01:54:35,780
on non-English data.

2443
01:54:35,780 --> 01:54:36,780
And so here, for example,

2444
01:54:36,780 --> 01:54:39,780
hello, how are you is five tokens

2445
01:54:39,780 --> 01:54:41,780
and its translation is 15 tokens.

2446
01:54:41,780 --> 01:54:42,780
So this is a three times block

2447
01:54:42,780 --> 01:54:44,780
and so, for example,

2448
01:54:44,780 --> 01:54:45,780
is just hello,

2449
01:54:45,780 --> 01:54:46,780
basically in Korean.

2450
01:54:46,780 --> 01:54:48,780
And that ends up being three tokens.

2451
01:54:48,780 --> 01:54:49,780
I'm actually kind of surprised by that

2452
01:54:49,780 --> 01:54:51,780
because that is a very common phrase.

2453
01:54:51,780 --> 01:54:52,780
There's just a typical greeting

2454
01:54:52,780 --> 01:54:53,780
of like, hello.

2455
01:54:53,780 --> 01:54:54,780
And that ends up being three tokens,

2456
01:54:54,780 --> 01:54:56,780
whereas our hello is a single token.

2457
01:54:56,780 --> 01:54:57,780
And so basically everything

2458
01:54:57,780 --> 01:54:59,780
is a lot more bloated and diffuse.

2459
01:54:59,780 --> 01:55:00,780
And this is, I think,

2460
01:55:00,780 --> 01:55:02,780
partly the reason that the model works

2461
01:55:02,780 --> 01:55:04,780
worse on other languages.

2462
01:55:04,780 --> 01:55:05,780
Coming back,

2463
01:55:05,780 --> 01:55:08,780
why is LLM bad at simple arithmetic?

2464
01:55:08,780 --> 01:55:09,780
That has something to do

2465
01:55:09,780 --> 01:55:10,780
with the fact that

2466
01:55:10,780 --> 01:55:11,780
LLMs can be used

2467
01:55:11,780 --> 01:55:12,780
in a lot of different ways.

2468
01:55:12,780 --> 01:55:13,780
And that has to do

2469
01:55:13,780 --> 01:55:16,780
with the tokenization of numbers.

2470
01:55:16,780 --> 01:55:18,780
And so you'll notice that,

2471
01:55:18,780 --> 01:55:19,780
for example,

2472
01:55:19,780 --> 01:55:21,780
addition is very sort of like,

2473
01:55:21,780 --> 01:55:22,780
there's an algorithm

2474
01:55:22,780 --> 01:55:23,780
that is like character level

2475
01:55:23,780 --> 01:55:25,780
for doing addition.

2476
01:55:25,780 --> 01:55:26,780
So for example,

2477
01:55:26,780 --> 01:55:27,780
here we would first add the ones

2478
01:55:27,780 --> 01:55:28,780
and then the tens

2479
01:55:28,780 --> 01:55:29,780
and then the hundreds.

2480
01:55:29,780 --> 01:55:30,780
You have to refer

2481
01:55:30,780 --> 01:55:32,780
to specific parts of these digits.

2482
01:55:32,780 --> 01:55:34,780
But these numbers

2483
01:55:34,780 --> 01:55:36,780
are represented completely arbitrarily

2484
01:55:36,780 --> 01:55:37,780
based on whatever happened

2485
01:55:37,780 --> 01:55:38,780
to merge or not merge

2486
01:55:38,780 --> 01:55:40,780
during the tokenization process.

2487
01:55:40,780 --> 01:55:41,780
There's an entire block

2488
01:55:41,780 --> 01:55:42,780
of information

2489
01:55:42,780 --> 01:55:43,780
that's been published

2490
01:55:43,780 --> 01:55:44,780
about this that I think

2491
01:55:44,780 --> 01:55:45,780
is quite good.

2492
01:55:45,780 --> 01:55:46,780
Integer tokenization is insane.

2493
01:55:46,780 --> 01:55:47,780
And this person basically

2494
01:55:47,780 --> 01:55:48,780
systematically explores

2495
01:55:48,780 --> 01:55:49,780
the tokenization of numbers

2496
01:55:49,780 --> 01:55:50,780
in, I believe,

2497
01:55:50,780 --> 01:55:51,780
this is GPT-2.

2498
01:55:51,780 --> 01:55:52,780
And so they noticed

2499
01:55:52,780 --> 01:55:53,780
that, for example,

2500
01:55:53,780 --> 01:55:56,780
for four-digit numbers,

2501
01:55:56,780 --> 01:55:57,780
you can take a look at

2502
01:55:57,780 --> 01:55:59,780
whether it is a single token

2503
01:55:59,780 --> 01:56:01,780
or whether it is two tokens

2504
01:56:01,780 --> 01:56:02,780
that is a one-three

2505
01:56:02,780 --> 01:56:03,780
or a two-two

2506
01:56:03,780 --> 01:56:04,780
or a three-one combination.

2507
01:56:04,780 --> 01:56:05,780
And so all the different numbers

2508
01:56:05,780 --> 01:56:07,780
are all the different combinations.

2509
01:56:07,780 --> 01:56:08,780
And you can imagine

2510
01:56:08,780 --> 01:56:10,780
this is all completely arbitrarily so.

2511
01:56:10,780 --> 01:56:11,780
And the model, unfortunately,

2512
01:56:11,780 --> 01:56:14,780
sometimes sees a token

2513
01:56:14,780 --> 01:56:15,780
for all four digits,

2514
01:56:15,780 --> 01:56:16,780
sometimes for three,

2515
01:56:16,780 --> 01:56:17,780
sometimes for two,

2516
01:56:17,780 --> 01:56:18,780
sometimes for one.

2517
01:56:18,780 --> 01:56:21,780
And it's in an arbitrary manner.

2518
01:56:21,780 --> 01:56:22,780
And so this is definitely

2519
01:56:22,780 --> 01:56:24,780
a headwind, if you will,

2520
01:56:24,780 --> 01:56:25,780
for the language model.

2521
01:56:25,780 --> 01:56:26,780
And it's kind of incredible

2522
01:56:26,780 --> 01:56:27,780
that it can kind of do it

2523
01:56:27,780 --> 01:56:28,780
and deal with it.

2524
01:56:28,780 --> 01:56:30,780
But it's also kind of not ideal.

2525
01:56:30,780 --> 01:56:31,780
And so that's why, for example,

2526
01:56:31,780 --> 01:56:32,780
we saw that Meta,

2527
01:56:32,780 --> 01:56:33,780
when they trained

2528
01:56:33,780 --> 01:56:34,780
the LAMA2 algorithm

2529
01:56:34,780 --> 01:56:35,780
and the sentence piece,

2530
01:56:35,780 --> 01:56:37,780
they make sure to split up

2531
01:56:37,780 --> 01:56:39,780
all the digits

2532
01:56:39,780 --> 01:56:42,780
as an example for LAMA2.

2533
01:56:42,780 --> 01:56:44,780
And this is partly to improve

2534
01:56:44,780 --> 01:56:45,780
a simple arithmetic

2535
01:56:45,780 --> 01:56:47,780
kind of performance.

2536
01:56:47,780 --> 01:56:48,780
And finally,

2537
01:56:48,780 --> 01:56:49,780
why is GPT-2

2538
01:56:49,780 --> 01:56:51,780
not as good in Python?

2539
01:56:51,780 --> 01:56:52,780
Again, this is partly

2540
01:56:52,780 --> 01:56:53,780
a modeling issue

2541
01:56:53,780 --> 01:56:54,780
in the architecture

2542
01:56:54,780 --> 01:56:55,780
and the data set

2543
01:56:55,780 --> 01:56:56,780
and the strength of the model,

2544
01:56:56,780 --> 01:56:58,780
but it's also partly tokenization.

2545
01:56:58,780 --> 01:56:59,780
Because as we saw here

2546
01:56:59,780 --> 01:57:01,780
with the simple Python example,

2547
01:57:01,780 --> 01:57:03,780
the encoding efficiency

2548
01:57:03,780 --> 01:57:04,780
of the tokenizer

2549
01:57:04,780 --> 01:57:05,780
for handling spaces in Python

2550
01:57:05,780 --> 01:57:06,780
is terrible.

2551
01:57:06,780 --> 01:57:07,780
And every single space

2552
01:57:07,780 --> 01:57:08,780
is an individual token.

2553
01:57:08,780 --> 01:57:09,780
And this dramatically

2554
01:57:09,780 --> 01:57:10,780
reduces the context length

2555
01:57:10,780 --> 01:57:12,780
that the model can attend across.

2556
01:57:12,780 --> 01:57:13,780
So that's almost like

2557
01:57:13,780 --> 01:57:15,780
a tokenization bug for GPT-2.

2558
01:57:15,780 --> 01:57:18,780
And that was later fixed with GPT-4.

2559
01:57:18,780 --> 01:57:20,780
Okay, so here's another fun one.

2560
01:57:20,780 --> 01:57:21,780
My LLM abruptly halts

2561
01:57:21,780 --> 01:57:24,780
when it sees the string end of text.

2562
01:57:24,780 --> 01:57:27,780
So here's a very strange behavior.

2563
01:57:27,780 --> 01:57:28,780
Print a string end of text

2564
01:57:28,780 --> 01:57:30,780
is what I told GPT-4.

2565
01:57:30,780 --> 01:57:31,780
And it says,

2566
01:57:31,780 --> 01:57:33,780
could you please specify the string?

2567
01:57:33,780 --> 01:57:34,780
And I'm telling it,

2568
01:57:34,780 --> 01:57:35,780
give me end of text.

2569
01:57:35,780 --> 01:57:37,780
And it seems like there's an issue.

2570
01:57:37,780 --> 01:57:39,780
It's not seeing end of text.

2571
01:57:39,780 --> 01:57:40,780
And then I give it,

2572
01:57:40,780 --> 01:57:42,780
end of text is the string.

2573
01:57:42,780 --> 01:57:43,780
And then here's the string.

2574
01:57:43,780 --> 01:57:45,780
And then it just doesn't print it.

2575
01:57:45,780 --> 01:57:46,780
So obviously something is breaking here

2576
01:57:46,780 --> 01:57:47,780
with respect to the handling

2577
01:57:47,780 --> 01:57:48,780
of the special token.

2578
01:57:48,780 --> 01:57:49,780
And I didn't actually know

2579
01:57:49,780 --> 01:57:50,780
what OpenAI is doing

2580
01:57:50,780 --> 01:57:52,780
under the hood here

2581
01:57:52,780 --> 01:57:54,780
and whether they are potentially parsing this

2582
01:57:54,780 --> 01:57:58,780
as an actual token

2583
01:57:58,780 --> 01:58:02,780
instead of this just being end of text

2584
01:58:02,780 --> 01:58:04,780
as like individual sort of pieces of it

2585
01:58:04,780 --> 01:58:07,780
without the special token handling logic.

2586
01:58:07,780 --> 01:58:08,780
And so it might be

2587
01:58:08,780 --> 01:58:09,780
that someone,

2588
01:58:09,780 --> 01:58:11,780
when they're calling dot encode,

2589
01:58:11,780 --> 01:58:13,780
they are passing in the allowed special

2590
01:58:13,780 --> 01:58:15,780
and they are allowing end of text

2591
01:58:15,780 --> 01:58:18,780
as a special character in the user prompt.

2592
01:58:18,780 --> 01:58:19,780
But the user prompt,

2593
01:58:19,780 --> 01:58:20,780
of course,

2594
01:58:20,780 --> 01:58:22,780
is a sort of attacker controlled text.

2595
01:58:22,780 --> 01:58:25,780
So you would hope that they don't really parse

2596
01:58:25,780 --> 01:58:26,780
or use special tokens

2597
01:58:26,780 --> 01:58:27,780
or, you know,

2598
01:58:27,780 --> 01:58:29,780
from that kind of input.

2599
01:58:29,780 --> 01:58:30,780
But it appears that there's something

2600
01:58:30,780 --> 01:58:31,780
definitely going wrong here.

2601
01:58:31,780 --> 01:58:33,780
And so your knowledge

2602
01:58:33,780 --> 01:58:35,780
of these special tokens

2603
01:58:35,780 --> 01:58:37,780
ends up being an attack surface potentially.

2604
01:58:37,780 --> 01:58:40,780
And so if you'd like to confuse LLMs,

2605
01:58:40,780 --> 01:58:43,780
then just try to give them some special tokens

2606
01:58:43,780 --> 01:58:45,780
and see if you're breaking something by chance.

2607
01:58:45,780 --> 01:58:48,780
Okay, so this next one is a really fun one.

2608
01:58:48,780 --> 01:58:51,780
The trailing whitespace issue.

2609
01:58:51,780 --> 01:58:53,780
So if you come to Playground

2610
01:58:53,780 --> 01:58:57,780
and we come here to GPT 3.5 Turbo instruct.

2611
01:58:57,780 --> 01:58:58,780
So this is not a chat model.

2612
01:58:58,780 --> 01:59:00,780
This is a completion model.

2613
01:59:00,780 --> 01:59:01,780
So think of it more like

2614
01:59:01,780 --> 01:59:03,780
it's a lot more closer to a base model.

2615
01:59:03,780 --> 01:59:05,780
It does completion.

2616
01:59:05,780 --> 01:59:07,780
It will continue the token sequence.

2617
01:59:07,780 --> 01:59:09,780
So here's a tagline for ice cream shop

2618
01:59:09,780 --> 01:59:11,780
and we want to continue the sequence.

2619
01:59:11,780 --> 01:59:13,780
And so we can submit

2620
01:59:13,780 --> 01:59:14,780
and get a bunch of tokens.

2621
01:59:14,780 --> 01:59:16,780
Okay, no problem.

2622
01:59:16,780 --> 01:59:18,780
But now suppose I do this,

2623
01:59:18,780 --> 01:59:21,780
but instead of pressing submit here,

2624
01:59:21,780 --> 01:59:24,780
I do here's a tagline for ice cream shop space.

2625
01:59:24,780 --> 01:59:26,780
So I have a space here

2626
01:59:26,780 --> 01:59:28,780
before I click submit.

2627
01:59:28,780 --> 01:59:30,780
We get a warning.

2628
01:59:30,780 --> 01:59:32,780
Your text ends in the trailing space,

2629
01:59:32,780 --> 01:59:33,780
which causes the worst performance

2630
01:59:33,780 --> 01:59:36,780
due to how API splits text into tokens.

2631
01:59:36,780 --> 01:59:37,780
So what's happening here?

2632
01:59:37,780 --> 01:59:40,780
It still gave us a sort of completion here,

2633
01:59:40,780 --> 01:59:43,780
but let's take a look at what's happening.

2634
01:59:43,780 --> 01:59:45,780
So here's a tagline for an ice cream shop.

2635
01:59:45,780 --> 01:59:48,780
And then what does this look like

2636
01:59:48,780 --> 01:59:49,780
in the actual training data?

2637
01:59:49,780 --> 01:59:51,780
Suppose you found the completion

2638
01:59:51,780 --> 01:59:52,780
in the training documents

2639
01:59:52,780 --> 01:59:53,780
somewhere on the internet

2640
01:59:53,780 --> 01:59:55,780
and the LLM trained on this data.

2641
01:59:55,780 --> 01:59:57,780
So maybe it's something like,

2642
01:59:57,780 --> 01:59:59,780
oh yeah, maybe that's the tagline.

2643
01:59:59,780 --> 02:00:00,780
That's a terrible tagline.

2644
02:00:00,780 --> 02:00:03,780
But notice here that when I create O,

2645
02:00:03,780 --> 02:00:06,780
you see that because there's the space characters

2646
02:00:06,780 --> 02:00:08,780
the space character is always a prefix

2647
02:00:08,780 --> 02:00:10,780
to these tokens in GPT.

2648
02:00:10,780 --> 02:00:12,780
So it's not an O token.

2649
02:00:12,780 --> 02:00:13,780
It's a space O token.

2650
02:00:13,780 --> 02:00:15,780
The space is part of the O

2651
02:00:15,780 --> 02:00:18,780
and together they are token 8840.

2652
02:00:18,780 --> 02:00:20,780
That's space O.

2653
02:00:20,780 --> 02:00:22,780
So what's happening here is that

2654
02:00:22,780 --> 02:00:24,780
when I just have it like this

2655
02:00:24,780 --> 02:00:27,780
and I let it complete the next token,

2656
02:00:27,780 --> 02:00:30,780
it can sample the space O token.

2657
02:00:30,780 --> 02:00:33,780
But instead, if I have this and I add my space,

2658
02:00:33,780 --> 02:00:35,780
then what I'm doing here when I encode this string,

2659
02:00:35,780 --> 02:00:37,780
is I have basically,

2660
02:00:37,780 --> 02:00:39,780
here's the tagline for an ice cream shop

2661
02:00:39,780 --> 02:00:43,780
and this space at the very end becomes a token 220.

2662
02:00:43,780 --> 02:00:46,780
And so we've added token 220

2663
02:00:46,780 --> 02:00:49,780
and this token otherwise would be part of the tagline

2664
02:00:49,780 --> 02:00:51,780
because if there actually is a tagline here,

2665
02:00:51,780 --> 02:00:54,780
so space O is the token.

2666
02:00:54,780 --> 02:00:57,780
And so this is suddenly out of distribution for the model

2667
02:00:57,780 --> 02:01:00,780
because this space is part of the next token,

2668
02:01:00,780 --> 02:01:02,780
but we're putting it here like this

2669
02:01:02,780 --> 02:01:05,780
and the model has seen very, very little

2670
02:01:05,780 --> 02:01:09,780
data of actual space by itself.

2671
02:01:09,780 --> 02:01:11,780
And we're asking it to complete the sequence,

2672
02:01:11,780 --> 02:01:12,780
like add in more tokens.

2673
02:01:12,780 --> 02:01:15,780
But the problem is that we've sort of begun the first token

2674
02:01:15,780 --> 02:01:17,780
and now it's been split up

2675
02:01:17,780 --> 02:01:19,780
and now we're out of distribution

2676
02:01:19,780 --> 02:01:21,780
and now arbitrary bad things happen.

2677
02:01:21,780 --> 02:01:23,780
And it's just a very rare example

2678
02:01:23,780 --> 02:01:25,780
for it to see something like that.

2679
02:01:25,780 --> 02:01:27,780
And that's why we get the warning.

2680
02:01:27,780 --> 02:01:30,780
So the fundamental issue here is of course that

2681
02:01:30,780 --> 02:01:33,780
the LLM is on top of these tokens

2682
02:01:33,780 --> 02:01:34,780
and these tokens are text chunks.

2683
02:01:34,780 --> 02:01:37,780
They're not characters in a way you and I would think of them.

2684
02:01:37,780 --> 02:01:40,780
These are the atoms of what the LLM is seeing

2685
02:01:40,780 --> 02:01:42,780
and there's a bunch of weird stuff that comes out of it.

2686
02:01:42,780 --> 02:01:46,780
Let's go back to our default cell style.

2687
02:01:46,780 --> 02:01:49,780
I bet you that the model has never in its training set

2688
02:01:49,780 --> 02:01:54,780
seen default cell star without LE in there.

2689
02:01:54,780 --> 02:01:56,780
It's always seen this as a single group

2690
02:01:56,780 --> 02:02:00,780
because this is some kind of a function in...

2691
02:02:00,780 --> 02:02:02,780
I don't actually know what this is part of.

2692
02:02:02,780 --> 02:02:03,780
This is some kind of API,

2693
02:02:03,780 --> 02:02:07,780
but I bet you that it's never seen this combination of tokens

2694
02:02:07,780 --> 02:02:09,780
in its training data

2695
02:02:09,780 --> 02:02:11,780
or I think it would be extremely rare.

2696
02:02:11,780 --> 02:02:13,780
So I took this and I copy pasted it here

2697
02:02:13,780 --> 02:02:16,780
and I tried to complete from it

2698
02:02:16,780 --> 02:02:19,780
and it immediately gave me a big error.

2699
02:02:19,780 --> 02:02:21,780
And it said the model predicted a completion

2700
02:02:21,780 --> 02:02:23,780
that begins with a stop sequence resulting in no output.

2701
02:02:23,780 --> 02:02:25,780
Consider adjusting your prompt or stop sequences.

2702
02:02:25,780 --> 02:02:27,780
So what happened here when I clicked submit

2703
02:02:27,780 --> 02:02:30,780
is that immediately the model emitted

2704
02:02:30,780 --> 02:02:32,780
and sort of like end of text token, I think,

2705
02:02:32,780 --> 02:02:33,780
or something like that

2706
02:02:33,780 --> 02:02:36,780
it basically predicted the stop sequence immediately.

2707
02:02:36,780 --> 02:02:38,780
So it had no completion.

2708
02:02:38,780 --> 02:02:40,780
And so this is why I'm getting a warning again

2709
02:02:40,780 --> 02:02:42,780
because we're off the data distribution

2710
02:02:42,780 --> 02:02:45,780
and the model is just predicting

2711
02:02:45,780 --> 02:02:47,780
just totally arbitrary things.

2712
02:02:47,780 --> 02:02:49,780
It's just really confused basically.

2713
02:02:49,780 --> 02:02:50,780
This is giving it brain damage.

2714
02:02:50,780 --> 02:02:51,780
It's never seen this before.

2715
02:02:51,780 --> 02:02:52,780
It's shocked.

2716
02:02:52,780 --> 02:02:54,780
And it's predicting end of text or something.

2717
02:02:54,780 --> 02:02:56,780
I tried it again here

2718
02:02:56,780 --> 02:02:58,780
and in this case it completed it.

2719
02:02:58,780 --> 02:02:59,780
But then for some reason

2720
02:02:59,780 --> 02:03:02,780
this request may violate our usage policies.

2721
02:03:02,780 --> 02:03:04,780
This was flagged.

2722
02:03:04,780 --> 02:03:06,780
Basically something just like goes wrong

2723
02:03:06,780 --> 02:03:07,780
and there's something like jank.

2724
02:03:07,780 --> 02:03:08,780
You can just feel the jank

2725
02:03:08,780 --> 02:03:10,780
because the model is like extremely unhappy

2726
02:03:10,780 --> 02:03:11,780
with just this

2727
02:03:11,780 --> 02:03:12,780
and it doesn't know how to complete it

2728
02:03:12,780 --> 02:03:14,780
because it's never occurred in a training set.

2729
02:03:14,780 --> 02:03:17,780
In a training set it always appears like this

2730
02:03:17,780 --> 02:03:19,780
and becomes a single token.

2731
02:03:19,780 --> 02:03:21,780
So these kinds of issues where tokens are

2732
02:03:21,780 --> 02:03:24,780
either you sort of like complete the first character

2733
02:03:24,780 --> 02:03:25,780
of the next token

2734
02:03:25,780 --> 02:03:26,780
or you are sort of

2735
02:03:26,780 --> 02:03:27,780
you have long tokens

2736
02:03:27,780 --> 02:03:30,780
that you then have just some of the characters off.

2737
02:03:30,780 --> 02:03:31,780
All of these are kind of like

2738
02:03:31,780 --> 02:03:34,780
issues with partial tokens

2739
02:03:34,780 --> 02:03:36,780
is how I would describe it.

2740
02:03:36,780 --> 02:03:40,780
And if you actually dig into the TokToken repository

2741
02:03:40,780 --> 02:03:41,780
go to the Rust code

2742
02:03:41,780 --> 02:03:44,780
and search for unstable

2743
02:03:44,780 --> 02:03:46,780
and you'll see in code

2744
02:03:46,780 --> 02:03:47,780
unstable native

2745
02:03:47,780 --> 02:03:48,780
unstable tokens

2746
02:03:48,780 --> 02:03:50,780
and a lot of like special case handling.

2747
02:03:50,780 --> 02:03:52,780
None of this stuff about unstable tokens

2748
02:03:52,780 --> 02:03:54,780
is documented anywhere

2749
02:03:54,780 --> 02:03:55,780
but there's a ton of code

2750
02:03:55,780 --> 02:03:57,780
dealing with unstable tokens

2751
02:03:57,780 --> 02:03:59,780
and unstable tokens is exactly

2752
02:03:59,780 --> 02:04:01,780
kind of like what I'm describing here.

2753
02:04:01,780 --> 02:04:04,780
What you would like out of a completion API

2754
02:04:04,780 --> 02:04:05,780
is something a lot more fancy.

2755
02:04:05,780 --> 02:04:07,780
Like if we're putting in default cell star

2756
02:04:07,780 --> 02:04:09,780
if we're asking for the next token sequence

2757
02:04:09,780 --> 02:04:11,780
we're not actually trying to append the next token

2758
02:04:11,780 --> 02:04:13,780
exactly after this list.

2759
02:04:13,780 --> 02:04:15,780
We're actually trying to append

2760
02:04:15,780 --> 02:04:18,780
we're trying to consider lots of tokens

2761
02:04:18,780 --> 02:04:20,780
that if we were

2762
02:04:20,780 --> 02:04:21,780
I guess like

2763
02:04:21,780 --> 02:04:23,780
we're trying to search over characters

2764
02:04:23,780 --> 02:04:25,780
that if we retokenized

2765
02:04:25,780 --> 02:04:27,780
would be of high probability

2766
02:04:27,780 --> 02:04:29,780
if that makes sense.

2767
02:04:29,780 --> 02:04:30,780
So that we can actually add

2768
02:04:30,780 --> 02:04:32,780
a single individual character

2769
02:04:32,780 --> 02:04:35,780
instead of just like adding the next full token

2770
02:04:35,780 --> 02:04:38,780
that comes after this partial token list.

2771
02:04:38,780 --> 02:04:40,780
So this is very tricky to describe

2772
02:04:40,780 --> 02:04:42,780
and I invite you to maybe like look through this.

2773
02:04:42,780 --> 02:04:43,780
It ends up being extremely gnarly

2774
02:04:43,780 --> 02:04:45,780
and hairy kind of topic

2775
02:04:45,780 --> 02:04:47,780
and it comes from tokenization fundamentally.

2776
02:04:47,780 --> 02:04:50,780
So maybe I can even spend an entire video

2777
02:04:50,780 --> 02:04:51,780
talking about unstable tokens

2778
02:04:51,780 --> 02:04:52,780
sometime in the future.

2779
02:04:52,780 --> 02:04:54,780
Okay and I'm really saving the best for last.

2780
02:04:54,780 --> 02:04:56,780
My favorite one by far

2781
02:04:56,780 --> 02:04:58,780
is this solid gold Magikarp.

2782
02:04:58,780 --> 02:05:00,780
It was just

2783
02:05:00,780 --> 02:05:02,780
okay so this comes from this blog post

2784
02:05:02,780 --> 02:05:03,780
solid gold Magikarp

2785
02:05:03,780 --> 02:05:05,780
and this is

2786
02:05:05,780 --> 02:05:07,780
internet famous now

2787
02:05:07,780 --> 02:05:09,780
for those of us in LLMs.

2788
02:05:09,780 --> 02:05:11,780
And basically I would advise you to

2789
02:05:11,780 --> 02:05:13,780
read this blog post in full.

2790
02:05:13,780 --> 02:05:15,780
But basically what this person was doing is

2791
02:05:15,780 --> 02:05:18,780
this person went to the

2792
02:05:18,780 --> 02:05:20,780
token embedding stable

2793
02:05:20,780 --> 02:05:22,780
and clustered the tokens

2794
02:05:22,780 --> 02:05:24,780
based on their embedding representation.

2795
02:05:24,780 --> 02:05:27,780
And this person noticed that there's a cluster

2796
02:05:27,780 --> 02:05:29,780
of tokens that look really strange.

2797
02:05:29,780 --> 02:05:31,780
So there's a cluster here

2798
02:05:31,780 --> 02:05:32,780
at rot

2799
02:05:32,780 --> 02:05:33,780
East stream fame

2800
02:05:33,780 --> 02:05:34,780
solid gold Magikarp

2801
02:05:34,780 --> 02:05:35,780
sign up message

2802
02:05:35,780 --> 02:05:37,780
like really weird tokens in

2803
02:05:37,780 --> 02:05:40,780
basically in this embedding cluster.

2804
02:05:40,780 --> 02:05:42,780
And so what are these tokens

2805
02:05:42,780 --> 02:05:43,780
and where do they even come from?

2806
02:05:43,780 --> 02:05:44,780
Like what is solid gold Magikarp?

2807
02:05:44,780 --> 02:05:45,780
It makes no sense.

2808
02:05:45,780 --> 02:05:49,780
And then they found a bunch of these tokens

2809
02:05:49,780 --> 02:05:51,780
and then they noticed that actually

2810
02:05:51,780 --> 02:05:52,780
the plot thickens here

2811
02:05:52,780 --> 02:05:55,780
because if you ask the model about these tokens

2812
02:05:55,780 --> 02:05:58,780
like you ask it some very benign question

2813
02:05:58,780 --> 02:06:00,780
like please can you repeat back to me

2814
02:06:00,780 --> 02:06:02,780
the strength sold gold Magikarp

2815
02:06:02,780 --> 02:06:04,780
then you get a variety of basically

2816
02:06:04,780 --> 02:06:06,780
totally broken LLM behavior.

2817
02:06:06,780 --> 02:06:08,780
So either you get evasion.

2818
02:06:08,780 --> 02:06:10,780
So I'm sorry, I can't hear you

2819
02:06:10,780 --> 02:06:13,780
or you get a bunch of hallucinations as a response.

2820
02:06:13,780 --> 02:06:15,780
You can even get back like insults.

2821
02:06:15,780 --> 02:06:18,780
So you ask it about streamer bot

2822
02:06:18,780 --> 02:06:19,780
and it tells them

2823
02:06:19,780 --> 02:06:22,780
and the model actually just calls you names

2824
02:06:22,780 --> 02:06:24,780
or it kind of comes up with like weird humor

2825
02:06:24,780 --> 02:06:26,780
like you're actually breaking the model

2826
02:06:26,780 --> 02:06:27,780
by asking about these

2827
02:06:27,780 --> 02:06:29,780
very simple strings like at Roth

2828
02:06:29,780 --> 02:06:31,780
and solid gold Magikarp.

2829
02:06:31,780 --> 02:06:32,780
So like what the hell is happening?

2830
02:06:32,780 --> 02:06:35,780
And there's a variety of here documented behaviors.

2831
02:06:35,780 --> 02:06:37,780
There's a bunch of tokens

2832
02:06:37,780 --> 02:06:38,780
not just solid gold Magikarp

2833
02:06:38,780 --> 02:06:40,780
that have that kind of a behavior.

2834
02:06:40,780 --> 02:06:43,780
And so basically there's a bunch of like trigger words.

2835
02:06:43,780 --> 02:06:45,780
And if you ask the model about these trigger words

2836
02:06:45,780 --> 02:06:47,780
or you just include them in your prompt

2837
02:06:47,780 --> 02:06:48,780
the model goes haywire

2838
02:06:48,780 --> 02:06:51,780
and has all kinds of really strange behaviors

2839
02:06:51,780 --> 02:06:53,780
including sort of ones that violate

2840
02:06:53,780 --> 02:06:55,780
typical safety guidelines

2841
02:06:55,780 --> 02:06:56,780
and the alignment of the model

2842
02:06:56,780 --> 02:06:57,780
like in this case

2843
02:06:57,780 --> 02:06:58,780
it's swearing back at you.

2844
02:06:58,780 --> 02:07:00,780
So what is happening here

2845
02:07:00,780 --> 02:07:02,780
and how can this possibly be true?

2846
02:07:02,780 --> 02:07:05,780
Well, this again comes down to tokenization.

2847
02:07:05,780 --> 02:07:06,780
So what's happening here

2848
02:07:06,780 --> 02:07:07,780
is that solid gold Magikarp

2849
02:07:07,780 --> 02:07:09,780
if you actually dig into it

2850
02:07:09,780 --> 02:07:10,780
is a Reddit user.

2851
02:07:10,780 --> 02:07:13,780
So there's a u slash solid gold Magikarp

2852
02:07:13,780 --> 02:07:15,780
and probably what happened here

2853
02:07:15,780 --> 02:07:17,780
even though I don't know that this has been

2854
02:07:17,780 --> 02:07:19,780
like really definitively explored

2855
02:07:19,780 --> 02:07:21,780
but what is thought to have happened

2856
02:07:21,780 --> 02:07:24,780
is that the tokenization data set

2857
02:07:24,780 --> 02:07:27,780
was very different from the training data set

2858
02:07:27,780 --> 02:07:29,780
for the actual language model.

2859
02:07:29,780 --> 02:07:30,780
So in the tokenization data set

2860
02:07:30,780 --> 02:07:32,780
there was a ton of Reddit data potentially

2861
02:07:32,780 --> 02:07:35,780
where the user solid gold Magikarp

2862
02:07:35,780 --> 02:07:36,780
was mentioned in the text.

2863
02:07:36,780 --> 02:07:38,780
Because solid gold Magikarp

2864
02:07:38,780 --> 02:07:41,780
was a very common sort of person

2865
02:07:41,780 --> 02:07:42,780
who would post a lot

2866
02:07:42,780 --> 02:07:44,780
this would be a string that occurs many times

2867
02:07:44,780 --> 02:07:46,780
in a tokenization data set.

2868
02:07:46,780 --> 02:07:48,780
Because it occurs many times

2869
02:07:48,780 --> 02:07:49,780
in the tokenization data set

2870
02:07:49,780 --> 02:07:51,780
these tokens would end up getting merged

2871
02:07:51,780 --> 02:07:52,780
to a single individual token

2872
02:07:52,780 --> 02:07:54,780
for that single Reddit user

2873
02:07:54,780 --> 02:07:55,780
solid gold Magikarp.

2874
02:07:55,780 --> 02:07:57,780
So they would have a dedicated token

2875
02:07:57,780 --> 02:07:58,780
in a vocabulary of

2876
02:07:58,780 --> 02:08:00,780
was it 50,000 tokens in GPT-2

2877
02:08:00,780 --> 02:08:03,780
that is devoted to that Reddit user.

2878
02:08:03,780 --> 02:08:04,780
And then what happens is

2879
02:08:04,780 --> 02:08:07,780
the tokenization data set has those strings

2880
02:08:07,780 --> 02:08:10,780
but then later when you train the model

2881
02:08:10,780 --> 02:08:12,780
the language model itself

2882
02:08:12,780 --> 02:08:15,780
this data from Reddit was not present.

2883
02:08:15,780 --> 02:08:16,780
And so therefore

2884
02:08:16,780 --> 02:08:19,780
in the entire training set for the language model

2885
02:08:19,780 --> 02:08:21,780
solid gold Magikarp never occurs.

2886
02:08:21,780 --> 02:08:24,780
That token never appears in the training set

2887
02:08:24,780 --> 02:08:26,780
for the actual language model later.

2888
02:08:26,780 --> 02:08:29,780
So this token never gets activated

2889
02:08:29,780 --> 02:08:30,780
it's initialized at random

2890
02:08:30,780 --> 02:08:32,780
in the beginning of optimization

2891
02:08:32,780 --> 02:08:33,780
then you have forward-backward passes

2892
02:08:33,780 --> 02:08:34,780
and updates to the model

2893
02:08:34,780 --> 02:08:36,780
and this token is just never updated

2894
02:08:36,780 --> 02:08:37,780
in the embedding table.

2895
02:08:37,780 --> 02:08:39,780
That row vector never gets sampled

2896
02:08:39,780 --> 02:08:40,780
it never gets used

2897
02:08:40,780 --> 02:08:41,780
so it never gets trained

2898
02:08:41,780 --> 02:08:42,780
and it's completely untrained.

2899
02:08:42,780 --> 02:08:44,780
It's kind of like unallocated memory

2900
02:08:44,780 --> 02:08:46,780
in a typical binary program

2901
02:08:46,780 --> 02:08:48,780
written in C or something like that.

2902
02:08:48,780 --> 02:08:49,780
So it's unallocated memory

2903
02:08:49,780 --> 02:08:50,780
and then at test time

2904
02:08:50,780 --> 02:08:52,780
if you evoke this token

2905
02:08:52,780 --> 02:08:53,780
then you're basically

2906
02:08:53,780 --> 02:08:55,780
plucking out a row of the embedding table

2907
02:08:55,780 --> 02:08:56,780
that is completely untrained

2908
02:08:56,780 --> 02:08:58,780
and that feeds into a transformer

2909
02:08:58,780 --> 02:09:00,780
and creates undefined behavior.

2910
02:09:00,780 --> 02:09:01,780
And that's what we're seeing here

2911
02:09:01,780 --> 02:09:02,780
it's completely undefined

2912
02:09:02,780 --> 02:09:05,780
never before seen in a training behavior.

2913
02:09:05,780 --> 02:09:07,780
And so any of these kind of like weird tokens

2914
02:09:07,780 --> 02:09:08,780
would evoke this behavior

2915
02:09:08,780 --> 02:09:10,780
because fundamentally the model

2916
02:09:10,780 --> 02:09:13,780
is out of sample

2917
02:09:13,780 --> 02:09:15,780
out of distribution.

2918
02:09:15,780 --> 02:09:16,780
Okay and the very last thing

2919
02:09:16,780 --> 02:09:18,780
I wanted to just briefly mention and point out

2920
02:09:18,780 --> 02:09:19,780
although I think a lot of people

2921
02:09:19,780 --> 02:09:20,780
are quite aware of this

2922
02:09:20,780 --> 02:09:22,780
is that different kinds of formats

2923
02:09:22,780 --> 02:09:23,780
and different representations

2924
02:09:23,780 --> 02:09:24,780
and different languages

2925
02:09:24,780 --> 02:09:25,780
and so on

2926
02:09:25,780 --> 02:09:26,780
might be more or less efficient

2927
02:09:26,780 --> 02:09:28,780
with GPT tokenizers

2928
02:09:28,780 --> 02:09:30,780
or any tokenizers for any other LLM

2929
02:09:30,780 --> 02:09:31,780
for that matter.

2930
02:09:31,780 --> 02:09:32,780
So for example JSON

2931
02:09:32,780 --> 02:09:34,780
is actually really dense in tokens

2932
02:09:34,780 --> 02:09:37,780
and YAML is a lot more efficient in tokens.

2933
02:09:37,780 --> 02:09:40,780
So for example these are the same

2934
02:09:40,780 --> 02:09:42,780
in JSON and in YAML.

2935
02:09:42,780 --> 02:09:44,780
The JSON is 116

2936
02:09:44,780 --> 02:09:46,780
and the YAML is 99.

2937
02:09:46,780 --> 02:09:48,780
So quite a bit of an improvement.

2938
02:09:48,780 --> 02:09:51,780
And so in the token economy

2939
02:09:51,780 --> 02:09:53,780
where we are paying per token

2940
02:09:53,780 --> 02:09:54,780
in many ways

2941
02:09:54,780 --> 02:09:55,780
and you are paying in the context length

2942
02:09:55,780 --> 02:09:57,780
and you're paying in dollar amount

2943
02:09:57,780 --> 02:09:59,780
for the cost of processing

2944
02:09:59,780 --> 02:10:00,780
all this kind of structured data

2945
02:10:00,780 --> 02:10:02,780
when you have to.

2946
02:10:02,780 --> 02:10:04,780
So prefer to use YAMLs over JSONs

2947
02:10:04,780 --> 02:10:05,780
and in general

2948
02:10:05,780 --> 02:10:06,780
kind of like the tokenization density

2949
02:10:06,780 --> 02:10:08,780
is something that you have to

2950
02:10:08,780 --> 02:10:09,780
sort of care about

2951
02:10:09,780 --> 02:10:11,780
and worry about at all times

2952
02:10:11,780 --> 02:10:13,780
and try to find efficient encoding schemes

2953
02:10:13,780 --> 02:10:14,780
and spend a lot of time

2954
02:10:14,780 --> 02:10:15,780
in tick tokenizer

2955
02:10:15,780 --> 02:10:17,780
and measure the different token efficiencies

2956
02:10:17,780 --> 02:10:18,780
of different formats and settings

2957
02:10:18,780 --> 02:10:19,780
and so on.

2958
02:10:19,780 --> 02:10:20,780
Okay so that concludes

2959
02:10:20,780 --> 02:10:23,780
my fairly long video on tokenization.

2960
02:10:23,780 --> 02:10:25,780
I know it's dry.

2961
02:10:25,780 --> 02:10:26,780
I know it's annoying.

2962
02:10:26,780 --> 02:10:27,780
I know it's irritating.

2963
02:10:27,780 --> 02:10:29,780
I personally really dislike the stage.

2964
02:10:29,780 --> 02:10:31,780
What I do have to say at this point

2965
02:10:31,780 --> 02:10:33,780
is don't brush it off.

2966
02:10:33,780 --> 02:10:34,780
There's a lot of foot guns,

2967
02:10:34,780 --> 02:10:35,780
sharp edges here,

2968
02:10:35,780 --> 02:10:36,780
security issues,

2969
02:10:36,780 --> 02:10:38,780
AI safety issues

2970
02:10:38,780 --> 02:10:40,780
as we saw plugging in unallocated memory

2971
02:10:40,780 --> 02:10:42,780
into language models.

2972
02:10:42,780 --> 02:10:45,780
So it's worth understanding this stage.

2973
02:10:45,780 --> 02:10:47,780
That said I will say that

2974
02:10:47,780 --> 02:10:49,780
eternal glory goes to anyone

2975
02:10:49,780 --> 02:10:50,780
who can get rid of it.

2976
02:10:50,780 --> 02:10:52,780
I showed you one possible paper

2977
02:10:52,780 --> 02:10:54,780
that tried to do that

2978
02:10:54,780 --> 02:10:55,780
and I think I hope

2979
02:10:55,780 --> 02:10:57,780
a lot more can follow over time.

2980
02:10:57,780 --> 02:10:58,780
And my final recommendations

2981
02:10:58,780 --> 02:11:00,780
for the application right now are

2982
02:11:00,780 --> 02:11:02,780
if you can reuse the GPT-4 tokens

2983
02:11:02,780 --> 02:11:03,780
and the vocabulary

2984
02:11:03,780 --> 02:11:04,780
in your application

2985
02:11:04,780 --> 02:11:05,780
then that's something you should consider

2986
02:11:05,780 --> 02:11:06,780
and just use tick token

2987
02:11:06,780 --> 02:11:08,780
because it is very efficient

2988
02:11:08,780 --> 02:11:10,780
and nice library for inference

2989
02:11:10,780 --> 02:11:11,780
for BPE.

2990
02:11:11,780 --> 02:11:13,780
I also really like the byte level BPE

2991
02:11:13,780 --> 02:11:16,780
that tick token and OpenAI uses.

2992
02:11:16,780 --> 02:11:17,780
If you for some reason

2993
02:11:17,780 --> 02:11:19,780
want to train your own vocabulary

2994
02:11:19,780 --> 02:11:21,780
from scratch

2995
02:11:21,780 --> 02:11:25,780
then I would use the BPE with sentence piece.

2996
02:11:25,780 --> 02:11:26,780
Oops.

2997
02:11:26,780 --> 02:11:27,780
As I mentioned

2998
02:11:27,780 --> 02:11:28,780
I'm not a huge fan of sentence piece.

2999
02:11:28,780 --> 02:11:32,780
I don't like its byte fallback

3000
02:11:32,780 --> 02:11:34,780
and I don't like that it's doing BPE

3001
02:11:34,780 --> 02:11:35,780
on Unicode code points.

3002
02:11:35,780 --> 02:11:36,780
I think it's

3003
02:11:36,780 --> 02:11:37,780
it also has like a million settings

3004
02:11:37,780 --> 02:11:39,780
and I think there's a lot of foot guns here

3005
02:11:39,780 --> 02:11:40,780
and I think it's really easy

3006
02:11:40,780 --> 02:11:41,780
to miscalibrate them

3007
02:11:41,780 --> 02:11:42,780
and you end up cropping your sentences

3008
02:11:42,780 --> 02:11:44,780
or something like that

3009
02:11:44,780 --> 02:11:45,780
because of some type of parameter

3010
02:11:45,780 --> 02:11:47,780
that you don't fully understand.

3011
02:11:47,780 --> 02:11:49,780
So be very careful with the settings.

3012
02:11:49,780 --> 02:11:50,780
Try to copy paste exactly

3013
02:11:50,780 --> 02:11:52,780
maybe what Meta did

3014
02:11:52,780 --> 02:11:54,780
or basically spend a lot of time

3015
02:11:54,780 --> 02:11:56,780
looking at all the hyperparameters

3016
02:11:56,780 --> 02:11:57,780
and go through the code of sentence piece

3017
02:11:57,780 --> 02:11:59,780
and make sure that you have this correct.

3018
02:11:59,780 --> 02:12:02,780
But even if you have all the settings correct

3019
02:12:02,780 --> 02:12:03,780
I still think that the algorithm

3020
02:12:03,780 --> 02:12:04,780
is kind of inferior

3021
02:12:04,780 --> 02:12:06,780
to what's happening here.

3022
02:12:06,780 --> 02:12:08,780
And maybe the best

3023
02:12:08,780 --> 02:12:10,780
if you really need to train your vocabulary

3024
02:12:10,780 --> 02:12:11,780
maybe the best thing is to just wait

3025
02:12:11,780 --> 02:12:13,780
for minBPE to become as efficient

3026
02:12:13,780 --> 02:12:14,780
as possible

3027
02:12:14,780 --> 02:12:16,780
and that's something that

3028
02:12:16,780 --> 02:12:18,780
maybe I hope to work on.

3029
02:12:18,780 --> 02:12:19,780
And at some point

3030
02:12:19,780 --> 02:12:21,780
maybe we can be training basically

3031
02:12:21,780 --> 02:12:22,780
really what we want

3032
02:12:22,780 --> 02:12:23,780
is we want tick token

3033
02:12:23,780 --> 02:12:24,780
but training code

3034
02:12:24,780 --> 02:12:26,780
and that is the ideal thing

3035
02:12:26,780 --> 02:12:28,780
that currently does not exist.

3036
02:12:28,780 --> 02:12:31,780
And minBPE is an implementation of it

3037
02:12:31,780 --> 02:12:33,780
but currently it's in Python.

3038
02:12:33,780 --> 02:12:35,780
So that's currently what I have to say

3039
02:12:35,780 --> 02:12:37,780
for tokenization.

3040
02:12:37,780 --> 02:12:38,780
There might be an advanced video

3041
02:12:38,780 --> 02:12:40,780
that is even drier

3042
02:12:40,780 --> 02:12:41,780
and even more detailed in the future.

3043
02:12:41,780 --> 02:12:42,780
But for now I think

3044
02:12:42,780 --> 02:12:44,780
we're going to leave things off here

3045
02:12:44,780 --> 02:12:46,780
and I hope that was helpful.

3046
02:12:46,780 --> 02:12:47,780
Bye.

3047
02:12:49,780 --> 02:12:55,780
And they increased this context size

3048
02:12:55,780 --> 02:12:57,780
from GPT-1 of 5.12

3049
02:12:57,780 --> 02:13:02,780
to 1,024 in GPT-4.2.

3050
02:13:02,780 --> 02:13:05,780
The next...

3051
02:13:05,780 --> 02:13:06,780
Okay, next I would like us

3052
02:13:06,780 --> 02:13:07,780
to briefly walk through

3053
02:13:07,780 --> 02:13:09,780
the code from OpenAI

3054
02:13:09,780 --> 02:13:16,780
on the GPT-2 encoder.py.

3055
02:13:16,780 --> 02:13:18,780
I'm sorry, I'm going to sneeze.

3056
02:13:18,780 --> 02:13:19,780
And then what's happening

3057
02:13:19,780 --> 02:13:22,780
here is...

3058
02:13:22,780 --> 02:13:23,780
This is a spurious layer

3059
02:13:23,780 --> 02:13:26,780
that I will explain in a bit.

3060
02:13:26,780 --> 02:13:28,780
What's happening here is...