1 00:00:00,000 --> 00:00:04,320 Hi everyone. So in this video, I'd like us to cover the process of tokenization in large 2 00:00:04,320 --> 00:00:10,040 language models. Now, you see here that I have a sad face, and that's because tokenization is my 3 00:00:10,040 --> 00:00:14,020 least favorite part of working with large language models, but unfortunately it is necessary to 4 00:00:14,020 --> 00:00:18,600 understand in some detail because it is fairly hairy, gnarly, and there's a lot of hidden foot 5 00:00:18,600 --> 00:00:23,560 guns to be aware of, and a lot of oddness with large language models typically traces back 6 00:00:23,560 --> 00:00:29,720 to tokenization. So what is tokenization? Now, in my previous video, Let's Build GPT 7 00:00:29,720 --> 00:00:34,480 from Scratch, we actually already did tokenization, but we did a very naive, 8 00:00:34,780 --> 00:00:39,080 simple version of tokenization. So when you go to the Google Colab for that video, 9 00:00:39,880 --> 00:00:46,120 you see here that we loaded our training set, and our training set was this Shakespeare data set. 10 00:00:47,360 --> 00:00:51,820 Now, in the beginning, the Shakespeare data set is just a large string in Python. It's just text, 11 00:00:52,300 --> 00:00:57,900 and so the question is, how do we plug text into large language models? And in this case here, 12 00:00:58,780 --> 00:00:59,700 we created 13 00:00:59,700 --> 00:01:05,740 a vocabulary of 65 possible characters that we saw occur in this string. These were the possible 14 00:01:05,740 --> 00:01:11,860 characters, and we saw that there are 65 of them, and then we created a lookup table for converting 15 00:01:11,860 --> 00:01:19,280 from every possible character, a little string piece, into a token, an integer. So here, for 16 00:01:19,280 --> 00:01:26,080 example, we tokenized the string, hi there, and we received this sequence of tokens. And here, 17 00:01:26,080 --> 00:01:29,100 we took the first 1,000 characters of our data set, 18 00:01:29,700 --> 00:01:36,440 we encoded it into tokens. And because this is character level, we received 1,000 tokens in 19 00:01:36,440 --> 00:01:45,460 the sequence. So, token 1847, etc. Now, later, we saw that the way we plug these tokens into the 20 00:01:45,460 --> 00:01:52,540 language model is by using an embedding table. And so basically, if we have 65 possible tokens, 21 00:01:52,540 --> 00:01:58,300 then this embedding table is going to have 65 rows. And roughly speaking, we're taking the 22 00:01:58,300 --> 00:01:59,040 integer 23 00:01:59,040 --> 00:02:04,040 with every single token. We're using that as a lookup into this table and we're 24 00:02:04,040 --> 00:02:09,220 plucking out the corresponding row and this row is trainable parameters 25 00:02:09,220 --> 00:02:12,840 that we're going to train using back propagation and this is the vector that 26 00:02:12,840 --> 00:02:16,440 then feeds into the transformer and that's how the transformer sort of 27 00:02:16,440 --> 00:02:22,360 perceives every single token. So here we had a very naive tokenization process 28 00:02:22,360 --> 00:02:27,240 that was a character level tokenizer but in practice in state-of-the-art language 29 00:02:27,240 --> 00:02:31,000 models people use a lot more complicated schemes unfortunately for 30 00:02:31,000 --> 00:02:37,260 constructing these token vocabularies. So we're not dealing on the character level 31 00:02:37,260 --> 00:02:42,660 we're dealing on chunk level and the way these character chunks are constructed 32 00:02:42,660 --> 00:02:46,840 is using algorithms such as for example the byte pair encoding algorithm which 33 00:02:46,840 --> 00:02:52,680 we're going to go into in detail and cover in this video. I'd like to briefly 34 00:02:52,680 --> 00:02:56,940 show you the paper that introduced a byte level encoding as a mechanism for 35 00:02:56,940 --> 00:02:57,200 tokenization. 36 00:02:57,240 --> 00:03:01,140 In the context of large language models and I would say that that's probably the 37 00:03:01,140 --> 00:03:07,260 GPT-2 paper and if you scroll down here to the section input representation this 38 00:03:07,260 --> 00:03:10,600 is where they cover tokenization the kinds of properties that you'd like the 39 00:03:10,600 --> 00:03:14,760 tokenization to have and they conclude here that they're going to have a 40 00:03:14,760 --> 00:03:18,840 tokenizer where you have a vocabulary of fifty thousand two hundred and fifty 41 00:03:18,840 --> 00:03:26,460 seven possible tokens and the context size is going to be 1024 tokens so in 42 00:03:26,460 --> 00:03:26,940 the entire 43 00:03:26,940 --> 00:03:31,020 concept in the attention layer of the transformer neural network every single 44 00:03:31,020 --> 00:03:34,500 token is attending to the previous tokens in the sequence and it's going to 45 00:03:34,500 --> 00:03:42,360 see up to 1024 tokens so tokens are this like fundamental unit the atom of large 46 00:03:42,360 --> 00:03:45,240 language models if you will and everything is in units of tokens 47 00:03:45,240 --> 00:03:50,160 everything is about tokens and tokenization is the process for translating strings or 48 00:03:50,160 --> 00:03:56,640 text into sequences of tokens and vice versa when you go into the Lama 2 paper 49 00:03:56,640 --> 00:04:00,600 as well I can show you that when you search token you're going to get 63 hits 50 00:04:00,600 --> 00:04:05,120 and that's because tokens are again pervasive so here they mentioned that 51 00:04:05,120 --> 00:04:10,340 they trained on two trillion tokens of data and so on so we're going to build 52 00:04:10,340 --> 00:04:13,860 our own tokenizer luckily the byte-bearing coding algorithm is not 53 00:04:13,860 --> 00:04:18,200 that super complicated and we can build it from scratch ourselves and we'll see 54 00:04:18,200 --> 00:04:21,980 exactly how this works before we dive into code I'd like to give you a brief 55 00:04:21,980 --> 00:04:25,760 taste of some of the complexities that come from the tokenization because I 56 00:04:25,760 --> 00:04:26,460 just want to make sure that we're very clear on how the tokenization works and so let's get started. 57 00:04:26,640 --> 00:04:33,200 motivated sufficiently for why we are doing all this and why this is so gross. So tokenization 58 00:04:33,200 --> 00:04:36,960 is at the heart of a lot of weirdness in large language models and I would advise that you do 59 00:04:36,960 --> 00:04:42,240 not brush it off. A lot of the issues that may look like just issues with the neural network 60 00:04:42,240 --> 00:04:47,680 architecture or the large language model itself are actually issues with the tokenization and 61 00:04:47,680 --> 00:04:53,440 fundamentally trace back to it. So if you've noticed any issues with large language models 62 00:04:53,440 --> 00:04:58,400 can't, you know, not able to do spelling tasks very easily, that's usually due to tokenization. 63 00:04:58,960 --> 00:05:03,600 Simple string processing can be difficult for the large language model to perform natively. 64 00:05:04,720 --> 00:05:09,360 Non-English languages can work much worse and to a large extent this is due to tokenization. 65 00:05:10,160 --> 00:05:17,520 Sometimes LLMs are bad at simple arithmetic, also can be traced to tokenization. GPT-2 specifically 66 00:05:17,520 --> 00:05:22,880 would have had quite a bit more issues with Python than future versions of it due to tokenization. 67 00:05:23,600 --> 00:05:27,040 There's a lot of other issues, maybe you've seen weird warnings about a trailing whitespace, 68 00:05:27,040 --> 00:05:34,880 this is a tokenization issue. If you had asked GPT earlier about solid gold Magikarp and what it is, 69 00:05:34,880 --> 00:05:40,080 you would see the LLM go totally crazy and it would start going off about a completely unrelated 70 00:05:40,080 --> 00:05:44,640 tangent topic. Maybe you've been told to use YAML over JSON in structured data, 71 00:05:44,640 --> 00:05:48,800 all of that has to do with tokenization. So basically tokenization is at the heart 72 00:05:48,800 --> 00:05:52,960 of many issues. I will loop back around to these at the end of the video, 73 00:05:53,040 --> 00:05:58,400 but for now let me just skip over it a little bit and let's go to this web app, 74 00:05:59,520 --> 00:06:04,880 the tiktokenizer.versal.app. So I have it loaded here and what I like about this web app is that 75 00:06:04,880 --> 00:06:10,560 tokenization is running a sort of live in your browser in JavaScript. So you can just type here 76 00:06:10,560 --> 00:06:18,720 stuff, hello world, and the whole string retokenizes. So here what we see on the left 77 00:06:18,720 --> 00:06:22,480 is a string that you put in, on the right we're currently using the GPT-2 tokenizer, 78 00:06:23,120 --> 00:06:28,000 we see that this string that I pasted here is currently tokenizing into 300 tokens 79 00:06:28,560 --> 00:06:33,440 and here they are sort of shown explicitly in different colors for every single token. 80 00:06:34,240 --> 00:06:43,680 So for example this word tokenization became two tokens, the token 30642 and 1634. 81 00:06:44,880 --> 00:06:52,400 The token space is is token 318. So be careful on the bottom you can show white space, 82 00:06:53,200 --> 00:06:58,480 and keep in mind that there are spaces and slash n new line characters in here, 83 00:06:58,480 --> 00:07:05,440 but you can hide them for clarity. The token space at is token 379, 84 00:07:06,400 --> 00:07:14,960 the token space the is 262, etc. So you notice here that the space is part of that token chunk. 85 00:07:16,800 --> 00:07:22,240 Now so this is kind of like how our English sentence broke up and that seems all well and good. 86 00:07:23,040 --> 00:07:32,400 Now here I put in some arithmetic. So we see that the token 127 plus and then token 6 space 6 87 00:07:32,400 --> 00:07:37,760 followed by 77. So what's happening here is that 127 is feeding in as a single token into 88 00:07:37,760 --> 00:07:44,880 the large language model, but the number 677 will actually feed in as two separate tokens. 89 00:07:45,680 --> 00:07:52,320 And so the large language model has to sort of take account of that and process it correctly in 90 00:07:52,320 --> 00:07:58,240 its network. And see here 804 will be broken up into two tokens and it's all completely arbitrary. 91 00:07:58,880 --> 00:08:03,280 And here I have another example of four digit numbers and they break up in the way that they 92 00:08:03,280 --> 00:08:08,480 break up and it's totally arbitrary. Sometimes you have multiple digits single token, sometimes 93 00:08:08,480 --> 00:08:13,120 you have individual digits as many tokens, and it's all kind of pretty arbitrary and coming out 94 00:08:13,120 --> 00:08:21,440 of the tokenizer. Here's another example. We have the string egg and you see here that this became two tokens. 95 00:08:23,200 --> 00:08:27,680 But for some reason when I say I have an egg, you see when it's a space egg 96 00:08:28,640 --> 00:08:34,480 it's a single token. So just egg by itself in the beginning of a sentence is 97 00:08:34,480 --> 00:08:40,960 two tokens, but here as a space egg is suddenly a single token for the exact same string. 98 00:08:41,600 --> 00:08:46,880 Here lowercase egg turns out to be a single token and in particular notice that 99 00:08:46,880 --> 00:08:51,840 the color is different, so this is a different token, so this is case sensitive. And of course, 100 00:08:52,480 --> 00:08:58,720 if you look at the formula for the literal egg, it would also be different tokens. And again this 101 00:08:58,720 --> 00:09:04,160 would be two tokens arbitrarily. So for the same concept egg depending on if it's in the beginning 102 00:09:04,160 --> 00:09:09,360 of a sentence, at the end of a sentence, lowercase, uppercase or mixed, all this will be basically 103 00:09:09,360 --> 00:09:14,160 very different tokens and different IDs. And the language model has to learn from raw data from 104 00:09:14,160 --> 00:09:17,520 all the internet text that it's going to be training on that these are actually all the 105 00:09:17,520 --> 00:09:21,600 exact same concept. And it has to sort of group them in the parameters of the neural network and understand, just based on data, 106 00:09:21,600 --> 00:09:25,360 just based on the data patterns that these are all very similar but maybe not 107 00:09:25,360 --> 00:09:32,220 almost exactly similar but very very similar. After the demonstration here I 108 00:09:32,220 --> 00:09:41,160 have an introduction from OpenAI's ChatGPT in Korean so 만났어, 반가워요, etc. 109 00:09:41,160 --> 00:09:46,320 So this is in Korean and the reason I put this here is because you'll notice 110 00:09:46,320 --> 00:09:53,940 that non-English languages work slightly worse in ChatGPT. Part of this is 111 00:09:53,940 --> 00:09:57,720 because of course the training data set for ChatGPT is much larger for 112 00:09:57,720 --> 00:10:01,360 English than for everything else but the same is true not just for the large 113 00:10:01,360 --> 00:10:05,580 language model itself but also for the tokenizer. So when we train the tokenizer 114 00:10:05,580 --> 00:10:08,520 we're going to see that there's a training set as well and there's a lot 115 00:10:08,520 --> 00:10:12,540 more English than non-English and what ends up happening is that we're going to 116 00:10:12,540 --> 00:10:16,300 have a lot more longer tokens for English 117 00:10:16,300 --> 00:10:21,400 so how do I put this if you have a single sentence in English and you 118 00:10:21,400 --> 00:10:25,460 tokenize it you might see that it's 10 tokens or something like that but if you 119 00:10:25,460 --> 00:10:29,360 translate that sentence into say Korean or Japanese or something else you'll 120 00:10:29,360 --> 00:10:33,520 typically see that number of tokens used is much larger and that's because the 121 00:10:33,520 --> 00:10:38,440 chunks here are a lot more broken up so we're using a lot more tokens for the 122 00:10:38,440 --> 00:10:43,720 exact same thing and what this does is it bloats up the sequence length of all 123 00:10:43,720 --> 00:10:45,920 the documents so you're using up more tokens if the number of tokens you use is 124 00:10:45,920 --> 00:10:50,160 more tokens and then in the attention of the transformer when these tokens try to attend 125 00:10:50,160 --> 00:10:56,320 each other you are running out of context in the maximum context length of that transformer 126 00:10:56,320 --> 00:11:02,400 and so basically all the non-english text is stretched out from the perspective of the 127 00:11:02,400 --> 00:11:07,600 transformer and this just has to do with the trainings that used for the tokenizer and the 128 00:11:07,600 --> 00:11:13,760 tokenization itself so it will create a lot bigger tokens and a lot larger groups in english and it 129 00:11:13,760 --> 00:11:17,200 will have a lot of little boundaries for all the other non-english text 130 00:11:19,280 --> 00:11:23,040 so if we translated this into english it would be significantly fewer tokens 131 00:11:24,240 --> 00:11:28,320 the final example i have here is a little snippet of python for doing fizzbuzz 132 00:11:29,040 --> 00:11:35,920 and what i'd like you to notice is look all these individual spaces are all separate tokens they are 133 00:11:35,920 --> 00:11:43,680 token 220 so uh 220 220 220 220 and then space if is a single token 134 00:11:43,760 --> 00:11:48,280 token. And so what's going on here is that when the transformer is going to consume or try to 135 00:11:48,280 --> 00:11:55,740 create this text, it needs to handle all these spaces individually. They all feed in one by one 136 00:11:55,740 --> 00:12:01,440 into the entire transformer in the sequence. And so this is being extremely wasteful tokenizing it 137 00:12:01,440 --> 00:12:07,900 in this way. And so as a result of that, GPT-2 is not very good with Python, and it's not anything 138 00:12:07,900 --> 00:12:12,080 to do with coding or the language model itself. It's just that if you use a lot of indentation 139 00:12:12,080 --> 00:12:18,040 using space in Python, like we usually do, you just end up bloating out all the text, and it's 140 00:12:18,040 --> 00:12:22,020 separated across way too much of the sequence, and we are running out of the context length 141 00:12:22,020 --> 00:12:26,660 in the sequence. That's roughly speaking what's happening. We're being way too wasteful. We're 142 00:12:26,660 --> 00:12:31,000 taking up way too much token space. Now we can also scroll up here, and we can change the tokenizer. 143 00:12:31,640 --> 00:12:38,140 So note here that GPT-2 tokenizer creates a token count of 300 for this string here. We can change 144 00:12:38,140 --> 00:12:42,020 it to CL100KBASE, which is the GPT-4 tokenizer. And we see 145 00:12:42,020 --> 00:12:42,060 that it's not very good with Python, and it's not very good with Python, and it's not very good 146 00:12:42,060 --> 00:12:48,440 the token count drops to 185. So for the exact same string, we are now roughly halving the number 147 00:12:48,440 --> 00:12:54,540 of tokens. And roughly speaking, this is because the number of tokens in the GPT-4 tokenizer is 148 00:12:54,540 --> 00:12:59,840 roughly double that of the number of tokens in the GPT-2 tokenizer. So we went from roughly 50K 149 00:12:59,840 --> 00:13:05,920 to roughly 100K. Now you can imagine that this is a good thing, because the same text is now 150 00:13:06,200 --> 00:13:12,000 squished into half as many tokens. So this is a lot denser input to the 151 00:13:12,000 --> 00:13:17,840 transformer. And in the transformer, every single token has a finite number of tokens before it 152 00:13:17,840 --> 00:13:23,240 that it's going to pay attention to. And so what this is doing is we're roughly able to see twice 153 00:13:23,240 --> 00:13:29,800 as much text as a context for what token to predict next because of this change. But of course, 154 00:13:29,980 --> 00:13:34,940 just increasing the number of tokens is not strictly better infinitely, because as you 155 00:13:34,940 --> 00:13:40,280 increase the number of tokens, now your embedding table is sort of getting a lot larger. And also 156 00:13:40,280 --> 00:13:41,980 at the output, we are trying to predict the next number of tokens. So we're able to see 157 00:13:42,000 --> 00:13:45,880 the next token, and there's the softmax there, and that grows as well. We're going to go into more 158 00:13:45,880 --> 00:13:51,400 detail later on this. But there's some kind of a sweet spot somewhere where you have a just right 159 00:13:51,400 --> 00:13:55,560 number of tokens in your vocabulary, where everything is appropriately dense and still 160 00:13:55,560 --> 00:14:01,080 fairly efficient. Now, one thing I would like you to note specifically for the GPT-4 tokenizer 161 00:14:01,080 --> 00:14:07,480 is that the handling of the whitespace for Python has improved a lot. You see that here, 162 00:14:07,480 --> 00:14:11,960 these four spaces are represented as one single token for the three spaces here. 163 00:14:12,000 --> 00:14:19,820 And here, seven spaces were all grouped into a single token. So we're being a lot more efficient 164 00:14:19,820 --> 00:14:23,900 in how we represent Python. And this was a deliberate choice made by OpenAI when they 165 00:14:23,900 --> 00:14:29,960 designed the GPT-4 tokenizer. And they group a lot more whitespace into a single character. 166 00:14:30,480 --> 00:14:37,720 What this does is this densifies Python, and therefore we can attend to more code before it 167 00:14:37,720 --> 00:14:41,940 when we're trying to predict the next token in the sequence. And so the improvement in 168 00:14:42,000 --> 00:14:47,680 Python coding ability from GPT-2 to GPT-4 is not just a matter of the language model and the 169 00:14:47,680 --> 00:14:52,080 architecture and the details of the optimization, but a lot of the improvement here is also coming 170 00:14:52,080 --> 00:14:57,160 from the design of the tokenizer and how it groups characters into tokens. Okay, so let's 171 00:14:57,160 --> 00:15:03,200 now start writing some code. So remember what we want to do. We want to take strings and feed them 172 00:15:03,200 --> 00:15:09,080 into language models. For that, we need to somehow tokenize strings into some integers 173 00:15:09,080 --> 00:15:11,320 in some fixed vocabulary. 174 00:15:12,000 --> 00:15:17,040 And then we will use those integers to make a lookup into a lookup table of vectors and feed 175 00:15:17,040 --> 00:15:22,240 those vectors into the transformer as an input. Now, the reason this gets a little bit tricky, 176 00:15:22,240 --> 00:15:26,320 of course, is that we don't just want to support the simple English alphabet. We want to support 177 00:15:26,320 --> 00:15:31,440 different kinds of languages. So this is annyeonghaseyo in Korean, which is hello. 178 00:15:31,440 --> 00:15:35,520 And we also want to support many kinds of special characters that we might find on the internet. 179 00:15:35,520 --> 00:15:41,920 For example, emoji. So how do we feed this text into a transformer? 180 00:15:42,000 --> 00:15:48,560 Well, what is this text anyway in Python? So if you go to the documentation of a string in 181 00:15:48,560 --> 00:15:55,600 Python, you can see that strings are immutable sequences of Unicode code points. Okay, what are 182 00:15:55,600 --> 00:16:03,040 Unicode code points? We can go to Wikipedia. So Unicode code points are defined by the Unicode 183 00:16:03,040 --> 00:16:09,520 consortium as part of the Unicode standard. And what this is really is that it's just a definition 184 00:16:09,520 --> 00:16:11,680 of roughly 150,000 characters right now. So it's just a definition of roughly 150,000 characters 185 00:16:12,000 --> 00:16:18,560 right now. And roughly speaking, what they look like and what integers represent those characters. 186 00:16:18,560 --> 00:16:24,560 So it says 150,000 characters across 161 scripts as of right now. So if you scroll down here, 187 00:16:24,560 --> 00:16:29,760 you can see that the standard is very much alive. The latest standard 15.1 is September 2023. 188 00:16:31,040 --> 00:16:38,800 And basically, this is just a way to define lots of types of characters. Like for example, 189 00:16:38,800 --> 00:16:40,400 all these characters across different scripts. 190 00:16:42,000 --> 00:16:46,480 The way we can access the Unicode code point given a single character is by using the ORT 191 00:16:46,480 --> 00:16:52,800 function in Python. So for example, I can pass in ORT of H. And I can see that for the single 192 00:16:52,800 --> 00:17:01,280 character H, the Unicode code point is 104. Okay. But this can be arbitrarily complicated. So we 193 00:17:01,280 --> 00:17:06,560 can take, for example, our emoji here. And we can see that the code point for this one is 128,000. 194 00:17:07,520 --> 00:17:08,960 Or we can take un. 195 00:17:12,000 --> 00:17:18,800 Now keep in mind, you can't plug in strings here because this doesn't have a single code point. 196 00:17:18,800 --> 00:17:25,520 It only takes a single Unicode code point character and tells you its integer. So in this 197 00:17:25,520 --> 00:17:32,640 way, we can look up all the characters of this specific string and their code points. So ORT of 198 00:17:32,640 --> 00:17:41,520 X for X in this string. And we get this encoding here. Now see here, we've already turned the raw 199 00:17:42,000 --> 00:17:47,360 code points. Already have integers. So why can't we simply just use these integers and not have any 200 00:17:47,360 --> 00:17:51,520 tokenization at all? Why can't we just use this natively as is and just use the code point? 201 00:17:52,320 --> 00:17:56,160 Well, one reason for that, of course, is that the vocabulary in that case would be quite long. 202 00:17:56,160 --> 00:18:02,880 So in this case, for Unicode, this is a vocabulary of 150,000 different code points. But more 203 00:18:02,880 --> 00:18:08,640 worryingly than that, I think the Unicode standard is very much alive and it keeps changing. And so 204 00:18:08,640 --> 00:18:11,920 it's not kind of a stable representation necessarily that we may want to use. 205 00:18:12,000 --> 00:18:16,720 So for those reasons, we need something a bit better. So to find something better, 206 00:18:16,720 --> 00:18:21,200 we turn to encodings. So if you go to the Wikipedia page here, we see that the Unicode 207 00:18:21,200 --> 00:18:29,440 consortium defines three types of encodings, UTF-8, UTF-16, and UTF-32. These encodings are 208 00:18:29,440 --> 00:18:35,120 the way by which we can take Unicode text and translate it into binary data or byte streams. 209 00:18:36,000 --> 00:18:41,840 UTF-8 is by far the most common. So this is the UTF-8 page. Now, this Wikipedia page is actually 210 00:18:41,840 --> 00:18:46,880 quite long, but what's important for our purposes is that UTF-8 takes every single code point 211 00:18:47,680 --> 00:18:53,440 and it translates it to a byte stream. And this byte stream is between one to four bytes. So it's 212 00:18:53,440 --> 00:18:57,840 a variable length encoding. So depending on the Unicode point according to the schema, 213 00:18:57,840 --> 00:19:03,200 you're going to end up with between one to four bytes for each code point. On top of that, there's 214 00:19:03,200 --> 00:19:10,880 UTF-8, UTF-16, and UTF-32. UTF-32 is nice because it is fixed length instead of variable length, 215 00:19:10,880 --> 00:19:11,840 but it has many other variables. So here's the UTF-8 code, and here's UTF-16 code, and here's UTF-32. 216 00:19:11,840 --> 00:19:18,020 downsides as well. So the full kind of spectrum of pros and cons of all these 217 00:19:18,020 --> 00:19:21,980 different tree encodings are beyond the scope of this video. I just like to point 218 00:19:21,980 --> 00:19:26,600 out that I enjoyed this blog post and this blog post at the end of it also has 219 00:19:26,600 --> 00:19:31,660 a number of references that can be quite useful. One of them is UTF-8 Everywhere 220 00:19:31,660 --> 00:19:37,160 manifesto and this manifesto describes the reason why UTF-8 is significantly 221 00:19:37,160 --> 00:19:42,480 preferred and a lot nicer than the other encodings and why it is used a lot 222 00:19:42,480 --> 00:19:48,700 more prominently on the Internet. One of the major advantages does just give you 223 00:19:48,700 --> 00:19:53,180 a sense is that UTF-8 is the only one of these that is backwards compatible to 224 00:19:53,180 --> 00:19:57,780 the much simpler ASCII encoding of text, but I'm not going to go into the full 225 00:19:57,780 --> 00:20:02,220 detail in this video. So suffice to say that we like the UTF-8 encoding and 226 00:20:02,220 --> 00:20:07,060 let's try to take the string and see what we get if we encode it into UTF-8. 227 00:20:07,060 --> 00:20:07,120 So let's try to take the string and see what we get if we encode it into UTF-8. 228 00:20:07,120 --> 00:20:07,140 So let's try to take the string and see what we get if we encode it into UTF-8. 229 00:20:07,140 --> 00:20:13,120 the string class in python actually has dot encode and you can give it the encoding which is 230 00:20:13,120 --> 00:20:20,700 say utf-8 now we get out of this is not very nice because this is the bytes is a bytes object and 231 00:20:20,700 --> 00:20:25,680 it's not very nice in the way that it's printed so i personally like to take it through a list 232 00:20:25,680 --> 00:20:32,760 because then we actually get the raw bytes of this encoding so this is the raw bytes 233 00:20:32,760 --> 00:20:40,360 that represent this string according to the utf-8 encoding we can also look at utf-16 we get a 234 00:20:40,360 --> 00:20:45,880 slightly different byte stream and here we start to see one of the disadvantages of utf-16 you see 235 00:20:45,880 --> 00:20:50,100 how we have zero zero something zero something zero something we're starting to get a sense 236 00:20:50,100 --> 00:20:55,800 that this is a bit of a wasteful encoding and indeed for simple ascii characters or english 237 00:20:55,800 --> 00:21:01,080 characters here we just have the structure of zero something zero something and it's not exactly 238 00:21:01,080 --> 00:21:02,740 nice same for utf-8 239 00:21:02,760 --> 00:21:07,320 but if we look at utf-32 when we expand this we can start to get a sense of the wastefulness of 240 00:21:07,320 --> 00:21:13,640 this encoding for our purposes you see a lot of zeros followed by something and so this is not 241 00:21:13,640 --> 00:21:22,840 desirable so suffice it to say that we would like to stick with utf-8 for our purposes however if we 242 00:21:22,840 --> 00:21:30,120 just use utf-8 naively these are byte streams so that would imply a vocabulary length of only 256 243 00:21:30,120 --> 00:21:31,080 possible tokens 244 00:21:32,120 --> 00:21:32,520 but this 245 00:21:32,760 --> 00:21:37,880 this vocabulary size is very very small what this is going to do if we just were to use it naively 246 00:21:37,880 --> 00:21:44,840 is that all of our text would be stretched out over very very long sequences of bytes and so 247 00:21:47,320 --> 00:21:51,720 what this does is that certainly the embedding table is going to be tiny and the prediction at 248 00:21:51,720 --> 00:21:56,360 the top of the final layer is going to be very tiny but our sequences are very long and remember 249 00:21:56,360 --> 00:22:02,680 that we have pretty finite context lengths in the attention that we can support in a transformer for 250 00:22:02,760 --> 00:22:08,440 computational reasons and so we only have as much context length but now we have very very long 251 00:22:08,440 --> 00:22:12,840 sequences and this is just inefficient and it's not going to allow us to attend to sufficiently 252 00:22:12,840 --> 00:22:19,560 long text before us for the purposes of the next token prediction task so we don't want to use 253 00:22:20,280 --> 00:22:26,760 the raw bytes of the utf-8 encoding we want to be able to support larger vocabulary size that 254 00:22:26,760 --> 00:22:32,280 we can tune as a height parameter but we want to stick with the utf-8 encoding of these strings 255 00:22:32,760 --> 00:22:37,320 so what do we do well the answer of course is we turn to the byte pair encoding algorithm 256 00:22:37,320 --> 00:22:39,800 which will allow us to compress these byte sequences 257 00:22:41,160 --> 00:22:45,960 to a variable amount so we'll get to that in a bit but i just want to briefly speak to the 258 00:22:45,960 --> 00:22:51,880 fact that i would love nothing more than to be able to feed raw byte sequences into 259 00:22:52,920 --> 00:22:57,800 language models in fact there's a paper about how this could potentially be done from summer 260 00:22:57,800 --> 00:23:02,280 last year now the problem is you actually have to go in and you have to modify the transformer 261 00:23:03,080 --> 00:23:07,720 because as i mentioned you're going to have a problem where the attention will start to become 262 00:23:07,720 --> 00:23:13,720 extremely expensive because the sequences are so long and so in this paper they propose kind of a 263 00:23:13,720 --> 00:23:19,880 hierarchical structuring of the transformer that could allow you to just feed in raw bytes and so 264 00:23:19,880 --> 00:23:24,040 at the end they say together these results establish the viability of tokenization free 265 00:23:24,040 --> 00:23:27,960 autoregressive sequence modeling at scale so tokenization free would indeed be 266 00:23:27,960 --> 00:23:32,680 amazing we would just feed byte streams directly into our models but unfortunately 267 00:23:32,760 --> 00:23:37,480 I don't know that this has really been proven out yet by sufficiently many groups and at sufficient 268 00:23:37,480 --> 00:23:41,880 scale but something like this at one point would be amazing and I hope someone comes up with it 269 00:23:41,880 --> 00:23:46,840 but for now we have to come back and we can't feed this directly into language models and we have to 270 00:23:46,840 --> 00:23:51,320 compress it using the byte pair encoding algorithm so let's see how that works so as I mentioned the 271 00:23:51,320 --> 00:23:55,640 byte pair encoding algorithm is not all that complicated and the Wikipedia page is actually 272 00:23:55,640 --> 00:24:00,680 quite instructive as far as the basic idea goes what we're doing is we have some kind of a input 273 00:24:00,680 --> 00:24:06,600 sequence like for example here we have only four elements in our vocabulary a b c and d and we have 274 00:24:06,600 --> 00:24:13,400 a sequence of them so instead of bytes let's say we just had four a vocab size of four the sequence 275 00:24:13,400 --> 00:24:21,320 is too long we'd like to compress it so we do is that we iteratively find the pair of tokens that 276 00:24:21,320 --> 00:24:29,320 occur the most frequently and then once we've identified that pair we replace that pair with 277 00:24:29,320 --> 00:24:30,600 just a single new token 278 00:24:30,680 --> 00:24:37,560 that we append to our vocabulary so for example here the byte pair a a occurs most often so we 279 00:24:37,560 --> 00:24:43,880 meant a new token let's call it capital Z and we replace every single occurrence of a a by z 280 00:24:44,680 --> 00:24:52,440 so now we have two z's here so here we took a sequence of 11 characters with vocabulary size 281 00:24:52,440 --> 00:25:00,360 four and we've converted it to a sequence of only nine tokens but now with a vocabulary of five 282 00:25:00,360 --> 00:25:00,520 b and z so that's how we make our token of a and b so let's now move on from that we're going to 283 00:25:00,520 --> 00:25:00,600 make our token of a and b our token of a and b now we're going to make our token of a and b so 284 00:25:00,600 --> 00:25:00,660 now we have a binary of 10 characters of 10 characters of 10 characters of 10 characters of 10 characters of 10 characters 285 00:25:00,660 --> 00:25:02,800 because we have a fifth vocabulary element 286 00:25:02,800 --> 00:25:03,840 that we just created, 287 00:25:03,840 --> 00:25:07,340 and it's Z standing for concatenation of AA. 288 00:25:07,340 --> 00:25:09,700 And we can, again, repeat this process. 289 00:25:09,700 --> 00:25:12,040 So we, again, look at the sequence 290 00:25:12,040 --> 00:25:16,720 and identify the pair of tokens that are most frequent. 291 00:25:16,720 --> 00:25:19,180 Let's say that that is now AB. 292 00:25:19,180 --> 00:25:20,660 Well, we are going to replace AB 293 00:25:20,660 --> 00:25:23,500 with a new token that we meant called Y. 294 00:25:23,500 --> 00:25:24,640 So Y becomes AB, 295 00:25:24,640 --> 00:25:26,280 and then every single occurrence of AB 296 00:25:26,280 --> 00:25:28,160 is now replaced with Y. 297 00:25:28,160 --> 00:25:29,820 So we end up with this. 298 00:25:30,660 --> 00:25:32,900 So now we only have one, two, three, 299 00:25:32,900 --> 00:25:36,120 four, five, six, seven characters in our sequence, 300 00:25:36,120 --> 00:25:41,120 but we have not just four vocabulary elements, 301 00:25:41,420 --> 00:25:43,600 or five, but now we have six. 302 00:25:43,600 --> 00:25:45,560 And for the final round, 303 00:25:45,560 --> 00:25:47,320 we, again, look through the sequence, 304 00:25:47,320 --> 00:25:51,520 find that the phrase ZY or the pair ZY is most common, 305 00:25:51,520 --> 00:25:55,520 and replace it one more time with another character, 306 00:25:55,520 --> 00:25:56,560 let's say X. 307 00:25:56,560 --> 00:25:59,960 So X is ZY, and we replace all occurrences of ZY, 308 00:25:59,960 --> 00:26:02,020 and we get this following sequence. 309 00:26:02,020 --> 00:26:04,840 So basically, after we have gone through this process, 310 00:26:04,840 --> 00:26:09,840 instead of having a sequence of 11 tokens 311 00:26:12,500 --> 00:26:14,680 with a vocabulary length of four, 312 00:26:14,680 --> 00:26:19,680 we now have a sequence of one, two, three, four, five tokens, 313 00:26:20,780 --> 00:26:24,120 but our vocabulary length now is seven. 314 00:26:24,120 --> 00:26:25,240 And so in this way, 315 00:26:25,240 --> 00:26:27,580 we can iteratively compress our sequence 316 00:26:27,580 --> 00:26:29,280 as we mint new tokens. 317 00:26:29,960 --> 00:26:31,320 In the exact same way, 318 00:26:31,320 --> 00:26:34,260 we start out with byte sequences. 319 00:26:34,260 --> 00:26:37,500 So we have 256 vocabulary size, 320 00:26:37,500 --> 00:26:38,960 but we're now going to go through these 321 00:26:38,960 --> 00:26:42,140 and find the byte pairs that occur the most. 322 00:26:42,140 --> 00:26:44,900 And we're going to iteratively start minting new tokens, 323 00:26:44,900 --> 00:26:48,140 appending them to our vocabulary, and replacing things. 324 00:26:48,140 --> 00:26:48,980 And in this way, 325 00:26:48,980 --> 00:26:51,540 we're going to end up with a compressed training dataset, 326 00:26:51,540 --> 00:26:54,880 and also an algorithm for taking any arbitrary sequence 327 00:26:54,880 --> 00:26:58,200 and encoding it using this vocabulary, 328 00:26:58,200 --> 00:26:59,820 and also decoding it back to store. 329 00:26:59,820 --> 00:27:02,060 So we can then use this to do string. 330 00:27:02,060 --> 00:27:03,960 So let's now implement all that. 331 00:27:03,960 --> 00:27:05,360 So here's what I did. 332 00:27:05,360 --> 00:27:07,560 I went to this blog post that I enjoyed, 333 00:27:07,560 --> 00:27:08,960 and I took the first paragraph 334 00:27:08,960 --> 00:27:11,740 and I copy pasted it here into text. 335 00:27:11,740 --> 00:27:13,900 So this is one very long line here. 336 00:27:15,200 --> 00:27:17,340 Now, to get the tokens, as I mentioned, 337 00:27:17,340 --> 00:27:20,300 we just take our text and we encode it into UTF-8. 338 00:27:20,300 --> 00:27:23,440 The tokens here at this point will be a raw bytes, 339 00:27:23,440 --> 00:27:25,580 single stream of bytes. 340 00:27:25,580 --> 00:27:27,620 And just so that it's easier to work with, 341 00:27:27,620 --> 00:27:29,820 instead of just a bytes object, 342 00:27:29,820 --> 00:27:32,120 we can import all those bytes to integers, 343 00:27:32,120 --> 00:27:33,500 and then create a list of it, 344 00:27:33,500 --> 00:27:35,100 just so it's easier for us to manipulate 345 00:27:35,100 --> 00:27:37,300 and work with in Python and visualize. 346 00:27:37,300 --> 00:27:38,400 And here I'm printing all of that. 347 00:27:38,400 --> 00:27:41,400 So this is the original paragraph, 348 00:27:44,560 --> 00:27:48,800 and its length is 533 code points. 349 00:27:48,800 --> 00:27:52,840 And then here are the bytes encoded in UTF-8. 350 00:27:52,840 --> 00:27:56,240 And we see that this has a length of 616 bytes 351 00:27:56,240 --> 00:27:58,580 at this point, or 616 tokens. 352 00:27:58,580 --> 00:27:59,820 And the reason this is more, 353 00:27:59,820 --> 00:28:03,220 is because a lot of these simple ASCII characters, 354 00:28:03,220 --> 00:28:06,220 or simple characters, they just become a single byte. 355 00:28:06,220 --> 00:28:08,860 But a lot of these Unicode, more complex characters, 356 00:28:08,860 --> 00:28:10,740 become multiple bytes, up to four. 357 00:28:10,740 --> 00:28:12,760 And so we are expanding that size. 358 00:28:13,860 --> 00:28:15,900 So now what we'd like to do as a first step of the algorithm 359 00:28:15,900 --> 00:28:17,700 is we'd like to iterate over here 360 00:28:17,700 --> 00:28:22,040 and find the pair of bytes that occur most frequently, 361 00:28:22,040 --> 00:28:23,840 because we're then going to merge it. 362 00:28:23,840 --> 00:28:26,480 So if you are working along on a notebook on a side, 363 00:28:26,480 --> 00:28:28,740 then I encourage you to basically click on the link, 364 00:28:28,740 --> 00:28:29,680 find this notebook, 365 00:28:29,820 --> 00:28:31,920 and try to write that function yourself. 366 00:28:31,920 --> 00:28:34,060 Otherwise, I'm going to come here and implement first 367 00:28:34,060 --> 00:28:36,300 the function that finds the most common pair. 368 00:28:36,300 --> 00:28:37,660 Okay, so here's what I came up with. 369 00:28:37,660 --> 00:28:39,800 There are many different ways to implement this, 370 00:28:39,800 --> 00:28:41,400 but I'm calling the function getStats. 371 00:28:41,400 --> 00:28:43,300 It expects a list of integers. 372 00:28:43,300 --> 00:28:45,200 I'm using a dictionary to keep track 373 00:28:45,200 --> 00:28:46,800 of basically the counts. 374 00:28:46,800 --> 00:28:48,040 And then this is a Pythonic way 375 00:28:48,040 --> 00:28:51,280 to iterate consecutive elements off this list, 376 00:28:51,280 --> 00:28:53,680 which we covered in a previous video. 377 00:28:53,680 --> 00:28:55,980 And then here, I'm just keeping track of 378 00:28:55,980 --> 00:28:59,320 just incrementing by one for all the pairs. 379 00:28:59,820 --> 00:29:02,160 So if I do this on all the tokens here, 380 00:29:02,160 --> 00:29:04,460 then the stats comes out here. 381 00:29:04,460 --> 00:29:05,860 So this is the dictionary. 382 00:29:05,860 --> 00:29:09,700 The keys are these tuples of consecutive elements, 383 00:29:09,700 --> 00:29:11,340 and this is the count. 384 00:29:11,340 --> 00:29:14,940 So just to print it in a slightly better way, 385 00:29:14,940 --> 00:29:17,740 this is one way that I like to do that, 386 00:29:17,740 --> 00:29:20,980 where it's a little bit compound here, 387 00:29:20,980 --> 00:29:22,380 so you can pause if you like. 388 00:29:22,380 --> 00:29:24,640 But we iterate all the items. 389 00:29:24,640 --> 00:29:29,220 The items called on dictionary returns pairs of key value, 390 00:29:29,220 --> 00:29:34,100 instead, I create a list here of value key, 391 00:29:34,100 --> 00:29:36,260 because if it's a value key list, 392 00:29:36,260 --> 00:29:38,060 then I can call sort on it. 393 00:29:38,060 --> 00:29:42,800 And by default, Python will use the first element, 394 00:29:42,800 --> 00:29:44,100 which in this case, it will be value, 395 00:29:44,100 --> 00:29:46,540 to sort by if it's given tuples. 396 00:29:46,540 --> 00:29:49,240 And then reverse, so it's descending, and print that. 397 00:29:49,240 --> 00:29:52,680 So basically, it looks like 101 comma 32 398 00:29:52,680 --> 00:29:55,580 was the most commonly occurring consecutive pair, 399 00:29:55,580 --> 00:29:57,480 and it occurred 20 times. 400 00:29:57,480 --> 00:29:59,180 We can double check that that makes 401 00:29:59,220 --> 00:30:00,220 a reasonable sense. 402 00:30:00,220 --> 00:30:03,160 So if I just search 101, 32, 403 00:30:03,160 --> 00:30:07,660 then you see that these are the 20 occurrences of that pair. 404 00:30:09,440 --> 00:30:11,460 And if we'd like to take a look at what exactly 405 00:30:11,460 --> 00:30:13,940 that pair is, we can use char, 406 00:30:13,940 --> 00:30:16,340 which is the opposite of ord in Python. 407 00:30:16,340 --> 00:30:19,180 So we give it a Unicode code point, 408 00:30:19,180 --> 00:30:24,180 so 101 and of 32, and we see that this is E and space. 409 00:30:24,940 --> 00:30:28,220 So basically, there's a lot of E space here, 410 00:30:28,220 --> 00:30:29,220 meaning that a lot of these words 411 00:30:29,220 --> 00:30:30,580 seem to end with E. 412 00:30:30,580 --> 00:30:33,120 So here's E space as an example. 413 00:30:33,120 --> 00:30:34,460 So there's a lot of that going on here, 414 00:30:34,460 --> 00:30:36,720 and this is the most common pair. 415 00:30:36,720 --> 00:30:39,040 So now that we've identified the most common pair, 416 00:30:39,040 --> 00:30:41,860 we would like to iterate over the sequence. 417 00:30:41,860 --> 00:30:46,240 We're going to mint a new token with the ID of 256, right? 418 00:30:46,240 --> 00:30:49,640 Because these tokens currently go from zero to 255. 419 00:30:49,640 --> 00:30:51,080 So when we create a new token, 420 00:30:51,080 --> 00:30:53,680 it will have an ID of 256. 421 00:30:53,680 --> 00:30:58,340 And we're going to iterate over this entire list, 422 00:30:58,340 --> 00:30:59,180 and every, 423 00:30:59,220 --> 00:31:02,020 every time we see 101 comma 32, 424 00:31:02,020 --> 00:31:04,860 we're going to swap that out for 256. 425 00:31:04,860 --> 00:31:06,960 So let's implement that now, 426 00:31:06,960 --> 00:31:09,800 and feel free to do that yourself as well. 427 00:31:09,800 --> 00:31:11,600 So first I commented this, 428 00:31:11,600 --> 00:31:14,900 just so we don't pollute the notebook too much. 429 00:31:14,900 --> 00:31:17,980 This is a nice way of, in Python, 430 00:31:17,980 --> 00:31:20,400 obtaining the highest ranking pair. 431 00:31:20,400 --> 00:31:24,980 So we're basically calling the max on this dictionary stats, 432 00:31:24,980 --> 00:31:28,580 and this will return the maximum key. 433 00:31:28,580 --> 00:31:31,720 And then the question is, how does it rank keys? 434 00:31:31,720 --> 00:31:34,820 So you can provide it with a function that ranks keys, 435 00:31:34,820 --> 00:31:36,860 and that function is just stats.get. 436 00:31:37,720 --> 00:31:41,100 Stats.get would basically return the value. 437 00:31:41,100 --> 00:31:42,720 And so we're ranking by the value 438 00:31:42,720 --> 00:31:44,460 and getting the maximum key. 439 00:31:44,460 --> 00:31:47,500 So it's 101 comma 32, as we saw. 440 00:31:47,500 --> 00:31:49,840 Now to actually merge 101, 32, 441 00:31:51,040 --> 00:31:52,200 this is the function that I wrote, 442 00:31:52,200 --> 00:31:54,800 but again, there are many different versions of it. 443 00:31:54,800 --> 00:31:56,880 So we're going to take a list of IDs 444 00:31:56,880 --> 00:31:57,780 and the pair that we want to replace, and we're going to do that. 445 00:31:57,780 --> 00:31:58,580 And we're going to take a list of IDs, and the pair that we want to replace, 446 00:31:58,580 --> 00:32:02,820 and that pair will be replaced with the new index IDX. 447 00:32:02,820 --> 00:32:04,980 So iterating through IDs, 448 00:32:04,980 --> 00:32:08,220 if we find the pair, swap it out for IDX. 449 00:32:08,220 --> 00:32:11,660 So we create this new list, and then we start at zero, 450 00:32:11,660 --> 00:32:13,820 and then we go through this entire list sequentially 451 00:32:13,820 --> 00:32:14,820 from left to right. 452 00:32:15,760 --> 00:32:18,000 And here we are checking for equality 453 00:32:18,000 --> 00:32:20,400 at the current position with the pair. 454 00:32:22,200 --> 00:32:24,640 So here we are checking that the pair matches. 455 00:32:24,640 --> 00:32:26,340 Now here's a bit of a tricky condition 456 00:32:26,340 --> 00:32:28,580 that you have to append if you're trying to be careful. 457 00:32:28,580 --> 00:32:31,620 And that is that you don't want this here 458 00:32:31,620 --> 00:32:34,220 to be out of bounds at the very last position 459 00:32:34,220 --> 00:32:36,580 when you're on the rightmost element of this list. 460 00:32:36,580 --> 00:32:39,320 Otherwise this would give you an out of bounds error. 461 00:32:39,320 --> 00:32:40,700 So we have to make sure that we're not 462 00:32:40,700 --> 00:32:42,720 at the very, very last element. 463 00:32:42,720 --> 00:32:45,760 So this would be false for that. 464 00:32:45,760 --> 00:32:50,500 So if we find a match, we append to this new list 465 00:32:50,500 --> 00:32:54,000 that replacement index, and we increment the position by two. 466 00:32:54,000 --> 00:32:56,340 So we skip over that entire pair. 467 00:32:56,340 --> 00:32:58,580 But otherwise, if we haven't found a matching pair, 468 00:32:58,580 --> 00:33:02,380 we just sort of copy over the element at that position 469 00:33:02,380 --> 00:33:06,020 and increment by one, and then return this. 470 00:33:06,020 --> 00:33:07,520 So here's a very small toy example. 471 00:33:07,520 --> 00:33:09,960 If we have a list five, six, six, seven, nine, one, 472 00:33:09,960 --> 00:33:13,300 and we wanna replace the occurrences of 67 with 99, 473 00:33:13,300 --> 00:33:17,000 then calling this on that will give us 474 00:33:17,000 --> 00:33:18,300 what we're asking for. 475 00:33:18,300 --> 00:33:21,140 So here the six, seven is replaced with nine, nine. 476 00:33:22,340 --> 00:33:25,940 So now I'm gonna uncomment this for our actual use case, 477 00:33:25,940 --> 00:33:28,200 where we wanna take our tokens, 478 00:33:28,200 --> 00:33:28,580 we wanna take our tokens, 479 00:33:28,580 --> 00:33:31,020 we wanna take the top pair here 480 00:33:31,020 --> 00:33:34,080 and replace it with two, five, six to get tokens two. 481 00:33:34,080 --> 00:33:36,660 If we run this, we get the following. 482 00:33:38,220 --> 00:33:43,220 So recall that previously we had a length 616 in this list. 483 00:33:44,700 --> 00:33:47,540 And now we have a length 596, right? 484 00:33:47,540 --> 00:33:50,200 So this decreased by 20, which makes sense 485 00:33:50,200 --> 00:33:52,400 because there are 20 occurrences. 486 00:33:52,400 --> 00:33:55,480 Moreover, we can try to find two, five, six here, 487 00:33:55,480 --> 00:33:57,600 and we see plenty of occurrences of it. 488 00:33:58,580 --> 00:33:59,780 And moreover, just double check, 489 00:33:59,780 --> 00:34:02,340 there should be no occurrence of 101, 32. 490 00:34:02,340 --> 00:34:04,980 So this is the original array, plenty of them. 491 00:34:04,980 --> 00:34:08,320 And in the second array, there are no occurrences of 101, 32. 492 00:34:08,320 --> 00:34:11,500 So we've successfully merged this single pair. 493 00:34:11,500 --> 00:34:13,320 And now we just iterate this. 494 00:34:13,320 --> 00:34:15,440 So we are gonna go over the sequence again, 495 00:34:15,440 --> 00:34:17,740 find the most common pair and replace it. 496 00:34:17,740 --> 00:34:20,180 So let me now write a while loop that uses these functions 497 00:34:20,180 --> 00:34:22,820 to do this sort of iteratively. 498 00:34:22,820 --> 00:34:25,180 And how many times do we do it for? 499 00:34:25,180 --> 00:34:27,460 Well, that's totally up to us as a hyperparameter. 500 00:34:27,460 --> 00:34:28,460 The more, the better. 501 00:34:28,580 --> 00:34:32,380 The more steps we take, the larger will be our vocabulary 502 00:34:32,380 --> 00:34:34,460 and the shorter will be our sequence. 503 00:34:34,460 --> 00:34:36,940 And there is some sweet spot that we usually find 504 00:34:36,940 --> 00:34:38,760 works the best in practice. 505 00:34:38,760 --> 00:34:41,560 And so this is kind of a hyperparameter and we tune it 506 00:34:41,560 --> 00:34:44,000 and we find good vocabulary sizes. 507 00:34:44,000 --> 00:34:47,760 As an example, GPT-4 currently uses roughly 100,000 tokens 508 00:34:47,760 --> 00:34:51,560 and ballpark, those are reasonable numbers currently 509 00:34:51,560 --> 00:34:53,460 in state-of-the-art language language models. 510 00:34:53,460 --> 00:34:56,560 So let me now write, putting it all together 511 00:34:56,560 --> 00:34:58,340 and iterating these steps. 512 00:34:58,580 --> 00:35:00,480 Okay, now, before we dive into the while loop, 513 00:35:00,480 --> 00:35:03,120 I wanted to add one more cell here 514 00:35:03,120 --> 00:35:04,520 where I went to the blog post 515 00:35:04,520 --> 00:35:07,040 and instead of grabbing just the first paragraph or two, 516 00:35:07,040 --> 00:35:08,680 I took the entire blog post 517 00:35:08,680 --> 00:35:10,740 and I stretched it out in a single line. 518 00:35:10,740 --> 00:35:12,400 And basically just using longer text 519 00:35:12,400 --> 00:35:14,680 will allow us to have more representative statistics 520 00:35:14,680 --> 00:35:15,880 for the byte pairs 521 00:35:15,880 --> 00:35:18,900 and we'll just get a more sensible results out of it 522 00:35:18,900 --> 00:35:20,240 because it's longer text. 523 00:35:21,420 --> 00:35:23,140 So here we have the raw text. 524 00:35:23,140 --> 00:35:27,380 We encode it into bytes using the UTF-8 encoding. 525 00:35:27,380 --> 00:35:28,460 And then here, 526 00:35:28,460 --> 00:35:30,600 as before we are just changing it 527 00:35:30,600 --> 00:35:32,200 into a list of integers in Python, 528 00:35:32,200 --> 00:35:34,020 just so it's easier to work with 529 00:35:34,020 --> 00:35:36,340 instead of the raw bytes objects. 530 00:35:36,340 --> 00:35:40,200 And then this is the code that I came up with 531 00:35:40,200 --> 00:35:43,840 to actually do the merging and loop. 532 00:35:43,840 --> 00:35:45,660 These two functions here are identical 533 00:35:45,660 --> 00:35:46,700 to what we had above. 534 00:35:46,700 --> 00:35:48,080 I only included them here 535 00:35:48,080 --> 00:35:50,740 just so that you have the point of reference here. 536 00:35:51,640 --> 00:35:53,980 So these two are identical 537 00:35:53,980 --> 00:35:56,500 and then this is the new code that I added. 538 00:35:56,500 --> 00:35:58,340 So the first thing you wanna do is you want to decide on a, 539 00:35:58,340 --> 00:36:00,140 the final vocabulary size 540 00:36:00,140 --> 00:36:02,440 that we want our tokenizer to have. 541 00:36:02,440 --> 00:36:03,980 And as I mentioned, this is a hyperparameter 542 00:36:03,980 --> 00:36:05,240 and you set it in some way 543 00:36:05,240 --> 00:36:07,480 depending on your best performance. 544 00:36:07,480 --> 00:36:10,100 So let's say for us, we're going to use 276 545 00:36:10,100 --> 00:36:13,960 because that way we're going to be doing exactly 20 merges. 546 00:36:13,960 --> 00:36:18,380 And 20 merges because we already have 256 tokens 547 00:36:18,380 --> 00:36:20,240 for the raw bytes. 548 00:36:20,240 --> 00:36:23,560 And to reach 276, we have to do 20 merges 549 00:36:23,560 --> 00:36:25,120 to add 20 new tokens. 550 00:36:26,480 --> 00:36:27,560 Here, this is a one way in Python, 551 00:36:27,560 --> 00:36:28,080 we have to do 20 merges to add 20 new tokens. 552 00:36:28,080 --> 00:36:28,620 So I'm going to use the same method in Python 553 00:36:28,620 --> 00:36:30,720 to just create a copy of a list. 554 00:36:31,580 --> 00:36:33,240 So I'm taking the tokens list 555 00:36:33,240 --> 00:36:34,880 and by wrapping it in the list, 556 00:36:34,880 --> 00:36:36,900 Python will construct a new list 557 00:36:36,900 --> 00:36:38,060 of all the individual elements. 558 00:36:38,060 --> 00:36:39,660 So this is just a copy operation. 559 00:36:40,960 --> 00:36:44,300 Then here, I'm creating a merges dictionary. 560 00:36:44,300 --> 00:36:47,120 So this merges dictionary is going to maintain basically 561 00:36:47,120 --> 00:36:52,120 the child one, child two mapping to a new token. 562 00:36:52,260 --> 00:36:53,700 And so what we're going to be building up here 563 00:36:53,700 --> 00:36:56,460 is a binary tree of merges. 564 00:36:56,460 --> 00:36:57,840 But actually it's not exactly a tree 565 00:36:57,840 --> 00:37:00,500 because a tree would have a single root node 566 00:37:00,500 --> 00:37:02,220 with a bunch of leaves. 567 00:37:02,220 --> 00:37:04,640 For us, we're starting with the leaves on the bottom, 568 00:37:04,640 --> 00:37:06,060 which are the individual bytes. 569 00:37:06,060 --> 00:37:08,600 Those are the starting 256 tokens. 570 00:37:08,600 --> 00:37:11,400 And then we're starting to like merge two of them at a time. 571 00:37:11,400 --> 00:37:13,800 And so it's not a tree, it's more like a forest 572 00:37:16,180 --> 00:37:18,360 as we merge these elements. 573 00:37:18,360 --> 00:37:21,560 So for 20 merges, 574 00:37:21,560 --> 00:37:24,840 we're going to find the most commonly occurring pair. 575 00:37:24,840 --> 00:37:27,840 We're going to mint a new token integer for it. 576 00:37:27,840 --> 00:37:29,360 So I here will start at zero. 577 00:37:29,360 --> 00:37:31,840 So we're going to start at 256. 578 00:37:31,840 --> 00:37:33,520 We're going to print that we're merging it. 579 00:37:33,520 --> 00:37:36,560 And we're going to replace all the occurrences of that pair 580 00:37:36,560 --> 00:37:39,440 with the new, newly minted token. 581 00:37:39,440 --> 00:37:42,900 And we're going to record that this pair of integers 582 00:37:42,900 --> 00:37:45,340 merged into this new integer. 583 00:37:46,400 --> 00:37:49,860 So running this gives us the following output. 584 00:37:52,080 --> 00:37:54,060 So we did 20 merges. 585 00:37:54,060 --> 00:37:57,180 And for example, the first merge was exactly as before. 586 00:37:57,180 --> 00:38:02,180 The 101, 32 tokens merging into a new token, 256. 587 00:38:02,900 --> 00:38:06,460 Now keep in mind that the individual tokens 101 and 32 588 00:38:06,460 --> 00:38:09,240 can still occur in the sequence after merging. 589 00:38:09,240 --> 00:38:11,820 It's only when they occur exactly consecutively 590 00:38:11,820 --> 00:38:13,700 that that becomes 256 now. 591 00:38:15,700 --> 00:38:17,640 And in particular, the other thing to notice here 592 00:38:17,640 --> 00:38:20,920 is that the token 256, which is the newly minted token, 593 00:38:20,920 --> 00:38:22,860 is also eligible for merging. 594 00:38:22,860 --> 00:38:25,660 So here on the bottom, the 20th merge was a merge of 256, 595 00:38:25,660 --> 00:38:27,180 and 256 was the second merge. 596 00:38:27,180 --> 00:38:29,800 So now we have 259 becoming 275. 597 00:38:29,800 --> 00:38:32,180 So every time we replace these tokens, 598 00:38:32,180 --> 00:38:33,680 they become eligible for merging 599 00:38:33,680 --> 00:38:35,820 in the next round of the iteration. 600 00:38:35,820 --> 00:38:38,440 So that's why we're building up a small sort of binary forest 601 00:38:38,440 --> 00:38:40,240 instead of a single individual tree. 602 00:38:41,260 --> 00:38:43,020 One thing we can take a look at as well 603 00:38:43,020 --> 00:38:44,740 is we can take a look at the compression ratio 604 00:38:44,740 --> 00:38:46,120 that we've achieved. 605 00:38:46,120 --> 00:38:49,360 So in particular, we started off with this tokens list. 606 00:38:50,240 --> 00:38:53,480 So we started off with 24,000 bytes. 607 00:38:53,480 --> 00:38:56,940 And after merging 20 times, we now have, you know, 608 00:38:56,940 --> 00:39:01,240 we now have only 19,000 tokens. 609 00:39:01,240 --> 00:39:02,700 And so therefore, the compression ratio, 610 00:39:02,700 --> 00:39:06,240 simply just dividing the two, is roughly 1.27. 611 00:39:06,240 --> 00:39:08,380 So that's the amount of compression we were able to achieve 612 00:39:08,380 --> 00:39:11,120 of this text with only 20 merges. 613 00:39:12,180 --> 00:39:15,380 And of course, the more vocabulary elements you add, 614 00:39:15,380 --> 00:39:17,700 the greater the compression ratio here would be. 615 00:39:20,320 --> 00:39:24,160 Finally, so that's kind of like the training 616 00:39:24,160 --> 00:39:25,760 of the tokenizer, if you will. 617 00:39:25,760 --> 00:39:26,940 Now, one point that I wanted 618 00:39:26,940 --> 00:39:29,360 to make is that, and maybe this is a diagram 619 00:39:29,360 --> 00:39:32,320 that can help kind of illustrate, 620 00:39:32,320 --> 00:39:34,780 is that the tokenizer is a completely separate object 621 00:39:34,780 --> 00:39:36,780 from the large language model itself. 622 00:39:36,780 --> 00:39:37,920 So everything in this lecture, 623 00:39:37,920 --> 00:39:40,080 we're not really touching the LLM itself. 624 00:39:40,080 --> 00:39:41,660 We're just training the tokenizer. 625 00:39:41,660 --> 00:39:44,820 This is a completely separate pre-processing stage usually. 626 00:39:44,820 --> 00:39:47,440 So the tokenizer will have its own training set, 627 00:39:47,440 --> 00:39:48,640 just like a large language model 628 00:39:48,640 --> 00:39:51,540 has a potentially different training set. 629 00:39:51,540 --> 00:39:53,300 So the tokenizer has a training set of documents 630 00:39:53,300 --> 00:39:55,720 on which you're going to train the tokenizer. 631 00:39:55,720 --> 00:39:56,880 And then, 632 00:39:56,940 --> 00:39:59,400 we're performing the byte-pair encoding algorithm, 633 00:39:59,400 --> 00:40:00,480 as we saw above, 634 00:40:00,480 --> 00:40:03,620 to train the vocabulary of this tokenizer. 635 00:40:03,620 --> 00:40:05,020 So it has its own training set. 636 00:40:05,020 --> 00:40:06,340 It has a pre-processing stage 637 00:40:06,340 --> 00:40:08,780 that you would run a single time in the beginning. 638 00:40:09,880 --> 00:40:11,700 And the tokenizer is trained 639 00:40:11,700 --> 00:40:13,800 using byte-pair encoding algorithm. 640 00:40:13,800 --> 00:40:15,120 Once you have the tokenizer, 641 00:40:15,120 --> 00:40:17,160 once it's trained and you have the vocabulary 642 00:40:17,160 --> 00:40:19,180 and you have the merges, 643 00:40:19,180 --> 00:40:22,260 we can do both encoding and decoding. 644 00:40:22,260 --> 00:40:24,300 So these two arrows here. 645 00:40:24,300 --> 00:40:26,420 So the tokenizer is a translation layer 646 00:40:26,420 --> 00:40:28,180 between raw text, 647 00:40:28,180 --> 00:40:31,800 which is as we saw the sequence of Unicode code points. 648 00:40:31,800 --> 00:40:35,500 It can take raw text and turn it into a token sequence 649 00:40:35,500 --> 00:40:36,340 and vice versa, 650 00:40:36,340 --> 00:40:38,060 it can take a token sequence 651 00:40:38,060 --> 00:40:40,100 and translate it back into raw text. 652 00:40:41,860 --> 00:40:44,100 So now that we have trained the tokenizer 653 00:40:44,100 --> 00:40:45,900 and we have these merges, 654 00:40:45,900 --> 00:40:48,080 we are going to turn to how we can do the encoding 655 00:40:48,080 --> 00:40:49,360 and the decoding step. 656 00:40:49,360 --> 00:40:51,980 If you give me text, here are the tokens and vice versa. 657 00:40:51,980 --> 00:40:54,200 If you give me tokens, here's the text. 658 00:40:54,200 --> 00:40:55,040 Once we have that, 659 00:40:55,040 --> 00:40:55,960 we can translate between these two. 660 00:40:55,960 --> 00:40:59,140 And then the language model is going to be trained 661 00:40:59,140 --> 00:41:01,080 as a step two afterwards. 662 00:41:01,080 --> 00:41:05,240 And typically in a sort of a state of the art application, 663 00:41:05,240 --> 00:41:06,640 you might take all of your training data 664 00:41:06,640 --> 00:41:07,760 for the language model 665 00:41:07,760 --> 00:41:09,420 and you might run it through the tokenizer 666 00:41:09,420 --> 00:41:11,260 and sort of translate everything 667 00:41:11,260 --> 00:41:13,020 into a massive token sequence. 668 00:41:13,020 --> 00:41:14,640 And then you can throw away the raw text. 669 00:41:14,640 --> 00:41:17,000 You're just left with the tokens themselves. 670 00:41:17,000 --> 00:41:19,100 And those are stored on disk. 671 00:41:19,100 --> 00:41:21,420 And that is what the large language model is actually reading 672 00:41:21,420 --> 00:41:22,920 when it's training on them. 673 00:41:22,920 --> 00:41:24,140 So that's one approach that you can take 674 00:41:24,140 --> 00:41:25,880 as a single massive pre-processing state. 675 00:41:25,960 --> 00:41:30,220 So yeah, basically, 676 00:41:30,220 --> 00:41:31,800 I think the most important thing I want to get across 677 00:41:31,800 --> 00:41:33,440 is that this is completely separate stage. 678 00:41:33,440 --> 00:41:36,340 It usually has its own entire training set. 679 00:41:36,340 --> 00:41:38,460 You may want to have those training sets be different 680 00:41:38,460 --> 00:41:40,720 between the tokenizer and the large language model. 681 00:41:40,720 --> 00:41:43,160 So for example, when you're training the tokenizer, 682 00:41:43,160 --> 00:41:44,000 as I mentioned, 683 00:41:44,000 --> 00:41:46,420 we don't just care about the performance of English text. 684 00:41:46,420 --> 00:41:49,460 We care about multi, many different languages. 685 00:41:49,460 --> 00:41:51,640 And we also care about code or not code. 686 00:41:51,640 --> 00:41:54,600 So you may want to look into different kinds of mixtures 687 00:41:54,600 --> 00:41:55,960 of different kinds of languages 688 00:41:55,960 --> 00:41:58,760 and different amounts of code and things like that, 689 00:41:58,760 --> 00:42:01,340 because the amount of different language 690 00:42:01,340 --> 00:42:03,720 that you have in your tokenizer training set 691 00:42:03,720 --> 00:42:07,280 will determine how many merges of it there will be. 692 00:42:07,280 --> 00:42:09,420 And therefore that determines the density 693 00:42:09,420 --> 00:42:14,420 with which this type of data is sort of has 694 00:42:14,500 --> 00:42:16,220 in the token space. 695 00:42:16,220 --> 00:42:18,800 And so roughly speaking, intuitively, 696 00:42:18,800 --> 00:42:20,000 if you add some amount of data, 697 00:42:20,000 --> 00:42:21,720 like say you have a ton of Japanese data 698 00:42:21,720 --> 00:42:24,060 in your tokenizer training set, 699 00:42:24,060 --> 00:42:24,260 then that means that more Japanese tokens will get in there, 700 00:42:24,260 --> 00:42:24,300 then that means that more Japanese tokens will get in there, 701 00:42:24,300 --> 00:42:26,800 then that means that more Japanese tokens will get merged, 702 00:42:26,800 --> 00:42:29,900 and therefore Japanese will have shorter sequences. 703 00:42:29,900 --> 00:42:31,120 And that's going to be beneficial 704 00:42:31,120 --> 00:42:32,340 for the large language model, 705 00:42:32,340 --> 00:42:34,260 which has a finite context length 706 00:42:34,260 --> 00:42:37,780 on which it can work on in the token space. 707 00:42:37,780 --> 00:42:39,300 So hopefully that makes sense. 708 00:42:39,300 --> 00:42:42,080 So we're now going to turn to encoding and decoding 709 00:42:42,080 --> 00:42:44,000 now that we have trained a tokenizer. 710 00:42:44,000 --> 00:42:46,120 So we have our merges, 711 00:42:46,120 --> 00:42:48,300 and now how do we do encoding and decoding? 712 00:42:48,300 --> 00:42:50,080 Okay, so let's begin with decoding, 713 00:42:50,080 --> 00:42:51,840 which is this arrow over here. 714 00:42:51,840 --> 00:42:53,780 So given a token sequence, 715 00:42:53,780 --> 00:42:54,980 let's go through the tokenizer 716 00:42:54,980 --> 00:42:57,200 to get back a Python string object, 717 00:42:57,200 --> 00:42:59,100 so the raw text. 718 00:42:59,100 --> 00:43:01,820 So this is the function that we'd like to implement. 719 00:43:01,820 --> 00:43:03,240 We're given the list of integers, 720 00:43:03,240 --> 00:43:04,960 and we want to return a Python string. 721 00:43:04,960 --> 00:43:07,400 If you'd like, try to implement this function yourself. 722 00:43:07,400 --> 00:43:08,560 It's a fun exercise. 723 00:43:08,560 --> 00:43:12,240 Otherwise, I'm going to start pasting in my own solution. 724 00:43:12,240 --> 00:43:15,080 So there are many different ways to do it. 725 00:43:15,080 --> 00:43:16,400 Here's one way. 726 00:43:16,400 --> 00:43:19,440 I will create a kind of preprocessing variable 727 00:43:19,440 --> 00:43:20,760 that I will call vocab. 728 00:43:21,960 --> 00:43:23,620 And vocab is a function that's called vocab. 729 00:43:23,620 --> 00:43:23,700 And vocab is a function that's called vocab. 730 00:43:23,700 --> 00:43:23,780 And vocab is a function that's called vocab. 731 00:43:23,780 --> 00:43:26,020 And vocab is a mapping or dictionary in Python 732 00:43:26,020 --> 00:43:31,020 from the token ID to the bytes object for that token. 733 00:43:31,560 --> 00:43:35,880 So we begin with the raw bytes for tokens from zero to 255. 734 00:43:35,880 --> 00:43:38,760 And then we go in order of all the merges, 735 00:43:38,760 --> 00:43:41,720 and we sort of populate this vocab list 736 00:43:41,720 --> 00:43:43,360 by doing an addition here. 737 00:43:43,360 --> 00:43:46,860 So this is basically the bytes representation 738 00:43:46,860 --> 00:43:49,940 of the first child, followed by the second one. 739 00:43:49,940 --> 00:43:51,740 And remember, these are bytes objects. 740 00:43:51,740 --> 00:43:53,620 So this addition here is an addition 741 00:43:53,620 --> 00:43:57,080 of two bytes objects, just concatenation. 742 00:43:57,080 --> 00:43:58,440 So that's what we get here. 743 00:43:59,700 --> 00:44:01,540 One tricky thing to be careful with, by the way, 744 00:44:01,540 --> 00:44:04,340 is that I'm iterating a dictionary in Python 745 00:44:04,340 --> 00:44:05,900 using a dot items. 746 00:44:05,900 --> 00:44:09,420 And it really matters that this runs in the order 747 00:44:09,420 --> 00:44:13,220 in which we inserted items into the merges dictionary. 748 00:44:13,220 --> 00:44:15,120 Luckily, starting with Python 3.7, 749 00:44:15,120 --> 00:44:16,640 this is guaranteed to be the case. 750 00:44:16,640 --> 00:44:18,280 But before Python 3.7, 751 00:44:18,280 --> 00:44:20,140 this iteration may have been out of order 752 00:44:20,140 --> 00:44:22,880 with respect to how we inserted elements into merges. 753 00:44:22,880 --> 00:44:23,540 And this may not have been the case. 754 00:44:24,440 --> 00:44:27,120 But we are using modern Python, so we're okay. 755 00:44:29,120 --> 00:44:30,400 And then here, given the IDs, 756 00:44:30,400 --> 00:44:33,520 the first thing we're going to do is get the tokens. 757 00:44:33,520 --> 00:44:37,840 So the way I implemented this here is I'm taking, 758 00:44:37,840 --> 00:44:39,880 I'm iterating over all the IDs. 759 00:44:39,880 --> 00:44:42,080 I'm using vocab to look up their bytes. 760 00:44:42,080 --> 00:44:44,660 And then here, this is one way in Python 761 00:44:44,660 --> 00:44:48,940 to concatenate all these bytes together to create our tokens. 762 00:44:48,940 --> 00:44:52,000 And then these tokens here at this point are raw bytes. 763 00:44:52,000 --> 00:44:59,180 bytes so I have to decode using UTF-8 now back into Python strings. So 764 00:44:59,180 --> 00:45:03,240 previously we called dot encode on a string object to get the bytes and now 765 00:45:03,240 --> 00:45:07,680 we're doing it opposite. We're taking the bytes and calling a decode on the bytes 766 00:45:07,680 --> 00:45:15,920 object to get a string in Python and then we can return text. So this is how 767 00:45:15,920 --> 00:45:21,700 we can do it. Now this actually has a issue in the way I implemented it. In 768 00:45:21,700 --> 00:45:25,660 this could actually throw an error. So try to figure out why this code 769 00:45:25,660 --> 00:45:31,660 could actually result in an error if we plug in some sequence of IDs that is 770 00:45:31,660 --> 00:45:36,880 unlucky. So let me demonstrate the issue. When I try to decode just something like 771 00:45:36,880 --> 00:45:43,720 97, I am going to get a letter A here back so nothing too crazy happening. But 772 00:45:43,720 --> 00:45:50,220 when I try to decode 128 as a single element, the token 128 is what in string 773 00:45:50,220 --> 00:45:51,660 or in Python. 774 00:45:51,700 --> 00:45:58,240 So this is an object, Unicode decoder. UTF-8 can't decode byte 0x80 which is 775 00:45:58,240 --> 00:46:03,340 this in hex in position 0 invalid start byte. What does that mean? Well to 776 00:46:03,340 --> 00:46:07,120 understand what this means we have to go back to our UTF-8 page that I briefly 777 00:46:07,120 --> 00:46:12,280 showed earlier and this is Wikipedia UTF-8 and basically there's a specific 778 00:46:12,280 --> 00:46:17,920 schema that UTF-8 bytes take. So in particular if you have a multi byte 779 00:46:17,920 --> 00:46:21,640 object for some of the Unicode characters they have to have the 780 00:46:21,700 --> 00:46:26,500 special sort of envelope in how the encoding works. And so what's happening 781 00:46:26,500 --> 00:46:33,220 here is that invalid start byte that's because 128 the binary representation of 782 00:46:33,220 --> 00:46:39,580 it is 1 followed by all zeros. So we have 1 and then all 0 and we see here that 783 00:46:39,580 --> 00:46:43,000 that doesn't conform to the format because 1 followed by all 0 just doesn't 784 00:46:43,000 --> 00:46:48,100 fit any of these rules so to speak. So it's an invalid start byte which is byte 785 00:46:48,100 --> 00:46:51,600 1. This 1 must have a 1 following it. 786 00:46:51,700 --> 00:46:57,220 And then a 0 following it. And then the content of your Unicode in X is here. So 787 00:46:57,220 --> 00:47:01,300 basically we don't exactly follow the UTF-8 standard and this cannot be 788 00:47:01,300 --> 00:47:10,760 decoded. And so the way to fix this is to use this errors equals in bytes.decode 789 00:47:10,760 --> 00:47:16,500 function of Python. And by default errors is strict so we will throw an error if 790 00:47:16,500 --> 00:47:21,680 it's not valid UTF-8 bytes encoding. But there are many different things that 791 00:47:21,700 --> 00:47:25,240 you could put here on error handling. This is the full list of all the errors 792 00:47:25,240 --> 00:47:28,820 that you can use. And in particular instead of strict let's change it to 793 00:47:28,820 --> 00:47:35,560 replace. And that will replace with this special marker. This is the replacement 794 00:47:35,560 --> 00:47:44,200 character. So errors equals replace. And now we just get that character back. So 795 00:47:44,200 --> 00:47:50,640 basically not every single byte sequence is valid UTF-8. And if it happens that 796 00:47:50,640 --> 00:47:51,680 your large language is not valid, then it's not valid UTF-8. So basically not every single byte sequence is valid UTF-8. And if it happens that your large language 797 00:47:51,700 --> 00:47:57,160 model for example predicts your tokens in a bad manner, then they might not fall 798 00:47:57,160 --> 00:48:03,220 into valid UTF-8. And then we won't be able to decode them. So the standard 799 00:48:03,220 --> 00:48:07,720 practice is to basically use errors equals replace. And this is what you will 800 00:48:07,720 --> 00:48:12,700 also find in the OpenAI code that they released as well. But basically 801 00:48:12,700 --> 00:48:15,480 whenever you see this kind of a character in your output in that case 802 00:48:15,480 --> 00:48:21,080 something went wrong and the LM output was not valid sort of sequence of tokens. 803 00:48:22,420 --> 00:48:27,620 OK. And now we're going to go the other way. So we are going to implement this error right here. 804 00:48:27,620 --> 00:48:30,960 Where we are going to be given a string and we want to encode it into tokens. 805 00:48:32,100 --> 00:48:35,060 So this is a signature of the function that we're interested in 806 00:48:35,060 --> 00:48:41,820 and this should basically print a list of integers of the tokens. So again try 807 00:48:41,820 --> 00:48:45,840 to maybe implement this yourself if you'd like a fun exercise. And pause here 808 00:48:45,840 --> 00:48:49,920 otherwise I'm going to start putting in my solution. So again there are many ways 809 00:48:49,920 --> 00:48:50,780 to do this. 810 00:48:50,780 --> 00:48:51,420 So 811 00:48:51,700 --> 00:48:56,820 This is one of the ways that I came up with. 812 00:48:56,820 --> 00:48:59,380 The first thing we're going to do is we are going 813 00:48:59,380 --> 00:49:04,940 to take our text encoded into UTF-8 to get the raw bytes. 814 00:49:04,940 --> 00:49:06,940 Then as before, we're going to call list on 815 00:49:06,940 --> 00:49:11,480 the bytes object to get a list of integers of those bytes. 816 00:49:11,480 --> 00:49:13,200 Those are the starting tokens. 817 00:49:13,200 --> 00:49:15,300 Those are the raw bytes of our sequence. 818 00:49:15,300 --> 00:49:18,800 But now, of course, according to the merges dictionary above, 819 00:49:18,800 --> 00:49:21,360 and recall this was the merges, 820 00:49:21,360 --> 00:49:25,680 some of the bytes may be merged according to this lookup. 821 00:49:25,680 --> 00:49:27,340 In addition to that, remember that 822 00:49:27,340 --> 00:49:29,280 the merges was built from top to bottom. 823 00:49:29,280 --> 00:49:32,520 This is the order in which we inserted stuff into merges. 824 00:49:32,520 --> 00:49:34,980 We prefer to do all these merges in 825 00:49:34,980 --> 00:49:37,580 the beginning before we do these merges later. 826 00:49:37,580 --> 00:49:39,540 Because for example, 827 00:49:39,540 --> 00:49:43,980 this merge over here relies on the 256 which got merged here. 828 00:49:43,980 --> 00:49:47,140 We have to go in the order from top to bottom 829 00:49:47,140 --> 00:49:48,760 if we are going to be merging anything. 830 00:49:48,760 --> 00:49:52,080 Now, we expect to be doing a few merges, 831 00:49:52,080 --> 00:49:54,720 so we're going to be doing while true. 832 00:49:55,320 --> 00:49:59,160 Now, we want to find a pair of bytes that is 833 00:49:59,160 --> 00:50:03,060 consecutive that we are allowed to merge according to this. 834 00:50:03,060 --> 00:50:06,120 In order to reuse some of the functionality that we've already written, 835 00:50:06,120 --> 00:50:09,540 I'm going to reuse the function getStats. 836 00:50:09,540 --> 00:50:14,780 Recall that getStats will basically count up how many times 837 00:50:14,780 --> 00:50:18,200 every single pair occurs in our sequence of tokens. 838 00:50:18,200 --> 00:50:20,140 Return that as a dictionary. 839 00:50:20,140 --> 00:50:23,640 The dictionary was a mapping from 840 00:50:23,640 --> 00:50:26,260 all the different byte pairs 841 00:50:26,260 --> 00:50:28,700 to the number of times that they occur. 842 00:50:28,700 --> 00:50:33,000 At this point, we don't actually care how many times they occur in the sequence. 843 00:50:33,000 --> 00:50:36,600 We only care what the raw pairs are in that sequence. 844 00:50:36,600 --> 00:50:39,700 I'm only going to be using basically the keys of the dictionary. 845 00:50:39,700 --> 00:50:41,140 I only care about the set of 846 00:50:41,140 --> 00:50:44,200 possible merge candidates, if that makes sense. 847 00:50:44,200 --> 00:50:48,080 Now, we want to identify the pair that we're going to be merging at this stage, 848 00:50:48,080 --> 00:50:49,320 of the loop. 849 00:50:49,320 --> 00:50:50,120 So, what do we want? 850 00:50:50,120 --> 00:50:54,740 We want to find the pair, or like a key inside stats, 851 00:50:54,740 --> 00:50:59,360 that has the lowest index in the merges dictionary, 852 00:50:59,360 --> 00:51:04,040 because we want to do all the early merges before we work our way to the late merges. 853 00:51:04,040 --> 00:51:06,180 So, again, there are many different ways to implement this, 854 00:51:06,180 --> 00:51:12,280 but I'm going to do something a little bit fancy here. 855 00:51:12,280 --> 00:51:15,920 So, I'm going to be using the min over an iterator. 856 00:51:15,920 --> 00:51:18,080 In Python, when you call min on an iterator, 857 00:51:18,080 --> 00:51:19,880 and stats here is a dictionary, 858 00:51:19,880 --> 00:51:23,440 we're going to be iterating the keys of this dictionary in Python. 859 00:51:23,440 --> 00:51:28,380 So, we're looking at all the pairs inside stats, 860 00:51:28,380 --> 00:51:30,440 which are all the consecutive pairs. 861 00:51:30,440 --> 00:51:34,180 And we're going to be taking the consecutive pair inside tokens 862 00:51:34,180 --> 00:51:37,040 that has the minimum, what. 863 00:51:37,040 --> 00:51:38,920 The min takes a key, 864 00:51:38,920 --> 00:51:41,880 which gives us the function that is going to return a value 865 00:51:41,880 --> 00:51:44,080 over which we're going to do the min. 866 00:51:44,080 --> 00:51:47,320 And the one we care about is we care about taking merges, 867 00:51:47,320 --> 00:51:53,720 and basically getting that pair's index. 868 00:51:53,720 --> 00:51:58,180 So, basically, for any pair inside stats, 869 00:51:58,180 --> 00:52:02,680 we are going to be looking into merges at what index it has, 870 00:52:02,680 --> 00:52:05,760 and we want to get the pair with the min number. 871 00:52:05,760 --> 00:52:08,280 So, as an example, if there's a pair 101 and 32, 872 00:52:08,280 --> 00:52:10,560 we definitely want to get that pair. 873 00:52:10,560 --> 00:52:12,520 We want to identify it here and return it, 874 00:52:12,520 --> 00:52:16,780 and pair would become 101, 32 if it occurs. 875 00:52:16,780 --> 00:52:20,780 And the reason that I'm putting a float inf here as a fallback 876 00:52:20,780 --> 00:52:23,820 is that in the get function, when we call, 877 00:52:23,820 --> 00:52:28,340 when we basically consider a pair that doesn't occur in the merges, 878 00:52:28,340 --> 00:52:30,680 then that pair is not eligible to be merged, right? 879 00:52:30,680 --> 00:52:32,640 So, if in the token sequence, 880 00:52:32,640 --> 00:52:35,120 there's some pair that is not a merging pair, 881 00:52:35,120 --> 00:52:36,380 it cannot be merged, 882 00:52:36,380 --> 00:52:38,580 then it doesn't actually occur here, 883 00:52:38,580 --> 00:52:40,180 and it doesn't have an index, 884 00:52:40,180 --> 00:52:41,780 and it cannot be merged, 885 00:52:41,780 --> 00:52:44,080 which we will denote as float inf. 886 00:52:44,080 --> 00:52:46,480 And the reason infinity is nice here is because for sure, 887 00:52:46,480 --> 00:52:48,720 we're guaranteed that it's not going to participate 888 00:52:48,720 --> 00:52:51,840 in the list of candidates when we do the min. 889 00:52:51,840 --> 00:52:55,080 So, this is one way to do it. 890 00:52:55,080 --> 00:52:56,380 So, basically, long story short, 891 00:52:56,380 --> 00:53:00,720 this returns the most eligible merging candidate pair 892 00:53:00,720 --> 00:53:02,340 that occurs in the tokens. 893 00:53:02,340 --> 00:53:05,140 Now, one thing to be careful with here is 894 00:53:05,140 --> 00:53:09,520 this function here might fail in the following way. 895 00:53:09,520 --> 00:53:11,180 If there is nothing to merge, 896 00:53:11,180 --> 00:53:16,180 then there's nothing in merges that satisfies this function. 897 00:53:16,480 --> 00:53:18,720 If there is nothing that is satisfied anymore, 898 00:53:18,720 --> 00:53:19,820 there's nothing to merge. 899 00:53:19,820 --> 00:53:22,180 Everything just returns float infs, 900 00:53:22,180 --> 00:53:23,980 and then the pair, I think, 901 00:53:23,980 --> 00:53:27,980 will just become the very first element of stats. 902 00:53:27,980 --> 00:53:29,780 But this pair is not actually a mergeable pair. 903 00:53:29,780 --> 00:53:33,480 It just becomes the first pair inside stats arbitrarily 904 00:53:33,480 --> 00:53:36,980 because all of these pairs evaluate to float inf 905 00:53:36,980 --> 00:53:38,720 for the merging criterion. 906 00:53:38,720 --> 00:53:41,320 So, basically, it could be that this doesn't succeed 907 00:53:41,320 --> 00:53:42,620 because there's no more merging pairs. 908 00:53:42,620 --> 00:53:45,780 So, if this pair is not in merges that was returned, 909 00:53:45,780 --> 00:53:46,320 then this is a failure. 910 00:53:46,480 --> 00:53:48,320 So, this signals for us that, actually, 911 00:53:48,320 --> 00:53:49,720 there was nothing to merge. 912 00:53:49,720 --> 00:53:51,720 No single pair can be merged anymore. 913 00:53:51,720 --> 00:53:55,480 In that case, we will break out. 914 00:53:55,480 --> 00:53:59,320 Nothing else can be merged. 915 00:53:59,320 --> 00:54:00,680 You might come up with a different implementation, 916 00:54:00,680 --> 00:54:00,980 by the way. 917 00:54:00,980 --> 00:54:05,480 This is kind of like really trying hard in Python. 918 00:54:05,480 --> 00:54:07,240 But really, we're just trying to find a pair 919 00:54:07,240 --> 00:54:10,720 that can be merged with a lowest index here. 920 00:54:10,720 --> 00:54:15,520 Now, if we did find a pair that is inside merges 921 00:54:15,520 --> 00:54:15,780 with the lowest index, 922 00:54:15,780 --> 00:54:17,680 then we can merge it. 923 00:54:17,680 --> 00:54:22,420 So, we're going to look into the mergers dictionary 924 00:54:22,420 --> 00:54:25,240 for that pair to look up the index, 925 00:54:25,240 --> 00:54:28,620 and we're going to now merge into that index. 926 00:54:28,620 --> 00:54:30,220 So, we're going to do tokens equals, 927 00:54:30,220 --> 00:54:34,440 and we're going to replace the original tokens. 928 00:54:34,440 --> 00:54:36,620 We're going to be replacing the pair pair, 929 00:54:36,620 --> 00:54:39,020 and we're going to be replacing it with index IDX. 930 00:54:39,020 --> 00:54:41,580 And this returns a new list of tokens 931 00:54:41,580 --> 00:54:44,440 where every occurrence of pair is replaced with IDX. 932 00:54:44,440 --> 00:54:45,620 So, we're doing a merge. 933 00:54:45,780 --> 00:54:47,540 And we're going to be continuing this 934 00:54:47,540 --> 00:54:49,320 until eventually nothing can be merged. 935 00:54:49,320 --> 00:54:51,280 We'll come out here and we'll break out. 936 00:54:51,280 --> 00:54:54,180 And here, we just return tokens. 937 00:54:54,180 --> 00:54:56,780 And so, that's the implementation, I think. 938 00:54:56,780 --> 00:54:58,140 So, hopefully, this runs. 939 00:54:58,140 --> 00:55:00,980 Okay, cool. 940 00:55:00,980 --> 00:55:02,880 Yeah, and this looks reasonable. 941 00:55:02,880 --> 00:55:05,520 So, for example, 32 is a space in ASCII. 942 00:55:05,520 --> 00:55:08,380 So, that's here. 943 00:55:08,380 --> 00:55:09,920 So, this looks like it worked. 944 00:55:09,920 --> 00:55:10,680 Great. 945 00:55:10,680 --> 00:55:13,440 Okay, so let's wrap up this section of the video, at least. 946 00:55:13,440 --> 00:55:15,680 I wanted to point out that this is not quite the right implementation. 947 00:55:15,680 --> 00:55:18,680 Just yet, because we are leaving out a special case. 948 00:55:18,680 --> 00:55:21,480 So, in particular, if we try to do this, 949 00:55:21,480 --> 00:55:23,180 this would give us an error. 950 00:55:23,180 --> 00:55:26,280 And the issue is that if we only have a single character 951 00:55:26,280 --> 00:55:28,880 or an empty string, then stats is empty. 952 00:55:28,880 --> 00:55:30,920 And that causes an issue inside min. 953 00:55:30,920 --> 00:55:36,080 So, one way to fight this is if len of tokens is at least two. 954 00:55:36,080 --> 00:55:38,720 Because if it's less than two, it's just a single token or no tokens, 955 00:55:38,720 --> 00:55:40,880 then let's just, there's nothing to merge. 956 00:55:40,880 --> 00:55:42,320 So, we just return. 957 00:55:42,320 --> 00:55:45,380 So, that would fix that case. 958 00:55:45,380 --> 00:55:46,280 Okay. 959 00:55:46,280 --> 00:55:49,880 And then second, I have a few test cases here for us as well. 960 00:55:49,880 --> 00:55:55,080 So, first, let's make sure about, or let's note the following. 961 00:55:55,080 --> 00:55:58,680 If we take a string and we try to encode it and then decode it back, 962 00:55:58,680 --> 00:56:00,980 you'd expect to get the same string back, right? 963 00:56:00,980 --> 00:56:05,580 Is that true for all strings? 964 00:56:05,580 --> 00:56:07,380 So, I think, so here it is the case. 965 00:56:07,380 --> 00:56:11,080 And I think in general, this is probably the case. 966 00:56:11,080 --> 00:56:15,180 But notice that going backwards is not, is not, you're not going to have an identity. 967 00:56:15,380 --> 00:56:16,280 Going backwards. 968 00:56:16,280 --> 00:56:24,380 Because as I mentioned, not all token sequences are valid UTF-8 sort of byte streams. 969 00:56:24,380 --> 00:56:28,480 And so, therefore, some of them can't even be decodable. 970 00:56:28,480 --> 00:56:31,080 So, this only goes in one direction. 971 00:56:31,080 --> 00:56:33,980 But for that one direction, we can check here. 972 00:56:33,980 --> 00:56:37,380 If we take the training text, which is the text that we trained the tokenizer on, 973 00:56:37,380 --> 00:56:41,680 we can make sure that when we encode and decode, we get the same thing back, which is true. 974 00:56:41,680 --> 00:56:43,280 And here I took some validation data. 975 00:56:43,280 --> 00:56:45,280 So, I went to, I think, this web page. 976 00:56:45,280 --> 00:56:46,880 And I grabbed some text. 977 00:56:46,880 --> 00:56:49,280 So, this is text that the tokenizer has not seen. 978 00:56:49,280 --> 00:56:52,580 And we can make sure that this also works. 979 00:56:52,580 --> 00:56:55,880 So, that gives us some confidence that this was correctly implemented. 980 00:56:55,880 --> 00:56:59,080 So, those are the basics of the byte pair encoding algorithm. 981 00:56:59,080 --> 00:57:03,580 We saw how we can take some training set, train a tokenizer. 982 00:57:03,580 --> 00:57:07,780 The parameters of this tokenizer really are just this dictionary of merges. 983 00:57:07,780 --> 00:57:12,480 And that basically creates the little binary forest on top of raw bytes. 984 00:57:12,480 --> 00:57:14,580 Once we have this, the merges table, 985 00:57:14,580 --> 00:57:18,780 we can both encode and decode between raw text and token sequences. 986 00:57:18,780 --> 00:57:22,180 So, that's the simplest setting of the tokenizer. 987 00:57:22,180 --> 00:57:25,180 What we're going to do now, though, is we're going to look at some of the state-of-the-art 988 00:57:25,180 --> 00:57:28,380 large language models and the kinds of tokenizers that they use. 989 00:57:28,380 --> 00:57:30,980 And we're going to see that this picture complexifies very quickly. 990 00:57:30,980 --> 00:57:37,080 So, we're going to go through the details of this complexification one at a time. 991 00:57:37,080 --> 00:57:39,880 So, let's kick things off by looking at the GPT series. 992 00:57:39,880 --> 00:57:43,180 So, in particular, I have the GPT-2 paper here. 993 00:57:43,180 --> 00:57:47,780 And this paper is from 2019 or so, so five years ago. 994 00:57:47,780 --> 00:57:50,980 And let's scroll down to input representation. 995 00:57:50,980 --> 00:57:54,880 This is where they talk about the tokenizer that they're using for GPT-2. 996 00:57:54,880 --> 00:57:59,680 Now, this is all fairly readable, so I encourage you to pause and read this yourself. 997 00:57:59,680 --> 00:58:03,680 But this is where they motivate the use of the byte pair encoding algorithm 998 00:58:03,680 --> 00:58:08,380 on the byte level representation of UTF-8 encoding. 999 00:58:08,380 --> 00:58:12,780 So, this is where they motivate it, and they talk about the vocabulary sizes and everything. 1000 00:58:13,180 --> 00:58:16,080 Now, everything here is exactly as we've covered it so far, 1001 00:58:16,080 --> 00:58:18,780 but things start to depart around here. 1002 00:58:18,780 --> 00:58:23,580 So, what they mention is that they don't just apply the naive algorithm as we have done it. 1003 00:58:23,580 --> 00:58:26,280 And in particular, here's a motivating example. 1004 00:58:26,280 --> 00:58:28,480 Suppose that you have common words like dog. 1005 00:58:28,480 --> 00:58:32,680 What will happen is that dog, of course, occurs very frequently in the text, 1006 00:58:32,680 --> 00:58:36,180 and it occurs right next to all kinds of punctuation, as an example. 1007 00:58:36,180 --> 00:58:40,880 So, dog dot, dog exclamation mark, dog question mark, et cetera. 1008 00:58:40,880 --> 00:58:42,980 And naively, you might imagine that the BP algorithm 1009 00:58:42,980 --> 00:58:45,580 could merge these to be single tokens. 1010 00:58:45,580 --> 00:58:48,080 And then you end up with lots of tokens that are just like dog 1011 00:58:48,080 --> 00:58:50,080 with a slightly different punctuation. 1012 00:58:50,080 --> 00:58:52,480 And so, it feels like you're clustering things that shouldn't be clustered. 1013 00:58:52,480 --> 00:58:56,480 You're combining kind of semantics with punctuation. 1014 00:58:56,480 --> 00:58:58,780 And this feels suboptimal. 1015 00:58:58,780 --> 00:59:01,480 And indeed, they also say that this is suboptimal, 1016 00:59:01,480 --> 00:59:03,380 according to some of the experiments. 1017 00:59:03,380 --> 00:59:06,280 So, what they want to do is they want to top-down, in a manual way, 1018 00:59:06,280 --> 00:59:12,680 enforce that some types of characters should never be merged together. 1019 00:59:12,680 --> 00:59:14,780 So, they want to enforce these merging rules 1020 00:59:14,780 --> 00:59:17,680 on top of the byte pair encoding algorithm. 1021 00:59:17,680 --> 00:59:21,480 So, let's take a look at their code and see how they actually enforce this 1022 00:59:21,480 --> 00:59:24,280 and what kinds of mergers they actually do perform. 1023 00:59:24,280 --> 00:59:29,480 So, I have the tab open here for GPT-2 under OpenAI on GitHub. 1024 00:59:29,480 --> 00:59:33,980 And when we go to source, there is an encoder.py. 1025 00:59:33,980 --> 00:59:36,280 Now, I don't personally love that they call it encoder.py 1026 00:59:36,280 --> 00:59:38,080 because this is the tokenizer. 1027 00:59:38,080 --> 00:59:41,080 And the tokenizer can do both encode and decode. 1028 00:59:41,080 --> 00:59:42,580 So, it feels kind of awkward to me that it's called that. 1029 00:59:42,680 --> 00:59:45,780 It's called encoder, but that is the tokenizer. 1030 00:59:45,780 --> 00:59:46,880 And there's a lot going on here, 1031 00:59:46,880 --> 00:59:49,580 and we're going to step through it in detail at one point. 1032 00:59:49,580 --> 00:59:53,380 For now, I just want to focus on this part here. 1033 00:59:53,380 --> 00:59:56,480 They create a regex pattern here that looks very complicated, 1034 00:59:56,480 --> 00:59:58,880 and we're going to go through it in a bit. 1035 00:59:58,880 --> 01:00:02,880 But this is the core part that allows them to enforce rules 1036 01:00:02,880 --> 01:00:07,280 for what parts of the text will never be merged for sure. 1037 01:00:07,280 --> 01:00:09,880 Now, notice that re.compile here is a little bit misleading 1038 01:00:09,880 --> 01:00:12,380 because we're not just doing import re, 1039 01:00:12,380 --> 01:00:15,680 we're doing import regex as re, 1040 01:00:15,680 --> 01:00:18,780 and regex is a Python package that you can install, 1041 01:00:18,780 --> 01:00:21,780 pip install regex, and it's basically an extension of re, 1042 01:00:21,780 --> 01:00:26,080 so it's a bit more powerful re. 1043 01:00:26,080 --> 01:00:29,680 So, let's take a look at this pattern and what it's doing 1044 01:00:29,680 --> 01:00:32,280 and why this is actually doing the separation 1045 01:00:32,280 --> 01:00:33,880 that they are looking for. 1046 01:00:33,880 --> 01:00:35,780 Okay, so I've copy pasted the pattern here 1047 01:00:35,780 --> 01:00:38,280 to our Jupyter notebook where we left off, 1048 01:00:38,280 --> 01:00:40,680 and let's take this pattern for a spin. 1049 01:00:40,680 --> 01:00:42,180 So, in the exact same way that 1050 01:00:42,380 --> 01:00:45,780 Jupyter code does, we're going to call an re.findall 1051 01:00:45,780 --> 01:00:48,180 for this pattern on any arbitrary string 1052 01:00:48,180 --> 01:00:49,480 that we are interested in. 1053 01:00:49,480 --> 01:00:53,680 So, this is the string that we want to encode into tokens 1054 01:00:53,680 --> 01:00:56,880 to feed into an LLM like GPT-2. 1055 01:00:56,880 --> 01:00:59,080 So, what exactly is this doing? 1056 01:00:59,080 --> 01:01:01,080 Well, re.findall will take this pattern 1057 01:01:01,080 --> 01:01:04,980 and try to match it against this string. 1058 01:01:04,980 --> 01:01:07,780 The way this works is that you are going from left to right 1059 01:01:07,780 --> 01:01:11,380 in the string, and you're trying to match the pattern. 1060 01:01:11,380 --> 01:01:12,280 And re.fall, 1061 01:01:12,380 --> 01:01:15,080 re.findall will get all the occurrences 1062 01:01:15,080 --> 01:01:17,380 and organize them into a list. 1063 01:01:17,380 --> 01:01:20,480 Now, when you look at this pattern, 1064 01:01:20,480 --> 01:01:23,880 first of all, notice that this is a raw string, 1065 01:01:23,880 --> 01:01:26,180 and then these are three double quotes 1066 01:01:26,180 --> 01:01:27,780 just to start the string. 1067 01:01:27,780 --> 01:01:29,380 So, really, the string itself, 1068 01:01:29,380 --> 01:01:32,380 this is the pattern itself, right? 1069 01:01:32,380 --> 01:01:35,280 And notice that it's made up of a lot of ors. 1070 01:01:35,280 --> 01:01:36,480 So, see these vertical bars? 1071 01:01:36,480 --> 01:01:39,380 Those are ors in regex. 1072 01:01:39,380 --> 01:01:41,380 And so, you go from left to right in this pattern 1073 01:01:41,380 --> 01:01:44,480 and try to match it against the string wherever you are. 1074 01:01:44,480 --> 01:01:47,980 So, we have hello, and we're going to try to match it. 1075 01:01:47,980 --> 01:01:49,480 Well, it's not apostrophe s. 1076 01:01:49,480 --> 01:01:52,480 It's not apostrophe t or any of these, 1077 01:01:52,480 --> 01:01:57,180 but it is an optional space followed by dash p of, 1078 01:01:57,180 --> 01:02:00,280 sorry, slash p of l one or more times. 1079 01:02:00,280 --> 01:02:01,880 What is slash p of l? 1080 01:02:01,880 --> 01:02:06,880 It is coming to some documentation that I found. 1081 01:02:06,880 --> 01:02:09,180 There might be other sources as well. 1082 01:02:09,180 --> 01:02:11,180 Slash p of l is a letter. 1083 01:02:11,180 --> 01:02:13,680 Any kind of letter from any language. 1084 01:02:13,680 --> 01:02:16,080 And hello is made up of letters. 1085 01:02:16,080 --> 01:02:18,380 H-E-L-L-O, et cetera. 1086 01:02:18,380 --> 01:02:21,480 So, optional space followed by a bunch of letters, 1087 01:02:21,480 --> 01:02:24,780 one or more letters, is going to match hello, 1088 01:02:24,780 --> 01:02:28,780 but then the match ends because a white space is not a letter. 1089 01:02:28,780 --> 01:02:33,280 So, from there on begins a new sort of attempt 1090 01:02:33,280 --> 01:02:35,880 to match against the string again. 1091 01:02:35,880 --> 01:02:39,180 And starting in here, we're going to skip over all of these again 1092 01:02:39,180 --> 01:02:40,980 until we get to the exact same point again. 1093 01:02:41,180 --> 01:02:43,580 And we see that there's an optional space. 1094 01:02:43,580 --> 01:02:46,180 This is the optional space followed by a bunch of letters, 1095 01:02:46,180 --> 01:02:47,180 one or more of them. 1096 01:02:47,180 --> 01:02:48,680 And so, that matches. 1097 01:02:48,680 --> 01:02:53,080 So, when we run this, we get a list of two elements, hello, 1098 01:02:53,080 --> 01:02:55,680 and then space world. 1099 01:02:55,680 --> 01:02:58,880 So, how are you if we add more letters? 1100 01:02:58,880 --> 01:03:01,180 We would just get them like this. 1101 01:03:01,180 --> 01:03:03,680 Now, what is this doing and why is this important? 1102 01:03:03,680 --> 01:03:08,380 We are taking our string and instead of directly encoding it 1103 01:03:08,380 --> 01:03:11,080 for tokenization, we are first splitting it. 1104 01:03:11,180 --> 01:03:13,980 And when you actually step through the code, 1105 01:03:13,980 --> 01:03:16,180 and we'll do that in a bit more detail, 1106 01:03:16,180 --> 01:03:20,680 what really it's doing on a high level is that it first splits your text 1107 01:03:20,680 --> 01:03:24,480 into a list of texts, just like this one. 1108 01:03:24,480 --> 01:03:27,480 And all these elements of this list are processed independently 1109 01:03:27,480 --> 01:03:29,080 by the tokenizer. 1110 01:03:29,080 --> 01:03:33,080 And all of the results of that processing are simply concatenated. 1111 01:03:33,080 --> 01:03:35,180 So, hello, world. 1112 01:03:35,180 --> 01:03:37,480 Oh, I missed how. 1113 01:03:37,480 --> 01:03:39,380 Hello, world, how are you? 1114 01:03:39,380 --> 01:03:41,180 We have five elements of a list. 1115 01:03:41,180 --> 01:03:48,080 All of these will independently go from text to a token sequence. 1116 01:03:48,080 --> 01:03:50,580 And then that token sequence is going to be concatenated. 1117 01:03:50,580 --> 01:03:52,680 It's all going to be joined up. 1118 01:03:52,680 --> 01:03:57,380 And roughly speaking, what that does is you're only ever finding merges 1119 01:03:57,380 --> 01:03:59,280 between the elements of this list. 1120 01:03:59,280 --> 01:04:01,480 So, you can only ever consider merges within every one 1121 01:04:01,480 --> 01:04:04,080 of these elements individually. 1122 01:04:04,080 --> 01:04:07,880 And after you've done all the possible merging for all 1123 01:04:07,880 --> 01:04:10,180 of these elements individually, the results of all 1124 01:04:10,180 --> 01:04:11,080 that will be joined up. 1125 01:04:11,180 --> 01:04:19,180 So, basically, what you're doing effectively is you are never going 1126 01:04:19,180 --> 01:04:23,480 to be merging this E with this space because they are now parts 1127 01:04:23,480 --> 01:04:25,880 of the separate elements of this list. 1128 01:04:25,880 --> 01:04:31,080 And so, you are saying we are never going to merge E space 1129 01:04:31,080 --> 01:04:33,580 because we're breaking it up in this way. 1130 01:04:33,580 --> 01:04:36,580 So, basically, using this regex pattern to chunk 1131 01:04:36,580 --> 01:04:41,080 up the text is just one way of enforcing that some merges 1132 01:04:41,180 --> 01:04:42,380 are not to happen. 1133 01:04:42,380 --> 01:04:44,980 And we're going to go into more of this text and we'll see 1134 01:04:44,980 --> 01:04:47,180 that what this is trying to do on a high level is we're trying 1135 01:04:47,180 --> 01:04:50,180 to not merge across letters, across numbers, 1136 01:04:50,180 --> 01:04:52,680 across punctuation, and so on. 1137 01:04:52,680 --> 01:04:54,480 So, let's see in more detail how that works. 1138 01:04:54,480 --> 01:04:55,880 So, let's continue now. 1139 01:04:55,880 --> 01:04:59,680 We have slash P of N. If you go to the documentation, 1140 01:04:59,680 --> 01:05:04,380 slash P of N is any kind of numeric character in any script. 1141 01:05:04,380 --> 01:05:05,880 So, it's numbers. 1142 01:05:05,880 --> 01:05:07,880 So, we have an optional space followed by numbers 1143 01:05:07,880 --> 01:05:09,680 and those would be separated out. 1144 01:05:09,680 --> 01:05:11,180 So, letters and numbers are being separated out. 1145 01:05:11,180 --> 01:05:14,980 So, if I do hello world, one, two, three, how are you? 1146 01:05:14,980 --> 01:05:19,580 Then world will stop matching here because one is not a letter anymore. 1147 01:05:19,580 --> 01:05:22,480 But one is a number, so this group will match for that 1148 01:05:22,480 --> 01:05:26,780 and we'll get it as a separate entity. 1149 01:05:26,780 --> 01:05:28,380 Let's see how these apostrophes work. 1150 01:05:28,380 --> 01:05:36,680 So, here, if we have slash V or, I mean, apostrophe V as an example, 1151 01:05:36,680 --> 01:05:40,580 then apostrophe here is not a letter or a number. 1152 01:05:40,580 --> 01:05:45,880 So, hello will stop matching and then we will exactly match this with that. 1153 01:05:45,880 --> 01:05:49,180 So, that will come out as a separate thing. 1154 01:05:49,180 --> 01:05:51,580 So, why are they doing the apostrophes here? 1155 01:05:51,580 --> 01:05:54,780 Honestly, I think that these are just like very common apostrophes 1156 01:05:54,780 --> 01:05:57,980 that are used typically. 1157 01:05:57,980 --> 01:05:59,580 I don't love that they've done this 1158 01:05:59,580 --> 01:06:06,380 because let me show you what happens when you have some Unicode apostrophes. 1159 01:06:06,380 --> 01:06:10,080 Like, for example, you can have, if you have house, 1160 01:06:10,080 --> 01:06:13,080 then this will be separated out because of this matching. 1161 01:06:13,080 --> 01:06:17,180 But if you use the Unicode apostrophe like this, 1162 01:06:17,180 --> 01:06:19,880 then suddenly this does not work. 1163 01:06:19,880 --> 01:06:23,680 And so, this apostrophe will actually become its own thing now. 1164 01:06:23,680 --> 01:06:28,180 And so, it's basically hard-coded for this specific kind of apostrophe 1165 01:06:28,180 --> 01:06:33,180 and otherwise they become completely separate tokens. 1166 01:06:33,180 --> 01:06:37,580 In addition to this, you can go to the GPT-2 docs 1167 01:06:37,580 --> 01:06:39,880 and here when they define the pattern, they say, 1168 01:06:40,080 --> 01:06:42,280 should have added re.ignorecase. 1169 01:06:42,280 --> 01:06:45,580 So, BP merges can happen for capitalized versions of contractions. 1170 01:06:45,580 --> 01:06:48,480 So, what they're pointing out is that you see how this is apostrophe 1171 01:06:48,480 --> 01:06:50,780 and then lowercase letters. 1172 01:06:50,780 --> 01:06:53,880 Well, because they didn't do re.ignorecase, 1173 01:06:53,880 --> 01:06:59,880 then these rules will not separate out the apostrophes if it's uppercase. 1174 01:06:59,880 --> 01:07:04,780 So, house would be like this. 1175 01:07:04,780 --> 01:07:09,880 But if I did house from uppercase, then notice, 1176 01:07:10,080 --> 01:07:13,280 suddenly the apostrophe comes by itself. 1177 01:07:13,280 --> 01:07:17,480 So, the tokenization will work differently in uppercase and lowercase, 1178 01:07:17,480 --> 01:07:19,880 inconsistently separating out these apostrophes. 1179 01:07:19,880 --> 01:07:23,880 So, it feels extremely gnarly and slightly gross. 1180 01:07:23,880 --> 01:07:25,780 But that's how that works. 1181 01:07:25,780 --> 01:07:27,280 Okay, so let's come back. 1182 01:07:27,280 --> 01:07:29,880 After trying to match a bunch of apostrophe expressions, 1183 01:07:29,880 --> 01:07:33,280 by the way, the other issue here is that these are quite language-specific probably. 1184 01:07:33,280 --> 01:07:37,080 So, I don't know that all the languages, for example, use or don't use apostrophes, 1185 01:07:37,080 --> 01:07:39,880 but that would be inconsistently tokenized as a result. 1186 01:07:40,080 --> 01:07:44,280 Well, then we try to match letters, then we try to match numbers, 1187 01:07:44,280 --> 01:07:47,480 and then if that doesn't work, we fall back to here. 1188 01:07:47,480 --> 01:07:51,280 And what this is saying is, again, optional space followed by something that is not a letter, 1189 01:07:51,280 --> 01:07:55,080 number, or a space, and one or more of that. 1190 01:07:55,080 --> 01:07:58,280 So, what this is doing effectively is this is trying to match punctuation, 1191 01:07:58,280 --> 01:08:01,080 roughly speaking, not letters and not numbers. 1192 01:08:01,080 --> 01:08:03,280 So, this group will try to trigger for that. 1193 01:08:03,280 --> 01:08:09,480 So, if I do something like this, then these parts here are not letters or numbers, 1194 01:08:09,480 --> 01:08:13,480 but they will actually get caught here. 1195 01:08:13,480 --> 01:08:15,680 And so, they become its own group. 1196 01:08:15,680 --> 01:08:18,380 So, we've separated out the punctuation. 1197 01:08:18,380 --> 01:08:21,580 And finally, this is also a little bit confusing. 1198 01:08:21,580 --> 01:08:24,080 So, this is matching whitespace, 1199 01:08:24,080 --> 01:08:28,980 but this is using a negative look-ahead assertion in regex. 1200 01:08:28,980 --> 01:08:32,080 So, what this is doing is it's matching whitespace up to, 1201 01:08:32,080 --> 01:08:35,980 but not including the last whitespace character. 1202 01:08:35,980 --> 01:08:39,380 Why is this important? This is pretty subtle, I think. 1203 01:08:39,380 --> 01:08:43,380 So, you see how the whitespace is always included at the beginning of the word. 1204 01:08:43,380 --> 01:08:47,280 So, space R, space U, et cetera. 1205 01:08:47,280 --> 01:08:50,480 Suppose we have a lot of spaces here. 1206 01:08:50,480 --> 01:08:53,680 What's going to happen here is that these spaces up to 1207 01:08:53,680 --> 01:08:57,980 and not including the last character will get caught by this. 1208 01:08:57,980 --> 01:09:01,480 And what that will do is it will separate out the spaces up to, 1209 01:09:01,480 --> 01:09:03,280 but not including the last character, 1210 01:09:03,280 --> 01:09:08,580 so that the last character can come here and join with the space U. 1211 01:09:08,580 --> 01:09:12,580 And the reason that's nice is because space U is the common token. 1212 01:09:12,580 --> 01:09:16,480 So, if I didn't have these extra spaces here, we just have space U. 1213 01:09:16,480 --> 01:09:20,680 And if I add tokens, if I add spaces, we still have a space U, 1214 01:09:20,680 --> 01:09:23,080 but now we have all this extra whitespace. 1215 01:09:23,080 --> 01:09:27,780 So, basically, the GPT-2 tokenizer really likes to have space letters or numbers, 1216 01:09:27,780 --> 01:09:30,280 and it prepends these spaces. 1217 01:09:30,280 --> 01:09:32,980 And this is just something that it is consistent about. 1218 01:09:32,980 --> 01:09:34,380 So, that's what that is for. 1219 01:09:34,380 --> 01:09:38,380 And then, finally, we have all the last fallback is whitespace. 1220 01:09:38,580 --> 01:09:45,980 So, that would be just if that doesn't get caught, 1221 01:09:45,980 --> 01:09:49,980 then this thing will catch any trailing spaces and so on. 1222 01:09:49,980 --> 01:09:52,680 I wanted to show one more real-world example here. 1223 01:09:52,680 --> 01:09:55,280 So, if we have this string, which is a piece of Python code, 1224 01:09:55,280 --> 01:09:59,480 and then we try to split it up, then this is the kind of output we get. 1225 01:09:59,480 --> 01:10:01,480 So, you'll notice that the list has many elements here, 1226 01:10:01,480 --> 01:10:08,480 and that's because we are splitting up fairly often every time sort of a category changes. 1227 01:10:08,480 --> 01:10:11,880 So, there will never be any mergers within these elements. 1228 01:10:11,880 --> 01:10:14,780 And that's what you are seeing here. 1229 01:10:14,780 --> 01:10:18,980 Now, you might think that in order to train the tokenizer, 1230 01:10:18,980 --> 01:10:23,180 OpenAI has used this to split up text into chunks 1231 01:10:23,180 --> 01:10:26,780 and then run just a BP algorithm within all the chunks. 1232 01:10:26,780 --> 01:10:28,580 But that is not exactly what happened. 1233 01:10:28,580 --> 01:10:30,380 And the reason is the following. 1234 01:10:30,380 --> 01:10:33,280 Notice that we have the spaces here. 1235 01:10:33,280 --> 01:10:36,380 Those spaces end up being entire elements, 1236 01:10:38,480 --> 01:10:40,880 and they end up being merged by OpenAI. 1237 01:10:40,880 --> 01:10:46,880 And the way you can tell is that if you copy-paste the exact same chunk here into a tick tokenizer, 1238 01:10:46,880 --> 01:10:51,880 you see that all the spaces are kept independent, and they are all token 220. 1239 01:10:51,880 --> 01:10:57,880 So, I think OpenAI at some point enforced some rule that these spaces would never be merged. 1240 01:10:57,880 --> 01:11:05,880 And so, there are some additional rules on top of just chunking and BPE that OpenAI is not clear about. 1241 01:11:05,880 --> 01:11:07,880 Now, the training code for the GPT-2 tokenizer was never released. 1242 01:11:07,880 --> 01:11:08,380 Now, the training code for the GPT-2 tokenizer was never released. 1243 01:11:08,380 --> 01:11:12,180 So, all we have is the code that I've already shown you. 1244 01:11:12,180 --> 01:11:17,180 But this code here that they've released is only the inference code for the tokens. 1245 01:11:17,180 --> 01:11:18,480 So, this is not the training code. 1246 01:11:18,480 --> 01:11:21,480 You can't give it a piece of text and train the tokenizer. 1247 01:11:21,480 --> 01:11:26,180 This is just the inference code which takes the merges that we have up above 1248 01:11:26,180 --> 01:11:29,180 and applies them to a new piece of text. 1249 01:11:29,180 --> 01:11:32,880 And so, we don't know exactly how OpenAI trained the tokenizer, 1250 01:11:32,880 --> 01:11:37,780 but it wasn't as simple as chunk it up and BPE it, whatever it was. 1251 01:11:37,780 --> 01:11:41,780 Next, I wanted to introduce you to the tiktokin library from OpenAI, 1252 01:11:41,780 --> 01:11:45,780 which is the official library for tokenization from OpenAI. 1253 01:11:45,780 --> 01:11:53,780 So, this is tiktokin, pip install tiktokin, and then you can do the tokenization inference. 1254 01:11:53,780 --> 01:11:55,780 This is, again, not training code. 1255 01:11:55,780 --> 01:11:57,780 This is only inference code for tokenization. 1256 01:11:57,780 --> 01:12:00,780 I wanted to show you how you would use it. 1257 01:12:00,780 --> 01:12:01,780 Quite simple. 1258 01:12:01,780 --> 01:12:05,780 And running this just gives us the GPT-2 tokens or the GPT-4 tokens. 1259 01:12:05,780 --> 01:12:06,780 So, this is the tokenizer you're using. 1260 01:12:07,780 --> 01:12:09,780 This is the tokenizer you're using for GPT-4. 1261 01:12:09,780 --> 01:12:13,780 And so, in particular, we see that the whitespace in GPT-2 remains unmerged, 1262 01:12:13,780 --> 01:12:18,780 but in GPT-4, these whitespaces merge, as we also saw in this one, 1263 01:12:18,780 --> 01:12:26,780 where here they're all unmerged, but if we go down to GPT-4, they become merged. 1264 01:12:26,780 --> 01:12:34,780 Now, in the GPT-4 tokenizer, they changed the regular expression that they use to chunk up text. 1265 01:12:34,780 --> 01:12:37,780 So, the way to see this is that if you come to the tiktokin library, 1266 01:12:37,780 --> 01:12:43,780 and then you go to this file, tiktokin.ext.openai.public, 1267 01:12:43,780 --> 01:12:48,780 this is where sort of like the definition of all these different tokenizers that OpenAI maintains is. 1268 01:12:48,780 --> 01:12:53,780 And so, necessarily to do the inference, they had to publish some of the details about the strings. 1269 01:12:53,780 --> 01:12:56,780 So, this is the string that we already saw for GPT-2. 1270 01:12:56,780 --> 01:13:01,780 It is slightly different, but it is actually equivalent to what we discussed here. 1271 01:13:01,780 --> 01:13:05,780 So, this pattern that we discussed is equivalent to this pattern. 1272 01:13:05,780 --> 01:13:07,780 This one just executes a little bit faster. 1273 01:13:07,780 --> 01:13:11,780 So, here you see a little bit of a slightly different definition, but otherwise it's the same. 1274 01:13:11,780 --> 01:13:14,780 We're going to go into special tokens in a bit. 1275 01:13:14,780 --> 01:13:19,780 And then if you scroll down to CL100K, this is the GPT-4 tokenizer, 1276 01:13:19,780 --> 01:13:22,780 you see that the pattern has changed. 1277 01:13:22,780 --> 01:13:27,780 And this is kind of like the major change in addition to a bunch of other special tokens, 1278 01:13:27,780 --> 01:13:29,780 which we'll go into a bit again. 1279 01:13:29,780 --> 01:13:33,780 Now, I'm not going to actually go into the full detail of the pattern change, 1280 01:13:33,780 --> 01:13:35,780 because honestly, this isn't mind-numbing. 1281 01:13:35,780 --> 01:13:37,780 I would just advise that you pull out your GPT-4 tokenizer, 1282 01:13:37,780 --> 01:13:41,780 pull out ChatGPT and the regex documentation, and just step through it. 1283 01:13:41,780 --> 01:13:46,780 But really, the major changes are, number one, you see this I here? 1284 01:13:46,780 --> 01:13:52,780 That means that the case sensitivity, this is case insensitive match. 1285 01:13:52,780 --> 01:13:57,780 And so, the comment that we saw earlier on, oh, we should have used re.uppercase, 1286 01:13:57,780 --> 01:14:05,780 basically, we're now going to be matching these apostrophe s, apostrophe d, apostrophe m, etc. 1287 01:14:05,780 --> 01:14:07,780 We're going to be matching them both in lowercase. 1288 01:14:07,780 --> 01:14:08,780 And in uppercase. 1289 01:14:08,780 --> 01:14:10,780 So, that's fixed. 1290 01:14:10,780 --> 01:14:12,780 There's a bunch of different, like, handling of the white space 1291 01:14:12,780 --> 01:14:14,780 that I'm not going to go into the full details of. 1292 01:14:14,780 --> 01:14:19,780 And then, one more thing here is you will notice that when they match the numbers, 1293 01:14:19,780 --> 01:14:22,780 they only match one to three numbers. 1294 01:14:22,780 --> 01:14:29,780 So, they will never merge numbers that are in more than three digits. 1295 01:14:29,780 --> 01:14:33,780 Only up to three digits of numbers will ever be merged. 1296 01:14:33,780 --> 01:14:35,780 And that's one change that they made as well, 1297 01:14:35,780 --> 01:14:37,780 to prevent tokens, 1298 01:14:37,780 --> 01:14:40,780 that are very, very long number sequences. 1299 01:14:40,780 --> 01:14:43,780 But again, we don't really know why they do any of this stuff 1300 01:14:43,780 --> 01:14:45,780 because none of this is documented. 1301 01:14:45,780 --> 01:14:47,780 And it's just, we just get the pattern. 1302 01:14:47,780 --> 01:14:50,780 So, yeah, it is what it is. 1303 01:14:50,780 --> 01:14:53,780 But those are some of the changes that GPT-4 has made. 1304 01:14:53,780 --> 01:14:58,780 And of course, the vocabulary size went from roughly 50k to roughly 100k. 1305 01:14:58,780 --> 01:15:00,780 The next thing I would like to do very briefly 1306 01:15:00,780 --> 01:15:05,780 is to take you through the GPT-2 encoder.py that OpenAI has released. 1307 01:15:05,780 --> 01:15:06,780 This is the file. 1308 01:15:06,780 --> 01:15:08,780 They already mentioned to you briefly. 1309 01:15:08,780 --> 01:15:11,780 Now, this file is fairly short 1310 01:15:11,780 --> 01:15:14,780 and should be relatively understandable to you at this point. 1311 01:15:14,780 --> 01:15:17,780 Starting at the bottom here, 1312 01:15:17,780 --> 01:15:20,780 they are loading two files, 1313 01:15:20,780 --> 01:15:22,780 encoder.json and vocab.bpe. 1314 01:15:22,780 --> 01:15:24,780 And they do some light processing on it 1315 01:15:24,780 --> 01:15:27,780 and then they call this encoder object, which is the tokenizer. 1316 01:15:27,780 --> 01:15:30,780 Now, if you'd like to inspect these two files, 1317 01:15:30,780 --> 01:15:33,780 which together constitute their saved tokenizer, 1318 01:15:33,780 --> 01:15:35,780 then you can do that with a piece of code like this. 1319 01:15:36,780 --> 01:15:39,780 This is where you can download these two files 1320 01:15:39,780 --> 01:15:41,780 and you can inspect them if you'd like. 1321 01:15:41,780 --> 01:15:43,780 And what you will find is that this encoder, 1322 01:15:43,780 --> 01:15:45,780 as they call it in their code, 1323 01:15:45,780 --> 01:15:47,780 is exactly equivalent to our vocab. 1324 01:15:47,780 --> 01:15:52,780 So remember here where we have this vocab object, 1325 01:15:52,780 --> 01:15:54,780 which allowed us to decode very efficiently 1326 01:15:54,780 --> 01:16:00,780 and basically it took us from the integer to the bytes for that integer. 1327 01:16:00,780 --> 01:16:04,780 So our vocab is exactly their encoder. 1328 01:16:04,780 --> 01:16:06,780 And then their vocab.bpe, 1329 01:16:06,780 --> 01:16:08,780 confusingly, 1330 01:16:08,780 --> 01:16:10,780 is actually our merges. 1331 01:16:10,780 --> 01:16:12,780 So their bpe merges, 1332 01:16:12,780 --> 01:16:15,780 which is based on the data inside vocab.bpe, 1333 01:16:15,780 --> 01:16:18,780 ends up being equivalent to our merges. 1334 01:16:18,780 --> 01:16:22,780 So basically they are saving and loading 1335 01:16:22,780 --> 01:16:25,780 the two variables that for us are also critical, 1336 01:16:25,780 --> 01:16:28,780 the merges variable and the vocab variable. 1337 01:16:28,780 --> 01:16:30,780 Using just these two variables, 1338 01:16:30,780 --> 01:16:32,780 you can represent a tokenizer 1339 01:16:32,780 --> 01:16:34,780 and you can both do encoding and decoding 1340 01:16:34,780 --> 01:16:36,780 once you've trained this tokenizer. 1341 01:16:36,780 --> 01:16:40,780 Now the only thing that is actually slightly confusing 1342 01:16:40,780 --> 01:16:42,780 inside what OpenAI does here 1343 01:16:42,780 --> 01:16:45,780 is that in addition to this encoder and the decoder, 1344 01:16:45,780 --> 01:16:47,780 they also have something called a byte encoder 1345 01:16:47,780 --> 01:16:49,780 and a byte decoder. 1346 01:16:49,780 --> 01:16:51,780 And this is actually, unfortunately, 1347 01:16:51,780 --> 01:16:55,780 just kind of a spurious implementation detail. 1348 01:16:55,780 --> 01:16:57,780 It isn't actually deep or interesting in any way, 1349 01:16:57,780 --> 01:16:59,780 so I'm going to skip the discussion of it. 1350 01:16:59,780 --> 01:17:00,780 But what OpenAI does here, 1351 01:17:00,780 --> 01:17:02,780 for reasons that I don't fully understand, 1352 01:17:02,780 --> 01:17:04,780 is that not only have they this tokenizer, 1353 01:17:04,780 --> 01:17:06,780 which can encode and decode, 1354 01:17:06,780 --> 01:17:09,780 but they have a whole separate layer here in addition 1355 01:17:09,780 --> 01:17:11,780 that is used serially with the tokenizer. 1356 01:17:11,780 --> 01:17:15,780 And so you first do byte encode and then encode, 1357 01:17:15,780 --> 01:17:18,780 and then you do decode and then byte decode. 1358 01:17:18,780 --> 01:17:19,780 So that's the loop, 1359 01:17:19,780 --> 01:17:22,780 and they are just stacked serial on top of each other. 1360 01:17:22,780 --> 01:17:24,780 And it's not that interesting, so I won't cover it, 1361 01:17:24,780 --> 01:17:26,780 and you can step through it if you'd like. 1362 01:17:26,780 --> 01:17:27,780 Otherwise, this file, 1363 01:17:27,780 --> 01:17:30,780 if you ignore the byte encoder and the byte decoder, 1364 01:17:30,780 --> 01:17:32,780 will be algorithmically very familiar with you. 1365 01:17:32,780 --> 01:17:34,780 And the meat of it here is the, 1366 01:17:34,780 --> 01:17:36,780 what they call BPE function, 1367 01:17:36,780 --> 01:17:39,780 and you should recognize this loop here, 1368 01:17:39,780 --> 01:17:41,780 which is very similar to our own while loop, 1369 01:17:41,780 --> 01:17:44,780 where they're trying to identify the bigram, 1370 01:17:44,780 --> 01:17:45,780 a pair, 1371 01:17:45,780 --> 01:17:47,780 that they should be merging next. 1372 01:17:47,780 --> 01:17:49,780 And then here, just like we had, 1373 01:17:49,780 --> 01:17:51,780 they have a for loop trying to merge this pair. 1374 01:17:51,780 --> 01:17:53,780 So they will go over all of the sequence 1375 01:17:53,780 --> 01:17:56,780 and they will merge the pair whenever they find it. 1376 01:17:56,780 --> 01:17:58,780 And they keep repeating that 1377 01:17:58,780 --> 01:18:01,780 until they run out of possible merges in the text. 1378 01:18:01,780 --> 01:18:03,780 So that's the meat of this file, 1379 01:18:03,780 --> 01:18:05,780 and there's an encode and decode function, 1380 01:18:05,780 --> 01:18:07,780 just like we have implemented it. 1381 01:18:07,780 --> 01:18:08,780 So long story short, 1382 01:18:08,780 --> 01:18:10,780 what I want you to take away at this point is that, 1383 01:18:10,780 --> 01:18:12,780 unfortunately, it's a little bit of a messy code that they have, 1384 01:18:12,780 --> 01:18:14,780 but algorithmically it is identical 1385 01:18:14,780 --> 01:18:16,780 to what we've built up above. 1386 01:18:16,780 --> 01:18:18,780 And what we've built up above, if you understand it, 1387 01:18:18,780 --> 01:18:20,780 is algorithmically what is necessary 1388 01:18:20,780 --> 01:18:23,780 to actually build a BPE tokenizer, 1389 01:18:23,780 --> 01:18:26,780 train it, and then both encode and decode. 1390 01:18:26,780 --> 01:18:27,780 The next topic I would like to turn to 1391 01:18:27,780 --> 01:18:29,780 is that of special tokens. 1392 01:18:29,780 --> 01:18:31,780 So in addition to tokens that are coming from, 1393 01:18:31,780 --> 01:18:34,780 you know, raw bytes and the BPE merges, 1394 01:18:34,780 --> 01:18:37,780 we can insert all kinds of tokens that we are going to use 1395 01:18:37,780 --> 01:18:39,780 to delimit different parts of the data 1396 01:18:39,780 --> 01:18:41,780 or introduce to create a special structure 1397 01:18:41,780 --> 01:18:44,780 of the token streams. 1398 01:18:44,780 --> 01:18:47,780 So if you look at this encoder object 1399 01:18:47,780 --> 01:18:50,780 from OpenAI's GPT-2 right here, 1400 01:18:50,780 --> 01:18:52,780 we mentioned this is very similar to our vocab. 1401 01:18:52,780 --> 01:18:59,780 You'll notice that the length of this is 50,257. 1402 01:18:59,780 --> 01:19:01,780 As I mentioned, it's mapping, 1403 01:19:01,780 --> 01:19:03,780 and it's inverted from the mapping of our vocab. 1404 01:19:03,780 --> 01:19:06,780 Our vocab goes from integer to string, 1405 01:19:06,780 --> 01:19:10,780 and they go the other way around for no amazing reason. 1406 01:19:10,780 --> 01:19:12,780 But the thing to note here is that 1407 01:19:12,780 --> 01:19:15,780 the mapping table here is 50,257. 1408 01:19:15,780 --> 01:19:17,780 Where does that number come from? 1409 01:19:17,780 --> 01:19:19,780 Where are the tokens? 1410 01:19:19,780 --> 01:19:24,780 As I mentioned, there are 256 raw byte tokens. 1411 01:19:24,780 --> 01:19:28,780 And then OpenAI actually did 50,000 merges. 1412 01:19:28,780 --> 01:19:31,780 So those become the other tokens. 1413 01:19:31,780 --> 01:19:33,780 But this would have been 50,257. 1414 01:19:33,780 --> 01:19:36,780 So what is the 57th token? 1415 01:19:36,780 --> 01:19:40,780 And there is basically one special token. 1416 01:19:40,780 --> 01:19:42,780 And that one special token, 1417 01:19:42,780 --> 01:19:45,780 you can see, is called end of text. 1418 01:19:45,780 --> 01:19:47,780 So this is a special token, 1419 01:19:47,780 --> 01:19:49,780 and it's the very last token. 1420 01:19:49,780 --> 01:19:52,780 And this token is used to delimit documents 1421 01:19:52,780 --> 01:19:54,780 in the training set. 1422 01:19:54,780 --> 01:19:56,780 So when we're creating the training data, 1423 01:19:56,780 --> 01:19:57,780 we have all these documents, 1424 01:19:57,780 --> 01:19:58,780 and we tokenize them, 1425 01:19:58,780 --> 01:20:00,780 and we get a stream of tokens. 1426 01:20:00,780 --> 01:20:02,780 Those tokens only range from 0 1427 01:20:02,780 --> 01:20:05,780 to 50,256. 1428 01:20:05,780 --> 01:20:07,780 And then in between those documents, 1429 01:20:07,780 --> 01:20:10,780 we put special end of text token. 1430 01:20:10,780 --> 01:20:13,780 And we insert that token in between documents. 1431 01:20:13,780 --> 01:20:17,780 And we are using this as a signal to the language model 1432 01:20:17,780 --> 01:20:19,780 that the document has ended, 1433 01:20:19,780 --> 01:20:21,780 and what follows is going to be unrelated 1434 01:20:21,780 --> 01:20:23,780 to the document previously. 1435 01:20:23,780 --> 01:20:26,780 That said, the language model has to learn this from data. 1436 01:20:26,780 --> 01:20:29,780 It needs to learn that this token usually means 1437 01:20:29,780 --> 01:20:31,780 that it should wipe its sort of memory 1438 01:20:31,780 --> 01:20:32,780 of what came before, 1439 01:20:32,780 --> 01:20:34,780 and what came before this token 1440 01:20:34,780 --> 01:20:36,780 is not actually informative to what comes next. 1441 01:20:36,780 --> 01:20:38,780 But we are expecting the language model 1442 01:20:38,780 --> 01:20:39,780 to just like learn this, 1443 01:20:39,780 --> 01:20:41,780 but we're giving it the special sort of delimiter 1444 01:20:41,780 --> 01:20:43,780 of these documents. 1445 01:20:43,780 --> 01:20:45,780 We can go here to tiktokenizer, 1446 01:20:45,780 --> 01:20:48,780 and this is the GPT to tokenizer, 1447 01:20:48,780 --> 01:20:50,780 our code that we've been playing with before. 1448 01:20:50,780 --> 01:20:51,780 So we can add here, right? 1449 01:20:51,780 --> 01:20:53,780 Hello world, how are you? 1450 01:20:53,780 --> 01:20:55,780 And we're getting different tokens. 1451 01:20:55,780 --> 01:20:57,780 But now you can see what happens 1452 01:20:57,780 --> 01:20:59,780 if I put end of text. 1453 01:20:59,780 --> 01:21:00,780 You see how, 1454 01:21:00,780 --> 01:21:02,780 until I finished it, 1455 01:21:02,780 --> 01:21:04,780 these are all different tokens. 1456 01:21:04,780 --> 01:21:06,780 End of text, 1457 01:21:06,780 --> 01:21:08,780 still tokens, 1458 01:21:08,780 --> 01:21:09,780 and now when I finish it, 1459 01:21:09,780 --> 01:21:13,780 suddenly we get token 50,256. 1460 01:21:13,780 --> 01:21:16,780 And the reason this works is because 1461 01:21:16,780 --> 01:21:19,780 this didn't actually go through the BPE merges. 1462 01:21:19,780 --> 01:21:23,780 Instead, the code that actually outputs the tokens 1463 01:21:23,780 --> 01:21:25,780 has special case instructions 1464 01:21:25,780 --> 01:21:28,780 for handling special tokens. 1465 01:21:28,780 --> 01:21:30,780 We did not see these special instructions 1466 01:21:30,780 --> 01:21:33,780 for handling special tokens in the encoder.py. 1467 01:21:33,780 --> 01:21:35,780 It's absent there. 1468 01:21:35,780 --> 01:21:37,780 But if you go to tiktoken library, 1469 01:21:37,780 --> 01:21:39,780 which is implemented in Rust, 1470 01:21:39,780 --> 01:21:41,780 you will find all kinds of special case handling 1471 01:21:41,780 --> 01:21:43,780 for these special tokens 1472 01:21:43,780 --> 01:21:45,780 that you can register, create, 1473 01:21:45,780 --> 01:21:47,780 add to the vocabulary, 1474 01:21:47,780 --> 01:21:48,780 and then it looks for them. 1475 01:21:48,780 --> 01:21:51,780 And whenever it sees these special tokens like this, 1476 01:21:51,780 --> 01:21:54,780 it will actually come in and swap in that special token. 1477 01:21:54,780 --> 01:21:57,780 So these things are outside of the typical algorithm 1478 01:21:57,780 --> 01:21:59,780 of byte pairing coding. 1479 01:21:59,780 --> 01:22:02,780 So these special tokens are used pervasively, 1480 01:22:02,780 --> 01:22:05,780 not just in basically base language modeling 1481 01:22:05,780 --> 01:22:07,780 of predicting the next token in the sequence, 1482 01:22:07,780 --> 01:22:09,780 but especially when it gets to later 1483 01:22:09,780 --> 01:22:10,780 to the fine-tuning stage 1484 01:22:10,780 --> 01:22:13,780 and all of the chat GPT sort of aspects of it, 1485 01:22:13,780 --> 01:22:15,780 because we don't just want to delimit documents, 1486 01:22:15,780 --> 01:22:17,780 we want to delimit entire conversations 1487 01:22:17,780 --> 01:22:19,780 between an assistant and a user. 1488 01:22:19,780 --> 01:22:22,780 So if I refresh this tiktokenizer page, 1489 01:22:22,780 --> 01:22:24,780 the default example that they have here 1490 01:22:24,780 --> 01:22:28,780 is using not sort of base model encoders, 1491 01:22:28,780 --> 01:22:32,780 but fine-tuned model sort of tokenizers. 1492 01:22:32,780 --> 01:22:35,780 So for example, using the GPT 3.5 Turbo scheme, 1493 01:22:35,780 --> 01:22:38,780 these here are all special tokens, 1494 01:22:38,780 --> 01:22:41,780 IAM start, IAM end, et cetera. 1495 01:22:41,780 --> 01:22:44,780 This is short for imaginary model log 1496 01:22:44,780 --> 01:22:46,780 underscore start, by the way. 1497 01:22:46,780 --> 01:22:49,780 But you can see here that there's a sort of start 1498 01:22:49,780 --> 01:22:51,780 and end of every single message, 1499 01:22:51,780 --> 01:22:53,780 and there can be many other tokens, 1500 01:22:53,780 --> 01:22:57,780 lots of tokens in use to delimit these conversations 1501 01:22:57,780 --> 01:23:01,780 and kind of keep track of the flow of the messages here. 1502 01:23:01,780 --> 01:23:04,780 Now we can go back to the tiktoken library, 1503 01:23:04,780 --> 01:23:06,780 and here when you scroll to the bottom, 1504 01:23:06,780 --> 01:23:09,780 they talk about how you can extend tiktoken, 1505 01:23:09,780 --> 01:23:12,780 and you can create, basically you can fork 1506 01:23:12,780 --> 01:23:16,780 the CL100K base tokenizer used in GPT-4, 1507 01:23:16,780 --> 01:23:18,780 and for example, you can extend it 1508 01:23:18,780 --> 01:23:19,780 by adding more special tokens, 1509 01:23:19,780 --> 01:23:20,780 and these are totally up to you. 1510 01:23:20,780 --> 01:23:22,780 You can come up with any arbitrary tokens 1511 01:23:22,780 --> 01:23:25,780 and add them with the new ID afterwards, 1512 01:23:25,780 --> 01:23:27,780 and the tiktoken library will correct 1513 01:23:27,780 --> 01:23:29,780 or directly swap them out 1514 01:23:29,780 --> 01:23:32,780 when it sees this in the strings. 1515 01:23:32,780 --> 01:23:34,780 Now we can also go back to this file, 1516 01:23:34,780 --> 01:23:36,780 which we looked at previously, 1517 01:23:36,780 --> 01:23:39,780 and I mentioned that the GPT-2 in tiktoken, 1518 01:23:39,780 --> 01:23:41,780 opening in public.py, 1519 01:23:41,780 --> 01:23:43,780 we have the vocabulary, 1520 01:23:43,780 --> 01:23:45,780 we have the pattern for splitting, 1521 01:23:45,780 --> 01:23:46,780 and then here we are registering 1522 01:23:46,780 --> 01:23:48,780 the single special token in GPT-2, 1523 01:23:48,780 --> 01:23:50,780 which was the end of text token, 1524 01:23:50,780 --> 01:23:52,780 and we saw that it has this ID. 1525 01:23:52,780 --> 01:23:55,780 In GPT-4, when they defined this here, 1526 01:23:55,780 --> 01:23:57,780 you see that the pattern has changed 1527 01:23:57,780 --> 01:23:58,780 as we've discussed, 1528 01:23:58,780 --> 01:24:00,780 but also the special tokens have changed 1529 01:24:00,780 --> 01:24:01,780 in this tokenizer. 1530 01:24:01,780 --> 01:24:03,780 So we of course have the end of text, 1531 01:24:03,780 --> 01:24:04,780 just like in GPT-2, 1532 01:24:04,780 --> 01:24:06,780 but we also see three, 1533 01:24:06,780 --> 01:24:08,780 sorry, four additional tokens here, 1534 01:24:08,780 --> 01:24:10,780 thim prefix, middle, and suffix. 1535 01:24:10,780 --> 01:24:11,780 What is thim? 1536 01:24:11,780 --> 01:24:14,780 Thim is short for fill in the middle, 1537 01:24:14,780 --> 01:24:16,780 and if you'd like to learn more about this idea, 1538 01:24:16,780 --> 01:24:19,780 it comes from this paper, 1539 01:24:19,780 --> 01:24:21,780 and I'm not going to go into detail in this video, 1540 01:24:21,780 --> 01:24:22,780 it's beyond this video, 1541 01:24:22,780 --> 01:24:24,780 and then there's one additional 1542 01:24:24,780 --> 01:24:26,780 sort of token here. 1543 01:24:26,780 --> 01:24:28,780 So that's that encoding as well. 1544 01:24:28,780 --> 01:24:30,780 So it's very common, basically, 1545 01:24:30,780 --> 01:24:32,780 to train a language model, 1546 01:24:32,780 --> 01:24:34,780 and then if you'd like, 1547 01:24:34,780 --> 01:24:36,780 you can add special tokens. 1548 01:24:36,780 --> 01:24:38,780 Now, when you add special tokens, 1549 01:24:38,780 --> 01:24:40,780 you of course have to do some model surgery 1550 01:24:40,780 --> 01:24:42,780 to the transformer 1551 01:24:42,780 --> 01:24:44,780 and all the parameters involved in that transformer, 1552 01:24:44,780 --> 01:24:46,780 because you are basically adding an integer, 1553 01:24:46,780 --> 01:24:47,780 and you want to make sure that, 1554 01:24:47,780 --> 01:24:49,780 for example, your embedding matrix 1555 01:24:49,780 --> 01:24:51,780 for the vocabulary tokens 1556 01:24:51,780 --> 01:24:53,780 has to be extended by adding a row, 1557 01:24:53,780 --> 01:24:55,780 and typically this row would be initialized 1558 01:24:55,780 --> 01:24:57,780 with small random numbers or something like that, 1559 01:24:57,780 --> 01:24:59,780 because we need to have a vector 1560 01:24:59,780 --> 01:25:01,780 that now stands for that token. 1561 01:25:01,780 --> 01:25:02,780 In addition to that, 1562 01:25:02,780 --> 01:25:04,780 you have to go to the final layer of the transformer, 1563 01:25:04,780 --> 01:25:06,780 and you have to make sure that that projection 1564 01:25:06,780 --> 01:25:08,780 at the very end into the classifier 1565 01:25:08,780 --> 01:25:10,780 is extended by one as well. 1566 01:25:10,780 --> 01:25:12,780 So basically there's some model surgery involved 1567 01:25:12,780 --> 01:25:15,780 that you have to couple with the tokenization changes 1568 01:25:15,780 --> 01:25:18,780 if you are going to add special tokens. 1569 01:25:18,780 --> 01:25:20,780 But this is a very common operation that people do, 1570 01:25:20,780 --> 01:25:22,780 especially if they'd like to fine-tune the model, 1571 01:25:22,780 --> 01:25:24,780 for example, taking it from a base model 1572 01:25:24,780 --> 01:25:28,780 to a chat model like ChatGPT. 1573 01:25:28,780 --> 01:25:29,780 Okay, so at this point, 1574 01:25:29,780 --> 01:25:30,780 you should have everything you need 1575 01:25:30,780 --> 01:25:32,780 in order to build your own GPT-4 tokenizer. 1576 01:25:32,780 --> 01:25:34,780 Now, in the process of developing this lecture, 1577 01:25:34,780 --> 01:25:35,780 I've done that, 1578 01:25:35,780 --> 01:25:39,780 and I've published the code under this repository minBPE. 1579 01:25:39,780 --> 01:25:42,780 So minBPE looks like this right now as I'm recording, 1580 01:25:42,780 --> 01:25:45,780 but the minBPE repository will probably change quite a bit 1581 01:25:45,780 --> 01:25:49,780 because I intend to continue working on it. 1582 01:25:49,780 --> 01:25:51,780 In addition to the minBPE repository, 1583 01:25:51,780 --> 01:25:53,780 I've published this exercise progression 1584 01:25:53,780 --> 01:25:54,780 that you can follow. 1585 01:25:54,780 --> 01:25:56,780 So if you go to exercise.md here, 1586 01:25:56,780 --> 01:26:00,780 this is sort of me breaking up the task ahead of you 1587 01:26:00,780 --> 01:26:03,780 into four steps that sort of build up 1588 01:26:03,780 --> 01:26:05,780 to what can be a GPT-4 tokenizer. 1589 01:26:05,780 --> 01:26:08,780 And so feel free to follow these steps exactly 1590 01:26:08,780 --> 01:26:10,780 and follow a little bit of the guidance 1591 01:26:10,780 --> 01:26:11,780 that I've laid out here. 1592 01:26:11,780 --> 01:26:13,780 And anytime you feel stuck, 1593 01:26:13,780 --> 01:26:16,780 just reference the minBPE repository here. 1594 01:26:16,780 --> 01:26:18,780 So either the tests could be useful 1595 01:26:18,780 --> 01:26:20,780 or the minBPE repository itself. 1596 01:26:20,780 --> 01:26:22,780 I try to keep the code fairly clean 1597 01:26:22,780 --> 01:26:24,780 and understandable. 1598 01:26:24,780 --> 01:26:29,780 And so feel free to reference it whenever you get stuck. 1599 01:26:29,780 --> 01:26:31,780 In addition to that, basically, 1600 01:26:31,780 --> 01:26:33,780 once you write it, 1601 01:26:33,780 --> 01:26:36,780 you should be able to reproduce this behavior from Tiktoken. 1602 01:26:36,780 --> 01:26:38,780 So getting the GPT-4 tokenizer, 1603 01:26:38,780 --> 01:26:40,780 you can encode this string 1604 01:26:40,780 --> 01:26:42,780 and you should get these tokens. 1605 01:26:42,780 --> 01:26:43,780 And then you can encode and decode 1606 01:26:43,780 --> 01:26:45,780 the exact same string to recover it. 1607 01:26:45,780 --> 01:26:46,780 And in addition to all that, 1608 01:26:46,780 --> 01:26:49,780 you should be able to implement your own train function, 1609 01:26:49,780 --> 01:26:51,780 which Tiktoken library does not provide. 1610 01:26:51,780 --> 01:26:53,780 It's, again, only inference code. 1611 01:26:53,780 --> 01:26:55,780 But you could write your own train. 1612 01:26:55,780 --> 01:26:57,780 minBPE does it as well. 1613 01:26:57,780 --> 01:27:01,780 And that will allow you to train your own token vocabularies. 1614 01:27:01,780 --> 01:27:03,780 So here's some of the code inside minBPE, 1615 01:27:03,780 --> 01:27:06,780 minBPE shows the token vocabularies 1616 01:27:06,780 --> 01:27:08,780 that you might obtain. 1617 01:27:08,780 --> 01:27:10,780 So on the left here, 1618 01:27:10,780 --> 01:27:12,780 we have the GPT-4 merges. 1619 01:27:12,780 --> 01:27:16,780 So the first 256 are raw individual bytes. 1620 01:27:16,780 --> 01:27:18,780 And then here I am visualizing the merges 1621 01:27:18,780 --> 01:27:20,780 that GPT-4 performed during its training. 1622 01:27:20,780 --> 01:27:23,780 So the very first merge that GPT-4 did 1623 01:27:23,780 --> 01:27:26,780 was merge two spaces into a single token 1624 01:27:26,780 --> 01:27:28,780 for, you know, two spaces. 1625 01:27:28,780 --> 01:27:30,780 And that is the token 256. 1626 01:27:30,780 --> 01:27:32,780 And so this is the order in which things merged 1627 01:27:32,780 --> 01:27:33,780 during GPT-4 training. 1628 01:27:33,780 --> 01:27:35,780 And this is the merge order 1629 01:27:35,780 --> 01:27:38,780 that we obtain in minBPE 1630 01:27:38,780 --> 01:27:40,780 by training a tokenizer. 1631 01:27:40,780 --> 01:27:41,780 And in this case, I trained it 1632 01:27:41,780 --> 01:27:43,780 on a Wikipedia page of Taylor Swift. 1633 01:27:43,780 --> 01:27:45,780 Not because I'm a Swifty, 1634 01:27:45,780 --> 01:27:47,780 but because that is one of the longest 1635 01:27:47,780 --> 01:27:49,780 Wikipedia pages apparently that's available. 1636 01:27:49,780 --> 01:27:51,780 But she is pretty cool. 1637 01:27:51,780 --> 01:27:55,780 And what was I going to say? 1638 01:27:55,780 --> 01:27:58,780 Yeah, so you can compare these two vocabularies. 1639 01:27:58,780 --> 01:28:02,780 And so as an example, 1640 01:28:02,780 --> 01:28:05,780 here GPT-4 merged IN to become IN. 1641 01:28:05,780 --> 01:28:07,780 And we've done the exact same thing 1642 01:28:07,780 --> 01:28:09,780 on this token, 259. 1643 01:28:09,780 --> 01:28:11,780 Here, space T becomes space T. 1644 01:28:11,780 --> 01:28:14,780 And that happened for us a little bit later as well. 1645 01:28:14,780 --> 01:28:16,780 So the difference here is, again, 1646 01:28:16,780 --> 01:28:17,780 to my understanding, 1647 01:28:17,780 --> 01:28:19,780 only a difference of the training set. 1648 01:28:19,780 --> 01:28:21,780 As an example, because I see a lot of white space, 1649 01:28:21,780 --> 01:28:23,780 I expect that GPT-4 probably had a lot of Python code 1650 01:28:23,780 --> 01:28:25,780 in its training set, I'm not sure, 1651 01:28:25,780 --> 01:28:27,780 for the tokenizer. 1652 01:28:27,780 --> 01:28:29,780 And here we see much less of that, 1653 01:28:29,780 --> 01:28:32,780 of course, in the Wikipedia page. 1654 01:28:32,780 --> 01:28:34,780 So roughly speaking, they look the same. 1655 01:28:34,780 --> 01:28:35,780 And they look the same because they're 1656 01:28:35,780 --> 01:28:36,780 running the same algorithm. 1657 01:28:36,780 --> 01:28:38,780 And when you train your own, 1658 01:28:38,780 --> 01:28:40,780 you're probably going to get something similar 1659 01:28:40,780 --> 01:28:41,780 depending on what you train it on. 1660 01:28:41,780 --> 01:28:43,780 Okay, so we are now going to move on 1661 01:28:43,780 --> 01:28:44,780 from TickToken 1662 01:28:44,780 --> 01:28:46,780 and the way that OpenAI tokenizes its strings. 1663 01:28:46,780 --> 01:28:48,780 And we're going to discuss one more 1664 01:28:48,780 --> 01:28:49,780 very commonly used library 1665 01:28:49,780 --> 01:28:51,780 for working with tokenization in LLMs, 1666 01:28:51,780 --> 01:28:53,780 and that is SentencePiece. 1667 01:28:53,780 --> 01:28:56,780 So SentencePiece is very commonly used 1668 01:28:56,780 --> 01:28:59,780 in language models because unlike TickToken, 1669 01:28:59,780 --> 01:29:01,780 it can do both training and inference 1670 01:29:01,780 --> 01:29:03,780 and is quite efficient at both. 1671 01:29:03,780 --> 01:29:05,780 It supports a number of algorithms 1672 01:29:05,780 --> 01:29:07,780 for training vocabularies, 1673 01:29:07,780 --> 01:29:09,780 but one of them is the byte pairing coding algorithm 1674 01:29:09,780 --> 01:29:10,780 that we've been looking at. 1675 01:29:10,780 --> 01:29:12,780 So it supports it. 1676 01:29:12,780 --> 01:29:14,780 Now, SentencePiece is used both by Lama 1677 01:29:14,780 --> 01:29:17,780 and Mistral series and many other models as well. 1678 01:29:17,780 --> 01:29:21,780 It is on GitHub under Google slash SentencePiece. 1679 01:29:21,780 --> 01:29:23,780 And the big difference with SentencePiece, 1680 01:29:23,780 --> 01:29:25,780 and we're going to look at example 1681 01:29:25,780 --> 01:29:28,780 because this is kind of hard and subtle to explain, 1682 01:29:28,780 --> 01:29:30,780 is that they think different 1683 01:29:30,780 --> 01:29:33,780 about the order of operations here. 1684 01:29:33,780 --> 01:29:35,780 So in the case of TickToken, 1685 01:29:35,780 --> 01:29:39,780 we first take our code points in a string. 1686 01:29:39,780 --> 01:29:41,780 We encode them using UTF-82 bytes 1687 01:29:41,780 --> 01:29:43,780 and then we're merging bytes. 1688 01:29:43,780 --> 01:29:45,780 It's fairly straightforward. 1689 01:29:45,780 --> 01:29:46,780 For SentencePiece, 1690 01:29:46,780 --> 01:29:48,780 it works directly on the level 1691 01:29:48,780 --> 01:29:50,780 of the code points themselves. 1692 01:29:50,780 --> 01:29:52,780 So it looks at whatever code points 1693 01:29:52,780 --> 01:29:54,780 are available in your training set 1694 01:29:54,780 --> 01:29:56,780 and then it starts merging those code points. 1695 01:29:56,780 --> 01:30:01,780 And the BPE is running on the level of code points. 1696 01:30:01,780 --> 01:30:04,780 And if you happen to run out of code points, 1697 01:30:04,780 --> 01:30:06,780 so there are maybe some rare code points 1698 01:30:06,780 --> 01:30:07,780 that just don't come up too often 1699 01:30:07,780 --> 01:30:08,780 and the rarity is determined 1700 01:30:08,780 --> 01:30:11,780 by this character coverage hyperparameter, 1701 01:30:11,780 --> 01:30:14,780 then these code points will either get maps 1702 01:30:14,780 --> 01:30:16,780 to a special unknown token, 1703 01:30:16,780 --> 01:30:17,780 like Ankh, 1704 01:30:17,780 --> 01:30:20,780 or if you have the byte fallback option turned on, 1705 01:30:20,780 --> 01:30:23,780 then that will take those rare code points, 1706 01:30:23,780 --> 01:30:25,780 it will encode them using UTF-8, 1707 01:30:25,780 --> 01:30:27,780 and then the individual bytes of that encoding 1708 01:30:27,780 --> 01:30:29,780 will be translated into tokens. 1709 01:30:29,780 --> 01:30:31,780 And there are these special byte tokens 1710 01:30:31,780 --> 01:30:33,780 that basically get added to the vocabulary. 1711 01:30:33,780 --> 01:30:37,780 So it uses BPE on the code points 1712 01:30:37,780 --> 01:30:39,780 and then it falls back to bytes 1713 01:30:39,780 --> 01:30:42,780 for rare code points. 1714 01:30:42,780 --> 01:30:44,780 And so that's kind of like the difference. 1715 01:30:44,780 --> 01:30:45,780 Personally, I find that TickToken 1716 01:30:45,780 --> 01:30:47,780 is significantly cleaner, 1717 01:30:47,780 --> 01:30:48,780 but it's kind of like a subtle 1718 01:30:48,780 --> 01:30:49,780 but pretty major difference 1719 01:30:49,780 --> 01:30:51,780 between the way they approach tokenization. 1720 01:30:51,780 --> 01:30:52,780 Let's work with a concrete example 1721 01:30:52,780 --> 01:30:54,780 because otherwise this is kind of hard 1722 01:30:54,780 --> 01:30:57,780 to get your head around. 1723 01:30:57,780 --> 01:30:59,780 So let's work with a concrete example. 1724 01:30:59,780 --> 01:31:02,780 This is how we can import sentence piece. 1725 01:31:02,780 --> 01:31:04,780 And then here we're going to take, 1726 01:31:04,780 --> 01:31:06,780 I think I took like the description of sentence piece 1727 01:31:06,780 --> 01:31:08,780 and I just created like a little toy data set. 1728 01:31:08,780 --> 01:31:09,780 It really likes to have a file. 1729 01:31:09,780 --> 01:31:13,780 So I created a toy.txt file with this content. 1730 01:31:13,780 --> 01:31:15,780 Now, what's kind of a little bit crazy 1731 01:31:15,780 --> 01:31:16,780 about sentence piece 1732 01:31:16,780 --> 01:31:19,780 is that there's a ton of options and configurations. 1733 01:31:19,780 --> 01:31:20,780 And the reason this is so 1734 01:31:20,780 --> 01:31:22,780 is because sentence piece has been around, 1735 01:31:22,780 --> 01:31:23,780 I think for a while, 1736 01:31:23,780 --> 01:31:24,780 and it really tries to handle 1737 01:31:24,780 --> 01:31:26,780 a large diversity of things. 1738 01:31:26,780 --> 01:31:28,780 And because it's been around, 1739 01:31:28,780 --> 01:31:30,780 I think it has quite a bit of accumulated 1740 01:31:30,780 --> 01:31:32,780 historical baggage as well. 1741 01:31:32,780 --> 01:31:33,780 And so in particular, 1742 01:31:33,780 --> 01:31:36,780 there's like a ton of configuration arguments. 1743 01:31:36,780 --> 01:31:38,780 This is not even all of it. 1744 01:31:38,780 --> 01:31:42,780 You can go to here to see all the training options. 1745 01:31:42,780 --> 01:31:45,780 And there's also quite useful documentation 1746 01:31:45,780 --> 01:31:48,780 like the raw protobuf that is used 1747 01:31:48,780 --> 01:31:52,780 to represent the trainer spec and so on. 1748 01:31:52,780 --> 01:31:54,780 Many of these options are irrelevant to us. 1749 01:31:54,780 --> 01:31:56,780 So maybe to point out one example, 1750 01:31:56,780 --> 01:31:58,780 dash dash shrinking factor. 1751 01:31:58,780 --> 01:32:00,780 This shrinking factor is not used 1752 01:32:00,780 --> 01:32:02,780 in the byte pairing coding algorithm. 1753 01:32:02,780 --> 01:32:05,780 So this is just an argument that is irrelevant to us. 1754 01:32:05,780 --> 01:32:10,780 It applies to a different training algorithm. 1755 01:32:10,780 --> 01:32:11,780 Now, what I tried to do here 1756 01:32:11,780 --> 01:32:13,780 is I tried to set up sentence piece 1757 01:32:13,780 --> 01:32:15,780 in a way that is very, very similar. 1758 01:32:15,780 --> 01:32:18,780 So I can tell to maybe identical, hopefully, 1759 01:32:18,780 --> 01:32:21,780 to the way that Lama2 was trained. 1760 01:32:21,780 --> 01:32:25,780 So the way they trained their own tokenizer. 1761 01:32:25,780 --> 01:32:27,780 And the way I did this was basically 1762 01:32:27,780 --> 01:32:29,780 you can take the tokenizer.model file 1763 01:32:29,780 --> 01:32:30,780 that Meta released, 1764 01:32:30,780 --> 01:32:34,780 and you can open it using the protobuf 1765 01:32:34,780 --> 01:32:37,780 sort of file that you can generate. 1766 01:32:37,780 --> 01:32:39,780 And then you can inspect all the options. 1767 01:32:39,780 --> 01:32:41,780 And I tried to copy over all the options 1768 01:32:41,780 --> 01:32:42,780 that looked relevant. 1769 01:32:42,780 --> 01:32:44,780 So here we set up the input. 1770 01:32:44,780 --> 01:32:46,780 This is a raw text in this file. 1771 01:32:46,780 --> 01:32:47,780 Here's going to be the output. 1772 01:32:47,780 --> 01:32:52,780 So it's going to be protoc400.model and .vocap. 1773 01:32:52,780 --> 01:32:54,780 We're saying that we're going to use the BP algorithm, 1774 01:32:54,780 --> 01:32:56,780 and we want a vocab size of 400. 1775 01:32:56,780 --> 01:32:58,780 And there's a ton of configurations here 1776 01:32:58,780 --> 01:33:04,780 for basically preprocessing 1777 01:33:04,780 --> 01:33:06,780 and normalization rules, as they're called. 1778 01:33:06,780 --> 01:33:09,780 Normalization used to be very prevalent, 1779 01:33:09,780 --> 01:33:12,780 I would say, before LLMs in natural language processing. 1780 01:33:12,780 --> 01:33:13,780 So in machine translation 1781 01:33:13,780 --> 01:33:15,780 and text classification and so on, 1782 01:33:15,780 --> 01:33:17,780 you want to normalize and simplify the text, 1783 01:33:17,780 --> 01:33:18,780 and you want to turn it all lowercase, 1784 01:33:18,780 --> 01:33:21,780 and you want to remove all double white space, etc. 1785 01:33:21,780 --> 01:33:22,780 And in language models, 1786 01:33:22,780 --> 01:33:24,780 we prefer not to do any of it, 1787 01:33:24,780 --> 01:33:25,780 or at least that is my preference 1788 01:33:25,780 --> 01:33:26,780 as a deep learning person. 1789 01:33:26,780 --> 01:33:28,780 You want to not touch your data. 1790 01:33:28,780 --> 01:33:29,780 You want to keep the raw data 1791 01:33:29,780 --> 01:33:33,780 as much as possible in a raw form. 1792 01:33:33,780 --> 01:33:34,780 So you're basically trying to turn off 1793 01:33:34,780 --> 01:33:37,780 a lot of this if you can. 1794 01:33:37,780 --> 01:33:38,780 The other thing that sentence piece does 1795 01:33:38,780 --> 01:33:41,780 is that it has this concept of sentences. 1796 01:33:41,780 --> 01:33:43,780 So sentence piece, 1797 01:33:43,780 --> 01:33:44,780 it's back, 1798 01:33:44,780 --> 01:33:45,780 it kind of like was developed, 1799 01:33:45,780 --> 01:33:46,780 I think, early in the days 1800 01:33:46,780 --> 01:33:49,780 where there was an idea 1801 01:33:49,780 --> 01:33:51,780 that you're training a tokenizer 1802 01:33:51,780 --> 01:33:53,780 on a bunch of independent sentences. 1803 01:33:53,780 --> 01:33:54,780 So it has a lot of like 1804 01:33:54,780 --> 01:33:56,780 how many sentences you're going to train on, 1805 01:33:56,780 --> 01:34:01,780 what is the maximum sentence length, 1806 01:34:01,780 --> 01:34:02,780 shuffling sentences. 1807 01:34:02,780 --> 01:34:03,780 And so for it, 1808 01:34:03,780 --> 01:34:04,780 sentences are kind of like 1809 01:34:04,780 --> 01:34:05,780 the individual training examples. 1810 01:34:05,780 --> 01:34:07,780 But again, in the context of LLMs, 1811 01:34:07,780 --> 01:34:09,780 I find that this is like a very spurious 1812 01:34:09,780 --> 01:34:10,780 and weird distinction. 1813 01:34:10,780 --> 01:34:12,780 Like sentences are 1814 01:34:12,780 --> 01:34:14,780 just like don't touch the raw data. 1815 01:34:14,780 --> 01:34:15,780 Sentences happen to exist. 1816 01:34:15,780 --> 01:34:17,780 But in the raw data sets, 1817 01:34:17,780 --> 01:34:19,780 there are a lot of like in-betweens, 1818 01:34:19,780 --> 01:34:20,780 like what exactly is a sentence? 1819 01:34:20,780 --> 01:34:22,780 What isn't a sentence? 1820 01:34:22,780 --> 01:34:24,780 And so I think like it's really hard to define 1821 01:34:24,780 --> 01:34:26,780 what an actual sentence is 1822 01:34:26,780 --> 01:34:28,780 if you really like dig into it. 1823 01:34:28,780 --> 01:34:30,780 And there could be different concepts of it 1824 01:34:30,780 --> 01:34:31,780 in different languages or something like that. 1825 01:34:31,780 --> 01:34:33,780 So why even introduce the concept? 1826 01:34:33,780 --> 01:34:35,780 It doesn't honestly make sense to me. 1827 01:34:35,780 --> 01:34:37,780 I would just prefer to treat a file 1828 01:34:37,780 --> 01:34:40,780 as a giant stream of bytes. 1829 01:34:40,780 --> 01:34:41,780 It has a lot of treatment around 1830 01:34:41,780 --> 01:34:43,780 the rare word characters. 1831 01:34:43,780 --> 01:34:44,780 And when I say word, 1832 01:34:44,780 --> 01:34:45,780 I mean code points. 1833 01:34:45,780 --> 01:34:47,780 We're going to come back to this in a second. 1834 01:34:47,780 --> 01:34:48,780 And it has a lot of other rules 1835 01:34:48,780 --> 01:34:52,780 for basically splitting digits, 1836 01:34:52,780 --> 01:34:54,780 splitting white space and numbers 1837 01:34:54,780 --> 01:34:55,780 and how you deal with that. 1838 01:34:55,780 --> 01:34:58,780 So these are some kind of like merge rules. 1839 01:34:58,780 --> 01:34:59,780 So I think this is a little bit equivalent 1840 01:34:59,780 --> 01:35:02,780 to TikToken using the regular expression 1841 01:35:02,780 --> 01:35:04,780 to split up categories. 1842 01:35:04,780 --> 01:35:07,780 There's like kind of equivalence of it 1843 01:35:07,780 --> 01:35:09,780 if you squint at it in sentence piece 1844 01:35:09,780 --> 01:35:10,780 where you can also, for example, 1845 01:35:10,780 --> 01:35:16,780 split up the digits and so on. 1846 01:35:16,780 --> 01:35:17,780 There's a few more things here 1847 01:35:17,780 --> 01:35:18,780 that I'll come back to in a bit. 1848 01:35:18,780 --> 01:35:19,780 And then there are some special tokens 1849 01:35:19,780 --> 01:35:20,780 that you can indicate. 1850 01:35:20,780 --> 01:35:23,780 And it hardcodes the UNK token, 1851 01:35:23,780 --> 01:35:24,780 the beginning of sentence, 1852 01:35:24,780 --> 01:35:25,780 end of sentence, 1853 01:35:25,780 --> 01:35:27,780 and a pad token. 1854 01:35:27,780 --> 01:35:29,780 And the UNK token must exist 1855 01:35:29,780 --> 01:35:31,780 from my understanding. 1856 01:35:31,780 --> 01:35:33,780 And then some systems things. 1857 01:35:33,780 --> 01:35:34,780 So we can train. 1858 01:35:34,780 --> 01:35:36,780 And when I press train, 1859 01:35:36,780 --> 01:35:38,780 it's going to create this file 1860 01:35:38,780 --> 01:35:39,780 talk400.model 1861 01:35:39,780 --> 01:35:41,780 and talk400.vocab. 1862 01:35:41,780 --> 01:35:43,780 I can then load the model file 1863 01:35:43,780 --> 01:35:46,780 and I can inspect the vocabulary of it. 1864 01:35:46,780 --> 01:35:49,780 And so we trained vocab size 400 1865 01:35:49,780 --> 01:35:52,780 on this text here. 1866 01:35:52,780 --> 01:35:54,780 And these are the individual pieces, 1867 01:35:54,780 --> 01:35:55,780 the individual tokens 1868 01:35:55,780 --> 01:35:57,780 that sentence piece will create. 1869 01:35:57,780 --> 01:35:58,780 So in the beginning, 1870 01:35:58,780 --> 01:36:00,780 we see that we have the UNK token 1871 01:36:00,780 --> 01:36:02,780 with the ID 0. 1872 01:36:02,780 --> 01:36:04,780 Then we have the beginning of sequence, 1873 01:36:04,780 --> 01:36:06,780 end of sequence, 1 and 2. 1874 01:36:06,780 --> 01:36:08,780 And then we said that the pad ID 1875 01:36:08,780 --> 01:36:09,780 is negative 1. 1876 01:36:09,780 --> 01:36:11,780 So we chose not to use it. 1877 01:36:11,780 --> 01:36:13,780 So there's no pad ID here. 1878 01:36:13,780 --> 01:36:17,780 Then these are individual byte tokens. 1879 01:36:17,780 --> 01:36:19,780 So here we saw that byte fallback 1880 01:36:19,780 --> 01:36:21,780 in Llama was turned on. 1881 01:36:21,780 --> 01:36:22,780 So it's true. 1882 01:36:22,780 --> 01:36:24,780 So what follows are going to be 1883 01:36:24,780 --> 01:36:27,780 the 256 byte tokens. 1884 01:36:27,780 --> 01:36:32,780 And these are their IDs. 1885 01:36:32,780 --> 01:36:34,780 And then at the bottom, 1886 01:36:34,780 --> 01:36:35,780 after the byte tokens, 1887 01:36:35,780 --> 01:36:38,780 come the merges. 1888 01:36:38,780 --> 01:36:41,780 And these are the parent nodes in the merges. 1889 01:36:41,780 --> 01:36:42,780 So we're not seeing the children. 1890 01:36:42,780 --> 01:36:45,780 We're just seeing the parents and their ID. 1891 01:36:45,780 --> 01:36:47,780 And then after the merges 1892 01:36:47,780 --> 01:36:51,780 comes eventually the individual tokens 1893 01:36:51,780 --> 01:36:52,780 and their IDs. 1894 01:36:52,780 --> 01:36:54,780 And so these are the individual tokens. 1895 01:36:54,780 --> 01:36:57,780 So these are the individual code point tokens, 1896 01:36:57,780 --> 01:36:58,780 if you will, 1897 01:36:58,780 --> 01:36:59,780 and they come at the end. 1898 01:36:59,780 --> 01:37:00,780 So that is the ordering 1899 01:37:00,780 --> 01:37:01,780 with which sentence piece 1900 01:37:01,780 --> 01:37:03,780 sort of like represents its vocabularies. 1901 01:37:03,780 --> 01:37:05,780 It starts with special tokens, 1902 01:37:05,780 --> 01:37:06,780 then the byte tokens, 1903 01:37:06,780 --> 01:37:07,780 then the merge tokens, 1904 01:37:07,780 --> 01:37:10,780 and then the individual code point tokens. 1905 01:37:10,780 --> 01:37:13,780 And all these raw code point tokens 1906 01:37:13,780 --> 01:37:14,780 are the ones that it encountered 1907 01:37:14,780 --> 01:37:16,780 in the training set. 1908 01:37:16,780 --> 01:37:18,780 So those individual code points 1909 01:37:18,780 --> 01:37:21,780 are all the entire set of code points 1910 01:37:21,780 --> 01:37:24,780 that occurred here. 1911 01:37:24,780 --> 01:37:26,780 So those all get put in there. 1912 01:37:26,780 --> 01:37:28,780 And then those are extremely rare 1913 01:37:28,780 --> 01:37:30,780 as determined by character coverage. 1914 01:37:30,780 --> 01:37:31,780 So if a code point occurred 1915 01:37:31,780 --> 01:37:32,780 only a single time 1916 01:37:32,780 --> 01:37:34,780 out of like a million sentences 1917 01:37:34,780 --> 01:37:35,780 or something like that, 1918 01:37:35,780 --> 01:37:37,780 then it would be ignored. 1919 01:37:37,780 --> 01:37:41,780 And it would not be added to our vocabulary. 1920 01:37:41,780 --> 01:37:42,780 Once we have a vocabulary, 1921 01:37:42,780 --> 01:37:44,780 we can encode into IDs 1922 01:37:44,780 --> 01:37:47,780 and we can sort of get a list. 1923 01:37:47,780 --> 01:37:48,780 And then here, 1924 01:37:48,780 --> 01:37:52,780 I am also decoding the individual tokens 1925 01:37:52,780 --> 01:37:54,780 back into little pieces, 1926 01:37:54,780 --> 01:37:55,780 as they call it. 1927 01:37:55,780 --> 01:37:58,780 So let's take a look at what happened here. 1928 01:37:58,780 --> 01:37:59,780 Hello, space, 1929 01:37:59,780 --> 01:38:01,780 Annyeonghaseyo. 1930 01:38:01,780 --> 01:38:04,780 So these are the token IDs we got back. 1931 01:38:04,780 --> 01:38:06,780 And when we look here, 1932 01:38:06,780 --> 01:38:07,780 a few things 1933 01:38:07,780 --> 01:38:10,780 sort of jump to mind. 1934 01:38:10,780 --> 01:38:11,780 Number one, 1935 01:38:11,780 --> 01:38:13,780 take a look at these characters. 1936 01:38:13,780 --> 01:38:14,780 The Korean characters, of course, 1937 01:38:14,780 --> 01:38:16,780 were not part of the training set. 1938 01:38:16,780 --> 01:38:18,780 So sentence piece is encountering code points 1939 01:38:18,780 --> 01:38:21,780 that it has not seen during training time. 1940 01:38:21,780 --> 01:38:23,780 And those code points do not have 1941 01:38:23,780 --> 01:38:25,780 a token associated with them. 1942 01:38:25,780 --> 01:38:27,780 So suddenly these are unk tokens, 1943 01:38:27,780 --> 01:38:29,780 unknown tokens. 1944 01:38:29,780 --> 01:38:31,780 But because byte fallback is true, 1945 01:38:31,780 --> 01:38:32,780 instead, 1946 01:38:32,780 --> 01:38:35,780 sentence piece falls back to bytes. 1947 01:38:35,780 --> 01:38:36,780 And so it takes this, 1948 01:38:36,780 --> 01:38:38,780 it encodes it with UTF-8, 1949 01:38:38,780 --> 01:38:41,780 and then it uses these tokens 1950 01:38:41,780 --> 01:38:43,780 to represent those bytes. 1951 01:38:43,780 --> 01:38:46,780 And that's what we are getting sort of here. 1952 01:38:46,780 --> 01:38:49,780 This is the UTF-8 encoding, 1953 01:38:49,780 --> 01:38:51,780 and it is shifted by three 1954 01:38:51,780 --> 01:38:55,780 because of these special tokens here 1955 01:38:55,780 --> 01:38:57,780 that have IDs earlier on. 1956 01:38:57,780 --> 01:38:59,780 So that's what happened here. 1957 01:38:59,780 --> 01:39:01,780 Now, one more thing that, 1958 01:39:01,780 --> 01:39:03,780 well, first before I go on, 1959 01:39:03,780 --> 01:39:05,780 with respect to the byte fallback, 1960 01:39:05,780 --> 01:39:08,780 let me remove byte fallback. 1961 01:39:08,780 --> 01:39:09,780 If this is false, 1962 01:39:09,780 --> 01:39:10,780 what's going to happen? 1963 01:39:10,780 --> 01:39:12,780 Let's retrain. 1964 01:39:12,780 --> 01:39:13,780 So the first thing that happened is 1965 01:39:13,780 --> 01:39:16,780 all of the byte tokens disappeared, right? 1966 01:39:16,780 --> 01:39:17,780 And now we just have the merges, 1967 01:39:17,780 --> 01:39:19,780 and we have a lot more merges now 1968 01:39:19,780 --> 01:39:20,780 because we have a lot more space 1969 01:39:20,780 --> 01:39:22,780 because we're not taking up space 1970 01:39:22,780 --> 01:39:25,780 in the vocab size with all the bytes. 1971 01:39:25,780 --> 01:39:28,780 And now if we encode this, 1972 01:39:28,780 --> 01:39:30,780 we get a zero. 1973 01:39:30,780 --> 01:39:32,780 So this entire string here, 1974 01:39:32,780 --> 01:39:34,780 suddenly there's no byte fallback. 1975 01:39:34,780 --> 01:39:36,780 So this is unknown, 1976 01:39:36,780 --> 01:39:38,780 and unknown is unk. 1977 01:39:38,780 --> 01:39:40,780 And so this is zero 1978 01:39:40,780 --> 01:39:43,780 because the unk token is token zero. 1979 01:39:43,780 --> 01:39:44,780 And you have to keep in mind 1980 01:39:44,780 --> 01:39:47,780 that this would feed into your language model. 1981 01:39:47,780 --> 01:39:48,780 So what is the language model supposed to do 1982 01:39:48,780 --> 01:39:50,780 when all kinds of different things 1983 01:39:50,780 --> 01:39:52,780 that are unrecognized because they're rare 1984 01:39:52,780 --> 01:39:54,780 just end up mapping into unk? 1985 01:39:54,780 --> 01:39:56,780 It's not exactly the property that you want. 1986 01:39:56,780 --> 01:39:58,780 So that's why I think Lama correctly 1987 01:39:58,780 --> 01:40:01,780 used byte fallback true 1988 01:40:01,780 --> 01:40:03,780 because we definitely want to feed these 1989 01:40:03,780 --> 01:40:05,780 unknown or rare code points 1990 01:40:05,780 --> 01:40:07,780 into the model in some manner. 1991 01:40:07,780 --> 01:40:10,780 The next thing I want to show you is the following. 1992 01:40:10,780 --> 01:40:12,780 Notice here when we are decoding 1993 01:40:12,780 --> 01:40:14,780 all the individual tokens. 1994 01:40:14,780 --> 01:40:17,780 You see how spaces, space here, 1995 01:40:17,780 --> 01:40:20,780 ends up being this bold underline. 1996 01:40:20,780 --> 01:40:21,780 I'm not 100% sure, by the way, 1997 01:40:21,780 --> 01:40:23,780 why sentence piece switches white space 1998 01:40:23,780 --> 01:40:26,780 into these bold underscore characters. 1999 01:40:26,780 --> 01:40:27,780 Maybe it's for visualization. 2000 01:40:27,780 --> 01:40:30,780 I'm not 100% sure why that happens. 2001 01:40:30,780 --> 01:40:31,780 But notice this. 2002 01:40:31,780 --> 01:40:34,780 Why do we have an extra space 2003 01:40:34,780 --> 01:40:38,780 in the front of hello? 2004 01:40:38,780 --> 01:40:40,780 Where is this coming from? 2005 01:40:40,780 --> 01:40:45,780 Well, it's coming from this option here. 2006 01:40:45,780 --> 01:40:47,780 Add dummy prefix is true. 2007 01:40:47,780 --> 01:40:50,780 And when you go to the documentation, 2008 01:40:50,780 --> 01:40:52,780 add dummy white space at the beginning of text 2009 01:40:52,780 --> 01:40:54,780 in order to treat world in world 2010 01:40:54,780 --> 01:40:56,780 and hello world in the exact same way. 2011 01:40:56,780 --> 01:40:59,780 So what this is trying to do is the following. 2012 01:40:59,780 --> 01:41:01,780 If we go back to our tick tokenizer, 2013 01:41:01,780 --> 01:41:05,780 world as a token by itself 2014 01:41:05,780 --> 01:41:09,780 has a different ID than space world. 2015 01:41:09,780 --> 01:41:11,780 So we have this is 1917, 2016 01:41:11,780 --> 01:41:13,780 but this is 14, etc. 2017 01:41:13,780 --> 01:41:15,780 So these are two different tokens 2018 01:41:15,780 --> 01:41:16,780 for the language model. 2019 01:41:16,780 --> 01:41:18,780 And the language model has to learn from data 2020 01:41:18,780 --> 01:41:19,780 that they are actually kind of like 2021 01:41:19,780 --> 01:41:20,780 a very similar concept. 2022 01:41:20,780 --> 01:41:23,780 So to the language model in the tick token world, 2023 01:41:23,780 --> 01:41:26,780 basically words in the beginning of sentences 2024 01:41:26,780 --> 01:41:28,780 and words in the middle of sentences 2025 01:41:28,780 --> 01:41:30,780 actually look completely different. 2026 01:41:30,780 --> 01:41:33,780 And it has learned that they are roughly the same. 2027 01:41:33,780 --> 01:41:35,780 So this add dummy prefix 2028 01:41:35,780 --> 01:41:37,780 is trying to fight that a little bit. 2029 01:41:37,780 --> 01:41:39,780 And the way that works is that 2030 01:41:39,780 --> 01:41:43,780 it basically adds a dummy prefix. 2031 01:41:43,780 --> 01:41:47,780 So as a part of preprocessing, 2032 01:41:47,780 --> 01:41:50,780 it will take the string and it will add a space. 2033 01:41:50,780 --> 01:41:52,780 It will do this. 2034 01:41:52,780 --> 01:41:54,780 And that's done in an effort 2035 01:41:54,780 --> 01:41:56,780 to make this world and that world the same. 2036 01:41:56,780 --> 01:41:58,780 They will both be space world. 2037 01:41:58,780 --> 01:42:00,780 So that's one other 2038 01:42:00,780 --> 01:42:03,780 kind of preprocessing option that is turned on. 2039 01:42:03,780 --> 01:42:06,780 And Lama2 also uses this option. 2040 01:42:06,780 --> 01:42:08,780 And that's I think everything that I want to say 2041 01:42:08,780 --> 01:42:09,780 for my preview of sentence piece 2042 01:42:09,780 --> 01:42:11,780 and how it is different. 2043 01:42:11,780 --> 01:42:13,780 Maybe here what I've done is 2044 01:42:13,780 --> 01:42:17,780 I just put in the raw protocol buffer 2045 01:42:17,780 --> 01:42:20,780 representation basically of the tokenizer 2046 01:42:20,780 --> 01:42:22,780 that Lama2 trained. 2047 01:42:22,780 --> 01:42:24,780 So feel free to sort of step through this. 2048 01:42:24,780 --> 01:42:26,780 And if you would like your tokenization 2049 01:42:26,780 --> 01:42:29,780 to look identical to that of the meta Lama2, 2050 01:42:29,780 --> 01:42:31,780 then you would be copy pasting these settings 2051 01:42:31,780 --> 01:42:33,780 as I've tried to do up above. 2052 01:42:33,780 --> 01:42:36,780 And yeah, I think that's it for this section. 2053 01:42:36,780 --> 01:42:38,780 I think my summary for sentence piece 2054 01:42:38,780 --> 01:42:40,780 from all this is number one, 2055 01:42:40,780 --> 01:42:42,780 I think that there's a lot of historical baggage 2056 01:42:42,780 --> 01:42:43,780 in sentence piece. 2057 01:42:43,780 --> 01:42:46,780 A lot of concepts that I think are slightly confusing 2058 01:42:46,780 --> 01:42:48,780 and I think potentially contain foot guns 2059 01:42:48,780 --> 01:42:50,780 like this concept of a sentence 2060 01:42:50,780 --> 01:42:52,780 and its maximum length and stuff like that. 2061 01:42:52,780 --> 01:42:56,780 Otherwise, it is fairly commonly used in the industry 2062 01:42:56,780 --> 01:42:58,780 because it is efficient and can do both training and training. 2063 01:42:58,780 --> 01:43:00,780 and can do both training and inference. 2064 01:43:00,780 --> 01:43:01,780 It has a few quirks. 2065 01:43:01,780 --> 01:43:02,780 Like for example, 2066 01:43:02,780 --> 01:43:03,780 unktoken must exist 2067 01:43:03,780 --> 01:43:05,780 and the way the byte fallbacks are done and so on 2068 01:43:05,780 --> 01:43:07,780 I don't find particularly elegant. 2069 01:43:07,780 --> 01:43:08,780 And unfortunately, I have to say 2070 01:43:08,780 --> 01:43:09,780 it's not very well documented. 2071 01:43:09,780 --> 01:43:13,780 So it took me a lot of time working with this myself 2072 01:43:13,780 --> 01:43:15,780 and just visualizing things 2073 01:43:15,780 --> 01:43:17,780 and try to really understand what is happening here 2074 01:43:17,780 --> 01:43:19,780 because the documentation unfortunately 2075 01:43:19,780 --> 01:43:21,780 is in my opinion not super amazing. 2076 01:43:21,780 --> 01:43:23,780 But it is a very nice repo 2077 01:43:23,780 --> 01:43:25,780 that is available to you 2078 01:43:25,780 --> 01:43:27,780 if you'd like to train your own tokenizer right now. 2079 01:43:27,780 --> 01:43:28,780 Okay. 2080 01:43:28,780 --> 01:43:29,780 I'll switch gears again 2081 01:43:29,780 --> 01:43:31,780 as we're starting to slowly wrap up here. 2082 01:43:31,780 --> 01:43:33,780 I want to revisit this issue in a bit more detail 2083 01:43:33,780 --> 01:43:35,780 of how we should set the vocab size 2084 01:43:35,780 --> 01:43:37,780 and what are some of the considerations around it. 2085 01:43:37,780 --> 01:43:39,780 So for this, 2086 01:43:39,780 --> 01:43:41,780 I'd like to go back to the model architecture 2087 01:43:41,780 --> 01:43:43,780 that we developed in the last video 2088 01:43:43,780 --> 01:43:45,780 when we built the GPT from scratch. 2089 01:43:45,780 --> 01:43:48,780 So this here was the file that we built in the previous video 2090 01:43:48,780 --> 01:43:50,780 and we defined the transformer model 2091 01:43:50,780 --> 01:43:52,780 and let's specifically look at vocab size 2092 01:43:52,780 --> 01:43:54,780 and where it appears in this file. 2093 01:43:54,780 --> 01:43:56,780 So here we define the vocab size. 2094 01:43:56,780 --> 01:43:57,780 At this time, 2095 01:43:57,780 --> 01:43:59,780 it was 65 or something like that, 2096 01:43:59,780 --> 01:44:00,780 extremely small number. 2097 01:44:00,780 --> 01:44:02,780 So this will grow much larger. 2098 01:44:02,780 --> 01:44:04,780 You'll see that vocab size doesn't come up too much 2099 01:44:04,780 --> 01:44:05,780 in most of these layers. 2100 01:44:05,780 --> 01:44:07,780 The only place that it comes up to 2101 01:44:07,780 --> 01:44:10,780 is in exactly these two places here. 2102 01:44:10,780 --> 01:44:12,780 So when we define the language model, 2103 01:44:12,780 --> 01:44:14,780 there's the token embedding table 2104 01:44:14,780 --> 01:44:16,780 which is this two-dimensional array 2105 01:44:16,780 --> 01:44:19,780 where the vocab size is basically the number of rows 2106 01:44:19,780 --> 01:44:22,780 and each vocabulary element, 2107 01:44:22,780 --> 01:44:24,780 each token has a vector 2108 01:44:24,780 --> 01:44:26,780 that we're going to train using backpropagation. 2109 01:44:26,780 --> 01:44:28,780 That vector is of size and embed, 2110 01:44:28,780 --> 01:44:30,780 which is number of channels in the transformer. 2111 01:44:30,780 --> 01:44:31,780 And basically, 2112 01:44:31,780 --> 01:44:32,780 as vocab size increases, 2113 01:44:32,780 --> 01:44:33,780 this embedding table, 2114 01:44:33,780 --> 01:44:34,780 as I mentioned earlier, 2115 01:44:34,780 --> 01:44:35,780 is going to also grow. 2116 01:44:35,780 --> 01:44:37,780 We're going to be adding rows. 2117 01:44:37,780 --> 01:44:38,780 In addition to that, 2118 01:44:38,780 --> 01:44:40,780 at the end of the transformer, 2119 01:44:40,780 --> 01:44:41,780 there's this LM head layer, 2120 01:44:41,780 --> 01:44:43,780 which is a linear layer. 2121 01:44:43,780 --> 01:44:45,780 And you'll notice that that layer is used 2122 01:44:45,780 --> 01:44:47,780 at the very end to produce the logits, 2123 01:44:47,780 --> 01:44:49,780 which become the probabilities 2124 01:44:49,780 --> 01:44:50,780 for the next token in a sequence. 2125 01:44:50,780 --> 01:44:51,780 And so intuitively, 2126 01:44:51,780 --> 01:44:53,780 we're trying to produce a probability 2127 01:44:53,780 --> 01:44:55,780 for every single token 2128 01:44:55,780 --> 01:44:56,780 that might come next 2129 01:44:56,780 --> 01:44:59,780 at every point in time of that transformer. 2130 01:44:59,780 --> 01:45:01,780 And if we have more and more tokens, 2131 01:45:01,780 --> 01:45:03,780 we need to produce more and more probabilities. 2132 01:45:03,780 --> 01:45:04,780 So every single token 2133 01:45:04,780 --> 01:45:06,780 is going to introduce an additional dot product 2134 01:45:06,780 --> 01:45:09,780 that we have to do here in this linear layer 2135 01:45:09,780 --> 01:45:11,780 for this final layer in the transformer. 2136 01:45:11,780 --> 01:45:14,780 So why can't vocab size be infinite? 2137 01:45:14,780 --> 01:45:15,780 Why can't we grow to infinity? 2138 01:45:15,780 --> 01:45:16,780 Well, number one, 2139 01:45:16,780 --> 01:45:19,780 your token embedding table is going to grow. 2140 01:45:19,780 --> 01:45:22,780 Your linear layer is going to grow. 2141 01:45:22,780 --> 01:45:24,780 So we're going to be doing a lot more computation here 2142 01:45:24,780 --> 01:45:25,780 because this LM head layer 2143 01:45:25,780 --> 01:45:27,780 will become more competitionally expensive. 2144 01:45:27,780 --> 01:45:29,780 Number two, because we have more parameters, 2145 01:45:29,780 --> 01:45:31,780 we could be worried that we are going to be 2146 01:45:31,780 --> 01:45:34,780 under-training some of these parameters. 2147 01:45:34,780 --> 01:45:36,780 So intuitively, 2148 01:45:36,780 --> 01:45:37,780 if you have a very large vocabulary size, 2149 01:45:37,780 --> 01:45:39,780 say we have a million tokens, 2150 01:45:39,780 --> 01:45:41,780 then every one of these tokens 2151 01:45:41,780 --> 01:45:43,780 is going to come up more and more rarely 2152 01:45:43,780 --> 01:45:44,780 in the training data 2153 01:45:44,780 --> 01:45:45,780 because there's a lot more other tokens 2154 01:45:45,780 --> 01:45:46,780 all over the place. 2155 01:45:46,780 --> 01:45:49,780 And so we're going to be seeing fewer and fewer examples 2156 01:45:49,780 --> 01:45:51,780 for each individual token. 2157 01:45:51,780 --> 01:45:53,780 And you might be worried that basically 2158 01:45:53,780 --> 01:45:54,780 the vectorization 2159 01:45:54,780 --> 01:45:55,780 of the vectors associated with every token 2160 01:45:55,780 --> 01:45:57,780 will be under-trained as a result 2161 01:45:57,780 --> 01:45:59,780 because they just don't come up too often 2162 01:45:59,780 --> 01:46:01,780 and they don't participate in the forward-backward pass. 2163 01:46:01,780 --> 01:46:02,780 In addition to that, 2164 01:46:02,780 --> 01:46:04,780 as your vocab size grows, 2165 01:46:04,780 --> 01:46:07,780 you're going to start shrinking your sequences a lot, right? 2166 01:46:07,780 --> 01:46:08,780 And that's really nice because 2167 01:46:08,780 --> 01:46:10,780 that means that we're going to be attending 2168 01:46:10,780 --> 01:46:11,780 to more and more text. 2169 01:46:11,780 --> 01:46:12,780 So that's nice. 2170 01:46:12,780 --> 01:46:13,780 But also you might be worrying 2171 01:46:13,780 --> 01:46:15,780 that too large of chunks 2172 01:46:15,780 --> 01:46:17,780 are being squished into single tokens. 2173 01:46:17,780 --> 01:46:19,780 And so the model just doesn't have 2174 01:46:19,780 --> 01:46:21,780 as much sort of time to think 2175 01:46:21,780 --> 01:46:23,780 per sort of 2176 01:46:23,780 --> 01:46:25,780 some number of characters in a text, 2177 01:46:25,780 --> 01:46:27,780 or you can think about it that way, right? 2178 01:46:27,780 --> 01:46:29,780 So basically we're squishing too much information 2179 01:46:29,780 --> 01:46:30,780 into a single token 2180 01:46:30,780 --> 01:46:32,780 and then the forward pass of the transformer 2181 01:46:32,780 --> 01:46:33,780 is not enough to actually process 2182 01:46:33,780 --> 01:46:35,780 that information appropriately. 2183 01:46:35,780 --> 01:46:36,780 And so these are some of the considerations 2184 01:46:36,780 --> 01:46:37,780 you're thinking about 2185 01:46:37,780 --> 01:46:39,780 when you're designing the vocab size. 2186 01:46:39,780 --> 01:46:40,780 As I mentioned, this is mostly 2187 01:46:40,780 --> 01:46:41,780 an empirical hyperparameter. 2188 01:46:41,780 --> 01:46:42,780 And it seems like 2189 01:46:42,780 --> 01:46:44,780 in state-of-the-art architectures today, 2190 01:46:44,780 --> 01:46:46,780 this is usually in the high 10,000s 2191 01:46:46,780 --> 01:46:48,780 or somewhere around 100,000 today. 2192 01:46:48,780 --> 01:46:49,780 And the next consideration 2193 01:46:49,780 --> 01:46:51,780 I want to briefly talk about is 2194 01:46:51,780 --> 01:46:53,780 what if we want to take a pre-trained model 2195 01:46:53,780 --> 01:46:55,780 and we want to extend the vocab size? 2196 01:46:55,780 --> 01:46:57,780 And this is done fairly commonly actually. 2197 01:46:57,780 --> 01:46:58,780 So for example, 2198 01:46:58,780 --> 01:47:00,780 when you're doing fine-tuning for ChatGPT, 2199 01:47:00,780 --> 01:47:02,780 a lot more new special tokens 2200 01:47:02,780 --> 01:47:04,780 get introduced on top of the base model 2201 01:47:04,780 --> 01:47:06,780 to maintain the metadata 2202 01:47:06,780 --> 01:47:09,780 and all the structure of conversation objects 2203 01:47:09,780 --> 01:47:10,780 between the user and the system. 2204 01:47:10,780 --> 01:47:12,780 So that takes a lot of special tokens. 2205 01:47:12,780 --> 01:47:15,780 You might also try to throw in more special tokens, 2206 01:47:15,780 --> 01:47:16,780 for example, for using the browser 2207 01:47:16,780 --> 01:47:17,780 or any other tool. 2208 01:47:17,780 --> 01:47:20,780 And so it's very tempting to add a lot of tokens 2209 01:47:20,780 --> 01:47:22,780 for all kinds of special functionality. 2210 01:47:22,780 --> 01:47:24,780 So if you want to be adding a token, 2211 01:47:24,780 --> 01:47:25,780 that's totally possible, right? 2212 01:47:25,780 --> 01:47:28,780 All we have to do is we have to resize this embedding. 2213 01:47:28,780 --> 01:47:30,780 So we have to add rows. 2214 01:47:30,780 --> 01:47:32,780 We would initialize these parameters from scratch, 2215 01:47:32,780 --> 01:47:34,780 which would be small random numbers. 2216 01:47:34,780 --> 01:47:36,780 And then we have to extend the weight 2217 01:47:36,780 --> 01:47:38,780 inside this linear. 2218 01:47:38,780 --> 01:47:40,780 So we have to start making dot products 2219 01:47:40,780 --> 01:47:42,780 with the associated parameters as well 2220 01:47:42,780 --> 01:47:44,780 to basically calculate the probabilities 2221 01:47:44,780 --> 01:47:45,780 for these new tokens. 2222 01:47:45,780 --> 01:47:48,780 So both of these are just resizing operation. 2223 01:47:48,780 --> 01:47:50,780 It's a very mild model surgery 2224 01:47:50,780 --> 01:47:51,780 and can be done fairly easily. 2225 01:47:51,780 --> 01:47:53,780 And it's quite common that basically 2226 01:47:53,780 --> 01:47:54,780 you would freeze the base model. 2227 01:47:54,780 --> 01:47:56,780 You introduce these new parameters 2228 01:47:56,780 --> 01:47:58,780 and then you only train these new parameters 2229 01:47:58,780 --> 01:48:00,780 to introduce new tokens into the architecture. 2230 01:48:00,780 --> 01:48:03,780 And so you can freeze arbitrary parts of it 2231 01:48:03,780 --> 01:48:05,780 or you can train arbitrary parts of it. 2232 01:48:05,780 --> 01:48:06,780 And that's totally up to you. 2233 01:48:06,780 --> 01:48:08,780 But basically minor surgery required 2234 01:48:08,780 --> 01:48:10,780 if you'd like to introduce new tokens. 2235 01:48:10,780 --> 01:48:12,780 And finally, I'd like to mention that actually 2236 01:48:12,780 --> 01:48:14,780 there's an entire design space of applications 2237 01:48:14,780 --> 01:48:17,780 in terms of introducing new tokens into a vocabulary 2238 01:48:17,780 --> 01:48:19,780 that go way beyond just adding special tokens 2239 01:48:19,780 --> 01:48:20,780 and special new functionality. 2240 01:48:20,780 --> 01:48:23,780 So just to give you a sense of the design space, 2241 01:48:23,780 --> 01:48:25,780 but this could be an entire video just by itself, 2242 01:48:25,780 --> 01:48:28,780 this is a paper on learning to compress prompts 2243 01:48:28,780 --> 01:48:30,780 with what they called GIST tokens. 2244 01:48:30,780 --> 01:48:32,780 And the rough idea is, 2245 01:48:32,780 --> 01:48:34,780 suppose that you're using language models 2246 01:48:34,780 --> 01:48:36,780 in a setting that requires very long prompts. 2247 01:48:36,780 --> 01:48:38,780 Well, these long prompts just slow everything down 2248 01:48:38,780 --> 01:48:39,780 because you have to encode them 2249 01:48:39,780 --> 01:48:40,780 and then you have to use them 2250 01:48:40,780 --> 01:48:42,780 and then you're tending over them 2251 01:48:42,780 --> 01:48:45,780 and it's just heavy to have very large prompts. 2252 01:48:45,780 --> 01:48:48,780 So instead, what they do here in this paper 2253 01:48:48,780 --> 01:48:50,780 is they introduce new functions 2254 01:48:50,780 --> 01:48:52,780 and new tokens. 2255 01:48:52,780 --> 01:48:55,780 And imagine basically having a few new tokens, 2256 01:48:55,780 --> 01:48:57,780 you put them in a sequence, 2257 01:48:57,780 --> 01:49:00,780 and then you train the model by distillation. 2258 01:49:00,780 --> 01:49:02,780 So you are keeping the entire model frozen 2259 01:49:02,780 --> 01:49:04,780 and you're only training the representations 2260 01:49:04,780 --> 01:49:06,780 of the new tokens, their embeddings, 2261 01:49:06,780 --> 01:49:08,780 and you're optimizing over the new tokens 2262 01:49:08,780 --> 01:49:10,780 such that the behavior of the language model 2263 01:49:10,780 --> 01:49:14,780 is identical to the model 2264 01:49:14,780 --> 01:49:17,780 that has a very long prompt that works for you. 2265 01:49:17,780 --> 01:49:18,780 And so it's a compression technique 2266 01:49:18,780 --> 01:49:20,780 of compressing that very long prompt 2267 01:49:20,780 --> 01:49:22,780 into those few new gist tokens. 2268 01:49:22,780 --> 01:49:24,780 And so you can train this and then at test time 2269 01:49:24,780 --> 01:49:25,780 you can discard your old prompt 2270 01:49:25,780 --> 01:49:27,780 and just swap in those tokens 2271 01:49:27,780 --> 01:49:29,780 and they sort of like stand in 2272 01:49:29,780 --> 01:49:30,780 for that very long prompt 2273 01:49:30,780 --> 01:49:32,780 and have an almost identical performance. 2274 01:49:32,780 --> 01:49:35,780 And so this is one technique 2275 01:49:35,780 --> 01:49:38,780 in a class of parameter-efficient fine-tuning techniques 2276 01:49:38,780 --> 01:49:40,780 where most of the model is basically fixed 2277 01:49:40,780 --> 01:49:42,780 and there's no training of the model weights, 2278 01:49:42,780 --> 01:49:44,780 there's no training of LoRa or anything like that 2279 01:49:44,780 --> 01:49:45,780 of new parameters. 2280 01:49:45,780 --> 01:49:47,780 The parameters that you're training 2281 01:49:47,780 --> 01:49:49,780 are now just the token embeddings. 2282 01:49:49,780 --> 01:49:51,780 So that's just one example, 2283 01:49:51,780 --> 01:49:53,780 but this could again be like an entire video, 2284 01:49:53,780 --> 01:49:54,780 but just to give you a sense 2285 01:49:54,780 --> 01:49:55,780 that there's a whole design space here 2286 01:49:55,780 --> 01:49:57,780 that is potentially worth exploring in the future. 2287 01:49:57,780 --> 01:49:59,780 The next thing I want to briefly address 2288 01:49:59,780 --> 01:50:01,780 is that I think recently there's a lot of momentum 2289 01:50:01,780 --> 01:50:04,780 in how you actually could construct transformers 2290 01:50:04,780 --> 01:50:05,780 that can simultaneously process 2291 01:50:05,780 --> 01:50:07,780 not just text as the input modality, 2292 01:50:07,780 --> 01:50:09,780 but a lot of other modalities. 2293 01:50:09,780 --> 01:50:12,780 So be it images, videos, audio, etc. 2294 01:50:12,780 --> 01:50:14,780 And how do you feed in all these modalities 2295 01:50:14,780 --> 01:50:16,780 and potentially predict these modalities 2296 01:50:16,780 --> 01:50:18,780 from a transformer? 2297 01:50:18,780 --> 01:50:19,780 Do you have to change the architecture 2298 01:50:19,780 --> 01:50:20,780 in some fundamental way? 2299 01:50:20,780 --> 01:50:21,780 And I think what a lot of people 2300 01:50:21,780 --> 01:50:22,780 are starting to converge towards 2301 01:50:22,780 --> 01:50:24,780 is that you're not changing the architecture, 2302 01:50:24,780 --> 01:50:25,780 you stick with the transformer, 2303 01:50:25,780 --> 01:50:28,780 you just kind of tokenize your input domains 2304 01:50:28,780 --> 01:50:29,780 and then call it a day 2305 01:50:29,780 --> 01:50:30,780 and pretend it's just text tokens 2306 01:50:30,780 --> 01:50:34,780 and just do everything else in an identical manner. 2307 01:50:34,780 --> 01:50:35,780 So here, for example, 2308 01:50:35,780 --> 01:50:37,780 there was an early paper that has a nice graphic 2309 01:50:37,780 --> 01:50:38,780 for how you can take an image 2310 01:50:38,780 --> 01:50:42,780 and you can truncate it into integers. 2311 01:50:42,780 --> 01:50:44,780 And these sometimes... 2312 01:50:44,780 --> 01:50:45,780 So these would basically become 2313 01:50:45,780 --> 01:50:48,780 the tokens of images, as an example. 2314 01:50:48,780 --> 01:50:51,780 And these tokens can be hard tokens 2315 01:50:51,780 --> 01:50:53,780 where you force them to be integers. 2316 01:50:53,780 --> 01:50:55,780 They can also be soft tokens 2317 01:50:55,780 --> 01:50:58,780 where you sort of don't require 2318 01:50:58,780 --> 01:51:00,780 these to be discrete, 2319 01:51:00,780 --> 01:51:02,780 but you do force these representations 2320 01:51:02,780 --> 01:51:03,780 to go through bottlenecks, 2321 01:51:03,780 --> 01:51:05,780 like in autoencoders. 2322 01:51:05,780 --> 01:51:07,780 Also in this paper that came out from OpenAI, 2323 01:51:07,780 --> 01:51:10,780 Sora, which I think really 2324 01:51:10,780 --> 01:51:12,780 blew the mind of many people 2325 01:51:12,780 --> 01:51:13,780 and inspired a lot of people 2326 01:51:13,780 --> 01:51:14,780 in terms of what's possible, 2327 01:51:14,780 --> 01:51:15,780 they have a graphic here 2328 01:51:15,780 --> 01:51:17,780 and they talk briefly about how 2329 01:51:17,780 --> 01:51:19,780 LLMs have text tokens, 2330 01:51:19,780 --> 01:51:21,780 Sora has visual patches. 2331 01:51:21,780 --> 01:51:22,780 So again, they came up with a way 2332 01:51:22,780 --> 01:51:25,780 to truncate videos into basically tokens 2333 01:51:25,780 --> 01:51:27,780 with their own vocabularies. 2334 01:51:27,780 --> 01:51:29,780 And then you can either process discrete tokens, 2335 01:51:29,780 --> 01:51:30,780 say, with autoregressive models 2336 01:51:30,780 --> 01:51:33,780 or even soft tokens with diffusion models. 2337 01:51:33,780 --> 01:51:36,780 And all of that is sort of 2338 01:51:36,780 --> 01:51:38,780 being actively worked on, designed on, 2339 01:51:38,780 --> 01:51:39,780 and it's beyond the scope of this video, 2340 01:51:39,780 --> 01:51:41,780 but just something I wanted to mention briefly. 2341 01:51:41,780 --> 01:51:43,780 Okay, now that we have gone quite deep 2342 01:51:43,780 --> 01:51:45,780 into the tokenization algorithm 2343 01:51:45,780 --> 01:51:46,780 and we understand a lot more 2344 01:51:46,780 --> 01:51:47,780 about how it works, 2345 01:51:47,780 --> 01:51:48,780 let's loop back around 2346 01:51:48,780 --> 01:51:49,780 to the beginning of this video 2347 01:51:49,780 --> 01:51:51,780 and go through some of these bullet points 2348 01:51:51,780 --> 01:51:53,780 and really see why they happen. 2349 01:51:53,780 --> 01:51:54,780 So first of all, 2350 01:51:54,780 --> 01:51:57,780 why can't my LLM spell words very well 2351 01:51:57,780 --> 01:52:00,780 or do other spell-related tasks? 2352 01:52:00,780 --> 01:52:02,780 So fundamentally, this is because, 2353 01:52:02,780 --> 01:52:04,780 as we saw, these characters 2354 01:52:04,780 --> 01:52:06,780 are chunked up into tokens 2355 01:52:06,780 --> 01:52:07,780 and some of these tokens 2356 01:52:07,780 --> 01:52:09,780 are actually fairly long. 2357 01:52:09,780 --> 01:52:10,780 So as an example, 2358 01:52:10,780 --> 01:52:12,780 I went to the GPT-4 vocabulary 2359 01:52:12,780 --> 01:52:14,780 and I looked at one of the longer tokens. 2360 01:52:14,780 --> 01:52:16,780 So .defaultset 2361 01:52:16,780 --> 01:52:18,780 turns out to be a single individual token. 2362 01:52:18,780 --> 01:52:19,780 So that's a lot of characters 2363 01:52:19,780 --> 01:52:20,780 for a single token. 2364 01:52:20,780 --> 01:52:22,780 So my suspicion is that 2365 01:52:22,780 --> 01:52:23,780 there's just too much crammed 2366 01:52:23,780 --> 01:52:24,780 into this single token. 2367 01:52:24,780 --> 01:52:26,780 And my suspicion was that 2368 01:52:26,780 --> 01:52:27,780 the model should not be very good 2369 01:52:27,780 --> 01:52:30,780 at tasks related to spelling 2370 01:52:30,780 --> 01:52:33,780 of this single token. 2371 01:52:33,780 --> 01:52:34,780 So I asked, 2372 01:52:34,780 --> 01:52:35,780 how many letters L 2373 01:52:35,780 --> 01:52:38,780 are there in the word .defaultstyle? 2374 01:52:38,780 --> 01:52:39,780 And of course, 2375 01:52:39,780 --> 01:52:43,780 my prompt is intentionally done that way. 2376 01:52:43,780 --> 01:52:44,780 And you see how .defaultstyle 2377 01:52:44,780 --> 01:52:45,780 will be a single token. 2378 01:52:45,780 --> 01:52:47,780 So this is what the model sees. 2379 01:52:47,780 --> 01:52:48,780 So my suspicion is that 2380 01:52:48,780 --> 01:52:49,780 it wouldn't be very good at this. 2381 01:52:49,780 --> 01:52:51,780 And indeed, it is not. 2382 01:52:51,780 --> 01:52:52,780 It doesn't actually know 2383 01:52:52,780 --> 01:52:53,780 how many Ls are in there. 2384 01:52:53,780 --> 01:52:54,780 It thinks there are three 2385 01:52:54,780 --> 01:52:56,780 and actually there are four, 2386 01:52:56,780 --> 01:52:58,780 if I'm not getting this wrong myself. 2387 01:52:58,780 --> 01:53:00,780 So that didn't go extremely well. 2388 01:53:00,780 --> 01:53:02,780 Let's look at another 2389 01:53:02,780 --> 01:53:04,780 kind of character-level task. 2390 01:53:04,780 --> 01:53:05,780 So for example, 2391 01:53:05,780 --> 01:53:07,780 here I asked GPT-4 2392 01:53:07,780 --> 01:53:10,780 to reverse the string .defaultstyle 2393 01:53:10,780 --> 01:53:12,780 and to try to use a code interpreter. 2394 01:53:12,780 --> 01:53:13,780 And I stopped it 2395 01:53:13,780 --> 01:53:14,780 and I said, just do it. 2396 01:53:14,780 --> 01:53:15,780 Just try it. 2397 01:53:15,780 --> 01:53:17,780 And it gave me jumble. 2398 01:53:17,780 --> 01:53:20,780 So it doesn't actually really know 2399 01:53:20,780 --> 01:53:21,780 how to reverse this string 2400 01:53:21,780 --> 01:53:23,780 going from right to left. 2401 01:53:23,780 --> 01:53:25,780 So it gave it wrong result. 2402 01:53:25,780 --> 01:53:28,780 So again, like working with this working hypothesis 2403 01:53:28,780 --> 01:53:30,780 that maybe this is due to the tokenization, 2404 01:53:30,780 --> 01:53:31,780 I tried a different approach. 2405 01:53:31,780 --> 01:53:32,780 I said, okay, 2406 01:53:32,780 --> 01:53:34,780 let's reverse the exact same string, 2407 01:53:34,780 --> 01:53:36,780 but take the following approach. 2408 01:53:36,780 --> 01:53:37,780 Step one, 2409 01:53:37,780 --> 01:53:38,780 just print out every single character 2410 01:53:38,780 --> 01:53:39,780 separated by spaces. 2411 01:53:39,780 --> 01:53:40,780 And then as a step two, 2412 01:53:40,780 --> 01:53:42,780 reverse that list. 2413 01:53:42,780 --> 01:53:43,780 And it again tried to use a tool, 2414 01:53:43,780 --> 01:53:44,780 but when I said, 2415 01:53:44,780 --> 01:53:45,780 I stopped it, 2416 01:53:45,780 --> 01:53:47,780 it first produced all the characters 2417 01:53:47,780 --> 01:53:49,780 and that was actually correct. 2418 01:53:49,780 --> 01:53:50,780 And then it reversed them 2419 01:53:50,780 --> 01:53:51,780 and that was correct 2420 01:53:51,780 --> 01:53:52,780 once it had this. 2421 01:53:52,780 --> 01:53:54,780 So somehow it can't reverse it directly. 2422 01:53:54,780 --> 01:53:56,780 But when you go just first, 2423 01:53:56,780 --> 01:53:57,780 you know, 2424 01:53:57,780 --> 01:53:58,780 listing it out in order, 2425 01:53:58,780 --> 01:53:59,780 it can do that somehow. 2426 01:53:59,780 --> 01:54:00,780 And then it can, 2427 01:54:00,780 --> 01:54:02,780 once it's broken up this way, 2428 01:54:02,780 --> 01:54:04,780 this becomes all these individual characters. 2429 01:54:04,780 --> 01:54:06,780 And so now this is much easier 2430 01:54:06,780 --> 01:54:08,780 for it to see these individual tokens 2431 01:54:08,780 --> 01:54:10,780 and reverse them and print them out. 2432 01:54:10,780 --> 01:54:13,780 So that is kind of interesting. 2433 01:54:13,780 --> 01:54:15,780 So let's continue now. 2434 01:54:15,780 --> 01:54:19,780 Why are LLMs worse at non-English languages? 2435 01:54:19,780 --> 01:54:21,780 And I briefly covered this already, 2436 01:54:21,780 --> 01:54:22,780 but basically, 2437 01:54:22,780 --> 01:54:24,780 it's not only that the language model 2438 01:54:24,780 --> 01:54:26,780 sees less non-English data 2439 01:54:26,780 --> 01:54:28,780 during training of the model parameters, 2440 01:54:28,780 --> 01:54:30,780 but also the tokenizer 2441 01:54:30,780 --> 01:54:33,780 is not sufficiently trained 2442 01:54:33,780 --> 01:54:35,780 on non-English data. 2443 01:54:35,780 --> 01:54:36,780 And so here, for example, 2444 01:54:36,780 --> 01:54:39,780 hello, how are you is five tokens 2445 01:54:39,780 --> 01:54:41,780 and its translation is 15 tokens. 2446 01:54:41,780 --> 01:54:42,780 So this is a three times block 2447 01:54:42,780 --> 01:54:44,780 and so, for example, 2448 01:54:44,780 --> 01:54:45,780 is just hello, 2449 01:54:45,780 --> 01:54:46,780 basically in Korean. 2450 01:54:46,780 --> 01:54:48,780 And that ends up being three tokens. 2451 01:54:48,780 --> 01:54:49,780 I'm actually kind of surprised by that 2452 01:54:49,780 --> 01:54:51,780 because that is a very common phrase. 2453 01:54:51,780 --> 01:54:52,780 There's just a typical greeting 2454 01:54:52,780 --> 01:54:53,780 of like, hello. 2455 01:54:53,780 --> 01:54:54,780 And that ends up being three tokens, 2456 01:54:54,780 --> 01:54:56,780 whereas our hello is a single token. 2457 01:54:56,780 --> 01:54:57,780 And so basically everything 2458 01:54:57,780 --> 01:54:59,780 is a lot more bloated and diffuse. 2459 01:54:59,780 --> 01:55:00,780 And this is, I think, 2460 01:55:00,780 --> 01:55:02,780 partly the reason that the model works 2461 01:55:02,780 --> 01:55:04,780 worse on other languages. 2462 01:55:04,780 --> 01:55:05,780 Coming back, 2463 01:55:05,780 --> 01:55:08,780 why is LLM bad at simple arithmetic? 2464 01:55:08,780 --> 01:55:09,780 That has something to do 2465 01:55:09,780 --> 01:55:10,780 with the fact that 2466 01:55:10,780 --> 01:55:11,780 LLMs can be used 2467 01:55:11,780 --> 01:55:12,780 in a lot of different ways. 2468 01:55:12,780 --> 01:55:13,780 And that has to do 2469 01:55:13,780 --> 01:55:16,780 with the tokenization of numbers. 2470 01:55:16,780 --> 01:55:18,780 And so you'll notice that, 2471 01:55:18,780 --> 01:55:19,780 for example, 2472 01:55:19,780 --> 01:55:21,780 addition is very sort of like, 2473 01:55:21,780 --> 01:55:22,780 there's an algorithm 2474 01:55:22,780 --> 01:55:23,780 that is like character level 2475 01:55:23,780 --> 01:55:25,780 for doing addition. 2476 01:55:25,780 --> 01:55:26,780 So for example, 2477 01:55:26,780 --> 01:55:27,780 here we would first add the ones 2478 01:55:27,780 --> 01:55:28,780 and then the tens 2479 01:55:28,780 --> 01:55:29,780 and then the hundreds. 2480 01:55:29,780 --> 01:55:30,780 You have to refer 2481 01:55:30,780 --> 01:55:32,780 to specific parts of these digits. 2482 01:55:32,780 --> 01:55:34,780 But these numbers 2483 01:55:34,780 --> 01:55:36,780 are represented completely arbitrarily 2484 01:55:36,780 --> 01:55:37,780 based on whatever happened 2485 01:55:37,780 --> 01:55:38,780 to merge or not merge 2486 01:55:38,780 --> 01:55:40,780 during the tokenization process. 2487 01:55:40,780 --> 01:55:41,780 There's an entire block 2488 01:55:41,780 --> 01:55:42,780 of information 2489 01:55:42,780 --> 01:55:43,780 that's been published 2490 01:55:43,780 --> 01:55:44,780 about this that I think 2491 01:55:44,780 --> 01:55:45,780 is quite good. 2492 01:55:45,780 --> 01:55:46,780 Integer tokenization is insane. 2493 01:55:46,780 --> 01:55:47,780 And this person basically 2494 01:55:47,780 --> 01:55:48,780 systematically explores 2495 01:55:48,780 --> 01:55:49,780 the tokenization of numbers 2496 01:55:49,780 --> 01:55:50,780 in, I believe, 2497 01:55:50,780 --> 01:55:51,780 this is GPT-2. 2498 01:55:51,780 --> 01:55:52,780 And so they noticed 2499 01:55:52,780 --> 01:55:53,780 that, for example, 2500 01:55:53,780 --> 01:55:56,780 for four-digit numbers, 2501 01:55:56,780 --> 01:55:57,780 you can take a look at 2502 01:55:57,780 --> 01:55:59,780 whether it is a single token 2503 01:55:59,780 --> 01:56:01,780 or whether it is two tokens 2504 01:56:01,780 --> 01:56:02,780 that is a one-three 2505 01:56:02,780 --> 01:56:03,780 or a two-two 2506 01:56:03,780 --> 01:56:04,780 or a three-one combination. 2507 01:56:04,780 --> 01:56:05,780 And so all the different numbers 2508 01:56:05,780 --> 01:56:07,780 are all the different combinations. 2509 01:56:07,780 --> 01:56:08,780 And you can imagine 2510 01:56:08,780 --> 01:56:10,780 this is all completely arbitrarily so. 2511 01:56:10,780 --> 01:56:11,780 And the model, unfortunately, 2512 01:56:11,780 --> 01:56:14,780 sometimes sees a token 2513 01:56:14,780 --> 01:56:15,780 for all four digits, 2514 01:56:15,780 --> 01:56:16,780 sometimes for three, 2515 01:56:16,780 --> 01:56:17,780 sometimes for two, 2516 01:56:17,780 --> 01:56:18,780 sometimes for one. 2517 01:56:18,780 --> 01:56:21,780 And it's in an arbitrary manner. 2518 01:56:21,780 --> 01:56:22,780 And so this is definitely 2519 01:56:22,780 --> 01:56:24,780 a headwind, if you will, 2520 01:56:24,780 --> 01:56:25,780 for the language model. 2521 01:56:25,780 --> 01:56:26,780 And it's kind of incredible 2522 01:56:26,780 --> 01:56:27,780 that it can kind of do it 2523 01:56:27,780 --> 01:56:28,780 and deal with it. 2524 01:56:28,780 --> 01:56:30,780 But it's also kind of not ideal. 2525 01:56:30,780 --> 01:56:31,780 And so that's why, for example, 2526 01:56:31,780 --> 01:56:32,780 we saw that Meta, 2527 01:56:32,780 --> 01:56:33,780 when they trained 2528 01:56:33,780 --> 01:56:34,780 the LAMA2 algorithm 2529 01:56:34,780 --> 01:56:35,780 and the sentence piece, 2530 01:56:35,780 --> 01:56:37,780 they make sure to split up 2531 01:56:37,780 --> 01:56:39,780 all the digits 2532 01:56:39,780 --> 01:56:42,780 as an example for LAMA2. 2533 01:56:42,780 --> 01:56:44,780 And this is partly to improve 2534 01:56:44,780 --> 01:56:45,780 a simple arithmetic 2535 01:56:45,780 --> 01:56:47,780 kind of performance. 2536 01:56:47,780 --> 01:56:48,780 And finally, 2537 01:56:48,780 --> 01:56:49,780 why is GPT-2 2538 01:56:49,780 --> 01:56:51,780 not as good in Python? 2539 01:56:51,780 --> 01:56:52,780 Again, this is partly 2540 01:56:52,780 --> 01:56:53,780 a modeling issue 2541 01:56:53,780 --> 01:56:54,780 in the architecture 2542 01:56:54,780 --> 01:56:55,780 and the data set 2543 01:56:55,780 --> 01:56:56,780 and the strength of the model, 2544 01:56:56,780 --> 01:56:58,780 but it's also partly tokenization. 2545 01:56:58,780 --> 01:56:59,780 Because as we saw here 2546 01:56:59,780 --> 01:57:01,780 with the simple Python example, 2547 01:57:01,780 --> 01:57:03,780 the encoding efficiency 2548 01:57:03,780 --> 01:57:04,780 of the tokenizer 2549 01:57:04,780 --> 01:57:05,780 for handling spaces in Python 2550 01:57:05,780 --> 01:57:06,780 is terrible. 2551 01:57:06,780 --> 01:57:07,780 And every single space 2552 01:57:07,780 --> 01:57:08,780 is an individual token. 2553 01:57:08,780 --> 01:57:09,780 And this dramatically 2554 01:57:09,780 --> 01:57:10,780 reduces the context length 2555 01:57:10,780 --> 01:57:12,780 that the model can attend across. 2556 01:57:12,780 --> 01:57:13,780 So that's almost like 2557 01:57:13,780 --> 01:57:15,780 a tokenization bug for GPT-2. 2558 01:57:15,780 --> 01:57:18,780 And that was later fixed with GPT-4. 2559 01:57:18,780 --> 01:57:20,780 Okay, so here's another fun one. 2560 01:57:20,780 --> 01:57:21,780 My LLM abruptly halts 2561 01:57:21,780 --> 01:57:24,780 when it sees the string end of text. 2562 01:57:24,780 --> 01:57:27,780 So here's a very strange behavior. 2563 01:57:27,780 --> 01:57:28,780 Print a string end of text 2564 01:57:28,780 --> 01:57:30,780 is what I told GPT-4. 2565 01:57:30,780 --> 01:57:31,780 And it says, 2566 01:57:31,780 --> 01:57:33,780 could you please specify the string? 2567 01:57:33,780 --> 01:57:34,780 And I'm telling it, 2568 01:57:34,780 --> 01:57:35,780 give me end of text. 2569 01:57:35,780 --> 01:57:37,780 And it seems like there's an issue. 2570 01:57:37,780 --> 01:57:39,780 It's not seeing end of text. 2571 01:57:39,780 --> 01:57:40,780 And then I give it, 2572 01:57:40,780 --> 01:57:42,780 end of text is the string. 2573 01:57:42,780 --> 01:57:43,780 And then here's the string. 2574 01:57:43,780 --> 01:57:45,780 And then it just doesn't print it. 2575 01:57:45,780 --> 01:57:46,780 So obviously something is breaking here 2576 01:57:46,780 --> 01:57:47,780 with respect to the handling 2577 01:57:47,780 --> 01:57:48,780 of the special token. 2578 01:57:48,780 --> 01:57:49,780 And I didn't actually know 2579 01:57:49,780 --> 01:57:50,780 what OpenAI is doing 2580 01:57:50,780 --> 01:57:52,780 under the hood here 2581 01:57:52,780 --> 01:57:54,780 and whether they are potentially parsing this 2582 01:57:54,780 --> 01:57:58,780 as an actual token 2583 01:57:58,780 --> 01:58:02,780 instead of this just being end of text 2584 01:58:02,780 --> 01:58:04,780 as like individual sort of pieces of it 2585 01:58:04,780 --> 01:58:07,780 without the special token handling logic. 2586 01:58:07,780 --> 01:58:08,780 And so it might be 2587 01:58:08,780 --> 01:58:09,780 that someone, 2588 01:58:09,780 --> 01:58:11,780 when they're calling dot encode, 2589 01:58:11,780 --> 01:58:13,780 they are passing in the allowed special 2590 01:58:13,780 --> 01:58:15,780 and they are allowing end of text 2591 01:58:15,780 --> 01:58:18,780 as a special character in the user prompt. 2592 01:58:18,780 --> 01:58:19,780 But the user prompt, 2593 01:58:19,780 --> 01:58:20,780 of course, 2594 01:58:20,780 --> 01:58:22,780 is a sort of attacker controlled text. 2595 01:58:22,780 --> 01:58:25,780 So you would hope that they don't really parse 2596 01:58:25,780 --> 01:58:26,780 or use special tokens 2597 01:58:26,780 --> 01:58:27,780 or, you know, 2598 01:58:27,780 --> 01:58:29,780 from that kind of input. 2599 01:58:29,780 --> 01:58:30,780 But it appears that there's something 2600 01:58:30,780 --> 01:58:31,780 definitely going wrong here. 2601 01:58:31,780 --> 01:58:33,780 And so your knowledge 2602 01:58:33,780 --> 01:58:35,780 of these special tokens 2603 01:58:35,780 --> 01:58:37,780 ends up being an attack surface potentially. 2604 01:58:37,780 --> 01:58:40,780 And so if you'd like to confuse LLMs, 2605 01:58:40,780 --> 01:58:43,780 then just try to give them some special tokens 2606 01:58:43,780 --> 01:58:45,780 and see if you're breaking something by chance. 2607 01:58:45,780 --> 01:58:48,780 Okay, so this next one is a really fun one. 2608 01:58:48,780 --> 01:58:51,780 The trailing whitespace issue. 2609 01:58:51,780 --> 01:58:53,780 So if you come to Playground 2610 01:58:53,780 --> 01:58:57,780 and we come here to GPT 3.5 Turbo instruct. 2611 01:58:57,780 --> 01:58:58,780 So this is not a chat model. 2612 01:58:58,780 --> 01:59:00,780 This is a completion model. 2613 01:59:00,780 --> 01:59:01,780 So think of it more like 2614 01:59:01,780 --> 01:59:03,780 it's a lot more closer to a base model. 2615 01:59:03,780 --> 01:59:05,780 It does completion. 2616 01:59:05,780 --> 01:59:07,780 It will continue the token sequence. 2617 01:59:07,780 --> 01:59:09,780 So here's a tagline for ice cream shop 2618 01:59:09,780 --> 01:59:11,780 and we want to continue the sequence. 2619 01:59:11,780 --> 01:59:13,780 And so we can submit 2620 01:59:13,780 --> 01:59:14,780 and get a bunch of tokens. 2621 01:59:14,780 --> 01:59:16,780 Okay, no problem. 2622 01:59:16,780 --> 01:59:18,780 But now suppose I do this, 2623 01:59:18,780 --> 01:59:21,780 but instead of pressing submit here, 2624 01:59:21,780 --> 01:59:24,780 I do here's a tagline for ice cream shop space. 2625 01:59:24,780 --> 01:59:26,780 So I have a space here 2626 01:59:26,780 --> 01:59:28,780 before I click submit. 2627 01:59:28,780 --> 01:59:30,780 We get a warning. 2628 01:59:30,780 --> 01:59:32,780 Your text ends in the trailing space, 2629 01:59:32,780 --> 01:59:33,780 which causes the worst performance 2630 01:59:33,780 --> 01:59:36,780 due to how API splits text into tokens. 2631 01:59:36,780 --> 01:59:37,780 So what's happening here? 2632 01:59:37,780 --> 01:59:40,780 It still gave us a sort of completion here, 2633 01:59:40,780 --> 01:59:43,780 but let's take a look at what's happening. 2634 01:59:43,780 --> 01:59:45,780 So here's a tagline for an ice cream shop. 2635 01:59:45,780 --> 01:59:48,780 And then what does this look like 2636 01:59:48,780 --> 01:59:49,780 in the actual training data? 2637 01:59:49,780 --> 01:59:51,780 Suppose you found the completion 2638 01:59:51,780 --> 01:59:52,780 in the training documents 2639 01:59:52,780 --> 01:59:53,780 somewhere on the internet 2640 01:59:53,780 --> 01:59:55,780 and the LLM trained on this data. 2641 01:59:55,780 --> 01:59:57,780 So maybe it's something like, 2642 01:59:57,780 --> 01:59:59,780 oh yeah, maybe that's the tagline. 2643 01:59:59,780 --> 02:00:00,780 That's a terrible tagline. 2644 02:00:00,780 --> 02:00:03,780 But notice here that when I create O, 2645 02:00:03,780 --> 02:00:06,780 you see that because there's the space characters 2646 02:00:06,780 --> 02:00:08,780 the space character is always a prefix 2647 02:00:08,780 --> 02:00:10,780 to these tokens in GPT. 2648 02:00:10,780 --> 02:00:12,780 So it's not an O token. 2649 02:00:12,780 --> 02:00:13,780 It's a space O token. 2650 02:00:13,780 --> 02:00:15,780 The space is part of the O 2651 02:00:15,780 --> 02:00:18,780 and together they are token 8840. 2652 02:00:18,780 --> 02:00:20,780 That's space O. 2653 02:00:20,780 --> 02:00:22,780 So what's happening here is that 2654 02:00:22,780 --> 02:00:24,780 when I just have it like this 2655 02:00:24,780 --> 02:00:27,780 and I let it complete the next token, 2656 02:00:27,780 --> 02:00:30,780 it can sample the space O token. 2657 02:00:30,780 --> 02:00:33,780 But instead, if I have this and I add my space, 2658 02:00:33,780 --> 02:00:35,780 then what I'm doing here when I encode this string, 2659 02:00:35,780 --> 02:00:37,780 is I have basically, 2660 02:00:37,780 --> 02:00:39,780 here's the tagline for an ice cream shop 2661 02:00:39,780 --> 02:00:43,780 and this space at the very end becomes a token 220. 2662 02:00:43,780 --> 02:00:46,780 And so we've added token 220 2663 02:00:46,780 --> 02:00:49,780 and this token otherwise would be part of the tagline 2664 02:00:49,780 --> 02:00:51,780 because if there actually is a tagline here, 2665 02:00:51,780 --> 02:00:54,780 so space O is the token. 2666 02:00:54,780 --> 02:00:57,780 And so this is suddenly out of distribution for the model 2667 02:00:57,780 --> 02:01:00,780 because this space is part of the next token, 2668 02:01:00,780 --> 02:01:02,780 but we're putting it here like this 2669 02:01:02,780 --> 02:01:05,780 and the model has seen very, very little 2670 02:01:05,780 --> 02:01:09,780 data of actual space by itself. 2671 02:01:09,780 --> 02:01:11,780 And we're asking it to complete the sequence, 2672 02:01:11,780 --> 02:01:12,780 like add in more tokens. 2673 02:01:12,780 --> 02:01:15,780 But the problem is that we've sort of begun the first token 2674 02:01:15,780 --> 02:01:17,780 and now it's been split up 2675 02:01:17,780 --> 02:01:19,780 and now we're out of distribution 2676 02:01:19,780 --> 02:01:21,780 and now arbitrary bad things happen. 2677 02:01:21,780 --> 02:01:23,780 And it's just a very rare example 2678 02:01:23,780 --> 02:01:25,780 for it to see something like that. 2679 02:01:25,780 --> 02:01:27,780 And that's why we get the warning. 2680 02:01:27,780 --> 02:01:30,780 So the fundamental issue here is of course that 2681 02:01:30,780 --> 02:01:33,780 the LLM is on top of these tokens 2682 02:01:33,780 --> 02:01:34,780 and these tokens are text chunks. 2683 02:01:34,780 --> 02:01:37,780 They're not characters in a way you and I would think of them. 2684 02:01:37,780 --> 02:01:40,780 These are the atoms of what the LLM is seeing 2685 02:01:40,780 --> 02:01:42,780 and there's a bunch of weird stuff that comes out of it. 2686 02:01:42,780 --> 02:01:46,780 Let's go back to our default cell style. 2687 02:01:46,780 --> 02:01:49,780 I bet you that the model has never in its training set 2688 02:01:49,780 --> 02:01:54,780 seen default cell star without LE in there. 2689 02:01:54,780 --> 02:01:56,780 It's always seen this as a single group 2690 02:01:56,780 --> 02:02:00,780 because this is some kind of a function in... 2691 02:02:00,780 --> 02:02:02,780 I don't actually know what this is part of. 2692 02:02:02,780 --> 02:02:03,780 This is some kind of API, 2693 02:02:03,780 --> 02:02:07,780 but I bet you that it's never seen this combination of tokens 2694 02:02:07,780 --> 02:02:09,780 in its training data 2695 02:02:09,780 --> 02:02:11,780 or I think it would be extremely rare. 2696 02:02:11,780 --> 02:02:13,780 So I took this and I copy pasted it here 2697 02:02:13,780 --> 02:02:16,780 and I tried to complete from it 2698 02:02:16,780 --> 02:02:19,780 and it immediately gave me a big error. 2699 02:02:19,780 --> 02:02:21,780 And it said the model predicted a completion 2700 02:02:21,780 --> 02:02:23,780 that begins with a stop sequence resulting in no output. 2701 02:02:23,780 --> 02:02:25,780 Consider adjusting your prompt or stop sequences. 2702 02:02:25,780 --> 02:02:27,780 So what happened here when I clicked submit 2703 02:02:27,780 --> 02:02:30,780 is that immediately the model emitted 2704 02:02:30,780 --> 02:02:32,780 and sort of like end of text token, I think, 2705 02:02:32,780 --> 02:02:33,780 or something like that 2706 02:02:33,780 --> 02:02:36,780 it basically predicted the stop sequence immediately. 2707 02:02:36,780 --> 02:02:38,780 So it had no completion. 2708 02:02:38,780 --> 02:02:40,780 And so this is why I'm getting a warning again 2709 02:02:40,780 --> 02:02:42,780 because we're off the data distribution 2710 02:02:42,780 --> 02:02:45,780 and the model is just predicting 2711 02:02:45,780 --> 02:02:47,780 just totally arbitrary things. 2712 02:02:47,780 --> 02:02:49,780 It's just really confused basically. 2713 02:02:49,780 --> 02:02:50,780 This is giving it brain damage. 2714 02:02:50,780 --> 02:02:51,780 It's never seen this before. 2715 02:02:51,780 --> 02:02:52,780 It's shocked. 2716 02:02:52,780 --> 02:02:54,780 And it's predicting end of text or something. 2717 02:02:54,780 --> 02:02:56,780 I tried it again here 2718 02:02:56,780 --> 02:02:58,780 and in this case it completed it. 2719 02:02:58,780 --> 02:02:59,780 But then for some reason 2720 02:02:59,780 --> 02:03:02,780 this request may violate our usage policies. 2721 02:03:02,780 --> 02:03:04,780 This was flagged. 2722 02:03:04,780 --> 02:03:06,780 Basically something just like goes wrong 2723 02:03:06,780 --> 02:03:07,780 and there's something like jank. 2724 02:03:07,780 --> 02:03:08,780 You can just feel the jank 2725 02:03:08,780 --> 02:03:10,780 because the model is like extremely unhappy 2726 02:03:10,780 --> 02:03:11,780 with just this 2727 02:03:11,780 --> 02:03:12,780 and it doesn't know how to complete it 2728 02:03:12,780 --> 02:03:14,780 because it's never occurred in a training set. 2729 02:03:14,780 --> 02:03:17,780 In a training set it always appears like this 2730 02:03:17,780 --> 02:03:19,780 and becomes a single token. 2731 02:03:19,780 --> 02:03:21,780 So these kinds of issues where tokens are 2732 02:03:21,780 --> 02:03:24,780 either you sort of like complete the first character 2733 02:03:24,780 --> 02:03:25,780 of the next token 2734 02:03:25,780 --> 02:03:26,780 or you are sort of 2735 02:03:26,780 --> 02:03:27,780 you have long tokens 2736 02:03:27,780 --> 02:03:30,780 that you then have just some of the characters off. 2737 02:03:30,780 --> 02:03:31,780 All of these are kind of like 2738 02:03:31,780 --> 02:03:34,780 issues with partial tokens 2739 02:03:34,780 --> 02:03:36,780 is how I would describe it. 2740 02:03:36,780 --> 02:03:40,780 And if you actually dig into the TokToken repository 2741 02:03:40,780 --> 02:03:41,780 go to the Rust code 2742 02:03:41,780 --> 02:03:44,780 and search for unstable 2743 02:03:44,780 --> 02:03:46,780 and you'll see in code 2744 02:03:46,780 --> 02:03:47,780 unstable native 2745 02:03:47,780 --> 02:03:48,780 unstable tokens 2746 02:03:48,780 --> 02:03:50,780 and a lot of like special case handling. 2747 02:03:50,780 --> 02:03:52,780 None of this stuff about unstable tokens 2748 02:03:52,780 --> 02:03:54,780 is documented anywhere 2749 02:03:54,780 --> 02:03:55,780 but there's a ton of code 2750 02:03:55,780 --> 02:03:57,780 dealing with unstable tokens 2751 02:03:57,780 --> 02:03:59,780 and unstable tokens is exactly 2752 02:03:59,780 --> 02:04:01,780 kind of like what I'm describing here. 2753 02:04:01,780 --> 02:04:04,780 What you would like out of a completion API 2754 02:04:04,780 --> 02:04:05,780 is something a lot more fancy. 2755 02:04:05,780 --> 02:04:07,780 Like if we're putting in default cell star 2756 02:04:07,780 --> 02:04:09,780 if we're asking for the next token sequence 2757 02:04:09,780 --> 02:04:11,780 we're not actually trying to append the next token 2758 02:04:11,780 --> 02:04:13,780 exactly after this list. 2759 02:04:13,780 --> 02:04:15,780 We're actually trying to append 2760 02:04:15,780 --> 02:04:18,780 we're trying to consider lots of tokens 2761 02:04:18,780 --> 02:04:20,780 that if we were 2762 02:04:20,780 --> 02:04:21,780 I guess like 2763 02:04:21,780 --> 02:04:23,780 we're trying to search over characters 2764 02:04:23,780 --> 02:04:25,780 that if we retokenized 2765 02:04:25,780 --> 02:04:27,780 would be of high probability 2766 02:04:27,780 --> 02:04:29,780 if that makes sense. 2767 02:04:29,780 --> 02:04:30,780 So that we can actually add 2768 02:04:30,780 --> 02:04:32,780 a single individual character 2769 02:04:32,780 --> 02:04:35,780 instead of just like adding the next full token 2770 02:04:35,780 --> 02:04:38,780 that comes after this partial token list. 2771 02:04:38,780 --> 02:04:40,780 So this is very tricky to describe 2772 02:04:40,780 --> 02:04:42,780 and I invite you to maybe like look through this. 2773 02:04:42,780 --> 02:04:43,780 It ends up being extremely gnarly 2774 02:04:43,780 --> 02:04:45,780 and hairy kind of topic 2775 02:04:45,780 --> 02:04:47,780 and it comes from tokenization fundamentally. 2776 02:04:47,780 --> 02:04:50,780 So maybe I can even spend an entire video 2777 02:04:50,780 --> 02:04:51,780 talking about unstable tokens 2778 02:04:51,780 --> 02:04:52,780 sometime in the future. 2779 02:04:52,780 --> 02:04:54,780 Okay and I'm really saving the best for last. 2780 02:04:54,780 --> 02:04:56,780 My favorite one by far 2781 02:04:56,780 --> 02:04:58,780 is this solid gold Magikarp. 2782 02:04:58,780 --> 02:05:00,780 It was just 2783 02:05:00,780 --> 02:05:02,780 okay so this comes from this blog post 2784 02:05:02,780 --> 02:05:03,780 solid gold Magikarp 2785 02:05:03,780 --> 02:05:05,780 and this is 2786 02:05:05,780 --> 02:05:07,780 internet famous now 2787 02:05:07,780 --> 02:05:09,780 for those of us in LLMs. 2788 02:05:09,780 --> 02:05:11,780 And basically I would advise you to 2789 02:05:11,780 --> 02:05:13,780 read this blog post in full. 2790 02:05:13,780 --> 02:05:15,780 But basically what this person was doing is 2791 02:05:15,780 --> 02:05:18,780 this person went to the 2792 02:05:18,780 --> 02:05:20,780 token embedding stable 2793 02:05:20,780 --> 02:05:22,780 and clustered the tokens 2794 02:05:22,780 --> 02:05:24,780 based on their embedding representation. 2795 02:05:24,780 --> 02:05:27,780 And this person noticed that there's a cluster 2796 02:05:27,780 --> 02:05:29,780 of tokens that look really strange. 2797 02:05:29,780 --> 02:05:31,780 So there's a cluster here 2798 02:05:31,780 --> 02:05:32,780 at rot 2799 02:05:32,780 --> 02:05:33,780 East stream fame 2800 02:05:33,780 --> 02:05:34,780 solid gold Magikarp 2801 02:05:34,780 --> 02:05:35,780 sign up message 2802 02:05:35,780 --> 02:05:37,780 like really weird tokens in 2803 02:05:37,780 --> 02:05:40,780 basically in this embedding cluster. 2804 02:05:40,780 --> 02:05:42,780 And so what are these tokens 2805 02:05:42,780 --> 02:05:43,780 and where do they even come from? 2806 02:05:43,780 --> 02:05:44,780 Like what is solid gold Magikarp? 2807 02:05:44,780 --> 02:05:45,780 It makes no sense. 2808 02:05:45,780 --> 02:05:49,780 And then they found a bunch of these tokens 2809 02:05:49,780 --> 02:05:51,780 and then they noticed that actually 2810 02:05:51,780 --> 02:05:52,780 the plot thickens here 2811 02:05:52,780 --> 02:05:55,780 because if you ask the model about these tokens 2812 02:05:55,780 --> 02:05:58,780 like you ask it some very benign question 2813 02:05:58,780 --> 02:06:00,780 like please can you repeat back to me 2814 02:06:00,780 --> 02:06:02,780 the strength sold gold Magikarp 2815 02:06:02,780 --> 02:06:04,780 then you get a variety of basically 2816 02:06:04,780 --> 02:06:06,780 totally broken LLM behavior. 2817 02:06:06,780 --> 02:06:08,780 So either you get evasion. 2818 02:06:08,780 --> 02:06:10,780 So I'm sorry, I can't hear you 2819 02:06:10,780 --> 02:06:13,780 or you get a bunch of hallucinations as a response. 2820 02:06:13,780 --> 02:06:15,780 You can even get back like insults. 2821 02:06:15,780 --> 02:06:18,780 So you ask it about streamer bot 2822 02:06:18,780 --> 02:06:19,780 and it tells them 2823 02:06:19,780 --> 02:06:22,780 and the model actually just calls you names 2824 02:06:22,780 --> 02:06:24,780 or it kind of comes up with like weird humor 2825 02:06:24,780 --> 02:06:26,780 like you're actually breaking the model 2826 02:06:26,780 --> 02:06:27,780 by asking about these 2827 02:06:27,780 --> 02:06:29,780 very simple strings like at Roth 2828 02:06:29,780 --> 02:06:31,780 and solid gold Magikarp. 2829 02:06:31,780 --> 02:06:32,780 So like what the hell is happening? 2830 02:06:32,780 --> 02:06:35,780 And there's a variety of here documented behaviors. 2831 02:06:35,780 --> 02:06:37,780 There's a bunch of tokens 2832 02:06:37,780 --> 02:06:38,780 not just solid gold Magikarp 2833 02:06:38,780 --> 02:06:40,780 that have that kind of a behavior. 2834 02:06:40,780 --> 02:06:43,780 And so basically there's a bunch of like trigger words. 2835 02:06:43,780 --> 02:06:45,780 And if you ask the model about these trigger words 2836 02:06:45,780 --> 02:06:47,780 or you just include them in your prompt 2837 02:06:47,780 --> 02:06:48,780 the model goes haywire 2838 02:06:48,780 --> 02:06:51,780 and has all kinds of really strange behaviors 2839 02:06:51,780 --> 02:06:53,780 including sort of ones that violate 2840 02:06:53,780 --> 02:06:55,780 typical safety guidelines 2841 02:06:55,780 --> 02:06:56,780 and the alignment of the model 2842 02:06:56,780 --> 02:06:57,780 like in this case 2843 02:06:57,780 --> 02:06:58,780 it's swearing back at you. 2844 02:06:58,780 --> 02:07:00,780 So what is happening here 2845 02:07:00,780 --> 02:07:02,780 and how can this possibly be true? 2846 02:07:02,780 --> 02:07:05,780 Well, this again comes down to tokenization. 2847 02:07:05,780 --> 02:07:06,780 So what's happening here 2848 02:07:06,780 --> 02:07:07,780 is that solid gold Magikarp 2849 02:07:07,780 --> 02:07:09,780 if you actually dig into it 2850 02:07:09,780 --> 02:07:10,780 is a Reddit user. 2851 02:07:10,780 --> 02:07:13,780 So there's a u slash solid gold Magikarp 2852 02:07:13,780 --> 02:07:15,780 and probably what happened here 2853 02:07:15,780 --> 02:07:17,780 even though I don't know that this has been 2854 02:07:17,780 --> 02:07:19,780 like really definitively explored 2855 02:07:19,780 --> 02:07:21,780 but what is thought to have happened 2856 02:07:21,780 --> 02:07:24,780 is that the tokenization data set 2857 02:07:24,780 --> 02:07:27,780 was very different from the training data set 2858 02:07:27,780 --> 02:07:29,780 for the actual language model. 2859 02:07:29,780 --> 02:07:30,780 So in the tokenization data set 2860 02:07:30,780 --> 02:07:32,780 there was a ton of Reddit data potentially 2861 02:07:32,780 --> 02:07:35,780 where the user solid gold Magikarp 2862 02:07:35,780 --> 02:07:36,780 was mentioned in the text. 2863 02:07:36,780 --> 02:07:38,780 Because solid gold Magikarp 2864 02:07:38,780 --> 02:07:41,780 was a very common sort of person 2865 02:07:41,780 --> 02:07:42,780 who would post a lot 2866 02:07:42,780 --> 02:07:44,780 this would be a string that occurs many times 2867 02:07:44,780 --> 02:07:46,780 in a tokenization data set. 2868 02:07:46,780 --> 02:07:48,780 Because it occurs many times 2869 02:07:48,780 --> 02:07:49,780 in the tokenization data set 2870 02:07:49,780 --> 02:07:51,780 these tokens would end up getting merged 2871 02:07:51,780 --> 02:07:52,780 to a single individual token 2872 02:07:52,780 --> 02:07:54,780 for that single Reddit user 2873 02:07:54,780 --> 02:07:55,780 solid gold Magikarp. 2874 02:07:55,780 --> 02:07:57,780 So they would have a dedicated token 2875 02:07:57,780 --> 02:07:58,780 in a vocabulary of 2876 02:07:58,780 --> 02:08:00,780 was it 50,000 tokens in GPT-2 2877 02:08:00,780 --> 02:08:03,780 that is devoted to that Reddit user. 2878 02:08:03,780 --> 02:08:04,780 And then what happens is 2879 02:08:04,780 --> 02:08:07,780 the tokenization data set has those strings 2880 02:08:07,780 --> 02:08:10,780 but then later when you train the model 2881 02:08:10,780 --> 02:08:12,780 the language model itself 2882 02:08:12,780 --> 02:08:15,780 this data from Reddit was not present. 2883 02:08:15,780 --> 02:08:16,780 And so therefore 2884 02:08:16,780 --> 02:08:19,780 in the entire training set for the language model 2885 02:08:19,780 --> 02:08:21,780 solid gold Magikarp never occurs. 2886 02:08:21,780 --> 02:08:24,780 That token never appears in the training set 2887 02:08:24,780 --> 02:08:26,780 for the actual language model later. 2888 02:08:26,780 --> 02:08:29,780 So this token never gets activated 2889 02:08:29,780 --> 02:08:30,780 it's initialized at random 2890 02:08:30,780 --> 02:08:32,780 in the beginning of optimization 2891 02:08:32,780 --> 02:08:33,780 then you have forward-backward passes 2892 02:08:33,780 --> 02:08:34,780 and updates to the model 2893 02:08:34,780 --> 02:08:36,780 and this token is just never updated 2894 02:08:36,780 --> 02:08:37,780 in the embedding table. 2895 02:08:37,780 --> 02:08:39,780 That row vector never gets sampled 2896 02:08:39,780 --> 02:08:40,780 it never gets used 2897 02:08:40,780 --> 02:08:41,780 so it never gets trained 2898 02:08:41,780 --> 02:08:42,780 and it's completely untrained. 2899 02:08:42,780 --> 02:08:44,780 It's kind of like unallocated memory 2900 02:08:44,780 --> 02:08:46,780 in a typical binary program 2901 02:08:46,780 --> 02:08:48,780 written in C or something like that. 2902 02:08:48,780 --> 02:08:49,780 So it's unallocated memory 2903 02:08:49,780 --> 02:08:50,780 and then at test time 2904 02:08:50,780 --> 02:08:52,780 if you evoke this token 2905 02:08:52,780 --> 02:08:53,780 then you're basically 2906 02:08:53,780 --> 02:08:55,780 plucking out a row of the embedding table 2907 02:08:55,780 --> 02:08:56,780 that is completely untrained 2908 02:08:56,780 --> 02:08:58,780 and that feeds into a transformer 2909 02:08:58,780 --> 02:09:00,780 and creates undefined behavior. 2910 02:09:00,780 --> 02:09:01,780 And that's what we're seeing here 2911 02:09:01,780 --> 02:09:02,780 it's completely undefined 2912 02:09:02,780 --> 02:09:05,780 never before seen in a training behavior. 2913 02:09:05,780 --> 02:09:07,780 And so any of these kind of like weird tokens 2914 02:09:07,780 --> 02:09:08,780 would evoke this behavior 2915 02:09:08,780 --> 02:09:10,780 because fundamentally the model 2916 02:09:10,780 --> 02:09:13,780 is out of sample 2917 02:09:13,780 --> 02:09:15,780 out of distribution. 2918 02:09:15,780 --> 02:09:16,780 Okay and the very last thing 2919 02:09:16,780 --> 02:09:18,780 I wanted to just briefly mention and point out 2920 02:09:18,780 --> 02:09:19,780 although I think a lot of people 2921 02:09:19,780 --> 02:09:20,780 are quite aware of this 2922 02:09:20,780 --> 02:09:22,780 is that different kinds of formats 2923 02:09:22,780 --> 02:09:23,780 and different representations 2924 02:09:23,780 --> 02:09:24,780 and different languages 2925 02:09:24,780 --> 02:09:25,780 and so on 2926 02:09:25,780 --> 02:09:26,780 might be more or less efficient 2927 02:09:26,780 --> 02:09:28,780 with GPT tokenizers 2928 02:09:28,780 --> 02:09:30,780 or any tokenizers for any other LLM 2929 02:09:30,780 --> 02:09:31,780 for that matter. 2930 02:09:31,780 --> 02:09:32,780 So for example JSON 2931 02:09:32,780 --> 02:09:34,780 is actually really dense in tokens 2932 02:09:34,780 --> 02:09:37,780 and YAML is a lot more efficient in tokens. 2933 02:09:37,780 --> 02:09:40,780 So for example these are the same 2934 02:09:40,780 --> 02:09:42,780 in JSON and in YAML. 2935 02:09:42,780 --> 02:09:44,780 The JSON is 116 2936 02:09:44,780 --> 02:09:46,780 and the YAML is 99. 2937 02:09:46,780 --> 02:09:48,780 So quite a bit of an improvement. 2938 02:09:48,780 --> 02:09:51,780 And so in the token economy 2939 02:09:51,780 --> 02:09:53,780 where we are paying per token 2940 02:09:53,780 --> 02:09:54,780 in many ways 2941 02:09:54,780 --> 02:09:55,780 and you are paying in the context length 2942 02:09:55,780 --> 02:09:57,780 and you're paying in dollar amount 2943 02:09:57,780 --> 02:09:59,780 for the cost of processing 2944 02:09:59,780 --> 02:10:00,780 all this kind of structured data 2945 02:10:00,780 --> 02:10:02,780 when you have to. 2946 02:10:02,780 --> 02:10:04,780 So prefer to use YAMLs over JSONs 2947 02:10:04,780 --> 02:10:05,780 and in general 2948 02:10:05,780 --> 02:10:06,780 kind of like the tokenization density 2949 02:10:06,780 --> 02:10:08,780 is something that you have to 2950 02:10:08,780 --> 02:10:09,780 sort of care about 2951 02:10:09,780 --> 02:10:11,780 and worry about at all times 2952 02:10:11,780 --> 02:10:13,780 and try to find efficient encoding schemes 2953 02:10:13,780 --> 02:10:14,780 and spend a lot of time 2954 02:10:14,780 --> 02:10:15,780 in tick tokenizer 2955 02:10:15,780 --> 02:10:17,780 and measure the different token efficiencies 2956 02:10:17,780 --> 02:10:18,780 of different formats and settings 2957 02:10:18,780 --> 02:10:19,780 and so on. 2958 02:10:19,780 --> 02:10:20,780 Okay so that concludes 2959 02:10:20,780 --> 02:10:23,780 my fairly long video on tokenization. 2960 02:10:23,780 --> 02:10:25,780 I know it's dry. 2961 02:10:25,780 --> 02:10:26,780 I know it's annoying. 2962 02:10:26,780 --> 02:10:27,780 I know it's irritating. 2963 02:10:27,780 --> 02:10:29,780 I personally really dislike the stage. 2964 02:10:29,780 --> 02:10:31,780 What I do have to say at this point 2965 02:10:31,780 --> 02:10:33,780 is don't brush it off. 2966 02:10:33,780 --> 02:10:34,780 There's a lot of foot guns, 2967 02:10:34,780 --> 02:10:35,780 sharp edges here, 2968 02:10:35,780 --> 02:10:36,780 security issues, 2969 02:10:36,780 --> 02:10:38,780 AI safety issues 2970 02:10:38,780 --> 02:10:40,780 as we saw plugging in unallocated memory 2971 02:10:40,780 --> 02:10:42,780 into language models. 2972 02:10:42,780 --> 02:10:45,780 So it's worth understanding this stage. 2973 02:10:45,780 --> 02:10:47,780 That said I will say that 2974 02:10:47,780 --> 02:10:49,780 eternal glory goes to anyone 2975 02:10:49,780 --> 02:10:50,780 who can get rid of it. 2976 02:10:50,780 --> 02:10:52,780 I showed you one possible paper 2977 02:10:52,780 --> 02:10:54,780 that tried to do that 2978 02:10:54,780 --> 02:10:55,780 and I think I hope 2979 02:10:55,780 --> 02:10:57,780 a lot more can follow over time. 2980 02:10:57,780 --> 02:10:58,780 And my final recommendations 2981 02:10:58,780 --> 02:11:00,780 for the application right now are 2982 02:11:00,780 --> 02:11:02,780 if you can reuse the GPT-4 tokens 2983 02:11:02,780 --> 02:11:03,780 and the vocabulary 2984 02:11:03,780 --> 02:11:04,780 in your application 2985 02:11:04,780 --> 02:11:05,780 then that's something you should consider 2986 02:11:05,780 --> 02:11:06,780 and just use tick token 2987 02:11:06,780 --> 02:11:08,780 because it is very efficient 2988 02:11:08,780 --> 02:11:10,780 and nice library for inference 2989 02:11:10,780 --> 02:11:11,780 for BPE. 2990 02:11:11,780 --> 02:11:13,780 I also really like the byte level BPE 2991 02:11:13,780 --> 02:11:16,780 that tick token and OpenAI uses. 2992 02:11:16,780 --> 02:11:17,780 If you for some reason 2993 02:11:17,780 --> 02:11:19,780 want to train your own vocabulary 2994 02:11:19,780 --> 02:11:21,780 from scratch 2995 02:11:21,780 --> 02:11:25,780 then I would use the BPE with sentence piece. 2996 02:11:25,780 --> 02:11:26,780 Oops. 2997 02:11:26,780 --> 02:11:27,780 As I mentioned 2998 02:11:27,780 --> 02:11:28,780 I'm not a huge fan of sentence piece. 2999 02:11:28,780 --> 02:11:32,780 I don't like its byte fallback 3000 02:11:32,780 --> 02:11:34,780 and I don't like that it's doing BPE 3001 02:11:34,780 --> 02:11:35,780 on Unicode code points. 3002 02:11:35,780 --> 02:11:36,780 I think it's 3003 02:11:36,780 --> 02:11:37,780 it also has like a million settings 3004 02:11:37,780 --> 02:11:39,780 and I think there's a lot of foot guns here 3005 02:11:39,780 --> 02:11:40,780 and I think it's really easy 3006 02:11:40,780 --> 02:11:41,780 to miscalibrate them 3007 02:11:41,780 --> 02:11:42,780 and you end up cropping your sentences 3008 02:11:42,780 --> 02:11:44,780 or something like that 3009 02:11:44,780 --> 02:11:45,780 because of some type of parameter 3010 02:11:45,780 --> 02:11:47,780 that you don't fully understand. 3011 02:11:47,780 --> 02:11:49,780 So be very careful with the settings. 3012 02:11:49,780 --> 02:11:50,780 Try to copy paste exactly 3013 02:11:50,780 --> 02:11:52,780 maybe what Meta did 3014 02:11:52,780 --> 02:11:54,780 or basically spend a lot of time 3015 02:11:54,780 --> 02:11:56,780 looking at all the hyperparameters 3016 02:11:56,780 --> 02:11:57,780 and go through the code of sentence piece 3017 02:11:57,780 --> 02:11:59,780 and make sure that you have this correct. 3018 02:11:59,780 --> 02:12:02,780 But even if you have all the settings correct 3019 02:12:02,780 --> 02:12:03,780 I still think that the algorithm 3020 02:12:03,780 --> 02:12:04,780 is kind of inferior 3021 02:12:04,780 --> 02:12:06,780 to what's happening here. 3022 02:12:06,780 --> 02:12:08,780 And maybe the best 3023 02:12:08,780 --> 02:12:10,780 if you really need to train your vocabulary 3024 02:12:10,780 --> 02:12:11,780 maybe the best thing is to just wait 3025 02:12:11,780 --> 02:12:13,780 for minBPE to become as efficient 3026 02:12:13,780 --> 02:12:14,780 as possible 3027 02:12:14,780 --> 02:12:16,780 and that's something that 3028 02:12:16,780 --> 02:12:18,780 maybe I hope to work on. 3029 02:12:18,780 --> 02:12:19,780 And at some point 3030 02:12:19,780 --> 02:12:21,780 maybe we can be training basically 3031 02:12:21,780 --> 02:12:22,780 really what we want 3032 02:12:22,780 --> 02:12:23,780 is we want tick token 3033 02:12:23,780 --> 02:12:24,780 but training code 3034 02:12:24,780 --> 02:12:26,780 and that is the ideal thing 3035 02:12:26,780 --> 02:12:28,780 that currently does not exist. 3036 02:12:28,780 --> 02:12:31,780 And minBPE is an implementation of it 3037 02:12:31,780 --> 02:12:33,780 but currently it's in Python. 3038 02:12:33,780 --> 02:12:35,780 So that's currently what I have to say 3039 02:12:35,780 --> 02:12:37,780 for tokenization. 3040 02:12:37,780 --> 02:12:38,780 There might be an advanced video 3041 02:12:38,780 --> 02:12:40,780 that is even drier 3042 02:12:40,780 --> 02:12:41,780 and even more detailed in the future. 3043 02:12:41,780 --> 02:12:42,780 But for now I think 3044 02:12:42,780 --> 02:12:44,780 we're going to leave things off here 3045 02:12:44,780 --> 02:12:46,780 and I hope that was helpful. 3046 02:12:46,780 --> 02:12:47,780 Bye. 3047 02:12:49,780 --> 02:12:55,780 And they increased this context size 3048 02:12:55,780 --> 02:12:57,780 from GPT-1 of 5.12 3049 02:12:57,780 --> 02:13:02,780 to 1,024 in GPT-4.2. 3050 02:13:02,780 --> 02:13:05,780 The next... 3051 02:13:05,780 --> 02:13:06,780 Okay, next I would like us 3052 02:13:06,780 --> 02:13:07,780 to briefly walk through 3053 02:13:07,780 --> 02:13:09,780 the code from OpenAI 3054 02:13:09,780 --> 02:13:16,780 on the GPT-2 encoder.py. 3055 02:13:16,780 --> 02:13:18,780 I'm sorry, I'm going to sneeze. 3056 02:13:18,780 --> 02:13:19,780 And then what's happening 3057 02:13:19,780 --> 02:13:22,780 here is... 3058 02:13:22,780 --> 02:13:23,780 This is a spurious layer 3059 02:13:23,780 --> 02:13:26,780 that I will explain in a bit. 3060 02:13:26,780 --> 02:13:28,780 What's happening here is...