WEBVTT 00:00.000 --> 00:04.400 Hi everyone. Today we are continuing our implementation of MakeMore, our favorite 00:04.400 --> 00:09.000 character-level language model. Now, you'll notice that the background behind me is different. That's 00:09.000 --> 00:14.380 because I am in Kyoto, and it is awesome. So I'm in a hotel room here. Now, over the last few 00:14.380 --> 00:19.620 lectures, we've built up to this architecture that is a multi-layer perceptron character-level 00:19.620 --> 00:23.700 language model. So we see that it receives three previous characters and tries to predict the 00:23.700 --> 00:28.360 fourth character in a sequence using a very simple multi-layer perceptron using one hidden 00:28.360 --> 00:33.460 layer of neurons with tenational neuralities. So what we'd like to do now in this lecture is I'd 00:33.460 --> 00:37.400 like to complexify this architecture. In particular, we would like to take more characters 00:37.400 --> 00:42.660 in a sequence as an input, not just three. And in addition to that, we don't just want to feed 00:42.660 --> 00:47.020 them all into a single hidden layer because that squashes too much information too quickly. 00:47.680 --> 00:52.760 Instead, we would like to make a deeper model that progressively fuses this information to make 00:52.760 --> 00:57.960 its guess about the next character in a sequence. And so we'll see that as we make this architecture 00:57.960 --> 00:58.340 more complex, we'll be able to make a more complex model that progressively fuses this 00:58.340 --> 01:01.440 information to make it more complex. We're actually going to arrive at something that looks 01:01.440 --> 01:06.520 very much like a WaveNet. So WaveNet is this paper published by DeepMind in 2016. 01:07.480 --> 01:13.500 And it is also a language model, basically, but it tries to predict audio sequences instead of 01:13.500 --> 01:19.280 character-level sequences or word-level sequences. But fundamentally, the modeling setup is 01:19.280 --> 01:24.280 identical. It is an autoregressive model, and it tries to predict the next character in a sequence. 01:24.280 --> 01:27.940 And the architecture actually takes this interesting hierarchical 01:27.940 --> 01:33.920 sort of approach to predicting the next character in a sequence with this tree-like structure. 01:34.760 --> 01:38.720 And this is the architecture, and we're going to implement it in the course of this video. 01:39.040 --> 01:44.260 So let's get started. So the starter code for part five is very similar to where we ended up in 01:44.260 --> 01:49.180 in part three. Recall that part four was the manual dot propagation exercise. That is kind 01:49.180 --> 01:54.120 of an aside. So we are coming back to part three, copy-pasting chunks out of it. And that is our 01:54.120 --> 01:56.920 starter code for part five. I've changed very few things otherwise. 01:57.940 --> 02:02.400 However, both should look familiar to if you've gone through part three. So in particular, very 02:02.400 --> 02:08.880 briefly, we are doing imports. We are reading our data set of words. And we are processing the 02:08.880 --> 02:14.040 dataset of words into individual examples. And none of this data generation code has changed. 02:14.040 --> 02:20.060 And basically we have lots and lots of examples. In particular, we have 182,000 examples 02:20.060 --> 02:26.200 of three characters try to predict the fourth one. And we've broken up every one of these words into 02:26.200 --> 02:27.060 little problems. 02:27.060 --> 02:30.660 given three characters predict the fourth one so this is our data set and this is what we're 02:30.660 --> 02:37.220 trying to get the neural net to do now in part three we started to develop our code around these 02:37.220 --> 02:43.220 layer modules that are for example a class linear and we're doing this because we want to think of 02:43.220 --> 02:48.980 these modules as building blocks and like a lego building block bricks that we can sort of like 02:48.980 --> 02:54.260 stack up into neural networks and we can feed data between these layers and stack them up into 02:54.260 --> 03:01.700 sort of graphs now we also developed these layers to have apis and signatures very similar to those 03:01.700 --> 03:06.740 that are found in pytorch so we have torch.nn and it's got all these layer building blocks that you 03:06.740 --> 03:11.940 would use in practice and we were developing all these to mimic the apis of these so for example 03:11.940 --> 03:17.780 we have linear so there will also be a torch.nn.linear and its signature will be very 03:17.780 --> 03:22.580 similar to our signature and the functionality will be also quite identical as far as i'm aware 03:22.580 --> 03:23.620 so we have the linear layer 03:24.260 --> 03:30.740 with the batchnorm1d layer and the 10h layer that we developed previously and linear just does a 03:30.740 --> 03:36.500 matrix multiply in the forward pass of this module batchnorm of course is this crazy layer that we 03:36.500 --> 03:41.300 developed in the previous lecture and what's crazy about it is well there's many things 03:42.020 --> 03:47.220 number one it has these running mean and variances that are trained outside of back propagation they 03:47.220 --> 03:53.540 are trained using exponential moving average inside this layer when we call the forward pass 03:54.420 --> 03:59.460 uh in addition to that there's this training flag because the behavior of bastion is different 03:59.460 --> 04:03.380 during train time and evaluation time and so suddenly we have to be very careful that 04:03.380 --> 04:07.860 bastion is in its correct state that it's in the evaluation state or training state so that's 04:07.860 --> 04:12.740 something to now keep track of something that sometimes introduces bugs because you forget to 04:12.740 --> 04:18.020 put it into the right mode and finally we saw that bastion couples the statistics or the the 04:18.020 --> 04:24.020 activations across the examples in the batch so normally we thought of the batch as just an efficiency thing 04:24.820 --> 04:30.900 but now we are coupling the computation across batch elements and it's done for the purposes of 04:30.900 --> 04:36.340 controlling the activation statistics as we saw in the previous video so it's a very weird layer 04:36.340 --> 04:41.300 at least a lot of bugs um partly for example because you have to modulate the training and 04:41.300 --> 04:48.740 eval phase and so on um in addition for example you have to wait for uh the mean and the variance 04:48.740 --> 04:53.780 to settle and to actually reach a steady state and so um you have to make sure that you 04:54.260 --> 05:01.540 if you write the problem invalid you can more simply to you know drag and drop um 05:02.260 --> 05:22.900 to get whatever range you want uh and i want to show you the behavior of bastion 05:24.260 --> 05:25.940 And then we have a list of layers. 05:25.940 --> 05:29.840 And it's a linear, feeds to BatchNorm, feeds to 10H, 05:29.840 --> 05:31.780 and then a linear output layer. 05:31.780 --> 05:33.200 And its weights are scaled down, 05:33.200 --> 05:36.720 so we are not confidently wrong at initialization. 05:36.720 --> 05:39.260 We see that this is about 12,000 parameters. 05:39.260 --> 05:42.860 We're telling PyTorch that the parameters require gradients. 05:42.860 --> 05:45.700 The optimization is, as far as I'm aware, identical, 05:45.700 --> 05:47.940 and should look very, very familiar. 05:47.940 --> 05:48.980 Nothing changed here. 05:50.000 --> 05:52.580 Loss function looks very crazy. 05:52.580 --> 05:54.020 We should probably fix this. 05:54.020 --> 05:57.280 And that's because 32 batch elements are too few. 05:57.280 --> 05:59.880 And so you can get very lucky or unlucky 05:59.880 --> 06:01.200 in any one of these batches, 06:01.200 --> 06:04.240 and it creates a very thick loss function. 06:04.240 --> 06:06.440 So we're gonna fix that soon. 06:06.440 --> 06:09.240 Now, once we want to evaluate the trained neural network, 06:09.240 --> 06:11.400 we need to remember, because of the BatchNorm layers, 06:11.400 --> 06:14.360 to set all the layers to be training equals false. 06:14.360 --> 06:17.240 This only matters for the BatchNorm layer so far. 06:17.240 --> 06:18.860 And then we evaluate. 06:20.340 --> 06:23.740 We see that currently we have validation loss of 2.10, 06:23.740 --> 06:27.200 which is fairly good, but there's still a ways to go. 06:27.200 --> 06:30.500 But even at 2.10, we see that when we sample from the model, 06:30.500 --> 06:33.380 we actually get relatively name-like results 06:33.380 --> 06:35.100 that do not exist in a training set. 06:35.100 --> 06:40.100 So for example, Yvonne, Kilo, Pros, Alaya, et cetera. 06:41.600 --> 06:46.200 So certainly not reasonable, not unreasonable, I would say, 06:46.200 --> 06:47.320 but not amazing. 06:47.320 --> 06:49.760 And we can still push this validation loss even lower 06:49.760 --> 06:52.860 and get much better samples that are even more name-like. 06:53.740 --> 06:56.640 So let's improve this model now. 06:56.640 --> 06:58.200 Okay, first, let's fix this graph, 06:58.200 --> 06:59.700 because it is daggers in my eyes, 06:59.700 --> 07:02.000 and I just can't take it anymore. 07:02.000 --> 07:06.900 So lossI, if you recall, is a Python list of floats. 07:06.900 --> 07:10.020 So for example, the first 10 elements look like this. 07:11.060 --> 07:12.340 Now, what we'd like to do, basically, 07:12.340 --> 07:15.180 is we need to average up some of these values 07:15.180 --> 07:19.780 to get a more sort of representative value along the way. 07:19.780 --> 07:21.860 So one way to do this is the following. 07:21.860 --> 07:23.040 In PyTorch, if I create, for example, this, 07:23.040 --> 07:23.680 I'm gonna do this. 07:23.680 --> 07:27.480 If I create, for example, a tensor of the first 10 numbers, 07:27.480 --> 07:29.920 then this is currently a one-dimensional array. 07:29.920 --> 07:32.840 But recall that I can view this array as two-dimensional. 07:32.840 --> 07:35.800 So for example, I can view it as a two-by-five array, 07:35.800 --> 07:39.140 and this is a 2D tensor now, two-by-five. 07:39.140 --> 07:41.460 And you see what PyTorch has done is that the first row 07:41.460 --> 07:43.800 of this tensor is the first five elements, 07:43.800 --> 07:46.840 and the second row is the second five elements. 07:46.840 --> 07:50.180 I can also view it as a five-by-two as an example. 07:50.180 --> 07:53.680 And then recall that I can also use negative one 07:53.680 --> 07:55.920 in place of one of these numbers. 07:55.920 --> 07:58.640 And PyTorch will calculate what that number must be 07:58.640 --> 08:01.080 in order to make the number of elements work out. 08:01.080 --> 08:04.720 So this can be this, or like that. 08:04.720 --> 08:05.520 Both will work. 08:05.520 --> 08:06.720 Of course, this would not work. 08:09.220 --> 08:12.680 OK, so this allows it to spread out some of the consecutive values 08:12.680 --> 08:13.720 into rows. 08:13.720 --> 08:15.720 So that's very helpful, because what we can do now 08:15.720 --> 08:19.600 is, first of all, we're going to create a Torch.tensor out 08:19.600 --> 08:22.480 of the list of floats. 08:22.480 --> 08:23.520 And then we're going to view it. 08:23.520 --> 08:26.520 As whatever it is, but we're going 08:26.520 --> 08:30.640 to stretch it out into rows of 1,000 consecutive elements. 08:30.640 --> 08:34.560 So the shape of this now becomes 200 by 1,000. 08:34.560 --> 08:39.520 And each row is 1,000 consecutive elements in this list. 08:39.520 --> 08:41.020 So that's very helpful, because now we 08:41.020 --> 08:44.040 can do a mean along the rows. 08:44.040 --> 08:47.240 And the shape of this will just be 200. 08:47.240 --> 08:49.840 And so we've taken basically the mean on every row. 08:49.840 --> 08:53.060 So plt.plot of that should be something nicer. 08:53.060 --> 08:55.180 Much better. 08:55.180 --> 08:57.740 So we see that we've basically made a lot of progress. 08:57.740 --> 09:00.900 And then here, this is the learning rate decay. 09:00.900 --> 09:03.180 So here we see that the learning rate decay subtracted 09:03.180 --> 09:04.900 a ton of energy out of the system, 09:04.900 --> 09:07.820 and allowed us to settle into the local minimum 09:07.820 --> 09:09.500 in this optimization. 09:09.500 --> 09:11.420 So this is a much nicer plot. 09:11.420 --> 09:14.680 Let me come up and delete the monster. 09:14.680 --> 09:16.700 And we're going to be using this going forward. 09:16.700 --> 09:18.240 Now, next up, what I'm bothered by 09:18.240 --> 09:21.180 is that you see our forward pass is a little bit gnarly, and takes a lot of time. 09:21.180 --> 09:21.940 So we're going to go ahead and do that. 09:21.940 --> 09:22.060 And we're going to go ahead and do that. 09:22.060 --> 09:23.060 And we're going to go ahead and do that. 09:23.060 --> 09:24.740 And we're going to explain too many lines of code. 09:24.740 --> 09:27.380 So in particular, we see that we've organized some of the layers 09:27.380 --> 09:31.900 inside the layers list, but not all of them for no reason. 09:31.900 --> 09:34.780 So in particular, we see that we still have the embedding table special 09:34.780 --> 09:37.020 cased outside of the layers. 09:37.020 --> 09:39.780 And in addition to that, the viewing operation here 09:39.780 --> 09:41.680 is also outside of our layers. 09:41.680 --> 09:43.640 Let's create layers for these, and then we 09:43.640 --> 09:46.600 can add those layers to just our list. 09:46.600 --> 09:49.540 So in particular, the two things that we need is here, 09:49.540 --> 09:52.580 we have this embedding table, and we are indexing 09:52.580 --> 09:58.700 at the integers inside the batch xb, inside the tensor xb. 09:58.700 --> 10:02.580 So that's an embedding table lookup just done with indexing. 10:02.580 --> 10:04.700 And then here, we see that we have this view operation, 10:04.700 --> 10:06.620 which if you recall from the previous video, 10:06.620 --> 10:11.000 simply rearranges the character embeddings 10:11.000 --> 10:13.080 and stretches them out into a row. 10:13.080 --> 10:16.420 And effectively, what that does is the concatenation operation, 10:16.420 --> 10:19.100 basically, except it's free because viewing 10:19.100 --> 10:21.180 is very cheap in PyTorch. 10:21.180 --> 10:23.420 And no memory is being copied. 10:23.420 --> 10:26.040 We're just re-representing how we view that tensor. 10:26.040 --> 10:30.780 So let's create modules for both of these operations, 10:30.780 --> 10:34.060 the embedding operation and the flattening operation. 10:34.060 --> 10:38.860 So I actually wrote the code just to save some time. 10:38.860 --> 10:41.860 So we have a module embedding and a module flatten. 10:41.860 --> 10:44.880 And both of them simply do the indexing operation 10:44.880 --> 10:49.620 in a forward pass and the flattening operation here. 10:49.620 --> 10:51.060 And this. 10:51.180 --> 10:56.880 C now will just become a self.weight inside an embedding module. 10:56.880 --> 10:59.760 And I'm calling these layers specifically embedding and flatten 10:59.760 --> 11:03.080 because it turns out that both of them actually exist in PyTorch. 11:03.080 --> 11:05.820 So in PyTorch, we have n and dot embedding. 11:05.820 --> 11:07.260 And it also takes the number of embeddings 11:07.260 --> 11:10.960 and the dimensionality of the embedding, just like we have here. 11:10.960 --> 11:13.340 But in addition, PyTorch takes in a lot of other keyword arguments 11:13.340 --> 11:18.080 that we are not using for our purposes yet. 11:18.080 --> 11:20.720 And for flatten, that also exists in PyTorch. 11:20.720 --> 11:23.840 But it also takes additional keyword arguments that we are not using. 11:23.840 --> 11:26.860 So we have a very simple flatten. 11:26.860 --> 11:27.860 But both of them exist in PyTorch. 11:27.860 --> 11:30.180 They're just a bit more simpler. 11:30.180 --> 11:38.500 And now that we have these, we can simply take out some of these special cased things. 11:38.500 --> 11:45.820 So instead of C, we're just going to have an embedding and vocab size and n embed. 11:45.820 --> 11:49.120 And then after the embedding, we are going to flatten. 11:49.120 --> 11:50.620 So let's construct those modules. 11:50.720 --> 11:53.300 And now I can take out this C. 11:53.300 --> 11:55.600 And here, I don't have to special case it anymore. 11:55.600 --> 11:59.180 Because now, C is the embedding's weight. 11:59.180 --> 12:02.060 And it's inside layers. 12:02.060 --> 12:04.340 So this should just work. 12:04.340 --> 12:07.960 And then here, our forward pass simplifies substantially. 12:07.960 --> 12:13.460 Because we don't need to do these now outside of these layers, outside and explicitly. 12:13.460 --> 12:15.360 They're now inside layers. 12:15.360 --> 12:17.320 So we can delete those. 12:17.320 --> 12:20.600 But now to kick things off, we want this little x, 12:20.600 --> 12:23.100 which in the beginning is just xb, 12:23.100 --> 12:27.920 the tensor of integers specifying the identities of these characters at the input. 12:27.920 --> 12:31.100 And so these characters can now directly feed into the first layer. 12:31.100 --> 12:32.740 And this should just work. 12:32.740 --> 12:35.480 So let me come here and insert a break. 12:35.480 --> 12:37.980 Because I just want to make sure that the first iteration of this runs, 12:37.980 --> 12:39.840 and that there's no mistake. 12:39.840 --> 12:41.720 So that ran properly. 12:41.720 --> 12:44.840 And basically, we've substantially simplified the forward pass here. 12:44.840 --> 12:46.740 Okay, I'm sorry, I changed my microphone. 12:46.740 --> 12:49.720 So hopefully, the audio is a little bit better. 12:49.720 --> 12:50.480 Now, one last thing. 12:50.480 --> 12:54.100 One more thing that I would like to do in order to PyTorchify our code even further, 12:54.100 --> 12:58.440 is that right now we are maintaining all of our modules in a naked list of layers. 12:58.440 --> 13:04.180 And we can also simplify this, because we can introduce the concept of PyTorch containers. 13:04.180 --> 13:07.480 So in torch.nn, which we are basically rebuilding from scratch here, 13:07.480 --> 13:09.340 there's a concept of containers. 13:09.340 --> 13:14.860 And these containers are basically a way of organizing layers into lists or dicts and 13:14.860 --> 13:15.860 so on. 13:15.860 --> 13:20.480 So in particular, there's a sequential, which maintains a list of layers, and there's a 13:20.480 --> 13:22.760 module class in PyTorch. 13:22.760 --> 13:26.940 And it basically just passes a given input through all the layers sequentially, exactly 13:26.940 --> 13:28.980 as we are doing here. 13:28.980 --> 13:31.200 So let's write our own sequential. 13:31.200 --> 13:32.840 I've written a code here. 13:32.840 --> 13:36.000 And basically, the code for sequential is quite straightforward. 13:36.000 --> 13:39.140 We pass in a list of layers, which we keep here. 13:39.140 --> 13:43.180 And then given any input in a forward pass, we just call all the layers sequentially and 13:43.180 --> 13:44.180 return the result. 13:44.180 --> 13:48.400 And in terms of the parameters, it's just all the parameters of the child modules. 13:48.400 --> 13:49.600 So we can run this. 13:49.600 --> 13:50.100 Okay. 13:50.480 --> 13:54.520 And again, simplify this substantially, because we don't maintain this naked list of layers. 13:54.520 --> 14:01.000 We now have a notion of a model, which is a module, and in particular, is a sequential 14:01.000 --> 14:05.040 of all these layers. 14:05.040 --> 14:09.700 And now parameters are simply just model.parameters. 14:09.700 --> 14:14.100 And so that list comprehension now lives here. 14:14.100 --> 14:18.100 And then here we are doing all the things we used to do. 14:18.100 --> 14:19.600 Now here, the code again simplifies substantially. 14:19.600 --> 14:25.320 Because we don't have to do this forwarding here, instead we just call the model on the 14:25.320 --> 14:26.320 input data. 14:26.320 --> 14:29.600 And the input data here are the integers inside xb. 14:29.600 --> 14:33.900 So we can simply do logits, which are the outputs of our model, are simply the model 14:33.900 --> 14:37.040 called on xb. 14:37.040 --> 14:41.520 And then the cross entropy here takes the logits and the targets. 14:41.520 --> 14:44.120 So this simplifies substantially. 14:44.120 --> 14:45.840 And then this looks good. 14:45.840 --> 14:47.440 So let's just make sure this runs. 14:47.440 --> 14:48.440 That looks good. 14:48.440 --> 14:49.440 Okay. 14:49.440 --> 14:53.760 Now here, we actually have some work to do still here, but I'm going to come back later. 14:53.760 --> 14:55.160 For now, there's no more layers. 14:55.160 --> 14:57.040 There's a model that layers. 14:57.040 --> 15:00.900 But it's naughty to access attributes of these classes directly. 15:00.900 --> 15:03.320 So we'll come back and fix this later. 15:03.320 --> 15:07.620 And then here, of course, this simplifies substantially as well, because logits are 15:07.620 --> 15:10.840 the model called on x. 15:10.840 --> 15:14.260 And then these logits come here. 15:14.260 --> 15:18.120 So we can evaluate the train and validation loss, which currently is terrible, because 15:18.120 --> 15:19.280 we just initialized it in neural net. 15:19.280 --> 15:21.800 And then we can also sample from the model. 15:21.800 --> 15:27.100 And this simplifies dramatically as well, because we just want to call the model onto 15:27.100 --> 15:30.520 the context and outcome logits. 15:30.520 --> 15:34.600 And then these logits go into softmax and get the probabilities, et cetera. 15:34.600 --> 15:38.400 So we can sample from this model. 15:38.400 --> 15:39.400 What did I screw up? 15:39.400 --> 15:40.400 Okay. 15:40.400 --> 15:47.160 So I fixed the issue and we now get the result that we expect, which is gibberish, because 15:47.160 --> 15:48.440 the model is not trained. 15:48.440 --> 15:49.160 Okay. 15:49.160 --> 15:50.800 So we initialize it from scratch. 15:50.800 --> 15:55.080 The problem was that when I fixed this cell to be modeled out layers instead of just layers, 15:55.080 --> 15:57.420 I did not actually run the cell. 15:57.420 --> 16:00.460 And so our neural net was in a training mode. 16:00.460 --> 16:04.000 And what caused the issue here is the batch norm layer, as batch norm layer often likes 16:04.000 --> 16:07.280 to do, because batch norm was in the training mode. 16:07.280 --> 16:11.520 And here we are passing in an input, which is a batch of just a single example made up 16:11.520 --> 16:13.220 of the context. 16:13.220 --> 16:16.840 And so if you are trying to pass in a single example into a batch norm that is in the training 16:16.840 --> 16:17.840 mode. 16:17.840 --> 16:18.960 You're going to end up estimating the variance. 16:18.960 --> 16:24.600 Using the input and the variance of a single number is not a number because it is a measure 16:24.600 --> 16:25.900 of a spread. 16:25.900 --> 16:30.780 So for example, the variance of just a single number five, you can see is not a number. 16:30.780 --> 16:32.980 And so that's what happened. 16:32.980 --> 16:35.020 And batch norm basically caused an issue. 16:35.020 --> 16:38.160 And then that polluted all of the further processing. 16:38.160 --> 16:41.360 So all that we had to do was make sure that this runs. 16:41.360 --> 16:46.660 And we basically made the issue of, again, we didn't actually see the issue with the 16:46.660 --> 16:47.660 loss. 16:47.660 --> 16:48.660 We could have evaluated the loss. 16:48.960 --> 16:53.460 We got the wrong result because batch norm was in the training mode. 16:53.460 --> 16:54.460 And so we still get a result. 16:54.460 --> 16:59.540 It's just the wrong result because it's using the sample statistics of the batch. 16:59.540 --> 17:03.180 Whereas we want to use the running mean and running variance inside the batch norm. 17:03.180 --> 17:09.620 And so again, an example of introducing a bug in line because we did not properly maintain 17:09.620 --> 17:11.400 the state of what is training or not. 17:11.400 --> 17:12.400 Okay. 17:12.400 --> 17:13.480 So I rerun everything. 17:13.480 --> 17:14.520 And here's where we are. 17:14.520 --> 17:17.800 As a reminder, we have the training loss of 2.05 and validation of 2.10. 17:17.800 --> 17:18.800 Now, let's go back. 17:18.800 --> 17:22.700 Now, because these losses are very similar to each other, we have a sense that we are 17:22.700 --> 17:25.240 not overfitting too much on this task. 17:25.240 --> 17:28.880 And we can make additional progress in our performance by scaling up the size of the 17:28.880 --> 17:31.980 neural network and making everything bigger and deeper. 17:31.980 --> 17:36.600 Now, currently, we are using this architecture here, where we are taking in some number of 17:36.600 --> 17:40.040 characters, going into a single hidden layer, and then going to the prediction of the next 17:40.040 --> 17:41.420 character. 17:41.420 --> 17:46.400 The problem here is we don't have a naive way of making this bigger in a productive 17:46.400 --> 17:47.400 way. 17:47.400 --> 17:48.600 We could, of course, use our neural network. 17:48.600 --> 17:52.780 We could use our layers, sort of building blocks and materials to introduce additional 17:52.780 --> 17:55.280 layers here and make the network deeper. 17:55.280 --> 17:59.420 But it is still the case that we are crushing all of the characters into a single layer 17:59.420 --> 18:01.260 all the way at the beginning. 18:01.260 --> 18:04.980 And even if we make this a bigger layer and add neurons, it's still kind of like silly 18:04.980 --> 18:10.040 to squash all that information so fast in a single step. 18:10.040 --> 18:13.580 So what we'd like to do instead is we'd like our network to look a lot more like this in 18:13.580 --> 18:14.960 the WaveNet case. 18:14.960 --> 18:18.220 So you see in the WaveNet, when we are trying to make the prediction for the next character 18:18.220 --> 18:24.940 in a sequence, it is a function of the previous characters that feed in, but not all of these 18:24.940 --> 18:30.000 different characters are not just crushed to a single layer and then you have a sandwich. 18:30.000 --> 18:32.280 They are crushed slowly. 18:32.280 --> 18:37.800 So in particular, we take two characters and we fuse them into sort of like a bigram representation. 18:37.800 --> 18:40.480 And we do that for all these characters consecutively. 18:40.480 --> 18:47.000 And then we take the bigrams and we fuse those into four character level chunks. 18:47.000 --> 18:48.100 And then we fuse that again. 18:48.220 --> 18:52.060 And so we do that in this tree-like hierarchical manner. 18:52.060 --> 18:57.220 So we fuse the information from the previous context slowly into the network as it gets 18:57.220 --> 18:58.220 deeper. 18:58.220 --> 19:01.000 And so this is the kind of architecture that we want to implement. 19:01.000 --> 19:06.740 Now in the WaveNet's case, this is a visualization of a stack of dilated causal convolution layers. 19:06.740 --> 19:10.180 And this makes it sound very scary, but actually the idea is very simple. 19:10.180 --> 19:14.340 And the fact that it's a dilated causal convolution layer is really just an implementation detail 19:14.340 --> 19:15.660 to make everything fast. 19:15.660 --> 19:17.220 We're going to see that later. 19:17.220 --> 19:18.220 But for now, let's just keep going. 19:18.220 --> 19:21.780 We're going to keep the basic idea of it, which is this progressive fusion. 19:21.780 --> 19:26.220 So we want to make the network deeper, and at each level, we want to fuse only two consecutive 19:26.220 --> 19:27.260 elements. 19:27.260 --> 19:31.820 Two characters, then two bigrams, then two fourgrams, and so on. 19:31.820 --> 19:32.820 So let's implement this. 19:32.820 --> 19:36.100 Okay, so first up, let me scroll to where we built the dataset, and let's change the 19:36.100 --> 19:38.520 block size from three to eight. 19:38.520 --> 19:43.360 So we're going to be taking eight characters of context to predict the ninth character. 19:43.360 --> 19:45.260 So the dataset now looks like this. 19:45.260 --> 19:48.220 We have a lot more context feeding in to predict any next character. 19:48.220 --> 19:49.220 So we're going to have a sequence. 19:49.220 --> 19:53.660 And these eight characters are going to be processed in this tree-like structure. 19:53.660 --> 19:57.660 Now if we scroll here, everything here should just be able to work. 19:57.660 --> 19:59.960 So we should be able to redefine the network. 19:59.960 --> 20:04.460 You see that the number of parameters has increased by 10,000, and that's because the 20:04.460 --> 20:05.720 block size has grown. 20:05.720 --> 20:08.140 So this first linear layer is much, much bigger. 20:08.140 --> 20:12.740 Our linear layer now takes eight characters into this middle layer. 20:12.740 --> 20:14.500 So there's a lot more parameters there. 20:14.500 --> 20:16.260 But this should just run. 20:16.260 --> 20:18.140 Let me just break right after this. 20:18.140 --> 20:19.140 This is the very first iteration. 20:19.140 --> 20:21.500 So you see that this runs just fine. 20:21.500 --> 20:23.580 It's just that this network doesn't make too much sense. 20:23.580 --> 20:27.140 We're crushing way too much information way too fast. 20:27.140 --> 20:32.120 So let's now come in and see how we could try to implement the hierarchical scheme. 20:32.120 --> 20:35.800 Now before we dive into the detail of the re-implementation here, I was just curious 20:35.800 --> 20:39.640 to actually run it and see where we are in terms of the baseline performance of just 20:39.640 --> 20:42.500 lazily scaling up the context length. 20:42.500 --> 20:43.500 So I let it run. 20:43.500 --> 20:45.000 We get a nice loss curve. 20:45.000 --> 20:48.040 And then evaluating the loss, we actually see quite a bit of improvement. 20:48.140 --> 20:51.040 This is from increasing the context length. 20:51.040 --> 20:53.240 So I started a little bit of a performance log here. 20:53.240 --> 20:59.180 And previously where we were is we were getting a performance of 2.10 on the validation loss. 20:59.180 --> 21:05.080 And now simply scaling up the context length from 3 to 8 gives us a performance of 2.02. 21:05.080 --> 21:07.060 So quite a bit of an improvement here. 21:07.060 --> 21:10.900 And also when you sample from the model, you see that the names are definitely improving 21:10.900 --> 21:13.260 qualitatively as well. 21:13.260 --> 21:18.140 So we could, of course, spend a lot of time here tuning things and making it even bigger 21:18.140 --> 21:23.880 and scaling up the network further, even with the simple sort of setup here. 21:23.880 --> 21:27.580 But let's continue and let's implement the hierarchical model and treat this as just 21:27.580 --> 21:29.960 a rough baseline performance. 21:29.960 --> 21:34.360 But there's a lot of optimization left on the table in terms of some of the hyperparameters 21:34.360 --> 21:36.260 that you're hopefully getting a sense of now. 21:36.260 --> 21:40.400 Okay, so let's scroll up now and come back up. 21:40.400 --> 21:44.360 And what I've done here is I've created a bit of a scratch space for us to just look 21:44.360 --> 21:47.040 at the forward pass of the neural net and inspect the shape of the network. 21:47.040 --> 21:48.040 So let's go ahead and do that. 21:48.040 --> 21:50.000 So let's go ahead and look at the shape of the tensors along the way as the neural net 21:50.000 --> 21:52.360 forwards. 21:52.360 --> 21:57.580 So here I'm just temporarily for debugging, creating a batch of just, say, four examples. 21:57.580 --> 21:59.320 So four random integers. 21:59.320 --> 22:02.460 Then I'm plucking out those rows from our training set. 22:02.460 --> 22:06.720 And then I'm passing into the model the input XB. 22:06.720 --> 22:11.040 Now the shape of XB here, because we have only four examples, is four by eight. 22:11.040 --> 22:14.480 And this eight is now the current block size. 22:14.480 --> 22:17.980 So inspecting XB, we just see that we have four examples. 22:17.980 --> 22:21.720 Each one of them is a row of XB. 22:21.720 --> 22:24.620 And we have eight characters here. 22:24.620 --> 22:29.740 And this integer tensor just contains the identities of those characters. 22:29.740 --> 22:32.380 So the first layer of our neural net is the embedding layer. 22:32.380 --> 22:37.020 So passing XB, this integer tensor, through the embedding layer creates an output that 22:37.020 --> 22:39.360 is four by eight by 10. 22:39.360 --> 22:45.020 So our embedding table has, for each character, a 10-dimensional vector that we are trying 22:45.020 --> 22:46.020 to learn. 22:46.020 --> 22:47.300 And so what the embedding layer does here... 22:47.300 --> 22:52.360 What the layer does here is it blocks out the embedding vector for each one of these 22:52.360 --> 22:58.760 integers and organizes it all in a four by eight by 10 tensor now. 22:58.760 --> 23:03.160 So all of these integers are translated into 10-dimensional vectors inside this three-dimensional 23:03.160 --> 23:04.980 tensor now. 23:04.980 --> 23:09.380 Now passing that through the flatten layer, as you recall, what this does is it views 23:09.380 --> 23:12.520 this tensor as just a four by 80 tensor. 23:12.520 --> 23:16.800 And what that effectively does is that all these 10-dimensional embeddings for all these 23:16.800 --> 23:21.800 eight characters just end up being stretched out into a long row. 23:21.800 --> 23:24.820 And that looks kind of like a concatenation operation, basically. 23:24.820 --> 23:29.100 So by viewing the tensor differently, we now have a four by 80. 23:29.100 --> 23:36.380 And inside this 80, it's all the 10-dimensional vectors just concatenated next to each other. 23:36.380 --> 23:43.480 And the linear layer, of course, takes 80 and creates 200 channels just via matrix multiplication. 23:43.480 --> 23:44.480 So so far, so good. 23:44.480 --> 23:45.720 Now I'd like to show you something surprising. 23:45.720 --> 23:46.720 Let's see. 23:46.800 --> 23:52.740 Let's look at the insides of the linear layer and remind ourselves how it works. 23:52.740 --> 23:57.740 The linear layer here in a forward pass takes the input x, multiplies it with a weight, 23:57.740 --> 23:59.760 and then optionally adds bias. 23:59.760 --> 24:03.060 And the weight here is two-dimensional, as defined here, and the bias is one-dimensional 24:03.060 --> 24:04.620 here. 24:04.620 --> 24:08.680 So effectively, in terms of the shapes involved, what's happening inside this linear layer 24:08.680 --> 24:11.240 looks like this right now. 24:11.240 --> 24:15.800 And I'm using random numbers here, but I'm just illustrating the shapes and what happens. 24:15.800 --> 24:20.620 Basically, a four by 80 input comes into the linear layer, gets multiplied by this 24:20.620 --> 24:24.840 80 by 200 weight matrix inside, and then there's a plus 200 bias. 24:24.840 --> 24:28.780 And the shape of the whole thing that comes out of the linear layer is four by 200, as 24:28.780 --> 24:30.780 we see here. 24:30.780 --> 24:36.180 Now notice here, by the way, that this here will create a four by 200 tensor, and then 24:36.180 --> 24:42.360 plus 200, there's a broadcasting happening here, but four by 200 broadcasts with 200, 24:42.360 --> 24:43.680 so everything works here. 24:43.680 --> 24:44.640 So now the surprising thing that I want to show you is this. 24:44.640 --> 24:45.640 I'm going to show you how this works. 24:45.640 --> 24:49.460 One thing that I'd like to show you that you may not expect is that this input here 24:49.460 --> 24:53.640 that is being multiplied doesn't actually have to be two-dimensional. 24:53.640 --> 24:58.180 This matrix multiply operator in PyTorch is quite powerful, and in fact, you can actually 24:58.180 --> 25:02.020 pass in higher dimensional arrays or tensors, and everything works fine. 25:02.020 --> 25:05.960 So for example, this could be four by five by 80, and the result in that case will become 25:05.960 --> 25:08.360 four by five by 200. 25:08.360 --> 25:11.720 You can add as many dimensions as you like on the left here. 25:11.720 --> 25:15.640 And so effectively, what's happening is that the matrix multiplication only works on a 25:15.640 --> 25:19.180 matrix multiplication on the last dimension, and the dimensions before it in the input 25:19.180 --> 25:24.780 tensor are left unchanged. 25:24.780 --> 25:31.740 So basically, these dimensions on the left are all treated as just a batch dimension. 25:31.740 --> 25:36.580 So we can have multiple batch dimensions, and then in parallel over all those dimensions, 25:36.580 --> 25:39.620 we are doing the matrix multiplication on the last dimension. 25:39.620 --> 25:44.520 So this is quite convenient because we can use that in our network now. 25:44.520 --> 25:45.520 Because remember that. 25:45.520 --> 25:49.380 We have these eight characters coming in. 25:49.380 --> 25:55.120 And we don't want to now flatten all of it out into a large eight-dimensional vector 25:55.120 --> 26:01.840 because we don't want to matrix multiply 80 into a weight matrix multiply immediately. 26:01.840 --> 26:07.140 Instead, we want to group these like this. 26:07.140 --> 26:11.380 So every consecutive two elements, one and two and three and four and five and six and 26:11.380 --> 26:14.900 seven and eight, all of these should be now basically flattened. 26:14.900 --> 26:15.520 Okay. 26:15.520 --> 26:16.600 So we can. 26:16.600 --> 26:18.380 End out and multiply by weight matrix. 26:18.380 --> 26:22.240 But all of these four groups here, we'd like to process in parallel. 26:22.240 --> 26:26.080 So it's kind of like a batch dimension that we can introduce. 26:26.080 --> 26:33.820 And then we can, in parallel, basically process all of these bigram groups in the four batch 26:33.820 --> 26:40.020 dimensions of an individual example, and also over the actual batch dimension of the four 26:40.020 --> 26:41.840 examples in our example here. 26:41.840 --> 26:43.380 So let's see how that works. 26:43.380 --> 26:44.880 Effectively, what we want is. 26:44.880 --> 26:45.360 Right now. 26:45.360 --> 26:51.760 Now we take a 4 by 80 and multiply it by 80 by 200 in the linear layer. 26:51.840 --> 26:52.500 This is what happens. 26:53.560 --> 26:58.660 But instead what we want is we don't want 80 characters or 80 numbers to come in. 26:59.020 --> 27:01.840 We only want two characters to come in on the very first layer, 27:02.020 --> 27:03.700 and those two characters should be fused. 27:04.820 --> 27:08.320 So in other words, we just want 20 to come in, right? 27:09.020 --> 27:10.280 20 numbers would come in. 27:10.900 --> 27:14.140 And here we don't want a 4 by 80 to feed into the linear layer. 27:14.140 --> 27:17.220 We actually want these groups of 2 to feed in. 27:17.620 --> 27:21.700 So instead of 4 by 80, we want this to be a 4 by 4 by 20. 27:23.260 --> 27:28.700 So these are the four groups of 2, and each one of them is a 10-dimensional vector. 27:29.420 --> 27:32.380 So what we want now is we need to change the flatten layer 27:32.380 --> 27:36.400 so it doesn't output a 4 by 80, but it outputs a 4 by 4 by 20, 27:37.000 --> 27:44.120 where basically every two consecutive characters are packed in 27:44.120 --> 27:45.400 on the very last dimension. 27:45.980 --> 27:48.400 And then these four is the first batch dimension, 27:48.880 --> 27:51.020 and this four is the second batch dimension, 27:51.400 --> 27:54.380 referring to the four groups inside every one of these examples. 27:55.380 --> 27:57.580 And then this will just multiply like this. 27:57.700 --> 27:59.300 So this is what we want to get to. 27:59.880 --> 28:01.360 So we're going to have to change the linear layer 28:01.360 --> 28:03.280 in terms of how many inputs it expects. 28:03.420 --> 28:05.440 It shouldn't expect 80. 28:05.520 --> 28:06.720 It should just expect 20 numbers. 28:07.040 --> 28:08.700 And we have to change our flatten layer 28:08.700 --> 28:11.880 so it doesn't just fully flatten out this entire example. 28:12.280 --> 28:14.100 It needs to create a 4 by 4. 28:14.100 --> 28:16.580 It needs to create a 4 by 20 instead of a 4 by 80. 28:17.060 --> 28:18.500 So let's see how this could be implemented. 28:19.240 --> 28:22.740 Basically right now we have an input that is a 4 by 8 by 10 28:22.740 --> 28:24.680 that feeds into the flatten layer, 28:25.100 --> 28:28.140 and currently the flatten layer just stretches it out. 28:28.500 --> 28:30.420 So if you remember the implementation of flatten, 28:31.260 --> 28:35.220 it takes our x and it just views it as whatever the batch dimension is, 28:35.320 --> 28:36.100 and then negative 1. 28:36.940 --> 28:42.000 So effectively what it does right now is it does e.view of 4, negative 1, 28:42.000 --> 28:44.080 and the shape of this, of course, is 4 by 80. 28:44.100 --> 28:47.500 So that's what currently happens, 28:47.660 --> 28:50.260 and we instead want this to be a 4 by 4 by 20, 28:50.500 --> 28:53.280 where these consecutive 10-dimensional vectors get concatenated. 28:54.180 --> 28:58.500 So you know how in Python you can take a list of range of 10? 28:59.780 --> 29:02.280 So we have numbers from 0 to 9, 29:02.640 --> 29:05.740 and we can index like this to get all the even parts, 29:06.280 --> 29:08.740 and we can also index like starting at 1 29:08.740 --> 29:11.720 and going in steps of 2 to get all the odd parts. 29:13.060 --> 29:14.080 So one way to implement this is to take a list of range of 10, 29:14.080 --> 29:16.380 and one way to implement this, it would be as follows. 29:16.380 --> 29:21.200 We can take e, and we can index into it for all the batch elements, 29:21.720 --> 29:24.440 and then just even elements in this dimension, 29:25.020 --> 29:28.100 so at indexes 0, 2, 4, and 8, 29:28.840 --> 29:32.200 and then all the parts here from this last dimension, 29:33.480 --> 29:36.800 and this gives us the even characters, 29:37.300 --> 29:41.180 and then here this gives us all the odd characters. 29:41.500 --> 29:43.800 And basically what we want to do is we want to make sure 29:43.800 --> 29:46.140 that these get concatenated in PyTorch, 29:46.380 --> 29:49.420 and then we want to concatenate these two tensors 29:49.420 --> 29:51.200 along the second dimension. 29:53.020 --> 29:56.280 So this and the shape of it would be 4 by 4 by 20. 29:56.440 --> 29:58.160 This is definitely the result we want. 29:58.360 --> 30:02.080 We are explicitly grabbing the even parts and the odd parts, 30:02.220 --> 30:05.220 and we're arranging those 4 by 4 by 10 30:05.220 --> 30:07.240 right next to each other and concatenate. 30:08.240 --> 30:11.140 So this works, but it turns out that what also works 30:11.140 --> 30:13.700 is you can simply use a view again, 30:13.800 --> 30:16.040 and just request the right shape. 30:16.380 --> 30:18.020 And it just so happens that in this case, 30:18.540 --> 30:21.480 those vectors will again end up being arranged 30:21.480 --> 30:22.640 exactly the way we want. 30:23.260 --> 30:24.640 So in particular, if we take e, 30:24.760 --> 30:26.920 and we just view it as a 4 by 4 by 20, 30:27.060 --> 30:27.900 which is what we want, 30:28.600 --> 30:30.960 we can check that this is exactly equal to, 30:31.680 --> 30:35.020 let me call this, this is the explicit concatenation, I suppose. 30:36.760 --> 30:39.540 So explicit dot shape is 4 by 4 by 20. 30:40.100 --> 30:41.820 If you just view it as 4 by 4 by 20, 30:41.820 --> 30:43.120 you can check that, 30:43.120 --> 30:45.120 when you compare it to explicit, 30:46.220 --> 30:48.580 you get a big, this is element-wise operation, 30:48.840 --> 30:50.340 so making sure that all of them are true, 30:51.140 --> 30:51.720 values to true. 30:52.620 --> 30:54.160 So basically, long story short, 30:54.240 --> 30:56.880 we don't need to make an explicit call to concatenate, etc. 30:57.200 --> 31:01.220 We can simply take this input tensor to flatten, 31:01.680 --> 31:04.140 and we can just view it in whatever way we want. 31:05.020 --> 31:06.400 And in particular, 31:06.520 --> 31:08.740 we don't want to stretch things out with negative 1. 31:08.980 --> 31:11.100 We want to actually create a three-dimensional array, 31:11.400 --> 31:12.800 and depending on how many, 31:12.800 --> 31:14.800 vectors that are consecutive, 31:15.280 --> 31:17.700 we want to fuse, 31:17.920 --> 31:18.920 like for example, 2, 31:19.580 --> 31:21.960 then we can just simply ask for this dimension to be 20, 31:22.520 --> 31:25.320 and use a negative 1 here, 31:25.660 --> 31:27.940 and PyTorch will figure out how many groups it needs to pack 31:28.020 --> 31:29.940 into this additional batch dimension. 31:30.780 --> 31:33.160 So let's now go into flatten and implement this. 31:33.420 --> 31:35.040 Okay, so I scrolled up here to flatten, 31:35.580 --> 31:37.900 and what we'd like to do is we'd like to change it now. 31:38.140 --> 31:39.360 So let me create a constructor, 31:39.680 --> 31:42.200 and take the number of elements that are consecutive, 31:42.300 --> 31:42.800 that we would like to use, 31:42.880 --> 31:44.800 and then we'd like to concatenate now 31:44.880 --> 31:46.880 in the last dimension of the output. 31:47.400 --> 31:49.400 So here we're just going to remember, 31:49.480 --> 31:50.480 self.n equals n. 31:51.120 --> 31:53.120 And then I want to be careful here, 31:53.200 --> 31:56.200 because PyTorch actually has a torch.flatten, 31:56.280 --> 31:58.280 and its keyword arguments are different, 31:58.360 --> 32:00.360 and they kind of like function differently. 32:00.440 --> 32:03.440 So our flatten is going to start to depart from PyTorch flatten. 32:03.520 --> 32:06.200 So let me call it flatten consecutive, 32:06.280 --> 32:07.280 or something like that, 32:07.360 --> 32:10.160 just to make sure that our APIs are about equal. 32:10.520 --> 32:11.520 So this, 32:11.520 --> 32:15.000 basically flattens only some n consecutive elements, 32:15.080 --> 32:17.080 and puts them into the last dimension. 32:17.720 --> 32:21.400 Now here, the shape of x is b by t by c. 32:21.480 --> 32:25.480 So let me pop those out into variables. 32:25.560 --> 32:27.680 And recall that in our example down below, 32:27.760 --> 32:30.760 b was 4, t was 8, and c was 10. 32:33.680 --> 32:37.680 Now, instead of doing x.view of b by negative 1, 32:39.600 --> 32:41.480 right, this is what we had before. 32:41.560 --> 32:48.560 We want this to be b by negative 1 by, 32:48.640 --> 32:52.640 and basically here, we want c times n. 32:52.720 --> 32:55.720 That's how many consecutive elements we want. 32:55.800 --> 32:58.040 And here, instead of negative 1, 32:58.120 --> 32:59.920 I don't super love the use of negative 1, 33:00.000 --> 33:01.920 because I like to be very explicit, 33:02.000 --> 33:03.000 so that you get error messages 33:03.080 --> 33:05.320 when things don't go according to your expectation. 33:05.400 --> 33:06.880 So what do we expect here? 33:06.960 --> 33:10.360 We expect this to become t divide n, 33:10.440 --> 33:11.480 using integer division here. 33:11.560 --> 33:13.680 So that's what I expect to happen. 33:13.760 --> 33:16.120 And then one more thing I want to do here is, 33:16.200 --> 33:18.600 remember previously, all the way in the beginning, 33:18.680 --> 33:22.560 n was 3, and basically we're concatenating 33:22.640 --> 33:25.360 all the three characters that existed there. 33:25.440 --> 33:29.040 So we basically concatenated everything. 33:29.120 --> 33:30.960 And so sometimes that can create 33:31.040 --> 33:32.680 a spurious dimension of 1 here. 33:32.760 --> 33:36.480 So if it is the case that x.shapeAt1 is 1, 33:36.560 --> 33:38.520 then it's kind of like a spurious dimension. 33:38.600 --> 33:40.440 So we don't want to return a 3, 33:40.480 --> 33:41.440 so we don't want to return a 3, 33:41.480 --> 33:44.600 so we don't want to return a 3-dimensional tensor with a 1 here. 33:44.680 --> 33:46.480 We just want to return a 2-dimensional tensor 33:46.560 --> 33:48.640 exactly as we did before. 33:48.720 --> 33:53.560 So in this case, basically, we will just say x equals x.squeeze, 33:53.640 --> 33:57.560 that is a PyTorch function. 33:57.640 --> 34:01.920 And squeeze takes a dimension that it either squeezes out 34:02.000 --> 34:04.720 all the dimensions of a tensor that are 1, 34:04.800 --> 34:07.760 or you can specify the exact dimension 34:07.840 --> 34:09.360 that you want to be squeezed. 34:09.440 --> 34:11.400 And again, I like to be as explicit as possible, 34:11.480 --> 34:15.480 always, so I expect to squeeze out the first dimension only 34:15.560 --> 34:18.880 of this tensor, this 3-dimensional tensor. 34:18.960 --> 34:20.400 And if this dimension here is 1, 34:20.480 --> 34:24.120 then I just want to return b by c times n. 34:24.200 --> 34:29.120 And so self.out will be x, and then we return self.out. 34:29.200 --> 34:31.000 So that's the candidate implementation. 34:31.080 --> 34:34.840 And of course, this should be self.in instead of just n. 34:34.920 --> 34:40.520 So let's run, and let's come here now and take it for a spin. 34:40.560 --> 34:46.080 So flattened consecutive, and in the beginning, 34:46.160 --> 34:47.640 let's just use 8. 34:47.720 --> 34:50.120 So this should recover the previous behavior. 34:50.200 --> 34:55.680 So flattened consecutive of 8, which is the current block size, 34:55.760 --> 34:59.320 we can do this, that should recover the previous behavior. 34:59.400 --> 35:02.600 So we should be able to run the model. 35:02.680 --> 35:06.920 And here we can inspect, I have a little code snippet here, 35:07.000 --> 35:10.480 where I iterate over all the layers, I print the name, 35:10.560 --> 35:14.680 of this class, and the shape. 35:14.760 --> 35:17.720 And so we see the shapes as we expect them 35:17.800 --> 35:20.600 after every single layer in its output. 35:20.680 --> 35:24.640 So now let's try to restructure it using our flattened consecutive 35:24.720 --> 35:26.600 and do it hierarchically. 35:26.680 --> 35:29.520 So in particular, we want to flatten consecutive, 35:29.600 --> 35:33.000 not just block size, but just 2. 35:33.080 --> 35:35.280 And then we want to process this with linear. 35:35.360 --> 35:37.960 Now, the number of inputs to this linear will not be 35:38.040 --> 35:40.400 n embed times block size, it will now only be 35:40.440 --> 35:44.240 n embed times 2, 20. 35:44.320 --> 35:46.240 This goes through the first layer. 35:46.320 --> 35:49.800 And now we can, in principle, just copy paste this. 35:49.880 --> 35:54.000 Now, the next linear layer should expect n hidden times 2. 35:54.080 --> 36:01.560 And the last piece of it should expect n hidden times 2 again. 36:01.640 --> 36:05.400 So this is sort of like the naive version of it. 36:05.480 --> 36:09.040 So running this, we now have a much, much bigger model. 36:09.120 --> 36:10.360 And we should be able to basically just 36:10.440 --> 36:13.640 just forward the model. 36:13.720 --> 36:17.720 And now we can inspect the numbers in between. 36:17.800 --> 36:23.240 So 4x8x20 was flattened consecutively into 4x4x20. 36:23.320 --> 36:26.640 This was projected into 4x4x200. 36:26.720 --> 36:30.240 And then BatchNorm just worked out of the box. 36:30.320 --> 36:32.400 We have to verify that BatchNorm does the correct thing, 36:32.480 --> 36:33.960 even though it takes a three-dimensional input 36:34.040 --> 36:36.480 instead of two-dimensional input. 36:36.560 --> 36:38.800 Then we have 10H, which is element-wise. 36:38.880 --> 36:40.280 Then we crushed it again. 36:40.320 --> 36:44.400 So we flattened consecutively and ended up with a 4x2x400 now. 36:44.480 --> 36:48.360 Then linear brought it back down to 200, BatchNorm 10H. 36:48.440 --> 36:50.800 And lastly, we get a 4x400. 36:50.880 --> 36:53.840 And we see that the flattened consecutive for the last flatten here, 36:53.920 --> 36:56.800 it squeezed out that dimension of 1. 36:56.880 --> 36:58.920 So we only ended up with 4x400. 36:59.000 --> 37:04.440 And then linear BatchNorm 10H and the last linear layer to get our logits. 37:04.520 --> 37:07.600 And so the logits end up in the same shape as they were before. 37:07.680 --> 37:09.360 But now we actually have a nice three-layer, 37:09.360 --> 37:11.400 neural net. 37:11.480 --> 37:14.600 And it basically corresponds to, whoops, sorry. 37:14.680 --> 37:17.600 It basically corresponds exactly to this network now, 37:17.680 --> 37:21.000 except only this piece here because we only have three layers. 37:21.080 --> 37:24.760 Whereas here in this example, there's four layers 37:24.840 --> 37:28.480 with a total receptive field size of 16 characters 37:28.560 --> 37:30.360 instead of just eight characters. 37:30.440 --> 37:32.640 So the block size here is 16. 37:32.720 --> 37:36.960 So this piece of it is basically implemented here. 37:37.040 --> 37:39.320 Now we just have to kind of figure out some good, 37:39.400 --> 37:41.360 channel numbers to use here. 37:41.440 --> 37:45.320 Now in particular, I changed the number of hidden units to be 68 37:45.400 --> 37:47.560 in this architecture because when I use 68, 37:47.640 --> 37:50.200 the number of parameters comes out to be 22,000. 37:50.280 --> 37:52.720 So that's exactly the same that we had before. 37:52.800 --> 37:55.600 And we have the same amount of capacity at this neural net 37:55.680 --> 37:57.360 in terms of the number of parameters. 37:57.440 --> 37:59.560 But the question is whether we are utilizing those parameters 37:59.640 --> 38:01.480 in a more efficient architecture. 38:01.560 --> 38:05.640 So what I did then is I got rid of a lot of the debugging cells here 38:05.720 --> 38:07.360 and I rerun the optimization. 38:07.440 --> 38:09.080 And scrolling down to the result, 38:09.120 --> 38:12.760 we see that we get the identical performance roughly. 38:12.840 --> 38:17.680 So our validation loss now is 2.029 and previously it was 2.027. 38:17.760 --> 38:19.640 So controlling for the number of parameters, 38:19.720 --> 38:23.480 changing from the flat to hierarchical is not giving us anything yet. 38:23.560 --> 38:26.600 That said, there are two things to point out. 38:26.680 --> 38:30.080 Number one, we didn't really torture the architecture here very much. 38:30.160 --> 38:31.480 This is just my first guess. 38:31.560 --> 38:34.000 And there's a bunch of hyperparameter search that we could do 38:34.080 --> 38:38.920 in terms of how we allocate our budget of parameters to what layers. 38:39.080 --> 38:43.720 Number two, we still may have a bug inside the BatchNorm1D layer. 38:43.800 --> 38:50.640 So let's take a look at that because it runs but doesn't do the right thing. 38:50.720 --> 38:54.440 So I pulled up the layer inspector sort of that we have here 38:54.520 --> 38:56.200 and printed out the shape along the way. 38:56.280 --> 38:58.960 And currently it looks like the BatchNorm is receiving an input 38:59.040 --> 39:02.480 that is 32 by 4 by 68, right? 39:02.560 --> 39:05.120 And here on the right, I have the current implementation of BatchNorm 39:05.200 --> 39:06.520 that we have right now. 39:06.600 --> 39:08.920 Now, this BatchNorm assumed, in the way 39:08.920 --> 39:12.040 we wrote it and at the time, that X is two-dimensional. 39:12.120 --> 39:16.040 So it was N by D, where N was the batch size. 39:16.120 --> 39:20.560 So that's why we only reduced the mean and the variance over the zeroth dimension. 39:20.640 --> 39:23.080 But now X will basically become three-dimensional. 39:23.160 --> 39:25.000 So what's happening inside the BatchNorm layer right now? 39:25.080 --> 39:28.200 And how come it's working at all and not giving any errors? 39:28.280 --> 39:31.560 The reason for that is basically because everything broadcasts properly, 39:31.640 --> 39:35.560 but the BatchNorm is not doing what we want it to do. 39:35.640 --> 39:38.200 So in particular, let's basically think through what's happening 39:38.280 --> 39:38.520 inside the BatchNorm. 39:38.920 --> 39:43.760 I'm looking at what's happening here. 39:43.840 --> 39:45.440 I have the code here. 39:45.520 --> 39:49.600 So we're receiving an input of 32 by 4 by 68. 39:49.680 --> 39:52.720 And then we are doing here, X dot mean. 39:52.800 --> 39:54.600 Here I have E instead of X. 39:54.680 --> 39:57.160 But we're doing the mean over zero. 39:57.240 --> 39:59.680 And that's actually giving us 1 by 4 by 68. 39:59.760 --> 40:02.640 So we're doing the mean only over the very first dimension. 40:02.720 --> 40:07.880 And it's giving us a mean and a variance that still maintain this dimension here. 40:07.880 --> 40:12.160 So these means are only taken over 32 numbers in the first dimension. 40:12.240 --> 40:17.000 And then when we perform this, everything broadcasts correctly still. 40:17.080 --> 40:26.200 But basically what ends up happening is when we also look at the running mean, 40:26.280 --> 40:26.880 the shape of it. 40:26.960 --> 40:28.480 So I'm looking at the model that layers the three, 40:28.560 --> 40:29.920 which is the first BatchNorm layer, 40:30.000 --> 40:34.120 and then looking at whatever the running mean became and its shape. 40:34.200 --> 40:37.640 The shape of this running mean now is 1 by 4 by 68. 40:37.880 --> 40:44.480 Instead of it being just size of dimension, because we have 68 channels, 40:44.560 --> 40:48.320 we expect to have 68 means and variances that we're maintaining. 40:48.400 --> 40:50.960 But actually we have an array of 4 by 68. 40:51.040 --> 40:55.120 And so basically what this is telling us is this BatchNorm is only... 40:55.200 --> 41:05.960 This BatchNorm is currently working in parallel over 4 times 68 instead of just 68 channels. 41:06.040 --> 41:07.200 So basically we are maintaining this. 41:07.880 --> 41:13.680 We are maintaining statistics for every one of these four positions individually and independently. 41:13.760 --> 41:17.880 And instead what we want to do is we want to treat this 4 as a Batch dimension, 41:17.960 --> 41:19.840 just like the 0th dimension. 41:19.920 --> 41:22.800 So as far as the BatchNorm is concerned, 41:22.880 --> 41:24.440 it doesn't want to average... 41:24.520 --> 41:26.440 We don't want to average over 32 numbers. 41:26.520 --> 41:32.720 We want to now average over 32 times 4 numbers for every single one of these 68 channels. 41:32.800 --> 41:37.080 And so let me now remove this. 41:37.080 --> 41:42.320 It turns out that when you look at the documentation of Torch.mean... 41:42.400 --> 41:49.600 So let's go to Torch.mean. 41:49.680 --> 41:53.200 In one of its signatures, when we specify the dimension, 41:53.280 --> 41:55.200 we see that the dimension here is not just... 41:55.280 --> 41:58.400 It can be int or it can also be a tuple of ints. 41:58.480 --> 42:01.880 So we can reduce over multiple integers at the same time, 42:01.960 --> 42:03.760 over multiple dimensions at the same time. 42:03.840 --> 42:06.840 So instead of just reducing over 0, we can pass in a tuple, 42:06.840 --> 42:10.400 0, 1, and here 0, 1 as well. 42:10.480 --> 42:13.840 And then what's going to happen is the output, of course, is going to be the same. 42:13.920 --> 42:17.240 But now what's going to happen is because we reduce over 0 and 1, 42:17.320 --> 42:22.440 if we look at inmean.shape, we see that now we've reduced. 42:22.520 --> 42:26.840 We took the mean over both the 0th and the first dimension. 42:26.920 --> 42:30.920 So we're just getting 68 numbers and a bunch of spurious dimensions here. 42:31.000 --> 42:33.640 So now this becomes 1 by 1 by 68. 42:33.720 --> 42:36.160 And the running mean and the running variance, 42:36.160 --> 42:38.800 analogously, will become 1 by 1 by 68. 42:38.880 --> 42:41.040 So even though there are the spurious dimensions, 42:41.120 --> 42:46.200 the correct thing will happen in that we are only maintaining means and variances 42:46.280 --> 42:49.640 for 68 channels. 42:49.720 --> 42:54.200 And we're now calculating the mean and variance across 32 times 4 dimensions. 42:54.280 --> 42:56.040 So that's exactly what we want. 42:56.120 --> 42:59.720 And let's change the implementation of BatchNorm1D that we have 42:59.800 --> 43:03.400 so that it can take in two-dimensional or three-dimensional inputs 43:03.480 --> 43:05.240 and perform accordingly. 43:05.320 --> 43:05.960 So at the end of the day, 43:05.960 --> 43:08.000 the fix is relatively straightforward. 43:08.080 --> 43:12.240 Basically, the dimension we want to reduce over is either 0 43:12.320 --> 43:15.400 or the tuple 0 and 1, depending on the dimensionality of x. 43:15.480 --> 43:19.280 So if x.ndim is 2, so it's a two-dimensional tensor, 43:19.360 --> 43:22.520 then the dimension we want to reduce over is just the integer 0. 43:22.600 --> 43:25.840 And if x.ndim is 3, so it's a three-dimensional tensor, 43:25.920 --> 43:31.440 then the dims we're going to assume are 0 and 1 that we want to reduce over. 43:31.520 --> 43:33.880 And then here, we just pass in dim. 43:33.960 --> 43:35.680 And if the dimensionality of x is anything else, 43:35.680 --> 43:37.840 we're going to get an error, which is good. 43:37.920 --> 43:40.560 So that should be the fix. 43:40.640 --> 43:42.320 Now, I want to point out one more thing. 43:42.400 --> 43:45.720 We're actually departing from the API of PyTorch here a little bit, 43:45.800 --> 43:48.560 because when you come to BatchNorm1D in PyTorch, 43:48.640 --> 43:51.720 you can scroll down and you can see that the input to this layer 43:51.800 --> 43:54.680 can either be n by c, where n is the batch size 43:54.760 --> 43:56.840 and c is the number of features or channels, 43:56.920 --> 43:59.600 or it actually does accept three-dimensional inputs, 43:59.680 --> 44:02.720 but it expects it to be n by c by l, 44:02.800 --> 44:05.520 where l is, say, like the sequence length or something like that. 44:05.680 --> 44:11.040 So this is a problem because you see how c is nested here in the middle. 44:11.120 --> 44:14.000 And so when it gets three-dimensional inputs, 44:14.080 --> 44:19.120 this BatchNorm layer will reduce over 0 and 2 instead of 0 and 1. 44:19.200 --> 44:26.400 So basically, PyTorch BatchNorm1D layer assumes that c will always be the first dimension, 44:26.480 --> 44:30.240 whereas we assume here that c is the last dimension, 44:30.320 --> 44:32.560 and there are some number of batch dimensions beforehand. 44:32.640 --> 44:35.440 And so, 44:35.440 --> 44:37.440 it expects n by c or n by c by l. 44:37.520 --> 44:39.440 We expect n by c or n by l by c. 44:39.520 --> 44:41.520 And so, it's a deviation. 44:41.600 --> 44:43.600 I think it's okay. 44:43.680 --> 44:45.680 I prefer it this way, honestly, 44:45.760 --> 44:48.400 so this is the way that we will keep it for our purposes. 44:48.480 --> 44:51.200 So I redefined the layers, reinitialized the neural net, 44:51.280 --> 44:54.480 and did a single forward pass with a break just for one step. 44:54.560 --> 44:57.600 Looking at the shapes along the way, they're, of course, identical. 44:57.680 --> 44:59.280 All the shapes are the same, 44:59.360 --> 45:02.560 but the way we see that things are actually working as we want them to, 45:02.640 --> 45:04.880 is that we can actually do the same thing. 45:04.880 --> 45:06.880 So the way we see that things are actually working as we want them to now 45:06.960 --> 45:08.880 is that when we look at the BatchNorm layer, 45:08.960 --> 45:10.880 the running mean shape is now 1 by 1 by 68. 45:10.960 --> 45:14.800 So we're only maintaining 68 means for every one of our channels, 45:14.880 --> 45:18.800 and we're treating both the 0th and the first dimension as a batch dimension, 45:18.880 --> 45:20.800 which is exactly what we want. 45:20.880 --> 45:22.800 So let me retrain the neural net now. 45:22.880 --> 45:24.800 Okay, so I've retrained the neural net with the bug fix. 45:24.880 --> 45:26.800 We get a nice curve. 45:26.880 --> 45:28.800 And when we look at the validation performance, 45:28.880 --> 45:30.800 we do actually see a slight improvement. 45:30.880 --> 45:32.800 So it went from 2.029 to 2.022. 45:32.880 --> 45:34.800 So basically, the bug inside the BatchNorm was holding us back, like, 45:34.880 --> 45:36.800 a little bit, it looks like. 45:36.880 --> 45:38.800 And we are getting a tiny improvement now, 45:38.880 --> 45:42.800 but it's not clear if this is statistically significant. 45:42.880 --> 45:44.800 And the reason we slightly expect an improvement 45:44.880 --> 45:48.800 is because we're not maintaining so many different means and variances 45:48.880 --> 45:50.800 that are only estimated using 32 numbers, effectively. 45:50.880 --> 45:54.800 Now we are estimating them using 32 times 4 numbers. 45:54.880 --> 45:56.800 So you just have a lot more numbers 45:56.880 --> 45:58.800 that go into any one estimate of the mean and variance. 45:58.880 --> 46:02.800 And it allows things to be a bit more stable and less wiggly 46:02.880 --> 46:04.800 inside those estimates of the BatchNorm. 46:04.880 --> 46:06.800 So pretty nice. 46:06.880 --> 46:08.800 With this more general architecture in place, 46:08.880 --> 46:10.800 we are now set up to push the performance further 46:10.880 --> 46:12.800 by increasing the size of the network. 46:12.880 --> 46:14.800 So, for example, 46:14.880 --> 46:16.800 I've bumped up the number of embeddings to 24 instead of 10, 46:16.880 --> 46:18.800 and also increased the number of hidden units. 46:18.880 --> 46:20.800 But using the exact same architecture, 46:20.880 --> 46:22.800 we now have 76,000 parameters, 46:22.880 --> 46:24.800 and the training takes a lot longer, 46:24.880 --> 46:26.800 but we do get a nice curve. 46:26.880 --> 46:28.800 And then when you actually evaluate the performance, 46:28.880 --> 46:30.800 we are now getting validation performance of 1.993. 46:30.880 --> 46:32.800 So we've crossed over 1.993. 46:32.880 --> 46:34.800 So we've crossed over 1.993. 46:34.880 --> 46:36.800 We've crossed over the 2.0 sort of territory. 46:36.880 --> 46:38.800 And we're at about 1.99. 46:38.880 --> 46:40.800 But we are starting to have to wait quite a bit longer. 46:40.880 --> 46:42.800 But we are starting to have to wait quite a bit longer. 46:42.880 --> 46:44.800 And we're a little bit in the dark 46:44.880 --> 46:46.800 with respect to the correct setting of the hyperparameters here 46:46.880 --> 46:48.800 and the learning rates and so on, 46:48.880 --> 46:50.800 because the experiments are starting to take longer to train. 46:50.880 --> 46:52.800 And so we are missing sort of like an experimental harness 46:52.880 --> 46:54.800 on which we could run a number of experiments 46:54.880 --> 46:56.800 on which we could run a number of experiments 46:56.880 --> 46:58.800 and really tune this architecture very well. 46:58.880 --> 47:00.800 So I'd like to conclude now with a few notes. 47:00.880 --> 47:02.800 We basically improved our performance 47:02.880 --> 47:04.800 from a starting of 2.1 47:04.800 --> 47:06.720 to 2.9. 47:06.800 --> 47:08.720 But I don't want that to be the focus 47:08.800 --> 47:10.720 because honestly we're kind of in the dark. 47:10.800 --> 47:12.720 We have no experimental harness. 47:12.800 --> 47:14.720 We're just guessing and checking. 47:14.800 --> 47:16.720 And this whole thing is terrible. 47:16.800 --> 47:18.720 We're just looking at the training loss. 47:18.800 --> 47:20.720 Normally you want to look at both the training 47:20.800 --> 47:22.720 and the validation loss together. 47:22.800 --> 47:24.720 The whole thing looks different 47:24.800 --> 47:26.720 if you're actually trying to squeeze out numbers. 47:26.800 --> 47:28.720 That said, we did implement this architecture 47:28.800 --> 47:30.720 from the WaveNet paper. 47:30.800 --> 47:32.720 But we did not implement this specific forward pass of it 47:32.800 --> 47:34.720 where you have a more complicated 47:34.720 --> 47:36.640 structure that is this gated 47:36.720 --> 47:38.640 linear layer kind of. 47:38.720 --> 47:40.640 And there's residual connections and skip connections 47:40.720 --> 47:42.640 and so on. So we did not implement that. 47:42.720 --> 47:44.640 We just implemented this structure. 47:44.720 --> 47:46.640 I would like to briefly hint or preview 47:46.720 --> 47:48.640 how what we've done here relates 47:48.720 --> 47:50.640 to convolutional neural networks 47:50.720 --> 47:52.640 as used in the WaveNet paper. 47:52.720 --> 47:54.640 And basically the use of convolutions 47:54.720 --> 47:56.640 is strictly for efficiency. 47:56.720 --> 47:58.640 It doesn't actually change the model we've implemented. 47:58.720 --> 48:00.640 So here for example, 48:00.720 --> 48:02.640 let me look at a specific name 48:02.720 --> 48:04.640 to work with an example. 48:04.640 --> 48:06.560 So we have a name in our training set 48:06.640 --> 48:08.560 and it's D'Andre. 48:08.640 --> 48:10.560 And it has seven letters. 48:10.640 --> 48:12.560 So that is eight independent examples in our model. 48:12.640 --> 48:14.560 So all these rows here 48:14.640 --> 48:16.560 are independent examples of D'Andre. 48:16.640 --> 48:18.560 Now you can forward of course 48:18.640 --> 48:20.560 any one of these rows independently. 48:20.640 --> 48:22.560 So I can take my model 48:22.640 --> 48:24.560 and call it on 48:24.640 --> 48:26.560 any individual index. 48:26.640 --> 48:28.560 Notice by the way here 48:28.640 --> 48:30.560 I'm being a little bit tricky. 48:30.640 --> 48:32.560 The reason for this is that 48:32.560 --> 48:36.480 it's a one dimensional array of eight. 48:36.560 --> 48:38.480 So you can't actually call the model on it. 48:38.560 --> 48:40.480 You're going to get an error 48:40.560 --> 48:42.480 because there's no batch dimension. 48:42.560 --> 48:44.480 So when you do extra at 48:44.560 --> 48:46.480 a list of seven 48:46.560 --> 48:48.480 then the shape of this becomes one by eight. 48:48.560 --> 48:50.480 So I get an extra batch dimension 48:50.560 --> 48:52.480 of one and then we can forward the model. 48:52.560 --> 48:54.480 So 48:54.560 --> 48:56.480 that forwards a single example 48:56.560 --> 48:58.480 and you might imagine that you actually 48:58.560 --> 49:00.480 may want to forward all of these eight 49:00.560 --> 49:02.480 at the same time. 49:02.480 --> 49:04.400 So pre-allocating some memory 49:04.480 --> 49:06.400 and then doing a for loop 49:06.480 --> 49:08.400 eight times and forwarding all of those 49:08.480 --> 49:10.400 eight here will give us 49:10.480 --> 49:12.400 all the logits in all these different cases. 49:12.480 --> 49:14.400 Now for us with the model 49:14.480 --> 49:16.400 as we've implemented it right now 49:16.480 --> 49:18.400 this is eight independent calls to our model. 49:18.480 --> 49:20.400 But what convolutions allow you to do 49:20.480 --> 49:22.400 is it allow you to basically slide 49:22.480 --> 49:24.400 this model efficiently 49:24.480 --> 49:26.400 over the input sequence. 49:26.480 --> 49:28.400 And so this for loop can be done 49:28.480 --> 49:30.400 not outside in Python 49:30.480 --> 49:32.400 but inside of kernels in CUDA. 49:32.480 --> 49:34.400 And so this for loop gets hidden into the convolution. 49:34.480 --> 49:36.400 So the convolution 49:36.480 --> 49:38.400 basically you can think of it as 49:38.480 --> 49:40.400 it's a for loop applying a little linear 49:40.480 --> 49:42.400 filter over space 49:42.480 --> 49:44.400 of some input sequence. 49:44.480 --> 49:46.400 And in our case the space we're interested in is one dimensional 49:46.480 --> 49:48.400 and we're interested in sliding these filters 49:48.480 --> 49:50.400 over the input data. 49:50.480 --> 49:52.400 So this diagram 49:52.480 --> 49:54.400 actually is fairly good as well. 49:54.480 --> 49:56.400 Basically what we've done is 49:56.480 --> 49:58.400 here they are highlighting in black 49:58.480 --> 50:00.400 one single sort of like tree 50:00.480 --> 50:02.400 of this calculation. 50:02.400 --> 50:04.320 So just calculating the single output 50:04.400 --> 50:06.320 example here. 50:06.400 --> 50:08.320 And so this is basically 50:08.400 --> 50:10.320 what we've implemented here. 50:10.400 --> 50:12.320 We've implemented a single, this black structure 50:12.400 --> 50:14.320 we've implemented that 50:14.400 --> 50:16.320 and calculated a single output, like a single example. 50:16.400 --> 50:18.320 But what convolutions 50:18.400 --> 50:20.320 allow you to do is it allows you to take 50:20.400 --> 50:22.320 this black structure and 50:22.400 --> 50:24.320 kind of like slide it over the input sequence 50:24.400 --> 50:26.320 here and calculate 50:26.400 --> 50:28.320 all of these orange 50:28.400 --> 50:30.320 outputs at the same time. 50:30.400 --> 50:32.320 Or here that corresponds to calculating 50:32.320 --> 50:34.240 all of these outputs of 50:34.320 --> 50:36.240 at all the positions of 50:36.320 --> 50:38.240 deandre at the same time. 50:38.320 --> 50:40.240 And the reason that 50:40.320 --> 50:42.240 this is much more efficient is because 50:42.320 --> 50:44.240 number one, as I mentioned, the for loop 50:44.320 --> 50:46.240 is inside the CUDA kernels in the 50:46.320 --> 50:48.240 sliding. So that makes 50:48.320 --> 50:50.240 it efficient. But number two, notice 50:50.320 --> 50:52.240 the variable reuse here. For example 50:52.320 --> 50:54.240 if we look at this circle, this node here, 50:54.320 --> 50:56.240 this node here is the right child 50:56.320 --> 50:58.240 of this node, but it's also 50:58.320 --> 51:00.240 the left child of the node here. 51:00.320 --> 51:02.240 And so basically this 51:02.240 --> 51:04.160 node and its value is used 51:04.240 --> 51:06.160 twice. And so 51:06.240 --> 51:08.160 right now, in this naive way, 51:08.240 --> 51:10.160 we'd have to recalculate it. 51:10.240 --> 51:12.160 But here we are allowed to reuse it. 51:12.240 --> 51:14.160 So in the convolutional neural network, 51:14.240 --> 51:16.160 you think of these linear layers that we have 51:16.240 --> 51:18.160 up above as filters. 51:18.240 --> 51:20.160 And we take these filters 51:20.240 --> 51:22.160 and they're linear filters, and you slide them over 51:22.240 --> 51:24.160 input sequence, and we calculate 51:24.240 --> 51:26.160 the first layer, and then the second layer, 51:26.240 --> 51:28.160 and then the third layer, and then the output layer 51:28.240 --> 51:30.160 of the sandwich, and it's all done very 51:30.240 --> 51:32.160 efficiently using these convolutions. 51:32.240 --> 51:34.160 So we're going to cover that in a future video. 51:34.240 --> 51:36.160 The second thing I hope you took away from this video 51:36.240 --> 51:38.160 is you've seen me basically implement 51:38.240 --> 51:40.160 all of these layer 51:40.240 --> 51:42.160 Lego building blocks, or module 51:42.240 --> 51:44.160 building blocks. And I'm 51:44.240 --> 51:46.160 implementing them over here, and we've implemented 51:46.240 --> 51:48.160 a number of layers together, and we're also 51:48.240 --> 51:50.160 implementing these containers. 51:50.240 --> 51:52.160 And we've overall 51:52.240 --> 51:54.160 PyTorchified our code quite a bit more. 51:54.240 --> 51:56.160 Now, basically what we're doing 51:56.240 --> 51:58.160 here is we're reimplementing Torch.nn, 51:58.240 --> 52:00.160 which is the neural network's 52:00.240 --> 52:02.160 library on top of 52:02.240 --> 52:04.080 Torch.tensor. And it looks very much 52:04.160 --> 52:06.080 like this, except it is much better 52:06.160 --> 52:08.080 because it's in PyTorch 52:08.160 --> 52:10.080 instead of jinkling my Jupyter 52:10.160 --> 52:12.080 notebook. So I think going forward 52:12.160 --> 52:14.080 I will probably have considered us having 52:14.160 --> 52:16.080 unlocked Torch.nn. 52:16.160 --> 52:18.080 We understand roughly what's in there, 52:18.160 --> 52:20.080 how these modules work, how they're nested, 52:20.160 --> 52:22.080 and what they're doing on top of 52:22.160 --> 52:24.080 Torch.tensor. So hopefully we'll just 52:24.160 --> 52:26.080 switch over and continue 52:26.160 --> 52:28.080 and start using Torch.nn directly. 52:28.160 --> 52:30.080 The next thing I hope you got a bit of a sense of 52:30.160 --> 52:32.080 is what the development process 52:32.080 --> 52:34.000 of building deep neural networks looks like. 52:34.080 --> 52:36.000 Which I think was relatively representative 52:36.080 --> 52:38.000 to some extent. So number one, 52:38.080 --> 52:40.000 we are spending a lot of time 52:40.080 --> 52:42.000 in the documentation page of PyTorch. 52:42.080 --> 52:44.000 And we're reading through all the layers, 52:44.080 --> 52:46.000 looking at documentations, 52:46.080 --> 52:48.000 what are the shapes of the inputs, 52:48.080 --> 52:50.000 what can they be, what does the layer do, 52:50.080 --> 52:52.000 and so on. Unfortunately, 52:52.080 --> 52:54.000 I have to say the PyTorch documentation 52:54.080 --> 52:56.000 is not very good. 52:56.080 --> 52:58.000 They spend a ton of time on hardcore 52:58.080 --> 53:00.000 engineering of all kinds of distributed primitives, 53:00.080 --> 53:02.000 etc. But as far as I can tell, 53:02.000 --> 53:03.920 no one is maintaining documentation. 53:04.000 --> 53:05.920 It will lie to you, 53:06.000 --> 53:07.920 it will be wrong, it will be incomplete, 53:08.000 --> 53:09.920 it will be unclear. 53:10.000 --> 53:11.920 So unfortunately, it is what it is 53:12.000 --> 53:13.920 and you just kind of do your best 53:14.000 --> 53:15.920 with what they've 53:16.000 --> 53:17.920 given us. 53:18.000 --> 53:19.920 Number two, 53:20.000 --> 53:21.920 the other thing that I hope you got 53:22.000 --> 53:23.920 a sense of is there's a ton of 53:24.000 --> 53:25.920 trying to make the shapes work. 53:26.000 --> 53:27.920 And there's a lot of gymnastics around these multi-dimensional 53:28.000 --> 53:29.920 arrays. And are they two-dimensional, 53:30.000 --> 53:31.920 three-dimensional, four-dimensional? 53:31.920 --> 53:33.840 Do the layers take what shapes? 53:33.920 --> 53:35.840 Is it NCL or NLC? 53:35.920 --> 53:37.840 And you're permuting and viewing, 53:37.920 --> 53:39.840 and it just gets pretty messy. 53:39.920 --> 53:41.840 And so that brings me to number three. 53:41.920 --> 53:43.840 I very often prototype these layers 53:43.920 --> 53:45.840 and implementations in Jupyter Notebooks 53:45.920 --> 53:47.840 and make sure that all the shapes work out. 53:47.920 --> 53:49.840 And I'm spending a lot of time basically 53:49.920 --> 53:51.840 babysitting the shapes and making sure 53:51.920 --> 53:53.840 everything is correct. And then once I'm 53:53.920 --> 53:55.840 satisfied with the functionality in a Jupyter Notebook, 53:55.920 --> 53:57.840 I will take that code and copy-paste it into 53:57.920 --> 53:59.840 my repository of actual code 53:59.920 --> 54:01.840 that I'm training with. And so 54:01.840 --> 54:03.760 then I'm working with VS Code on the side. 54:03.840 --> 54:05.760 So I usually have Jupyter Notebook and VS Code. 54:05.840 --> 54:07.760 I develop in Jupyter Notebook, I paste 54:07.840 --> 54:09.760 into VS Code, and then I kick off experiments 54:09.840 --> 54:11.760 from the repo, of course, 54:11.840 --> 54:13.760 from the code repository. 54:13.840 --> 54:15.760 So that's roughly some notes on the 54:15.840 --> 54:17.760 development process of working with neural nets. 54:17.840 --> 54:19.760 Lastly, I think this lecture unlocks a lot 54:19.840 --> 54:21.760 of potential further lectures 54:21.840 --> 54:23.760 because, number one, we have to convert our 54:23.840 --> 54:25.760 neural network to actually use these dilated 54:25.840 --> 54:27.760 causal convolutional layers, 54:27.840 --> 54:29.760 so implementing the comnet. 54:29.840 --> 54:31.760 Number two, I potentially start 54:31.760 --> 54:33.680 to get into what this means, 54:33.760 --> 54:35.680 where are residual connections and 54:35.760 --> 54:37.680 skip connections and why are they useful. 54:37.760 --> 54:39.680 Number three, 54:39.760 --> 54:41.680 as I mentioned, we don't have any experimental harness. 54:41.760 --> 54:43.680 So right now I'm just guessing, checking 54:43.760 --> 54:45.680 everything. This is not representative of 54:45.760 --> 54:47.680 typical deep learning workflows. You have to 54:47.760 --> 54:49.680 set up your evaluation harness. 54:49.760 --> 54:51.680 You can kick off experiments. You have lots of arguments 54:51.760 --> 54:53.680 that your script can take. 54:53.760 --> 54:55.680 You're kicking off a lot of experimentation. 54:55.760 --> 54:57.680 You're looking at a lot of plots of training and validation 54:57.760 --> 54:59.680 losses, and you're looking at what is working 54:59.760 --> 55:01.680 and what is not working. And you're working on this 55:01.680 --> 55:03.600 like population level, and you're doing 55:03.680 --> 55:05.600 all these hyperparameter searches. 55:05.680 --> 55:07.600 And so we've done none of that so far. 55:07.680 --> 55:09.600 So how to set that up 55:09.680 --> 55:11.600 and how to make it good, I think 55:11.680 --> 55:13.600 is a whole another topic. 55:13.680 --> 55:15.600 And number three, we should probably cover 55:15.680 --> 55:17.600 recurring neural networks. RNNs, LSTMs, 55:17.680 --> 55:19.600 Grooves, and of course Transformers. 55:19.680 --> 55:21.600 So many 55:21.680 --> 55:23.600 places to go, 55:23.680 --> 55:25.600 and we'll cover that in the future. 55:25.680 --> 55:27.600 For now, bye. Sorry, I forgot to say that 55:27.680 --> 55:29.600 if you are interested, I think 55:29.680 --> 55:31.600 it is kind of interesting to try to beat this number 55:31.600 --> 55:33.520 1.993, because 55:33.600 --> 55:35.520 I really haven't tried a lot of experimentation 55:35.600 --> 55:37.520 here, and there's quite a bit of longing for it potentially, 55:37.600 --> 55:39.520 to still push this further. 55:39.600 --> 55:41.520 So I haven't tried any other 55:41.600 --> 55:43.520 ways of allocating these channels in this 55:43.600 --> 55:45.520 neural net. Maybe the number of 55:45.600 --> 55:47.520 dimensions for the embedding is all 55:47.600 --> 55:49.520 wrong. Maybe it's possible to actually 55:49.600 --> 55:51.520 take the original network with just one hidden layer 55:51.600 --> 55:53.520 and make it big enough and actually 55:53.600 --> 55:55.520 beat my fancy hierarchical 55:55.600 --> 55:57.520 network. It's not obvious. 55:57.600 --> 55:59.520 That would be kind of embarrassing if this 55:59.600 --> 56:01.520 did not do better, even once you torture 56:01.520 --> 56:03.440 it a little bit. Maybe you can read the 56:03.520 --> 56:05.440 WaveNet paper and try to figure out how some of these 56:05.520 --> 56:07.440 layers work and implement them yourselves using 56:07.520 --> 56:09.440 what we have. And of course 56:09.520 --> 56:11.440 you can always tune some of the initialization 56:11.520 --> 56:13.440 or some of the optimization 56:13.520 --> 56:15.440 and see if you can improve it that way. 56:15.520 --> 56:17.440 So I'd be curious if people can come up with some 56:17.520 --> 56:19.440 ways to beat this. 56:19.520 --> 56:21.440 And yeah, that's it for now. Bye.