WEBVTT 00:00.000 --> 00:04.380 Hello, my name is Andre and I've been training deep neural networks for a bit more than a decade 00:04.380 --> 00:09.120 and in this lecture I'd like to show you what neural network training looks like under the hood. 00:09.840 --> 00:14.080 So in particular we are going to start with a blank Jupyter notebook and by the end of this 00:14.080 --> 00:18.640 lecture we will define and train a neural net and you'll get to see everything that goes on 00:18.640 --> 00:23.680 under the hood and exactly sort of how that works on an intuitive level. Now specifically what I 00:23.680 --> 00:29.240 would like to do is I would like to take you through building of micrograd. Now micrograd 00:29.240 --> 00:34.080 is this library that I released on github about two years ago but at the time I only uploaded the 00:34.080 --> 00:39.940 source code and you'd have to go in by yourself and really figure out how it works. So in this 00:39.940 --> 00:43.740 lecture I will take you through it step by step and kind of comment on all the pieces of it. 00:44.140 --> 00:51.520 So what is micrograd and why is it interesting? Thank you. Micrograd is basically an autograd 00:51.520 --> 00:56.580 engine. Autograd is short for automatic gradient and really what it does is it implements back 00:56.580 --> 00:59.120 propagation. Now back propagation is this algorithm 00:59.120 --> 00:59.220 that you can use to create a neural network and you can use it to create a neural network 00:59.220 --> 00:59.240 and you can use it to create a neural network and you can use it to create a neural network. 00:59.240 --> 01:05.320 That allows you to efficiently evaluate the gradient of some kind of a loss function with 01:05.320 --> 01:10.160 respect to the weights of a neural network and what that allows us to do then is we can 01:10.160 --> 01:14.100 iteratively tune the weights of that neural network to minimize the loss function and 01:14.100 --> 01:18.820 therefore improve the accuracy of the network. So back propagation would be at the mathematical 01:18.820 --> 01:24.720 core of any modern deep neural network library like say PyTorch or JAX. So the functionality 01:24.720 --> 01:28.780 of micrograd is I think best illustrated by an example. So if we just scroll down here 01:28.780 --> 01:33.460 you'll see that micrograd basically allows you to build out mathematical expressions 01:33.460 --> 01:38.580 and here what we are doing is we have an expression that we're building out where you have two 01:38.580 --> 01:45.780 inputs a and b and you'll see that a and b are negative four and two but we are wrapping those 01:45.780 --> 01:51.820 values into this value object that we are going to build out as part of micrograd. So this value 01:51.820 --> 01:57.160 object will wrap the numbers themselves and then we are going to build out a mathematical expression 01:57.160 --> 01:58.760 here where a and b are the values that we are going to build out as part of micrograd. 01:58.760 --> 02:03.300 are transformed into C, D, and eventually E, F, and G. 02:03.940 --> 02:07.040 And I'm showing some of the functionality of micrograph 02:07.040 --> 02:08.460 and the operations that it supports. 02:08.820 --> 02:10.680 So you can add two value objects. 02:10.880 --> 02:11.780 You can multiply them. 02:12.160 --> 02:14.160 You can raise them to a constant power. 02:14.600 --> 02:17.460 You can offset by one, negate, squash at zero, 02:18.200 --> 02:22.000 square, divide by constant, divide by it, et cetera. 02:22.820 --> 02:24.680 And so we're building out an expression graph 02:24.680 --> 02:26.980 with these two inputs, A and B, 02:26.980 --> 02:29.520 and we're creating an output value of G. 02:30.580 --> 02:32.420 And micrograph will, in the background, 02:32.820 --> 02:34.640 build out this entire mathematical expression. 02:35.200 --> 02:38.020 So it will, for example, know that C is also a value. 02:38.840 --> 02:40.800 C was a result of an addition operation. 02:41.500 --> 02:45.860 And the child nodes of C are A and B 02:45.860 --> 02:49.620 because it will maintain pointers to A and B value objects. 02:49.940 --> 02:52.580 So we'll basically know exactly how all of this is laid out. 02:53.100 --> 02:56.320 And then not only can we do what we call the forward pass, 02:56.320 --> 02:56.900 where we actually, 02:56.980 --> 02:58.360 if we look at the value of G, of course, 02:58.440 --> 02:59.380 that's pretty straightforward, 02:59.960 --> 03:02.680 we will access that using the dot data attribute. 03:02.920 --> 03:05.320 And so the output of the forward pass, 03:05.760 --> 03:08.580 the value of G, is 24.7, it turns out. 03:08.860 --> 03:12.620 But the big deal is that we can also take this G value object 03:12.620 --> 03:14.220 and we can call dot backward. 03:14.760 --> 03:18.600 And this will basically initialize backpropagation at the node G. 03:19.980 --> 03:21.360 And what backpropagation is going to do 03:21.360 --> 03:22.580 is it's going to start at G 03:22.580 --> 03:26.040 and it's going to go backwards through that expression graph 03:26.040 --> 03:26.880 and it's going to recurve. 03:26.980 --> 03:29.640 So we're going to recursively apply the chain rule from Calculus. 03:30.280 --> 03:31.920 And what that allows us to do then 03:32.280 --> 03:35.480 is we're going to evaluate basically the derivative of G 03:35.760 --> 03:39.900 with respect to all the internal nodes like E, D, and C, 03:40.260 --> 03:42.760 but also with respect to the inputs A and B. 03:43.400 --> 03:46.900 And then we can actually query this derivative of G 03:46.900 --> 03:49.600 with respect to A, for example, that's A dot grad. 03:49.960 --> 03:51.540 In this case, it happens to be 138. 03:52.040 --> 03:54.040 And the derivative of G with respect to B, 03:54.440 --> 03:56.940 which also happens to be here, 645. 03:57.520 --> 03:59.240 And this derivative, we'll see soon, 03:59.340 --> 04:00.540 is very important information 04:00.540 --> 04:04.660 because it's telling us how A and B are affecting G 04:04.660 --> 04:06.200 through this mathematical expression. 04:06.820 --> 04:09.940 So in particular, A dot grad is 138. 04:10.060 --> 04:13.980 So if we slightly nudge A and make it slightly larger, 04:14.960 --> 04:17.480 138 is telling us that G will grow 04:17.480 --> 04:20.140 and the slope of that growth is going to be 138. 04:20.880 --> 04:23.700 And the slope of growth of B is going to be 645. 04:24.320 --> 04:26.780 So that's going to tell us about how G will respond, 04:26.980 --> 04:29.160 if A and B get tweaked a tiny amount 04:29.160 --> 04:31.400 in a positive direction, okay? 04:33.300 --> 04:36.220 Now, you might be confused about what this expression is 04:36.220 --> 04:37.060 that we built out here. 04:37.200 --> 04:39.360 And this expression, by the way, is completely meaningless. 04:39.640 --> 04:40.720 I just made it up. 04:40.820 --> 04:42.760 I'm just flexing about the kinds of operations 04:42.760 --> 04:44.360 that are supported by micrograd. 04:44.960 --> 04:47.000 What we actually really care about are neural networks. 04:47.460 --> 04:48.700 But it turns out that neural networks 04:48.700 --> 04:51.520 are just mathematical expressions, just like this one, 04:51.780 --> 04:53.720 but actually a slightly bit less crazy even. 04:54.820 --> 04:56.720 Neural networks are just a mathematical expression, 04:57.100 --> 04:59.500 they take the input data as an input, 04:59.800 --> 05:01.980 and they take the weights of a neural network as an input, 05:02.340 --> 05:03.500 and it's a mathematical expression, 05:03.860 --> 05:06.300 and the output are your predictions of your neural net 05:06.300 --> 05:07.480 or the loss function. 05:07.620 --> 05:08.400 We'll see this in a bit. 05:09.000 --> 05:11.260 But basically, neural networks just happen to be 05:11.260 --> 05:13.200 a certain class of mathematical expressions. 05:13.860 --> 05:16.540 But backpropagation is actually significantly more general. 05:16.840 --> 05:19.040 It doesn't actually care about neural networks at all. 05:19.180 --> 05:21.620 It only cares about arbitrary mathematical expressions. 05:22.040 --> 05:24.140 And then we happen to use that machinery 05:24.140 --> 05:25.980 for training of neural networks. 05:26.500 --> 05:26.920 Now, one more. 05:26.980 --> 05:28.480 Another note I would like to make at this stage 05:28.480 --> 05:29.420 is that, as you see here, 05:29.540 --> 05:31.920 micrograd is a scalar-valued autograd engine. 05:32.340 --> 05:35.220 So it's working on the level of individual scalars, 05:35.300 --> 05:36.440 like negative four and two. 05:36.820 --> 05:37.740 And we're taking neural nets 05:37.740 --> 05:38.600 and we're breaking them down 05:38.600 --> 05:41.220 all the way to these atoms of individual scalars 05:41.220 --> 05:42.760 and all the little pluses and times, 05:42.920 --> 05:44.200 and it's just excessive. 05:44.900 --> 05:45.540 And so, obviously, 05:45.660 --> 05:47.580 you would never be doing any of this in production. 05:47.920 --> 05:49.840 It's really just done for pedagogical reasons 05:49.840 --> 05:51.740 because it allows us to not have to deal 05:51.740 --> 05:53.640 with these n-dimensional tensors 05:53.640 --> 05:56.400 that you would use in a modern deep neural network library. 05:56.980 --> 05:59.520 So this is really done so that you understand 05:59.520 --> 06:01.300 and refactor out the background application 06:01.300 --> 06:04.280 and chain rule and understanding of neural training. 06:04.960 --> 06:07.360 And then, if you actually want to train bigger networks, 06:07.540 --> 06:08.800 you have to be using these tensors, 06:09.080 --> 06:10.240 but none of the math changes. 06:10.380 --> 06:11.760 This is done purely for efficiency. 06:12.360 --> 06:15.200 We are basically taking all the scalar values, 06:15.520 --> 06:16.980 we're packaging them up into tensors, 06:17.260 --> 06:19.020 which are just arrays of these scalars. 06:19.440 --> 06:21.720 And then, because we have these large arrays, 06:21.720 --> 06:23.860 we're making operations on those large arrays 06:23.860 --> 06:26.700 that allows us to take advantage of the parallelism 06:26.700 --> 06:27.200 in a computer. 06:27.840 --> 06:30.000 And all those operations can be done in parallel, 06:30.280 --> 06:31.880 and then the whole thing runs faster. 06:32.320 --> 06:33.580 But really, none of the math changes, 06:33.740 --> 06:34.980 and they're done purely for efficiency. 06:35.400 --> 06:37.220 So I don't think that it's pedagogically useful 06:37.220 --> 06:39.080 to be dealing with tensors from scratch. 06:39.680 --> 06:42.040 And that's why I fundamentally wrote micrograd, 06:42.360 --> 06:43.920 because you can understand how things work 06:43.920 --> 06:45.360 at the fundamental level, 06:45.720 --> 06:47.240 and then you can speed it up later. 06:48.160 --> 06:49.160 Okay, so here's the fun part. 06:49.500 --> 06:51.580 My claim is that micrograd is what you need 06:51.580 --> 06:52.600 to train neural networks, 06:52.600 --> 06:54.300 and everything else is just efficiency. 06:54.920 --> 06:56.280 So you'd think that micrograd would be 06:56.700 --> 06:58.240 a very complex piece of code. 06:58.500 --> 07:00.960 And that turns out to not be the case. 07:01.420 --> 07:03.040 So if we just go to micrograd, 07:03.540 --> 07:07.040 and you'll see that there's only two files here in micrograd. 07:07.340 --> 07:08.540 This is the actual engine. 07:08.780 --> 07:10.280 It doesn't know anything about neural nets. 07:10.580 --> 07:12.580 And this is the entire neural nets library 07:13.160 --> 07:14.160 on top of micrograd. 07:14.360 --> 07:17.080 So engine and nn.py. 07:17.620 --> 07:20.760 So the actual backpropagation autograd engine 07:21.660 --> 07:23.320 that gives you the power of neural networks 07:23.760 --> 07:24.760 is literally 07:24.840 --> 07:25.840 100 lines of code. 07:25.840 --> 07:26.400 100 lines of code. 07:26.400 --> 07:28.400 Of, like, very simple Python, 07:28.400 --> 07:30.400 which we'll understand by the end of this lecture. 07:30.400 --> 07:32.400 And then nn.py, 07:32.400 --> 07:34.400 this neural network library 07:34.400 --> 07:36.400 built on top of the autograd engine, 07:36.400 --> 07:38.400 is like a joke. 07:38.400 --> 07:40.400 It's like, we have to define what is a neuron, 07:40.400 --> 07:42.400 and then we have to define what is a layer of neurons, 07:42.400 --> 07:44.400 and then we define what is a multilayer perceptron, 07:44.400 --> 07:46.400 which is just a sequence of layers of neurons. 07:46.400 --> 07:48.400 And so it's just a total joke. 07:48.400 --> 07:50.400 So basically, 07:50.400 --> 07:52.400 there's a lot of power 07:52.400 --> 07:54.400 that comes from only 150 lines of code. 07:54.400 --> 07:56.200 And then we have to define what is a multilayer perceptron, 07:56.200 --> 07:58.200 which is 150 lines of code. 07:58.200 --> 08:00.200 And that's all you need to understand 08:00.200 --> 08:02.200 to understand neural network training. 08:02.200 --> 08:04.200 And everything else is just efficiency. 08:04.200 --> 08:06.200 And of course, there's a lot to efficiency. 08:06.200 --> 08:08.200 But fundamentally, that's all that's happening. 08:08.200 --> 08:10.200 Okay, so now let's dive right in 08:10.200 --> 08:12.200 and implement micrograd step by step. 08:12.200 --> 08:14.200 The first thing I'd like to do is I'd like to make sure 08:14.200 --> 08:16.200 that you have a very good understanding, intuitively, 08:16.200 --> 08:18.200 of what a derivative is 08:18.200 --> 08:20.200 and exactly what information it gives you. 08:20.200 --> 08:22.200 So let's start with some basic imports 08:22.200 --> 08:24.200 that I copy-paste in every Jupyter Notebook, always. 08:24.200 --> 08:26.200 And let's define a derivative. 08:26.200 --> 08:28.200 So let's define a function, 08:28.200 --> 08:30.200 a scalar-valued function, 08:30.200 --> 08:32.200 f of x, as follows. 08:32.200 --> 08:34.200 So I just made this up randomly. 08:34.200 --> 08:36.200 I just wanted a scalar-valued function 08:36.200 --> 08:38.200 that takes a single scalar x 08:38.200 --> 08:40.200 and returns a single scalar y. 08:40.200 --> 08:42.200 And we can call this function, of course, 08:42.200 --> 08:44.200 so we can pass in, say, 3.0 08:44.200 --> 08:46.200 and get 20 back. 08:46.200 --> 08:48.200 Now, we can also plot this function 08:48.200 --> 08:50.200 to get a sense of its shape. 08:50.200 --> 08:52.200 You can tell from the mathematical expression 08:52.200 --> 08:54.200 that this is probably a parabola. 08:54.200 --> 08:56.200 It's a quadratic. 08:56.200 --> 08:58.200 It's a scalar-value that we can feed in 08:58.200 --> 09:00.200 using, for example, a range 09:00.200 --> 09:02.200 from negative 5 to 5 09:02.200 --> 09:04.200 in steps of 0.25. 09:04.200 --> 09:06.200 So x is just 09:06.200 --> 09:08.200 from negative 5 to 5 09:08.200 --> 09:10.200 not including 5 09:10.200 --> 09:12.200 in steps of 0.25. 09:12.200 --> 09:14.200 And we can actually call this function 09:14.200 --> 09:16.200 on this numpy array as well. 09:16.200 --> 09:18.200 So we get a set of y's 09:18.200 --> 09:20.200 if we call f on x. 09:20.200 --> 09:22.200 And these y's are basically 09:22.200 --> 09:24.200 also applying the function 09:24.200 --> 09:26.200 on every one of these elements independently. 09:26.200 --> 09:28.200 Let's talk about this using Mathplotlib. 09:28.200 --> 09:30.200 So plt.plot, x's and y's 09:30.200 --> 09:32.200 and we get a nice parabola. 09:32.200 --> 09:34.200 So previously here we fed in 3.0 09:34.200 --> 09:36.200 somewhere here, and we received 09:36.200 --> 09:38.200 20 back, which is here 09:38.200 --> 09:40.200 the y coordinate. 09:40.200 --> 09:42.200 So now I'd like to think through 09:42.200 --> 09:44.200 what is the derivative of this function 09:44.200 --> 09:46.200 at any single input point x? 09:46.200 --> 09:48.200 So what is the derivative at different points x 09:48.200 --> 09:50.200 of this function? 09:50.200 --> 09:52.200 Now if you remember back to your calculus class 09:52.200 --> 09:54.200 you've probably derived derivatives. 09:54.200 --> 09:56.200 So we take this mathematical expression 09:56.200 --> 09:58.200 for x plus 5, and you would write it out 09:58.200 --> 10:00.200 on a piece of paper and you would 10:00.200 --> 10:02.200 apply the product rule and all the other rules 10:02.200 --> 10:04.200 and derive the mathematical expression 10:04.200 --> 10:06.200 of the great derivative of the original function. 10:06.200 --> 10:08.200 And then you could plug in different x's 10:08.200 --> 10:10.200 and see what the derivative is. 10:10.200 --> 10:12.200 We're not going to actually do that 10:12.200 --> 10:14.200 because no one in neural networks 10:14.200 --> 10:16.200 actually writes out the expression for the neural net. 10:16.200 --> 10:18.200 It would be a massive expression. 10:18.200 --> 10:20.200 It would be thousands, tens of thousands of terms. 10:20.200 --> 10:22.200 No one actually derives the derivative 10:22.200 --> 10:24.200 of course. 10:24.200 --> 10:26.200 And so we're not going to take this kind of symbolic approach 10:26.200 --> 10:28.200 instead what I'd like to do is I'd like to look at the 10:28.200 --> 10:30.200 definition of derivative and just make sure 10:30.200 --> 10:32.200 that we really understand what the derivative is measuring 10:32.200 --> 10:34.200 what it's telling you about the function. 10:34.200 --> 10:36.200 And so if we just look up 10:36.200 --> 10:38.200 derivative 10:42.200 --> 10:44.200 we see that 10:44.200 --> 10:46.200 this is not a very good definition of derivative 10:46.200 --> 10:48.200 this is a definition of what it means to be differentiable 10:48.200 --> 10:50.200 but if you remember from your calculus 10:50.200 --> 10:52.200 it is the limit as h goes to 0 10:52.200 --> 10:54.200 of f of x plus h minus f of x 10:54.200 --> 10:56.200 over h. 10:56.200 --> 10:58.200 And basically what it's saying is 10:58.200 --> 11:00.200 if you slightly bump up 11:00.200 --> 11:02.200 at some point x that you're interested in 11:02.200 --> 11:04.200 or a, and if you slightly bump up 11:04.200 --> 11:06.200 you slightly increase it by 11:06.200 --> 11:08.200 a small number h 11:08.200 --> 11:10.200 how does the function respond? 11:10.200 --> 11:12.200 With what sensitivity does it respond? 11:12.200 --> 11:14.200 What is the slope at that point? 11:14.200 --> 11:16.200 Does the function go up or does it go down? 11:16.200 --> 11:18.200 And by how much? 11:18.200 --> 11:20.200 And that's the slope of that function 11:20.200 --> 11:22.200 the slope of that response at that point. 11:22.200 --> 11:24.200 And so we can basically evaluate 11:24.200 --> 11:26.200 the derivative here numerically 11:26.200 --> 11:28.200 by taking a very small h 11:28.200 --> 11:30.200 of course the definition would ask us to take h to 0 11:30.200 --> 11:32.200 we're just going to pick a very small h 11:32.200 --> 11:34.200 0.001 11:34.200 --> 11:36.200 and let's say we're interested in point 3.0 11:36.200 --> 11:38.200 so we can look at f of x of course as 20 11:38.200 --> 11:40.200 and now f of x plus h 11:40.200 --> 11:42.200 so if we slightly nudge 11:42.200 --> 11:44.200 x in a positive direction 11:44.200 --> 11:46.200 how is the function going to respond? 11:46.200 --> 11:48.200 And just looking at this 11:48.200 --> 11:50.200 do you expect f of x plus h to be slightly greater 11:50.200 --> 11:52.200 than 20? 11:52.200 --> 11:54.200 Or do you expect it to be slightly lower than 20? 11:54.200 --> 11:56.200 And so since 3 is here 11:56.200 --> 11:58.200 and this is 20 11:58.200 --> 12:00.200 if we slightly go positively 12:00.200 --> 12:02.200 the function will respond positively 12:02.200 --> 12:04.200 so you'd expect this to be slightly greater than 20 12:04.200 --> 12:06.200 and now by how much 12:06.200 --> 12:08.200 is telling you the 12:08.200 --> 12:10.200 strength of that slope 12:10.200 --> 12:12.200 the size of that slope 12:12.200 --> 12:14.200 so f of x plus h minus f of x 12:14.200 --> 12:16.200 this is how much the function responded 12:16.200 --> 12:18.200 in a positive direction 12:18.200 --> 12:20.200 and we have to normalize by the run 12:20.200 --> 12:22.200 so we have the rise over run 12:22.200 --> 12:24.200 to get the slope 12:24.200 --> 12:26.200 so this of course is just a numerical approximation 12:26.200 --> 12:28.200 of the slope 12:28.200 --> 12:30.200 because we have to make h very very small 12:30.200 --> 12:32.200 to converge to the exact amount 12:32.200 --> 12:34.200 now if I'm doing too many zeros 12:34.200 --> 12:36.200 at some point 12:36.200 --> 12:38.200 I'm going to get an incorrect answer 12:38.200 --> 12:40.200 because we're using floating point arithmetic 12:40.200 --> 12:42.200 and the representations of all these numbers 12:42.200 --> 12:44.200 in computer memory is finite 12:44.200 --> 12:46.200 and at some point we get into trouble 12:46.200 --> 12:48.200 so we can converge towards the right answer 12:48.200 --> 12:50.200 with this approach 12:50.200 --> 12:52.200 but basically 12:52.200 --> 12:54.200 at 3 the slope is 14 12:54.200 --> 12:56.200 and you can see that by taking 12:56.200 --> 12:58.200 x squared minus 4x plus 5 12:58.200 --> 13:00.200 and differentiating it in our head 13:00.200 --> 13:02.200 so 3x squared would be 13:02.200 --> 13:04.200 6x minus 4 13:04.200 --> 13:06.200 and then we plug in x equals 3 13:06.200 --> 13:08.200 so that's 18 minus 4 is 14 13:08.200 --> 13:10.200 so this is correct 13:10.200 --> 13:12.200 so that's at 3 13:12.200 --> 13:14.200 now how about 13:14.200 --> 13:16.200 the slope at say negative 3 13:16.200 --> 13:18.200 would you expect 13:18.200 --> 13:20.200 what would you expect for the slope 13:20.200 --> 13:22.200 now telling the exact value is really hard 13:22.200 --> 13:24.200 but what is the sign of that slope 13:24.200 --> 13:26.200 so at negative 3 13:26.200 --> 13:28.200 if we slightly go in the positive direction 13:28.200 --> 13:30.200 at x 13:30.200 --> 13:32.200 the function would actually go down 13:32.200 --> 13:34.200 and so that tells you that the slope would be negative 13:34.200 --> 13:36.200 so we'll get a slight number below 13:36.200 --> 13:38.200 below 20 13:38.200 --> 13:40.200 and so if we take the slope 13:40.200 --> 13:42.200 we expect something negative 13:42.200 --> 13:44.200 negative 22 13:44.200 --> 13:46.200 and at some point here of course 13:46.200 --> 13:48.200 the slope would be 0 13:48.200 --> 13:50.200 now for this specific function 13:50.200 --> 13:52.200 I looked it up previously 13:52.200 --> 13:54.200 and it's at point 2 over 3 13:54.200 --> 13:56.200 so at roughly 2 over 3 13:56.200 --> 13:58.200 this derivative would be 0 13:58.200 --> 14:00.200 so basically 14:00.200 --> 14:02.200 at that precise point 14:04.200 --> 14:06.200 at that precise point 14:06.200 --> 14:08.200 if we nudge in a positive direction 14:08.200 --> 14:10.200 the function doesn't respond 14:10.200 --> 14:12.200 this stays the same almost 14:12.200 --> 14:14.200 and so that's why the slope is 0 14:14.200 --> 14:16.200 ok now let's look at a bit more complex case 14:16.200 --> 14:18.200 so we're going to start complexifying a bit 14:18.200 --> 14:20.200 so now we have a function 14:20.200 --> 14:22.200 here 14:22.200 --> 14:24.200 with output variable b 14:24.200 --> 14:26.200 that is a function of 3 scalar inputs 14:26.200 --> 14:28.200 so a, b and c are some specific values 14:28.200 --> 14:30.200 3 inputs into our expression graph 14:30.200 --> 14:32.200 and a single output d 14:32.200 --> 14:34.200 and so if we just print d 14:34.200 --> 14:36.200 we get 4 14:36.200 --> 14:38.200 and now what I'd like to do is 14:38.200 --> 14:40.200 I'd like to again look at the derivatives of d 14:40.200 --> 14:42.200 with respect to a, b and c 14:42.200 --> 14:44.200 and think through 14:44.200 --> 14:46.200 again just the intuition of what this derivative 14:46.200 --> 14:48.200 is telling us 14:48.200 --> 14:50.200 so in order to evaluate this derivative 14:50.200 --> 14:52.200 we're going to get a bit hacky here 14:52.200 --> 14:54.200 we're going to again have a very small 14:54.200 --> 14:56.200 value of h and then we're going to 14:56.200 --> 14:58.200 fix the inputs at some 14:58.200 --> 15:00.200 values that we're interested in 15:00.200 --> 15:02.200 so these are the 15:02.200 --> 15:04.200 this is the point a, b, c at which we're going to be evaluating 15:04.200 --> 15:06.200 the derivative of d 15:06.200 --> 15:08.200 with respect to all a, b and c 15:08.200 --> 15:10.200 at that point 15:10.200 --> 15:12.200 so there's the inputs and now we have d1 15:12.200 --> 15:14.200 is that expression 15:14.200 --> 15:16.200 and then we're going to for example look at the derivative of d 15:16.200 --> 15:18.200 with respect to a 15:18.200 --> 15:20.200 so we'll take a and we'll bump it by h 15:20.200 --> 15:22.200 and then we'll get d2 to be the exact same 15:22.200 --> 15:24.200 function 15:24.200 --> 15:26.200 and now we're going to print 15:26.200 --> 15:28.200 you know 15:28.200 --> 15:30.200 d1 is d1 15:30.200 --> 15:32.200 d2 is d2 15:32.200 --> 15:34.200 and print slope 15:34.200 --> 15:36.200 so the derivative 15:36.200 --> 15:38.200 or slope here 15:38.200 --> 15:40.200 will be of course 15:40.200 --> 15:42.200 d2 15:42.200 --> 15:44.200 minus d1 divided by h 15:44.200 --> 15:46.200 so d2 minus d1 is how 15:46.200 --> 15:48.200 much the function increased 15:48.200 --> 15:50.200 when we bumped 15:50.200 --> 15:52.200 the specific 15:52.200 --> 15:54.200 input that we're interested in 15:54.200 --> 15:56.200 by a tiny amount 15:56.200 --> 15:58.200 and this is then normalized by 15:58.200 --> 16:00.200 h to get the slope 16:02.200 --> 16:04.200 so 16:04.200 --> 16:06.200 yeah 16:06.200 --> 16:08.200 so this 16:08.200 --> 16:10.200 so I just run this 16:10.200 --> 16:12.200 we're going to print d1 16:12.200 --> 16:14.200 which we know is 16:14.200 --> 16:16.200 4 16:16.200 --> 16:18.200 now d2 will be bumped 16:18.200 --> 16:20.200 a will be bumped by h 16:20.200 --> 16:22.200 so let's just think through 16:22.200 --> 16:24.200 a little bit 16:24.200 --> 16:26.200 what d2 will be 16:26.200 --> 16:28.200 printed out here in particular 16:28.200 --> 16:30.200 d1 will be 4 16:30.200 --> 16:32.200 will d2 be 16:32.200 --> 16:34.200 a number slightly greater than 4 16:34.200 --> 16:36.200 or slightly lower than 4 16:36.200 --> 16:38.200 and that's going to tell us the 16:38.200 --> 16:40.200 sign of the derivative 16:40.200 --> 16:42.200 so 16:42.200 --> 16:44.200 we're bumping a by h 16:44.200 --> 16:46.200 b is minus 3 16:46.200 --> 16:48.200 c is 10 16:48.200 --> 16:50.200 so you can just intuitively think through 16:50.200 --> 16:52.200 this derivative and what it's doing 16:52.200 --> 16:54.200 a will be slightly more positive 16:54.200 --> 16:56.200 and but b is a negative 16:56.200 --> 16:58.200 number so if a is 16:58.200 --> 17:00.200 slightly more positive 17:00.200 --> 17:02.200 because b is negative 3 17:02.200 --> 17:04.200 we're actually going to be 17:04.200 --> 17:06.200 adding less to 17:06.200 --> 17:08.200 d 17:08.200 --> 17:10.200 so you'd actually expect that 17:10.200 --> 17:12.200 the value of the function will go 17:12.200 --> 17:14.200 down so let's 17:14.200 --> 17:16.200 just see this 17:16.200 --> 17:18.200 yeah and so we went from 4 17:18.200 --> 17:20.200 to 3.996 17:20.200 --> 17:22.200 and that tells you that the slope will 17:22.200 --> 17:24.200 be negative and then 17:24.200 --> 17:26.200 will be a negative number 17:26.200 --> 17:28.200 because we went down and 17:28.200 --> 17:30.200 then the exact number of slope 17:30.200 --> 17:32.200 will be the exact amount of slope 17:32.200 --> 17:34.200 is negative 3 and you can 17:34.200 --> 17:36.200 also convince yourself that negative 3 is the right 17:36.200 --> 17:38.200 answer mathematically and analytically 17:38.200 --> 17:40.200 because if you have a times b plus 17:40.200 --> 17:42.200 c and you are you know you have 17:42.200 --> 17:44.200 calculus then 17:44.200 --> 17:46.200 differentiating a times b plus c with 17:46.200 --> 17:48.200 respect to a gives you just b 17:48.200 --> 17:50.200 and indeed the value of b 17:50.200 --> 17:52.200 is negative 3 which is the derivative that we have 17:52.200 --> 17:54.200 so you can tell that that's correct 17:54.200 --> 17:56.200 so now if we do this 17:56.200 --> 17:58.200 with b so if we 17:58.200 --> 18:00.200 bump b by a little bit in a positive 18:00.200 --> 18:02.200 direction we'd get different 18:02.200 --> 18:04.200 slopes so what is the influence of b 18:04.200 --> 18:06.200 on the output d 18:06.200 --> 18:08.200 so if we bump b by a tiny amount 18:08.200 --> 18:10.200 in a positive direction then because a 18:10.200 --> 18:12.200 is positive we'll be 18:12.200 --> 18:14.200 adding more to d right 18:14.200 --> 18:16.200 so and now 18:16.200 --> 18:18.200 what is the sensitivity what is the 18:18.200 --> 18:20.200 slope of that addition and 18:20.200 --> 18:22.200 it might not surprise you that this should be 18:22.200 --> 18:24.200 2 18:24.200 --> 18:26.200 and why is it 2 because 18:26.200 --> 18:28.200 d of d by db 18:28.200 --> 18:30.200 differentiating with respect to b 18:30.200 --> 18:32.200 would be would give us a and 18:32.200 --> 18:34.200 the value of a is 2 so that's also 18:34.200 --> 18:36.200 working well and then if c 18:36.200 --> 18:38.200 gets bumped a tiny amount in h 18:38.200 --> 18:40.200 by h then 18:40.200 --> 18:42.200 of course a times b is unaffected and 18:42.200 --> 18:44.200 now c becomes slightly bit higher 18:44.200 --> 18:46.200 what does that do to the function it 18:46.200 --> 18:48.200 makes it slightly bit higher because we're simply adding 18:48.200 --> 18:50.200 c and it makes it slightly bit 18:50.200 --> 18:52.200 higher by the exact same amount that we 18:52.200 --> 18:54.200 added to c and so that tells you 18:54.200 --> 18:56.200 that the slope is 1 18:56.200 --> 18:58.200 that will be the 18:58.200 --> 19:00.200 the rate at which 19:00.200 --> 19:02.200 d will increase 19:02.200 --> 19:04.200 as we scale 19:04.200 --> 19:06.200 c okay so we now have some 19:06.200 --> 19:08.200 intuitive sense of what this derivative is telling you 19:08.200 --> 19:10.200 about the function and we'd like to move to 19:10.200 --> 19:12.200 neural networks now as i mentioned neural networks 19:12.200 --> 19:14.200 will be pretty massive expressions mathematical 19:14.200 --> 19:16.200 expressions so we need some data structures 19:16.200 --> 19:18.200 that maintain these expressions and that's what 19:18.200 --> 19:20.200 we're going to start to build out now 19:20.200 --> 19:22.200 so we're going to 19:22.200 --> 19:24.200 build out this value object that i 19:24.200 --> 19:26.200 showed you in the readme page 19:26.200 --> 19:28.200 of micrograd so let me 19:28.200 --> 19:30.200 copy paste a skeleton 19:30.200 --> 19:32.200 of the first very simple value 19:32.200 --> 19:34.200 object so class 19:34.200 --> 19:36.200 value takes a single 19:36.200 --> 19:38.200 scalar value that it wraps and 19:38.200 --> 19:40.200 keeps track of and that's 19:40.200 --> 19:42.200 it so we can for example 19:42.200 --> 19:44.200 do value of 2.0 and then we can 19:44.200 --> 19:46.200 get 19:46.200 --> 19:48.200 we can look at its content and 19:48.200 --> 19:50.200 python will internally 19:50.200 --> 19:52.200 use the wrapper function 19:52.200 --> 19:54.200 to return 19:54.200 --> 19:56.200 this string 19:56.200 --> 19:58.200 like that 19:58.200 --> 20:00.200 so this is a value object with 20:00.200 --> 20:02.200 data equals two that we're creating 20:02.200 --> 20:04.200 here now what we'd like to do is 20:04.200 --> 20:06.200 like we'd like to be able to 20:06.200 --> 20:08.200 have not just like 20:08.200 --> 20:10.200 two values but 20:10.200 --> 20:12.200 we'd like to do a plus b right we'd like 20:12.200 --> 20:14.200 to add them so currently 20:14.200 --> 20:16.200 you would get an error because python 20:16.200 --> 20:18.200 doesn't know how to add two value 20:18.200 --> 20:20.200 objects so we have to tell it 20:20.200 --> 20:22.200 so here's 20:22.200 --> 20:24.200 addition 20:24.200 --> 20:26.200 so 20:26.200 --> 20:28.200 you have to basically use these special 20:28.200 --> 20:30.200 double underscore methods in python to 20:30.200 --> 20:32.200 define these operators for these 20:32.200 --> 20:34.200 objects so if we call 20:34.200 --> 20:36.200 the 20:36.200 --> 20:38.200 if we use this plus 20:38.200 --> 20:40.200 operator python will internally 20:40.200 --> 20:42.200 call a dot 20:42.200 --> 20:44.200 add of b that's 20:44.200 --> 20:46.200 what will happen internally and so 20:46.200 --> 20:48.200 b will be the other 20:48.200 --> 20:50.200 and self will be 20:50.200 --> 20:52.200 a and so we see that what we're going 20:52.200 --> 20:54.200 to return is a new value object and 20:54.200 --> 20:56.200 it's just it's going to be wrapping 20:56.200 --> 20:58.200 the plus of 20:58.200 --> 21:00.200 their data but remember 21:00.200 --> 21:02.200 now because data is the actual 21:02.200 --> 21:04.200 like numbered python number so 21:04.200 --> 21:06.200 this operator here is just the 21:06.200 --> 21:08.200 typical floating point plus 21:08.200 --> 21:10.200 addition now it's not an addition of value 21:10.200 --> 21:12.200 objects and we'll return 21:12.200 --> 21:14.200 a new value so now a 21:14.200 --> 21:16.200 plus b should work and it should print value 21:16.200 --> 21:18.200 of negative one 21:18.200 --> 21:20.200 because that's two plus minus three 21:20.200 --> 21:22.200 there we go okay let's 21:22.200 --> 21:24.200 now implement multiply 21:24.200 --> 21:26.200 just so we can recreate this expression here 21:26.200 --> 21:28.200 so multiply i think it won't 21:28.200 --> 21:30.200 surprise you will be fairly similar 21:30.200 --> 21:32.200 so instead 21:32.200 --> 21:34.200 of add we're going to be using mul 21:34.200 --> 21:36.200 and then here of course we want to do times 21:36.200 --> 21:38.200 and so now we can create a 21:38.200 --> 21:40.200 c value object which will be 10.0 21:40.200 --> 21:42.200 and now we should be able to do 21:42.200 --> 21:44.200 a times b 21:44.200 --> 21:46.200 well let's just do a times b first 21:46.200 --> 21:48.200 um 21:48.200 --> 21:50.200 that's value of negative six now 21:50.200 --> 21:52.200 and by the way i skipped over this a little 21:52.200 --> 21:54.200 bit suppose that i didn't have the wrapper 21:54.200 --> 21:56.200 function here then 21:56.200 --> 21:58.200 it's just that you'll get some kind of an ugly expression 21:58.200 --> 22:00.200 so what wrapper is doing 22:00.200 --> 22:02.200 is it's providing us a way to 22:02.200 --> 22:04.200 print out like a nicer looking expression in 22:04.200 --> 22:06.200 python so we 22:06.200 --> 22:08.200 don't just have something cryptic we 22:08.200 --> 22:10.200 actually are you know it's value of 22:10.200 --> 22:12.200 negative six 22:12.200 --> 22:14.200 so this gives us a times 22:14.200 --> 22:16.200 and then this we should now be able 22:16.200 --> 22:18.200 to add c to it because we've defined and 22:18.200 --> 22:20.200 told the python how to do mul and add 22:20.200 --> 22:22.200 and so this will call 22:22.200 --> 22:24.200 this will basically be equivalent to a dot 22:24.200 --> 22:26.200 mul 22:26.200 --> 22:28.200 of b and then 22:28.200 --> 22:30.200 this new value object will be dot 22:30.200 --> 22:32.200 add of c 22:32.200 --> 22:34.200 and so let's see if that worked 22:34.200 --> 22:36.200 yep so that worked well that gave 22:36.200 --> 22:38.200 us four which is what we expect from before 22:38.200 --> 22:40.200 and i 22:40.200 --> 22:42.200 believe we can just call them manually as well 22:42.200 --> 22:44.200 there we go so 22:44.200 --> 22:46.200 yeah okay so now what we are 22:46.200 --> 22:48.200 missing is the connected tissue of this 22:48.200 --> 22:50.200 expression as i mentioned we want to keep 22:50.200 --> 22:52.200 these expression graphs so we need to 22:52.200 --> 22:54.200 know and keep pointers about 22:54.200 --> 22:56.200 what values produce what other values 22:56.200 --> 22:58.200 produce so here for example we are 22:58.200 --> 23:00.200 going to introduce a new variable which 23:00.200 --> 23:02.200 we'll call children and by default it 23:02.200 --> 23:04.200 will be an empty tuple and then we're 23:04.200 --> 23:06.200 actually going to keep a slightly 23:06.200 --> 23:08.200 different variable in the class which 23:08.200 --> 23:10.200 we'll call underscore prev which will be 23:10.200 --> 23:12.200 the set of children 23:12.200 --> 23:14.200 this is how i done i did it in the 23:14.200 --> 23:16.200 original micrograd looking at my code 23:16.200 --> 23:18.200 here i can't remember exactly the reason 23:18.200 --> 23:20.200 i believe it was efficiency but this 23:20.200 --> 23:22.200 underscore children will be a tuple for 23:22.200 --> 23:24.200 convenience but then when we actually 23:24.200 --> 23:26.200 maintain it in the class it will be just 23:26.200 --> 23:28.200 efficiency 23:28.200 --> 23:30.200 so now when 23:30.200 --> 23:32.200 we are creating a value like this with a 23:32.200 --> 23:34.200 constructor children will be empty and 23:34.200 --> 23:36.200 prev will be the empty set but when we 23:36.200 --> 23:38.200 are creating a value through addition or 23:38.200 --> 23:40.200 multiplication we're going to feed in 23:40.200 --> 23:42.200 the children of this 23:42.200 --> 23:44.200 value which in this case is self 23:44.200 --> 23:46.200 another 23:46.200 --> 23:48.200 so those are the children 23:48.200 --> 23:50.200 here 23:50.200 --> 23:52.200 so now we can do d dot 23:52.200 --> 23:54.200 prev and we'll see that 23:54.200 --> 23:56.200 the children of the we know 23:56.200 --> 23:58.200 now know are this a value of 23:58.200 --> 24:00.200 negative six and value of ten 24:00.200 --> 24:02.200 and this of course is the value resulting 24:02.200 --> 24:04.200 from a times b and the 24:04.200 --> 24:06.200 c value which is ten 24:06.200 --> 24:08.200 now the last piece of information 24:08.200 --> 24:10.200 we don't know so we know now the 24:10.200 --> 24:12.200 children of every single value but we don't know 24:12.200 --> 24:14.200 what operation created this value 24:14.200 --> 24:16.200 so we need one more element 24:16.200 --> 24:18.200 here let's call it underscore pop 24:18.200 --> 24:20.200 and by default this 24:20.200 --> 24:22.200 is the empty set for leaves 24:22.200 --> 24:24.200 and then we'll just maintain it here 24:24.200 --> 24:26.200 and now the 24:26.200 --> 24:28.200 operation will be just a simple string 24:28.200 --> 24:30.200 and in the case of addition it's 24:30.200 --> 24:32.200 plus in the case of multiplication 24:32.200 --> 24:34.200 it's times so 24:34.200 --> 24:36.200 now we not just have d dot 24:36.200 --> 24:38.200 prev we also have a d dot op 24:38.200 --> 24:40.200 and we know that d was produced by 24:40.200 --> 24:42.200 an addition of those two values 24:42.200 --> 24:44.200 and so now we have the full 24:44.200 --> 24:46.200 mathematical expression and we're 24:46.200 --> 24:48.200 building out this data structure and we know exactly 24:48.200 --> 24:50.200 how each value came to be 24:50.200 --> 24:52.200 by what expression and from what other values 24:54.200 --> 24:56.200 now because these expressions are about 24:56.200 --> 24:58.200 to get quite a bit larger we'd like a 24:58.200 --> 25:00.200 way to nicely visualize 25:00.200 --> 25:02.200 these expressions that we're building out 25:02.200 --> 25:04.200 so for that i'm going to copy paste a bunch of 25:04.200 --> 25:06.200 slightly scary code that's 25:06.200 --> 25:08.200 going to visualize this these 25:08.200 --> 25:10.200 expression graphs for us so here's the 25:10.200 --> 25:12.200 code and i'll explain it in a bit 25:12.200 --> 25:14.200 but first let me just show you what this code does 25:14.200 --> 25:16.200 basically what it does is it creates 25:16.200 --> 25:18.200 a new function draw dot 25:18.200 --> 25:20.200 that we can call on some root node 25:20.200 --> 25:22.200 and then it's going to visualize it 25:22.200 --> 25:24.200 so if we call draw dot on d 25:24.200 --> 25:26.200 which is this final value here 25:26.200 --> 25:28.200 that is a times b plus c 25:28.200 --> 25:30.200 it creates 25:30.200 --> 25:32.200 something like this so this is d 25:32.200 --> 25:34.200 and you see that this is a times b 25:34.200 --> 25:36.200 creating an interpret value 25:36.200 --> 25:38.200 plus c gives us this output 25:38.200 --> 25:40.200 node d 25:40.200 --> 25:42.200 so that's draw dot of d 25:42.200 --> 25:44.200 and i'm not going to go through this 25:44.200 --> 25:46.200 in complete detail you can take a look at 25:46.200 --> 25:48.200 graphvis and its api 25:48.200 --> 25:50.200 graphvis is an open source graph visualization 25:50.200 --> 25:52.200 software and what we're doing here 25:52.200 --> 25:54.200 is we're building out this graph in graphvis 25:54.200 --> 25:56.200 api and 25:56.200 --> 25:58.200 you can basically see that 25:58.200 --> 26:00.200 trace is this helper function that 26:00.200 --> 26:02.200 enumerates all the nodes and edges in the graph 26:02.200 --> 26:04.200 so that just builds a set of all 26:04.200 --> 26:06.200 the nodes and edges and then we iterate through 26:06.200 --> 26:08.200 all the nodes and we create special node 26:08.200 --> 26:10.200 objects for them in 26:10.200 --> 26:12.200 using dot 26:12.200 --> 26:14.200 node and then we also 26:14.200 --> 26:16.200 create edges using dot dot edge 26:16.200 --> 26:18.200 and the only thing that's like slightly 26:18.200 --> 26:20.200 tricky here is you'll notice that i 26:20.200 --> 26:22.200 basically add these fake nodes 26:22.200 --> 26:24.200 which are these operation nodes 26:24.200 --> 26:26.200 so for example this node here is just 26:26.200 --> 26:28.200 like a plus node and 26:28.200 --> 26:30.200 i create these 26:30.200 --> 26:32.200 special 26:32.200 --> 26:34.200 op nodes here 26:34.200 --> 26:36.200 and i connect them accordingly 26:36.200 --> 26:38.200 so these nodes of course 26:38.200 --> 26:40.200 are not actual nodes 26:40.200 --> 26:42.200 in the original graph they're not 26:42.200 --> 26:44.200 actually a value object the only 26:44.200 --> 26:46.200 value objects here are the things 26:46.200 --> 26:48.200 in squares those are actual value 26:48.200 --> 26:50.200 objects or representations thereof 26:50.200 --> 26:52.200 and these op nodes are just created in 26:52.200 --> 26:54.200 this draw dot routine so that 26:54.200 --> 26:56.200 it looks nice let's also 26:56.200 --> 26:58.200 add labels to these graphs just so we 26:58.200 --> 27:00.200 know what variables are where 27:00.200 --> 27:02.200 so let's create a special underscore 27:02.200 --> 27:04.200 label 27:04.200 --> 27:06.200 or let's just do label equals 27:06.200 --> 27:08.200 empty by default and save it 27:08.200 --> 27:10.200 in each node 27:10.200 --> 27:12.200 and then here 27:12.200 --> 27:14.200 we're going to do label is a 27:14.200 --> 27:16.200 label is b 27:16.200 --> 27:18.200 label is c 27:22.200 --> 27:24.200 and then 27:24.200 --> 27:26.200 let's create a special 27:26.200 --> 27:28.200 um e equals 27:28.200 --> 27:30.200 a times b 27:30.200 --> 27:32.200 and e dot label will 27:32.200 --> 27:34.200 be e 27:34.200 --> 27:36.200 it's kind of naughty and e 27:36.200 --> 27:38.200 will be e plus c 27:38.200 --> 27:40.200 and a d dot label will be 27:40.200 --> 27:42.200 b 27:42.200 --> 27:44.200 okay so nothing really changes i just 27:44.200 --> 27:46.200 added this new e function 27:46.200 --> 27:48.200 a new e variable 27:48.200 --> 27:50.200 and then here when we are 27:50.200 --> 27:52.200 printing this i'm going 27:52.200 --> 27:54.200 to print the label here 27:54.200 --> 27:56.200 so this will be a percent s 27:56.200 --> 27:58.200 bar and this will be n dot 27:58.200 --> 28:00.200 label 28:00.200 --> 28:02.200 and so now 28:02.200 --> 28:04.200 we have the label 28:04.200 --> 28:06.200 on the left here so it says a b 28:06.200 --> 28:08.200 creating e and then e plus c creates 28:08.200 --> 28:10.200 d just like we have it 28:10.200 --> 28:12.200 here and finally let's make this 28:12.200 --> 28:14.200 expression just one layer deeper 28:14.200 --> 28:16.200 so d will not be the final output 28:16.200 --> 28:18.200 node instead 28:18.200 --> 28:20.200 after d we are going to create a 28:20.200 --> 28:22.200 new value object called 28:22.200 --> 28:24.200 f we're going to start running out of 28:24.200 --> 28:26.200 variables soon f will be negative two 28:26.200 --> 28:28.200 point zero and its label 28:28.200 --> 28:30.200 will of course just be f 28:30.200 --> 28:32.200 and then l 28:32.200 --> 28:34.200 capital l will be the output 28:34.200 --> 28:36.200 of our graph and l will be 28:36.200 --> 28:38.200 d times f 28:38.200 --> 28:40.200 okay so l will be negative eight 28:40.200 --> 28:42.200 is the output 28:42.200 --> 28:44.200 uh so 28:44.200 --> 28:46.200 now we don't just draw a 28:46.200 --> 28:48.200 d we draw l 28:50.200 --> 28:52.200 okay 28:52.200 --> 28:54.200 and somehow the label of 28:54.200 --> 28:56.200 l is undefined oops 28:56.200 --> 28:58.200 the label has to be explicitly 28:58.200 --> 29:00.200 given to it 29:00.200 --> 29:02.200 there we go so l is the output 29:02.200 --> 29:04.200 so let's quickly recap what we've done so far 29:04.200 --> 29:06.200 we are able to build out mathematical 29:06.200 --> 29:08.200 expressions using only plus and times 29:08.200 --> 29:10.200 so far they are scalar 29:10.200 --> 29:12.200 valued along the way and we can 29:12.200 --> 29:14.200 do this forward pass 29:14.200 --> 29:16.200 and build out a mathematical expression 29:16.200 --> 29:18.200 so we have multiple inputs here 29:18.200 --> 29:20.200 a b c and f going into 29:20.200 --> 29:22.200 a mathematical expression that produces 29:22.200 --> 29:24.200 a single output l 29:24.200 --> 29:26.200 and this here is visualizing the 29:26.200 --> 29:28.200 forward pass so the output of the 29:28.200 --> 29:30.200 forward pass is negative eight 29:30.200 --> 29:32.200 that's the value now 29:32.200 --> 29:34.200 what we'd like to do next is we'd like to run 29:34.200 --> 29:36.200 back propagation and in back 29:36.200 --> 29:38.200 propagation we are going to start here at the end 29:38.200 --> 29:40.200 and we're going to reverse 29:40.200 --> 29:42.200 and calculate the gradient 29:42.200 --> 29:44.200 along all these intermediate 29:44.200 --> 29:46.200 values and really what we're 29:46.200 --> 29:48.200 computing for every single value here 29:48.200 --> 29:50.200 um we're going to compute 29:50.200 --> 29:52.200 the derivative of that node 29:52.200 --> 29:54.200 with respect to 29:54.200 --> 29:56.200 l so 29:56.200 --> 29:58.200 the derivative of l with respect to l 29:58.200 --> 30:00.200 is just one 30:00.200 --> 30:02.200 and then we're going to derive what is the 30:02.200 --> 30:04.200 derivative of l with respect to f with 30:04.200 --> 30:06.200 respect to d with respect to c 30:06.200 --> 30:08.200 with respect to e with respect 30:08.200 --> 30:10.200 to b and with respect to a 30:10.200 --> 30:12.200 and in a neural network setting you'd 30:12.200 --> 30:14.200 be very interested in the derivative of basically 30:14.200 --> 30:16.200 this loss function l 30:16.200 --> 30:18.200 with respect to the weights of 30:18.200 --> 30:20.200 a neural network and here of course 30:20.200 --> 30:22.200 we have just these variables a b c and f 30:22.200 --> 30:24.200 but some of these will eventually represent 30:24.200 --> 30:26.200 the weights of a neural net and so 30:26.200 --> 30:28.200 we'll need to know how those weights are impacting 30:28.200 --> 30:30.200 the loss function 30:30.200 --> 30:32.200 so we'll be interested basically in the derivative of 30:32.200 --> 30:34.200 the output with respect to some of its 30:34.200 --> 30:36.200 leaf nodes and those leaf nodes will 30:36.200 --> 30:38.200 be the weights of the neural net 30:38.200 --> 30:40.200 and the other leaf nodes of course will be the data 30:40.200 --> 30:42.200 itself but usually we will not want 30:42.200 --> 30:44.200 or use the derivative of the 30:44.200 --> 30:46.200 loss function with respect to data because 30:46.200 --> 30:48.200 the data is fixed but the weights 30:48.200 --> 30:50.200 will be iterated on 30:50.200 --> 30:52.200 using the gradient information 30:52.200 --> 30:54.200 so next we are going to create a variable inside 30:54.200 --> 30:56.200 the value class that maintains 30:56.200 --> 30:58.200 the derivative of 30:58.200 --> 31:00.200 l with respect to that value 31:00.200 --> 31:02.200 and we will call this variable 31:02.200 --> 31:04.200 grad so there 31:04.200 --> 31:06.200 is a dot data and there is a self.grad 31:06.200 --> 31:08.200 and initially 31:08.200 --> 31:10.200 it will be zero and remember that 31:10.200 --> 31:12.200 zero is basically means no 31:12.200 --> 31:14.200 effect so at initialization 31:14.200 --> 31:16.200 we are assuming that every value does not 31:16.200 --> 31:18.200 impact does not affect the 31:18.200 --> 31:20.200 output right because 31:20.200 --> 31:22.200 if the gradient is zero that means that changing 31:22.200 --> 31:24.200 this variable is not changing the 31:24.200 --> 31:26.200 loss function so by 31:26.200 --> 31:28.200 default we assume that the gradient is zero 31:28.200 --> 31:30.200 and then 31:30.200 --> 31:32.200 now that we have grad 31:32.200 --> 31:34.200 and it's zero point zero 31:36.200 --> 31:38.200 we are going to be able to visualize 31:38.200 --> 31:40.200 it here after data so here 31:40.200 --> 31:42.200 grad is point four f 31:42.200 --> 31:44.200 and this will be end of grad 31:44.200 --> 31:46.200 and now 31:46.200 --> 31:48.200 we are going to be showing both the data 31:48.200 --> 31:50.200 and the grad 31:50.200 --> 31:52.200 initialized at zero 31:52.200 --> 31:54.200 and we are 31:54.200 --> 31:56.200 just about getting ready to calculate the 31:56.200 --> 31:58.200 back propagation and of course this 31:58.200 --> 32:00.200 grad again as i mentioned is representing 32:00.200 --> 32:02.200 the derivative of the output in 32:02.200 --> 32:04.200 this case l with respect to this 32:04.200 --> 32:06.200 value so with respect to 32:06.200 --> 32:08.200 so this is the derivative of l with respect to 32:08.200 --> 32:10.200 f with respect to d and so on 32:10.200 --> 32:12.200 so let's now fill in those gradients 32:12.200 --> 32:14.200 and actually do back propagation manually 32:14.200 --> 32:16.200 so let's start filling in these gradients and 32:16.200 --> 32:18.200 start all the way at the end as i mentioned here 32:18.200 --> 32:20.200 first we are interested to fill in this 32:20.200 --> 32:22.200 gradient here so 32:22.200 --> 32:24.200 what is the derivative of l with respect to 32:24.200 --> 32:26.200 l in other words if i change 32:26.200 --> 32:28.200 l by a tiny amount h 32:28.200 --> 32:30.200 how much does 32:30.200 --> 32:32.200 l change 32:32.200 --> 32:34.200 it changes by h so 32:34.200 --> 32:36.200 it's proportional and therefore the derivative will be 32:36.200 --> 32:38.200 one we can of course 32:38.200 --> 32:40.200 measure these or estimate these numerical 32:40.200 --> 32:42.200 gradients numerically just like 32:42.200 --> 32:44.200 we've seen before so if i take this 32:44.200 --> 32:46.200 expression and i create a 32:46.200 --> 32:48.200 def lol function here 32:48.200 --> 32:50.200 and put this here 32:50.200 --> 32:52.200 now the reason i'm creating a gating function 32:52.200 --> 32:54.200 lol here is because i don't want 32:54.200 --> 32:56.200 to pollute or mess up the global scope 32:56.200 --> 32:58.200 here this is just kind of like a little staging 32:58.200 --> 33:00.200 area and as you know in python all of these 33:00.200 --> 33:02.200 will be local variables to this function 33:02.200 --> 33:04.200 so i'm not changing any of the 33:04.200 --> 33:06.200 global scope here so here 33:06.200 --> 33:08.200 l1 will be l 33:10.200 --> 33:12.200 and then copy pasting this expression 33:12.200 --> 33:14.200 we're going to add a small 33:14.200 --> 33:16.200 amount h 33:16.200 --> 33:18.200 in 33:18.200 --> 33:20.200 for example a 33:20.200 --> 33:22.200 right and this would be measuring 33:22.200 --> 33:24.200 the derivative of l with respect 33:24.200 --> 33:26.200 to a so here 33:26.200 --> 33:28.200 this will be l2 33:28.200 --> 33:30.200 and then we want to print test derivatives 33:30.200 --> 33:32.200 so print l2 minus 33:32.200 --> 33:34.200 l1 which is how much l 33:34.200 --> 33:36.200 changed and then normalize it 33:36.200 --> 33:38.200 by h so this is the rise 33:38.200 --> 33:40.200 over run and we have to be 33:40.200 --> 33:42.200 careful because l is a valid node 33:42.200 --> 33:44.200 so we actually want its data 33:46.200 --> 33:48.200 so that these are floats dividing 33:48.200 --> 33:50.200 by h and this should print 33:50.200 --> 33:52.200 the derivative of l with respect to a 33:52.200 --> 33:54.200 because a is the one that we bumped a 33:54.200 --> 33:56.200 little bit by h so what is 33:56.200 --> 33:58.200 the derivative of l with respect 33:58.200 --> 34:00.200 to a it's six 34:00.200 --> 34:02.200 okay and obviously 34:02.200 --> 34:04.200 if we change 34:04.200 --> 34:06.200 l by h 34:06.200 --> 34:08.200 then that would be 34:08.200 --> 34:10.200 here 34:10.200 --> 34:12.200 effectively 34:12.200 --> 34:14.200 this looks really awkward but 34:14.200 --> 34:16.200 changing l by h 34:16.200 --> 34:18.200 you see the derivative here is one 34:20.200 --> 34:22.200 that's kind of like the base case 34:22.200 --> 34:24.200 of what we are doing here 34:24.200 --> 34:26.200 so basically we can come up here 34:26.200 --> 34:28.200 and we can manually set 34:28.200 --> 34:30.200 l.grad to one this is our 34:30.200 --> 34:32.200 manual backpropagation 34:32.200 --> 34:34.200 l.grad is one and let's redraw 34:34.200 --> 34:36.200 and we'll see 34:36.200 --> 34:38.200 that we filled in grad is one 34:38.200 --> 34:40.200 for l we're now going to continue 34:40.200 --> 34:42.200 the backpropagation so let's here look at 34:42.200 --> 34:44.200 the derivatives of l with respect to 34:44.200 --> 34:46.200 d and f let's do 34:46.200 --> 34:48.200 d first so what 34:48.200 --> 34:50.200 we are interested in if i create a markdown on 34:50.200 --> 34:52.200 here is we'd like to know 34:52.200 --> 34:54.200 basically we have that l is d times f 34:54.200 --> 34:56.200 and we'd like to know what is 34:56.200 --> 34:58.200 d l by 34:58.200 --> 35:00.200 d d 35:00.200 --> 35:02.200 what is that and if you know 35:02.200 --> 35:04.200 your calculus l is d times f 35:04.200 --> 35:06.200 so what is d l by d d 35:06.200 --> 35:08.200 it would be f 35:08.200 --> 35:10.200 and if you don't believe me we can also 35:10.200 --> 35:12.200 just derive it because the proof would be 35:12.200 --> 35:14.200 fairly straightforward we go 35:14.200 --> 35:16.200 to the definition 35:16.200 --> 35:18.200 of the derivative which is 35:18.200 --> 35:20.200 f of x plus h minus f of x 35:20.200 --> 35:22.200 divide h 35:22.200 --> 35:24.200 as a limit of h goes to zero 35:24.200 --> 35:26.200 of this kind of expression so 35:26.200 --> 35:28.200 when we have l is d times f 35:28.200 --> 35:30.200 then increasing 35:30.200 --> 35:32.200 d by h would give us 35:32.200 --> 35:34.200 the output of d plus h times 35:34.200 --> 35:36.200 f that's 35:36.200 --> 35:38.200 basically f of x plus h right 35:38.200 --> 35:40.200 minus d times 35:40.200 --> 35:42.200 f 35:42.200 --> 35:44.200 and then divide h and 35:44.200 --> 35:46.200 symbolically expanding out here we 35:46.200 --> 35:48.200 would have basically d times f 35:48.200 --> 35:50.200 plus h times f minus 35:50.200 --> 35:52.200 d times f divide h 35:52.200 --> 35:54.200 and then you see how the df minus 35:54.200 --> 35:56.200 df cancels so you're left with h times 35:56.200 --> 35:58.200 f divide h 35:58.200 --> 36:00.200 which is f so 36:00.200 --> 36:02.200 in the limit as h goes to zero 36:02.200 --> 36:04.200 of you know 36:04.200 --> 36:06.200 derivative 36:06.200 --> 36:08.200 definition we just 36:08.200 --> 36:10.200 get f in the case of 36:10.200 --> 36:12.200 d times f 36:12.200 --> 36:14.200 so symmetrically 36:14.200 --> 36:16.200 d l by d f 36:16.200 --> 36:18.200 will just be d 36:18.200 --> 36:20.200 so what we have is that 36:20.200 --> 36:22.200 f dot grad we see now 36:22.200 --> 36:24.200 is just the value of d 36:24.200 --> 36:26.200 which is four 36:26.200 --> 36:30.200 and we see that 36:30.200 --> 36:32.200 d dot grad is just 36:32.200 --> 36:34.200 the value of f 36:36.200 --> 36:38.200 and so the value of f 36:38.200 --> 36:40.200 is negative two 36:40.200 --> 36:42.200 so we'll set those 36:42.200 --> 36:44.200 manually 36:44.200 --> 36:46.200 let me erase this markdown 36:46.200 --> 36:48.200 node and then let's redraw what we 36:48.200 --> 36:50.200 have 36:50.200 --> 36:52.200 okay and let's 36:52.200 --> 36:54.200 just make sure that these were correct 36:54.200 --> 36:56.200 so we seem to think that 36:56.200 --> 36:58.200 d l by d d is negative two so let's 36:58.200 --> 37:00.200 double check 37:00.200 --> 37:02.200 let me erase this plus h from before 37:02.200 --> 37:04.200 and now we want the derivative with respect to f 37:04.200 --> 37:06.200 so let's just come here 37:06.200 --> 37:08.200 when i create f and let's do a plus h here 37:08.200 --> 37:10.200 and this should print a derivative of 37:10.200 --> 37:12.200 l with respect to f so we expect 37:12.200 --> 37:14.200 to see four 37:14.200 --> 37:16.200 yeah and this is four up to 37:16.200 --> 37:18.200 floating point funkiness 37:18.200 --> 37:20.200 and then d l 37:20.200 --> 37:22.200 by d d should be 37:22.200 --> 37:24.200 f which is negative two 37:24.200 --> 37:26.200 grad is negative two 37:26.200 --> 37:28.200 so if we again 37:28.200 --> 37:30.200 come here and we change d 37:30.200 --> 37:32.200 d dot 37:32.200 --> 37:34.200 data plus equals h right 37:34.200 --> 37:36.200 here so we expect 37:36.200 --> 37:38.200 so we've added a little h and then we see 37:38.200 --> 37:40.200 how l changed and we 37:40.200 --> 37:42.200 expect to print 37:42.200 --> 37:44.200 negative two 37:44.200 --> 37:46.200 there we go 37:46.200 --> 37:48.200 so we've numerically 37:48.200 --> 37:50.200 verified what we're doing here is 37:50.200 --> 37:52.200 kind of like an inline gradient check 37:52.200 --> 37:54.200 gradient check is when we 37:54.200 --> 37:56.200 are deriving this like back propagation 37:56.200 --> 37:58.200 and getting the derivative with respect to all the 37:58.200 --> 38:00.200 intermediate results and 38:00.200 --> 38:02.200 then numerical gradient is just you know 38:02.200 --> 38:04.200 estimating it using 38:04.200 --> 38:06.200 small step size 38:06.200 --> 38:08.200 now we're getting to the crux of 38:08.200 --> 38:10.200 back propagation so this will be the 38:10.200 --> 38:12.200 most important node to understand 38:12.200 --> 38:14.200 because if you understand the gradient for 38:14.200 --> 38:16.200 this node you understand all of back 38:16.200 --> 38:18.200 propagation and all training of neural nets 38:18.200 --> 38:20.200 basically so we need 38:20.200 --> 38:22.200 to derive d l by 38:22.200 --> 38:24.200 d c in other words the derivative 38:24.200 --> 38:26.200 of l with respect to c 38:26.200 --> 38:28.200 because we've computed all these other 38:28.200 --> 38:30.200 gradients already now we're coming 38:30.200 --> 38:32.200 here and we're continuing the back propagation 38:32.200 --> 38:34.200 manually so we want 38:34.200 --> 38:36.200 d l by d c and then we'll also 38:36.200 --> 38:38.200 derive d l by d e 38:38.200 --> 38:40.200 now here's the problem 38:40.200 --> 38:42.200 how do we derive d l by 38:42.200 --> 38:44.200 d c 38:44.200 --> 38:46.200 we actually know the derivative l 38:46.200 --> 38:48.200 with respect to d so we know how 38:48.200 --> 38:50.200 l is sensitive to d 38:50.200 --> 38:52.200 but how is l sensitive to 38:52.200 --> 38:54.200 c so if we wiggle c how does 38:54.200 --> 38:56.200 that impact l through d 38:56.200 --> 39:00.200 so we know d l by d c 39:00.200 --> 39:02.200 and we 39:02.200 --> 39:04.200 also here know how c impacts d 39:04.200 --> 39:06.200 and so just very intuitively if you 39:06.200 --> 39:08.200 know the impact that c is having 39:08.200 --> 39:10.200 on d and the impact that d is having 39:10.200 --> 39:12.200 on l then you should be able to 39:12.200 --> 39:14.200 somehow put that information together to 39:14.200 --> 39:16.200 figure out how c impacts l 39:16.200 --> 39:18.200 and indeed this is what we can actually 39:18.200 --> 39:20.200 do so in particular we 39:20.200 --> 39:22.200 know just concentrating on d first 39:22.200 --> 39:24.200 let's look at how what is the derivative 39:24.200 --> 39:26.200 basically of d with respect to c 39:26.200 --> 39:28.200 so in other words what is d d by d 39:28.200 --> 39:30.200 c 39:30.200 --> 39:32.200 so here 39:32.200 --> 39:34.200 we know that d is c times 39:34.200 --> 39:36.200 c plus e that's what we 39:36.200 --> 39:38.200 know and now we're interested in d d 39:38.200 --> 39:40.200 by d c if you 39:40.200 --> 39:42.200 just know your calculus again and you remember 39:42.200 --> 39:44.200 then differentiating c plus e with 39:44.200 --> 39:46.200 respect to c you know that that gives you 39:46.200 --> 39:48.200 1.0 and 39:48.200 --> 39:50.200 we can also go back to the basics and derive 39:50.200 --> 39:52.200 this because again we can go to our 39:52.200 --> 39:54.200 f of x plus h minus f of x 39:54.200 --> 39:56.200 divide by h 39:56.200 --> 39:58.200 that's the definition of a derivative 39:58.200 --> 40:00.200 as h goes to zero and 40:00.200 --> 40:02.200 so here focusing on c 40:02.200 --> 40:04.200 and its effect on d 40:04.200 --> 40:06.200 we can basically do the f of x plus h 40:06.200 --> 40:08.200 will be c is 40:08.200 --> 40:10.200 incremented by h plus c 40:10.200 --> 40:12.200 that's the first evaluation of our 40:12.200 --> 40:14.200 function minus 40:14.200 --> 40:16.200 c plus e 40:16.200 --> 40:18.200 and then divide h 40:18.200 --> 40:20.200 and so what is this 40:20.200 --> 40:22.200 just expanding this out this will be c plus 40:22.200 --> 40:24.200 h plus e minus c minus 40:24.200 --> 40:26.200 e divide h 40:26.200 --> 40:28.200 and then you see here how c minus c 40:28.200 --> 40:30.200 cancels e minus e cancels 40:30.200 --> 40:32.200 we're left with h over h which is 1.0 40:32.200 --> 40:34.200 and so 40:34.200 --> 40:36.200 by symmetry also 40:36.200 --> 40:38.200 d d by d 40:38.200 --> 40:40.200 e will be 40:40.200 --> 40:42.200 1.0 as well 40:42.200 --> 40:44.200 so basically the derivative of 40:44.200 --> 40:46.200 a sum expression is very simple 40:46.200 --> 40:48.200 and this is the local derivative 40:48.200 --> 40:50.200 so i call this the local derivative because 40:50.200 --> 40:52.200 we have the final output value all the 40:52.200 --> 40:54.200 way at the end of this graph and we're now 40:54.200 --> 40:56.200 like a small node here and 40:56.200 --> 40:58.200 this is a little plus node and 40:58.200 --> 41:00.200 the little plus node doesn't know 41:00.200 --> 41:02.200 anything about the rest of the graph 41:02.200 --> 41:04.200 that it's embedded in all it knows 41:04.200 --> 41:06.200 is that it did a plus it took a c 41:06.200 --> 41:08.200 and an e added them and created 41:08.200 --> 41:10.200 d and this plus node 41:10.200 --> 41:12.200 also knows the local influence of 41:12.200 --> 41:14.200 c on d or rather 41:14.200 --> 41:16.200 the derivative of d with respect to c 41:16.200 --> 41:18.200 and it also knows the derivative of d 41:18.200 --> 41:20.200 with respect to e but 41:20.200 --> 41:22.200 that's not what we want that's just a local derivative 41:22.200 --> 41:24.200 what we actually want is 41:24.200 --> 41:26.200 dl by dc and 41:26.200 --> 41:28.200 l could l is here just one 41:28.200 --> 41:30.200 step away but in the general case 41:30.200 --> 41:32.200 this little plus node is could be 41:32.200 --> 41:34.200 embedded in like a massive graph 41:34.200 --> 41:36.200 so again 41:36.200 --> 41:38.200 we know how l impacts d and 41:38.200 --> 41:40.200 now we know how c and e impact 41:40.200 --> 41:42.200 d how do we put that information together 41:42.200 --> 41:44.200 to write dl by dc 41:44.200 --> 41:46.200 and the answer of course is the chain rule 41:46.200 --> 41:48.200 in calculus and so 41:50.200 --> 41:52.200 i pulled up chain rule here from wikipedia 41:52.200 --> 41:54.200 and i'm going 41:54.200 --> 41:56.200 to go through this very briefly so chain 41:56.200 --> 41:58.200 rule wikipedia sometimes 41:58.200 --> 42:00.200 can be very confusing and calculus can 42:00.200 --> 42:02.200 can be very confusing like 42:02.200 --> 42:04.200 this is the way i learned 42:04.200 --> 42:06.200 chain rule and it was very 42:06.200 --> 42:08.200 confusing like what is happening 42:08.200 --> 42:10.200 it's just complicated so i like 42:10.200 --> 42:12.200 this expression much better 42:12.200 --> 42:14.200 if a variable z depends 42:14.200 --> 42:16.200 on a variable y which itself depends 42:16.200 --> 42:18.200 on a variable x 42:18.200 --> 42:20.200 then z depends on x as well obviously 42:20.200 --> 42:22.200 through the intermediate variable y 42:22.200 --> 42:24.200 and in this case the chain rule is expressed 42:24.200 --> 42:26.200 as if you want 42:26.200 --> 42:28.200 dz by dx 42:28.200 --> 42:30.200 then you take the dz by dy 42:30.200 --> 42:32.200 and you multiply it by dy 42:32.200 --> 42:34.200 by dx so the chain 42:34.200 --> 42:36.200 rule fundamentally is telling you 42:36.200 --> 42:38.200 how we chain 42:38.200 --> 42:40.200 these derivatives 42:40.200 --> 42:42.200 together correctly 42:42.200 --> 42:44.200 so to differentiate through 42:44.200 --> 42:46.200 a function composition 42:46.200 --> 42:48.200 we have to apply a multiplication 42:48.200 --> 42:50.200 of those derivatives 42:50.200 --> 42:52.200 so that's 42:52.200 --> 42:54.200 really what chain rule is telling us 42:54.200 --> 42:56.200 and there's a nice little 42:56.200 --> 42:58.200 intuitive explanation here which i also think is 42:58.200 --> 43:00.200 kind of cute the chain rule states that 43:00.200 --> 43:02.200 knowing the instantaneous rate of change of z with respect 43:02.200 --> 43:04.200 to y and y relative to x allows 43:04.200 --> 43:06.200 one to calculate the instantaneous rate of change of z 43:06.200 --> 43:08.200 relative to x as a 43:08.200 --> 43:10.200 product of those two rates of change 43:10.200 --> 43:12.200 simply the product of those two 43:12.200 --> 43:14.200 so here's a good one 43:14.200 --> 43:16.200 if a car travels twice as fast as a bicycle 43:16.200 --> 43:18.200 and the bicycle is four times as 43:18.200 --> 43:20.200 fast as a walking man 43:20.200 --> 43:22.200 then the car travels two times four 43:22.200 --> 43:24.200 eight times as fast as a man 43:24.200 --> 43:26.200 and so this makes it 43:26.200 --> 43:28.200 very clear that the correct thing to do 43:28.200 --> 43:30.200 sort of is to multiply 43:30.200 --> 43:32.200 so car is 43:32.200 --> 43:34.200 twice as fast as bicycle and bicycle 43:34.200 --> 43:36.200 is four times as fast as man 43:36.200 --> 43:38.200 so the car will be eight 43:38.200 --> 43:40.200 times as fast as the man 43:40.200 --> 43:42.200 and so we can take these 43:42.200 --> 43:44.200 intermediate rates of change if you will 43:44.200 --> 43:46.200 and multiply them together 43:46.200 --> 43:48.200 and that justifies the 43:48.200 --> 43:50.200 chain rule intuitively 43:50.200 --> 43:52.200 so have a look at chain rule but here 43:52.200 --> 43:54.200 really what it means for us is 43:54.200 --> 43:56.200 there's a very simple recipe for deriving 43:56.200 --> 43:58.200 what we want 43:58.200 --> 44:00.200 which is dl by dc 44:00.200 --> 44:02.200 and what we have so far 44:02.200 --> 44:04.200 is we know 44:04.200 --> 44:06.200 want 44:06.200 --> 44:08.200 and we know 44:08.200 --> 44:10.200 what is the 44:10.200 --> 44:12.200 impact of d on l 44:12.200 --> 44:14.200 so we know dl by dd 44:14.200 --> 44:16.200 the derivative of l with respect to dd 44:16.200 --> 44:18.200 we know that that's negative two 44:18.200 --> 44:20.200 and now because of this local 44:20.200 --> 44:22.200 reasoning that we've done here 44:22.200 --> 44:24.200 we know dd by dc 44:24.200 --> 44:26.200 so how does c impact d 44:26.200 --> 44:28.200 and in particular 44:28.200 --> 44:30.200 this is a plus node so the local derivative 44:30.200 --> 44:32.200 is simply 1.0 it's very simple 44:32.200 --> 44:34.200 and so 44:34.200 --> 44:36.200 the chain rule tells us that dl by dc 44:36.200 --> 44:38.200 going through this intermediate 44:38.200 --> 44:40.200 variable 44:40.200 --> 44:42.200 will just be simply dl by 44:42.200 --> 44:44.200 dd 44:44.200 --> 44:46.200 times 44:48.200 --> 44:50.200 dd 44:50.200 --> 44:52.200 by dc that's chain rule 44:52.200 --> 44:54.200 so this is identical 44:54.200 --> 44:56.200 to what's happening here 44:56.200 --> 44:58.200 except 44:58.200 --> 45:00.200 z is rl 45:00.200 --> 45:02.200 y is rd and x is 45:02.200 --> 45:04.200 rc 45:04.200 --> 45:06.200 so we literally just have to multiply these 45:06.200 --> 45:08.200 and because 45:10.200 --> 45:12.200 these local derivatives like dd by dc 45:12.200 --> 45:14.200 are just one 45:14.200 --> 45:16.200 we basically just copy over 45:16.200 --> 45:18.200 dl by dd because this is just 45:18.200 --> 45:20.200 times one 45:20.200 --> 45:22.200 so because dl by dd 45:22.200 --> 45:24.200 is negative two what is dl 45:24.200 --> 45:26.200 by dc 45:26.200 --> 45:28.200 well it's the local gradient 45:28.200 --> 45:30.200 1.0 times dl by dd 45:30.200 --> 45:32.200 which is negative two so literally 45:32.200 --> 45:34.200 what a plus node does you can look 45:34.200 --> 45:36.200 at it that way is it literally just routes 45:36.200 --> 45:38.200 the gradient because the 45:38.200 --> 45:40.200 plus nodes local derivatives are just 45:40.200 --> 45:42.200 one and so in the chain rule 45:42.200 --> 45:44.200 one times dl by 45:44.200 --> 45:46.200 dd is 45:46.200 --> 45:48.200 is 45:48.200 --> 45:50.200 is just dl by dd 45:50.200 --> 45:52.200 and so that derivative just gets routed 45:52.200 --> 45:54.200 to both c and to e 45:54.200 --> 45:56.200 in this case so basically 45:56.200 --> 45:58.200 we have that e.grad 45:58.200 --> 46:00.200 or let's start with c 46:00.200 --> 46:02.200 since that's the one we looked at 46:02.200 --> 46:04.200 is negative 46:04.200 --> 46:06.200 two times one 46:06.200 --> 46:08.200 negative two 46:08.200 --> 46:10.200 and in the same way by 46:10.200 --> 46:12.200 symmetry e.grad will be negative two 46:12.200 --> 46:14.200 that's the claim 46:14.200 --> 46:16.200 so we can set those 46:16.200 --> 46:18.200 we can redraw 46:18.200 --> 46:20.200 and you see how 46:20.200 --> 46:22.200 we just assigned negative two negative two 46:22.200 --> 46:24.200 so this back propagating signal which is 46:24.200 --> 46:26.200 carrying the information of like what is the derivative 46:26.200 --> 46:28.200 of l with respect to all the intermediate nodes 46:28.200 --> 46:30.200 we can imagine it almost like 46:30.200 --> 46:32.200 flowing backwards through the graph and a 46:32.200 --> 46:34.200 plus node will simply distribute 46:34.200 --> 46:36.200 the derivative to all the leaf nodes 46:36.200 --> 46:38.200 sorry to all the children nodes of it 46:38.200 --> 46:40.200 so this is the claim 46:40.200 --> 46:42.200 and now let's verify it 46:42.200 --> 46:44.200 so let me remove the plus h here from before 46:44.200 --> 46:46.200 and now instead what we want to 46:46.200 --> 46:48.200 do is we want to increment c so 46:48.200 --> 46:50.200 c.data will be incremented by h 46:50.200 --> 46:52.200 and when i run this we expect 46:52.200 --> 46:54.200 to see negative two 46:54.200 --> 46:56.200 negative two 46:56.200 --> 46:58.200 and then of course for e 46:58.200 --> 47:00.200 so e.data plus equals h 47:00.200 --> 47:02.200 and we expect to see negative two 47:02.200 --> 47:04.200 simple 47:06.200 --> 47:08.200 so those are the derivatives 47:08.200 --> 47:10.200 of these internal nodes 47:10.200 --> 47:12.200 and now we're going to 47:12.200 --> 47:14.200 recurse our way backwards 47:14.200 --> 47:16.200 again and we're again 47:16.200 --> 47:18.200 going to apply the chain rule 47:18.200 --> 47:20.200 so here we go our second application of chain rule 47:20.200 --> 47:22.200 and we will apply it all the way through the 47:22.200 --> 47:24.200 graph we just happen to only have one more node 47:24.200 --> 47:26.200 remaining we have that 47:26.200 --> 47:28.200 derivative of l 47:28.200 --> 47:30.200 so we know that 47:30.200 --> 47:32.200 the derivative of l 47:32.200 --> 47:34.200 as we have just calculated 47:34.200 --> 47:36.200 is negative two 47:36.200 --> 47:38.200 so we know that 47:38.200 --> 47:40.200 so we know the derivative of l 47:40.200 --> 47:42.200 with respect to e 47:42.200 --> 47:44.200 and now we want 47:44.200 --> 47:46.200 dL by dA 47:46.200 --> 47:48.200 right 47:48.200 --> 47:50.200 and the chain rule is telling us 47:50.200 --> 47:52.200 that that's just dL by dE 47:52.200 --> 47:54.200 negative two 47:54.200 --> 47:56.200 so that's basically 47:56.200 --> 47:58.200 dE by dA 47:58.200 --> 48:00.200 we have to look at that 48:00.200 --> 48:02.200 so I'm a little times node 48:02.200 --> 48:04.200 inside a massive graph 48:04.200 --> 48:06.200 and I only know that I did 48:06.200 --> 48:08.200 a times b and I produced an e 48:08.200 --> 48:10.200 so now what is 48:10.200 --> 48:12.200 dE by dA 48:12.200 --> 48:14.200 and dE by dB 48:14.200 --> 48:16.200 that's the only thing that I sort of know about 48:16.200 --> 48:18.200 that's my local gradient 48:18.200 --> 48:20.200 so because we have that e is a times b 48:20.200 --> 48:22.200 we're asking what is dE 48:22.200 --> 48:24.200 by dA 48:24.200 --> 48:26.200 and of course we just did that here 48:26.200 --> 48:28.200 we had a times 48:28.200 --> 48:30.200 so I'm not going to re-derive it 48:30.200 --> 48:32.200 but if you want to differentiate this 48:32.200 --> 48:34.200 with respect to a you'll just get b 48:34.200 --> 48:36.200 right the value of b 48:36.200 --> 48:38.200 which in this case is 48:38.200 --> 48:40.200 negative three point zero 48:40.200 --> 48:42.200 so 48:42.200 --> 48:44.200 basically we have that dL by dA 48:44.200 --> 48:46.200 well let me just do it 48:46.200 --> 48:48.200 right here we have that a dot grad 48:48.200 --> 48:50.200 and we are applying chain rule here 48:50.200 --> 48:52.200 is dL by dE 48:52.200 --> 48:54.200 which we see here is 48:54.200 --> 48:56.200 negative two 48:56.200 --> 48:58.200 times 48:58.200 --> 49:00.200 what is dE by dA 49:00.200 --> 49:02.200 it's the value of b 49:02.200 --> 49:04.200 which is negative three 49:04.200 --> 49:06.200 that's it 49:08.200 --> 49:10.200 and then we have b dot grad 49:10.200 --> 49:12.200 is again dL by dE 49:12.200 --> 49:14.200 which is negative two 49:14.200 --> 49:16.200 just the same way 49:16.200 --> 49:18.200 times 49:18.200 --> 49:20.200 what is dE by dB 49:20.200 --> 49:22.200 is the value of a 49:22.200 --> 49:24.200 which is 2.0 49:24.200 --> 49:26.200 so these are 49:26.200 --> 49:28.200 our claimed derivatives 49:28.200 --> 49:30.200 let's 49:30.200 --> 49:32.200 re-draw 49:32.200 --> 49:34.200 and we see here that 49:34.200 --> 49:36.200 a dot grad turns out to be six 49:36.200 --> 49:38.200 because that is negative two times negative three 49:38.200 --> 49:40.200 and b dot grad is negative four 49:40.200 --> 49:42.200 times 49:42.200 --> 49:44.200 sorry is negative two times two 49:44.200 --> 49:46.200 which is negative four 49:46.200 --> 49:48.200 so those are our claims 49:48.200 --> 49:50.200 let's delete this and let's verify them 49:50.200 --> 49:52.200 we have 49:52.200 --> 49:54.200 a here 49:54.200 --> 49:56.200 plus equals h 49:56.200 --> 49:58.200 so 49:58.200 --> 50:00.200 the claim is that 50:00.200 --> 50:02.200 a dot grad is six 50:02.200 --> 50:04.200 let's verify 50:04.200 --> 50:06.200 six 50:06.200 --> 50:08.200 and we have b dot data 50:08.200 --> 50:10.200 plus equals h 50:10.200 --> 50:12.200 so nudging b by h 50:12.200 --> 50:14.200 and looking at what happens 50:14.200 --> 50:16.200 we claim it's negative four 50:16.200 --> 50:18.200 and indeed it's negative four 50:18.200 --> 50:20.200 plus minus again float 50:20.200 --> 50:22.200 oddness 50:22.200 --> 50:24.200 and that's it 50:24.200 --> 50:26.200 that was the manual 50:26.200 --> 50:28.200 back propagation 50:28.200 --> 50:30.200 all the way from here 50:30.200 --> 50:32.200 to all the leaf nodes 50:32.200 --> 50:34.200 and we've done it piece by piece 50:34.200 --> 50:36.200 and really all we've done is 50:36.200 --> 50:38.200 as you saw we iterated through all the nodes 50:38.200 --> 50:40.200 one by one 50:40.200 --> 50:42.200 and locally applied the chain rule 50:42.200 --> 50:44.200 we always know what is the derivative of l 50:44.200 --> 50:46.200 with respect to this little output 50:46.200 --> 50:48.200 and then we look at how this output was produced 50:48.200 --> 50:50.200 this output was produced through some operation 50:50.200 --> 50:52.200 and we have the pointers to the children nodes 50:52.200 --> 50:54.200 and so in this little operation 50:54.200 --> 50:56.200 we know what the local derivatives are 50:56.200 --> 50:58.200 and we just multiply them onto the derivative 50:58.200 --> 51:00.200 always 51:00.200 --> 51:02.200 so we just go through and recursively multiply on 51:02.200 --> 51:04.200 the local derivatives 51:04.200 --> 51:06.200 and that's what back propagation is 51:06.200 --> 51:08.200 it's just a recursive application of chain rule 51:08.200 --> 51:10.200 backwards through the computation graph 51:10.200 --> 51:12.200 let's see this power in action 51:12.200 --> 51:14.200 just very briefly 51:14.200 --> 51:16.200 what we're going to do is we're going to 51:16.200 --> 51:18.200 nudge our inputs to try to make l 51:18.200 --> 51:20.200 go up 51:20.200 --> 51:22.200 so in particular what we're doing is 51:22.200 --> 51:24.200 we're going to take that data 51:24.200 --> 51:26.200 we're going to change it 51:26.200 --> 51:28.200 and if we want l to go up 51:28.200 --> 51:30.200 that means we just have to go in the direction of the gradient 51:30.200 --> 51:32.200 so a should increase 51:32.200 --> 51:34.200 in the direction of gradient 51:34.200 --> 51:36.200 by like some small step amount 51:36.200 --> 51:38.200 this is the step size 51:38.200 --> 51:40.200 and we don't just want this for b 51:40.200 --> 51:42.200 but also for b 51:42.200 --> 51:44.200 also for c 51:44.200 --> 51:46.200 also for f 51:46.200 --> 51:48.200 those are leaf nodes 51:48.200 --> 51:50.200 which we usually have control over 51:50.200 --> 51:52.200 and if we nudge in 51:52.200 --> 51:54.200 the direction of the gradient 51:54.200 --> 51:56.200 we expect a positive influence on l 51:56.200 --> 51:58.200 so we expect l to go up 51:58.200 --> 52:00.200 positively 52:00.200 --> 52:02.200 so it should become less negative 52:02.200 --> 52:04.200 it should go up to say negative 6 52:04.200 --> 52:06.200 or something like that 52:06.200 --> 52:08.200 it's hard to tell exactly 52:08.200 --> 52:10.200 and we have to rerun the forward pass 52:10.200 --> 52:12.200 so let me just 52:12.200 --> 52:14.200 do that here 52:16.200 --> 52:18.200 this would be the forward pass 52:18.200 --> 52:20.200 f would be unchanged 52:20.200 --> 52:22.200 this is effectively the forward pass 52:22.200 --> 52:24.200 but now if we print l.data 52:24.200 --> 52:26.200 we expect 52:26.200 --> 52:28.200 because we nudged all the values 52:28.200 --> 52:30.200 all the inputs in the direction of the gradient 52:30.200 --> 52:32.200 we expected less negative l 52:32.200 --> 52:34.200 we expect it to go up 52:34.200 --> 52:36.200 so maybe it's negative 6 or so 52:36.200 --> 52:38.200 let's see what happens 52:38.200 --> 52:40.200 ok negative 7 52:40.200 --> 52:42.200 and this is basically one step 52:42.200 --> 52:44.200 of an optimization that we'll end up running 52:44.200 --> 52:46.200 and really this gradient 52:46.200 --> 52:48.200 just gives us some power 52:48.200 --> 52:50.200 because we know how to influence the final outcome 52:50.200 --> 52:52.200 and this will be extremely useful for training NOLETs as we'll soon see 52:52.200 --> 52:54.200 so now I would like to do 52:54.200 --> 52:56.200 one more example 52:56.200 --> 52:58.200 of manual backpropagation 52:58.200 --> 53:00.200 using a bit more complex 53:00.200 --> 53:02.200 and useful example 53:02.200 --> 53:04.200 we are going to backpropagate 53:04.200 --> 53:06.200 through a neuron 53:06.200 --> 53:08.200 so we want to 53:08.200 --> 53:10.200 eventually build out neural networks 53:10.200 --> 53:12.200 and in the simplest case these are multilayer 53:12.200 --> 53:14.200 perceptrons as they're called 53:14.200 --> 53:16.200 so this is a two layer neural net 53:16.200 --> 53:18.200 and it's got these hidden layers made up of neurons 53:18.200 --> 53:20.200 and these neurons are fully connected to each other 53:20.200 --> 53:22.200 now biologically neurons are very complicated 53:22.200 --> 53:24.200 devices but we have very simple mathematical models 53:24.200 --> 53:26.200 of them 53:26.200 --> 53:28.200 and so this is a very simple mathematical model 53:28.200 --> 53:30.200 of a neuron 53:30.200 --> 53:32.200 you have some inputs, x's 53:32.200 --> 53:34.200 and then you have these synapses 53:34.200 --> 53:36.200 that have weights on them 53:36.200 --> 53:38.200 so the w's are weights 53:38.200 --> 53:40.200 and then 53:40.200 --> 53:42.200 the synapse interacts with the input 53:42.200 --> 53:44.200 to this neuron multiplicatively 53:44.200 --> 53:46.200 so what flows to the cell body 53:46.200 --> 53:48.200 of this neuron 53:48.200 --> 53:50.200 is w times x 53:50.200 --> 53:52.200 but there's multiple inputs 53:52.200 --> 53:54.200 w times x is flowing to the cell body 53:54.200 --> 53:56.200 the cell body then has 53:56.200 --> 53:58.200 also like some bias 53:58.200 --> 54:00.200 so this is kind of like the 54:00.200 --> 54:02.200 innate sort of trigger happiness 54:02.200 --> 54:04.200 of this neuron 54:04.200 --> 54:06.200 so this bias can make it a bit more trigger happy 54:06.200 --> 54:08.200 or a bit less trigger happy regardless of the input 54:08.200 --> 54:10.200 but basically we're taking all the w times x 54:10.200 --> 54:12.200 of all the inputs 54:12.200 --> 54:14.200 adding the bias 54:14.200 --> 54:16.200 and then we take it through an activation function 54:16.200 --> 54:18.200 and this activation function 54:18.200 --> 54:20.200 is usually some kind of a squashing function 54:20.200 --> 54:22.200 like a sigmoid or 10H 54:22.200 --> 54:24.200 or something like that 54:24.200 --> 54:26.200 so as an example 54:26.200 --> 54:28.200 we're going to use the 10H in this example 54:28.200 --> 54:30.200 numpy has a 54:30.200 --> 54:32.200 np.10H 54:32.200 --> 54:34.200 so we can call it on a range 54:34.200 --> 54:36.200 and we can plot it 54:36.200 --> 54:38.200 this is the 10H function 54:38.200 --> 54:40.200 and you see that the inputs 54:40.200 --> 54:42.200 as they come in 54:42.200 --> 54:44.200 get squashed on the y coordinate here 54:44.200 --> 54:46.200 so right at 0 54:46.200 --> 54:48.200 we're going to get exactly 0 54:48.200 --> 54:50.200 and then as you go more positive in the input 54:50.200 --> 54:52.200 then you'll see that 54:52.200 --> 54:54.200 the activation function will only go up to 1 54:54.200 --> 54:56.200 and then plateau out 54:56.200 --> 54:58.200 and so if you pass in very positive inputs 54:58.200 --> 55:00.200 we're going to cap it smoothly at 1 55:00.200 --> 55:02.200 and on the negative side 55:02.200 --> 55:04.200 we're going to cap it smoothly to negative 1 55:04.200 --> 55:06.200 so that's 10H 55:06.200 --> 55:08.200 and that's the squashing function 55:08.200 --> 55:10.200 or an activation function 55:10.200 --> 55:12.200 and what comes out of this neuron 55:12.200 --> 55:14.200 is just the activation function applied to the 55:14.200 --> 55:16.200 dot product of the weights 55:16.200 --> 55:18.200 and the inputs 55:18.200 --> 55:20.200 so let's write one out 55:20.200 --> 55:22.200 um 55:22.200 --> 55:24.200 I'm going to copy paste 55:24.200 --> 55:26.200 because 55:26.200 --> 55:28.200 I don't want to type too much 55:28.200 --> 55:30.200 but okay so here we have the inputs 55:30.200 --> 55:32.200 x1, x2 55:32.200 --> 55:34.200 so this is a two dimensional neuron 55:34.200 --> 55:36.200 so two inputs are going to come in 55:36.200 --> 55:38.200 these are thought of as the weights of this neuron 55:38.200 --> 55:40.200 weights w1, w2 55:40.200 --> 55:42.200 and these weights again are the 55:42.200 --> 55:44.200 synaptic strengths for each input 55:44.200 --> 55:46.200 and this is the bias 55:46.200 --> 55:48.200 of the neuron B 55:48.200 --> 55:50.200 and now what we want to do 55:50.200 --> 55:52.200 is according to this model 55:52.200 --> 55:54.200 we need to multiply 55:54.200 --> 55:56.200 x1 times w1 55:56.200 --> 55:58.200 and x2 times w2 55:58.200 --> 56:00.200 and then we need to add bias 56:00.200 --> 56:02.200 on top of it 56:02.200 --> 56:04.200 and it gets a little messy here 56:04.200 --> 56:06.200 but all we are trying to do is 56:06.200 --> 56:08.200 x1 w1 plus x2 w2 plus B 56:08.200 --> 56:10.200 and these are multiplied here 56:10.200 --> 56:12.200 except I'm doing it in small steps 56:12.200 --> 56:14.200 so that we actually have pointers 56:14.200 --> 56:16.200 to all these intermediate nodes 56:16.200 --> 56:18.200 so we have x1 w1 variable 56:18.200 --> 56:20.200 x2 w2 variable 56:20.200 --> 56:22.200 and I'm also labeling them 56:22.200 --> 56:24.200 so that we have the 56:24.200 --> 56:26.200 n is now the cell body 56:26.200 --> 56:28.200 raw activation 56:28.200 --> 56:30.200 without the activation function for now 56:30.200 --> 56:32.200 and this should be enough 56:32.200 --> 56:34.200 to basically plot it 56:34.200 --> 56:36.200 so draw dot of n 56:38.200 --> 56:40.200 gives us x1 times w1 56:40.200 --> 56:42.200 x2 times w2 56:42.200 --> 56:44.200 being added 56:44.200 --> 56:46.200 then the bias gets added on top of this 56:46.200 --> 56:48.200 and this n is this sum 56:48.200 --> 56:50.200 so we are now going to take it through 56:50.200 --> 56:52.200 an activation function 56:52.200 --> 56:54.200 And let's say we use the tanh 56:54.520 --> 57:00.980 So that we produce the output. So what we'd like to do here is we'd like to do the output and I'll call it O is 57:02.100 --> 57:04.340 N dot tanh 57:05.040 --> 57:07.520 Okay, but we haven't yet written the tanh 57:08.020 --> 57:12.260 now the reason that we need to implement another tanh function here is that 57:12.840 --> 57:14.720 tanh is a 57:14.720 --> 57:21.220 Hyperbolic function and we've only so far implemented a plus and a times and you can't make a tanh out of just pluses and times 57:21.220 --> 57:26.260 You also need exponentiation. So tanh is this kind of a formula here 57:26.940 --> 57:30.480 You can use either one of these and you see that there are exponentiation involved 57:30.480 --> 57:34.160 Which we have not implemented yet for our little value node here 57:34.160 --> 57:38.260 So we're not going to be able to produce tanh yet and we have to go back up and implement something like it 57:38.780 --> 57:41.200 now one option here is 57:42.540 --> 57:44.540 We could actually implement 57:45.360 --> 57:51.200 Exponentiation right and we could return the exp of the value instead of a tanh 57:51.220 --> 57:56.160 Of a value because if we had exp then we have everything else that we need so 57:56.880 --> 57:59.240 because we know how to add and we know how to 58:01.100 --> 58:06.360 We know how to add and we know how to multiply so we'd be able to create tanh if we knew how to exp 58:06.680 --> 58:09.220 but for the purposes of this example, I specifically wanted to 58:09.800 --> 58:14.920 Show you that we don't necessarily need to have the most atomic pieces in 58:16.100 --> 58:21.220 In this value object we can actually like create functions at arbitrary 58:21.920 --> 58:25.640 Points of abstraction they can be complicated functions 58:25.640 --> 58:29.620 But they can be also very very simple functions like a plus and it's totally up to us 58:29.760 --> 58:33.820 The only thing that matters is that we know how to differentiate through any one function 58:33.920 --> 58:36.280 So we take some inputs and we make an output 58:36.420 --> 58:40.720 The only thing that matters it can be arbitrarily complex function as long as you know 58:40.720 --> 58:46.840 How to create the local derivative if you know the local derivative of how the inputs impact the output then that's all you need 58:47.040 --> 58:49.040 So we're going to cluster up 58:49.260 --> 58:51.100 all of this expression 58:51.100 --> 58:56.220 And we're not going to break it down to its atomic pieces. We're just going to directly implement tanh. So let's do that 58:57.060 --> 58:59.060 depth tanh and 58:59.100 --> 59:01.100 then out will be a value of 59:02.600 --> 59:05.360 And we need this expression here, so 59:08.240 --> 59:11.000 Let me actually copy paste 59:14.000 --> 59:20.040 Let's grab n which is a sol.theta and then this I believe is the tanh 59:21.100 --> 59:23.100 math.exp of 59:24.280 --> 59:25.600 2 59:25.600 --> 59:29.720 You know n minus 1 over 2n plus 1 59:30.400 --> 59:32.400 Maybe I can call this x 59:32.860 --> 59:34.860 Just so that it matches exactly 59:35.480 --> 59:39.000 okay, and now this will be t and 59:41.140 --> 59:43.440 Children of this node. There's just one child and 59:44.280 --> 59:47.940 I'm wrapping it in a tuple. So this is a tuple of one object just self and 59:48.640 --> 59:50.940 here the name of this operation will be 59:51.100 --> 59:52.100 10h 59:52.100 --> 59:54.100 And we're going to return that 59:56.160 --> 59:58.160 Okay 59:58.260 --> 01:00:00.260 So now value should be 01:00:00.760 --> 01:00:06.480 Implementing tanh and now we can scroll all the way down here and we can actually do n dot tanh 01:00:06.480 --> 01:00:10.280 And that's going to return the tanh output of n 01:00:11.100 --> 01:00:15.820 And now we should be able to draw it out of o not of n. So let's see how that worked 01:00:18.480 --> 01:00:20.920 There we go n went through tanh 01:00:21.500 --> 01:00:23.500 to produce this output 01:00:24.000 --> 01:00:25.900 so now tanh is a 01:00:25.900 --> 01:00:27.500 sort of 01:00:27.500 --> 01:00:31.380 our little micro grad supported node here as an operation and 01:00:33.080 --> 01:00:38.460 As long as we know the derivative of tanh then we'll be able to back propagate through it now 01:00:38.460 --> 01:00:44.100 Let's see this tanh in action. Currently. It's not squashing too much because the input to it is pretty low 01:00:44.400 --> 01:00:47.160 So the bias was increased to say 8 01:00:48.860 --> 01:00:51.100 Then we'll see that what's flowing in 01:00:51.100 --> 01:00:54.100 to the tanh now is 2 and 01:00:54.360 --> 01:00:56.500 Tanh is squashing it to 0.96 01:00:56.720 --> 01:01:02.800 So we're already hitting the tail of this tanh and it will sort of smoothly go up to 1 and then plateau out over there 01:01:03.220 --> 01:01:09.040 Okay, so I'm going to do something slightly strange. I'm going to change this bias from 8 to this number 01:01:10.000 --> 01:01:11.040 6.88 etc 01:01:11.040 --> 01:01:16.600 and I'm going to do this for specific reasons because we're about to start back propagation and 01:01:17.000 --> 01:01:19.740 I want to make sure that our numbers come out nice 01:01:19.900 --> 01:01:21.000 They're not like very 01:01:21.000 --> 01:01:26.160 Crazy numbers, they're nice numbers that we can sort of understand in our head. Let me also add those label 01:01:26.160 --> 01:01:28.560 O is short for output here 01:01:29.820 --> 01:01:31.360 So that's the R 01:01:31.360 --> 01:01:37.660 Okay, so 0.88 flows into tanh comes out 0.7. So so now we're going to do back propagation 01:01:37.660 --> 01:01:39.660 And we're going to fill in all the gradients 01:01:40.000 --> 01:01:44.440 so what is the derivative O with respect to all the 01:01:44.960 --> 01:01:50.620 inputs here and of course in a typical neural network setting what we really care about the most is the derivative of 01:01:51.000 --> 01:01:53.000 these neurons on the weights 01:01:53.440 --> 01:01:59.180 specifically the w2 and w1 because those are the weights that we're going to be changing part of the optimization and 01:01:59.560 --> 01:02:01.460 The other thing that we have to remember is here 01:02:01.460 --> 01:02:05.480 We have only a single neuron but in the neural net you typically have many neurons and they're connected 01:02:06.940 --> 01:02:11.960 So this is only like a one small neuron a piece of a much bigger puzzle and eventually there's a loss function 01:02:12.120 --> 01:02:17.800 That sort of measures the accuracy of the neural net and we're back propagating with respect to that accuracy and trying to increase it 01:02:18.960 --> 01:02:20.960 So let's start off back propagation 01:02:21.000 --> 01:02:22.260 Here in the end 01:02:22.260 --> 01:02:29.620 What is the derivative of O with respect to O the base case sort of we know always is that the gradient is just 1.0 01:02:30.220 --> 01:02:32.220 so let me fill it in and 01:02:32.380 --> 01:02:33.500 then 01:02:33.500 --> 01:02:35.000 Let me 01:02:35.000 --> 01:02:36.860 split out 01:02:36.860 --> 01:02:38.860 the drawing function 01:02:39.860 --> 01:02:41.860 Here 01:02:42.420 --> 01:02:45.020 And then here cell 01:02:47.100 --> 01:02:49.060 Clear this output here, okay 01:02:49.860 --> 01:02:53.300 So now when we draw O we'll see that or that grad is 1 01:02:53.780 --> 01:02:58.420 So now we're going to back propagate through the tanh so to back propagate through tanh 01:02:58.420 --> 01:03:04.240 We need to know the local derivative of tanh. So if we have that O is 01:03:05.100 --> 01:03:07.100 tanh of n 01:03:08.180 --> 01:03:10.960 Then what is do by dn? 01:03:11.780 --> 01:03:18.400 Now what you could do is you could come here and you could take this expression and you could do your calculus derivative taking 01:03:19.060 --> 01:03:27.600 and that would work but we can also just scroll down Wikipedia here into a section that hopefully tells us that derivative 01:03:29.000 --> 01:03:31.000 d by dx of tanh of x is 01:03:31.460 --> 01:03:34.460 Any of these I like this one 1 minus tanh square of x 01:03:35.000 --> 01:03:43.100 So this is 1 minus tanh of x squared. So basically what this is saying is that d o by dn is 01:03:44.320 --> 01:03:46.320 1 minus tanh 01:03:46.860 --> 01:03:48.720 of n 01:03:48.720 --> 01:03:57.060 squared. And we already have 10h of n. It's just o. So it's 1 minus o squared. So o is 01:03:57.060 --> 01:04:08.560 the output here. So the output is this number. o.data is this number. And then what this 01:04:08.560 --> 01:04:17.360 is saying is that do by dn is 1 minus this squared. So 1 minus o.data squared is 0.5 01:04:17.360 --> 01:04:25.640 conveniently. So the local derivative of this 10h operation here is 0.5. And so that 01:04:25.640 --> 01:04:43.040 would be do by dn. So we can fill in that n.grad is 0.5. We'll just fill it in. So this 01:04:43.040 --> 01:04:47.340 is exactly 0.5, 1 half. So now we're going to continue the backprop. 01:04:47.360 --> 01:04:57.320 This is 0.5. And this is a plus node. So what is backprop going to do here? And if you remember 01:04:57.320 --> 01:05:03.140 our previous example, a plus is just a distributor of gradient. So this gradient will simply 01:05:03.140 --> 01:05:07.420 flow to both of these equally. And that's because the local derivative of this operation 01:05:07.420 --> 01:05:15.260 is 1 for every one of its nodes. So 1 times 0.5 is 0.5. So therefore, we know that this 01:05:15.260 --> 01:05:17.260 node here, which we called this. 01:05:17.360 --> 01:05:26.040 It's grad. It's just 0.5. And we know that b.grad is also 0.5. So let's set those and 01:05:26.040 --> 01:05:33.760 let's draw. So those are 0.5. Continuing, we have another plus. 0.5, again, we'll just 01:05:33.760 --> 01:05:46.360 distribute. So 0.5 will flow to both of these. So we can set theirs. x2w2 as well. .grad is 01:05:46.360 --> 01:05:46.800 0.5. 01:05:47.360 --> 01:05:53.600 And let's redraw. Pluses are my favorite operations to backpropagate through because it's very 01:05:53.600 --> 01:05:59.040 simple. So now what's flowing into these expressions is 0.5. And so really, again, keep in mind 01:05:59.040 --> 01:06:03.440 what the derivative is telling us at every point in time along here. This is saying that 01:06:03.440 --> 01:06:10.880 if we want the output of this neuron to increase, then the influence on these expressions is 01:06:10.880 --> 01:06:13.880 positive on the output. Both of them are positive. 01:06:17.360 --> 01:06:24.740 So we can put a distribution to the output. So now, backpropagating to x2 and w2 first. 01:06:24.740 --> 01:06:30.300 This is a times node. So we know that the local derivative is the other term. So if 01:06:30.300 --> 01:06:42.860 we want to calculate x2.grad, then can you think through what it's going to be? So x2.grad 01:06:42.860 --> 01:06:45.420 will be w2.data times this x2.grad. 01:06:45.420 --> 01:06:46.320 .grad. 01:06:46.320 --> 01:06:47.360 .grad. 01:06:47.360 --> 01:07:03.600 w2.grad right and w2.grad will be x2.data times x2.w2.grad right so that's the little local piece 01:07:03.600 --> 01:07:11.660 of chain rule let's set them and let's redraw so here we see that the gradient on our weight 01:07:11.660 --> 01:07:19.960 2 is 0 because x2's data was 0 right but x2 will have the gradient 0.5 because data here was 1 01:07:19.960 --> 01:07:26.800 and so what's interesting here right is because the input x2 was 0 then because of the way the 01:07:26.800 --> 01:07:32.120 times works of course this gradient will be 0 and think about intuitively why that is 01:07:32.120 --> 01:07:39.440 derivative always tells us the influence of this on the final output if i wiggle w2 01:07:39.440 --> 01:07:40.940 how is the output changing 01:07:40.940 --> 01:07:46.160 it's not changing because we're multiplying by 0 so because it's not changing there is no 01:07:46.160 --> 01:07:53.160 derivative and 0 is the correct answer because we're squashing that 0 and let's do it here 01:07:53.160 --> 01:08:00.720 0.5 should come here and flow through this times and so we'll have that x1.grad is 01:08:00.720 --> 01:08:05.040 can you think through a little bit what what this should be 01:08:05.560 --> 01:08:10.540 local derivative of times with respect to x1 01:08:10.540 --> 01:08:10.920 is 01:08:10.920 --> 01:08:25.800 going to be w1 so w1's data times x1 w1.grad and w1.grad will be x1.data times x1 w2 w1.grad 01:08:27.240 --> 01:08:33.400 let's see what those came out to be so this is 0.5 so this would be negative 1.5 and this would be 01:08:33.400 --> 01:08:39.400 1. and we've back propagated through this expression these are the actual final derivatives so if we 01:08:39.400 --> 01:08:40.520 want this neurons to be negative 1.5 we're going to have to do this we're going to have to do this 01:08:40.520 --> 01:08:44.920 bit of elaborating so actually we can do this byаци to here so this is negative 1.5 so if we 01:08:44.920 --> 01:08:49.560 now want this neuron's output to increase we know that what's necessary is that 01:08:51.320 --> 01:08:55.400 w2 we have no gradient w2 doesn't actually matter to this neuron right now 01:08:55.400 --> 01:09:00.920 but this neuron this weight should go up so if this weight goes up then this neurones output 01:09:00.920 --> 01:09:07.480 would have gone up and proportionally because the gradient is 1. okay so doing the back propagation 01:09:07.480 --> 01:09:08.360 manually is obviously ridiculous so we are now going to put an end to this suffering and we're going to see how we can implement the back propagation's output Health classes method lambda. 01:09:08.360 --> 01:09:08.980 self attack self acquire lerud and a random entunkered router operation will be still coercion equal to 0.25éro. 01:09:08.980 --> 01:09:13.060 can implement the backward pass a bit more automatically. We're not going to be doing 01:09:13.060 --> 01:09:18.140 all of it manually out here. It's now pretty obvious to us by example how these pluses and 01:09:18.140 --> 01:09:23.540 times are back-propagating ingredients. So let's go up to the value object and we're going to start 01:09:23.540 --> 01:09:31.760 codifying what we've seen in the examples below. So we're going to do this by storing a special 01:09:31.760 --> 01:09:39.640 self.backward and underscore backward. And this will be a function which is going to do that 01:09:39.640 --> 01:09:44.340 little piece of chain rule. At each little node that took inputs and produced output, 01:09:45.040 --> 01:09:51.480 we're going to store how we are going to chain the outputs gradient into the inputs gradients. 01:09:52.340 --> 01:10:00.940 So by default, this will be a function that doesn't do anything. And you can also see that 01:10:00.940 --> 01:10:01.740 here in the value in my example. 01:10:01.760 --> 01:10:08.900 Micrograd. So we have this backward function. By default, it doesn't do anything. This is a 01:10:08.900 --> 01:10:13.080 empty function. And that would be sort of the case, for example, for a leaf node. For a leaf 01:10:13.080 --> 01:10:20.520 node, there's nothing to do. But now when we're creating these out values, these out values are 01:10:20.520 --> 01:10:30.600 an addition of self and other. And so we'll want to set out backward to be the function that 01:10:30.600 --> 01:10:31.740 propagates the gradient. 01:10:31.760 --> 01:10:42.960 So let's define what should happen. And we're going to store it in a closure. Let's define what 01:10:42.960 --> 01:10:53.180 should happen when we call out's grad. For addition, our job is to take out's grad and 01:10:53.180 --> 01:10:58.940 propagate it into self's grad and other.grad. So basically, we want to solve self.grad to 01:10:58.940 --> 01:11:01.740 something. And we want to set out's grad to something. And we want to set out's grad to 01:11:01.760 --> 01:11:09.240 that grad to something okay and the way we saw below how chain rule works we 01:11:09.240 --> 01:11:14.300 want to take the local derivative times the sort of global derivative I should 01:11:14.300 --> 01:11:17.900 call it which is the derivative of the final output of the expression with 01:11:17.900 --> 01:11:27.320 respect to out's data with respect to out so the local derivative of self in an 01:11:27.320 --> 01:11:35.420 addition is 1.0 so it's just 1.0 times out's grad that's the chain rule and 01:11:35.420 --> 01:11:40.760 others.grad will be 1.0 times out.grad and what you basically what you're seeing 01:11:40.760 --> 01:11:46.280 here is that out's grad will simply be copied onto self's grad and others grad 01:11:46.280 --> 01:11:51.440 as we saw happens for an addition operation so we're going to later call 01:11:51.440 --> 01:11:56.420 this function to propagate the gradient having done an addition let's now do 01:11:56.420 --> 01:11:57.040 multiplication 01:11:57.320 --> 01:12:04.880 we're going to also define and we're going to set its backward to be 01:12:04.880 --> 01:12:15.500 backward and we want to chain out grad into self.grad and others.grad 01:12:15.500 --> 01:12:21.940 and this will be a little piece of chain rule for multiplication so we'll have so 01:12:21.940 --> 01:12:25.940 what should this be can you think through 01:12:27.320 --> 01:12:31.340 scale it up a little bit more I think we can test it but okay so we've got 01:12:31.340 --> 01:12:33.680 thatanche squared caught or else what should it be and this is going to be 01:12:33.680 --> 01:12:35.900 a little better what should this be it's going to be a little bit better 01:12:35.900 --> 01:12:40.160 so finally see here to the other side and this will be the off part second 01:12:40.160 --> 01:12:44.840 time creative so where the version to copy to that I was off the plane or up 01:12:44.840 --> 01:12:48.040 to the -, and then target my output time so let's go to case sickness 01:12:48.040 --> 01:12:51.400 so here's the look of a general promotions of set for entire settings 01:12:51.400 --> 01:12:55.540 we want a group this isn't going to come the other way we want to set the 01:12:55.540 --> 01:12:56.560 You can also add in a你們 I think return method and even the previous employees 01:12:56.560 --> 01:12:57.300 and I'm gonna do a little bit of what we're going to say for the SQL gameplay 01:12:57.320 --> 01:13:05.600 to be just backward and here we need to back propagate we have out dot grad and we want to 01:13:05.600 --> 01:13:13.780 chain it into salt dot grad and salt dot grad will be the local derivative of this operation 01:13:13.780 --> 01:13:20.360 that we've done here which is 10h and so we saw that the local gradient is 1 minus the 10h of x 01:13:20.360 --> 01:13:27.120 squared which here is t that's the local derivative because that's t is the output of this 10h 01:13:27.120 --> 01:13:33.960 so 1 minus t squared is the local derivative and then gradient has to be multiplied because of the 01:13:33.960 --> 01:13:40.100 chain rule so out grad is chained through the local gradient into salt dot grad and that should 01:13:40.100 --> 01:13:46.240 be basically it so we're going to redefine our value node we're going to swing all the way down 01:13:46.240 --> 01:13:56.560 here and we're going to redefine our expression make sure that all the grads are zero okay but 01:13:56.560 --> 01:13:57.100 now we don't have to do this again we're just going to do this again and we're going to do this 01:13:57.100 --> 01:14:01.980 to do this manually anymore. We are going to basically be calling the dot backward 01:14:01.980 --> 01:14:15.080 in the right order. So first we want to call o's dot backward. So o was the 01:14:15.080 --> 01:14:23.320 outcome of 10h, right? So calling o's backward will be this 01:14:23.320 --> 01:14:29.480 function. This is what it will do. Now we have to be careful because there's a 01:14:29.480 --> 01:14:39.700 times out dot grad and out dot grad remember is initialized to 0. So here we see 01:14:39.700 --> 01:14:47.320 grad 0. So as a base case we need to set o's dot grad to 1.0 to initialize 01:14:47.320 --> 01:14:49.980 this with 1 01:14:53.320 --> 01:14:59.000 and then once this is 1, we can call o dot backward and what that should do is it should 01:14:59.000 --> 01:15:05.840 propagate this grad through 10h. So the local derivative times the global derivative which 01:15:05.840 --> 01:15:08.780 is initialized at 1. So this should 01:15:08.780 --> 01:15:21.080 so I thought about redoing it but I figured I should just leave the error in here because 01:15:21.080 --> 01:15:23.200 it's pretty funny. Why is an anti-object 01:15:23.320 --> 01:15:30.860 not callable? It's because I screwed up. We're trying to save these functions. So this is 01:15:30.860 --> 01:15:36.200 correct. This here, we don't want to call the function because that returns none. These 01:15:36.200 --> 01:15:41.000 functions return none. We just want to store the function. So let me redefine the value 01:15:41.000 --> 01:15:47.080 object and then we're going to come back in, redefine the expression, draw a dot. Everything 01:15:47.080 --> 01:15:48.080 is great. 01:15:48.080 --> 01:15:53.080 o dot grad is 1, o dot grad is 1 and now 01:15:53.080 --> 01:15:59.580 this should work, of course. Okay. So o dot backward should have, this grad should now 01:15:59.580 --> 01:16:06.840 be 0.5 if we redraw and if everything went correctly, 0.5. Yay. Okay. So now we need to 01:16:06.840 --> 01:16:18.080 call ns dot grad, ns dot backward, sorry, ns backward. So that seems to have worked. 01:16:18.080 --> 01:16:22.920 So ns dot backward routed the gradient to both of these. So this is looking great. So 01:16:22.920 --> 01:16:32.280 now we could, of course, call b dot grad, b dot backward, sorry. What's going to happen? 01:16:32.280 --> 01:16:38.860 Well b doesn't have a backward. b is backward because b is a leaf node. b is backward is 01:16:38.860 --> 01:16:46.200 by initialization the empty function. So nothing would happen. But we can call it on it. But 01:16:46.200 --> 01:16:50.920 when we call this one, it's backward. 01:16:52.920 --> 01:16:55.300 M Normal entire value. 01:16:57.300 --> 01:17:04.820 Let's do this behavior here. Then we expect this 0.5 to give further routed. Right? So 01:17:04.820 --> 01:17:18.140 there we go, 0.5, 0.5. And then finally, we want to call it here on x2, w2. And on 01:17:18.140 --> 01:17:20.140 x1, w1. 01:17:20.140 --> 01:17:22.080 Let's do both of those. And there we go. 01:17:22.080 --> 01:17:22.920 ?? 01:17:22.920 --> 01:17:29.420 and one exactly as we did before but now we've done it through calling that backward 01:17:29.420 --> 01:17:36.780 sort of manually so we have one last piece to get rid of which is us calling underscore 01:17:36.780 --> 01:17:42.660 backward manually so let's think through what we are actually doing we've laid out a mathematical 01:17:42.660 --> 01:17:48.760 expression and now we're trying to go backwards through that expression so going backwards through 01:17:48.760 --> 01:17:54.520 the expression just means that we never want to call a dot backward for any node before 01:17:54.520 --> 01:18:02.140 we've done sort of everything after it so we have to do everything after it before we're ever going 01:18:02.140 --> 01:18:06.260 to call dot backward on any one node we have to get all of its full dependencies everything that 01:18:06.260 --> 01:18:13.780 it depends on has to propagate to it before we can continue back-propagation so this ordering 01:18:13.780 --> 01:18:17.220 of graphs can be achieved using something called topological sort 01:18:17.220 --> 01:18:18.620 so topological 01:18:18.620 --> 01:18:25.520 sort is basically a laying out of a graph such that all the edges go only from left to right 01:18:25.520 --> 01:18:33.380 basically. So here we have a graph it's a directed acyclic graph a DAG and this is two different 01:18:33.380 --> 01:18:37.860 topological orders of it I believe where basically you'll see that it's a laying out of the nodes 01:18:37.860 --> 01:18:44.260 such that all the edges go only one way from left to right. And implementing topological sort you 01:18:44.260 --> 01:18:51.000 can look in wikipedia and so on I'm not going to go through it in detail but basically this is what 01:18:51.000 --> 01:19:00.160 builds a topological graph. We maintain a set of visited nodes and then we are going through 01:19:00.160 --> 01:19:04.960 starting at some root node which for us is O that's where I want to start the topological sort 01:19:04.960 --> 01:19:11.000 and starting at O we go through all of its children and we need to lay them out from left to 01:19:11.000 --> 01:19:14.200 right and basically this starts at OH. 01:19:14.260 --> 01:19:17.580 Oh, if it's not visited, then it marks it as visited. 01:19:17.880 --> 01:19:23.260 And then it iterates through all of its children and calls build topological on them. 01:19:24.080 --> 01:19:27.680 And then after it's gone through all the children, it adds itself. 01:19:28.240 --> 01:19:32.700 So basically, this node that we're going to call it on, like say, oh, 01:19:33.060 --> 01:19:38.860 is only going to add itself to the topo list after all of the children have been processed. 01:19:38.860 --> 01:19:43.560 And that's how this function is guaranteeing that you're only going to be in the list 01:19:43.560 --> 01:19:45.560 once all of your children are in the list. 01:19:45.820 --> 01:19:47.400 And that's the invariant that is being maintained. 01:19:47.820 --> 01:19:51.340 So if we build topo on O and then inspect this list, 01:19:51.720 --> 01:19:55.740 we're going to see that it ordered our value objects. 01:19:56.500 --> 01:20:00.820 And the last one is the value of 0.707, which is the output. 01:20:01.520 --> 01:20:08.080 So this is O, and then this is N, and then all the other nodes get laid out before it. 01:20:09.500 --> 01:20:11.540 So that builds the topological graph. 01:20:12.100 --> 01:20:13.500 And really what we're doing now, 01:20:13.560 --> 01:20:19.000 is we're just calling dot underscore backward on all of the nodes in a topological order. 01:20:19.580 --> 01:20:24.180 So if we just reset the gradients, they're all 0, what did we do? 01:20:24.540 --> 01:20:30.480 We started by setting O.grad to be 1. 01:20:31.160 --> 01:20:32.600 That's the base case. 01:20:33.260 --> 01:20:35.960 Then we built a topological order. 01:20:37.960 --> 01:20:43.240 And then we went for node in reversed. 01:20:43.960 --> 01:20:44.680 Of topo. 01:20:46.220 --> 01:20:51.220 Now, in the reverse order, because this list goes from, you know, 01:20:51.620 --> 01:20:53.080 we need to go through it in reversed order. 01:20:53.960 --> 01:20:57.180 So starting at O, node dot backward. 01:20:58.480 --> 01:21:01.580 And this should be it. 01:21:03.180 --> 01:21:03.940 There we go. 01:21:05.380 --> 01:21:06.580 Those are the correct derivatives. 01:21:07.140 --> 01:21:09.480 Finally, we are going to hide this functionality. 01:21:10.020 --> 01:21:12.420 So I'm going to copy this. 01:21:12.740 --> 01:21:13.540 And we're going to hide this functionality. 01:21:13.560 --> 01:21:14.840 And we're going to hide it inside the value class, 01:21:15.000 --> 01:21:17.340 because we don't want to have all that code lying around. 01:21:18.340 --> 01:21:19.720 So instead of an underscore backward, 01:21:19.940 --> 01:21:21.840 we're now going to define an actual backward. 01:21:22.160 --> 01:21:24.200 So that's backward, without the underscore. 01:21:26.120 --> 01:21:28.400 And that's going to do all the stuff that we just derived. 01:21:29.000 --> 01:21:30.700 So let me just clean this up a little bit. 01:21:31.160 --> 01:21:38.360 So we're first going to build a topological graph, 01:21:38.840 --> 01:21:40.340 starting at self. 01:21:41.340 --> 01:21:43.380 So build topo of self. 01:21:43.820 --> 01:21:47.380 We'll populate the topological order into the topo list, 01:21:47.540 --> 01:21:48.500 which is a local variable. 01:21:49.060 --> 01:21:51.520 Then we set self.grads to be one. 01:21:52.780 --> 01:21:55.740 And then for each node in the reversed list, 01:21:56.060 --> 01:21:58.240 so starting at S and going to all the children, 01:21:59.240 --> 01:22:00.700 underscore backward. 01:22:02.180 --> 01:22:04.460 And that should be it. 01:22:04.820 --> 01:22:06.460 So save. 01:22:07.700 --> 01:22:08.720 Come down here. 01:22:09.340 --> 01:22:10.000 We define. 01:22:11.000 --> 01:22:12.340 Okay, all the grads are zero. 01:22:13.560 --> 01:22:16.520 And now what we can do is odot backward without the underscore. 01:22:17.480 --> 01:22:22.000 And there we go. 01:22:22.900 --> 01:22:25.040 And that's backpropagation. 01:22:26.420 --> 01:22:27.580 Place for one neuron. 01:22:28.540 --> 01:22:30.700 Now we shouldn't be too happy with ourselves, actually, 01:22:30.700 --> 01:22:32.700 because we have a bad bug. 01:22:33.340 --> 01:22:35.060 And we have not surfaced the bug 01:22:35.060 --> 01:22:38.780 because of some specific conditions that we have to think about right now. 01:22:39.700 --> 01:22:42.440 So here's the simplest case that shows the bug. 01:22:43.560 --> 01:22:45.800 Say I create a single node A, 01:22:46.160 --> 01:22:49.980 and then I create a B that is A plus A. 01:22:51.460 --> 01:22:52.440 And then I call backward. 01:22:54.740 --> 01:22:56.780 So what's going to happen is A is three, 01:22:57.280 --> 01:22:59.120 and then B is A plus A. 01:22:59.340 --> 01:23:01.640 So there's two arrows on top of each other here. 01:23:03.660 --> 01:23:06.320 Then we can see that B is, of course, the forward pass works. 01:23:06.780 --> 01:23:09.240 B is just A plus A, which is six. 01:23:09.240 --> 01:23:12.100 But the gradient here is not actually correct. 01:23:12.580 --> 01:23:13.540 That we calculated. 01:23:13.560 --> 01:23:14.160 We can calculate it automatically. 01:23:15.740 --> 01:23:22.160 And that's because, of course, just doing calculus in your head, 01:23:22.540 --> 01:23:25.940 the derivative of B with respect to A should be two. 01:23:27.420 --> 01:23:28.240 One plus one. 01:23:28.880 --> 01:23:29.580 It's not one. 01:23:30.940 --> 01:23:32.320 Intuitively, what's happening here, right? 01:23:32.380 --> 01:23:35.840 So B is the result of A plus A, and then we call backward on it. 01:23:36.440 --> 01:23:39.200 So let's go up and see what that does. 01:23:43.560 --> 01:23:46.680 B is the result of addition, so out as B. 01:23:48.020 --> 01:23:54.000 And then when we call backward, what happened is self.grad was set to one, 01:23:54.560 --> 01:23:56.540 and then other.grad was set to one. 01:23:57.280 --> 01:24:02.740 But because we're doing A plus A, self and other are actually the exact same object. 01:24:03.420 --> 01:24:05.660 So we are overriding the gradient. 01:24:05.840 --> 01:24:09.160 We are setting it to one, and then we are setting it again to one. 01:24:09.500 --> 01:24:12.320 And that's why it stays at one. 01:24:12.580 --> 01:24:13.540 So that's a good thing. 01:24:14.540 --> 01:24:18.000 There's another way to see this in a little bit more complicated expression. 01:24:21.340 --> 01:24:24.660 So here we have A and B. 01:24:25.920 --> 01:24:29.320 And then D will be the multiplication of the two, 01:24:29.620 --> 01:24:31.240 and E will be the addition of the two. 01:24:32.140 --> 01:24:34.800 And then we multiply E times D to get F. 01:24:35.260 --> 01:24:36.540 And then we call F dot backward. 01:24:37.660 --> 01:24:40.040 And these gradients, if you check, will be incorrect. 01:24:40.600 --> 01:24:42.880 So fundamentally what's happening here, again, 01:24:42.880 --> 01:24:48.660 is basically we're going to see an issue any time we use a variable more than once. 01:24:49.180 --> 01:24:53.060 Until now, in these expressions above, every variable is used exactly once. 01:24:53.160 --> 01:24:54.160 So we didn't see the issue. 01:24:54.920 --> 01:24:57.000 But here, if a variable is used more than once, 01:24:57.100 --> 01:24:58.580 what's going to happen during backward pass? 01:24:59.100 --> 01:25:01.680 We're back-propagating from F to E to D. 01:25:01.860 --> 01:25:02.480 So far, so good. 01:25:02.720 --> 01:25:07.080 But now E calls it backward, and it deposits its gradients to A and B. 01:25:07.420 --> 01:25:10.020 But then we come back to D and call backward, 01:25:10.020 --> 01:25:12.860 and it overwrites those gradients at A and B. 01:25:12.880 --> 01:25:16.100 So that's obviously a problem. 01:25:17.300 --> 01:25:22.260 And the solution here, if you look at the multivariate case of the chain rule 01:25:22.260 --> 01:25:23.420 and its generalization there, 01:25:23.780 --> 01:25:27.880 the solution there is basically that we have to accumulate these gradients. 01:25:28.020 --> 01:25:29.020 These gradients add. 01:25:30.200 --> 01:25:32.820 And so instead of setting those gradients, 01:25:33.780 --> 01:25:36.260 we can simply do plus equals. 01:25:36.680 --> 01:25:38.380 We need to accumulate those gradients. 01:25:38.880 --> 01:25:42.560 Plus equals, plus equals, plus equals. 01:25:42.880 --> 01:25:49.880 And this will be okay, remember, because we are initializing them at zero. 01:25:50.040 --> 01:25:58.100 So they start at zero, and then any contribution that flows backwards will simply add. 01:25:58.800 --> 01:26:05.500 So now if we redefine this one, because the plus equals, this now works. 01:26:05.880 --> 01:26:09.220 Because A dot grad started at zero, and we called B dot backward, 01:26:09.660 --> 01:26:12.460 we deposit one, and then we deposit one again. 01:26:12.720 --> 01:26:12.860 And then we call B dot backward. 01:26:12.860 --> 01:26:14.280 And now this is two, which is correct. 01:26:14.860 --> 01:26:17.900 And here, this will also work, and we'll get correct gradients. 01:26:18.380 --> 01:26:22.060 Because when we call E dot backward, we will deposit the gradients from this branch, 01:26:22.460 --> 01:26:26.480 and then when we get to D dot backward, it will deposit its own gradients. 01:26:26.900 --> 01:26:29.580 And then those gradients simply add on top of each other. 01:26:30.120 --> 01:26:32.820 And so we just accumulate those gradients, and that fixes the issue. 01:26:33.440 --> 01:26:36.320 Okay, now before we move on, let me actually do a bit of cleanup here 01:26:36.320 --> 01:26:40.000 and delete some of this intermediate work. 01:26:40.720 --> 01:26:42.620 So I'm not going to need any of this. 01:26:42.620 --> 01:26:44.000 Now that we've derived all of it. 01:26:45.460 --> 01:26:48.840 We are going to keep this, because I want to come back to it. 01:26:49.640 --> 01:26:56.640 Delete the 10H, delete our modigating example, delete the step, delete this, 01:26:56.960 --> 01:27:01.220 keep the code that draws, and then delete this example, 01:27:01.840 --> 01:27:04.180 and leave behind only the definition of value. 01:27:05.360 --> 01:27:08.800 And now let's come back to this non-linearity here that we implemented, the 10H. 01:27:09.060 --> 01:27:12.060 Now I told you that we could have broken down 10H 01:27:12.060 --> 01:27:17.200 into its explicit atoms in terms of other expressions if we had the exp function. 01:27:17.880 --> 01:27:19.720 So if you remember, 10H is defined like this, 01:27:20.140 --> 01:27:22.620 and we chose to develop 10H as a single function, 01:27:22.960 --> 01:27:25.080 and we can do that because we know it's derivative, 01:27:25.280 --> 01:27:26.440 and we can backpropagate through it. 01:27:26.880 --> 01:27:30.740 But we can also break down 10H into an expressiveness, a function of exp. 01:27:31.160 --> 01:27:33.760 And I would like to do that now, because I want to prove to you 01:27:33.760 --> 01:27:35.760 that you get all the same results and all the same gradients, 01:27:36.300 --> 01:27:39.560 but also because it forces us to implement a few more expressions. 01:27:39.560 --> 01:27:41.940 It forces us to do exponentiation, 01:27:42.060 --> 01:27:45.160 addition, subtraction, division, and things like that. 01:27:45.160 --> 01:27:47.660 And I think it's a good exercise to go through a few more of these. 01:27:48.160 --> 01:27:51.360 Okay, so let's scroll up to the definition of value. 01:27:52.160 --> 01:27:54.560 And here, one thing that we currently can't do is, 01:27:54.560 --> 01:27:57.560 we can do like a value of, say, 2.0. 01:27:58.460 --> 01:28:02.360 But we can't do, you know, here, for example, we want to add a constant 1. 01:28:02.560 --> 01:28:04.260 And we can't do something like this. 01:28:05.260 --> 01:28:08.360 And we can't do it because it says int object has no attribute data. 01:28:08.660 --> 01:28:11.660 That's because a plus 1 comes right here to add, 01:28:12.160 --> 01:28:14.460 and then other is the integer 1. 01:28:14.860 --> 01:28:18.360 And then here, Python is trying to access 1.data, and that's not a thing. 01:28:18.760 --> 01:28:21.460 And that's because basically, 1 is not a value object, 01:28:21.460 --> 01:28:23.560 and we only have addition for value objects. 01:28:23.960 --> 01:28:27.760 So as a matter of convenience, so that we can create expressions like this 01:28:27.760 --> 01:28:30.860 and make them make sense, we can simply do something like this. 01:28:32.360 --> 01:28:37.060 Basically, we let other alone if other is an instance of value. 01:28:37.260 --> 01:28:39.860 But if it's not an instance of value, we're going to assume that it's a number, 01:28:39.860 --> 01:28:41.860 like an integer or a float, and we're going to simply 01:28:41.860 --> 01:28:43.860 wrap it in value. 01:28:44.160 --> 01:28:46.060 And then other will just become value of other, 01:28:46.060 --> 01:28:48.960 and then other will have a data attribute, and this should work. 01:28:49.360 --> 01:28:52.860 So if I just say this, redefine value, then this should work. 01:28:53.360 --> 01:28:53.860 There we go. 01:28:54.360 --> 01:28:56.560 Okay, now let's do the exact same thing for multiply, 01:28:56.660 --> 01:29:01.060 because we can't do something like this, again, for the exact same reason. 01:29:01.360 --> 01:29:05.660 So we just have to go to mol, and if other is not a value, 01:29:05.660 --> 01:29:07.160 then let's wrap it in value. 01:29:07.660 --> 01:29:09.960 Let's redefine value, and now this works. 01:29:10.660 --> 01:29:11.660 Now, here's a kind of, unfortunately, 01:29:11.660 --> 01:29:15.760 and not obvious part, a times two works, we saw that, 01:29:15.960 --> 01:29:18.560 but two times a, is that going to work? 01:29:19.860 --> 01:29:20.960 You'd expect it to, right? 01:29:21.360 --> 01:29:22.760 But actually, it will not. 01:29:23.160 --> 01:29:25.760 And the reason it won't is because Python doesn't know, 01:29:26.260 --> 01:29:30.860 like when you do a times two, basically, so a times two, 01:29:30.960 --> 01:29:35.760 Python will go and it will basically do something like a dot mol of two. 01:29:35.860 --> 01:29:37.060 That's basically what it will call. 01:29:37.260 --> 01:29:41.460 But to it, two times a is the same as two dot mol of a. 01:29:41.860 --> 01:29:45.460 And it doesn't, two can't multiply value. 01:29:45.560 --> 01:29:47.060 And so it's really confused about that. 01:29:47.560 --> 01:29:51.460 So instead, what happens is in Python, the way this works is you are free to define 01:29:51.860 --> 01:29:53.460 something called the rmol. 01:29:54.460 --> 01:29:57.260 And rmol is kind of like a fallback. 01:29:57.360 --> 01:30:03.660 So if Python can't do two times a, it will check if by any chance, 01:30:03.860 --> 01:30:07.660 a knows how to multiply two, and that will be called into rmol. 01:30:08.860 --> 01:30:11.260 So because Python can't do two times a, 01:30:11.660 --> 01:30:13.560 it will check, is there an rmol in value? 01:30:13.860 --> 01:30:16.360 And because there is, it will now call that. 01:30:17.060 --> 01:30:20.360 And what we'll do here is we will swap the order of the operands. 01:30:20.760 --> 01:30:23.360 So basically, two times a will redirect to rmol, 01:30:23.660 --> 01:30:25.760 and rmol will basically call a times two. 01:30:26.360 --> 01:30:27.560 And that's how that will work. 01:30:28.560 --> 01:30:32.360 So redefining that with rmol, two times a becomes four. 01:30:32.860 --> 01:30:35.060 Okay, now looking at the other elements that we still need, 01:30:35.160 --> 01:30:36.960 we need to know how to exponentiate and how to divide. 01:30:37.460 --> 01:30:40.160 So let's first do the exponentiation part. 01:30:40.560 --> 01:30:41.460 We're going to introduce 01:30:41.960 --> 01:30:44.360 a single function exp here. 01:30:45.160 --> 01:30:49.860 And exp is going to mirror 10h in the sense that it's a single function 01:30:49.860 --> 01:30:52.660 that transforms a single scalar value and outputs a single scalar value. 01:30:53.260 --> 01:30:55.260 So we pop out the Python number. 01:30:55.760 --> 01:30:58.860 We use math.exp to exponentiate it, create a new value object, 01:30:59.360 --> 01:31:00.560 everything that we've seen before. 01:31:01.060 --> 01:31:04.160 The tricky part, of course, is how do you backpropagate through e to the x? 01:31:04.860 --> 01:31:10.160 And so here you can potentially pause the video and think about what should go here. 01:31:11.660 --> 01:31:18.260 Okay, so basically, we need to know what is the local derivative of e to the x. 01:31:18.560 --> 01:31:22.060 So d by dx of e to the x is famously just e to the x. 01:31:22.360 --> 01:31:26.360 And we've already just calculated e to the x, and it's inside out.data. 01:31:26.660 --> 01:31:31.260 So we can do out.data times and out.grad, that's the chain rule. 01:31:32.160 --> 01:31:34.660 So we're just chaining on to the current running grad. 01:31:35.360 --> 01:31:37.160 And this is what the expression looks like. 01:31:37.360 --> 01:31:39.760 It looks a little confusing, but this is what it is. 01:31:39.760 --> 01:31:41.060 And that's the exponentiation. 01:31:41.660 --> 01:31:44.960 So redefining, we should now be able to call a.exp. 01:31:45.460 --> 01:31:48.160 And hopefully the backward pass works as well. 01:31:48.360 --> 01:31:51.660 Okay, and the last thing we'd like to do, of course, is we'd like to be able to divide. 01:31:52.360 --> 01:31:56.060 Now, I actually will implement something slightly more powerful than division, 01:31:56.060 --> 01:31:59.560 because division is just a special case of something a bit more powerful. 01:32:00.160 --> 01:32:06.960 So in particular, just by rearranging, if we have some kind of a b equals value of 4.0 here, 01:32:07.060 --> 01:32:10.860 we'd like to basically be able to do a divide b, and we'd like this to be able to give us 0.5. 01:32:11.660 --> 01:32:14.960 Now, division actually can be reshuffled as follows. 01:32:15.460 --> 01:32:19.360 If we have a divide b, that's actually the same as a multiplying 1 over b. 01:32:20.060 --> 01:32:23.460 And that's the same as a multiplying b to the power of negative 1. 01:32:24.460 --> 01:32:31.460 And so what I'd like to do instead is I basically like to implement the operation of x to the k for some constant k. 01:32:31.660 --> 01:32:33.160 So it's an integer or a float. 01:32:34.160 --> 01:32:36.260 And we would like to be able to differentiate this. 01:32:36.260 --> 01:32:40.160 And then as a special case, negative 1 will be division. 01:32:40.960 --> 01:32:41.560 And so I'm doing that. 01:32:41.560 --> 01:32:46.060 Just because it's more general and you might as well do it that way. 01:32:46.460 --> 01:32:53.560 So basically what I'm saying is we can redefine division, which we will put here somewhere. 01:32:54.660 --> 01:32:55.860 You know, we can put it here somewhere. 01:32:56.360 --> 01:32:58.860 What I'm saying is that we can redefine division. 01:32:59.160 --> 01:33:00.460 So self divide other. 01:33:00.860 --> 01:33:04.960 This can actually be rewritten as self times other to the power of negative 1. 01:33:05.860 --> 01:33:10.860 And now, value raised to the power of negative 1, we have to now define that. 01:33:11.560 --> 01:33:15.660 So here's, so we need to implement the pow function. 01:33:16.160 --> 01:33:17.860 Where am I going to put the pow function? 01:33:17.860 --> 01:33:18.760 Maybe here somewhere. 01:33:20.160 --> 01:33:21.360 This is the skeleton for it. 01:33:22.560 --> 01:33:28.060 So this function will be called when we try to raise a value to some power and other will be that power. 01:33:28.760 --> 01:33:32.060 Now, I'd like to make sure that other is only an int or a float. 01:33:32.260 --> 01:33:35.360 Usually other is some kind of a different value object. 01:33:35.560 --> 01:33:38.460 But here other will be forced to be an int or a float. 01:33:38.760 --> 01:33:41.460 Otherwise, the math won't work. 01:33:41.660 --> 01:33:44.360 For what we're trying to achieve in this specific case. 01:33:44.760 --> 01:33:48.660 That would be a different derivative expression if we wanted other to be a value. 01:33:49.760 --> 01:33:54.660 So here we create the other value, which is just, you know, this data raised to the power of other. 01:33:54.860 --> 01:33:56.660 And other here could be, for example, negative 1. 01:33:56.760 --> 01:33:58.360 That's what we are hoping to achieve. 01:33:59.460 --> 01:34:01.660 And then this is the backward stub. 01:34:01.960 --> 01:34:10.960 And this is the fun part, which is what is the chain rule expression here for back propagating through 01:34:11.060 --> 01:34:15.160 the power function where the power is to the power of some kind of a constant. 01:34:15.860 --> 01:34:20.860 So this is the exercise and maybe pause the video here and see if you can figure it out yourself as to what we should put here. 01:34:27.060 --> 01:34:32.460 Okay, so you can actually go here and look at derivative rules as an example. 01:34:32.760 --> 01:34:35.760 And we see lots of derivative rules that you can hopefully know from calculus. 01:34:35.960 --> 01:34:40.860 In particular, what we're looking for is the power rule because that's telling us that if we're trying to take 01:34:40.960 --> 01:34:48.860 d by dx of x to the n, which is what we're doing here, then that is just n times x to the n minus 1, right? 01:34:49.660 --> 01:34:55.360 Okay, so that's telling us about the local derivative of this power operation. 01:34:56.060 --> 01:35:03.060 So all we want here basically n is now other and self.data is x. 01:35:03.660 --> 01:35:09.560 And so this now becomes other which is n times self.data, 01:35:10.460 --> 01:35:10.760 which is now another. 01:35:10.960 --> 01:35:12.460 Python int or a float. 01:35:13.260 --> 01:35:14.360 It's not a value object. 01:35:14.360 --> 01:35:20.460 We're accessing the data attribute raised to the power of other minus 1 or n minus 1. 01:35:21.360 --> 01:35:27.960 I can put brackets around this, but this doesn't matter because power takes precedence over multiply in pyhelm. 01:35:27.960 --> 01:35:29.060 So that would have been okay. 01:35:29.660 --> 01:35:31.360 And that's the local derivative only. 01:35:31.360 --> 01:35:36.260 But now we have to chain it and we chain it just simply by multiplying by a path grad that's chain rule. 01:35:36.860 --> 01:35:39.460 And this should technically work. 01:35:40.860 --> 01:35:42.060 And we're going to find out soon. 01:35:42.360 --> 01:35:45.960 But now if we do this, this should now work. 01:35:46.860 --> 01:35:47.960 And we get 0.5. 01:35:47.960 --> 01:35:50.860 So the forward pass works, but does the backward pass work? 01:35:51.260 --> 01:35:53.960 And I realized that we actually also have to know how to subtract. 01:35:54.060 --> 01:35:58.460 So right now a minus b will not work to make it work. 01:35:58.660 --> 01:36:01.160 We need one more piece of code here. 01:36:01.860 --> 01:36:10.760 And basically this is the subtraction and the way we're going to implement subtraction is we're going to implement it by addition of a negation. 01:36:10.960 --> 01:36:13.460 And then to implement negation, we're going to multiply by negative one. 01:36:13.960 --> 01:36:20.760 So just again using the stuff we've already built and just expressing it in terms of what we have and a minus b is not working. 01:36:21.260 --> 01:36:24.460 Okay, so now let's scroll again to this expression here for this neuron. 01:36:25.260 --> 01:36:28.460 And let's just compute the backward pass here. 01:36:28.460 --> 01:36:31.260 Once we've defined O and let's draw it. 01:36:32.160 --> 01:36:37.860 So here's the gradients for all these leaf nodes for this two-dimensional neuron that has a 10h that we've seen before. 01:36:38.560 --> 01:36:40.760 So now what I'd like to do is I'd like to break up. 01:36:40.860 --> 01:36:43.960 This 10h into this expression here. 01:36:44.560 --> 01:36:53.060 So let me copy paste this here and now instead of will preserve the label and we will change how we define O. 01:36:53.860 --> 01:36:56.460 So in particular we're going to implement this formula here. 01:36:56.860 --> 01:37:00.560 So we need e to the 2x minus 1 over e to the x plus 1. 01:37:00.960 --> 01:37:05.960 So e to the 2x we need to take 2 times n and we need to exponentiate it. 01:37:06.460 --> 01:37:09.560 That's e to the 2x and then because we're using it twice. 01:37:09.860 --> 01:37:10.760 Let's create an intermediate. 01:37:10.860 --> 01:37:24.260 Variable e and then define O as e plus 1 over e minus 1 over e plus 1 e minus 1 over e plus 1 and that should be it. 01:37:24.360 --> 01:37:26.460 And then we should be able to draw dot of O. 01:37:27.160 --> 01:37:30.360 So now before I run this, what do we expect to see? 01:37:31.060 --> 01:37:37.160 Number one, we're expecting to see a much longer graph here because we've broken up 10h into a bunch of other operations. 01:37:37.760 --> 01:37:40.060 But those operations are mathematically equivalent. 01:37:40.360 --> 01:37:40.760 And so what we're expecting. 01:37:40.960 --> 01:37:44.460 To see is number one, the same result here. 01:37:44.560 --> 01:37:48.260 So the forward pass works and number two because of that mathematical equivalence. 01:37:48.560 --> 01:37:52.460 We expect to see the same backward pass and the same gradients on these leaf nodes. 01:37:52.860 --> 01:37:54.460 So these gradients should be identical. 01:37:55.160 --> 01:37:56.460 So let's run this. 01:37:57.960 --> 01:38:01.260 So number one, let's verify that instead of a single 10h node. 01:38:01.360 --> 01:38:06.460 We have now X and we have plus we have times negative one. 01:38:07.060 --> 01:38:10.760 This is the division and we end up with the same forward pass. 01:38:10.960 --> 01:38:12.860 Here and then the gradients. 01:38:12.960 --> 01:38:14.860 We have to be careful because they're in slightly different order. 01:38:14.960 --> 01:38:26.760 Potentially the gradients for W2 X2 should be 0 and 0.5 W2 and X2 are 0 and 0.5 and W1 X1 are 1 and negative 1.5 1 and negative 1.5. 01:38:27.360 --> 01:38:34.960 So that means that both our forward passes and backward passes were correct because this turned out to be equivalent to 10h before. 01:38:35.960 --> 01:38:38.660 And so the reason I wanted to go through this exercise is number one. 01:38:38.960 --> 01:38:40.760 We got to practice a few more operations. 01:38:41.060 --> 01:38:43.960 And writing more backwards passes and number two. 01:38:44.160 --> 01:38:51.260 I wanted to illustrate the point that the the level at which you implement your operations is totally up to you. 01:38:51.460 --> 01:38:56.360 You can implement backward passes for tiny expressions like a single individual plus or a single times. 01:38:56.860 --> 01:39:01.560 Or you can implement them for say 10h which is a kind of a potential. 01:39:01.660 --> 01:39:05.960 You can see it as a composite operation because it's made up of all these more atomic operations. 01:39:06.460 --> 01:39:08.460 But really all of this is kind of like a fake concept. 01:39:08.660 --> 01:39:10.460 All that matters is we have some kind of inputs. 01:39:10.460 --> 01:39:13.760 And some kind of an output and this output is a function of the inputs in some way. 01:39:13.960 --> 01:39:18.060 And as long as you can do forward pass and the backward pass of that little operation. 01:39:18.460 --> 01:39:22.660 It doesn't matter what that operation is and how composite it is. 01:39:23.160 --> 01:39:27.260 If you can write the local gradients you can chain the gradient and you can continue back propagation. 01:39:27.460 --> 01:39:30.960 So the design of what those functions are is completely up to you. 01:39:32.060 --> 01:39:36.960 So now I would like to show you how you can do the exact same thing but using a modern deep neural network library. 01:39:37.060 --> 01:39:38.460 Like for example PyTorch. 01:39:38.860 --> 01:39:40.360 Which I've roughly modeled. 01:39:40.560 --> 01:39:42.360 Micrograd by. 01:39:43.060 --> 01:39:45.760 And so PyTorch is something you would use in production. 01:39:46.160 --> 01:39:49.360 And I'll show you how you can do the exact same thing but in PyTorch API. 01:39:49.860 --> 01:39:52.660 So I'm just going to copy paste it in and walk you through it a little bit. 01:39:52.860 --> 01:39:53.760 This is what it looks like. 01:39:54.960 --> 01:39:56.560 So we're going to import PyTorch. 01:39:57.060 --> 01:40:01.460 And then we need to define these value objects like we have here. 01:40:01.960 --> 01:40:05.360 Now Micrograd is a scalar valued engine. 01:40:05.460 --> 01:40:08.560 So we only have scalar values like 2.0. 01:40:09.160 --> 01:40:09.760 But in PyTorch. 01:40:10.460 --> 01:40:11.760 We only have around tensors. 01:40:12.060 --> 01:40:15.860 And like I mentioned tensors are just n dimensional arrays of scalars. 01:40:16.360 --> 01:40:19.260 So that's why things get a little bit more complicated here. 01:40:19.360 --> 01:40:21.760 I just need a scalar valued tensor. 01:40:21.860 --> 01:40:23.460 A tensor with just a single element. 01:40:24.060 --> 01:40:30.760 But by default when you work with PyTorch you would use more complicated tensors like this. 01:40:31.060 --> 01:40:32.360 So if I import PyTorch. 01:40:34.560 --> 01:40:36.360 Then I can create tensors like this. 01:40:36.760 --> 01:40:37.960 And this tensor for example. 01:40:38.060 --> 01:40:39.960 Is a 2x3 array. 01:40:39.960 --> 01:40:44.660 Of scalars in a single compact representation. 01:40:45.060 --> 01:40:46.060 So we can check its shape. 01:40:46.160 --> 01:40:48.860 We see that it's a 2x3 array and so on. 01:40:49.560 --> 01:40:53.160 So this is usually what you would work with in the actual libraries. 01:40:53.660 --> 01:40:58.960 So here I'm creating a tensor that has only a single element 2.0. 01:41:00.560 --> 01:41:03.160 And then I'm casting it to be double. 01:41:03.660 --> 01:41:07.860 Because Python is by default using double precision for its floating point numbers. 01:41:07.960 --> 01:41:09.760 So I'd like everything to be identical. 01:41:09.960 --> 01:41:14.360 By default the data type of these tensors will be float32. 01:41:14.460 --> 01:41:16.460 So it's only using a single precision float. 01:41:16.560 --> 01:41:18.260 So I'm casting it to double. 01:41:18.960 --> 01:41:21.860 So that we have float64 just like in Python. 01:41:22.660 --> 01:41:23.860 So I'm casting to double. 01:41:24.060 --> 01:41:27.660 And then we get something similar to value of 2. 01:41:28.060 --> 01:41:30.460 The next thing I have to do is because these are leaf nodes. 01:41:30.560 --> 01:41:33.660 By default PyTorch assumes that they do not require gradients. 01:41:33.860 --> 01:41:37.560 So I need to explicitly say that all of these nodes require gradients. 01:41:37.960 --> 01:41:38.460 Okay. 01:41:38.560 --> 01:41:39.660 So this is going to construct. 01:41:40.060 --> 01:41:42.860 Scalar valued one element tensors. 01:41:43.460 --> 01:41:45.660 Make sure that PyTorch knows that they require gradients. 01:41:46.260 --> 01:41:49.960 Now by default these are set to false by the way because of efficiency reasons. 01:41:50.160 --> 01:41:52.960 Because usually you would not want gradients for leaf nodes. 01:41:53.660 --> 01:41:55.460 Like the inputs to the network. 01:41:55.660 --> 01:41:58.460 And this is just trying to be efficient in the most common cases. 01:41:59.360 --> 01:42:02.360 So once we've defined all of our values in PyTorch land. 01:42:02.660 --> 01:42:05.660 We can perform arithmetic just like we can here in micrograd land. 01:42:05.960 --> 01:42:06.860 So this would just work. 01:42:07.260 --> 01:42:09.060 And then there's a torch.10h also. 01:42:09.660 --> 01:42:12.060 And when we get back as a tensor again. 01:42:12.660 --> 01:42:14.960 And we can just like in micrograd. 01:42:15.060 --> 01:42:17.760 It's got a data attribute and it's got grad attributes. 01:42:18.360 --> 01:42:22.360 So these tensor objects just like in micrograd have a dot data and a dot grad. 01:42:22.860 --> 01:42:26.260 And the only difference here is that we need to call a dot item. 01:42:26.660 --> 01:42:33.960 Because otherwise PyTorch dot item basically takes a single tensor of one element. 01:42:34.060 --> 01:42:36.860 And it just returns that element stripping out the tensor. 01:42:37.960 --> 01:42:38.860 So let me just run this. 01:42:38.860 --> 01:42:40.360 And hopefully we are going to get. 01:42:40.460 --> 01:42:44.560 This is going to print the forward pass which is 0.707. 01:42:45.060 --> 01:42:50.760 And this will be the gradients which hopefully are 0.50, negative 1.5, and 1. 01:42:51.260 --> 01:42:52.460 So if we just run this. 01:42:54.060 --> 01:42:54.460 There we go. 01:42:55.160 --> 01:42:55.560 0.7. 01:42:55.560 --> 01:42:56.960 So the forward pass agrees. 01:42:57.260 --> 01:42:59.860 And then 0.50, negative 1.5, and 1. 01:43:00.860 --> 01:43:02.260 So PyTorch agrees with us. 01:43:02.860 --> 01:43:04.060 And just to show you here basically. 01:43:04.060 --> 01:43:07.060 Oh, here's a tensor with a single element. 01:43:07.060 --> 01:43:09.160 And it's a double. 01:43:09.760 --> 01:43:13.660 And we can call that item on it to just get the single number out. 01:43:14.460 --> 01:43:15.660 So that's what item does. 01:43:16.060 --> 01:43:18.360 And O is a tensor object like I mentioned. 01:43:18.660 --> 01:43:21.160 And it's got a backward function just like we've implemented. 01:43:22.260 --> 01:43:24.260 And then all of these also have a dot grad. 01:43:24.260 --> 01:43:26.060 So like X2 for example has a grad. 01:43:26.360 --> 01:43:27.060 And it's a tensor. 01:43:27.360 --> 01:43:30.060 And we can pop out the individual number with dot item. 01:43:31.560 --> 01:43:36.860 So basically Torch can do what we did in micrograd as a special case. 01:43:37.060 --> 01:43:40.060 When your tensors are all single element tensors. 01:43:40.560 --> 01:43:43.860 But the big deal with PyTorch is that everything is significantly more efficient. 01:43:44.160 --> 01:43:46.560 Because we are working with these tensor objects. 01:43:46.760 --> 01:43:50.060 And we can do lots of operations in parallel on all of these tensors. 01:43:51.660 --> 01:43:55.160 But otherwise what we've built very much agrees with the API of PyTorch. 01:43:55.760 --> 01:43:59.660 Okay, so now that we have some machinery to build out pretty complicated mathematical expressions. 01:43:59.960 --> 01:44:01.860 We can also start building up neural nets. 01:44:02.060 --> 01:44:06.260 And as I mentioned neural nets are just a specific class of mathematical expressions. 01:44:07.060 --> 01:44:09.460 So we're going to start building out a neural net piece by piece. 01:44:09.460 --> 01:44:13.860 And eventually we'll build out a two-layer multi-layer layer perceptron as it's called. 01:44:14.160 --> 01:44:15.560 And I'll show you exactly what that means. 01:44:16.060 --> 01:44:17.660 Let's start with a single individual neuron. 01:44:18.060 --> 01:44:19.260 We've implemented one here. 01:44:19.660 --> 01:44:24.060 But here I'm going to implement one that also subscribes to the PyTorch API. 01:44:24.060 --> 01:44:26.860 And how it designs its neural network modules. 01:44:27.460 --> 01:44:32.660 So just like we saw that we can like match the API of PyTorch on the autograd side. 01:44:33.160 --> 01:44:35.360 We're going to try to do that on the neural network modules. 01:44:36.060 --> 01:44:36.960 So here's class neuron. 01:44:37.460 --> 01:44:40.660 And just for the sake of efficiency. 01:44:40.960 --> 01:44:44.360 I'm going to copy paste some sections that are relatively straightforward. 01:44:45.660 --> 01:44:49.460 So the constructor will take number of inputs to this neuron. 01:44:49.460 --> 01:44:52.160 Which is how many inputs come to a neuron. 01:44:52.560 --> 01:44:54.260 So this one for example has three inputs. 01:44:55.360 --> 01:44:56.860 And then it's going to create a weight. 01:44:57.360 --> 01:45:00.860 That is some random number between negative one and one for every one of those inputs. 01:45:01.360 --> 01:45:05.160 And a bias that controls the overall trigger happiness of this neuron. 01:45:05.160 --> 01:45:12.560 And then we're going to implement a def underscore underscore call of self and x. 01:45:12.560 --> 01:45:13.560 Some input x. 01:45:13.560 --> 01:45:16.860 And really what we don't want to do here is w times x plus b. 01:45:16.860 --> 01:45:20.260 Where w times x here is a dot product specifically. 01:45:20.260 --> 01:45:23.260 Now if you haven't seen call. 01:45:23.260 --> 01:45:26.160 Let me just return 0.0 here for now. 01:45:26.160 --> 01:45:30.360 The way this works now is we can have an x which is say like 2.0, 3.0. 01:45:30.360 --> 01:45:33.260 Then we can initialize a neuron that is two-dimensional. 01:45:33.260 --> 01:45:35.060 Because these are two numbers. 01:45:35.160 --> 01:45:38.560 And then we can feed those two numbers into that neuron to get an output. 01:45:39.560 --> 01:45:42.560 And so when you use this notation n of x. 01:45:42.560 --> 01:45:44.060 Python will use call. 01:45:44.860 --> 01:45:46.960 So currently call just returns 0.0. 01:45:49.960 --> 01:45:53.960 Now we'd like to actually do the forward pass of this neuron instead. 01:45:54.760 --> 01:45:56.360 So we're going to do here first. 01:45:56.560 --> 01:46:00.260 Is we need to basically multiply all of the elements of w. 01:46:00.260 --> 01:46:02.460 With all of the elements of x pairwise. 01:46:02.460 --> 01:46:03.460 We need to multiply them. 01:46:04.160 --> 01:46:05.060 So the first thing we're going to do. 01:46:05.060 --> 01:46:09.060 Is we're going to zip up salta w and x. 01:46:09.060 --> 01:46:12.660 And in Python zip takes two iterators. 01:46:12.660 --> 01:46:17.960 And it creates a new iterator that iterates over the tuples of their corresponding entries. 01:46:17.960 --> 01:46:22.060 So for example, just to show you we can print this list. 01:46:22.060 --> 01:46:25.860 And still return 0.0 here. 01:46:30.860 --> 01:46:32.560 Sorry. 01:46:32.560 --> 01:46:34.260 I'm in life. 01:46:34.260 --> 01:46:37.460 So we see that these w's are paired up with the x's. 01:46:37.460 --> 01:46:38.660 W with x. 01:46:41.660 --> 01:46:43.160 And now what we want to do is. 01:46:47.560 --> 01:46:49.860 For wi xi in. 01:46:50.860 --> 01:46:54.260 We want to multiply w times wi times xi. 01:46:55.060 --> 01:46:57.260 And then we want to sum all of that together. 01:46:57.760 --> 01:46:59.060 To come up with an activation. 01:46:59.760 --> 01:47:01.660 And add also salta b on top. 01:47:02.460 --> 01:47:03.660 So that's the raw activation. 01:47:04.260 --> 01:47:06.860 And then of course we need to pass that through a null linearity. 01:47:06.860 --> 01:47:10.260 So what we're going to be returning is act dot 10h. 01:47:10.260 --> 01:47:12.460 And here's out. 01:47:12.460 --> 01:47:15.760 So now we see that we are getting some outputs. 01:47:15.760 --> 01:47:17.560 And we get a different output from a neuron each time. 01:47:17.560 --> 01:47:21.560 Because we are initializing different weights and biases. 01:47:21.560 --> 01:47:23.660 And then to be a bit more efficient here actually. 01:47:23.660 --> 01:47:27.560 Sum by the way takes a second optional parameter. 01:47:27.560 --> 01:47:29.160 Which is the start. 01:47:29.160 --> 01:47:31.660 And by default the start is 0. 01:47:31.660 --> 01:47:33.660 So these elements of this sum. 01:47:33.660 --> 01:47:35.860 Will be added on top of 0 to begin with. 01:47:35.860 --> 01:47:37.660 But actually we can just start with salta b. 01:47:38.560 --> 01:47:40.260 And then we just have an expression like this. 01:47:45.660 --> 01:47:48.760 And then the generator expression here must be parenthesized in python. 01:47:49.560 --> 01:47:50.060 There we go. 01:47:53.960 --> 01:47:56.260 Yep so now we can forward a single neuron. 01:47:56.660 --> 01:47:59.060 Next up we're going to define a layer of neurons. 01:47:59.460 --> 01:48:02.060 So here we have a schematic for a MLP. 01:48:02.660 --> 01:48:03.560 So we see that. 01:48:03.660 --> 01:48:05.360 These MLPs each layer. 01:48:05.360 --> 01:48:06.460 This is one layer. 01:48:06.460 --> 01:48:07.960 Has actually a number of neurons. 01:48:07.960 --> 01:48:09.160 And they're not connected to each other. 01:48:09.160 --> 01:48:11.460 But all of them are fully connected to the input. 01:48:11.460 --> 01:48:13.160 So what is a layer of neurons? 01:48:13.160 --> 01:48:16.760 It's just it's just a set of neurons evaluated independently. 01:48:16.760 --> 01:48:19.160 So in the interest of time. 01:48:19.160 --> 01:48:23.160 I'm going to do something fairly straightforward here. 01:48:23.160 --> 01:48:29.160 It's literally a layer is just a list of neurons. 01:48:29.160 --> 01:48:30.760 And then how many neurons do we have? 01:48:30.760 --> 01:48:32.760 We take that as an input argument here. 01:48:32.760 --> 01:48:35.860 How many neurons do you want in your layer number of outputs in this layer? 01:48:36.760 --> 01:48:41.160 And so we just initialize completely independent neurons with this given dimensionality. 01:48:41.460 --> 01:48:42.960 And we call on it. 01:48:42.960 --> 01:48:45.860 We just independently evaluate them. 01:48:46.460 --> 01:48:49.660 So now instead of a neuron we can make a layer of neurons. 01:48:49.660 --> 01:48:51.860 They are two dimensional neurons and let's have three of them. 01:48:52.460 --> 01:48:57.660 And now we see that we have three independent evaluations of three different neurons, right? 01:48:58.960 --> 01:48:59.160 Okay. 01:48:59.160 --> 01:49:02.560 And finally, let's complete this picture and define an entire multi-layer. 01:49:02.560 --> 01:49:04.060 Perceptron or MLP. 01:49:04.060 --> 01:49:08.460 And as we can see here in an MLP, these layers just feed into each other sequentially. 01:49:08.460 --> 01:49:13.460 So let's come here and I'm just going to copy the code here in interest of time. 01:49:13.460 --> 01:49:16.060 So an MLP is very similar. 01:49:16.060 --> 01:49:19.260 We're taking the number of inputs as before. 01:49:19.260 --> 01:49:23.360 But now instead of saying taking a single and out which is number of neurons in a single layer. 01:49:23.360 --> 01:49:29.560 We're going to take a list of an outs and this list defines the sizes of all the layers that we want in our MLP. 01:49:29.560 --> 01:49:32.260 So here we just put them all together and then iterate. 01:49:32.560 --> 01:49:37.060 Over consecutive pairs of these sizes and create a layer objects for them. 01:49:37.760 --> 01:49:40.160 And then in the call function, we are just calling them sequentially. 01:49:40.460 --> 01:49:41.960 So that's an MLP really. 01:49:42.760 --> 01:49:44.460 And let's actually re-implement this picture. 01:49:44.460 --> 01:49:48.460 So we want three input neurons and then two layers of four and an output unit. 01:49:49.560 --> 01:49:53.360 So we want three dimensional input. 01:49:53.460 --> 01:49:54.760 Say this is an example input. 01:49:55.160 --> 01:49:59.960 We want three inputs into two layers of four and one output. 01:50:00.360 --> 01:50:01.960 And this of course is an MLP. 01:50:02.560 --> 01:50:04.560 And there we go. 01:50:04.560 --> 01:50:06.160 That's a forward pass of an MLP. 01:50:06.160 --> 01:50:08.260 To make this a little bit nicer. 01:50:08.260 --> 01:50:13.060 You see how we have just a single element, but it's wrapped in a list because layer always returns lists. 01:50:13.060 --> 01:50:20.060 So for convenience, return outs at zero if len outs is exactly a single element. 01:50:20.060 --> 01:50:22.060 Else return fullest. 01:50:22.060 --> 01:50:27.060 And this will allow us to just get a single value out at the last layer that only has a single neuron. 01:50:27.060 --> 01:50:31.060 And finally, we should be able to draw a dot of N of X. 01:50:32.560 --> 01:50:37.960 As you might imagine, these expressions are now getting relatively involved. 01:50:38.660 --> 01:50:41.160 So this is an entire MLP that we're defining now. 01:50:45.360 --> 01:50:47.360 All the way until a single output. 01:50:48.360 --> 01:50:53.460 Okay, and so obviously you would never differentiate on pen and paper these expressions. 01:50:53.660 --> 01:51:02.360 But with micrograd, we will be able to back propagate all the way through this and back propagate into these weights of all these neurons. 01:51:02.560 --> 01:51:04.360 So let's see how that works. 01:51:04.360 --> 01:51:08.360 Okay, so let's create ourselves a very simple example data set here. 01:51:08.360 --> 01:51:10.360 So this data set has four examples. 01:51:10.360 --> 01:51:15.360 And so we have four possible inputs into the neural net. 01:51:15.360 --> 01:51:17.360 And we have four desired targets. 01:51:17.360 --> 01:51:24.360 So we'd like the neural net to assign or output 1.0 when it's fed this example. 01:51:24.360 --> 01:51:26.360 Negative one when it's fed these examples. 01:51:26.360 --> 01:51:28.360 And one when it's fed this example. 01:51:28.360 --> 01:51:32.360 So it's a very simple binary classifier neural net basically that we would like here. 01:51:32.560 --> 01:51:35.960 Now let's think what the neural net currently thinks about these four examples. 01:51:35.960 --> 01:51:38.260 We can just get their predictions. 01:51:38.260 --> 01:51:42.160 Basically, we can just call N of X for X and Xs. 01:51:42.160 --> 01:51:45.160 And then we can print. 01:51:45.160 --> 01:51:48.960 So these are the outputs of the neural net on those four examples. 01:51:48.960 --> 01:51:53.860 So the first one is 0.91, but we'd like it to be one. 01:51:53.860 --> 01:51:55.860 So we should push this one higher. 01:51:55.860 --> 01:51:58.260 This one we want to be higher. 01:51:58.260 --> 01:52:02.360 This one says 0.88, and we want this to be negative one. 01:52:02.560 --> 01:52:05.160 This is 0.88, we want it to be negative one. 01:52:05.160 --> 01:52:08.160 And this one is 0.88, we want it to be one. 01:52:08.160 --> 01:52:10.160 So how do we make the neural net? 01:52:10.160 --> 01:52:16.760 And how do we tune the weights to better predict the desired targets? 01:52:16.760 --> 01:52:21.860 And the trick used in deep learning to achieve this is to calculate a single number 01:52:21.860 --> 01:52:25.160 that somehow measures the total performance of your neural net. 01:52:25.160 --> 01:52:28.060 And we call this single number the loss. 01:52:28.060 --> 01:52:32.260 So the loss first is a single number 01:52:32.560 --> 01:52:36.260 that we're going to define that basically measures how well the neural net is performing. 01:52:36.260 --> 01:52:38.560 Right now, we have the intuitive sense that it's not performing very well 01:52:38.560 --> 01:52:41.060 because we're not very much close to this. 01:52:41.060 --> 01:52:44.860 So the loss will be high, and we'll want to minimize the loss. 01:52:44.860 --> 01:52:49.860 So in particular, in this case, what we're going to do is we're going to implement the mean squared error loss. 01:52:49.860 --> 01:52:54.160 So what this is doing is we're going to basically iterate 01:52:54.160 --> 01:53:01.360 for Y ground truth and Y output in zip of Ys and Ybred. 01:53:01.360 --> 01:53:02.360 So we're going to pair up 01:53:02.360 --> 01:53:09.060 the ground truths with the predictions and the zip iterates over tuples of them. 01:53:09.060 --> 01:53:13.260 And for each Y ground truth and Y output, 01:53:13.260 --> 01:53:18.760 we're going to subtract them and square them. 01:53:18.760 --> 01:53:23.060 So let's first see what these losses are. These are individual loss components. 01:53:23.060 --> 01:53:26.560 And so basically for each one of the four, 01:53:26.560 --> 01:53:29.560 we are taking the prediction and the ground truth. 01:53:29.560 --> 01:53:31.560 We are subtracting them and squaring them. 01:53:32.360 --> 01:53:36.360 So because this one is so close to its target, 01:53:36.360 --> 01:53:42.060 0.91 is almost 1, subtracting them gives a very small number. 01:53:42.060 --> 01:53:44.360 So here we would get like a negative 0.1, 01:53:44.360 --> 01:53:48.960 and then squaring it just makes sure that regardless of 01:53:48.960 --> 01:53:51.060 whether we are more negative or more positive, 01:53:51.060 --> 01:53:54.460 we always get a positive number. 01:53:54.460 --> 01:53:56.160 Instead of squaring, we could also take, 01:53:56.160 --> 01:53:59.560 for example, the absolute value. We need to discard the sign. 01:53:59.560 --> 01:54:02.260 And so you see that the expression is ranged so that you 01:54:02.360 --> 01:54:06.560 only get 0 exactly when Y out is equal to Y ground truth. 01:54:06.560 --> 01:54:07.660 When those two are equal, 01:54:07.660 --> 01:54:09.560 so your prediction is exactly the target, 01:54:09.560 --> 01:54:13.060 you are going to get 0. And if your prediction is not the target, 01:54:13.060 --> 01:54:15.260 you are going to get some other number. 01:54:15.260 --> 01:54:17.160 So here, for example, we are way off. 01:54:17.160 --> 01:54:19.960 And so that's why the loss is quite high. 01:54:19.960 --> 01:54:24.160 And the more off we are, the greater the loss will be. 01:54:24.160 --> 01:54:27.960 So we don't want high loss, we want low loss. 01:54:27.960 --> 01:54:32.160 And so the final loss here will be just the sum, 01:54:32.360 --> 01:54:34.460 all of these numbers. 01:54:34.460 --> 01:54:38.660 So you see that this should be 0 roughly plus 0 roughly, 01:54:38.660 --> 01:54:40.960 but plus 7. 01:54:40.960 --> 01:54:45.060 So loss should be about 7 here. 01:54:45.060 --> 01:54:47.360 And now we want to minimize the loss. 01:54:47.360 --> 01:54:51.660 We want the loss to be low because if loss is low, 01:54:51.660 --> 01:54:56.560 then every one of the predictions is equal to its target. 01:54:56.560 --> 01:54:59.060 So the loss, the lowest it can be is 0, 01:54:59.060 --> 01:55:02.160 and the greater it is, the worse off the neural net is, 01:55:02.360 --> 01:55:05.160 and the higher the risk of shifting. 01:55:05.160 --> 01:55:08.760 So now, of course, if we do loss.backward, 01:55:08.760 --> 01:55:11.560 something magical happened when I hit enter. 01:55:11.560 --> 01:55:14.360 And the magical thing, of course, that happened is that we can look at 01:55:14.360 --> 01:55:19.660 n.layers.neuron, n.layers at, say, like the first layer, 01:55:19.660 --> 01:55:23.360 that neurons at 0, 01:55:23.360 --> 01:55:27.360 because remember that MLP has the layers, which is a list, 01:55:27.360 --> 01:55:29.660 and each layer has neurons, which is a list, 01:55:29.660 --> 01:55:31.560 and that gives us an individual neuron, 01:55:31.560 --> 01:55:33.560 and that gives us some weights. 01:55:33.560 --> 01:55:40.560 And so we can, for example, look at the weights at 0. 01:55:40.560 --> 01:55:44.560 Oops, it's not called weights, it's called w. 01:55:44.560 --> 01:55:48.560 And that's a value, but now this value also has a grad 01:55:48.560 --> 01:55:50.560 because of the backward pass. 01:55:50.560 --> 01:55:53.560 And so we see that because this gradient here 01:55:53.560 --> 01:55:55.560 on this particular weight of this particular neuron 01:55:55.560 --> 01:55:58.560 of this particular layer is negative, 01:55:58.560 --> 01:56:01.360 we see that its influence on the loss is also negative. 01:56:01.360 --> 01:56:05.360 So slightly increasing this particular weight of this neuron of this layer 01:56:05.360 --> 01:56:08.360 would make the loss go down. 01:56:08.360 --> 01:56:12.360 And we actually have this information for every single one of our neurons 01:56:12.360 --> 01:56:13.360 and all of their parameters. 01:56:13.360 --> 01:56:17.360 Actually, it's worth looking at also the draw dot of loss, by the way. 01:56:17.360 --> 01:56:21.360 So previously, we looked at the draw dot of a single neuron forward pass, 01:56:21.360 --> 01:56:23.360 and that was already a large expression. 01:56:23.360 --> 01:56:25.360 But what is this expression? 01:56:25.360 --> 01:56:28.360 We actually forwarded every one of those four examples, 01:56:28.360 --> 01:56:30.360 and then we have the loss on top of them, 01:56:30.360 --> 01:56:32.360 with the mean squared error. 01:56:32.360 --> 01:56:35.360 And so this is a really massive graph 01:56:35.360 --> 01:56:38.360 because this graph that we've built up now, 01:56:38.360 --> 01:56:40.360 oh my gosh, 01:56:40.360 --> 01:56:42.360 this graph that we've built up now, 01:56:42.360 --> 01:56:44.360 which is kind of excessive, 01:56:44.360 --> 01:56:47.360 it's excessive because it has four forward passes of a neural net 01:56:47.360 --> 01:56:49.360 for every one of the examples, 01:56:49.360 --> 01:56:51.360 and then it has the loss on top, 01:56:51.360 --> 01:56:54.360 and it ends with the value of the loss, which was 7.12. 01:56:54.360 --> 01:56:58.360 And this loss will now back propagate through all the four forward passes 01:56:58.360 --> 01:56:59.360 all the way through, 01:56:59.360 --> 01:57:02.360 just every single intermediate value of the neural net, 01:57:02.360 --> 01:57:04.360 all the way back to, 01:57:04.360 --> 01:57:06.360 of course, the parameters of the weights, 01:57:06.360 --> 01:57:07.360 which are the input. 01:57:07.360 --> 01:57:11.360 So these weight parameters here are inputs to this neural net, 01:57:11.360 --> 01:57:13.360 and these numbers here, 01:57:13.360 --> 01:57:14.360 these scalars, 01:57:14.360 --> 01:57:16.360 are inputs to the neural net. 01:57:16.360 --> 01:57:18.360 So if we went around here, 01:57:18.360 --> 01:57:21.360 we will probably find some of these examples, 01:57:21.360 --> 01:57:22.360 this 1.0, 01:57:22.360 --> 01:57:24.360 potentially maybe this 1.0, 01:57:24.360 --> 01:57:26.360 or, you know, some of the others. 01:57:26.360 --> 01:57:28.360 And you'll see that they all have gradients as well. 01:57:28.360 --> 01:57:31.360 The thing is these gradients on the input data 01:57:31.360 --> 01:57:33.360 are not that useful to us, 01:57:33.360 --> 01:57:37.360 and that's because the input data seems to be not changeable. 01:57:37.360 --> 01:57:39.360 It's a given to the problem, 01:57:39.360 --> 01:57:40.360 and so it's a fixed input. 01:57:40.360 --> 01:57:42.360 We're not going to be changing it or messing with it, 01:57:42.360 --> 01:57:45.360 even though we do have gradients for it. 01:57:45.360 --> 01:57:48.360 But some of these gradients here 01:57:48.360 --> 01:57:50.360 will be for the neural network parameters, 01:57:50.360 --> 01:57:52.360 the w's and the b's, 01:57:52.360 --> 01:57:55.360 and those we, of course, we want to change. 01:57:55.360 --> 01:57:57.360 Okay, so now we're going to want 01:57:57.360 --> 01:57:59.360 some convenience codes to gather up 01:57:59.360 --> 01:58:01.360 all of the parameters of the neural net 01:58:01.360 --> 01:58:04.360 so that we can operate on all of them simultaneously. 01:58:04.360 --> 01:58:05.360 And every one of them, 01:58:05.360 --> 01:58:08.360 we will nudge a tiny amount 01:58:08.360 --> 01:58:10.360 based on the gradient information. 01:58:10.360 --> 01:58:12.360 So let's collect the parameters of the neural net 01:58:12.360 --> 01:58:14.360 all in one array. 01:58:14.360 --> 01:58:17.360 So let's create a parameters of self 01:58:17.360 --> 01:58:19.360 that just returns 01:58:19.360 --> 01:58:22.360 self.w, which is a list, 01:58:22.360 --> 01:58:26.360 concatenated with a list of self.b. 01:58:27.360 --> 01:58:29.360 So this will just return a list. 01:58:29.360 --> 01:58:32.360 List plus list just gives you a list. 01:58:32.360 --> 01:58:34.360 So that's parameters of neuron, 01:58:34.360 --> 01:58:36.360 and I'm calling it this way 01:58:36.360 --> 01:58:38.360 because also PyTorch has parameters 01:58:38.360 --> 01:58:40.360 on every single NN module, 01:58:40.360 --> 01:58:42.360 and it does exactly what we're doing here. 01:58:42.360 --> 01:58:45.360 It just returns the parameter tensors. 01:58:45.360 --> 01:58:48.360 For us, it's the parameter scalars. 01:58:48.360 --> 01:58:50.360 Now, layer is also a module, 01:58:50.360 --> 01:58:54.360 so it will have parameters, self, 01:58:54.360 --> 01:58:56.360 and basically what we want to do here is 01:58:56.360 --> 01:58:59.360 something like this, like 01:58:59.360 --> 01:59:01.360 params is here, 01:59:01.360 --> 01:59:07.360 and then for neuron in self.neurons, 01:59:07.360 --> 01:59:10.360 we want to get neuron.parameters, 01:59:10.360 --> 01:59:14.360 and we want to params.extend. 01:59:14.360 --> 01:59:16.360 So these are the parameters of this neuron, 01:59:16.360 --> 01:59:19.360 and then we want to put them on top of params, 01:59:19.360 --> 01:59:22.360 so params.extend of piece, 01:59:22.360 --> 01:59:25.360 and then we want to return params. 01:59:25.360 --> 01:59:27.360 So this is way too much code, 01:59:27.360 --> 01:59:29.360 so actually there's a way to simplify this, 01:59:29.360 --> 01:59:39.360 which is return p for neuron in self.neurons 01:59:39.360 --> 01:59:45.360 for p in neuron.parameters. 01:59:45.360 --> 01:59:47.360 So it's a single list comprehension. 01:59:47.360 --> 01:59:49.360 In Python, you can sort of nest them like this, 01:59:49.360 --> 01:59:54.360 and you can then create the desired array. 01:59:54.360 --> 01:59:56.360 So these are identical. 01:59:56.360 --> 01:59:59.360 We can take this out. 01:59:59.360 --> 02:00:04.360 And then let's do the same here. 02:00:04.360 --> 02:00:07.360 dev.parameters self 02:00:07.360 --> 02:00:13.360 and return a parameter for layer in self.layers 02:00:13.360 --> 02:00:20.360 for p in layer.parameters. 02:00:20.360 --> 02:00:22.360 And that should be good. 02:00:22.360 --> 02:00:25.360 Now let me pop out this 02:00:25.360 --> 02:00:28.360 so we don't reinitialize our network, 02:00:28.360 --> 02:00:35.360 because we need to reinitialize our... 02:00:35.360 --> 02:00:36.360 Okay, so unfortunately, 02:00:36.360 --> 02:00:38.360 we will have to probably reinitialize the network 02:00:38.360 --> 02:00:41.360 because we just added functionality. 02:00:41.360 --> 02:00:42.360 Because this class, of course, 02:00:42.360 --> 02:00:45.360 I want to get all the end.parameters, 02:00:45.360 --> 02:00:46.360 but that's not going to work 02:00:46.360 --> 02:00:49.360 because this is the old class. 02:00:49.360 --> 02:00:50.360 Okay. 02:00:50.360 --> 02:00:51.360 So unfortunately, 02:00:51.360 --> 02:00:53.360 we do have to reinitialize the network, 02:00:53.360 --> 02:00:55.360 which will change some of the numbers. 02:00:55.360 --> 02:00:58.360 But let me do that so that we pick up the new API. 02:00:58.360 --> 02:01:00.360 We can now do end.parameters. 02:01:00.360 --> 02:01:02.360 And these are all the weights and biases 02:01:02.360 --> 02:01:05.360 inside the entire neural net. 02:01:05.360 --> 02:01:11.360 So in total, this MLP has 41 parameters. 02:01:11.360 --> 02:01:15.360 And now we'll be able to change them. 02:01:15.360 --> 02:01:18.360 If we recalculate the loss here, 02:01:18.360 --> 02:01:19.360 we see that unfortunately, 02:01:19.360 --> 02:01:23.360 we have slightly different predictions 02:01:23.360 --> 02:01:26.360 and slightly different loss. 02:01:26.360 --> 02:01:28.360 But that's okay. 02:01:28.360 --> 02:01:31.360 Okay, so we see that this neuron's gradient 02:01:31.360 --> 02:01:33.360 is slightly negative. 02:01:33.360 --> 02:01:36.360 We can also look at its data right now, 02:01:36.360 --> 02:01:38.360 which is 0.85. 02:01:38.360 --> 02:01:39.360 So this is the current value of this neuron, 02:01:39.360 --> 02:01:43.360 and this is its gradient on the loss. 02:01:43.360 --> 02:01:44.360 So what we want to do now 02:01:44.360 --> 02:01:48.360 is we want to iterate for every p in end.parameters. 02:01:48.360 --> 02:01:51.360 So for all the 41 parameters in this neural net, 02:01:51.360 --> 02:01:55.360 we actually want to change p.data slightly 02:01:55.360 --> 02:01:58.360 according to the gradient information. 02:01:58.360 --> 02:02:01.360 Okay, so dot dot dot to do here. 02:02:01.360 --> 02:02:04.360 But this will be basically a tiny update 02:02:04.360 --> 02:02:07.360 in this gradient descent scheme. 02:02:07.360 --> 02:02:09.360 And gradient descent, 02:02:09.360 --> 02:02:11.360 we are thinking of the gradient 02:02:11.360 --> 02:02:13.360 as a vector pointing in the direction 02:02:13.360 --> 02:02:17.360 of increased loss. 02:02:17.360 --> 02:02:20.360 And so in gradient descent, 02:02:20.360 --> 02:02:23.360 we are modifying p.data 02:02:23.360 --> 02:02:25.360 by a small step size 02:02:25.360 --> 02:02:27.360 in the direction of the gradient. 02:02:27.360 --> 02:02:28.360 So the step size as an example 02:02:28.360 --> 02:02:29.360 could be like a very small number, 02:02:29.360 --> 02:02:31.360 like 0.01 is the step size, 02:02:31.360 --> 02:02:35.360 times p.grad, right? 02:02:35.360 --> 02:02:37.360 But we have to think through 02:02:37.360 --> 02:02:38.360 some of the signs here. 02:02:38.360 --> 02:02:41.360 So in particular, 02:02:41.360 --> 02:02:44.360 working with this specific example here, 02:02:44.360 --> 02:02:46.360 we see that if we just left it like this, 02:02:46.360 --> 02:02:48.360 then this neuron's value 02:02:48.360 --> 02:02:50.360 would be currently increased 02:02:50.360 --> 02:02:52.360 by a tiny amount of the gradient. 02:02:52.360 --> 02:02:54.360 The gradient is negative, 02:02:54.360 --> 02:02:56.360 so this value of this neuron 02:02:56.360 --> 02:02:58.360 would go slightly down. 02:02:58.360 --> 02:03:00.360 It would become like 0.84 02:03:00.360 --> 02:03:02.360 or something like that. 02:03:02.360 --> 02:03:05.360 But if this neuron's value goes lower, 02:03:05.360 --> 02:03:10.360 that would actually increase the loss. 02:03:10.360 --> 02:03:13.360 That's because the derivative 02:03:13.360 --> 02:03:15.360 of this neuron is negative. 02:03:15.360 --> 02:03:17.360 So increasing this 02:03:17.360 --> 02:03:19.360 makes the loss go down. 02:03:19.360 --> 02:03:21.360 So increasing it is what we want to do 02:03:21.360 --> 02:03:23.360 instead of decreasing it. 02:03:23.360 --> 02:03:24.360 So basically what we're missing here 02:03:24.360 --> 02:03:26.360 is we're actually missing a negative sign. 02:03:26.360 --> 02:03:29.360 And again, this other interpretation, 02:03:29.360 --> 02:03:31.360 and that's because we want to minimize the loss. 02:03:31.360 --> 02:03:32.360 We don't want to maximize the loss. 02:03:32.360 --> 02:03:34.360 We want to decrease it. 02:03:34.360 --> 02:03:35.360 And the other interpretation, as I mentioned, 02:03:35.360 --> 02:03:37.360 is you can think of the gradient vector, 02:03:37.360 --> 02:03:40.360 so basically just the vector of all the gradients, 02:03:40.360 --> 02:03:42.360 as pointing in the direction 02:03:42.360 --> 02:03:45.360 of increasing the loss. 02:03:45.360 --> 02:03:46.360 But then we want to decrease it. 02:03:46.360 --> 02:03:49.360 So we actually want to go in the opposite direction. 02:03:49.360 --> 02:03:50.360 And so you can convince yourself 02:03:50.360 --> 02:03:52.360 that this does the right thing here with the negative 02:03:52.360 --> 02:03:55.360 because we want to minimize the loss. 02:03:55.360 --> 02:04:00.360 So if we nudge all the parameters by a tiny amount, 02:04:00.360 --> 02:04:02.360 then we'll see that this data 02:04:02.360 --> 02:04:04.360 will have changed a little bit. 02:04:04.360 --> 02:04:09.360 So now this neuron is a tiny amount greater value. 02:04:09.360 --> 02:04:13.360 So 0.854 went to 0.857. 02:04:13.360 --> 02:04:14.360 And that's a good thing 02:04:14.360 --> 02:04:18.360 because slightly increasing this neuron data 02:04:18.360 --> 02:04:22.360 makes the loss go down according to the gradient. 02:04:22.360 --> 02:04:25.360 And so the correcting has happened sign-wise. 02:04:25.360 --> 02:04:27.360 And so now what we would expect, of course, 02:04:27.360 --> 02:04:30.360 is that because we've changed all these parameters, 02:04:30.360 --> 02:04:34.360 we expect that the loss should have gone down a bit. 02:04:34.360 --> 02:04:36.360 So we want to reevaluate the loss. 02:04:36.360 --> 02:04:39.360 Let me basically... 02:04:39.360 --> 02:04:42.360 This is just a data definition that hasn't changed. 02:04:42.360 --> 02:04:43.360 But the forward pass here, 02:04:43.360 --> 02:04:45.360 of the network, 02:04:45.360 --> 02:04:47.360 we can recalculate. 02:04:49.360 --> 02:04:51.360 And actually, let me do it outside here 02:04:51.360 --> 02:04:54.360 so that we can compare the two loss values. 02:04:54.360 --> 02:04:57.360 So here, if I recalculate the loss, 02:04:57.360 --> 02:04:59.360 we'd expect the new loss now 02:04:59.360 --> 02:05:01.360 to be slightly lower than this number. 02:05:01.360 --> 02:05:03.360 So hopefully, what we're getting now 02:05:03.360 --> 02:05:06.360 is a tiny bit lower than 4.84. 02:05:06.360 --> 02:05:08.360 4.36. 02:05:08.360 --> 02:05:10.360 And remember, the way we've arranged this 02:05:10.360 --> 02:05:12.360 is that low loss means that 02:05:12.360 --> 02:05:14.360 our predictions are matching the targets. 02:05:14.360 --> 02:05:15.360 So our predictions now 02:05:15.360 --> 02:05:19.360 are probably slightly closer to the targets. 02:05:19.360 --> 02:05:21.360 And now all we have to do 02:05:21.360 --> 02:05:23.360 is we have to iterate this process. 02:05:23.360 --> 02:05:26.360 So again, we've done the forward pass, 02:05:26.360 --> 02:05:27.360 and this is the loss. 02:05:27.360 --> 02:05:29.360 Now we can loss that backward. 02:05:29.360 --> 02:05:31.360 Let me take these out. 02:05:31.360 --> 02:05:33.360 And we can do a step size. 02:05:33.360 --> 02:05:36.360 And now we should have a slightly lower loss. 02:05:36.360 --> 02:05:39.360 4.36 goes to 3.9. 02:05:39.360 --> 02:05:41.360 And okay, so we've done the forward pass. 02:05:41.360 --> 02:05:43.360 Here's the backward pass. 02:05:43.360 --> 02:05:45.360 Nudge. 02:05:45.360 --> 02:05:47.360 And now the loss is 3.66. 02:05:47.360 --> 02:05:51.360 3.47. 02:05:51.360 --> 02:05:53.360 And you get the idea. 02:05:53.360 --> 02:05:55.360 We just continue doing this. 02:05:55.360 --> 02:05:57.360 And this is gradient descent. 02:05:57.360 --> 02:05:59.360 We're just iteratively doing forward pass, 02:05:59.360 --> 02:06:01.360 backward pass, update. 02:06:01.360 --> 02:06:03.360 Forward pass, backward pass, update. 02:06:03.360 --> 02:06:05.360 And the neural net is improving its predictions. 02:06:05.360 --> 02:06:08.360 So here, if we look at ypred now, 02:06:08.360 --> 02:06:10.360 ypred, 02:06:10.360 --> 02:06:16.360 we see that this value should be getting closer to 1. 02:06:16.360 --> 02:06:18.360 So this value should be getting more positive. 02:06:18.360 --> 02:06:19.360 These should be getting more negative. 02:06:19.360 --> 02:06:21.360 And this one should be also getting more positive. 02:06:21.360 --> 02:06:26.360 So if we just iterate this a few more times, 02:06:26.360 --> 02:06:29.360 actually, we may be able to afford to go a bit faster. 02:06:29.360 --> 02:06:34.360 Let's try a slightly higher learning rate. 02:06:34.360 --> 02:06:35.360 Oops, okay, there we go. 02:06:35.360 --> 02:06:39.360 So now we're at 0.31. 02:06:39.360 --> 02:06:41.360 If you go too fast, by the way, 02:06:41.360 --> 02:06:43.360 if you try to make it too big of a step, 02:06:43.360 --> 02:06:47.360 you may actually overstep. 02:06:47.360 --> 02:06:48.360 It's overconfidence. 02:06:48.360 --> 02:06:49.360 Because again, remember, 02:06:49.360 --> 02:06:51.360 we don't actually know exactly about the loss function. 02:06:51.360 --> 02:06:53.360 The loss function has all kinds of structure. 02:06:53.360 --> 02:06:56.360 And we only know about the very local dependence 02:06:56.360 --> 02:06:58.360 of all these parameters on the loss. 02:06:58.360 --> 02:06:59.360 But if we step too far, 02:06:59.360 --> 02:07:01.360 we may step into, you know, 02:07:01.360 --> 02:07:03.360 a part of the loss that is completely different. 02:07:03.360 --> 02:07:05.360 And that can destabilize training 02:07:05.360 --> 02:07:08.360 and make your loss actually blow up even. 02:07:08.360 --> 02:07:10.360 So the loss is now 0.04. 02:07:10.360 --> 02:07:13.360 So actually, the predictions should be really quite close. 02:07:13.360 --> 02:07:15.360 Let's take a look. 02:07:15.360 --> 02:07:17.360 So you see how this is almost one, 02:07:17.360 --> 02:07:19.360 almost negative one, almost one. 02:07:19.360 --> 02:07:21.360 We can continue going. 02:07:21.360 --> 02:07:25.360 So, yep, backward, update. 02:07:25.360 --> 02:07:26.360 Oops, there we go. 02:07:26.360 --> 02:07:28.360 So we went way too fast. 02:07:28.360 --> 02:07:31.360 And we actually overstepped. 02:07:31.360 --> 02:07:34.360 So we got too eager. 02:07:34.360 --> 02:07:35.360 Where are we now? 02:07:35.360 --> 02:07:36.360 Oops. 02:07:36.360 --> 02:07:37.360 Okay. 02:07:37.360 --> 02:07:38.360 7E-9. 02:07:38.360 --> 02:07:41.360 So this is very, very low loss. 02:07:41.360 --> 02:07:45.360 And the predictions are basically perfect. 02:07:45.360 --> 02:07:47.360 So somehow we... 02:07:47.360 --> 02:07:49.360 Basically, we were doing way too big updates 02:07:49.360 --> 02:07:50.360 and we briefly exploded, 02:07:50.360 --> 02:07:53.360 but then somehow we ended up getting into a really good spot. 02:07:53.360 --> 02:07:55.360 So usually this learning rate 02:07:55.360 --> 02:07:57.360 and the tuning of it is a subtle art. 02:07:57.360 --> 02:07:59.360 You want to set your learning rate. 02:07:59.360 --> 02:08:00.360 If it's too low, 02:08:00.360 --> 02:08:02.360 you're going to take way too long to converge. 02:08:02.360 --> 02:08:03.360 But if it's too high, 02:08:03.360 --> 02:08:04.360 the whole thing gets unstable 02:08:04.360 --> 02:08:06.360 and you might actually even explode the loss, 02:08:06.360 --> 02:08:08.360 depending on your loss function. 02:08:08.360 --> 02:08:11.360 So finding the step size to be just right, 02:08:11.360 --> 02:08:13.360 it's a pretty subtle art sometimes 02:08:13.360 --> 02:08:15.360 when you're using sort of vanilla gradient descent. 02:08:15.360 --> 02:08:17.360 But we happened to get into a good spot. 02:08:17.360 --> 02:08:22.360 We can look at n.parameters. 02:08:22.360 --> 02:08:26.360 So this is the setting of weights and biases 02:08:26.360 --> 02:08:28.360 that makes our network 02:08:28.360 --> 02:08:31.360 predict the desired targets 02:08:31.360 --> 02:08:33.360 very, very close. 02:08:33.360 --> 02:08:35.360 And basically, 02:08:35.360 --> 02:08:38.360 we've successfully trained a neural net. 02:08:38.360 --> 02:08:40.360 Okay, let's make this a tiny bit more respectable 02:08:40.360 --> 02:08:42.360 and implement an actual training loop 02:08:42.360 --> 02:08:43.360 and what that looks like. 02:08:43.360 --> 02:08:45.360 So this is the data definition that stays. 02:08:45.360 --> 02:08:47.360 This is the forward pass. 02:08:47.360 --> 02:08:51.360 So for k in range, 02:08:51.360 --> 02:08:57.360 we're going to take a bunch of steps. 02:08:57.360 --> 02:08:59.360 First, we do the forward pass. 02:08:59.360 --> 02:09:03.360 We validate the loss. 02:09:03.360 --> 02:09:05.360 Let's reinitialize the neural net from scratch. 02:09:05.360 --> 02:09:08.360 And here's the data. 02:09:08.360 --> 02:09:10.360 And we first do the forward pass. 02:09:10.360 --> 02:09:12.360 Then we do the backward pass. 02:09:19.360 --> 02:09:20.360 And then we do an update. 02:09:20.360 --> 02:09:22.360 That's gradient descent. 02:09:26.360 --> 02:09:27.360 And then we should be able to iterate this 02:09:27.360 --> 02:09:30.360 and we should be able to print the current step, 02:09:30.360 --> 02:09:32.360 the current loss. 02:09:32.360 --> 02:09:34.360 Let's just print the sort of 02:09:34.360 --> 02:09:37.360 number of the loss. 02:09:37.360 --> 02:09:40.360 And that should be it. 02:09:40.360 --> 02:09:42.360 And then the learning rate, 02:09:42.360 --> 02:09:43.360 0.01 is a little too small. 02:09:43.360 --> 02:09:45.360 0.1 we saw is like a little bit dangerous 02:09:45.360 --> 02:09:46.360 and too high. 02:09:46.360 --> 02:09:48.360 Let's go somewhere in between. 02:09:48.360 --> 02:09:50.360 And we'll optimize this for 02:09:50.360 --> 02:09:51.360 not 10 steps, 02:09:51.360 --> 02:09:54.360 but let's go for say 20 steps. 02:09:54.360 --> 02:09:59.360 Let me erase all of this junk. 02:09:59.360 --> 02:10:02.360 And let's run the optimization. 02:10:02.360 --> 02:10:05.360 And you see how we've actually converged slower 02:10:05.360 --> 02:10:07.360 in a more controlled manner 02:10:07.360 --> 02:10:10.360 and got to a loss that is very low. 02:10:10.360 --> 02:10:14.360 So I expect YPred to be quite good. 02:10:14.360 --> 02:10:16.360 There we go. 02:10:21.360 --> 02:10:23.360 And that's it. 02:10:23.360 --> 02:10:25.360 Okay, so this is kind of embarrassing, 02:10:25.360 --> 02:10:28.360 but we actually have a really terrible bug in here. 02:10:28.360 --> 02:10:30.360 And it's a subtle bug 02:10:30.360 --> 02:10:32.360 and it's a very common bug. 02:10:32.360 --> 02:10:34.360 And I can't believe I've done it 02:10:34.360 --> 02:10:36.360 for the 20th time in my life, 02:10:36.360 --> 02:10:37.360 especially on camera. 02:10:37.360 --> 02:10:39.360 And I could have reshot the whole thing, 02:10:39.360 --> 02:10:41.360 but I think it's pretty funny. 02:10:41.360 --> 02:10:43.360 And you get to appreciate a bit 02:10:43.360 --> 02:10:45.360 what working with neural nets 02:10:45.360 --> 02:10:47.360 maybe is like sometimes. 02:10:47.360 --> 02:10:49.360 We are guilty of 02:10:49.360 --> 02:10:51.360 a common bug. 02:10:51.360 --> 02:10:53.360 I've actually tweeted 02:10:53.360 --> 02:10:55.360 the most common neural net mistakes 02:10:55.360 --> 02:10:57.360 a long time ago now. 02:10:57.360 --> 02:10:59.360 And I'm not really 02:10:59.360 --> 02:11:01.360 going to explain any of these, 02:11:01.360 --> 02:11:03.360 but remember we are guilty of number three. 02:11:03.360 --> 02:11:05.360 You forgot to zero grad 02:11:05.360 --> 02:11:06.360 before dot backward. 02:11:06.360 --> 02:11:09.360 What is that? 02:11:09.360 --> 02:11:10.360 Basically what's happening, 02:11:10.360 --> 02:11:11.360 and it's a subtle bug 02:11:11.360 --> 02:11:13.360 and I'm not sure if you saw it, 02:11:13.360 --> 02:11:16.360 is that all of these weights here 02:11:16.360 --> 02:11:19.360 have a dot data and a dot grad. 02:11:19.360 --> 02:11:22.360 And dot grad starts at zero. 02:11:22.360 --> 02:11:23.360 And then we do backward 02:11:23.360 --> 02:11:25.360 and we fill in the gradients. 02:11:25.360 --> 02:11:27.360 And then we do an update on the data, 02:11:27.360 --> 02:11:29.360 but we don't flush the grad. 02:11:29.360 --> 02:11:30.360 It stays there. 02:11:30.360 --> 02:11:33.360 So when we do the second forward pass 02:11:33.360 --> 02:11:35.360 and we do backward again, 02:11:35.360 --> 02:11:37.360 remember that all the backward operations 02:11:37.360 --> 02:11:39.360 do a plus equals on the grad. 02:11:39.360 --> 02:11:41.360 And so these gradients just add up 02:11:41.360 --> 02:11:44.360 and they never get reset to zero. 02:11:44.360 --> 02:11:47.360 So basically we didn't zero grad. 02:11:47.360 --> 02:11:48.360 So here's how we zero grad 02:11:48.360 --> 02:11:50.360 before backward. 02:11:50.360 --> 02:11:53.360 We need to iterate over all the parameters. 02:11:53.360 --> 02:11:55.360 And we need to make sure that 02:11:55.360 --> 02:11:58.360 p dot grad is set to zero. 02:11:58.360 --> 02:12:00.360 We need to reset it to zero. 02:12:00.360 --> 02:12:02.360 Just like it is in the constructor. 02:12:02.360 --> 02:12:04.360 So remember all the way here 02:12:04.360 --> 02:12:05.360 for all these value nodes, 02:12:05.360 --> 02:12:07.360 grad is reset to zero. 02:12:07.360 --> 02:12:09.360 And then all these backward passes 02:12:09.360 --> 02:12:11.360 do a plus equals from that grad. 02:12:11.360 --> 02:12:13.360 But we need to make sure that 02:12:13.360 --> 02:12:15.360 we reset these grads to zero 02:12:15.360 --> 02:12:17.360 so that when we do backward, 02:12:17.360 --> 02:12:18.360 all of them start at zero 02:12:18.360 --> 02:12:19.360 and the actual backward pass 02:12:19.360 --> 02:12:23.360 accumulates the loss derivatives 02:12:23.360 --> 02:12:25.360 into the grads. 02:12:25.360 --> 02:12:28.360 So this is zero grad in PyTorch. 02:12:28.360 --> 02:12:29.360 And 02:12:29.360 --> 02:12:33.360 we will get a slightly different optimization. 02:12:33.360 --> 02:12:35.360 Let's reset the neural net. 02:12:35.360 --> 02:12:36.360 The data is the same. 02:12:36.360 --> 02:12:38.360 This is now, I think, correct. 02:12:38.360 --> 02:12:41.360 And we get a much more 02:12:41.360 --> 02:12:44.360 slower descent. 02:12:44.360 --> 02:12:46.360 We still end up with pretty good results. 02:12:46.360 --> 02:12:48.360 And we can continue this a bit more 02:12:48.360 --> 02:12:50.360 to get down lower 02:12:50.360 --> 02:12:51.360 and lower 02:12:51.360 --> 02:12:54.360 and lower. 02:12:54.360 --> 02:12:56.360 Yeah. 02:12:56.360 --> 02:12:58.360 So the only reason that the previous thing worked, 02:12:58.360 --> 02:13:00.360 it's extremely buggy. 02:13:00.360 --> 02:13:01.360 The only reason that worked 02:13:01.360 --> 02:13:03.360 is that 02:13:03.360 --> 02:13:06.360 this is a very, very simple problem. 02:13:06.360 --> 02:13:08.360 And it's very easy for this neural net 02:13:08.360 --> 02:13:09.360 to fit this data. 02:13:09.360 --> 02:13:12.360 And so the grads ended up accumulating 02:13:12.360 --> 02:13:13.360 and it effectively gave us 02:13:13.360 --> 02:13:15.360 a massive step size. 02:13:15.360 --> 02:13:19.360 And it made us converge extremely fast. 02:13:19.360 --> 02:13:21.360 But basically now we have to do more steps 02:13:21.360 --> 02:13:24.360 to get to very low values of loss 02:13:24.360 --> 02:13:26.360 and get YPRED to be really good. 02:13:26.360 --> 02:13:27.360 We can try to 02:13:27.360 --> 02:13:34.360 step a bit greater. 02:13:34.360 --> 02:13:35.360 Yeah. 02:13:35.360 --> 02:13:36.360 We're going to get closer and closer 02:13:36.360 --> 02:13:38.360 to one minus one and one. 02:13:38.360 --> 02:13:39.360 So 02:13:39.360 --> 02:13:41.360 working with neural nets is sometimes 02:13:41.360 --> 02:13:44.360 tricky because 02:13:44.360 --> 02:13:46.360 you may have lots of bugs in the code 02:13:46.360 --> 02:13:48.360 and 02:13:48.360 --> 02:13:49.360 your network might actually work 02:13:49.360 --> 02:13:51.360 just like ours worked. 02:13:51.360 --> 02:13:52.360 But chances are is that 02:13:52.360 --> 02:13:54.360 if we had a more complex problem 02:13:54.360 --> 02:13:56.360 then actually this bug would have 02:13:56.360 --> 02:13:58.360 made us not optimize the loss very well. 02:13:58.360 --> 02:14:00.360 And we were only able to get away with it 02:14:00.360 --> 02:14:01.360 because 02:14:01.360 --> 02:14:03.360 the problem is very simple. 02:14:03.360 --> 02:14:05.360 So let's now bring everything together 02:14:05.360 --> 02:14:07.360 and summarize what we learned. 02:14:07.360 --> 02:14:08.360 What are neural nets? 02:14:08.360 --> 02:14:11.360 Neural nets are these mathematical expressions. 02:14:11.360 --> 02:14:13.360 Fairly simple mathematical expressions 02:14:13.360 --> 02:14:15.360 in the case of multi-layer perceptron 02:14:15.360 --> 02:14:18.360 that take input as the data 02:14:18.360 --> 02:14:20.360 and they take input the weights 02:14:20.360 --> 02:14:21.360 and the parameters of the neural net. 02:14:21.360 --> 02:14:24.360 Mathematical expression for the forward pass 02:14:24.360 --> 02:14:25.360 followed by a loss function. 02:14:25.360 --> 02:14:27.360 And the loss function tries to measure 02:14:27.360 --> 02:14:29.360 the accuracy of the predictions. 02:14:29.360 --> 02:14:31.360 And usually the loss will be low 02:14:31.360 --> 02:14:33.360 when your predictions are matching your targets 02:14:33.360 --> 02:14:36.360 or where the network is basically behaving well. 02:14:36.360 --> 02:14:38.360 So we manipulate the loss function 02:14:38.360 --> 02:14:40.360 so that when the loss is low 02:14:40.360 --> 02:14:42.360 the network is doing what you want it to do 02:14:42.360 --> 02:14:43.360 on your problem. 02:14:43.360 --> 02:14:46.360 And then we backward the loss. 02:14:46.360 --> 02:14:48.360 Use back propagation to get the gradient 02:14:48.360 --> 02:14:50.360 and then we know how to tune all the parameters 02:14:50.360 --> 02:14:52.360 to decrease the loss locally. 02:14:52.360 --> 02:14:54.360 But then we have to iterate that process 02:14:54.360 --> 02:14:57.360 many times in what's called the gradient descent. 02:14:57.360 --> 02:14:59.360 So we simply follow the gradient information 02:14:59.360 --> 02:15:01.360 and that minimizes the loss 02:15:01.360 --> 02:15:02.360 and the loss is arranged so that 02:15:02.360 --> 02:15:04.360 when the loss is minimized 02:15:04.360 --> 02:15:06.360 the network is doing what you want it to do. 02:15:06.360 --> 02:15:10.360 And yeah, so we just have a blob of neural stuff 02:15:10.360 --> 02:15:12.360 and we can make it do arbitrary things. 02:15:12.360 --> 02:15:15.360 And that's what gives neural nets their power. 02:15:15.360 --> 02:15:17.360 It's, you know, this is a very tiny network 02:15:17.360 --> 02:15:19.360 with 41 parameters. 02:15:19.360 --> 02:15:21.360 But you can build significantly more complicated 02:15:21.360 --> 02:15:23.360 neural nets with billions 02:15:23.360 --> 02:15:26.360 at this point almost trillions of parameters. 02:15:26.360 --> 02:15:28.360 And it's a massive blob of neural tissue 02:15:28.360 --> 02:15:30.360 simulated neural tissue 02:15:30.360 --> 02:15:32.360 roughly speaking. 02:15:32.360 --> 02:15:35.360 And you can make it do extremely complex problems. 02:15:35.360 --> 02:15:37.360 And these neural nets then 02:15:37.360 --> 02:15:39.360 have all kinds of very fascinating emergent properties 02:15:39.360 --> 02:15:43.360 in when you try to make them do 02:15:43.360 --> 02:15:45.360 significantly hard problems. 02:15:45.360 --> 02:15:47.360 As in the case of GPT for example 02:15:47.360 --> 02:15:50.360 we have massive amounts of text from the internet 02:15:50.360 --> 02:15:52.360 and we're trying to get a neural net to predict 02:15:52.360 --> 02:15:54.360 to take like a few words 02:15:54.360 --> 02:15:56.360 and try to predict the next word in a sequence. 02:15:56.360 --> 02:15:58.360 That's the learning problem. 02:15:58.360 --> 02:15:59.360 And it turns out that when you train this 02:15:59.360 --> 02:16:00.360 on all of internet 02:16:00.360 --> 02:16:02.360 the neural net actually has like really remarkable 02:16:02.360 --> 02:16:04.360 emergent properties. 02:16:04.360 --> 02:16:05.360 But that neural net would have 02:16:05.360 --> 02:16:07.360 hundreds of billions of parameters. 02:16:07.360 --> 02:16:10.360 But it works on fundamentally the exact same principles. 02:16:10.360 --> 02:16:13.360 The neural net of course will be a bit more complex. 02:16:13.360 --> 02:16:16.360 But otherwise the evaluating the gradient 02:16:16.360 --> 02:16:19.360 is there and will be identical. 02:16:19.360 --> 02:16:21.360 And the gradient descent would be there 02:16:21.360 --> 02:16:22.360 and basically identical. 02:16:22.360 --> 02:16:24.360 But people usually use slightly different updates. 02:16:24.360 --> 02:16:28.360 This is a very simple stochastic gradient descent update. 02:16:28.360 --> 02:16:31.360 And the loss function would not be a mean squared error. 02:16:31.360 --> 02:16:33.360 They would be using something called the cross entropy loss 02:16:33.360 --> 02:16:35.360 for predicting the next token. 02:16:35.360 --> 02:16:36.360 So there's a few more details 02:16:36.360 --> 02:16:38.360 but fundamentally the neural network setup 02:16:38.360 --> 02:16:39.360 and neural network training 02:16:39.360 --> 02:16:41.360 is identical and pervasive. 02:16:41.360 --> 02:16:43.360 And now you understand intuitively 02:16:43.360 --> 02:16:45.360 how that works under the hood. 02:16:45.360 --> 02:16:46.360 In the beginning of this video 02:16:46.360 --> 02:16:48.360 I told you that by the end of it 02:16:48.360 --> 02:16:50.360 you would understand everything in MicroGrad 02:16:50.360 --> 02:16:52.360 and then we'd slowly build it up. 02:16:52.360 --> 02:16:54.360 Let me briefly prove that to you. 02:16:54.360 --> 02:16:55.360 So I'm going to step through all the code 02:16:55.360 --> 02:16:57.360 that is in MicroGrad as of today. 02:16:57.360 --> 02:16:59.360 Actually potentially some of the code will change 02:16:59.360 --> 02:17:00.360 by the time you watch this video 02:17:00.360 --> 02:17:03.360 because I intend to continue developing MicroGrad. 02:17:03.360 --> 02:17:05.360 But let's look at what we have so far at least. 02:17:05.360 --> 02:17:07.360 Init.py is empty. 02:17:07.360 --> 02:17:10.360 When you go to engine.py that has the value. 02:17:10.360 --> 02:17:12.360 Everything here you should mostly recognize. 02:17:12.360 --> 02:17:14.360 So we have the data.data.grad attributes. 02:17:14.360 --> 02:17:16.360 We have the backward function. 02:17:16.360 --> 02:17:17.360 We have the previous set of children 02:17:17.360 --> 02:17:20.360 and the operation that produced this value. 02:17:20.360 --> 02:17:22.360 We have addition, multiplication 02:17:22.360 --> 02:17:24.360 and raising to a scalar power. 02:17:24.360 --> 02:17:26.360 We have the ReLU non-linearity 02:17:26.360 --> 02:17:28.360 which is a slightly different type of non-linearity 02:17:28.360 --> 02:17:30.360 than tanh that we used in this video. 02:17:30.360 --> 02:17:32.360 Both of them are non-linearities 02:17:32.360 --> 02:17:34.360 and notably tanh is not actually present 02:17:34.360 --> 02:17:36.360 in MicroGrad as of right now 02:17:36.360 --> 02:17:38.360 but I intend to add it later. 02:17:38.360 --> 02:17:40.360 We have the backward which is identical 02:17:40.360 --> 02:17:42.360 and then all of these other operations 02:17:42.360 --> 02:17:45.360 which are built up on top of operations here. 02:17:45.360 --> 02:17:47.360 So values should be very recognizable 02:17:47.360 --> 02:17:49.360 except for the non-linearity used in this video. 02:17:50.360 --> 02:17:52.360 There's no massive difference between ReLU and tanh 02:17:52.360 --> 02:17:54.360 and sigmoid and these other non-linearities. 02:17:54.360 --> 02:17:56.360 They're all roughly equivalent 02:17:56.360 --> 02:17:58.360 and can be used in MLPs. 02:17:58.360 --> 02:18:00.360 So I use tanh because it's a bit smoother 02:18:00.360 --> 02:18:02.360 and because it's a little bit more complicated than ReLU 02:18:02.360 --> 02:18:04.360 and therefore it's stressed a little bit more 02:18:04.360 --> 02:18:06.360 the local gradients 02:18:06.360 --> 02:18:08.360 and working with those derivatives 02:18:08.360 --> 02:18:10.360 which I thought would be useful. 02:18:10.360 --> 02:18:12.360 Init.py is the neural networks library 02:18:12.360 --> 02:18:14.360 as I mentioned. 02:18:14.360 --> 02:18:16.360 So you should recognize identical implementation 02:18:16.360 --> 02:18:18.360 of neuron, layer and MLP. 02:18:18.360 --> 02:18:20.360 Notably, or not so much 02:18:20.360 --> 02:18:22.360 we have a class module here 02:18:22.360 --> 02:18:24.360 that is a parent class of all these modules. 02:18:24.360 --> 02:18:26.360 I did that because there's an nn.module class 02:18:26.360 --> 02:18:28.360 in PyTorch 02:18:28.360 --> 02:18:30.360 and so this exactly matches that API 02:18:30.360 --> 02:18:32.360 and nn.module in PyTorch has also a 0 grad 02:18:32.360 --> 02:18:34.360 which I refactored out here. 02:18:36.360 --> 02:18:38.360 So that's the end of MicroGrad really. 02:18:38.360 --> 02:18:40.360 Then there's a test 02:18:40.360 --> 02:18:42.360 which you'll see basically creates 02:18:42.360 --> 02:18:44.360 two chunks of code 02:18:44.360 --> 02:18:46.360 one in MicroGrad and one in PyTorch 02:18:46.360 --> 02:18:48.360 and we'll make sure that the forward 02:18:48.360 --> 02:18:50.360 and the backward pass agree identically. 02:18:50.360 --> 02:18:52.360 For a slightly less complicated expression 02:18:52.360 --> 02:18:54.360 and slightly more complicated expression 02:18:54.360 --> 02:18:56.360 everything agrees 02:18:56.360 --> 02:18:58.360 so we agree with PyTorch on all of these operations. 02:18:58.360 --> 02:19:00.360 And finally there's a demo.pypyymb 02:19:00.360 --> 02:19:02.360 here and it's a bit more 02:19:02.360 --> 02:19:04.360 complicated binary classification demo 02:19:04.360 --> 02:19:06.360 than the one I covered in this lecture. 02:19:06.360 --> 02:19:08.360 So we only had a tiny data set of four examples. 02:19:08.360 --> 02:19:10.360 Here we have a bit more 02:19:10.360 --> 02:19:12.360 complicated example with lots of 02:19:12.360 --> 02:19:14.360 blue points and lots of red points 02:19:14.360 --> 02:19:16.360 and we're trying to again build a binary classifier 02:19:16.360 --> 02:19:18.360 to distinguish two-dimensional 02:19:18.360 --> 02:19:20.360 points as red or blue. 02:19:20.360 --> 02:19:22.360 It's a bit more complicated MLP here 02:19:22.360 --> 02:19:24.360 with it's a bigger MLP. 02:19:24.360 --> 02:19:26.360 The loss is a bit more complicated 02:19:26.360 --> 02:19:28.360 because it supports batches 02:19:28.360 --> 02:19:30.360 so because our data set 02:19:30.360 --> 02:19:32.360 was so tiny we always did a forward pass 02:19:32.360 --> 02:19:34.360 on the entire data set of four examples. 02:19:34.360 --> 02:19:36.360 But when your data set is like a million 02:19:36.360 --> 02:19:38.360 examples what we usually do in practice 02:19:38.360 --> 02:19:40.360 is we basically 02:19:40.360 --> 02:19:42.360 pick out some random subset, we call that a batch 02:19:42.360 --> 02:19:44.360 and then we only process the batch 02:19:44.360 --> 02:19:46.360 forward, backward and update. 02:19:46.360 --> 02:19:48.360 So we don't have to forward the entire training set. 02:19:48.360 --> 02:19:50.360 So this is 02:19:50.360 --> 02:19:52.360 something that supports batching 02:19:52.360 --> 02:19:54.360 because there's a lot more examples here. 02:19:54.360 --> 02:19:56.360 We do a forward pass. 02:19:56.360 --> 02:19:58.360 The loss is slightly more different. 02:19:58.360 --> 02:20:00.360 This is a max margin loss that I implement here. 02:20:00.360 --> 02:20:02.360 The one that we used was 02:20:02.360 --> 02:20:04.360 the mean squared error loss 02:20:04.360 --> 02:20:06.360 because it's the simplest one. 02:20:06.360 --> 02:20:08.360 There's also the binary cross entropy loss. 02:20:08.360 --> 02:20:10.360 All of them can be used for binary classification 02:20:10.360 --> 02:20:12.360 and don't make too much of a difference 02:20:12.360 --> 02:20:14.360 in the simple examples that we looked at so far. 02:20:14.360 --> 02:20:16.360 There's something called L2 regularization 02:20:16.360 --> 02:20:18.360 used here. 02:20:18.360 --> 02:20:20.360 This has to do with generalization of the neural net 02:20:20.360 --> 02:20:22.360 that controls the overfitting in machine learning setting 02:20:22.360 --> 02:20:24.360 but I did not cover these concepts 02:20:24.360 --> 02:20:26.360 in this video, potentially later. 02:20:26.360 --> 02:20:28.360 And the training loop you should recognize. 02:20:28.360 --> 02:20:30.360 So forward, backward, 02:20:30.360 --> 02:20:32.360 with, zero grad 02:20:32.360 --> 02:20:34.360 and update and so on. 02:20:34.360 --> 02:20:36.360 You'll notice that in the update here 02:20:36.360 --> 02:20:38.360 the learning rate is scaled as a function of 02:20:38.360 --> 02:20:40.360 number of iterations and it 02:20:40.360 --> 02:20:42.360 shrinks. 02:20:42.360 --> 02:20:44.360 And this is something called learning rate decay. 02:20:44.360 --> 02:20:46.360 So in the beginning you have a high learning rate 02:20:46.360 --> 02:20:48.360 and as the network sort of stabilizes near the end 02:20:48.360 --> 02:20:50.360 you bring down the learning rate 02:20:50.360 --> 02:20:52.360 to get to some of the fine details in the end. 02:20:52.360 --> 02:20:54.360 And in the end we see 02:20:54.360 --> 02:20:56.360 the decision surface of the neural net 02:20:56.360 --> 02:20:58.360 and we see that it learned to separate out the red 02:20:58.360 --> 02:21:00.360 and the blue area based on 02:21:00.360 --> 02:21:02.360 the data points. 02:21:02.360 --> 02:21:04.360 So that's the slightly more complicated example 02:21:04.360 --> 02:21:06.360 in the demo.hypiYMB 02:21:06.360 --> 02:21:08.360 that you're free to go over. 02:21:08.360 --> 02:21:10.360 But yeah, as of today, that is MicroGrad. 02:21:10.360 --> 02:21:12.360 I also wanted to show you a little bit of real stuff 02:21:12.360 --> 02:21:14.360 so that you get to see how this is actually implemented 02:21:14.360 --> 02:21:16.360 in a production grade library like PyTorch. 02:21:16.360 --> 02:21:18.360 So in particular I wanted to show 02:21:18.360 --> 02:21:20.360 I wanted to find and show you 02:21:20.360 --> 02:21:22.360 the backward pass for 10h in PyTorch. 02:21:22.360 --> 02:21:24.360 So here in MicroGrad 02:21:24.360 --> 02:21:26.360 we see that the backward pass for 10h 02:21:26.360 --> 02:21:28.360 is 1 minus t squared 02:21:28.360 --> 02:21:30.360 where t is the output of the 10h 02:21:30.360 --> 02:21:32.360 of x 02:21:32.360 --> 02:21:34.360 times of that grad 02:21:34.360 --> 02:21:36.360 which is the chain rule. 02:21:36.360 --> 02:21:38.360 So we're looking for something that looks like this. 02:21:38.360 --> 02:21:40.360 Now, I went to PyTorch 02:21:40.360 --> 02:21:42.360 which has 02:21:42.360 --> 02:21:44.360 an open source GitHub codebase 02:21:44.360 --> 02:21:46.360 and I looked through a lot of its code 02:21:46.360 --> 02:21:48.360 and honestly 02:21:48.360 --> 02:21:50.360 I spent about 15 minutes 02:21:50.360 --> 02:21:52.360 and I couldn't find 10h. 02:21:52.360 --> 02:21:54.360 And that's because these libraries, unfortunately 02:21:54.360 --> 02:21:56.360 they grow in size and entropy. 02:21:56.360 --> 02:21:58.360 And if you just search for 10h 02:21:58.360 --> 02:22:00.360 you get apparently 2,800 results 02:22:00.360 --> 02:22:02.360 and 406 files. 02:22:02.360 --> 02:22:04.360 So I don't know what these files 02:22:04.360 --> 02:22:06.360 are doing, honestly. 02:22:06.360 --> 02:22:08.360 And why there are 02:22:08.360 --> 02:22:10.360 so many mentions of 10h. 02:22:10.360 --> 02:22:12.360 But unfortunately these libraries are quite complex 02:22:12.360 --> 02:22:14.360 they're meant to be used, not really inspected. 02:22:14.360 --> 02:22:16.360 Eventually I did 02:22:16.360 --> 02:22:18.360 stumble on someone 02:22:18.360 --> 02:22:20.360 who tries to change 02:22:20.360 --> 02:22:22.360 the 10h backward code for some reason 02:22:22.360 --> 02:22:24.360 and someone here pointed to the 02:22:24.360 --> 02:22:26.360 CPU kernel and the CUDA kernel for 02:22:26.360 --> 02:22:28.360 10h backward. 02:22:28.360 --> 02:22:30.360 So basically it depends on if you're using 02:22:30.360 --> 02:22:32.360 PyTorch on a CPU device or on a GPU 02:22:32.360 --> 02:22:34.360 which these are different devices 02:22:34.360 --> 02:22:36.360 and I haven't covered this. 02:22:36.360 --> 02:22:38.360 But this is the 10h backward kernel 02:22:38.360 --> 02:22:40.360 for CPU 02:22:40.360 --> 02:22:42.360 and the reason it's so large 02:22:42.360 --> 02:22:44.360 is that 02:22:44.360 --> 02:22:46.360 number one, this is like if you're using a complex type 02:22:46.360 --> 02:22:48.360 which we haven't even talked about 02:22:48.360 --> 02:22:50.360 you're using a specific data type of bfloat16 02:22:50.360 --> 02:22:52.360 which we haven't talked about 02:22:52.360 --> 02:22:54.360 and then if you're not 02:22:54.360 --> 02:22:56.360 then this is the kernel 02:22:56.360 --> 02:22:58.360 and deep here we see something that resembles 02:22:58.360 --> 02:23:00.360 our backward pass. 02:23:00.360 --> 02:23:02.360 So they have a times one minus 02:23:02.360 --> 02:23:04.360 b square 02:23:04.360 --> 02:23:06.360 so this b here 02:23:06.360 --> 02:23:08.360 must be the output of the 10h 02:23:08.360 --> 02:23:10.360 and this is the out.grad 02:23:10.360 --> 02:23:12.360 so here we found it 02:23:12.360 --> 02:23:14.360 deep inside 02:23:14.360 --> 02:23:16.360 PyTorch on this location 02:23:16.360 --> 02:23:18.360 for some reason inside binary ops kernel 02:23:18.360 --> 02:23:20.360 10h is not actually binary op 02:23:20.360 --> 02:23:22.360 and then this is the 02:23:22.360 --> 02:23:24.360 GPU kernel 02:23:24.360 --> 02:23:26.360 we're not complex 02:23:26.360 --> 02:23:28.360 we're here 02:23:28.360 --> 02:23:30.360 and here we go with one line of code 02:23:30.360 --> 02:23:32.360 so we did find it 02:23:32.360 --> 02:23:34.360 but basically unfortunately 02:23:34.360 --> 02:23:36.360 these code bases are very large 02:23:36.360 --> 02:23:38.360 and micrograd is very very simple 02:23:38.360 --> 02:23:40.360 but if you actually want to use real stuff 02:23:40.360 --> 02:23:42.360 finding the code for it 02:23:42.360 --> 02:23:44.360 you'll actually find that difficult 02:23:44.360 --> 02:23:46.360 I also wanted to show you 02:23:46.360 --> 02:23:48.360 a little example here where PyTorch is showing you 02:23:48.360 --> 02:23:50.360 you can register a new type of function 02:23:50.360 --> 02:23:52.360 that you want to add to PyTorch 02:23:52.360 --> 02:23:54.360 as a lego building block 02:23:54.360 --> 02:23:56.360 so here if you want to for example add 02:23:56.360 --> 02:23:58.360 a gender polynomial 3 02:23:58.360 --> 02:24:00.360 here's how you could do it 02:24:00.360 --> 02:24:02.360 you will register it 02:24:02.360 --> 02:24:04.360 as a class that 02:24:04.360 --> 02:24:06.360 subclass says torch.rgrad.function 02:24:06.360 --> 02:24:08.360 and then you have to tell PyTorch how to forward 02:24:08.360 --> 02:24:10.360 your new function 02:24:10.360 --> 02:24:12.360 and how to backward through it 02:24:12.360 --> 02:24:14.360 so as long as you can do the forward pass 02:24:14.360 --> 02:24:16.360 of this little function piece that you want to add 02:24:16.360 --> 02:24:18.360 and as long as you know the local 02:24:18.360 --> 02:24:20.360 derivative, the local gradients 02:24:20.360 --> 02:24:22.360 which are implemented in the backward 02:24:22.360 --> 02:24:24.360 PyTorch will be able to back propagate through your function 02:24:24.360 --> 02:24:26.360 and then you can use this as a lego block 02:24:26.360 --> 02:24:28.360 in a larger lego castle 02:24:28.360 --> 02:24:30.360 of all the different lego blocks that PyTorch already has 02:24:30.360 --> 02:24:32.360 and so that's the only thing 02:24:32.360 --> 02:24:34.360 you have to tell PyTorch and everything will just work 02:24:34.360 --> 02:24:36.360 and you can register new types of functions 02:24:36.360 --> 02:24:38.360 in this way following this example 02:24:38.360 --> 02:24:40.360 and that is everything that I wanted to cover 02:24:40.360 --> 02:24:42.360 in this lecture 02:24:42.360 --> 02:24:44.360 so I hope you enjoyed building out micrograd with me 02:24:44.360 --> 02:24:46.360 I hope you find it interesting, insightful 02:24:46.360 --> 02:24:48.360 and yeah 02:24:48.360 --> 02:24:50.360 I will post a lot of the links 02:24:50.360 --> 02:24:52.360 that are related to this video 02:24:52.360 --> 02:24:54.360 in the video description below 02:24:54.360 --> 02:24:56.360 I will also probably post a link to a discussion forum 02:24:56.360 --> 02:24:58.360 or discussion group where you can ask 02:24:58.360 --> 02:25:00.360 questions related to this video 02:25:00.360 --> 02:25:02.360 and then I can answer or someone else can answer 02:25:02.360 --> 02:25:04.360 your questions 02:25:04.360 --> 02:25:06.360 and I may also do a follow up video 02:25:06.360 --> 02:25:08.360 that answers some of the most common questions 02:25:08.360 --> 02:25:10.360 but for now that's it 02:25:10.360 --> 02:25:12.360 I hope you enjoyed it 02:25:12.360 --> 02:25:14.360 if you did then please like and subscribe 02:25:14.360 --> 02:25:16.360 so that YouTube knows to feature this video to more people 02:25:16.360 --> 02:25:18.360 and that's it for now, I'll see you later 02:25:18.360 --> 02:25:20.360 bye 02:25:48.360 --> 02:25:50.360 I know what happened there