WEBVTT

00:00.000 --> 00:04.380
Hello, my name is Andre and I've been training deep neural networks for a bit more than a decade

00:04.380 --> 00:09.120
and in this lecture I'd like to show you what neural network training looks like under the hood.

00:09.840 --> 00:14.080
So in particular we are going to start with a blank Jupyter notebook and by the end of this

00:14.080 --> 00:18.640
lecture we will define and train a neural net and you'll get to see everything that goes on

00:18.640 --> 00:23.680
under the hood and exactly sort of how that works on an intuitive level. Now specifically what I

00:23.680 --> 00:29.240
would like to do is I would like to take you through building of micrograd. Now micrograd

00:29.240 --> 00:34.080
is this library that I released on github about two years ago but at the time I only uploaded the

00:34.080 --> 00:39.940
source code and you'd have to go in by yourself and really figure out how it works. So in this

00:39.940 --> 00:43.740
lecture I will take you through it step by step and kind of comment on all the pieces of it.

00:44.140 --> 00:51.520
So what is micrograd and why is it interesting? Thank you. Micrograd is basically an autograd

00:51.520 --> 00:56.580
engine. Autograd is short for automatic gradient and really what it does is it implements back

00:56.580 --> 00:59.120
propagation. Now back propagation is this algorithm

00:59.120 --> 00:59.220
that you can use to create a neural network and you can use it to create a neural network

00:59.220 --> 00:59.240
and you can use it to create a neural network and you can use it to create a neural network.

00:59.240 --> 01:05.320
That allows you to efficiently evaluate the gradient of some kind of a loss function with

01:05.320 --> 01:10.160
respect to the weights of a neural network and what that allows us to do then is we can

01:10.160 --> 01:14.100
iteratively tune the weights of that neural network to minimize the loss function and

01:14.100 --> 01:18.820
therefore improve the accuracy of the network. So back propagation would be at the mathematical

01:18.820 --> 01:24.720
core of any modern deep neural network library like say PyTorch or JAX. So the functionality

01:24.720 --> 01:28.780
of micrograd is I think best illustrated by an example. So if we just scroll down here

01:28.780 --> 01:33.460
you'll see that micrograd basically allows you to build out mathematical expressions

01:33.460 --> 01:38.580
and here what we are doing is we have an expression that we're building out where you have two

01:38.580 --> 01:45.780
inputs a and b and you'll see that a and b are negative four and two but we are wrapping those

01:45.780 --> 01:51.820
values into this value object that we are going to build out as part of micrograd. So this value

01:51.820 --> 01:57.160
object will wrap the numbers themselves and then we are going to build out a mathematical expression

01:57.160 --> 01:58.760
here where a and b are the values that we are going to build out as part of micrograd.

01:58.760 --> 02:03.300
are transformed into C, D, and eventually E, F, and G.

02:03.940 --> 02:07.040
And I'm showing some of the functionality of micrograph

02:07.040 --> 02:08.460
and the operations that it supports.

02:08.820 --> 02:10.680
So you can add two value objects.

02:10.880 --> 02:11.780
You can multiply them.

02:12.160 --> 02:14.160
You can raise them to a constant power.

02:14.600 --> 02:17.460
You can offset by one, negate, squash at zero,

02:18.200 --> 02:22.000
square, divide by constant, divide by it, et cetera.

02:22.820 --> 02:24.680
And so we're building out an expression graph

02:24.680 --> 02:26.980
with these two inputs, A and B,

02:26.980 --> 02:29.520
and we're creating an output value of G.

02:30.580 --> 02:32.420
And micrograph will, in the background,

02:32.820 --> 02:34.640
build out this entire mathematical expression.

02:35.200 --> 02:38.020
So it will, for example, know that C is also a value.

02:38.840 --> 02:40.800
C was a result of an addition operation.

02:41.500 --> 02:45.860
And the child nodes of C are A and B

02:45.860 --> 02:49.620
because it will maintain pointers to A and B value objects.

02:49.940 --> 02:52.580
So we'll basically know exactly how all of this is laid out.

02:53.100 --> 02:56.320
And then not only can we do what we call the forward pass,

02:56.320 --> 02:56.900
where we actually,

02:56.980 --> 02:58.360
if we look at the value of G, of course,

02:58.440 --> 02:59.380
that's pretty straightforward,

02:59.960 --> 03:02.680
we will access that using the dot data attribute.

03:02.920 --> 03:05.320
And so the output of the forward pass,

03:05.760 --> 03:08.580
the value of G, is 24.7, it turns out.

03:08.860 --> 03:12.620
But the big deal is that we can also take this G value object

03:12.620 --> 03:14.220
and we can call dot backward.

03:14.760 --> 03:18.600
And this will basically initialize backpropagation at the node G.

03:19.980 --> 03:21.360
And what backpropagation is going to do

03:21.360 --> 03:22.580
is it's going to start at G

03:22.580 --> 03:26.040
and it's going to go backwards through that expression graph

03:26.040 --> 03:26.880
and it's going to recurve.

03:26.980 --> 03:29.640
So we're going to recursively apply the chain rule from Calculus.

03:30.280 --> 03:31.920
And what that allows us to do then

03:32.280 --> 03:35.480
is we're going to evaluate basically the derivative of G

03:35.760 --> 03:39.900
with respect to all the internal nodes like E, D, and C,

03:40.260 --> 03:42.760
but also with respect to the inputs A and B.

03:43.400 --> 03:46.900
And then we can actually query this derivative of G

03:46.900 --> 03:49.600
with respect to A, for example, that's A dot grad.

03:49.960 --> 03:51.540
In this case, it happens to be 138.

03:52.040 --> 03:54.040
And the derivative of G with respect to B,

03:54.440 --> 03:56.940
which also happens to be here, 645.

03:57.520 --> 03:59.240
And this derivative, we'll see soon,

03:59.340 --> 04:00.540
is very important information

04:00.540 --> 04:04.660
because it's telling us how A and B are affecting G

04:04.660 --> 04:06.200
through this mathematical expression.

04:06.820 --> 04:09.940
So in particular, A dot grad is 138.

04:10.060 --> 04:13.980
So if we slightly nudge A and make it slightly larger,

04:14.960 --> 04:17.480
138 is telling us that G will grow

04:17.480 --> 04:20.140
and the slope of that growth is going to be 138.

04:20.880 --> 04:23.700
And the slope of growth of B is going to be 645.

04:24.320 --> 04:26.780
So that's going to tell us about how G will respond,

04:26.980 --> 04:29.160
if A and B get tweaked a tiny amount

04:29.160 --> 04:31.400
in a positive direction, okay?

04:33.300 --> 04:36.220
Now, you might be confused about what this expression is

04:36.220 --> 04:37.060
that we built out here.

04:37.200 --> 04:39.360
And this expression, by the way, is completely meaningless.

04:39.640 --> 04:40.720
I just made it up.

04:40.820 --> 04:42.760
I'm just flexing about the kinds of operations

04:42.760 --> 04:44.360
that are supported by micrograd.

04:44.960 --> 04:47.000
What we actually really care about are neural networks.

04:47.460 --> 04:48.700
But it turns out that neural networks

04:48.700 --> 04:51.520
are just mathematical expressions, just like this one,

04:51.780 --> 04:53.720
but actually a slightly bit less crazy even.

04:54.820 --> 04:56.720
Neural networks are just a mathematical expression,

04:57.100 --> 04:59.500
they take the input data as an input,

04:59.800 --> 05:01.980
and they take the weights of a neural network as an input,

05:02.340 --> 05:03.500
and it's a mathematical expression,

05:03.860 --> 05:06.300
and the output are your predictions of your neural net

05:06.300 --> 05:07.480
or the loss function.

05:07.620 --> 05:08.400
We'll see this in a bit.

05:09.000 --> 05:11.260
But basically, neural networks just happen to be

05:11.260 --> 05:13.200
a certain class of mathematical expressions.

05:13.860 --> 05:16.540
But backpropagation is actually significantly more general.

05:16.840 --> 05:19.040
It doesn't actually care about neural networks at all.

05:19.180 --> 05:21.620
It only cares about arbitrary mathematical expressions.

05:22.040 --> 05:24.140
And then we happen to use that machinery

05:24.140 --> 05:25.980
for training of neural networks.

05:26.500 --> 05:26.920
Now, one more.

05:26.980 --> 05:28.480
Another note I would like to make at this stage

05:28.480 --> 05:29.420
is that, as you see here,

05:29.540 --> 05:31.920
micrograd is a scalar-valued autograd engine.

05:32.340 --> 05:35.220
So it's working on the level of individual scalars,

05:35.300 --> 05:36.440
like negative four and two.

05:36.820 --> 05:37.740
And we're taking neural nets

05:37.740 --> 05:38.600
and we're breaking them down

05:38.600 --> 05:41.220
all the way to these atoms of individual scalars

05:41.220 --> 05:42.760
and all the little pluses and times,

05:42.920 --> 05:44.200
and it's just excessive.

05:44.900 --> 05:45.540
And so, obviously,

05:45.660 --> 05:47.580
you would never be doing any of this in production.

05:47.920 --> 05:49.840
It's really just done for pedagogical reasons

05:49.840 --> 05:51.740
because it allows us to not have to deal

05:51.740 --> 05:53.640
with these n-dimensional tensors

05:53.640 --> 05:56.400
that you would use in a modern deep neural network library.

05:56.980 --> 05:59.520
So this is really done so that you understand

05:59.520 --> 06:01.300
and refactor out the background application

06:01.300 --> 06:04.280
and chain rule and understanding of neural training.

06:04.960 --> 06:07.360
And then, if you actually want to train bigger networks,

06:07.540 --> 06:08.800
you have to be using these tensors,

06:09.080 --> 06:10.240
but none of the math changes.

06:10.380 --> 06:11.760
This is done purely for efficiency.

06:12.360 --> 06:15.200
We are basically taking all the scalar values,

06:15.520 --> 06:16.980
we're packaging them up into tensors,

06:17.260 --> 06:19.020
which are just arrays of these scalars.

06:19.440 --> 06:21.720
And then, because we have these large arrays,

06:21.720 --> 06:23.860
we're making operations on those large arrays

06:23.860 --> 06:26.700
that allows us to take advantage of the parallelism

06:26.700 --> 06:27.200
in a computer.

06:27.840 --> 06:30.000
And all those operations can be done in parallel,

06:30.280 --> 06:31.880
and then the whole thing runs faster.

06:32.320 --> 06:33.580
But really, none of the math changes,

06:33.740 --> 06:34.980
and they're done purely for efficiency.

06:35.400 --> 06:37.220
So I don't think that it's pedagogically useful

06:37.220 --> 06:39.080
to be dealing with tensors from scratch.

06:39.680 --> 06:42.040
And that's why I fundamentally wrote micrograd,

06:42.360 --> 06:43.920
because you can understand how things work

06:43.920 --> 06:45.360
at the fundamental level,

06:45.720 --> 06:47.240
and then you can speed it up later.

06:48.160 --> 06:49.160
Okay, so here's the fun part.

06:49.500 --> 06:51.580
My claim is that micrograd is what you need

06:51.580 --> 06:52.600
to train neural networks,

06:52.600 --> 06:54.300
and everything else is just efficiency.

06:54.920 --> 06:56.280
So you'd think that micrograd would be

06:56.700 --> 06:58.240
a very complex piece of code.

06:58.500 --> 07:00.960
And that turns out to not be the case.

07:01.420 --> 07:03.040
So if we just go to micrograd,

07:03.540 --> 07:07.040
and you'll see that there's only two files here in micrograd.

07:07.340 --> 07:08.540
This is the actual engine.

07:08.780 --> 07:10.280
It doesn't know anything about neural nets.

07:10.580 --> 07:12.580
And this is the entire neural nets library

07:13.160 --> 07:14.160
on top of micrograd.

07:14.360 --> 07:17.080
So engine and nn.py.

07:17.620 --> 07:20.760
So the actual backpropagation autograd engine

07:21.660 --> 07:23.320
that gives you the power of neural networks

07:23.760 --> 07:24.760
is literally

07:24.840 --> 07:25.840
100 lines of code.

07:25.840 --> 07:26.400
100 lines of code.

07:26.400 --> 07:28.400
Of, like, very simple Python,

07:28.400 --> 07:30.400
which we'll understand by the end of this lecture.

07:30.400 --> 07:32.400
And then nn.py,

07:32.400 --> 07:34.400
this neural network library

07:34.400 --> 07:36.400
built on top of the autograd engine,

07:36.400 --> 07:38.400
is like a joke.

07:38.400 --> 07:40.400
It's like, we have to define what is a neuron,

07:40.400 --> 07:42.400
and then we have to define what is a layer of neurons,

07:42.400 --> 07:44.400
and then we define what is a multilayer perceptron,

07:44.400 --> 07:46.400
which is just a sequence of layers of neurons.

07:46.400 --> 07:48.400
And so it's just a total joke.

07:48.400 --> 07:50.400
So basically,

07:50.400 --> 07:52.400
there's a lot of power

07:52.400 --> 07:54.400
that comes from only 150 lines of code.

07:54.400 --> 07:56.200
And then we have to define what is a multilayer perceptron,

07:56.200 --> 07:58.200
which is 150 lines of code.

07:58.200 --> 08:00.200
And that's all you need to understand

08:00.200 --> 08:02.200
to understand neural network training.

08:02.200 --> 08:04.200
And everything else is just efficiency.

08:04.200 --> 08:06.200
And of course, there's a lot to efficiency.

08:06.200 --> 08:08.200
But fundamentally, that's all that's happening.

08:08.200 --> 08:10.200
Okay, so now let's dive right in

08:10.200 --> 08:12.200
and implement micrograd step by step.

08:12.200 --> 08:14.200
The first thing I'd like to do is I'd like to make sure

08:14.200 --> 08:16.200
that you have a very good understanding, intuitively,

08:16.200 --> 08:18.200
of what a derivative is

08:18.200 --> 08:20.200
and exactly what information it gives you.

08:20.200 --> 08:22.200
So let's start with some basic imports

08:22.200 --> 08:24.200
that I copy-paste in every Jupyter Notebook, always.

08:24.200 --> 08:26.200
And let's define a derivative.

08:26.200 --> 08:28.200
So let's define a function,

08:28.200 --> 08:30.200
a scalar-valued function,

08:30.200 --> 08:32.200
f of x, as follows.

08:32.200 --> 08:34.200
So I just made this up randomly.

08:34.200 --> 08:36.200
I just wanted a scalar-valued function

08:36.200 --> 08:38.200
that takes a single scalar x

08:38.200 --> 08:40.200
and returns a single scalar y.

08:40.200 --> 08:42.200
And we can call this function, of course,

08:42.200 --> 08:44.200
so we can pass in, say, 3.0

08:44.200 --> 08:46.200
and get 20 back.

08:46.200 --> 08:48.200
Now, we can also plot this function

08:48.200 --> 08:50.200
to get a sense of its shape.

08:50.200 --> 08:52.200
You can tell from the mathematical expression

08:52.200 --> 08:54.200
that this is probably a parabola.

08:54.200 --> 08:56.200
It's a quadratic.

08:56.200 --> 08:58.200
It's a scalar-value that we can feed in

08:58.200 --> 09:00.200
using, for example, a range

09:00.200 --> 09:02.200
from negative 5 to 5

09:02.200 --> 09:04.200
in steps of 0.25.

09:04.200 --> 09:06.200
So x is just

09:06.200 --> 09:08.200
from negative 5 to 5

09:08.200 --> 09:10.200
not including 5

09:10.200 --> 09:12.200
in steps of 0.25.

09:12.200 --> 09:14.200
And we can actually call this function

09:14.200 --> 09:16.200
on this numpy array as well.

09:16.200 --> 09:18.200
So we get a set of y's

09:18.200 --> 09:20.200
if we call f on x.

09:20.200 --> 09:22.200
And these y's are basically

09:22.200 --> 09:24.200
also applying the function

09:24.200 --> 09:26.200
on every one of these elements independently.

09:26.200 --> 09:28.200
Let's talk about this using Mathplotlib.

09:28.200 --> 09:30.200
So plt.plot, x's and y's

09:30.200 --> 09:32.200
and we get a nice parabola.

09:32.200 --> 09:34.200
So previously here we fed in 3.0

09:34.200 --> 09:36.200
somewhere here, and we received

09:36.200 --> 09:38.200
20 back, which is here

09:38.200 --> 09:40.200
the y coordinate.

09:40.200 --> 09:42.200
So now I'd like to think through

09:42.200 --> 09:44.200
what is the derivative of this function

09:44.200 --> 09:46.200
at any single input point x?

09:46.200 --> 09:48.200
So what is the derivative at different points x

09:48.200 --> 09:50.200
of this function?

09:50.200 --> 09:52.200
Now if you remember back to your calculus class

09:52.200 --> 09:54.200
you've probably derived derivatives.

09:54.200 --> 09:56.200
So we take this mathematical expression

09:56.200 --> 09:58.200
for x plus 5, and you would write it out

09:58.200 --> 10:00.200
on a piece of paper and you would

10:00.200 --> 10:02.200
apply the product rule and all the other rules

10:02.200 --> 10:04.200
and derive the mathematical expression

10:04.200 --> 10:06.200
of the great derivative of the original function.

10:06.200 --> 10:08.200
And then you could plug in different x's

10:08.200 --> 10:10.200
and see what the derivative is.

10:10.200 --> 10:12.200
We're not going to actually do that

10:12.200 --> 10:14.200
because no one in neural networks

10:14.200 --> 10:16.200
actually writes out the expression for the neural net.

10:16.200 --> 10:18.200
It would be a massive expression.

10:18.200 --> 10:20.200
It would be thousands, tens of thousands of terms.

10:20.200 --> 10:22.200
No one actually derives the derivative

10:22.200 --> 10:24.200
of course.

10:24.200 --> 10:26.200
And so we're not going to take this kind of symbolic approach

10:26.200 --> 10:28.200
instead what I'd like to do is I'd like to look at the

10:28.200 --> 10:30.200
definition of derivative and just make sure

10:30.200 --> 10:32.200
that we really understand what the derivative is measuring

10:32.200 --> 10:34.200
what it's telling you about the function.

10:34.200 --> 10:36.200
And so if we just look up

10:36.200 --> 10:38.200
derivative

10:42.200 --> 10:44.200
we see that

10:44.200 --> 10:46.200
this is not a very good definition of derivative

10:46.200 --> 10:48.200
this is a definition of what it means to be differentiable

10:48.200 --> 10:50.200
but if you remember from your calculus

10:50.200 --> 10:52.200
it is the limit as h goes to 0

10:52.200 --> 10:54.200
of f of x plus h minus f of x

10:54.200 --> 10:56.200
over h.

10:56.200 --> 10:58.200
And basically what it's saying is

10:58.200 --> 11:00.200
if you slightly bump up

11:00.200 --> 11:02.200
at some point x that you're interested in

11:02.200 --> 11:04.200
or a, and if you slightly bump up

11:04.200 --> 11:06.200
you slightly increase it by

11:06.200 --> 11:08.200
a small number h

11:08.200 --> 11:10.200
how does the function respond?

11:10.200 --> 11:12.200
With what sensitivity does it respond?

11:12.200 --> 11:14.200
What is the slope at that point?

11:14.200 --> 11:16.200
Does the function go up or does it go down?

11:16.200 --> 11:18.200
And by how much?

11:18.200 --> 11:20.200
And that's the slope of that function

11:20.200 --> 11:22.200
the slope of that response at that point.

11:22.200 --> 11:24.200
And so we can basically evaluate

11:24.200 --> 11:26.200
the derivative here numerically

11:26.200 --> 11:28.200
by taking a very small h

11:28.200 --> 11:30.200
of course the definition would ask us to take h to 0

11:30.200 --> 11:32.200
we're just going to pick a very small h

11:32.200 --> 11:34.200
0.001

11:34.200 --> 11:36.200
and let's say we're interested in point 3.0

11:36.200 --> 11:38.200
so we can look at f of x of course as 20

11:38.200 --> 11:40.200
and now f of x plus h

11:40.200 --> 11:42.200
so if we slightly nudge

11:42.200 --> 11:44.200
x in a positive direction

11:44.200 --> 11:46.200
how is the function going to respond?

11:46.200 --> 11:48.200
And just looking at this

11:48.200 --> 11:50.200
do you expect f of x plus h to be slightly greater

11:50.200 --> 11:52.200
than 20?

11:52.200 --> 11:54.200
Or do you expect it to be slightly lower than 20?

11:54.200 --> 11:56.200
And so since 3 is here

11:56.200 --> 11:58.200
and this is 20

11:58.200 --> 12:00.200
if we slightly go positively

12:00.200 --> 12:02.200
the function will respond positively

12:02.200 --> 12:04.200
so you'd expect this to be slightly greater than 20

12:04.200 --> 12:06.200
and now by how much

12:06.200 --> 12:08.200
is telling you the

12:08.200 --> 12:10.200
strength of that slope

12:10.200 --> 12:12.200
the size of that slope

12:12.200 --> 12:14.200
so f of x plus h minus f of x

12:14.200 --> 12:16.200
this is how much the function responded

12:16.200 --> 12:18.200
in a positive direction

12:18.200 --> 12:20.200
and we have to normalize by the run

12:20.200 --> 12:22.200
so we have the rise over run

12:22.200 --> 12:24.200
to get the slope

12:24.200 --> 12:26.200
so this of course is just a numerical approximation

12:26.200 --> 12:28.200
of the slope

12:28.200 --> 12:30.200
because we have to make h very very small

12:30.200 --> 12:32.200
to converge to the exact amount

12:32.200 --> 12:34.200
now if I'm doing too many zeros

12:34.200 --> 12:36.200
at some point

12:36.200 --> 12:38.200
I'm going to get an incorrect answer

12:38.200 --> 12:40.200
because we're using floating point arithmetic

12:40.200 --> 12:42.200
and the representations of all these numbers

12:42.200 --> 12:44.200
in computer memory is finite

12:44.200 --> 12:46.200
and at some point we get into trouble

12:46.200 --> 12:48.200
so we can converge towards the right answer

12:48.200 --> 12:50.200
with this approach

12:50.200 --> 12:52.200
but basically

12:52.200 --> 12:54.200
at 3 the slope is 14

12:54.200 --> 12:56.200
and you can see that by taking

12:56.200 --> 12:58.200
x squared minus 4x plus 5

12:58.200 --> 13:00.200
and differentiating it in our head

13:00.200 --> 13:02.200
so 3x squared would be

13:02.200 --> 13:04.200
6x minus 4

13:04.200 --> 13:06.200
and then we plug in x equals 3

13:06.200 --> 13:08.200
so that's 18 minus 4 is 14

13:08.200 --> 13:10.200
so this is correct

13:10.200 --> 13:12.200
so that's at 3

13:12.200 --> 13:14.200
now how about

13:14.200 --> 13:16.200
the slope at say negative 3

13:16.200 --> 13:18.200
would you expect

13:18.200 --> 13:20.200
what would you expect for the slope

13:20.200 --> 13:22.200
now telling the exact value is really hard

13:22.200 --> 13:24.200
but what is the sign of that slope

13:24.200 --> 13:26.200
so at negative 3

13:26.200 --> 13:28.200
if we slightly go in the positive direction

13:28.200 --> 13:30.200
at x

13:30.200 --> 13:32.200
the function would actually go down

13:32.200 --> 13:34.200
and so that tells you that the slope would be negative

13:34.200 --> 13:36.200
so we'll get a slight number below

13:36.200 --> 13:38.200
below 20

13:38.200 --> 13:40.200
and so if we take the slope

13:40.200 --> 13:42.200
we expect something negative

13:42.200 --> 13:44.200
negative 22

13:44.200 --> 13:46.200
and at some point here of course

13:46.200 --> 13:48.200
the slope would be 0

13:48.200 --> 13:50.200
now for this specific function

13:50.200 --> 13:52.200
I looked it up previously

13:52.200 --> 13:54.200
and it's at point 2 over 3

13:54.200 --> 13:56.200
so at roughly 2 over 3

13:56.200 --> 13:58.200
this derivative would be 0

13:58.200 --> 14:00.200
so basically

14:00.200 --> 14:02.200
at that precise point

14:04.200 --> 14:06.200
at that precise point

14:06.200 --> 14:08.200
if we nudge in a positive direction

14:08.200 --> 14:10.200
the function doesn't respond

14:10.200 --> 14:12.200
this stays the same almost

14:12.200 --> 14:14.200
and so that's why the slope is 0

14:14.200 --> 14:16.200
ok now let's look at a bit more complex case

14:16.200 --> 14:18.200
so we're going to start complexifying a bit

14:18.200 --> 14:20.200
so now we have a function

14:20.200 --> 14:22.200
here

14:22.200 --> 14:24.200
with output variable b

14:24.200 --> 14:26.200
that is a function of 3 scalar inputs

14:26.200 --> 14:28.200
so a, b and c are some specific values

14:28.200 --> 14:30.200
3 inputs into our expression graph

14:30.200 --> 14:32.200
and a single output d

14:32.200 --> 14:34.200
and so if we just print d

14:34.200 --> 14:36.200
we get 4

14:36.200 --> 14:38.200
and now what I'd like to do is

14:38.200 --> 14:40.200
I'd like to again look at the derivatives of d

14:40.200 --> 14:42.200
with respect to a, b and c

14:42.200 --> 14:44.200
and think through

14:44.200 --> 14:46.200
again just the intuition of what this derivative

14:46.200 --> 14:48.200
is telling us

14:48.200 --> 14:50.200
so in order to evaluate this derivative

14:50.200 --> 14:52.200
we're going to get a bit hacky here

14:52.200 --> 14:54.200
we're going to again have a very small

14:54.200 --> 14:56.200
value of h and then we're going to

14:56.200 --> 14:58.200
fix the inputs at some

14:58.200 --> 15:00.200
values that we're interested in

15:00.200 --> 15:02.200
so these are the

15:02.200 --> 15:04.200
this is the point a, b, c at which we're going to be evaluating

15:04.200 --> 15:06.200
the derivative of d

15:06.200 --> 15:08.200
with respect to all a, b and c

15:08.200 --> 15:10.200
at that point

15:10.200 --> 15:12.200
so there's the inputs and now we have d1

15:12.200 --> 15:14.200
is that expression

15:14.200 --> 15:16.200
and then we're going to for example look at the derivative of d

15:16.200 --> 15:18.200
with respect to a

15:18.200 --> 15:20.200
so we'll take a and we'll bump it by h

15:20.200 --> 15:22.200
and then we'll get d2 to be the exact same

15:22.200 --> 15:24.200
function

15:24.200 --> 15:26.200
and now we're going to print

15:26.200 --> 15:28.200
you know

15:28.200 --> 15:30.200
d1 is d1

15:30.200 --> 15:32.200
d2 is d2

15:32.200 --> 15:34.200
and print slope

15:34.200 --> 15:36.200
so the derivative

15:36.200 --> 15:38.200
or slope here

15:38.200 --> 15:40.200
will be of course

15:40.200 --> 15:42.200
d2

15:42.200 --> 15:44.200
minus d1 divided by h

15:44.200 --> 15:46.200
so d2 minus d1 is how

15:46.200 --> 15:48.200
much the function increased

15:48.200 --> 15:50.200
when we bumped

15:50.200 --> 15:52.200
the specific

15:52.200 --> 15:54.200
input that we're interested in

15:54.200 --> 15:56.200
by a tiny amount

15:56.200 --> 15:58.200
and this is then normalized by

15:58.200 --> 16:00.200
h to get the slope

16:02.200 --> 16:04.200
so

16:04.200 --> 16:06.200
yeah

16:06.200 --> 16:08.200
so this

16:08.200 --> 16:10.200
so I just run this

16:10.200 --> 16:12.200
we're going to print d1

16:12.200 --> 16:14.200
which we know is

16:14.200 --> 16:16.200
4

16:16.200 --> 16:18.200
now d2 will be bumped

16:18.200 --> 16:20.200
a will be bumped by h

16:20.200 --> 16:22.200
so let's just think through

16:22.200 --> 16:24.200
a little bit

16:24.200 --> 16:26.200
what d2 will be

16:26.200 --> 16:28.200
printed out here in particular

16:28.200 --> 16:30.200
d1 will be 4

16:30.200 --> 16:32.200
will d2 be

16:32.200 --> 16:34.200
a number slightly greater than 4

16:34.200 --> 16:36.200
or slightly lower than 4

16:36.200 --> 16:38.200
and that's going to tell us the

16:38.200 --> 16:40.200
sign of the derivative

16:40.200 --> 16:42.200
so

16:42.200 --> 16:44.200
we're bumping a by h

16:44.200 --> 16:46.200
b is minus 3

16:46.200 --> 16:48.200
c is 10

16:48.200 --> 16:50.200
so you can just intuitively think through

16:50.200 --> 16:52.200
this derivative and what it's doing

16:52.200 --> 16:54.200
a will be slightly more positive

16:54.200 --> 16:56.200
and but b is a negative

16:56.200 --> 16:58.200
number so if a is

16:58.200 --> 17:00.200
slightly more positive

17:00.200 --> 17:02.200
because b is negative 3

17:02.200 --> 17:04.200
we're actually going to be

17:04.200 --> 17:06.200
adding less to

17:06.200 --> 17:08.200
d

17:08.200 --> 17:10.200
so you'd actually expect that

17:10.200 --> 17:12.200
the value of the function will go

17:12.200 --> 17:14.200
down so let's

17:14.200 --> 17:16.200
just see this

17:16.200 --> 17:18.200
yeah and so we went from 4

17:18.200 --> 17:20.200
to 3.996

17:20.200 --> 17:22.200
and that tells you that the slope will

17:22.200 --> 17:24.200
be negative and then

17:24.200 --> 17:26.200
will be a negative number

17:26.200 --> 17:28.200
because we went down and

17:28.200 --> 17:30.200
then the exact number of slope

17:30.200 --> 17:32.200
will be the exact amount of slope

17:32.200 --> 17:34.200
is negative 3 and you can

17:34.200 --> 17:36.200
also convince yourself that negative 3 is the right

17:36.200 --> 17:38.200
answer mathematically and analytically

17:38.200 --> 17:40.200
because if you have a times b plus

17:40.200 --> 17:42.200
c and you are you know you have

17:42.200 --> 17:44.200
calculus then

17:44.200 --> 17:46.200
differentiating a times b plus c with

17:46.200 --> 17:48.200
respect to a gives you just b

17:48.200 --> 17:50.200
and indeed the value of b

17:50.200 --> 17:52.200
is negative 3 which is the derivative that we have

17:52.200 --> 17:54.200
so you can tell that that's correct

17:54.200 --> 17:56.200
so now if we do this

17:56.200 --> 17:58.200
with b so if we

17:58.200 --> 18:00.200
bump b by a little bit in a positive

18:00.200 --> 18:02.200
direction we'd get different

18:02.200 --> 18:04.200
slopes so what is the influence of b

18:04.200 --> 18:06.200
on the output d

18:06.200 --> 18:08.200
so if we bump b by a tiny amount

18:08.200 --> 18:10.200
in a positive direction then because a

18:10.200 --> 18:12.200
is positive we'll be

18:12.200 --> 18:14.200
adding more to d right

18:14.200 --> 18:16.200
so and now

18:16.200 --> 18:18.200
what is the sensitivity what is the

18:18.200 --> 18:20.200
slope of that addition and

18:20.200 --> 18:22.200
it might not surprise you that this should be

18:22.200 --> 18:24.200
2

18:24.200 --> 18:26.200
and why is it 2 because

18:26.200 --> 18:28.200
d of d by db

18:28.200 --> 18:30.200
differentiating with respect to b

18:30.200 --> 18:32.200
would be would give us a and

18:32.200 --> 18:34.200
the value of a is 2 so that's also

18:34.200 --> 18:36.200
working well and then if c

18:36.200 --> 18:38.200
gets bumped a tiny amount in h

18:38.200 --> 18:40.200
by h then

18:40.200 --> 18:42.200
of course a times b is unaffected and

18:42.200 --> 18:44.200
now c becomes slightly bit higher

18:44.200 --> 18:46.200
what does that do to the function it

18:46.200 --> 18:48.200
makes it slightly bit higher because we're simply adding

18:48.200 --> 18:50.200
c and it makes it slightly bit

18:50.200 --> 18:52.200
higher by the exact same amount that we

18:52.200 --> 18:54.200
added to c and so that tells you

18:54.200 --> 18:56.200
that the slope is 1

18:56.200 --> 18:58.200
that will be the

18:58.200 --> 19:00.200
the rate at which

19:00.200 --> 19:02.200
d will increase

19:02.200 --> 19:04.200
as we scale

19:04.200 --> 19:06.200
c okay so we now have some

19:06.200 --> 19:08.200
intuitive sense of what this derivative is telling you

19:08.200 --> 19:10.200
about the function and we'd like to move to

19:10.200 --> 19:12.200
neural networks now as i mentioned neural networks

19:12.200 --> 19:14.200
will be pretty massive expressions mathematical

19:14.200 --> 19:16.200
expressions so we need some data structures

19:16.200 --> 19:18.200
that maintain these expressions and that's what

19:18.200 --> 19:20.200
we're going to start to build out now

19:20.200 --> 19:22.200
so we're going to

19:22.200 --> 19:24.200
build out this value object that i

19:24.200 --> 19:26.200
showed you in the readme page

19:26.200 --> 19:28.200
of micrograd so let me

19:28.200 --> 19:30.200
copy paste a skeleton

19:30.200 --> 19:32.200
of the first very simple value

19:32.200 --> 19:34.200
object so class

19:34.200 --> 19:36.200
value takes a single

19:36.200 --> 19:38.200
scalar value that it wraps and

19:38.200 --> 19:40.200
keeps track of and that's

19:40.200 --> 19:42.200
it so we can for example

19:42.200 --> 19:44.200
do value of 2.0 and then we can

19:44.200 --> 19:46.200
get

19:46.200 --> 19:48.200
we can look at its content and

19:48.200 --> 19:50.200
python will internally

19:50.200 --> 19:52.200
use the wrapper function

19:52.200 --> 19:54.200
to return

19:54.200 --> 19:56.200
this string

19:56.200 --> 19:58.200
like that

19:58.200 --> 20:00.200
so this is a value object with

20:00.200 --> 20:02.200
data equals two that we're creating

20:02.200 --> 20:04.200
here now what we'd like to do is

20:04.200 --> 20:06.200
like we'd like to be able to

20:06.200 --> 20:08.200
have not just like

20:08.200 --> 20:10.200
two values but

20:10.200 --> 20:12.200
we'd like to do a plus b right we'd like

20:12.200 --> 20:14.200
to add them so currently

20:14.200 --> 20:16.200
you would get an error because python

20:16.200 --> 20:18.200
doesn't know how to add two value

20:18.200 --> 20:20.200
objects so we have to tell it

20:20.200 --> 20:22.200
so here's

20:22.200 --> 20:24.200
addition

20:24.200 --> 20:26.200
so

20:26.200 --> 20:28.200
you have to basically use these special

20:28.200 --> 20:30.200
double underscore methods in python to

20:30.200 --> 20:32.200
define these operators for these

20:32.200 --> 20:34.200
objects so if we call

20:34.200 --> 20:36.200
the

20:36.200 --> 20:38.200
if we use this plus

20:38.200 --> 20:40.200
operator python will internally

20:40.200 --> 20:42.200
call a dot

20:42.200 --> 20:44.200
add of b that's

20:44.200 --> 20:46.200
what will happen internally and so

20:46.200 --> 20:48.200
b will be the other

20:48.200 --> 20:50.200
and self will be

20:50.200 --> 20:52.200
a and so we see that what we're going

20:52.200 --> 20:54.200
to return is a new value object and

20:54.200 --> 20:56.200
it's just it's going to be wrapping

20:56.200 --> 20:58.200
the plus of

20:58.200 --> 21:00.200
their data but remember

21:00.200 --> 21:02.200
now because data is the actual

21:02.200 --> 21:04.200
like numbered python number so

21:04.200 --> 21:06.200
this operator here is just the

21:06.200 --> 21:08.200
typical floating point plus

21:08.200 --> 21:10.200
addition now it's not an addition of value

21:10.200 --> 21:12.200
objects and we'll return

21:12.200 --> 21:14.200
a new value so now a

21:14.200 --> 21:16.200
plus b should work and it should print value

21:16.200 --> 21:18.200
of negative one

21:18.200 --> 21:20.200
because that's two plus minus three

21:20.200 --> 21:22.200
there we go okay let's

21:22.200 --> 21:24.200
now implement multiply

21:24.200 --> 21:26.200
just so we can recreate this expression here

21:26.200 --> 21:28.200
so multiply i think it won't

21:28.200 --> 21:30.200
surprise you will be fairly similar

21:30.200 --> 21:32.200
so instead

21:32.200 --> 21:34.200
of add we're going to be using mul

21:34.200 --> 21:36.200
and then here of course we want to do times

21:36.200 --> 21:38.200
and so now we can create a

21:38.200 --> 21:40.200
c value object which will be 10.0

21:40.200 --> 21:42.200
and now we should be able to do

21:42.200 --> 21:44.200
a times b

21:44.200 --> 21:46.200
well let's just do a times b first

21:46.200 --> 21:48.200
um

21:48.200 --> 21:50.200
that's value of negative six now

21:50.200 --> 21:52.200
and by the way i skipped over this a little

21:52.200 --> 21:54.200
bit suppose that i didn't have the wrapper

21:54.200 --> 21:56.200
function here then

21:56.200 --> 21:58.200
it's just that you'll get some kind of an ugly expression

21:58.200 --> 22:00.200
so what wrapper is doing

22:00.200 --> 22:02.200
is it's providing us a way to

22:02.200 --> 22:04.200
print out like a nicer looking expression in

22:04.200 --> 22:06.200
python so we

22:06.200 --> 22:08.200
don't just have something cryptic we

22:08.200 --> 22:10.200
actually are you know it's value of

22:10.200 --> 22:12.200
negative six

22:12.200 --> 22:14.200
so this gives us a times

22:14.200 --> 22:16.200
and then this we should now be able

22:16.200 --> 22:18.200
to add c to it because we've defined and

22:18.200 --> 22:20.200
told the python how to do mul and add

22:20.200 --> 22:22.200
and so this will call

22:22.200 --> 22:24.200
this will basically be equivalent to a dot

22:24.200 --> 22:26.200
mul

22:26.200 --> 22:28.200
of b and then

22:28.200 --> 22:30.200
this new value object will be dot

22:30.200 --> 22:32.200
add of c

22:32.200 --> 22:34.200
and so let's see if that worked

22:34.200 --> 22:36.200
yep so that worked well that gave

22:36.200 --> 22:38.200
us four which is what we expect from before

22:38.200 --> 22:40.200
and i

22:40.200 --> 22:42.200
believe we can just call them manually as well

22:42.200 --> 22:44.200
there we go so

22:44.200 --> 22:46.200
yeah okay so now what we are

22:46.200 --> 22:48.200
missing is the connected tissue of this

22:48.200 --> 22:50.200
expression as i mentioned we want to keep

22:50.200 --> 22:52.200
these expression graphs so we need to

22:52.200 --> 22:54.200
know and keep pointers about

22:54.200 --> 22:56.200
what values produce what other values

22:56.200 --> 22:58.200
produce so here for example we are

22:58.200 --> 23:00.200
going to introduce a new variable which

23:00.200 --> 23:02.200
we'll call children and by default it

23:02.200 --> 23:04.200
will be an empty tuple and then we're

23:04.200 --> 23:06.200
actually going to keep a slightly

23:06.200 --> 23:08.200
different variable in the class which

23:08.200 --> 23:10.200
we'll call underscore prev which will be

23:10.200 --> 23:12.200
the set of children

23:12.200 --> 23:14.200
this is how i done i did it in the

23:14.200 --> 23:16.200
original micrograd looking at my code

23:16.200 --> 23:18.200
here i can't remember exactly the reason

23:18.200 --> 23:20.200
i believe it was efficiency but this

23:20.200 --> 23:22.200
underscore children will be a tuple for

23:22.200 --> 23:24.200
convenience but then when we actually

23:24.200 --> 23:26.200
maintain it in the class it will be just

23:26.200 --> 23:28.200
efficiency

23:28.200 --> 23:30.200
so now when

23:30.200 --> 23:32.200
we are creating a value like this with a

23:32.200 --> 23:34.200
constructor children will be empty and

23:34.200 --> 23:36.200
prev will be the empty set but when we

23:36.200 --> 23:38.200
are creating a value through addition or

23:38.200 --> 23:40.200
multiplication we're going to feed in

23:40.200 --> 23:42.200
the children of this

23:42.200 --> 23:44.200
value which in this case is self

23:44.200 --> 23:46.200
another

23:46.200 --> 23:48.200
so those are the children

23:48.200 --> 23:50.200
here

23:50.200 --> 23:52.200
so now we can do d dot

23:52.200 --> 23:54.200
prev and we'll see that

23:54.200 --> 23:56.200
the children of the we know

23:56.200 --> 23:58.200
now know are this a value of

23:58.200 --> 24:00.200
negative six and value of ten

24:00.200 --> 24:02.200
and this of course is the value resulting

24:02.200 --> 24:04.200
from a times b and the

24:04.200 --> 24:06.200
c value which is ten

24:06.200 --> 24:08.200
now the last piece of information

24:08.200 --> 24:10.200
we don't know so we know now the

24:10.200 --> 24:12.200
children of every single value but we don't know

24:12.200 --> 24:14.200
what operation created this value

24:14.200 --> 24:16.200
so we need one more element

24:16.200 --> 24:18.200
here let's call it underscore pop

24:18.200 --> 24:20.200
and by default this

24:20.200 --> 24:22.200
is the empty set for leaves

24:22.200 --> 24:24.200
and then we'll just maintain it here

24:24.200 --> 24:26.200
and now the

24:26.200 --> 24:28.200
operation will be just a simple string

24:28.200 --> 24:30.200
and in the case of addition it's

24:30.200 --> 24:32.200
plus in the case of multiplication

24:32.200 --> 24:34.200
it's times so

24:34.200 --> 24:36.200
now we not just have d dot

24:36.200 --> 24:38.200
prev we also have a d dot op

24:38.200 --> 24:40.200
and we know that d was produced by

24:40.200 --> 24:42.200
an addition of those two values

24:42.200 --> 24:44.200
and so now we have the full

24:44.200 --> 24:46.200
mathematical expression and we're

24:46.200 --> 24:48.200
building out this data structure and we know exactly

24:48.200 --> 24:50.200
how each value came to be

24:50.200 --> 24:52.200
by what expression and from what other values

24:54.200 --> 24:56.200
now because these expressions are about

24:56.200 --> 24:58.200
to get quite a bit larger we'd like a

24:58.200 --> 25:00.200
way to nicely visualize

25:00.200 --> 25:02.200
these expressions that we're building out

25:02.200 --> 25:04.200
so for that i'm going to copy paste a bunch of

25:04.200 --> 25:06.200
slightly scary code that's

25:06.200 --> 25:08.200
going to visualize this these

25:08.200 --> 25:10.200
expression graphs for us so here's the

25:10.200 --> 25:12.200
code and i'll explain it in a bit

25:12.200 --> 25:14.200
but first let me just show you what this code does

25:14.200 --> 25:16.200
basically what it does is it creates

25:16.200 --> 25:18.200
a new function draw dot

25:18.200 --> 25:20.200
that we can call on some root node

25:20.200 --> 25:22.200
and then it's going to visualize it

25:22.200 --> 25:24.200
so if we call draw dot on d

25:24.200 --> 25:26.200
which is this final value here

25:26.200 --> 25:28.200
that is a times b plus c

25:28.200 --> 25:30.200
it creates

25:30.200 --> 25:32.200
something like this so this is d

25:32.200 --> 25:34.200
and you see that this is a times b

25:34.200 --> 25:36.200
creating an interpret value

25:36.200 --> 25:38.200
plus c gives us this output

25:38.200 --> 25:40.200
node d

25:40.200 --> 25:42.200
so that's draw dot of d

25:42.200 --> 25:44.200
and i'm not going to go through this

25:44.200 --> 25:46.200
in complete detail you can take a look at

25:46.200 --> 25:48.200
graphvis and its api

25:48.200 --> 25:50.200
graphvis is an open source graph visualization

25:50.200 --> 25:52.200
software and what we're doing here

25:52.200 --> 25:54.200
is we're building out this graph in graphvis

25:54.200 --> 25:56.200
api and

25:56.200 --> 25:58.200
you can basically see that

25:58.200 --> 26:00.200
trace is this helper function that

26:00.200 --> 26:02.200
enumerates all the nodes and edges in the graph

26:02.200 --> 26:04.200
so that just builds a set of all

26:04.200 --> 26:06.200
the nodes and edges and then we iterate through

26:06.200 --> 26:08.200
all the nodes and we create special node

26:08.200 --> 26:10.200
objects for them in

26:10.200 --> 26:12.200
using dot

26:12.200 --> 26:14.200
node and then we also

26:14.200 --> 26:16.200
create edges using dot dot edge

26:16.200 --> 26:18.200
and the only thing that's like slightly

26:18.200 --> 26:20.200
tricky here is you'll notice that i

26:20.200 --> 26:22.200
basically add these fake nodes

26:22.200 --> 26:24.200
which are these operation nodes

26:24.200 --> 26:26.200
so for example this node here is just

26:26.200 --> 26:28.200
like a plus node and

26:28.200 --> 26:30.200
i create these

26:30.200 --> 26:32.200
special

26:32.200 --> 26:34.200
op nodes here

26:34.200 --> 26:36.200
and i connect them accordingly

26:36.200 --> 26:38.200
so these nodes of course

26:38.200 --> 26:40.200
are not actual nodes

26:40.200 --> 26:42.200
in the original graph they're not

26:42.200 --> 26:44.200
actually a value object the only

26:44.200 --> 26:46.200
value objects here are the things

26:46.200 --> 26:48.200
in squares those are actual value

26:48.200 --> 26:50.200
objects or representations thereof

26:50.200 --> 26:52.200
and these op nodes are just created in

26:52.200 --> 26:54.200
this draw dot routine so that

26:54.200 --> 26:56.200
it looks nice let's also

26:56.200 --> 26:58.200
add labels to these graphs just so we

26:58.200 --> 27:00.200
know what variables are where

27:00.200 --> 27:02.200
so let's create a special underscore

27:02.200 --> 27:04.200
label

27:04.200 --> 27:06.200
or let's just do label equals

27:06.200 --> 27:08.200
empty by default and save it

27:08.200 --> 27:10.200
in each node

27:10.200 --> 27:12.200
and then here

27:12.200 --> 27:14.200
we're going to do label is a

27:14.200 --> 27:16.200
label is b

27:16.200 --> 27:18.200
label is c

27:22.200 --> 27:24.200
and then

27:24.200 --> 27:26.200
let's create a special

27:26.200 --> 27:28.200
um e equals

27:28.200 --> 27:30.200
a times b

27:30.200 --> 27:32.200
and e dot label will

27:32.200 --> 27:34.200
be e

27:34.200 --> 27:36.200
it's kind of naughty and e

27:36.200 --> 27:38.200
will be e plus c

27:38.200 --> 27:40.200
and a d dot label will be

27:40.200 --> 27:42.200
b

27:42.200 --> 27:44.200
okay so nothing really changes i just

27:44.200 --> 27:46.200
added this new e function

27:46.200 --> 27:48.200
a new e variable

27:48.200 --> 27:50.200
and then here when we are

27:50.200 --> 27:52.200
printing this i'm going

27:52.200 --> 27:54.200
to print the label here

27:54.200 --> 27:56.200
so this will be a percent s

27:56.200 --> 27:58.200
bar and this will be n dot

27:58.200 --> 28:00.200
label

28:00.200 --> 28:02.200
and so now

28:02.200 --> 28:04.200
we have the label

28:04.200 --> 28:06.200
on the left here so it says a b

28:06.200 --> 28:08.200
creating e and then e plus c creates

28:08.200 --> 28:10.200
d just like we have it

28:10.200 --> 28:12.200
here and finally let's make this

28:12.200 --> 28:14.200
expression just one layer deeper

28:14.200 --> 28:16.200
so d will not be the final output

28:16.200 --> 28:18.200
node instead

28:18.200 --> 28:20.200
after d we are going to create a

28:20.200 --> 28:22.200
new value object called

28:22.200 --> 28:24.200
f we're going to start running out of

28:24.200 --> 28:26.200
variables soon f will be negative two

28:26.200 --> 28:28.200
point zero and its label

28:28.200 --> 28:30.200
will of course just be f

28:30.200 --> 28:32.200
and then l

28:32.200 --> 28:34.200
capital l will be the output

28:34.200 --> 28:36.200
of our graph and l will be

28:36.200 --> 28:38.200
d times f

28:38.200 --> 28:40.200
okay so l will be negative eight

28:40.200 --> 28:42.200
is the output

28:42.200 --> 28:44.200
uh so

28:44.200 --> 28:46.200
now we don't just draw a

28:46.200 --> 28:48.200
d we draw l

28:50.200 --> 28:52.200
okay

28:52.200 --> 28:54.200
and somehow the label of

28:54.200 --> 28:56.200
l is undefined oops

28:56.200 --> 28:58.200
the label has to be explicitly

28:58.200 --> 29:00.200
given to it

29:00.200 --> 29:02.200
there we go so l is the output

29:02.200 --> 29:04.200
so let's quickly recap what we've done so far

29:04.200 --> 29:06.200
we are able to build out mathematical

29:06.200 --> 29:08.200
expressions using only plus and times

29:08.200 --> 29:10.200
so far they are scalar

29:10.200 --> 29:12.200
valued along the way and we can

29:12.200 --> 29:14.200
do this forward pass

29:14.200 --> 29:16.200
and build out a mathematical expression

29:16.200 --> 29:18.200
so we have multiple inputs here

29:18.200 --> 29:20.200
a b c and f going into

29:20.200 --> 29:22.200
a mathematical expression that produces

29:22.200 --> 29:24.200
a single output l

29:24.200 --> 29:26.200
and this here is visualizing the

29:26.200 --> 29:28.200
forward pass so the output of the

29:28.200 --> 29:30.200
forward pass is negative eight

29:30.200 --> 29:32.200
that's the value now

29:32.200 --> 29:34.200
what we'd like to do next is we'd like to run

29:34.200 --> 29:36.200
back propagation and in back

29:36.200 --> 29:38.200
propagation we are going to start here at the end

29:38.200 --> 29:40.200
and we're going to reverse

29:40.200 --> 29:42.200
and calculate the gradient

29:42.200 --> 29:44.200
along all these intermediate

29:44.200 --> 29:46.200
values and really what we're

29:46.200 --> 29:48.200
computing for every single value here

29:48.200 --> 29:50.200
um we're going to compute

29:50.200 --> 29:52.200
the derivative of that node

29:52.200 --> 29:54.200
with respect to

29:54.200 --> 29:56.200
l so

29:56.200 --> 29:58.200
the derivative of l with respect to l

29:58.200 --> 30:00.200
is just one

30:00.200 --> 30:02.200
and then we're going to derive what is the

30:02.200 --> 30:04.200
derivative of l with respect to f with

30:04.200 --> 30:06.200
respect to d with respect to c

30:06.200 --> 30:08.200
with respect to e with respect

30:08.200 --> 30:10.200
to b and with respect to a

30:10.200 --> 30:12.200
and in a neural network setting you'd

30:12.200 --> 30:14.200
be very interested in the derivative of basically

30:14.200 --> 30:16.200
this loss function l

30:16.200 --> 30:18.200
with respect to the weights of

30:18.200 --> 30:20.200
a neural network and here of course

30:20.200 --> 30:22.200
we have just these variables a b c and f

30:22.200 --> 30:24.200
but some of these will eventually represent

30:24.200 --> 30:26.200
the weights of a neural net and so

30:26.200 --> 30:28.200
we'll need to know how those weights are impacting

30:28.200 --> 30:30.200
the loss function

30:30.200 --> 30:32.200
so we'll be interested basically in the derivative of

30:32.200 --> 30:34.200
the output with respect to some of its

30:34.200 --> 30:36.200
leaf nodes and those leaf nodes will

30:36.200 --> 30:38.200
be the weights of the neural net

30:38.200 --> 30:40.200
and the other leaf nodes of course will be the data

30:40.200 --> 30:42.200
itself but usually we will not want

30:42.200 --> 30:44.200
or use the derivative of the

30:44.200 --> 30:46.200
loss function with respect to data because

30:46.200 --> 30:48.200
the data is fixed but the weights

30:48.200 --> 30:50.200
will be iterated on

30:50.200 --> 30:52.200
using the gradient information

30:52.200 --> 30:54.200
so next we are going to create a variable inside

30:54.200 --> 30:56.200
the value class that maintains

30:56.200 --> 30:58.200
the derivative of

30:58.200 --> 31:00.200
l with respect to that value

31:00.200 --> 31:02.200
and we will call this variable

31:02.200 --> 31:04.200
grad so there

31:04.200 --> 31:06.200
is a dot data and there is a self.grad

31:06.200 --> 31:08.200
and initially

31:08.200 --> 31:10.200
it will be zero and remember that

31:10.200 --> 31:12.200
zero is basically means no

31:12.200 --> 31:14.200
effect so at initialization

31:14.200 --> 31:16.200
we are assuming that every value does not

31:16.200 --> 31:18.200
impact does not affect the

31:18.200 --> 31:20.200
output right because

31:20.200 --> 31:22.200
if the gradient is zero that means that changing

31:22.200 --> 31:24.200
this variable is not changing the

31:24.200 --> 31:26.200
loss function so by

31:26.200 --> 31:28.200
default we assume that the gradient is zero

31:28.200 --> 31:30.200
and then

31:30.200 --> 31:32.200
now that we have grad

31:32.200 --> 31:34.200
and it's zero point zero

31:36.200 --> 31:38.200
we are going to be able to visualize

31:38.200 --> 31:40.200
it here after data so here

31:40.200 --> 31:42.200
grad is point four f

31:42.200 --> 31:44.200
and this will be end of grad

31:44.200 --> 31:46.200
and now

31:46.200 --> 31:48.200
we are going to be showing both the data

31:48.200 --> 31:50.200
and the grad

31:50.200 --> 31:52.200
initialized at zero

31:52.200 --> 31:54.200
and we are

31:54.200 --> 31:56.200
just about getting ready to calculate the

31:56.200 --> 31:58.200
back propagation and of course this

31:58.200 --> 32:00.200
grad again as i mentioned is representing

32:00.200 --> 32:02.200
the derivative of the output in

32:02.200 --> 32:04.200
this case l with respect to this

32:04.200 --> 32:06.200
value so with respect to

32:06.200 --> 32:08.200
so this is the derivative of l with respect to

32:08.200 --> 32:10.200
f with respect to d and so on

32:10.200 --> 32:12.200
so let's now fill in those gradients

32:12.200 --> 32:14.200
and actually do back propagation manually

32:14.200 --> 32:16.200
so let's start filling in these gradients and

32:16.200 --> 32:18.200
start all the way at the end as i mentioned here

32:18.200 --> 32:20.200
first we are interested to fill in this

32:20.200 --> 32:22.200
gradient here so

32:22.200 --> 32:24.200
what is the derivative of l with respect to

32:24.200 --> 32:26.200
l in other words if i change

32:26.200 --> 32:28.200
l by a tiny amount h

32:28.200 --> 32:30.200
how much does

32:30.200 --> 32:32.200
l change

32:32.200 --> 32:34.200
it changes by h so

32:34.200 --> 32:36.200
it's proportional and therefore the derivative will be

32:36.200 --> 32:38.200
one we can of course

32:38.200 --> 32:40.200
measure these or estimate these numerical

32:40.200 --> 32:42.200
gradients numerically just like

32:42.200 --> 32:44.200
we've seen before so if i take this

32:44.200 --> 32:46.200
expression and i create a

32:46.200 --> 32:48.200
def lol function here

32:48.200 --> 32:50.200
and put this here

32:50.200 --> 32:52.200
now the reason i'm creating a gating function

32:52.200 --> 32:54.200
lol here is because i don't want

32:54.200 --> 32:56.200
to pollute or mess up the global scope

32:56.200 --> 32:58.200
here this is just kind of like a little staging

32:58.200 --> 33:00.200
area and as you know in python all of these

33:00.200 --> 33:02.200
will be local variables to this function

33:02.200 --> 33:04.200
so i'm not changing any of the

33:04.200 --> 33:06.200
global scope here so here

33:06.200 --> 33:08.200
l1 will be l

33:10.200 --> 33:12.200
and then copy pasting this expression

33:12.200 --> 33:14.200
we're going to add a small

33:14.200 --> 33:16.200
amount h

33:16.200 --> 33:18.200
in

33:18.200 --> 33:20.200
for example a

33:20.200 --> 33:22.200
right and this would be measuring

33:22.200 --> 33:24.200
the derivative of l with respect

33:24.200 --> 33:26.200
to a so here

33:26.200 --> 33:28.200
this will be l2

33:28.200 --> 33:30.200
and then we want to print test derivatives

33:30.200 --> 33:32.200
so print l2 minus

33:32.200 --> 33:34.200
l1 which is how much l

33:34.200 --> 33:36.200
changed and then normalize it

33:36.200 --> 33:38.200
by h so this is the rise

33:38.200 --> 33:40.200
over run and we have to be

33:40.200 --> 33:42.200
careful because l is a valid node

33:42.200 --> 33:44.200
so we actually want its data

33:46.200 --> 33:48.200
so that these are floats dividing

33:48.200 --> 33:50.200
by h and this should print

33:50.200 --> 33:52.200
the derivative of l with respect to a

33:52.200 --> 33:54.200
because a is the one that we bumped a

33:54.200 --> 33:56.200
little bit by h so what is

33:56.200 --> 33:58.200
the derivative of l with respect

33:58.200 --> 34:00.200
to a it's six

34:00.200 --> 34:02.200
okay and obviously

34:02.200 --> 34:04.200
if we change

34:04.200 --> 34:06.200
l by h

34:06.200 --> 34:08.200
then that would be

34:08.200 --> 34:10.200
here

34:10.200 --> 34:12.200
effectively

34:12.200 --> 34:14.200
this looks really awkward but

34:14.200 --> 34:16.200
changing l by h

34:16.200 --> 34:18.200
you see the derivative here is one

34:20.200 --> 34:22.200
that's kind of like the base case

34:22.200 --> 34:24.200
of what we are doing here

34:24.200 --> 34:26.200
so basically we can come up here

34:26.200 --> 34:28.200
and we can manually set

34:28.200 --> 34:30.200
l.grad to one this is our

34:30.200 --> 34:32.200
manual backpropagation

34:32.200 --> 34:34.200
l.grad is one and let's redraw

34:34.200 --> 34:36.200
and we'll see

34:36.200 --> 34:38.200
that we filled in grad is one

34:38.200 --> 34:40.200
for l we're now going to continue

34:40.200 --> 34:42.200
the backpropagation so let's here look at

34:42.200 --> 34:44.200
the derivatives of l with respect to

34:44.200 --> 34:46.200
d and f let's do

34:46.200 --> 34:48.200
d first so what

34:48.200 --> 34:50.200
we are interested in if i create a markdown on

34:50.200 --> 34:52.200
here is we'd like to know

34:52.200 --> 34:54.200
basically we have that l is d times f

34:54.200 --> 34:56.200
and we'd like to know what is

34:56.200 --> 34:58.200
d l by

34:58.200 --> 35:00.200
d d

35:00.200 --> 35:02.200
what is that and if you know

35:02.200 --> 35:04.200
your calculus l is d times f

35:04.200 --> 35:06.200
so what is d l by d d

35:06.200 --> 35:08.200
it would be f

35:08.200 --> 35:10.200
and if you don't believe me we can also

35:10.200 --> 35:12.200
just derive it because the proof would be

35:12.200 --> 35:14.200
fairly straightforward we go

35:14.200 --> 35:16.200
to the definition

35:16.200 --> 35:18.200
of the derivative which is

35:18.200 --> 35:20.200
f of x plus h minus f of x

35:20.200 --> 35:22.200
divide h

35:22.200 --> 35:24.200
as a limit of h goes to zero

35:24.200 --> 35:26.200
of this kind of expression so

35:26.200 --> 35:28.200
when we have l is d times f

35:28.200 --> 35:30.200
then increasing

35:30.200 --> 35:32.200
d by h would give us

35:32.200 --> 35:34.200
the output of d plus h times

35:34.200 --> 35:36.200
f that's

35:36.200 --> 35:38.200
basically f of x plus h right

35:38.200 --> 35:40.200
minus d times

35:40.200 --> 35:42.200
f

35:42.200 --> 35:44.200
and then divide h and

35:44.200 --> 35:46.200
symbolically expanding out here we

35:46.200 --> 35:48.200
would have basically d times f

35:48.200 --> 35:50.200
plus h times f minus

35:50.200 --> 35:52.200
d times f divide h

35:52.200 --> 35:54.200
and then you see how the df minus

35:54.200 --> 35:56.200
df cancels so you're left with h times

35:56.200 --> 35:58.200
f divide h

35:58.200 --> 36:00.200
which is f so

36:00.200 --> 36:02.200
in the limit as h goes to zero

36:02.200 --> 36:04.200
of you know

36:04.200 --> 36:06.200
derivative

36:06.200 --> 36:08.200
definition we just

36:08.200 --> 36:10.200
get f in the case of

36:10.200 --> 36:12.200
d times f

36:12.200 --> 36:14.200
so symmetrically

36:14.200 --> 36:16.200
d l by d f

36:16.200 --> 36:18.200
will just be d

36:18.200 --> 36:20.200
so what we have is that

36:20.200 --> 36:22.200
f dot grad we see now

36:22.200 --> 36:24.200
is just the value of d

36:24.200 --> 36:26.200
which is four

36:26.200 --> 36:30.200
and we see that

36:30.200 --> 36:32.200
d dot grad is just

36:32.200 --> 36:34.200
the value of f

36:36.200 --> 36:38.200
and so the value of f

36:38.200 --> 36:40.200
is negative two

36:40.200 --> 36:42.200
so we'll set those

36:42.200 --> 36:44.200
manually

36:44.200 --> 36:46.200
let me erase this markdown

36:46.200 --> 36:48.200
node and then let's redraw what we

36:48.200 --> 36:50.200
have

36:50.200 --> 36:52.200
okay and let's

36:52.200 --> 36:54.200
just make sure that these were correct

36:54.200 --> 36:56.200
so we seem to think that

36:56.200 --> 36:58.200
d l by d d is negative two so let's

36:58.200 --> 37:00.200
double check

37:00.200 --> 37:02.200
let me erase this plus h from before

37:02.200 --> 37:04.200
and now we want the derivative with respect to f

37:04.200 --> 37:06.200
so let's just come here

37:06.200 --> 37:08.200
when i create f and let's do a plus h here

37:08.200 --> 37:10.200
and this should print a derivative of

37:10.200 --> 37:12.200
l with respect to f so we expect

37:12.200 --> 37:14.200
to see four

37:14.200 --> 37:16.200
yeah and this is four up to

37:16.200 --> 37:18.200
floating point funkiness

37:18.200 --> 37:20.200
and then d l

37:20.200 --> 37:22.200
by d d should be

37:22.200 --> 37:24.200
f which is negative two

37:24.200 --> 37:26.200
grad is negative two

37:26.200 --> 37:28.200
so if we again

37:28.200 --> 37:30.200
come here and we change d

37:30.200 --> 37:32.200
d dot

37:32.200 --> 37:34.200
data plus equals h right

37:34.200 --> 37:36.200
here so we expect

37:36.200 --> 37:38.200
so we've added a little h and then we see

37:38.200 --> 37:40.200
how l changed and we

37:40.200 --> 37:42.200
expect to print

37:42.200 --> 37:44.200
negative two

37:44.200 --> 37:46.200
there we go

37:46.200 --> 37:48.200
so we've numerically

37:48.200 --> 37:50.200
verified what we're doing here is

37:50.200 --> 37:52.200
kind of like an inline gradient check

37:52.200 --> 37:54.200
gradient check is when we

37:54.200 --> 37:56.200
are deriving this like back propagation

37:56.200 --> 37:58.200
and getting the derivative with respect to all the

37:58.200 --> 38:00.200
intermediate results and

38:00.200 --> 38:02.200
then numerical gradient is just you know

38:02.200 --> 38:04.200
estimating it using

38:04.200 --> 38:06.200
small step size

38:06.200 --> 38:08.200
now we're getting to the crux of

38:08.200 --> 38:10.200
back propagation so this will be the

38:10.200 --> 38:12.200
most important node to understand

38:12.200 --> 38:14.200
because if you understand the gradient for

38:14.200 --> 38:16.200
this node you understand all of back

38:16.200 --> 38:18.200
propagation and all training of neural nets

38:18.200 --> 38:20.200
basically so we need

38:20.200 --> 38:22.200
to derive d l by

38:22.200 --> 38:24.200
d c in other words the derivative

38:24.200 --> 38:26.200
of l with respect to c

38:26.200 --> 38:28.200
because we've computed all these other

38:28.200 --> 38:30.200
gradients already now we're coming

38:30.200 --> 38:32.200
here and we're continuing the back propagation

38:32.200 --> 38:34.200
manually so we want

38:34.200 --> 38:36.200
d l by d c and then we'll also

38:36.200 --> 38:38.200
derive d l by d e

38:38.200 --> 38:40.200
now here's the problem

38:40.200 --> 38:42.200
how do we derive d l by

38:42.200 --> 38:44.200
d c

38:44.200 --> 38:46.200
we actually know the derivative l

38:46.200 --> 38:48.200
with respect to d so we know how

38:48.200 --> 38:50.200
l is sensitive to d

38:50.200 --> 38:52.200
but how is l sensitive to

38:52.200 --> 38:54.200
c so if we wiggle c how does

38:54.200 --> 38:56.200
that impact l through d

38:56.200 --> 39:00.200
so we know d l by d c

39:00.200 --> 39:02.200
and we

39:02.200 --> 39:04.200
also here know how c impacts d

39:04.200 --> 39:06.200
and so just very intuitively if you

39:06.200 --> 39:08.200
know the impact that c is having

39:08.200 --> 39:10.200
on d and the impact that d is having

39:10.200 --> 39:12.200
on l then you should be able to

39:12.200 --> 39:14.200
somehow put that information together to

39:14.200 --> 39:16.200
figure out how c impacts l

39:16.200 --> 39:18.200
and indeed this is what we can actually

39:18.200 --> 39:20.200
do so in particular we

39:20.200 --> 39:22.200
know just concentrating on d first

39:22.200 --> 39:24.200
let's look at how what is the derivative

39:24.200 --> 39:26.200
basically of d with respect to c

39:26.200 --> 39:28.200
so in other words what is d d by d

39:28.200 --> 39:30.200
c

39:30.200 --> 39:32.200
so here

39:32.200 --> 39:34.200
we know that d is c times

39:34.200 --> 39:36.200
c plus e that's what we

39:36.200 --> 39:38.200
know and now we're interested in d d

39:38.200 --> 39:40.200
by d c if you

39:40.200 --> 39:42.200
just know your calculus again and you remember

39:42.200 --> 39:44.200
then differentiating c plus e with

39:44.200 --> 39:46.200
respect to c you know that that gives you

39:46.200 --> 39:48.200
1.0 and

39:48.200 --> 39:50.200
we can also go back to the basics and derive

39:50.200 --> 39:52.200
this because again we can go to our

39:52.200 --> 39:54.200
f of x plus h minus f of x

39:54.200 --> 39:56.200
divide by h

39:56.200 --> 39:58.200
that's the definition of a derivative

39:58.200 --> 40:00.200
as h goes to zero and

40:00.200 --> 40:02.200
so here focusing on c

40:02.200 --> 40:04.200
and its effect on d

40:04.200 --> 40:06.200
we can basically do the f of x plus h

40:06.200 --> 40:08.200
will be c is

40:08.200 --> 40:10.200
incremented by h plus c

40:10.200 --> 40:12.200
that's the first evaluation of our

40:12.200 --> 40:14.200
function minus

40:14.200 --> 40:16.200
c plus e

40:16.200 --> 40:18.200
and then divide h

40:18.200 --> 40:20.200
and so what is this

40:20.200 --> 40:22.200
just expanding this out this will be c plus

40:22.200 --> 40:24.200
h plus e minus c minus

40:24.200 --> 40:26.200
e divide h

40:26.200 --> 40:28.200
and then you see here how c minus c

40:28.200 --> 40:30.200
cancels e minus e cancels

40:30.200 --> 40:32.200
we're left with h over h which is 1.0

40:32.200 --> 40:34.200
and so

40:34.200 --> 40:36.200
by symmetry also

40:36.200 --> 40:38.200
d d by d

40:38.200 --> 40:40.200
e will be

40:40.200 --> 40:42.200
1.0 as well

40:42.200 --> 40:44.200
so basically the derivative of

40:44.200 --> 40:46.200
a sum expression is very simple

40:46.200 --> 40:48.200
and this is the local derivative

40:48.200 --> 40:50.200
so i call this the local derivative because

40:50.200 --> 40:52.200
we have the final output value all the

40:52.200 --> 40:54.200
way at the end of this graph and we're now

40:54.200 --> 40:56.200
like a small node here and

40:56.200 --> 40:58.200
this is a little plus node and

40:58.200 --> 41:00.200
the little plus node doesn't know

41:00.200 --> 41:02.200
anything about the rest of the graph

41:02.200 --> 41:04.200
that it's embedded in all it knows

41:04.200 --> 41:06.200
is that it did a plus it took a c

41:06.200 --> 41:08.200
and an e added them and created

41:08.200 --> 41:10.200
d and this plus node

41:10.200 --> 41:12.200
also knows the local influence of

41:12.200 --> 41:14.200
c on d or rather

41:14.200 --> 41:16.200
the derivative of d with respect to c

41:16.200 --> 41:18.200
and it also knows the derivative of d

41:18.200 --> 41:20.200
with respect to e but

41:20.200 --> 41:22.200
that's not what we want that's just a local derivative

41:22.200 --> 41:24.200
what we actually want is

41:24.200 --> 41:26.200
dl by dc and

41:26.200 --> 41:28.200
l could l is here just one

41:28.200 --> 41:30.200
step away but in the general case

41:30.200 --> 41:32.200
this little plus node is could be

41:32.200 --> 41:34.200
embedded in like a massive graph

41:34.200 --> 41:36.200
so again

41:36.200 --> 41:38.200
we know how l impacts d and

41:38.200 --> 41:40.200
now we know how c and e impact

41:40.200 --> 41:42.200
d how do we put that information together

41:42.200 --> 41:44.200
to write dl by dc

41:44.200 --> 41:46.200
and the answer of course is the chain rule

41:46.200 --> 41:48.200
in calculus and so

41:50.200 --> 41:52.200
i pulled up chain rule here from wikipedia

41:52.200 --> 41:54.200
and i'm going

41:54.200 --> 41:56.200
to go through this very briefly so chain

41:56.200 --> 41:58.200
rule wikipedia sometimes

41:58.200 --> 42:00.200
can be very confusing and calculus can

42:00.200 --> 42:02.200
can be very confusing like

42:02.200 --> 42:04.200
this is the way i learned

42:04.200 --> 42:06.200
chain rule and it was very

42:06.200 --> 42:08.200
confusing like what is happening

42:08.200 --> 42:10.200
it's just complicated so i like

42:10.200 --> 42:12.200
this expression much better

42:12.200 --> 42:14.200
if a variable z depends

42:14.200 --> 42:16.200
on a variable y which itself depends

42:16.200 --> 42:18.200
on a variable x

42:18.200 --> 42:20.200
then z depends on x as well obviously

42:20.200 --> 42:22.200
through the intermediate variable y

42:22.200 --> 42:24.200
and in this case the chain rule is expressed

42:24.200 --> 42:26.200
as if you want

42:26.200 --> 42:28.200
dz by dx

42:28.200 --> 42:30.200
then you take the dz by dy

42:30.200 --> 42:32.200
and you multiply it by dy

42:32.200 --> 42:34.200
by dx so the chain

42:34.200 --> 42:36.200
rule fundamentally is telling you

42:36.200 --> 42:38.200
how we chain

42:38.200 --> 42:40.200
these derivatives

42:40.200 --> 42:42.200
together correctly

42:42.200 --> 42:44.200
so to differentiate through

42:44.200 --> 42:46.200
a function composition

42:46.200 --> 42:48.200
we have to apply a multiplication

42:48.200 --> 42:50.200
of those derivatives

42:50.200 --> 42:52.200
so that's

42:52.200 --> 42:54.200
really what chain rule is telling us

42:54.200 --> 42:56.200
and there's a nice little

42:56.200 --> 42:58.200
intuitive explanation here which i also think is

42:58.200 --> 43:00.200
kind of cute the chain rule states that

43:00.200 --> 43:02.200
knowing the instantaneous rate of change of z with respect

43:02.200 --> 43:04.200
to y and y relative to x allows

43:04.200 --> 43:06.200
one to calculate the instantaneous rate of change of z

43:06.200 --> 43:08.200
relative to x as a

43:08.200 --> 43:10.200
product of those two rates of change

43:10.200 --> 43:12.200
simply the product of those two

43:12.200 --> 43:14.200
so here's a good one

43:14.200 --> 43:16.200
if a car travels twice as fast as a bicycle

43:16.200 --> 43:18.200
and the bicycle is four times as

43:18.200 --> 43:20.200
fast as a walking man

43:20.200 --> 43:22.200
then the car travels two times four

43:22.200 --> 43:24.200
eight times as fast as a man

43:24.200 --> 43:26.200
and so this makes it

43:26.200 --> 43:28.200
very clear that the correct thing to do

43:28.200 --> 43:30.200
sort of is to multiply

43:30.200 --> 43:32.200
so car is

43:32.200 --> 43:34.200
twice as fast as bicycle and bicycle

43:34.200 --> 43:36.200
is four times as fast as man

43:36.200 --> 43:38.200
so the car will be eight

43:38.200 --> 43:40.200
times as fast as the man

43:40.200 --> 43:42.200
and so we can take these

43:42.200 --> 43:44.200
intermediate rates of change if you will

43:44.200 --> 43:46.200
and multiply them together

43:46.200 --> 43:48.200
and that justifies the

43:48.200 --> 43:50.200
chain rule intuitively

43:50.200 --> 43:52.200
so have a look at chain rule but here

43:52.200 --> 43:54.200
really what it means for us is

43:54.200 --> 43:56.200
there's a very simple recipe for deriving

43:56.200 --> 43:58.200
what we want

43:58.200 --> 44:00.200
which is dl by dc

44:00.200 --> 44:02.200
and what we have so far

44:02.200 --> 44:04.200
is we know

44:04.200 --> 44:06.200
want

44:06.200 --> 44:08.200
and we know

44:08.200 --> 44:10.200
what is the

44:10.200 --> 44:12.200
impact of d on l

44:12.200 --> 44:14.200
so we know dl by dd

44:14.200 --> 44:16.200
the derivative of l with respect to dd

44:16.200 --> 44:18.200
we know that that's negative two

44:18.200 --> 44:20.200
and now because of this local

44:20.200 --> 44:22.200
reasoning that we've done here

44:22.200 --> 44:24.200
we know dd by dc

44:24.200 --> 44:26.200
so how does c impact d

44:26.200 --> 44:28.200
and in particular

44:28.200 --> 44:30.200
this is a plus node so the local derivative

44:30.200 --> 44:32.200
is simply 1.0 it's very simple

44:32.200 --> 44:34.200
and so

44:34.200 --> 44:36.200
the chain rule tells us that dl by dc

44:36.200 --> 44:38.200
going through this intermediate

44:38.200 --> 44:40.200
variable

44:40.200 --> 44:42.200
will just be simply dl by

44:42.200 --> 44:44.200
dd

44:44.200 --> 44:46.200
times

44:48.200 --> 44:50.200
dd

44:50.200 --> 44:52.200
by dc that's chain rule

44:52.200 --> 44:54.200
so this is identical

44:54.200 --> 44:56.200
to what's happening here

44:56.200 --> 44:58.200
except

44:58.200 --> 45:00.200
z is rl

45:00.200 --> 45:02.200
y is rd and x is

45:02.200 --> 45:04.200
rc

45:04.200 --> 45:06.200
so we literally just have to multiply these

45:06.200 --> 45:08.200
and because

45:10.200 --> 45:12.200
these local derivatives like dd by dc

45:12.200 --> 45:14.200
are just one

45:14.200 --> 45:16.200
we basically just copy over

45:16.200 --> 45:18.200
dl by dd because this is just

45:18.200 --> 45:20.200
times one

45:20.200 --> 45:22.200
so because dl by dd

45:22.200 --> 45:24.200
is negative two what is dl

45:24.200 --> 45:26.200
by dc

45:26.200 --> 45:28.200
well it's the local gradient

45:28.200 --> 45:30.200
1.0 times dl by dd

45:30.200 --> 45:32.200
which is negative two so literally

45:32.200 --> 45:34.200
what a plus node does you can look

45:34.200 --> 45:36.200
at it that way is it literally just routes

45:36.200 --> 45:38.200
the gradient because the

45:38.200 --> 45:40.200
plus nodes local derivatives are just

45:40.200 --> 45:42.200
one and so in the chain rule

45:42.200 --> 45:44.200
one times dl by

45:44.200 --> 45:46.200
dd is

45:46.200 --> 45:48.200
is

45:48.200 --> 45:50.200
is just dl by dd

45:50.200 --> 45:52.200
and so that derivative just gets routed

45:52.200 --> 45:54.200
to both c and to e

45:54.200 --> 45:56.200
in this case so basically

45:56.200 --> 45:58.200
we have that e.grad

45:58.200 --> 46:00.200
or let's start with c

46:00.200 --> 46:02.200
since that's the one we looked at

46:02.200 --> 46:04.200
is negative

46:04.200 --> 46:06.200
two times one

46:06.200 --> 46:08.200
negative two

46:08.200 --> 46:10.200
and in the same way by

46:10.200 --> 46:12.200
symmetry e.grad will be negative two

46:12.200 --> 46:14.200
that's the claim

46:14.200 --> 46:16.200
so we can set those

46:16.200 --> 46:18.200
we can redraw

46:18.200 --> 46:20.200
and you see how

46:20.200 --> 46:22.200
we just assigned negative two negative two

46:22.200 --> 46:24.200
so this back propagating signal which is

46:24.200 --> 46:26.200
carrying the information of like what is the derivative

46:26.200 --> 46:28.200
of l with respect to all the intermediate nodes

46:28.200 --> 46:30.200
we can imagine it almost like

46:30.200 --> 46:32.200
flowing backwards through the graph and a

46:32.200 --> 46:34.200
plus node will simply distribute

46:34.200 --> 46:36.200
the derivative to all the leaf nodes

46:36.200 --> 46:38.200
sorry to all the children nodes of it

46:38.200 --> 46:40.200
so this is the claim

46:40.200 --> 46:42.200
and now let's verify it

46:42.200 --> 46:44.200
so let me remove the plus h here from before

46:44.200 --> 46:46.200
and now instead what we want to

46:46.200 --> 46:48.200
do is we want to increment c so

46:48.200 --> 46:50.200
c.data will be incremented by h

46:50.200 --> 46:52.200
and when i run this we expect

46:52.200 --> 46:54.200
to see negative two

46:54.200 --> 46:56.200
negative two

46:56.200 --> 46:58.200
and then of course for e

46:58.200 --> 47:00.200
so e.data plus equals h

47:00.200 --> 47:02.200
and we expect to see negative two

47:02.200 --> 47:04.200
simple

47:06.200 --> 47:08.200
so those are the derivatives

47:08.200 --> 47:10.200
of these internal nodes

47:10.200 --> 47:12.200
and now we're going to

47:12.200 --> 47:14.200
recurse our way backwards

47:14.200 --> 47:16.200
again and we're again

47:16.200 --> 47:18.200
going to apply the chain rule

47:18.200 --> 47:20.200
so here we go our second application of chain rule

47:20.200 --> 47:22.200
and we will apply it all the way through the

47:22.200 --> 47:24.200
graph we just happen to only have one more node

47:24.200 --> 47:26.200
remaining we have that

47:26.200 --> 47:28.200
derivative of l

47:28.200 --> 47:30.200
so we know that

47:30.200 --> 47:32.200
the derivative of l

47:32.200 --> 47:34.200
as we have just calculated

47:34.200 --> 47:36.200
is negative two

47:36.200 --> 47:38.200
so we know that

47:38.200 --> 47:40.200
so we know the derivative of l

47:40.200 --> 47:42.200
with respect to e

47:42.200 --> 47:44.200
and now we want

47:44.200 --> 47:46.200
dL by dA

47:46.200 --> 47:48.200
right

47:48.200 --> 47:50.200
and the chain rule is telling us

47:50.200 --> 47:52.200
that that's just dL by dE

47:52.200 --> 47:54.200
negative two

47:54.200 --> 47:56.200
so that's basically

47:56.200 --> 47:58.200
dE by dA

47:58.200 --> 48:00.200
we have to look at that

48:00.200 --> 48:02.200
so I'm a little times node

48:02.200 --> 48:04.200
inside a massive graph

48:04.200 --> 48:06.200
and I only know that I did

48:06.200 --> 48:08.200
a times b and I produced an e

48:08.200 --> 48:10.200
so now what is

48:10.200 --> 48:12.200
dE by dA

48:12.200 --> 48:14.200
and dE by dB

48:14.200 --> 48:16.200
that's the only thing that I sort of know about

48:16.200 --> 48:18.200
that's my local gradient

48:18.200 --> 48:20.200
so because we have that e is a times b

48:20.200 --> 48:22.200
we're asking what is dE

48:22.200 --> 48:24.200
by dA

48:24.200 --> 48:26.200
and of course we just did that here

48:26.200 --> 48:28.200
we had a times

48:28.200 --> 48:30.200
so I'm not going to re-derive it

48:30.200 --> 48:32.200
but if you want to differentiate this

48:32.200 --> 48:34.200
with respect to a you'll just get b

48:34.200 --> 48:36.200
right the value of b

48:36.200 --> 48:38.200
which in this case is

48:38.200 --> 48:40.200
negative three point zero

48:40.200 --> 48:42.200
so

48:42.200 --> 48:44.200
basically we have that dL by dA

48:44.200 --> 48:46.200
well let me just do it

48:46.200 --> 48:48.200
right here we have that a dot grad

48:48.200 --> 48:50.200
and we are applying chain rule here

48:50.200 --> 48:52.200
is dL by dE

48:52.200 --> 48:54.200
which we see here is

48:54.200 --> 48:56.200
negative two

48:56.200 --> 48:58.200
times

48:58.200 --> 49:00.200
what is dE by dA

49:00.200 --> 49:02.200
it's the value of b

49:02.200 --> 49:04.200
which is negative three

49:04.200 --> 49:06.200
that's it

49:08.200 --> 49:10.200
and then we have b dot grad

49:10.200 --> 49:12.200
is again dL by dE

49:12.200 --> 49:14.200
which is negative two

49:14.200 --> 49:16.200
just the same way

49:16.200 --> 49:18.200
times

49:18.200 --> 49:20.200
what is dE by dB

49:20.200 --> 49:22.200
is the value of a

49:22.200 --> 49:24.200
which is 2.0

49:24.200 --> 49:26.200
so these are

49:26.200 --> 49:28.200
our claimed derivatives

49:28.200 --> 49:30.200
let's

49:30.200 --> 49:32.200
re-draw

49:32.200 --> 49:34.200
and we see here that

49:34.200 --> 49:36.200
a dot grad turns out to be six

49:36.200 --> 49:38.200
because that is negative two times negative three

49:38.200 --> 49:40.200
and b dot grad is negative four

49:40.200 --> 49:42.200
times

49:42.200 --> 49:44.200
sorry is negative two times two

49:44.200 --> 49:46.200
which is negative four

49:46.200 --> 49:48.200
so those are our claims

49:48.200 --> 49:50.200
let's delete this and let's verify them

49:50.200 --> 49:52.200
we have

49:52.200 --> 49:54.200
a here

49:54.200 --> 49:56.200
plus equals h

49:56.200 --> 49:58.200
so

49:58.200 --> 50:00.200
the claim is that

50:00.200 --> 50:02.200
a dot grad is six

50:02.200 --> 50:04.200
let's verify

50:04.200 --> 50:06.200
six

50:06.200 --> 50:08.200
and we have b dot data

50:08.200 --> 50:10.200
plus equals h

50:10.200 --> 50:12.200
so nudging b by h

50:12.200 --> 50:14.200
and looking at what happens

50:14.200 --> 50:16.200
we claim it's negative four

50:16.200 --> 50:18.200
and indeed it's negative four

50:18.200 --> 50:20.200
plus minus again float

50:20.200 --> 50:22.200
oddness

50:22.200 --> 50:24.200
and that's it

50:24.200 --> 50:26.200
that was the manual

50:26.200 --> 50:28.200
back propagation

50:28.200 --> 50:30.200
all the way from here

50:30.200 --> 50:32.200
to all the leaf nodes

50:32.200 --> 50:34.200
and we've done it piece by piece

50:34.200 --> 50:36.200
and really all we've done is

50:36.200 --> 50:38.200
as you saw we iterated through all the nodes

50:38.200 --> 50:40.200
one by one

50:40.200 --> 50:42.200
and locally applied the chain rule

50:42.200 --> 50:44.200
we always know what is the derivative of l

50:44.200 --> 50:46.200
with respect to this little output

50:46.200 --> 50:48.200
and then we look at how this output was produced

50:48.200 --> 50:50.200
this output was produced through some operation

50:50.200 --> 50:52.200
and we have the pointers to the children nodes

50:52.200 --> 50:54.200
and so in this little operation

50:54.200 --> 50:56.200
we know what the local derivatives are

50:56.200 --> 50:58.200
and we just multiply them onto the derivative

50:58.200 --> 51:00.200
always

51:00.200 --> 51:02.200
so we just go through and recursively multiply on

51:02.200 --> 51:04.200
the local derivatives

51:04.200 --> 51:06.200
and that's what back propagation is

51:06.200 --> 51:08.200
it's just a recursive application of chain rule

51:08.200 --> 51:10.200
backwards through the computation graph

51:10.200 --> 51:12.200
let's see this power in action

51:12.200 --> 51:14.200
just very briefly

51:14.200 --> 51:16.200
what we're going to do is we're going to

51:16.200 --> 51:18.200
nudge our inputs to try to make l

51:18.200 --> 51:20.200
go up

51:20.200 --> 51:22.200
so in particular what we're doing is

51:22.200 --> 51:24.200
we're going to take that data

51:24.200 --> 51:26.200
we're going to change it

51:26.200 --> 51:28.200
and if we want l to go up

51:28.200 --> 51:30.200
that means we just have to go in the direction of the gradient

51:30.200 --> 51:32.200
so a should increase

51:32.200 --> 51:34.200
in the direction of gradient

51:34.200 --> 51:36.200
by like some small step amount

51:36.200 --> 51:38.200
this is the step size

51:38.200 --> 51:40.200
and we don't just want this for b

51:40.200 --> 51:42.200
but also for b

51:42.200 --> 51:44.200
also for c

51:44.200 --> 51:46.200
also for f

51:46.200 --> 51:48.200
those are leaf nodes

51:48.200 --> 51:50.200
which we usually have control over

51:50.200 --> 51:52.200
and if we nudge in

51:52.200 --> 51:54.200
the direction of the gradient

51:54.200 --> 51:56.200
we expect a positive influence on l

51:56.200 --> 51:58.200
so we expect l to go up

51:58.200 --> 52:00.200
positively

52:00.200 --> 52:02.200
so it should become less negative

52:02.200 --> 52:04.200
it should go up to say negative 6

52:04.200 --> 52:06.200
or something like that

52:06.200 --> 52:08.200
it's hard to tell exactly

52:08.200 --> 52:10.200
and we have to rerun the forward pass

52:10.200 --> 52:12.200
so let me just

52:12.200 --> 52:14.200
do that here

52:16.200 --> 52:18.200
this would be the forward pass

52:18.200 --> 52:20.200
f would be unchanged

52:20.200 --> 52:22.200
this is effectively the forward pass

52:22.200 --> 52:24.200
but now if we print l.data

52:24.200 --> 52:26.200
we expect

52:26.200 --> 52:28.200
because we nudged all the values

52:28.200 --> 52:30.200
all the inputs in the direction of the gradient

52:30.200 --> 52:32.200
we expected less negative l

52:32.200 --> 52:34.200
we expect it to go up

52:34.200 --> 52:36.200
so maybe it's negative 6 or so

52:36.200 --> 52:38.200
let's see what happens

52:38.200 --> 52:40.200
ok negative 7

52:40.200 --> 52:42.200
and this is basically one step

52:42.200 --> 52:44.200
of an optimization that we'll end up running

52:44.200 --> 52:46.200
and really this gradient

52:46.200 --> 52:48.200
just gives us some power

52:48.200 --> 52:50.200
because we know how to influence the final outcome

52:50.200 --> 52:52.200
and this will be extremely useful for training NOLETs as we'll soon see

52:52.200 --> 52:54.200
so now I would like to do

52:54.200 --> 52:56.200
one more example

52:56.200 --> 52:58.200
of manual backpropagation

52:58.200 --> 53:00.200
using a bit more complex

53:00.200 --> 53:02.200
and useful example

53:02.200 --> 53:04.200
we are going to backpropagate

53:04.200 --> 53:06.200
through a neuron

53:06.200 --> 53:08.200
so we want to

53:08.200 --> 53:10.200
eventually build out neural networks

53:10.200 --> 53:12.200
and in the simplest case these are multilayer

53:12.200 --> 53:14.200
perceptrons as they're called

53:14.200 --> 53:16.200
so this is a two layer neural net

53:16.200 --> 53:18.200
and it's got these hidden layers made up of neurons

53:18.200 --> 53:20.200
and these neurons are fully connected to each other

53:20.200 --> 53:22.200
now biologically neurons are very complicated

53:22.200 --> 53:24.200
devices but we have very simple mathematical models

53:24.200 --> 53:26.200
of them

53:26.200 --> 53:28.200
and so this is a very simple mathematical model

53:28.200 --> 53:30.200
of a neuron

53:30.200 --> 53:32.200
you have some inputs, x's

53:32.200 --> 53:34.200
and then you have these synapses

53:34.200 --> 53:36.200
that have weights on them

53:36.200 --> 53:38.200
so the w's are weights

53:38.200 --> 53:40.200
and then

53:40.200 --> 53:42.200
the synapse interacts with the input

53:42.200 --> 53:44.200
to this neuron multiplicatively

53:44.200 --> 53:46.200
so what flows to the cell body

53:46.200 --> 53:48.200
of this neuron

53:48.200 --> 53:50.200
is w times x

53:50.200 --> 53:52.200
but there's multiple inputs

53:52.200 --> 53:54.200
w times x is flowing to the cell body

53:54.200 --> 53:56.200
the cell body then has

53:56.200 --> 53:58.200
also like some bias

53:58.200 --> 54:00.200
so this is kind of like the

54:00.200 --> 54:02.200
innate sort of trigger happiness

54:02.200 --> 54:04.200
of this neuron

54:04.200 --> 54:06.200
so this bias can make it a bit more trigger happy

54:06.200 --> 54:08.200
or a bit less trigger happy regardless of the input

54:08.200 --> 54:10.200
but basically we're taking all the w times x

54:10.200 --> 54:12.200
of all the inputs

54:12.200 --> 54:14.200
adding the bias

54:14.200 --> 54:16.200
and then we take it through an activation function

54:16.200 --> 54:18.200
and this activation function

54:18.200 --> 54:20.200
is usually some kind of a squashing function

54:20.200 --> 54:22.200
like a sigmoid or 10H

54:22.200 --> 54:24.200
or something like that

54:24.200 --> 54:26.200
so as an example

54:26.200 --> 54:28.200
we're going to use the 10H in this example

54:28.200 --> 54:30.200
numpy has a

54:30.200 --> 54:32.200
np.10H

54:32.200 --> 54:34.200
so we can call it on a range

54:34.200 --> 54:36.200
and we can plot it

54:36.200 --> 54:38.200
this is the 10H function

54:38.200 --> 54:40.200
and you see that the inputs

54:40.200 --> 54:42.200
as they come in

54:42.200 --> 54:44.200
get squashed on the y coordinate here

54:44.200 --> 54:46.200
so right at 0

54:46.200 --> 54:48.200
we're going to get exactly 0

54:48.200 --> 54:50.200
and then as you go more positive in the input

54:50.200 --> 54:52.200
then you'll see that

54:52.200 --> 54:54.200
the activation function will only go up to 1

54:54.200 --> 54:56.200
and then plateau out

54:56.200 --> 54:58.200
and so if you pass in very positive inputs

54:58.200 --> 55:00.200
we're going to cap it smoothly at 1

55:00.200 --> 55:02.200
and on the negative side

55:02.200 --> 55:04.200
we're going to cap it smoothly to negative 1

55:04.200 --> 55:06.200
so that's 10H

55:06.200 --> 55:08.200
and that's the squashing function

55:08.200 --> 55:10.200
or an activation function

55:10.200 --> 55:12.200
and what comes out of this neuron

55:12.200 --> 55:14.200
is just the activation function applied to the

55:14.200 --> 55:16.200
dot product of the weights

55:16.200 --> 55:18.200
and the inputs

55:18.200 --> 55:20.200
so let's write one out

55:20.200 --> 55:22.200
um

55:22.200 --> 55:24.200
I'm going to copy paste

55:24.200 --> 55:26.200
because

55:26.200 --> 55:28.200
I don't want to type too much

55:28.200 --> 55:30.200
but okay so here we have the inputs

55:30.200 --> 55:32.200
x1, x2

55:32.200 --> 55:34.200
so this is a two dimensional neuron

55:34.200 --> 55:36.200
so two inputs are going to come in

55:36.200 --> 55:38.200
these are thought of as the weights of this neuron

55:38.200 --> 55:40.200
weights w1, w2

55:40.200 --> 55:42.200
and these weights again are the

55:42.200 --> 55:44.200
synaptic strengths for each input

55:44.200 --> 55:46.200
and this is the bias

55:46.200 --> 55:48.200
of the neuron B

55:48.200 --> 55:50.200
and now what we want to do

55:50.200 --> 55:52.200
is according to this model

55:52.200 --> 55:54.200
we need to multiply

55:54.200 --> 55:56.200
x1 times w1

55:56.200 --> 55:58.200
and x2 times w2

55:58.200 --> 56:00.200
and then we need to add bias

56:00.200 --> 56:02.200
on top of it

56:02.200 --> 56:04.200
and it gets a little messy here

56:04.200 --> 56:06.200
but all we are trying to do is

56:06.200 --> 56:08.200
x1 w1 plus x2 w2 plus B

56:08.200 --> 56:10.200
and these are multiplied here

56:10.200 --> 56:12.200
except I'm doing it in small steps

56:12.200 --> 56:14.200
so that we actually have pointers

56:14.200 --> 56:16.200
to all these intermediate nodes

56:16.200 --> 56:18.200
so we have x1 w1 variable

56:18.200 --> 56:20.200
x2 w2 variable

56:20.200 --> 56:22.200
and I'm also labeling them

56:22.200 --> 56:24.200
so that we have the

56:24.200 --> 56:26.200
n is now the cell body

56:26.200 --> 56:28.200
raw activation

56:28.200 --> 56:30.200
without the activation function for now

56:30.200 --> 56:32.200
and this should be enough

56:32.200 --> 56:34.200
to basically plot it

56:34.200 --> 56:36.200
so draw dot of n

56:38.200 --> 56:40.200
gives us x1 times w1

56:40.200 --> 56:42.200
x2 times w2

56:42.200 --> 56:44.200
being added

56:44.200 --> 56:46.200
then the bias gets added on top of this

56:46.200 --> 56:48.200
and this n is this sum

56:48.200 --> 56:50.200
so we are now going to take it through

56:50.200 --> 56:52.200
an activation function

56:52.200 --> 56:54.200
And let's say we use the tanh

56:54.520 --> 57:00.980
So that we produce the output. So what we'd like to do here is we'd like to do the output and I'll call it O is

57:02.100 --> 57:04.340
N dot tanh

57:05.040 --> 57:07.520
Okay, but we haven't yet written the tanh

57:08.020 --> 57:12.260
now the reason that we need to implement another tanh function here is that

57:12.840 --> 57:14.720
tanh is a

57:14.720 --> 57:21.220
Hyperbolic function and we've only so far implemented a plus and a times and you can't make a tanh out of just pluses and times

57:21.220 --> 57:26.260
You also need exponentiation. So tanh is this kind of a formula here

57:26.940 --> 57:30.480
You can use either one of these and you see that there are exponentiation involved

57:30.480 --> 57:34.160
Which we have not implemented yet for our little value node here

57:34.160 --> 57:38.260
So we're not going to be able to produce tanh yet and we have to go back up and implement something like it

57:38.780 --> 57:41.200
now one option here is

57:42.540 --> 57:44.540
We could actually implement

57:45.360 --> 57:51.200
Exponentiation right and we could return the exp of the value instead of a tanh

57:51.220 --> 57:56.160
Of a value because if we had exp then we have everything else that we need so

57:56.880 --> 57:59.240
because we know how to add and we know how to

58:01.100 --> 58:06.360
We know how to add and we know how to multiply so we'd be able to create tanh if we knew how to exp

58:06.680 --> 58:09.220
but for the purposes of this example, I specifically wanted to

58:09.800 --> 58:14.920
Show you that we don't necessarily need to have the most atomic pieces in

58:16.100 --> 58:21.220
In this value object we can actually like create functions at arbitrary

58:21.920 --> 58:25.640
Points of abstraction they can be complicated functions

58:25.640 --> 58:29.620
But they can be also very very simple functions like a plus and it's totally up to us

58:29.760 --> 58:33.820
The only thing that matters is that we know how to differentiate through any one function

58:33.920 --> 58:36.280
So we take some inputs and we make an output

58:36.420 --> 58:40.720
The only thing that matters it can be arbitrarily complex function as long as you know

58:40.720 --> 58:46.840
How to create the local derivative if you know the local derivative of how the inputs impact the output then that's all you need

58:47.040 --> 58:49.040
So we're going to cluster up

58:49.260 --> 58:51.100
all of this expression

58:51.100 --> 58:56.220
And we're not going to break it down to its atomic pieces. We're just going to directly implement tanh. So let's do that

58:57.060 --> 58:59.060
depth tanh and

58:59.100 --> 59:01.100
then out will be a value of

59:02.600 --> 59:05.360
And we need this expression here, so

59:08.240 --> 59:11.000
Let me actually copy paste

59:14.000 --> 59:20.040
Let's grab n which is a sol.theta and then this I believe is the tanh

59:21.100 --> 59:23.100
math.exp of

59:24.280 --> 59:25.600
2

59:25.600 --> 59:29.720
You know n minus 1 over 2n plus 1

59:30.400 --> 59:32.400
Maybe I can call this x

59:32.860 --> 59:34.860
Just so that it matches exactly

59:35.480 --> 59:39.000
okay, and now this will be t and

59:41.140 --> 59:43.440
Children of this node. There's just one child and

59:44.280 --> 59:47.940
I'm wrapping it in a tuple. So this is a tuple of one object just self and

59:48.640 --> 59:50.940
here the name of this operation will be

59:51.100 --> 59:52.100
10h

59:52.100 --> 59:54.100
And we're going to return that

59:56.160 --> 59:58.160
Okay

59:58.260 --> 01:00:00.260
So now value should be

01:00:00.760 --> 01:00:06.480
Implementing tanh and now we can scroll all the way down here and we can actually do n dot tanh

01:00:06.480 --> 01:00:10.280
And that's going to return the tanh output of n

01:00:11.100 --> 01:00:15.820
And now we should be able to draw it out of o not of n. So let's see how that worked

01:00:18.480 --> 01:00:20.920
There we go n went through tanh

01:00:21.500 --> 01:00:23.500
to produce this output

01:00:24.000 --> 01:00:25.900
so now tanh is a

01:00:25.900 --> 01:00:27.500
sort of

01:00:27.500 --> 01:00:31.380
our little micro grad supported node here as an operation and

01:00:33.080 --> 01:00:38.460
As long as we know the derivative of tanh then we'll be able to back propagate through it now

01:00:38.460 --> 01:00:44.100
Let's see this tanh in action. Currently. It's not squashing too much because the input to it is pretty low

01:00:44.400 --> 01:00:47.160
So the bias was increased to say 8

01:00:48.860 --> 01:00:51.100
Then we'll see that what's flowing in

01:00:51.100 --> 01:00:54.100
to the tanh now is 2 and

01:00:54.360 --> 01:00:56.500
Tanh is squashing it to 0.96

01:00:56.720 --> 01:01:02.800
So we're already hitting the tail of this tanh and it will sort of smoothly go up to 1 and then plateau out over there

01:01:03.220 --> 01:01:09.040
Okay, so I'm going to do something slightly strange. I'm going to change this bias from 8 to this number

01:01:10.000 --> 01:01:11.040
6.88 etc

01:01:11.040 --> 01:01:16.600
and I'm going to do this for specific reasons because we're about to start back propagation and

01:01:17.000 --> 01:01:19.740
I want to make sure that our numbers come out nice

01:01:19.900 --> 01:01:21.000
They're not like very

01:01:21.000 --> 01:01:26.160
Crazy numbers, they're nice numbers that we can sort of understand in our head. Let me also add those label

01:01:26.160 --> 01:01:28.560
O is short for output here

01:01:29.820 --> 01:01:31.360
So that's the R

01:01:31.360 --> 01:01:37.660
Okay, so 0.88 flows into tanh comes out 0.7. So so now we're going to do back propagation

01:01:37.660 --> 01:01:39.660
And we're going to fill in all the gradients

01:01:40.000 --> 01:01:44.440
so what is the derivative O with respect to all the

01:01:44.960 --> 01:01:50.620
inputs here and of course in a typical neural network setting what we really care about the most is the derivative of

01:01:51.000 --> 01:01:53.000
these neurons on the weights

01:01:53.440 --> 01:01:59.180
specifically the w2 and w1 because those are the weights that we're going to be changing part of the optimization and

01:01:59.560 --> 01:02:01.460
The other thing that we have to remember is here

01:02:01.460 --> 01:02:05.480
We have only a single neuron but in the neural net you typically have many neurons and they're connected

01:02:06.940 --> 01:02:11.960
So this is only like a one small neuron a piece of a much bigger puzzle and eventually there's a loss function

01:02:12.120 --> 01:02:17.800
That sort of measures the accuracy of the neural net and we're back propagating with respect to that accuracy and trying to increase it

01:02:18.960 --> 01:02:20.960
So let's start off back propagation

01:02:21.000 --> 01:02:22.260
Here in the end

01:02:22.260 --> 01:02:29.620
What is the derivative of O with respect to O the base case sort of we know always is that the gradient is just 1.0

01:02:30.220 --> 01:02:32.220
so let me fill it in and

01:02:32.380 --> 01:02:33.500
then

01:02:33.500 --> 01:02:35.000
Let me

01:02:35.000 --> 01:02:36.860
split out

01:02:36.860 --> 01:02:38.860
the drawing function

01:02:39.860 --> 01:02:41.860
Here

01:02:42.420 --> 01:02:45.020
And then here cell

01:02:47.100 --> 01:02:49.060
Clear this output here, okay

01:02:49.860 --> 01:02:53.300
So now when we draw O we'll see that or that grad is 1

01:02:53.780 --> 01:02:58.420
So now we're going to back propagate through the tanh so to back propagate through tanh

01:02:58.420 --> 01:03:04.240
We need to know the local derivative of tanh. So if we have that O is

01:03:05.100 --> 01:03:07.100
tanh of n

01:03:08.180 --> 01:03:10.960
Then what is do by dn?

01:03:11.780 --> 01:03:18.400
Now what you could do is you could come here and you could take this expression and you could do your calculus derivative taking

01:03:19.060 --> 01:03:27.600
and that would work but we can also just scroll down Wikipedia here into a section that hopefully tells us that derivative

01:03:29.000 --> 01:03:31.000
d by dx of tanh of x is

01:03:31.460 --> 01:03:34.460
Any of these I like this one 1 minus tanh square of x

01:03:35.000 --> 01:03:43.100
So this is 1 minus tanh of x squared. So basically what this is saying is that d o by dn is

01:03:44.320 --> 01:03:46.320
1 minus tanh

01:03:46.860 --> 01:03:48.720
of n

01:03:48.720 --> 01:03:57.060
squared. And we already have 10h of n. It's just o. So it's 1 minus o squared. So o is

01:03:57.060 --> 01:04:08.560
the output here. So the output is this number. o.data is this number. And then what this

01:04:08.560 --> 01:04:17.360
is saying is that do by dn is 1 minus this squared. So 1 minus o.data squared is 0.5

01:04:17.360 --> 01:04:25.640
conveniently. So the local derivative of this 10h operation here is 0.5. And so that

01:04:25.640 --> 01:04:43.040
would be do by dn. So we can fill in that n.grad is 0.5. We'll just fill it in. So this

01:04:43.040 --> 01:04:47.340
is exactly 0.5, 1 half. So now we're going to continue the backprop.

01:04:47.360 --> 01:04:57.320
This is 0.5. And this is a plus node. So what is backprop going to do here? And if you remember

01:04:57.320 --> 01:05:03.140
our previous example, a plus is just a distributor of gradient. So this gradient will simply

01:05:03.140 --> 01:05:07.420
flow to both of these equally. And that's because the local derivative of this operation

01:05:07.420 --> 01:05:15.260
is 1 for every one of its nodes. So 1 times 0.5 is 0.5. So therefore, we know that this

01:05:15.260 --> 01:05:17.260
node here, which we called this.

01:05:17.360 --> 01:05:26.040
It's grad. It's just 0.5. And we know that b.grad is also 0.5. So let's set those and

01:05:26.040 --> 01:05:33.760
let's draw. So those are 0.5. Continuing, we have another plus. 0.5, again, we'll just

01:05:33.760 --> 01:05:46.360
distribute. So 0.5 will flow to both of these. So we can set theirs. x2w2 as well. .grad is

01:05:46.360 --> 01:05:46.800
0.5.

01:05:47.360 --> 01:05:53.600
And let's redraw. Pluses are my favorite operations to backpropagate through because it's very

01:05:53.600 --> 01:05:59.040
simple. So now what's flowing into these expressions is 0.5. And so really, again, keep in mind

01:05:59.040 --> 01:06:03.440
what the derivative is telling us at every point in time along here. This is saying that

01:06:03.440 --> 01:06:10.880
if we want the output of this neuron to increase, then the influence on these expressions is

01:06:10.880 --> 01:06:13.880
positive on the output. Both of them are positive.

01:06:17.360 --> 01:06:24.740
So we can put a distribution to the output. So now, backpropagating to x2 and w2 first.

01:06:24.740 --> 01:06:30.300
This is a times node. So we know that the local derivative is the other term. So if

01:06:30.300 --> 01:06:42.860
we want to calculate x2.grad, then can you think through what it's going to be? So x2.grad

01:06:42.860 --> 01:06:45.420
will be w2.data times this x2.grad.

01:06:45.420 --> 01:06:46.320
.grad.

01:06:46.320 --> 01:06:47.360
.grad.

01:06:47.360 --> 01:07:03.600
w2.grad right and w2.grad will be x2.data times x2.w2.grad right so that's the little local piece

01:07:03.600 --> 01:07:11.660
of chain rule let's set them and let's redraw so here we see that the gradient on our weight

01:07:11.660 --> 01:07:19.960
2 is 0 because x2's data was 0 right but x2 will have the gradient 0.5 because data here was 1

01:07:19.960 --> 01:07:26.800
and so what's interesting here right is because the input x2 was 0 then because of the way the

01:07:26.800 --> 01:07:32.120
times works of course this gradient will be 0 and think about intuitively why that is

01:07:32.120 --> 01:07:39.440
derivative always tells us the influence of this on the final output if i wiggle w2

01:07:39.440 --> 01:07:40.940
how is the output changing

01:07:40.940 --> 01:07:46.160
it's not changing because we're multiplying by 0 so because it's not changing there is no

01:07:46.160 --> 01:07:53.160
derivative and 0 is the correct answer because we're squashing that 0 and let's do it here

01:07:53.160 --> 01:08:00.720
0.5 should come here and flow through this times and so we'll have that x1.grad is

01:08:00.720 --> 01:08:05.040
can you think through a little bit what what this should be

01:08:05.560 --> 01:08:10.540
local derivative of times with respect to x1

01:08:10.540 --> 01:08:10.920
is

01:08:10.920 --> 01:08:25.800
going to be w1 so w1's data times x1 w1.grad and w1.grad will be x1.data times x1 w2 w1.grad

01:08:27.240 --> 01:08:33.400
let's see what those came out to be so this is 0.5 so this would be negative 1.5 and this would be

01:08:33.400 --> 01:08:39.400
1. and we've back propagated through this expression these are the actual final derivatives so if we

01:08:39.400 --> 01:08:40.520
want this neurons to be negative 1.5 we're going to have to do this we're going to have to do this

01:08:40.520 --> 01:08:44.920
bit of elaborating so actually we can do this byаци to here so this is negative 1.5 so if we

01:08:44.920 --> 01:08:49.560
now want this neuron's output to increase we know that what's necessary is that

01:08:51.320 --> 01:08:55.400
w2 we have no gradient w2 doesn't actually matter to this neuron right now

01:08:55.400 --> 01:09:00.920
but this neuron this weight should go up so if this weight goes up then this neurones output

01:09:00.920 --> 01:09:07.480
would have gone up and proportionally because the gradient is 1. okay so doing the back propagation

01:09:07.480 --> 01:09:08.360
manually is obviously ridiculous so we are now going to put an end to this suffering and we're going to see how we can implement the back propagation's output Health classes method lambda.

01:09:08.360 --> 01:09:08.980
self attack self acquire lerud and a random entunkered router operation will be still coercion equal to 0.25éro.

01:09:08.980 --> 01:09:13.060
can implement the backward pass a bit more automatically. We're not going to be doing

01:09:13.060 --> 01:09:18.140
all of it manually out here. It's now pretty obvious to us by example how these pluses and

01:09:18.140 --> 01:09:23.540
times are back-propagating ingredients. So let's go up to the value object and we're going to start

01:09:23.540 --> 01:09:31.760
codifying what we've seen in the examples below. So we're going to do this by storing a special

01:09:31.760 --> 01:09:39.640
self.backward and underscore backward. And this will be a function which is going to do that

01:09:39.640 --> 01:09:44.340
little piece of chain rule. At each little node that took inputs and produced output,

01:09:45.040 --> 01:09:51.480
we're going to store how we are going to chain the outputs gradient into the inputs gradients.

01:09:52.340 --> 01:10:00.940
So by default, this will be a function that doesn't do anything. And you can also see that

01:10:00.940 --> 01:10:01.740
here in the value in my example.

01:10:01.760 --> 01:10:08.900
Micrograd. So we have this backward function. By default, it doesn't do anything. This is a

01:10:08.900 --> 01:10:13.080
empty function. And that would be sort of the case, for example, for a leaf node. For a leaf

01:10:13.080 --> 01:10:20.520
node, there's nothing to do. But now when we're creating these out values, these out values are

01:10:20.520 --> 01:10:30.600
an addition of self and other. And so we'll want to set out backward to be the function that

01:10:30.600 --> 01:10:31.740
propagates the gradient.

01:10:31.760 --> 01:10:42.960
So let's define what should happen. And we're going to store it in a closure. Let's define what

01:10:42.960 --> 01:10:53.180
should happen when we call out's grad. For addition, our job is to take out's grad and

01:10:53.180 --> 01:10:58.940
propagate it into self's grad and other.grad. So basically, we want to solve self.grad to

01:10:58.940 --> 01:11:01.740
something. And we want to set out's grad to something. And we want to set out's grad to

01:11:01.760 --> 01:11:09.240
that grad to something okay and the way we saw below how chain rule works we

01:11:09.240 --> 01:11:14.300
want to take the local derivative times the sort of global derivative I should

01:11:14.300 --> 01:11:17.900
call it which is the derivative of the final output of the expression with

01:11:17.900 --> 01:11:27.320
respect to out's data with respect to out so the local derivative of self in an

01:11:27.320 --> 01:11:35.420
addition is 1.0 so it's just 1.0 times out's grad that's the chain rule and

01:11:35.420 --> 01:11:40.760
others.grad will be 1.0 times out.grad and what you basically what you're seeing

01:11:40.760 --> 01:11:46.280
here is that out's grad will simply be copied onto self's grad and others grad

01:11:46.280 --> 01:11:51.440
as we saw happens for an addition operation so we're going to later call

01:11:51.440 --> 01:11:56.420
this function to propagate the gradient having done an addition let's now do

01:11:56.420 --> 01:11:57.040
multiplication

01:11:57.320 --> 01:12:04.880
we're going to also define and we're going to set its backward to be

01:12:04.880 --> 01:12:15.500
backward and we want to chain out grad into self.grad and others.grad

01:12:15.500 --> 01:12:21.940
and this will be a little piece of chain rule for multiplication so we'll have so

01:12:21.940 --> 01:12:25.940
what should this be can you think through

01:12:27.320 --> 01:12:31.340
scale it up a little bit more I think we can test it but okay so we've got

01:12:31.340 --> 01:12:33.680
thatanche squared caught or else what should it be and this is going to be

01:12:33.680 --> 01:12:35.900
a little better what should this be it's going to be a little bit better

01:12:35.900 --> 01:12:40.160
so finally see here to the other side and this will be the off part second

01:12:40.160 --> 01:12:44.840
time creative so where the version to copy to that I was off the plane or up

01:12:44.840 --> 01:12:48.040
to the -, and then target my output time so let's go to case sickness

01:12:48.040 --> 01:12:51.400
so here's the look of a general promotions of set for entire settings

01:12:51.400 --> 01:12:55.540
we want a group this isn't going to come the other way we want to set the

01:12:55.540 --> 01:12:56.560
You can also add in a你們 I think return method and even the previous employees

01:12:56.560 --> 01:12:57.300
and I'm gonna do a little bit of what we're going to say for the SQL gameplay

01:12:57.320 --> 01:13:05.600
to be just backward and here we need to back propagate we have out dot grad and we want to

01:13:05.600 --> 01:13:13.780
chain it into salt dot grad and salt dot grad will be the local derivative of this operation

01:13:13.780 --> 01:13:20.360
that we've done here which is 10h and so we saw that the local gradient is 1 minus the 10h of x

01:13:20.360 --> 01:13:27.120
squared which here is t that's the local derivative because that's t is the output of this 10h

01:13:27.120 --> 01:13:33.960
so 1 minus t squared is the local derivative and then gradient has to be multiplied because of the

01:13:33.960 --> 01:13:40.100
chain rule so out grad is chained through the local gradient into salt dot grad and that should

01:13:40.100 --> 01:13:46.240
be basically it so we're going to redefine our value node we're going to swing all the way down

01:13:46.240 --> 01:13:56.560
here and we're going to redefine our expression make sure that all the grads are zero okay but

01:13:56.560 --> 01:13:57.100
now we don't have to do this again we're just going to do this again and we're going to do this

01:13:57.100 --> 01:14:01.980
to do this manually anymore. We are going to basically be calling the dot backward

01:14:01.980 --> 01:14:15.080
in the right order. So first we want to call o's dot backward. So o was the

01:14:15.080 --> 01:14:23.320
outcome of 10h, right? So calling o's backward will be this

01:14:23.320 --> 01:14:29.480
function. This is what it will do. Now we have to be careful because there's a

01:14:29.480 --> 01:14:39.700
times out dot grad and out dot grad remember is initialized to 0. So here we see

01:14:39.700 --> 01:14:47.320
grad 0. So as a base case we need to set o's dot grad to 1.0 to initialize

01:14:47.320 --> 01:14:49.980
this with 1

01:14:53.320 --> 01:14:59.000
and then once this is 1, we can call o dot backward and what that should do is it should

01:14:59.000 --> 01:15:05.840
propagate this grad through 10h. So the local derivative times the global derivative which

01:15:05.840 --> 01:15:08.780
is initialized at 1. So this should

01:15:08.780 --> 01:15:21.080
so I thought about redoing it but I figured I should just leave the error in here because

01:15:21.080 --> 01:15:23.200
it's pretty funny. Why is an anti-object

01:15:23.320 --> 01:15:30.860
not callable? It's because I screwed up. We're trying to save these functions. So this is

01:15:30.860 --> 01:15:36.200
correct. This here, we don't want to call the function because that returns none. These

01:15:36.200 --> 01:15:41.000
functions return none. We just want to store the function. So let me redefine the value

01:15:41.000 --> 01:15:47.080
object and then we're going to come back in, redefine the expression, draw a dot. Everything

01:15:47.080 --> 01:15:48.080
is great.

01:15:48.080 --> 01:15:53.080
o dot grad is 1, o dot grad is 1 and now

01:15:53.080 --> 01:15:59.580
this should work, of course. Okay. So o dot backward should have, this grad should now

01:15:59.580 --> 01:16:06.840
be 0.5 if we redraw and if everything went correctly, 0.5. Yay. Okay. So now we need to

01:16:06.840 --> 01:16:18.080
call ns dot grad, ns dot backward, sorry, ns backward. So that seems to have worked.

01:16:18.080 --> 01:16:22.920
So ns dot backward routed the gradient to both of these. So this is looking great. So

01:16:22.920 --> 01:16:32.280
now we could, of course, call b dot grad, b dot backward, sorry. What's going to happen?

01:16:32.280 --> 01:16:38.860
Well b doesn't have a backward. b is backward because b is a leaf node. b is backward is

01:16:38.860 --> 01:16:46.200
by initialization the empty function. So nothing would happen. But we can call it on it. But

01:16:46.200 --> 01:16:50.920
when we call this one, it's backward.

01:16:52.920 --> 01:16:55.300
M Normal entire value.

01:16:57.300 --> 01:17:04.820
Let's do this behavior here. Then we expect this 0.5 to give further routed. Right? So

01:17:04.820 --> 01:17:18.140
there we go, 0.5, 0.5. And then finally, we want to call it here on x2, w2. And on

01:17:18.140 --> 01:17:20.140
x1, w1.

01:17:20.140 --> 01:17:22.080
Let's do both of those. And there we go.

01:17:22.080 --> 01:17:22.920
??

01:17:22.920 --> 01:17:29.420
and one exactly as we did before but now we've done it through calling that backward

01:17:29.420 --> 01:17:36.780
sort of manually so we have one last piece to get rid of which is us calling underscore

01:17:36.780 --> 01:17:42.660
backward manually so let's think through what we are actually doing we've laid out a mathematical

01:17:42.660 --> 01:17:48.760
expression and now we're trying to go backwards through that expression so going backwards through

01:17:48.760 --> 01:17:54.520
the expression just means that we never want to call a dot backward for any node before

01:17:54.520 --> 01:18:02.140
we've done sort of everything after it so we have to do everything after it before we're ever going

01:18:02.140 --> 01:18:06.260
to call dot backward on any one node we have to get all of its full dependencies everything that

01:18:06.260 --> 01:18:13.780
it depends on has to propagate to it before we can continue back-propagation so this ordering

01:18:13.780 --> 01:18:17.220
of graphs can be achieved using something called topological sort

01:18:17.220 --> 01:18:18.620
so topological

01:18:18.620 --> 01:18:25.520
sort is basically a laying out of a graph such that all the edges go only from left to right

01:18:25.520 --> 01:18:33.380
basically. So here we have a graph it's a directed acyclic graph a DAG and this is two different

01:18:33.380 --> 01:18:37.860
topological orders of it I believe where basically you'll see that it's a laying out of the nodes

01:18:37.860 --> 01:18:44.260
such that all the edges go only one way from left to right. And implementing topological sort you

01:18:44.260 --> 01:18:51.000
can look in wikipedia and so on I'm not going to go through it in detail but basically this is what

01:18:51.000 --> 01:19:00.160
builds a topological graph. We maintain a set of visited nodes and then we are going through

01:19:00.160 --> 01:19:04.960
starting at some root node which for us is O that's where I want to start the topological sort

01:19:04.960 --> 01:19:11.000
and starting at O we go through all of its children and we need to lay them out from left to

01:19:11.000 --> 01:19:14.200
right and basically this starts at OH.

01:19:14.260 --> 01:19:17.580
Oh, if it's not visited, then it marks it as visited.

01:19:17.880 --> 01:19:23.260
And then it iterates through all of its children and calls build topological on them.

01:19:24.080 --> 01:19:27.680
And then after it's gone through all the children, it adds itself.

01:19:28.240 --> 01:19:32.700
So basically, this node that we're going to call it on, like say, oh,

01:19:33.060 --> 01:19:38.860
is only going to add itself to the topo list after all of the children have been processed.

01:19:38.860 --> 01:19:43.560
And that's how this function is guaranteeing that you're only going to be in the list

01:19:43.560 --> 01:19:45.560
once all of your children are in the list.

01:19:45.820 --> 01:19:47.400
And that's the invariant that is being maintained.

01:19:47.820 --> 01:19:51.340
So if we build topo on O and then inspect this list,

01:19:51.720 --> 01:19:55.740
we're going to see that it ordered our value objects.

01:19:56.500 --> 01:20:00.820
And the last one is the value of 0.707, which is the output.

01:20:01.520 --> 01:20:08.080
So this is O, and then this is N, and then all the other nodes get laid out before it.

01:20:09.500 --> 01:20:11.540
So that builds the topological graph.

01:20:12.100 --> 01:20:13.500
And really what we're doing now,

01:20:13.560 --> 01:20:19.000
is we're just calling dot underscore backward on all of the nodes in a topological order.

01:20:19.580 --> 01:20:24.180
So if we just reset the gradients, they're all 0, what did we do?

01:20:24.540 --> 01:20:30.480
We started by setting O.grad to be 1.

01:20:31.160 --> 01:20:32.600
That's the base case.

01:20:33.260 --> 01:20:35.960
Then we built a topological order.

01:20:37.960 --> 01:20:43.240
And then we went for node in reversed.

01:20:43.960 --> 01:20:44.680
Of topo.

01:20:46.220 --> 01:20:51.220
Now, in the reverse order, because this list goes from, you know,

01:20:51.620 --> 01:20:53.080
we need to go through it in reversed order.

01:20:53.960 --> 01:20:57.180
So starting at O, node dot backward.

01:20:58.480 --> 01:21:01.580
And this should be it.

01:21:03.180 --> 01:21:03.940
There we go.

01:21:05.380 --> 01:21:06.580
Those are the correct derivatives.

01:21:07.140 --> 01:21:09.480
Finally, we are going to hide this functionality.

01:21:10.020 --> 01:21:12.420
So I'm going to copy this.

01:21:12.740 --> 01:21:13.540
And we're going to hide this functionality.

01:21:13.560 --> 01:21:14.840
And we're going to hide it inside the value class,

01:21:15.000 --> 01:21:17.340
because we don't want to have all that code lying around.

01:21:18.340 --> 01:21:19.720
So instead of an underscore backward,

01:21:19.940 --> 01:21:21.840
we're now going to define an actual backward.

01:21:22.160 --> 01:21:24.200
So that's backward, without the underscore.

01:21:26.120 --> 01:21:28.400
And that's going to do all the stuff that we just derived.

01:21:29.000 --> 01:21:30.700
So let me just clean this up a little bit.

01:21:31.160 --> 01:21:38.360
So we're first going to build a topological graph,

01:21:38.840 --> 01:21:40.340
starting at self.

01:21:41.340 --> 01:21:43.380
So build topo of self.

01:21:43.820 --> 01:21:47.380
We'll populate the topological order into the topo list,

01:21:47.540 --> 01:21:48.500
which is a local variable.

01:21:49.060 --> 01:21:51.520
Then we set self.grads to be one.

01:21:52.780 --> 01:21:55.740
And then for each node in the reversed list,

01:21:56.060 --> 01:21:58.240
so starting at S and going to all the children,

01:21:59.240 --> 01:22:00.700
underscore backward.

01:22:02.180 --> 01:22:04.460
And that should be it.

01:22:04.820 --> 01:22:06.460
So save.

01:22:07.700 --> 01:22:08.720
Come down here.

01:22:09.340 --> 01:22:10.000
We define.

01:22:11.000 --> 01:22:12.340
Okay, all the grads are zero.

01:22:13.560 --> 01:22:16.520
And now what we can do is odot backward without the underscore.

01:22:17.480 --> 01:22:22.000
And there we go.

01:22:22.900 --> 01:22:25.040
And that's backpropagation.

01:22:26.420 --> 01:22:27.580
Place for one neuron.

01:22:28.540 --> 01:22:30.700
Now we shouldn't be too happy with ourselves, actually,

01:22:30.700 --> 01:22:32.700
because we have a bad bug.

01:22:33.340 --> 01:22:35.060
And we have not surfaced the bug

01:22:35.060 --> 01:22:38.780
because of some specific conditions that we have to think about right now.

01:22:39.700 --> 01:22:42.440
So here's the simplest case that shows the bug.

01:22:43.560 --> 01:22:45.800
Say I create a single node A,

01:22:46.160 --> 01:22:49.980
and then I create a B that is A plus A.

01:22:51.460 --> 01:22:52.440
And then I call backward.

01:22:54.740 --> 01:22:56.780
So what's going to happen is A is three,

01:22:57.280 --> 01:22:59.120
and then B is A plus A.

01:22:59.340 --> 01:23:01.640
So there's two arrows on top of each other here.

01:23:03.660 --> 01:23:06.320
Then we can see that B is, of course, the forward pass works.

01:23:06.780 --> 01:23:09.240
B is just A plus A, which is six.

01:23:09.240 --> 01:23:12.100
But the gradient here is not actually correct.

01:23:12.580 --> 01:23:13.540
That we calculated.

01:23:13.560 --> 01:23:14.160
We can calculate it automatically.

01:23:15.740 --> 01:23:22.160
And that's because, of course, just doing calculus in your head,

01:23:22.540 --> 01:23:25.940
the derivative of B with respect to A should be two.

01:23:27.420 --> 01:23:28.240
One plus one.

01:23:28.880 --> 01:23:29.580
It's not one.

01:23:30.940 --> 01:23:32.320
Intuitively, what's happening here, right?

01:23:32.380 --> 01:23:35.840
So B is the result of A plus A, and then we call backward on it.

01:23:36.440 --> 01:23:39.200
So let's go up and see what that does.

01:23:43.560 --> 01:23:46.680
B is the result of addition, so out as B.

01:23:48.020 --> 01:23:54.000
And then when we call backward, what happened is self.grad was set to one,

01:23:54.560 --> 01:23:56.540
and then other.grad was set to one.

01:23:57.280 --> 01:24:02.740
But because we're doing A plus A, self and other are actually the exact same object.

01:24:03.420 --> 01:24:05.660
So we are overriding the gradient.

01:24:05.840 --> 01:24:09.160
We are setting it to one, and then we are setting it again to one.

01:24:09.500 --> 01:24:12.320
And that's why it stays at one.

01:24:12.580 --> 01:24:13.540
So that's a good thing.

01:24:14.540 --> 01:24:18.000
There's another way to see this in a little bit more complicated expression.

01:24:21.340 --> 01:24:24.660
So here we have A and B.

01:24:25.920 --> 01:24:29.320
And then D will be the multiplication of the two,

01:24:29.620 --> 01:24:31.240
and E will be the addition of the two.

01:24:32.140 --> 01:24:34.800
And then we multiply E times D to get F.

01:24:35.260 --> 01:24:36.540
And then we call F dot backward.

01:24:37.660 --> 01:24:40.040
And these gradients, if you check, will be incorrect.

01:24:40.600 --> 01:24:42.880
So fundamentally what's happening here, again,

01:24:42.880 --> 01:24:48.660
is basically we're going to see an issue any time we use a variable more than once.

01:24:49.180 --> 01:24:53.060
Until now, in these expressions above, every variable is used exactly once.

01:24:53.160 --> 01:24:54.160
So we didn't see the issue.

01:24:54.920 --> 01:24:57.000
But here, if a variable is used more than once,

01:24:57.100 --> 01:24:58.580
what's going to happen during backward pass?

01:24:59.100 --> 01:25:01.680
We're back-propagating from F to E to D.

01:25:01.860 --> 01:25:02.480
So far, so good.

01:25:02.720 --> 01:25:07.080
But now E calls it backward, and it deposits its gradients to A and B.

01:25:07.420 --> 01:25:10.020
But then we come back to D and call backward,

01:25:10.020 --> 01:25:12.860
and it overwrites those gradients at A and B.

01:25:12.880 --> 01:25:16.100
So that's obviously a problem.

01:25:17.300 --> 01:25:22.260
And the solution here, if you look at the multivariate case of the chain rule

01:25:22.260 --> 01:25:23.420
and its generalization there,

01:25:23.780 --> 01:25:27.880
the solution there is basically that we have to accumulate these gradients.

01:25:28.020 --> 01:25:29.020
These gradients add.

01:25:30.200 --> 01:25:32.820
And so instead of setting those gradients,

01:25:33.780 --> 01:25:36.260
we can simply do plus equals.

01:25:36.680 --> 01:25:38.380
We need to accumulate those gradients.

01:25:38.880 --> 01:25:42.560
Plus equals, plus equals, plus equals.

01:25:42.880 --> 01:25:49.880
And this will be okay, remember, because we are initializing them at zero.

01:25:50.040 --> 01:25:58.100
So they start at zero, and then any contribution that flows backwards will simply add.

01:25:58.800 --> 01:26:05.500
So now if we redefine this one, because the plus equals, this now works.

01:26:05.880 --> 01:26:09.220
Because A dot grad started at zero, and we called B dot backward,

01:26:09.660 --> 01:26:12.460
we deposit one, and then we deposit one again.

01:26:12.720 --> 01:26:12.860
And then we call B dot backward.

01:26:12.860 --> 01:26:14.280
And now this is two, which is correct.

01:26:14.860 --> 01:26:17.900
And here, this will also work, and we'll get correct gradients.

01:26:18.380 --> 01:26:22.060
Because when we call E dot backward, we will deposit the gradients from this branch,

01:26:22.460 --> 01:26:26.480
and then when we get to D dot backward, it will deposit its own gradients.

01:26:26.900 --> 01:26:29.580
And then those gradients simply add on top of each other.

01:26:30.120 --> 01:26:32.820
And so we just accumulate those gradients, and that fixes the issue.

01:26:33.440 --> 01:26:36.320
Okay, now before we move on, let me actually do a bit of cleanup here

01:26:36.320 --> 01:26:40.000
and delete some of this intermediate work.

01:26:40.720 --> 01:26:42.620
So I'm not going to need any of this.

01:26:42.620 --> 01:26:44.000
Now that we've derived all of it.

01:26:45.460 --> 01:26:48.840
We are going to keep this, because I want to come back to it.

01:26:49.640 --> 01:26:56.640
Delete the 10H, delete our modigating example, delete the step, delete this,

01:26:56.960 --> 01:27:01.220
keep the code that draws, and then delete this example,

01:27:01.840 --> 01:27:04.180
and leave behind only the definition of value.

01:27:05.360 --> 01:27:08.800
And now let's come back to this non-linearity here that we implemented, the 10H.

01:27:09.060 --> 01:27:12.060
Now I told you that we could have broken down 10H

01:27:12.060 --> 01:27:17.200
into its explicit atoms in terms of other expressions if we had the exp function.

01:27:17.880 --> 01:27:19.720
So if you remember, 10H is defined like this,

01:27:20.140 --> 01:27:22.620
and we chose to develop 10H as a single function,

01:27:22.960 --> 01:27:25.080
and we can do that because we know it's derivative,

01:27:25.280 --> 01:27:26.440
and we can backpropagate through it.

01:27:26.880 --> 01:27:30.740
But we can also break down 10H into an expressiveness, a function of exp.

01:27:31.160 --> 01:27:33.760
And I would like to do that now, because I want to prove to you

01:27:33.760 --> 01:27:35.760
that you get all the same results and all the same gradients,

01:27:36.300 --> 01:27:39.560
but also because it forces us to implement a few more expressions.

01:27:39.560 --> 01:27:41.940
It forces us to do exponentiation,

01:27:42.060 --> 01:27:45.160
addition, subtraction, division, and things like that.

01:27:45.160 --> 01:27:47.660
And I think it's a good exercise to go through a few more of these.

01:27:48.160 --> 01:27:51.360
Okay, so let's scroll up to the definition of value.

01:27:52.160 --> 01:27:54.560
And here, one thing that we currently can't do is,

01:27:54.560 --> 01:27:57.560
we can do like a value of, say, 2.0.

01:27:58.460 --> 01:28:02.360
But we can't do, you know, here, for example, we want to add a constant 1.

01:28:02.560 --> 01:28:04.260
And we can't do something like this.

01:28:05.260 --> 01:28:08.360
And we can't do it because it says int object has no attribute data.

01:28:08.660 --> 01:28:11.660
That's because a plus 1 comes right here to add,

01:28:12.160 --> 01:28:14.460
and then other is the integer 1.

01:28:14.860 --> 01:28:18.360
And then here, Python is trying to access 1.data, and that's not a thing.

01:28:18.760 --> 01:28:21.460
And that's because basically, 1 is not a value object,

01:28:21.460 --> 01:28:23.560
and we only have addition for value objects.

01:28:23.960 --> 01:28:27.760
So as a matter of convenience, so that we can create expressions like this

01:28:27.760 --> 01:28:30.860
and make them make sense, we can simply do something like this.

01:28:32.360 --> 01:28:37.060
Basically, we let other alone if other is an instance of value.

01:28:37.260 --> 01:28:39.860
But if it's not an instance of value, we're going to assume that it's a number,

01:28:39.860 --> 01:28:41.860
like an integer or a float, and we're going to simply

01:28:41.860 --> 01:28:43.860
wrap it in value.

01:28:44.160 --> 01:28:46.060
And then other will just become value of other,

01:28:46.060 --> 01:28:48.960
and then other will have a data attribute, and this should work.

01:28:49.360 --> 01:28:52.860
So if I just say this, redefine value, then this should work.

01:28:53.360 --> 01:28:53.860
There we go.

01:28:54.360 --> 01:28:56.560
Okay, now let's do the exact same thing for multiply,

01:28:56.660 --> 01:29:01.060
because we can't do something like this, again, for the exact same reason.

01:29:01.360 --> 01:29:05.660
So we just have to go to mol, and if other is not a value,

01:29:05.660 --> 01:29:07.160
then let's wrap it in value.

01:29:07.660 --> 01:29:09.960
Let's redefine value, and now this works.

01:29:10.660 --> 01:29:11.660
Now, here's a kind of, unfortunately,

01:29:11.660 --> 01:29:15.760
and not obvious part, a times two works, we saw that,

01:29:15.960 --> 01:29:18.560
but two times a, is that going to work?

01:29:19.860 --> 01:29:20.960
You'd expect it to, right?

01:29:21.360 --> 01:29:22.760
But actually, it will not.

01:29:23.160 --> 01:29:25.760
And the reason it won't is because Python doesn't know,

01:29:26.260 --> 01:29:30.860
like when you do a times two, basically, so a times two,

01:29:30.960 --> 01:29:35.760
Python will go and it will basically do something like a dot mol of two.

01:29:35.860 --> 01:29:37.060
That's basically what it will call.

01:29:37.260 --> 01:29:41.460
But to it, two times a is the same as two dot mol of a.

01:29:41.860 --> 01:29:45.460
And it doesn't, two can't multiply value.

01:29:45.560 --> 01:29:47.060
And so it's really confused about that.

01:29:47.560 --> 01:29:51.460
So instead, what happens is in Python, the way this works is you are free to define

01:29:51.860 --> 01:29:53.460
something called the rmol.

01:29:54.460 --> 01:29:57.260
And rmol is kind of like a fallback.

01:29:57.360 --> 01:30:03.660
So if Python can't do two times a, it will check if by any chance,

01:30:03.860 --> 01:30:07.660
a knows how to multiply two, and that will be called into rmol.

01:30:08.860 --> 01:30:11.260
So because Python can't do two times a,

01:30:11.660 --> 01:30:13.560
it will check, is there an rmol in value?

01:30:13.860 --> 01:30:16.360
And because there is, it will now call that.

01:30:17.060 --> 01:30:20.360
And what we'll do here is we will swap the order of the operands.

01:30:20.760 --> 01:30:23.360
So basically, two times a will redirect to rmol,

01:30:23.660 --> 01:30:25.760
and rmol will basically call a times two.

01:30:26.360 --> 01:30:27.560
And that's how that will work.

01:30:28.560 --> 01:30:32.360
So redefining that with rmol, two times a becomes four.

01:30:32.860 --> 01:30:35.060
Okay, now looking at the other elements that we still need,

01:30:35.160 --> 01:30:36.960
we need to know how to exponentiate and how to divide.

01:30:37.460 --> 01:30:40.160
So let's first do the exponentiation part.

01:30:40.560 --> 01:30:41.460
We're going to introduce

01:30:41.960 --> 01:30:44.360
a single function exp here.

01:30:45.160 --> 01:30:49.860
And exp is going to mirror 10h in the sense that it's a single function

01:30:49.860 --> 01:30:52.660
that transforms a single scalar value and outputs a single scalar value.

01:30:53.260 --> 01:30:55.260
So we pop out the Python number.

01:30:55.760 --> 01:30:58.860
We use math.exp to exponentiate it, create a new value object,

01:30:59.360 --> 01:31:00.560
everything that we've seen before.

01:31:01.060 --> 01:31:04.160
The tricky part, of course, is how do you backpropagate through e to the x?

01:31:04.860 --> 01:31:10.160
And so here you can potentially pause the video and think about what should go here.

01:31:11.660 --> 01:31:18.260
Okay, so basically, we need to know what is the local derivative of e to the x.

01:31:18.560 --> 01:31:22.060
So d by dx of e to the x is famously just e to the x.

01:31:22.360 --> 01:31:26.360
And we've already just calculated e to the x, and it's inside out.data.

01:31:26.660 --> 01:31:31.260
So we can do out.data times and out.grad, that's the chain rule.

01:31:32.160 --> 01:31:34.660
So we're just chaining on to the current running grad.

01:31:35.360 --> 01:31:37.160
And this is what the expression looks like.

01:31:37.360 --> 01:31:39.760
It looks a little confusing, but this is what it is.

01:31:39.760 --> 01:31:41.060
And that's the exponentiation.

01:31:41.660 --> 01:31:44.960
So redefining, we should now be able to call a.exp.

01:31:45.460 --> 01:31:48.160
And hopefully the backward pass works as well.

01:31:48.360 --> 01:31:51.660
Okay, and the last thing we'd like to do, of course, is we'd like to be able to divide.

01:31:52.360 --> 01:31:56.060
Now, I actually will implement something slightly more powerful than division,

01:31:56.060 --> 01:31:59.560
because division is just a special case of something a bit more powerful.

01:32:00.160 --> 01:32:06.960
So in particular, just by rearranging, if we have some kind of a b equals value of 4.0 here,

01:32:07.060 --> 01:32:10.860
we'd like to basically be able to do a divide b, and we'd like this to be able to give us 0.5.

01:32:11.660 --> 01:32:14.960
Now, division actually can be reshuffled as follows.

01:32:15.460 --> 01:32:19.360
If we have a divide b, that's actually the same as a multiplying 1 over b.

01:32:20.060 --> 01:32:23.460
And that's the same as a multiplying b to the power of negative 1.

01:32:24.460 --> 01:32:31.460
And so what I'd like to do instead is I basically like to implement the operation of x to the k for some constant k.

01:32:31.660 --> 01:32:33.160
So it's an integer or a float.

01:32:34.160 --> 01:32:36.260
And we would like to be able to differentiate this.

01:32:36.260 --> 01:32:40.160
And then as a special case, negative 1 will be division.

01:32:40.960 --> 01:32:41.560
And so I'm doing that.

01:32:41.560 --> 01:32:46.060
Just because it's more general and you might as well do it that way.

01:32:46.460 --> 01:32:53.560
So basically what I'm saying is we can redefine division, which we will put here somewhere.

01:32:54.660 --> 01:32:55.860
You know, we can put it here somewhere.

01:32:56.360 --> 01:32:58.860
What I'm saying is that we can redefine division.

01:32:59.160 --> 01:33:00.460
So self divide other.

01:33:00.860 --> 01:33:04.960
This can actually be rewritten as self times other to the power of negative 1.

01:33:05.860 --> 01:33:10.860
And now, value raised to the power of negative 1, we have to now define that.

01:33:11.560 --> 01:33:15.660
So here's, so we need to implement the pow function.

01:33:16.160 --> 01:33:17.860
Where am I going to put the pow function?

01:33:17.860 --> 01:33:18.760
Maybe here somewhere.

01:33:20.160 --> 01:33:21.360
This is the skeleton for it.

01:33:22.560 --> 01:33:28.060
So this function will be called when we try to raise a value to some power and other will be that power.

01:33:28.760 --> 01:33:32.060
Now, I'd like to make sure that other is only an int or a float.

01:33:32.260 --> 01:33:35.360
Usually other is some kind of a different value object.

01:33:35.560 --> 01:33:38.460
But here other will be forced to be an int or a float.

01:33:38.760 --> 01:33:41.460
Otherwise, the math won't work.

01:33:41.660 --> 01:33:44.360
For what we're trying to achieve in this specific case.

01:33:44.760 --> 01:33:48.660
That would be a different derivative expression if we wanted other to be a value.

01:33:49.760 --> 01:33:54.660
So here we create the other value, which is just, you know, this data raised to the power of other.

01:33:54.860 --> 01:33:56.660
And other here could be, for example, negative 1.

01:33:56.760 --> 01:33:58.360
That's what we are hoping to achieve.

01:33:59.460 --> 01:34:01.660
And then this is the backward stub.

01:34:01.960 --> 01:34:10.960
And this is the fun part, which is what is the chain rule expression here for back propagating through

01:34:11.060 --> 01:34:15.160
the power function where the power is to the power of some kind of a constant.

01:34:15.860 --> 01:34:20.860
So this is the exercise and maybe pause the video here and see if you can figure it out yourself as to what we should put here.

01:34:27.060 --> 01:34:32.460
Okay, so you can actually go here and look at derivative rules as an example.

01:34:32.760 --> 01:34:35.760
And we see lots of derivative rules that you can hopefully know from calculus.

01:34:35.960 --> 01:34:40.860
In particular, what we're looking for is the power rule because that's telling us that if we're trying to take

01:34:40.960 --> 01:34:48.860
d by dx of x to the n, which is what we're doing here, then that is just n times x to the n minus 1, right?

01:34:49.660 --> 01:34:55.360
Okay, so that's telling us about the local derivative of this power operation.

01:34:56.060 --> 01:35:03.060
So all we want here basically n is now other and self.data is x.

01:35:03.660 --> 01:35:09.560
And so this now becomes other which is n times self.data,

01:35:10.460 --> 01:35:10.760
which is now another.

01:35:10.960 --> 01:35:12.460
Python int or a float.

01:35:13.260 --> 01:35:14.360
It's not a value object.

01:35:14.360 --> 01:35:20.460
We're accessing the data attribute raised to the power of other minus 1 or n minus 1.

01:35:21.360 --> 01:35:27.960
I can put brackets around this, but this doesn't matter because power takes precedence over multiply in pyhelm.

01:35:27.960 --> 01:35:29.060
So that would have been okay.

01:35:29.660 --> 01:35:31.360
And that's the local derivative only.

01:35:31.360 --> 01:35:36.260
But now we have to chain it and we chain it just simply by multiplying by a path grad that's chain rule.

01:35:36.860 --> 01:35:39.460
And this should technically work.

01:35:40.860 --> 01:35:42.060
And we're going to find out soon.

01:35:42.360 --> 01:35:45.960
But now if we do this, this should now work.

01:35:46.860 --> 01:35:47.960
And we get 0.5.

01:35:47.960 --> 01:35:50.860
So the forward pass works, but does the backward pass work?

01:35:51.260 --> 01:35:53.960
And I realized that we actually also have to know how to subtract.

01:35:54.060 --> 01:35:58.460
So right now a minus b will not work to make it work.

01:35:58.660 --> 01:36:01.160
We need one more piece of code here.

01:36:01.860 --> 01:36:10.760
And basically this is the subtraction and the way we're going to implement subtraction is we're going to implement it by addition of a negation.

01:36:10.960 --> 01:36:13.460
And then to implement negation, we're going to multiply by negative one.

01:36:13.960 --> 01:36:20.760
So just again using the stuff we've already built and just expressing it in terms of what we have and a minus b is not working.

01:36:21.260 --> 01:36:24.460
Okay, so now let's scroll again to this expression here for this neuron.

01:36:25.260 --> 01:36:28.460
And let's just compute the backward pass here.

01:36:28.460 --> 01:36:31.260
Once we've defined O and let's draw it.

01:36:32.160 --> 01:36:37.860
So here's the gradients for all these leaf nodes for this two-dimensional neuron that has a 10h that we've seen before.

01:36:38.560 --> 01:36:40.760
So now what I'd like to do is I'd like to break up.

01:36:40.860 --> 01:36:43.960
This 10h into this expression here.

01:36:44.560 --> 01:36:53.060
So let me copy paste this here and now instead of will preserve the label and we will change how we define O.

01:36:53.860 --> 01:36:56.460
So in particular we're going to implement this formula here.

01:36:56.860 --> 01:37:00.560
So we need e to the 2x minus 1 over e to the x plus 1.

01:37:00.960 --> 01:37:05.960
So e to the 2x we need to take 2 times n and we need to exponentiate it.

01:37:06.460 --> 01:37:09.560
That's e to the 2x and then because we're using it twice.

01:37:09.860 --> 01:37:10.760
Let's create an intermediate.

01:37:10.860 --> 01:37:24.260
Variable e and then define O as e plus 1 over e minus 1 over e plus 1 e minus 1 over e plus 1 and that should be it.

01:37:24.360 --> 01:37:26.460
And then we should be able to draw dot of O.

01:37:27.160 --> 01:37:30.360
So now before I run this, what do we expect to see?

01:37:31.060 --> 01:37:37.160
Number one, we're expecting to see a much longer graph here because we've broken up 10h into a bunch of other operations.

01:37:37.760 --> 01:37:40.060
But those operations are mathematically equivalent.

01:37:40.360 --> 01:37:40.760
And so what we're expecting.

01:37:40.960 --> 01:37:44.460
To see is number one, the same result here.

01:37:44.560 --> 01:37:48.260
So the forward pass works and number two because of that mathematical equivalence.

01:37:48.560 --> 01:37:52.460
We expect to see the same backward pass and the same gradients on these leaf nodes.

01:37:52.860 --> 01:37:54.460
So these gradients should be identical.

01:37:55.160 --> 01:37:56.460
So let's run this.

01:37:57.960 --> 01:38:01.260
So number one, let's verify that instead of a single 10h node.

01:38:01.360 --> 01:38:06.460
We have now X and we have plus we have times negative one.

01:38:07.060 --> 01:38:10.760
This is the division and we end up with the same forward pass.

01:38:10.960 --> 01:38:12.860
Here and then the gradients.

01:38:12.960 --> 01:38:14.860
We have to be careful because they're in slightly different order.

01:38:14.960 --> 01:38:26.760
Potentially the gradients for W2 X2 should be 0 and 0.5 W2 and X2 are 0 and 0.5 and W1 X1 are 1 and negative 1.5 1 and negative 1.5.

01:38:27.360 --> 01:38:34.960
So that means that both our forward passes and backward passes were correct because this turned out to be equivalent to 10h before.

01:38:35.960 --> 01:38:38.660
And so the reason I wanted to go through this exercise is number one.

01:38:38.960 --> 01:38:40.760
We got to practice a few more operations.

01:38:41.060 --> 01:38:43.960
And writing more backwards passes and number two.

01:38:44.160 --> 01:38:51.260
I wanted to illustrate the point that the the level at which you implement your operations is totally up to you.

01:38:51.460 --> 01:38:56.360
You can implement backward passes for tiny expressions like a single individual plus or a single times.

01:38:56.860 --> 01:39:01.560
Or you can implement them for say 10h which is a kind of a potential.

01:39:01.660 --> 01:39:05.960
You can see it as a composite operation because it's made up of all these more atomic operations.

01:39:06.460 --> 01:39:08.460
But really all of this is kind of like a fake concept.

01:39:08.660 --> 01:39:10.460
All that matters is we have some kind of inputs.

01:39:10.460 --> 01:39:13.760
And some kind of an output and this output is a function of the inputs in some way.

01:39:13.960 --> 01:39:18.060
And as long as you can do forward pass and the backward pass of that little operation.

01:39:18.460 --> 01:39:22.660
It doesn't matter what that operation is and how composite it is.

01:39:23.160 --> 01:39:27.260
If you can write the local gradients you can chain the gradient and you can continue back propagation.

01:39:27.460 --> 01:39:30.960
So the design of what those functions are is completely up to you.

01:39:32.060 --> 01:39:36.960
So now I would like to show you how you can do the exact same thing but using a modern deep neural network library.

01:39:37.060 --> 01:39:38.460
Like for example PyTorch.

01:39:38.860 --> 01:39:40.360
Which I've roughly modeled.

01:39:40.560 --> 01:39:42.360
Micrograd by.

01:39:43.060 --> 01:39:45.760
And so PyTorch is something you would use in production.

01:39:46.160 --> 01:39:49.360
And I'll show you how you can do the exact same thing but in PyTorch API.

01:39:49.860 --> 01:39:52.660
So I'm just going to copy paste it in and walk you through it a little bit.

01:39:52.860 --> 01:39:53.760
This is what it looks like.

01:39:54.960 --> 01:39:56.560
So we're going to import PyTorch.

01:39:57.060 --> 01:40:01.460
And then we need to define these value objects like we have here.

01:40:01.960 --> 01:40:05.360
Now Micrograd is a scalar valued engine.

01:40:05.460 --> 01:40:08.560
So we only have scalar values like 2.0.

01:40:09.160 --> 01:40:09.760
But in PyTorch.

01:40:10.460 --> 01:40:11.760
We only have around tensors.

01:40:12.060 --> 01:40:15.860
And like I mentioned tensors are just n dimensional arrays of scalars.

01:40:16.360 --> 01:40:19.260
So that's why things get a little bit more complicated here.

01:40:19.360 --> 01:40:21.760
I just need a scalar valued tensor.

01:40:21.860 --> 01:40:23.460
A tensor with just a single element.

01:40:24.060 --> 01:40:30.760
But by default when you work with PyTorch you would use more complicated tensors like this.

01:40:31.060 --> 01:40:32.360
So if I import PyTorch.

01:40:34.560 --> 01:40:36.360
Then I can create tensors like this.

01:40:36.760 --> 01:40:37.960
And this tensor for example.

01:40:38.060 --> 01:40:39.960
Is a 2x3 array.

01:40:39.960 --> 01:40:44.660
Of scalars in a single compact representation.

01:40:45.060 --> 01:40:46.060
So we can check its shape.

01:40:46.160 --> 01:40:48.860
We see that it's a 2x3 array and so on.

01:40:49.560 --> 01:40:53.160
So this is usually what you would work with in the actual libraries.

01:40:53.660 --> 01:40:58.960
So here I'm creating a tensor that has only a single element 2.0.

01:41:00.560 --> 01:41:03.160
And then I'm casting it to be double.

01:41:03.660 --> 01:41:07.860
Because Python is by default using double precision for its floating point numbers.

01:41:07.960 --> 01:41:09.760
So I'd like everything to be identical.

01:41:09.960 --> 01:41:14.360
By default the data type of these tensors will be float32.

01:41:14.460 --> 01:41:16.460
So it's only using a single precision float.

01:41:16.560 --> 01:41:18.260
So I'm casting it to double.

01:41:18.960 --> 01:41:21.860
So that we have float64 just like in Python.

01:41:22.660 --> 01:41:23.860
So I'm casting to double.

01:41:24.060 --> 01:41:27.660
And then we get something similar to value of 2.

01:41:28.060 --> 01:41:30.460
The next thing I have to do is because these are leaf nodes.

01:41:30.560 --> 01:41:33.660
By default PyTorch assumes that they do not require gradients.

01:41:33.860 --> 01:41:37.560
So I need to explicitly say that all of these nodes require gradients.

01:41:37.960 --> 01:41:38.460
Okay.

01:41:38.560 --> 01:41:39.660
So this is going to construct.

01:41:40.060 --> 01:41:42.860
Scalar valued one element tensors.

01:41:43.460 --> 01:41:45.660
Make sure that PyTorch knows that they require gradients.

01:41:46.260 --> 01:41:49.960
Now by default these are set to false by the way because of efficiency reasons.

01:41:50.160 --> 01:41:52.960
Because usually you would not want gradients for leaf nodes.

01:41:53.660 --> 01:41:55.460
Like the inputs to the network.

01:41:55.660 --> 01:41:58.460
And this is just trying to be efficient in the most common cases.

01:41:59.360 --> 01:42:02.360
So once we've defined all of our values in PyTorch land.

01:42:02.660 --> 01:42:05.660
We can perform arithmetic just like we can here in micrograd land.

01:42:05.960 --> 01:42:06.860
So this would just work.

01:42:07.260 --> 01:42:09.060
And then there's a torch.10h also.

01:42:09.660 --> 01:42:12.060
And when we get back as a tensor again.

01:42:12.660 --> 01:42:14.960
And we can just like in micrograd.

01:42:15.060 --> 01:42:17.760
It's got a data attribute and it's got grad attributes.

01:42:18.360 --> 01:42:22.360
So these tensor objects just like in micrograd have a dot data and a dot grad.

01:42:22.860 --> 01:42:26.260
And the only difference here is that we need to call a dot item.

01:42:26.660 --> 01:42:33.960
Because otherwise PyTorch dot item basically takes a single tensor of one element.

01:42:34.060 --> 01:42:36.860
And it just returns that element stripping out the tensor.

01:42:37.960 --> 01:42:38.860
So let me just run this.

01:42:38.860 --> 01:42:40.360
And hopefully we are going to get.

01:42:40.460 --> 01:42:44.560
This is going to print the forward pass which is 0.707.

01:42:45.060 --> 01:42:50.760
And this will be the gradients which hopefully are 0.50, negative 1.5, and 1.

01:42:51.260 --> 01:42:52.460
So if we just run this.

01:42:54.060 --> 01:42:54.460
There we go.

01:42:55.160 --> 01:42:55.560
0.7.

01:42:55.560 --> 01:42:56.960
So the forward pass agrees.

01:42:57.260 --> 01:42:59.860
And then 0.50, negative 1.5, and 1.

01:43:00.860 --> 01:43:02.260
So PyTorch agrees with us.

01:43:02.860 --> 01:43:04.060
And just to show you here basically.

01:43:04.060 --> 01:43:07.060
Oh, here's a tensor with a single element.

01:43:07.060 --> 01:43:09.160
And it's a double.

01:43:09.760 --> 01:43:13.660
And we can call that item on it to just get the single number out.

01:43:14.460 --> 01:43:15.660
So that's what item does.

01:43:16.060 --> 01:43:18.360
And O is a tensor object like I mentioned.

01:43:18.660 --> 01:43:21.160
And it's got a backward function just like we've implemented.

01:43:22.260 --> 01:43:24.260
And then all of these also have a dot grad.

01:43:24.260 --> 01:43:26.060
So like X2 for example has a grad.

01:43:26.360 --> 01:43:27.060
And it's a tensor.

01:43:27.360 --> 01:43:30.060
And we can pop out the individual number with dot item.

01:43:31.560 --> 01:43:36.860
So basically Torch can do what we did in micrograd as a special case.

01:43:37.060 --> 01:43:40.060
When your tensors are all single element tensors.

01:43:40.560 --> 01:43:43.860
But the big deal with PyTorch is that everything is significantly more efficient.

01:43:44.160 --> 01:43:46.560
Because we are working with these tensor objects.

01:43:46.760 --> 01:43:50.060
And we can do lots of operations in parallel on all of these tensors.

01:43:51.660 --> 01:43:55.160
But otherwise what we've built very much agrees with the API of PyTorch.

01:43:55.760 --> 01:43:59.660
Okay, so now that we have some machinery to build out pretty complicated mathematical expressions.

01:43:59.960 --> 01:44:01.860
We can also start building up neural nets.

01:44:02.060 --> 01:44:06.260
And as I mentioned neural nets are just a specific class of mathematical expressions.

01:44:07.060 --> 01:44:09.460
So we're going to start building out a neural net piece by piece.

01:44:09.460 --> 01:44:13.860
And eventually we'll build out a two-layer multi-layer layer perceptron as it's called.

01:44:14.160 --> 01:44:15.560
And I'll show you exactly what that means.

01:44:16.060 --> 01:44:17.660
Let's start with a single individual neuron.

01:44:18.060 --> 01:44:19.260
We've implemented one here.

01:44:19.660 --> 01:44:24.060
But here I'm going to implement one that also subscribes to the PyTorch API.

01:44:24.060 --> 01:44:26.860
And how it designs its neural network modules.

01:44:27.460 --> 01:44:32.660
So just like we saw that we can like match the API of PyTorch on the autograd side.

01:44:33.160 --> 01:44:35.360
We're going to try to do that on the neural network modules.

01:44:36.060 --> 01:44:36.960
So here's class neuron.

01:44:37.460 --> 01:44:40.660
And just for the sake of efficiency.

01:44:40.960 --> 01:44:44.360
I'm going to copy paste some sections that are relatively straightforward.

01:44:45.660 --> 01:44:49.460
So the constructor will take number of inputs to this neuron.

01:44:49.460 --> 01:44:52.160
Which is how many inputs come to a neuron.

01:44:52.560 --> 01:44:54.260
So this one for example has three inputs.

01:44:55.360 --> 01:44:56.860
And then it's going to create a weight.

01:44:57.360 --> 01:45:00.860
That is some random number between negative one and one for every one of those inputs.

01:45:01.360 --> 01:45:05.160
And a bias that controls the overall trigger happiness of this neuron.

01:45:05.160 --> 01:45:12.560
And then we're going to implement a def underscore underscore call of self and x.

01:45:12.560 --> 01:45:13.560
Some input x.

01:45:13.560 --> 01:45:16.860
And really what we don't want to do here is w times x plus b.

01:45:16.860 --> 01:45:20.260
Where w times x here is a dot product specifically.

01:45:20.260 --> 01:45:23.260
Now if you haven't seen call.

01:45:23.260 --> 01:45:26.160
Let me just return 0.0 here for now.

01:45:26.160 --> 01:45:30.360
The way this works now is we can have an x which is say like 2.0, 3.0.

01:45:30.360 --> 01:45:33.260
Then we can initialize a neuron that is two-dimensional.

01:45:33.260 --> 01:45:35.060
Because these are two numbers.

01:45:35.160 --> 01:45:38.560
And then we can feed those two numbers into that neuron to get an output.

01:45:39.560 --> 01:45:42.560
And so when you use this notation n of x.

01:45:42.560 --> 01:45:44.060
Python will use call.

01:45:44.860 --> 01:45:46.960
So currently call just returns 0.0.

01:45:49.960 --> 01:45:53.960
Now we'd like to actually do the forward pass of this neuron instead.

01:45:54.760 --> 01:45:56.360
So we're going to do here first.

01:45:56.560 --> 01:46:00.260
Is we need to basically multiply all of the elements of w.

01:46:00.260 --> 01:46:02.460
With all of the elements of x pairwise.

01:46:02.460 --> 01:46:03.460
We need to multiply them.

01:46:04.160 --> 01:46:05.060
So the first thing we're going to do.

01:46:05.060 --> 01:46:09.060
Is we're going to zip up salta w and x.

01:46:09.060 --> 01:46:12.660
And in Python zip takes two iterators.

01:46:12.660 --> 01:46:17.960
And it creates a new iterator that iterates over the tuples of their corresponding entries.

01:46:17.960 --> 01:46:22.060
So for example, just to show you we can print this list.

01:46:22.060 --> 01:46:25.860
And still return 0.0 here.

01:46:30.860 --> 01:46:32.560
Sorry.

01:46:32.560 --> 01:46:34.260
I'm in life.

01:46:34.260 --> 01:46:37.460
So we see that these w's are paired up with the x's.

01:46:37.460 --> 01:46:38.660
W with x.

01:46:41.660 --> 01:46:43.160
And now what we want to do is.

01:46:47.560 --> 01:46:49.860
For wi xi in.

01:46:50.860 --> 01:46:54.260
We want to multiply w times wi times xi.

01:46:55.060 --> 01:46:57.260
And then we want to sum all of that together.

01:46:57.760 --> 01:46:59.060
To come up with an activation.

01:46:59.760 --> 01:47:01.660
And add also salta b on top.

01:47:02.460 --> 01:47:03.660
So that's the raw activation.

01:47:04.260 --> 01:47:06.860
And then of course we need to pass that through a null linearity.

01:47:06.860 --> 01:47:10.260
So what we're going to be returning is act dot 10h.

01:47:10.260 --> 01:47:12.460
And here's out.

01:47:12.460 --> 01:47:15.760
So now we see that we are getting some outputs.

01:47:15.760 --> 01:47:17.560
And we get a different output from a neuron each time.

01:47:17.560 --> 01:47:21.560
Because we are initializing different weights and biases.

01:47:21.560 --> 01:47:23.660
And then to be a bit more efficient here actually.

01:47:23.660 --> 01:47:27.560
Sum by the way takes a second optional parameter.

01:47:27.560 --> 01:47:29.160
Which is the start.

01:47:29.160 --> 01:47:31.660
And by default the start is 0.

01:47:31.660 --> 01:47:33.660
So these elements of this sum.

01:47:33.660 --> 01:47:35.860
Will be added on top of 0 to begin with.

01:47:35.860 --> 01:47:37.660
But actually we can just start with salta b.

01:47:38.560 --> 01:47:40.260
And then we just have an expression like this.

01:47:45.660 --> 01:47:48.760
And then the generator expression here must be parenthesized in python.

01:47:49.560 --> 01:47:50.060
There we go.

01:47:53.960 --> 01:47:56.260
Yep so now we can forward a single neuron.

01:47:56.660 --> 01:47:59.060
Next up we're going to define a layer of neurons.

01:47:59.460 --> 01:48:02.060
So here we have a schematic for a MLP.

01:48:02.660 --> 01:48:03.560
So we see that.

01:48:03.660 --> 01:48:05.360
These MLPs each layer.

01:48:05.360 --> 01:48:06.460
This is one layer.

01:48:06.460 --> 01:48:07.960
Has actually a number of neurons.

01:48:07.960 --> 01:48:09.160
And they're not connected to each other.

01:48:09.160 --> 01:48:11.460
But all of them are fully connected to the input.

01:48:11.460 --> 01:48:13.160
So what is a layer of neurons?

01:48:13.160 --> 01:48:16.760
It's just it's just a set of neurons evaluated independently.

01:48:16.760 --> 01:48:19.160
So in the interest of time.

01:48:19.160 --> 01:48:23.160
I'm going to do something fairly straightforward here.

01:48:23.160 --> 01:48:29.160
It's literally a layer is just a list of neurons.

01:48:29.160 --> 01:48:30.760
And then how many neurons do we have?

01:48:30.760 --> 01:48:32.760
We take that as an input argument here.

01:48:32.760 --> 01:48:35.860
How many neurons do you want in your layer number of outputs in this layer?

01:48:36.760 --> 01:48:41.160
And so we just initialize completely independent neurons with this given dimensionality.

01:48:41.460 --> 01:48:42.960
And we call on it.

01:48:42.960 --> 01:48:45.860
We just independently evaluate them.

01:48:46.460 --> 01:48:49.660
So now instead of a neuron we can make a layer of neurons.

01:48:49.660 --> 01:48:51.860
They are two dimensional neurons and let's have three of them.

01:48:52.460 --> 01:48:57.660
And now we see that we have three independent evaluations of three different neurons, right?

01:48:58.960 --> 01:48:59.160
Okay.

01:48:59.160 --> 01:49:02.560
And finally, let's complete this picture and define an entire multi-layer.

01:49:02.560 --> 01:49:04.060
Perceptron or MLP.

01:49:04.060 --> 01:49:08.460
And as we can see here in an MLP, these layers just feed into each other sequentially.

01:49:08.460 --> 01:49:13.460
So let's come here and I'm just going to copy the code here in interest of time.

01:49:13.460 --> 01:49:16.060
So an MLP is very similar.

01:49:16.060 --> 01:49:19.260
We're taking the number of inputs as before.

01:49:19.260 --> 01:49:23.360
But now instead of saying taking a single and out which is number of neurons in a single layer.

01:49:23.360 --> 01:49:29.560
We're going to take a list of an outs and this list defines the sizes of all the layers that we want in our MLP.

01:49:29.560 --> 01:49:32.260
So here we just put them all together and then iterate.

01:49:32.560 --> 01:49:37.060
Over consecutive pairs of these sizes and create a layer objects for them.

01:49:37.760 --> 01:49:40.160
And then in the call function, we are just calling them sequentially.

01:49:40.460 --> 01:49:41.960
So that's an MLP really.

01:49:42.760 --> 01:49:44.460
And let's actually re-implement this picture.

01:49:44.460 --> 01:49:48.460
So we want three input neurons and then two layers of four and an output unit.

01:49:49.560 --> 01:49:53.360
So we want three dimensional input.

01:49:53.460 --> 01:49:54.760
Say this is an example input.

01:49:55.160 --> 01:49:59.960
We want three inputs into two layers of four and one output.

01:50:00.360 --> 01:50:01.960
And this of course is an MLP.

01:50:02.560 --> 01:50:04.560
And there we go.

01:50:04.560 --> 01:50:06.160
That's a forward pass of an MLP.

01:50:06.160 --> 01:50:08.260
To make this a little bit nicer.

01:50:08.260 --> 01:50:13.060
You see how we have just a single element, but it's wrapped in a list because layer always returns lists.

01:50:13.060 --> 01:50:20.060
So for convenience, return outs at zero if len outs is exactly a single element.

01:50:20.060 --> 01:50:22.060
Else return fullest.

01:50:22.060 --> 01:50:27.060
And this will allow us to just get a single value out at the last layer that only has a single neuron.

01:50:27.060 --> 01:50:31.060
And finally, we should be able to draw a dot of N of X.

01:50:32.560 --> 01:50:37.960
As you might imagine, these expressions are now getting relatively involved.

01:50:38.660 --> 01:50:41.160
So this is an entire MLP that we're defining now.

01:50:45.360 --> 01:50:47.360
All the way until a single output.

01:50:48.360 --> 01:50:53.460
Okay, and so obviously you would never differentiate on pen and paper these expressions.

01:50:53.660 --> 01:51:02.360
But with micrograd, we will be able to back propagate all the way through this and back propagate into these weights of all these neurons.

01:51:02.560 --> 01:51:04.360
So let's see how that works.

01:51:04.360 --> 01:51:08.360
Okay, so let's create ourselves a very simple example data set here.

01:51:08.360 --> 01:51:10.360
So this data set has four examples.

01:51:10.360 --> 01:51:15.360
And so we have four possible inputs into the neural net.

01:51:15.360 --> 01:51:17.360
And we have four desired targets.

01:51:17.360 --> 01:51:24.360
So we'd like the neural net to assign or output 1.0 when it's fed this example.

01:51:24.360 --> 01:51:26.360
Negative one when it's fed these examples.

01:51:26.360 --> 01:51:28.360
And one when it's fed this example.

01:51:28.360 --> 01:51:32.360
So it's a very simple binary classifier neural net basically that we would like here.

01:51:32.560 --> 01:51:35.960
Now let's think what the neural net currently thinks about these four examples.

01:51:35.960 --> 01:51:38.260
We can just get their predictions.

01:51:38.260 --> 01:51:42.160
Basically, we can just call N of X for X and Xs.

01:51:42.160 --> 01:51:45.160
And then we can print.

01:51:45.160 --> 01:51:48.960
So these are the outputs of the neural net on those four examples.

01:51:48.960 --> 01:51:53.860
So the first one is 0.91, but we'd like it to be one.

01:51:53.860 --> 01:51:55.860
So we should push this one higher.

01:51:55.860 --> 01:51:58.260
This one we want to be higher.

01:51:58.260 --> 01:52:02.360
This one says 0.88, and we want this to be negative one.

01:52:02.560 --> 01:52:05.160
This is 0.88, we want it to be negative one.

01:52:05.160 --> 01:52:08.160
And this one is 0.88, we want it to be one.

01:52:08.160 --> 01:52:10.160
So how do we make the neural net?

01:52:10.160 --> 01:52:16.760
And how do we tune the weights to better predict the desired targets?

01:52:16.760 --> 01:52:21.860
And the trick used in deep learning to achieve this is to calculate a single number

01:52:21.860 --> 01:52:25.160
that somehow measures the total performance of your neural net.

01:52:25.160 --> 01:52:28.060
And we call this single number the loss.

01:52:28.060 --> 01:52:32.260
So the loss first is a single number

01:52:32.560 --> 01:52:36.260
that we're going to define that basically measures how well the neural net is performing.

01:52:36.260 --> 01:52:38.560
Right now, we have the intuitive sense that it's not performing very well

01:52:38.560 --> 01:52:41.060
because we're not very much close to this.

01:52:41.060 --> 01:52:44.860
So the loss will be high, and we'll want to minimize the loss.

01:52:44.860 --> 01:52:49.860
So in particular, in this case, what we're going to do is we're going to implement the mean squared error loss.

01:52:49.860 --> 01:52:54.160
So what this is doing is we're going to basically iterate

01:52:54.160 --> 01:53:01.360
for Y ground truth and Y output in zip of Ys and Ybred.

01:53:01.360 --> 01:53:02.360
So we're going to pair up

01:53:02.360 --> 01:53:09.060
the ground truths with the predictions and the zip iterates over tuples of them.

01:53:09.060 --> 01:53:13.260
And for each Y ground truth and Y output,

01:53:13.260 --> 01:53:18.760
we're going to subtract them and square them.

01:53:18.760 --> 01:53:23.060
So let's first see what these losses are. These are individual loss components.

01:53:23.060 --> 01:53:26.560
And so basically for each one of the four,

01:53:26.560 --> 01:53:29.560
we are taking the prediction and the ground truth.

01:53:29.560 --> 01:53:31.560
We are subtracting them and squaring them.

01:53:32.360 --> 01:53:36.360
So because this one is so close to its target,

01:53:36.360 --> 01:53:42.060
0.91 is almost 1, subtracting them gives a very small number.

01:53:42.060 --> 01:53:44.360
So here we would get like a negative 0.1,

01:53:44.360 --> 01:53:48.960
and then squaring it just makes sure that regardless of

01:53:48.960 --> 01:53:51.060
whether we are more negative or more positive,

01:53:51.060 --> 01:53:54.460
we always get a positive number.

01:53:54.460 --> 01:53:56.160
Instead of squaring, we could also take,

01:53:56.160 --> 01:53:59.560
for example, the absolute value. We need to discard the sign.

01:53:59.560 --> 01:54:02.260
And so you see that the expression is ranged so that you

01:54:02.360 --> 01:54:06.560
only get 0 exactly when Y out is equal to Y ground truth.

01:54:06.560 --> 01:54:07.660
When those two are equal,

01:54:07.660 --> 01:54:09.560
so your prediction is exactly the target,

01:54:09.560 --> 01:54:13.060
you are going to get 0. And if your prediction is not the target,

01:54:13.060 --> 01:54:15.260
you are going to get some other number.

01:54:15.260 --> 01:54:17.160
So here, for example, we are way off.

01:54:17.160 --> 01:54:19.960
And so that's why the loss is quite high.

01:54:19.960 --> 01:54:24.160
And the more off we are, the greater the loss will be.

01:54:24.160 --> 01:54:27.960
So we don't want high loss, we want low loss.

01:54:27.960 --> 01:54:32.160
And so the final loss here will be just the sum,

01:54:32.360 --> 01:54:34.460
all of these numbers.

01:54:34.460 --> 01:54:38.660
So you see that this should be 0 roughly plus 0 roughly,

01:54:38.660 --> 01:54:40.960
but plus 7.

01:54:40.960 --> 01:54:45.060
So loss should be about 7 here.

01:54:45.060 --> 01:54:47.360
And now we want to minimize the loss.

01:54:47.360 --> 01:54:51.660
We want the loss to be low because if loss is low,

01:54:51.660 --> 01:54:56.560
then every one of the predictions is equal to its target.

01:54:56.560 --> 01:54:59.060
So the loss, the lowest it can be is 0,

01:54:59.060 --> 01:55:02.160
and the greater it is, the worse off the neural net is,

01:55:02.360 --> 01:55:05.160
and the higher the risk of shifting.

01:55:05.160 --> 01:55:08.760
So now, of course, if we do loss.backward,

01:55:08.760 --> 01:55:11.560
something magical happened when I hit enter.

01:55:11.560 --> 01:55:14.360
And the magical thing, of course, that happened is that we can look at

01:55:14.360 --> 01:55:19.660
n.layers.neuron, n.layers at, say, like the first layer,

01:55:19.660 --> 01:55:23.360
that neurons at 0,

01:55:23.360 --> 01:55:27.360
because remember that MLP has the layers, which is a list,

01:55:27.360 --> 01:55:29.660
and each layer has neurons, which is a list,

01:55:29.660 --> 01:55:31.560
and that gives us an individual neuron,

01:55:31.560 --> 01:55:33.560
and that gives us some weights.

01:55:33.560 --> 01:55:40.560
And so we can, for example, look at the weights at 0.

01:55:40.560 --> 01:55:44.560
Oops, it's not called weights, it's called w.

01:55:44.560 --> 01:55:48.560
And that's a value, but now this value also has a grad

01:55:48.560 --> 01:55:50.560
because of the backward pass.

01:55:50.560 --> 01:55:53.560
And so we see that because this gradient here

01:55:53.560 --> 01:55:55.560
on this particular weight of this particular neuron

01:55:55.560 --> 01:55:58.560
of this particular layer is negative,

01:55:58.560 --> 01:56:01.360
we see that its influence on the loss is also negative.

01:56:01.360 --> 01:56:05.360
So slightly increasing this particular weight of this neuron of this layer

01:56:05.360 --> 01:56:08.360
would make the loss go down.

01:56:08.360 --> 01:56:12.360
And we actually have this information for every single one of our neurons

01:56:12.360 --> 01:56:13.360
and all of their parameters.

01:56:13.360 --> 01:56:17.360
Actually, it's worth looking at also the draw dot of loss, by the way.

01:56:17.360 --> 01:56:21.360
So previously, we looked at the draw dot of a single neuron forward pass,

01:56:21.360 --> 01:56:23.360
and that was already a large expression.

01:56:23.360 --> 01:56:25.360
But what is this expression?

01:56:25.360 --> 01:56:28.360
We actually forwarded every one of those four examples,

01:56:28.360 --> 01:56:30.360
and then we have the loss on top of them,

01:56:30.360 --> 01:56:32.360
with the mean squared error.

01:56:32.360 --> 01:56:35.360
And so this is a really massive graph

01:56:35.360 --> 01:56:38.360
because this graph that we've built up now,

01:56:38.360 --> 01:56:40.360
oh my gosh,

01:56:40.360 --> 01:56:42.360
this graph that we've built up now,

01:56:42.360 --> 01:56:44.360
which is kind of excessive,

01:56:44.360 --> 01:56:47.360
it's excessive because it has four forward passes of a neural net

01:56:47.360 --> 01:56:49.360
for every one of the examples,

01:56:49.360 --> 01:56:51.360
and then it has the loss on top,

01:56:51.360 --> 01:56:54.360
and it ends with the value of the loss, which was 7.12.

01:56:54.360 --> 01:56:58.360
And this loss will now back propagate through all the four forward passes

01:56:58.360 --> 01:56:59.360
all the way through,

01:56:59.360 --> 01:57:02.360
just every single intermediate value of the neural net,

01:57:02.360 --> 01:57:04.360
all the way back to,

01:57:04.360 --> 01:57:06.360
of course, the parameters of the weights,

01:57:06.360 --> 01:57:07.360
which are the input.

01:57:07.360 --> 01:57:11.360
So these weight parameters here are inputs to this neural net,

01:57:11.360 --> 01:57:13.360
and these numbers here,

01:57:13.360 --> 01:57:14.360
these scalars,

01:57:14.360 --> 01:57:16.360
are inputs to the neural net.

01:57:16.360 --> 01:57:18.360
So if we went around here,

01:57:18.360 --> 01:57:21.360
we will probably find some of these examples,

01:57:21.360 --> 01:57:22.360
this 1.0,

01:57:22.360 --> 01:57:24.360
potentially maybe this 1.0,

01:57:24.360 --> 01:57:26.360
or, you know, some of the others.

01:57:26.360 --> 01:57:28.360
And you'll see that they all have gradients as well.

01:57:28.360 --> 01:57:31.360
The thing is these gradients on the input data

01:57:31.360 --> 01:57:33.360
are not that useful to us,

01:57:33.360 --> 01:57:37.360
and that's because the input data seems to be not changeable.

01:57:37.360 --> 01:57:39.360
It's a given to the problem,

01:57:39.360 --> 01:57:40.360
and so it's a fixed input.

01:57:40.360 --> 01:57:42.360
We're not going to be changing it or messing with it,

01:57:42.360 --> 01:57:45.360
even though we do have gradients for it.

01:57:45.360 --> 01:57:48.360
But some of these gradients here

01:57:48.360 --> 01:57:50.360
will be for the neural network parameters,

01:57:50.360 --> 01:57:52.360
the w's and the b's,

01:57:52.360 --> 01:57:55.360
and those we, of course, we want to change.

01:57:55.360 --> 01:57:57.360
Okay, so now we're going to want

01:57:57.360 --> 01:57:59.360
some convenience codes to gather up

01:57:59.360 --> 01:58:01.360
all of the parameters of the neural net

01:58:01.360 --> 01:58:04.360
so that we can operate on all of them simultaneously.

01:58:04.360 --> 01:58:05.360
And every one of them,

01:58:05.360 --> 01:58:08.360
we will nudge a tiny amount

01:58:08.360 --> 01:58:10.360
based on the gradient information.

01:58:10.360 --> 01:58:12.360
So let's collect the parameters of the neural net

01:58:12.360 --> 01:58:14.360
all in one array.

01:58:14.360 --> 01:58:17.360
So let's create a parameters of self

01:58:17.360 --> 01:58:19.360
that just returns

01:58:19.360 --> 01:58:22.360
self.w, which is a list,

01:58:22.360 --> 01:58:26.360
concatenated with a list of self.b.

01:58:27.360 --> 01:58:29.360
So this will just return a list.

01:58:29.360 --> 01:58:32.360
List plus list just gives you a list.

01:58:32.360 --> 01:58:34.360
So that's parameters of neuron,

01:58:34.360 --> 01:58:36.360
and I'm calling it this way

01:58:36.360 --> 01:58:38.360
because also PyTorch has parameters

01:58:38.360 --> 01:58:40.360
on every single NN module,

01:58:40.360 --> 01:58:42.360
and it does exactly what we're doing here.

01:58:42.360 --> 01:58:45.360
It just returns the parameter tensors.

01:58:45.360 --> 01:58:48.360
For us, it's the parameter scalars.

01:58:48.360 --> 01:58:50.360
Now, layer is also a module,

01:58:50.360 --> 01:58:54.360
so it will have parameters, self,

01:58:54.360 --> 01:58:56.360
and basically what we want to do here is

01:58:56.360 --> 01:58:59.360
something like this, like

01:58:59.360 --> 01:59:01.360
params is here,

01:59:01.360 --> 01:59:07.360
and then for neuron in self.neurons,

01:59:07.360 --> 01:59:10.360
we want to get neuron.parameters,

01:59:10.360 --> 01:59:14.360
and we want to params.extend.

01:59:14.360 --> 01:59:16.360
So these are the parameters of this neuron,

01:59:16.360 --> 01:59:19.360
and then we want to put them on top of params,

01:59:19.360 --> 01:59:22.360
so params.extend of piece,

01:59:22.360 --> 01:59:25.360
and then we want to return params.

01:59:25.360 --> 01:59:27.360
So this is way too much code,

01:59:27.360 --> 01:59:29.360
so actually there's a way to simplify this,

01:59:29.360 --> 01:59:39.360
which is return p for neuron in self.neurons

01:59:39.360 --> 01:59:45.360
for p in neuron.parameters.

01:59:45.360 --> 01:59:47.360
So it's a single list comprehension.

01:59:47.360 --> 01:59:49.360
In Python, you can sort of nest them like this,

01:59:49.360 --> 01:59:54.360
and you can then create the desired array.

01:59:54.360 --> 01:59:56.360
So these are identical.

01:59:56.360 --> 01:59:59.360
We can take this out.

01:59:59.360 --> 02:00:04.360
And then let's do the same here.

02:00:04.360 --> 02:00:07.360
dev.parameters self

02:00:07.360 --> 02:00:13.360
and return a parameter for layer in self.layers

02:00:13.360 --> 02:00:20.360
for p in layer.parameters.

02:00:20.360 --> 02:00:22.360
And that should be good.

02:00:22.360 --> 02:00:25.360
Now let me pop out this

02:00:25.360 --> 02:00:28.360
so we don't reinitialize our network,

02:00:28.360 --> 02:00:35.360
because we need to reinitialize our...

02:00:35.360 --> 02:00:36.360
Okay, so unfortunately,

02:00:36.360 --> 02:00:38.360
we will have to probably reinitialize the network

02:00:38.360 --> 02:00:41.360
because we just added functionality.

02:00:41.360 --> 02:00:42.360
Because this class, of course,

02:00:42.360 --> 02:00:45.360
I want to get all the end.parameters,

02:00:45.360 --> 02:00:46.360
but that's not going to work

02:00:46.360 --> 02:00:49.360
because this is the old class.

02:00:49.360 --> 02:00:50.360
Okay.

02:00:50.360 --> 02:00:51.360
So unfortunately,

02:00:51.360 --> 02:00:53.360
we do have to reinitialize the network,

02:00:53.360 --> 02:00:55.360
which will change some of the numbers.

02:00:55.360 --> 02:00:58.360
But let me do that so that we pick up the new API.

02:00:58.360 --> 02:01:00.360
We can now do end.parameters.

02:01:00.360 --> 02:01:02.360
And these are all the weights and biases

02:01:02.360 --> 02:01:05.360
inside the entire neural net.

02:01:05.360 --> 02:01:11.360
So in total, this MLP has 41 parameters.

02:01:11.360 --> 02:01:15.360
And now we'll be able to change them.

02:01:15.360 --> 02:01:18.360
If we recalculate the loss here,

02:01:18.360 --> 02:01:19.360
we see that unfortunately,

02:01:19.360 --> 02:01:23.360
we have slightly different predictions

02:01:23.360 --> 02:01:26.360
and slightly different loss.

02:01:26.360 --> 02:01:28.360
But that's okay.

02:01:28.360 --> 02:01:31.360
Okay, so we see that this neuron's gradient

02:01:31.360 --> 02:01:33.360
is slightly negative.

02:01:33.360 --> 02:01:36.360
We can also look at its data right now,

02:01:36.360 --> 02:01:38.360
which is 0.85.

02:01:38.360 --> 02:01:39.360
So this is the current value of this neuron,

02:01:39.360 --> 02:01:43.360
and this is its gradient on the loss.

02:01:43.360 --> 02:01:44.360
So what we want to do now

02:01:44.360 --> 02:01:48.360
is we want to iterate for every p in end.parameters.

02:01:48.360 --> 02:01:51.360
So for all the 41 parameters in this neural net,

02:01:51.360 --> 02:01:55.360
we actually want to change p.data slightly

02:01:55.360 --> 02:01:58.360
according to the gradient information.

02:01:58.360 --> 02:02:01.360
Okay, so dot dot dot to do here.

02:02:01.360 --> 02:02:04.360
But this will be basically a tiny update

02:02:04.360 --> 02:02:07.360
in this gradient descent scheme.

02:02:07.360 --> 02:02:09.360
And gradient descent,

02:02:09.360 --> 02:02:11.360
we are thinking of the gradient

02:02:11.360 --> 02:02:13.360
as a vector pointing in the direction

02:02:13.360 --> 02:02:17.360
of increased loss.

02:02:17.360 --> 02:02:20.360
And so in gradient descent,

02:02:20.360 --> 02:02:23.360
we are modifying p.data

02:02:23.360 --> 02:02:25.360
by a small step size

02:02:25.360 --> 02:02:27.360
in the direction of the gradient.

02:02:27.360 --> 02:02:28.360
So the step size as an example

02:02:28.360 --> 02:02:29.360
could be like a very small number,

02:02:29.360 --> 02:02:31.360
like 0.01 is the step size,

02:02:31.360 --> 02:02:35.360
times p.grad, right?

02:02:35.360 --> 02:02:37.360
But we have to think through

02:02:37.360 --> 02:02:38.360
some of the signs here.

02:02:38.360 --> 02:02:41.360
So in particular,

02:02:41.360 --> 02:02:44.360
working with this specific example here,

02:02:44.360 --> 02:02:46.360
we see that if we just left it like this,

02:02:46.360 --> 02:02:48.360
then this neuron's value

02:02:48.360 --> 02:02:50.360
would be currently increased

02:02:50.360 --> 02:02:52.360
by a tiny amount of the gradient.

02:02:52.360 --> 02:02:54.360
The gradient is negative,

02:02:54.360 --> 02:02:56.360
so this value of this neuron

02:02:56.360 --> 02:02:58.360
would go slightly down.

02:02:58.360 --> 02:03:00.360
It would become like 0.84

02:03:00.360 --> 02:03:02.360
or something like that.

02:03:02.360 --> 02:03:05.360
But if this neuron's value goes lower,

02:03:05.360 --> 02:03:10.360
that would actually increase the loss.

02:03:10.360 --> 02:03:13.360
That's because the derivative

02:03:13.360 --> 02:03:15.360
of this neuron is negative.

02:03:15.360 --> 02:03:17.360
So increasing this

02:03:17.360 --> 02:03:19.360
makes the loss go down.

02:03:19.360 --> 02:03:21.360
So increasing it is what we want to do

02:03:21.360 --> 02:03:23.360
instead of decreasing it.

02:03:23.360 --> 02:03:24.360
So basically what we're missing here

02:03:24.360 --> 02:03:26.360
is we're actually missing a negative sign.

02:03:26.360 --> 02:03:29.360
And again, this other interpretation,

02:03:29.360 --> 02:03:31.360
and that's because we want to minimize the loss.

02:03:31.360 --> 02:03:32.360
We don't want to maximize the loss.

02:03:32.360 --> 02:03:34.360
We want to decrease it.

02:03:34.360 --> 02:03:35.360
And the other interpretation, as I mentioned,

02:03:35.360 --> 02:03:37.360
is you can think of the gradient vector,

02:03:37.360 --> 02:03:40.360
so basically just the vector of all the gradients,

02:03:40.360 --> 02:03:42.360
as pointing in the direction

02:03:42.360 --> 02:03:45.360
of increasing the loss.

02:03:45.360 --> 02:03:46.360
But then we want to decrease it.

02:03:46.360 --> 02:03:49.360
So we actually want to go in the opposite direction.

02:03:49.360 --> 02:03:50.360
And so you can convince yourself

02:03:50.360 --> 02:03:52.360
that this does the right thing here with the negative

02:03:52.360 --> 02:03:55.360
because we want to minimize the loss.

02:03:55.360 --> 02:04:00.360
So if we nudge all the parameters by a tiny amount,

02:04:00.360 --> 02:04:02.360
then we'll see that this data

02:04:02.360 --> 02:04:04.360
will have changed a little bit.

02:04:04.360 --> 02:04:09.360
So now this neuron is a tiny amount greater value.

02:04:09.360 --> 02:04:13.360
So 0.854 went to 0.857.

02:04:13.360 --> 02:04:14.360
And that's a good thing

02:04:14.360 --> 02:04:18.360
because slightly increasing this neuron data

02:04:18.360 --> 02:04:22.360
makes the loss go down according to the gradient.

02:04:22.360 --> 02:04:25.360
And so the correcting has happened sign-wise.

02:04:25.360 --> 02:04:27.360
And so now what we would expect, of course,

02:04:27.360 --> 02:04:30.360
is that because we've changed all these parameters,

02:04:30.360 --> 02:04:34.360
we expect that the loss should have gone down a bit.

02:04:34.360 --> 02:04:36.360
So we want to reevaluate the loss.

02:04:36.360 --> 02:04:39.360
Let me basically...

02:04:39.360 --> 02:04:42.360
This is just a data definition that hasn't changed.

02:04:42.360 --> 02:04:43.360
But the forward pass here,

02:04:43.360 --> 02:04:45.360
of the network,

02:04:45.360 --> 02:04:47.360
we can recalculate.

02:04:49.360 --> 02:04:51.360
And actually, let me do it outside here

02:04:51.360 --> 02:04:54.360
so that we can compare the two loss values.

02:04:54.360 --> 02:04:57.360
So here, if I recalculate the loss,

02:04:57.360 --> 02:04:59.360
we'd expect the new loss now

02:04:59.360 --> 02:05:01.360
to be slightly lower than this number.

02:05:01.360 --> 02:05:03.360
So hopefully, what we're getting now

02:05:03.360 --> 02:05:06.360
is a tiny bit lower than 4.84.

02:05:06.360 --> 02:05:08.360
4.36.

02:05:08.360 --> 02:05:10.360
And remember, the way we've arranged this

02:05:10.360 --> 02:05:12.360
is that low loss means that

02:05:12.360 --> 02:05:14.360
our predictions are matching the targets.

02:05:14.360 --> 02:05:15.360
So our predictions now

02:05:15.360 --> 02:05:19.360
are probably slightly closer to the targets.

02:05:19.360 --> 02:05:21.360
And now all we have to do

02:05:21.360 --> 02:05:23.360
is we have to iterate this process.

02:05:23.360 --> 02:05:26.360
So again, we've done the forward pass,

02:05:26.360 --> 02:05:27.360
and this is the loss.

02:05:27.360 --> 02:05:29.360
Now we can loss that backward.

02:05:29.360 --> 02:05:31.360
Let me take these out.

02:05:31.360 --> 02:05:33.360
And we can do a step size.

02:05:33.360 --> 02:05:36.360
And now we should have a slightly lower loss.

02:05:36.360 --> 02:05:39.360
4.36 goes to 3.9.

02:05:39.360 --> 02:05:41.360
And okay, so we've done the forward pass.

02:05:41.360 --> 02:05:43.360
Here's the backward pass.

02:05:43.360 --> 02:05:45.360
Nudge.

02:05:45.360 --> 02:05:47.360
And now the loss is 3.66.

02:05:47.360 --> 02:05:51.360
3.47.

02:05:51.360 --> 02:05:53.360
And you get the idea.

02:05:53.360 --> 02:05:55.360
We just continue doing this.

02:05:55.360 --> 02:05:57.360
And this is gradient descent.

02:05:57.360 --> 02:05:59.360
We're just iteratively doing forward pass,

02:05:59.360 --> 02:06:01.360
backward pass, update.

02:06:01.360 --> 02:06:03.360
Forward pass, backward pass, update.

02:06:03.360 --> 02:06:05.360
And the neural net is improving its predictions.

02:06:05.360 --> 02:06:08.360
So here, if we look at ypred now,

02:06:08.360 --> 02:06:10.360
ypred,

02:06:10.360 --> 02:06:16.360
we see that this value should be getting closer to 1.

02:06:16.360 --> 02:06:18.360
So this value should be getting more positive.

02:06:18.360 --> 02:06:19.360
These should be getting more negative.

02:06:19.360 --> 02:06:21.360
And this one should be also getting more positive.

02:06:21.360 --> 02:06:26.360
So if we just iterate this a few more times,

02:06:26.360 --> 02:06:29.360
actually, we may be able to afford to go a bit faster.

02:06:29.360 --> 02:06:34.360
Let's try a slightly higher learning rate.

02:06:34.360 --> 02:06:35.360
Oops, okay, there we go.

02:06:35.360 --> 02:06:39.360
So now we're at 0.31.

02:06:39.360 --> 02:06:41.360
If you go too fast, by the way,

02:06:41.360 --> 02:06:43.360
if you try to make it too big of a step,

02:06:43.360 --> 02:06:47.360
you may actually overstep.

02:06:47.360 --> 02:06:48.360
It's overconfidence.

02:06:48.360 --> 02:06:49.360
Because again, remember,

02:06:49.360 --> 02:06:51.360
we don't actually know exactly about the loss function.

02:06:51.360 --> 02:06:53.360
The loss function has all kinds of structure.

02:06:53.360 --> 02:06:56.360
And we only know about the very local dependence

02:06:56.360 --> 02:06:58.360
of all these parameters on the loss.

02:06:58.360 --> 02:06:59.360
But if we step too far,

02:06:59.360 --> 02:07:01.360
we may step into, you know,

02:07:01.360 --> 02:07:03.360
a part of the loss that is completely different.

02:07:03.360 --> 02:07:05.360
And that can destabilize training

02:07:05.360 --> 02:07:08.360
and make your loss actually blow up even.

02:07:08.360 --> 02:07:10.360
So the loss is now 0.04.

02:07:10.360 --> 02:07:13.360
So actually, the predictions should be really quite close.

02:07:13.360 --> 02:07:15.360
Let's take a look.

02:07:15.360 --> 02:07:17.360
So you see how this is almost one,

02:07:17.360 --> 02:07:19.360
almost negative one, almost one.

02:07:19.360 --> 02:07:21.360
We can continue going.

02:07:21.360 --> 02:07:25.360
So, yep, backward, update.

02:07:25.360 --> 02:07:26.360
Oops, there we go.

02:07:26.360 --> 02:07:28.360
So we went way too fast.

02:07:28.360 --> 02:07:31.360
And we actually overstepped.

02:07:31.360 --> 02:07:34.360
So we got too eager.

02:07:34.360 --> 02:07:35.360
Where are we now?

02:07:35.360 --> 02:07:36.360
Oops.

02:07:36.360 --> 02:07:37.360
Okay.

02:07:37.360 --> 02:07:38.360
7E-9.

02:07:38.360 --> 02:07:41.360
So this is very, very low loss.

02:07:41.360 --> 02:07:45.360
And the predictions are basically perfect.

02:07:45.360 --> 02:07:47.360
So somehow we...

02:07:47.360 --> 02:07:49.360
Basically, we were doing way too big updates

02:07:49.360 --> 02:07:50.360
and we briefly exploded,

02:07:50.360 --> 02:07:53.360
but then somehow we ended up getting into a really good spot.

02:07:53.360 --> 02:07:55.360
So usually this learning rate

02:07:55.360 --> 02:07:57.360
and the tuning of it is a subtle art.

02:07:57.360 --> 02:07:59.360
You want to set your learning rate.

02:07:59.360 --> 02:08:00.360
If it's too low,

02:08:00.360 --> 02:08:02.360
you're going to take way too long to converge.

02:08:02.360 --> 02:08:03.360
But if it's too high,

02:08:03.360 --> 02:08:04.360
the whole thing gets unstable

02:08:04.360 --> 02:08:06.360
and you might actually even explode the loss,

02:08:06.360 --> 02:08:08.360
depending on your loss function.

02:08:08.360 --> 02:08:11.360
So finding the step size to be just right,

02:08:11.360 --> 02:08:13.360
it's a pretty subtle art sometimes

02:08:13.360 --> 02:08:15.360
when you're using sort of vanilla gradient descent.

02:08:15.360 --> 02:08:17.360
But we happened to get into a good spot.

02:08:17.360 --> 02:08:22.360
We can look at n.parameters.

02:08:22.360 --> 02:08:26.360
So this is the setting of weights and biases

02:08:26.360 --> 02:08:28.360
that makes our network

02:08:28.360 --> 02:08:31.360
predict the desired targets

02:08:31.360 --> 02:08:33.360
very, very close.

02:08:33.360 --> 02:08:35.360
And basically,

02:08:35.360 --> 02:08:38.360
we've successfully trained a neural net.

02:08:38.360 --> 02:08:40.360
Okay, let's make this a tiny bit more respectable

02:08:40.360 --> 02:08:42.360
and implement an actual training loop

02:08:42.360 --> 02:08:43.360
and what that looks like.

02:08:43.360 --> 02:08:45.360
So this is the data definition that stays.

02:08:45.360 --> 02:08:47.360
This is the forward pass.

02:08:47.360 --> 02:08:51.360
So for k in range,

02:08:51.360 --> 02:08:57.360
we're going to take a bunch of steps.

02:08:57.360 --> 02:08:59.360
First, we do the forward pass.

02:08:59.360 --> 02:09:03.360
We validate the loss.

02:09:03.360 --> 02:09:05.360
Let's reinitialize the neural net from scratch.

02:09:05.360 --> 02:09:08.360
And here's the data.

02:09:08.360 --> 02:09:10.360
And we first do the forward pass.

02:09:10.360 --> 02:09:12.360
Then we do the backward pass.

02:09:19.360 --> 02:09:20.360
And then we do an update.

02:09:20.360 --> 02:09:22.360
That's gradient descent.

02:09:26.360 --> 02:09:27.360
And then we should be able to iterate this

02:09:27.360 --> 02:09:30.360
and we should be able to print the current step,

02:09:30.360 --> 02:09:32.360
the current loss.

02:09:32.360 --> 02:09:34.360
Let's just print the sort of

02:09:34.360 --> 02:09:37.360
number of the loss.

02:09:37.360 --> 02:09:40.360
And that should be it.

02:09:40.360 --> 02:09:42.360
And then the learning rate,

02:09:42.360 --> 02:09:43.360
0.01 is a little too small.

02:09:43.360 --> 02:09:45.360
0.1 we saw is like a little bit dangerous

02:09:45.360 --> 02:09:46.360
and too high.

02:09:46.360 --> 02:09:48.360
Let's go somewhere in between.

02:09:48.360 --> 02:09:50.360
And we'll optimize this for

02:09:50.360 --> 02:09:51.360
not 10 steps,

02:09:51.360 --> 02:09:54.360
but let's go for say 20 steps.

02:09:54.360 --> 02:09:59.360
Let me erase all of this junk.

02:09:59.360 --> 02:10:02.360
And let's run the optimization.

02:10:02.360 --> 02:10:05.360
And you see how we've actually converged slower

02:10:05.360 --> 02:10:07.360
in a more controlled manner

02:10:07.360 --> 02:10:10.360
and got to a loss that is very low.

02:10:10.360 --> 02:10:14.360
So I expect YPred to be quite good.

02:10:14.360 --> 02:10:16.360
There we go.

02:10:21.360 --> 02:10:23.360
And that's it.

02:10:23.360 --> 02:10:25.360
Okay, so this is kind of embarrassing,

02:10:25.360 --> 02:10:28.360
but we actually have a really terrible bug in here.

02:10:28.360 --> 02:10:30.360
And it's a subtle bug

02:10:30.360 --> 02:10:32.360
and it's a very common bug.

02:10:32.360 --> 02:10:34.360
And I can't believe I've done it

02:10:34.360 --> 02:10:36.360
for the 20th time in my life,

02:10:36.360 --> 02:10:37.360
especially on camera.

02:10:37.360 --> 02:10:39.360
And I could have reshot the whole thing,

02:10:39.360 --> 02:10:41.360
but I think it's pretty funny.

02:10:41.360 --> 02:10:43.360
And you get to appreciate a bit

02:10:43.360 --> 02:10:45.360
what working with neural nets

02:10:45.360 --> 02:10:47.360
maybe is like sometimes.

02:10:47.360 --> 02:10:49.360
We are guilty of

02:10:49.360 --> 02:10:51.360
a common bug.

02:10:51.360 --> 02:10:53.360
I've actually tweeted

02:10:53.360 --> 02:10:55.360
the most common neural net mistakes

02:10:55.360 --> 02:10:57.360
a long time ago now.

02:10:57.360 --> 02:10:59.360
And I'm not really

02:10:59.360 --> 02:11:01.360
going to explain any of these,

02:11:01.360 --> 02:11:03.360
but remember we are guilty of number three.

02:11:03.360 --> 02:11:05.360
You forgot to zero grad

02:11:05.360 --> 02:11:06.360
before dot backward.

02:11:06.360 --> 02:11:09.360
What is that?

02:11:09.360 --> 02:11:10.360
Basically what's happening,

02:11:10.360 --> 02:11:11.360
and it's a subtle bug

02:11:11.360 --> 02:11:13.360
and I'm not sure if you saw it,

02:11:13.360 --> 02:11:16.360
is that all of these weights here

02:11:16.360 --> 02:11:19.360
have a dot data and a dot grad.

02:11:19.360 --> 02:11:22.360
And dot grad starts at zero.

02:11:22.360 --> 02:11:23.360
And then we do backward

02:11:23.360 --> 02:11:25.360
and we fill in the gradients.

02:11:25.360 --> 02:11:27.360
And then we do an update on the data,

02:11:27.360 --> 02:11:29.360
but we don't flush the grad.

02:11:29.360 --> 02:11:30.360
It stays there.

02:11:30.360 --> 02:11:33.360
So when we do the second forward pass

02:11:33.360 --> 02:11:35.360
and we do backward again,

02:11:35.360 --> 02:11:37.360
remember that all the backward operations

02:11:37.360 --> 02:11:39.360
do a plus equals on the grad.

02:11:39.360 --> 02:11:41.360
And so these gradients just add up

02:11:41.360 --> 02:11:44.360
and they never get reset to zero.

02:11:44.360 --> 02:11:47.360
So basically we didn't zero grad.

02:11:47.360 --> 02:11:48.360
So here's how we zero grad

02:11:48.360 --> 02:11:50.360
before backward.

02:11:50.360 --> 02:11:53.360
We need to iterate over all the parameters.

02:11:53.360 --> 02:11:55.360
And we need to make sure that

02:11:55.360 --> 02:11:58.360
p dot grad is set to zero.

02:11:58.360 --> 02:12:00.360
We need to reset it to zero.

02:12:00.360 --> 02:12:02.360
Just like it is in the constructor.

02:12:02.360 --> 02:12:04.360
So remember all the way here

02:12:04.360 --> 02:12:05.360
for all these value nodes,

02:12:05.360 --> 02:12:07.360
grad is reset to zero.

02:12:07.360 --> 02:12:09.360
And then all these backward passes

02:12:09.360 --> 02:12:11.360
do a plus equals from that grad.

02:12:11.360 --> 02:12:13.360
But we need to make sure that

02:12:13.360 --> 02:12:15.360
we reset these grads to zero

02:12:15.360 --> 02:12:17.360
so that when we do backward,

02:12:17.360 --> 02:12:18.360
all of them start at zero

02:12:18.360 --> 02:12:19.360
and the actual backward pass

02:12:19.360 --> 02:12:23.360
accumulates the loss derivatives

02:12:23.360 --> 02:12:25.360
into the grads.

02:12:25.360 --> 02:12:28.360
So this is zero grad in PyTorch.

02:12:28.360 --> 02:12:29.360
And

02:12:29.360 --> 02:12:33.360
we will get a slightly different optimization.

02:12:33.360 --> 02:12:35.360
Let's reset the neural net.

02:12:35.360 --> 02:12:36.360
The data is the same.

02:12:36.360 --> 02:12:38.360
This is now, I think, correct.

02:12:38.360 --> 02:12:41.360
And we get a much more

02:12:41.360 --> 02:12:44.360
slower descent.

02:12:44.360 --> 02:12:46.360
We still end up with pretty good results.

02:12:46.360 --> 02:12:48.360
And we can continue this a bit more

02:12:48.360 --> 02:12:50.360
to get down lower

02:12:50.360 --> 02:12:51.360
and lower

02:12:51.360 --> 02:12:54.360
and lower.

02:12:54.360 --> 02:12:56.360
Yeah.

02:12:56.360 --> 02:12:58.360
So the only reason that the previous thing worked,

02:12:58.360 --> 02:13:00.360
it's extremely buggy.

02:13:00.360 --> 02:13:01.360
The only reason that worked

02:13:01.360 --> 02:13:03.360
is that

02:13:03.360 --> 02:13:06.360
this is a very, very simple problem.

02:13:06.360 --> 02:13:08.360
And it's very easy for this neural net

02:13:08.360 --> 02:13:09.360
to fit this data.

02:13:09.360 --> 02:13:12.360
And so the grads ended up accumulating

02:13:12.360 --> 02:13:13.360
and it effectively gave us

02:13:13.360 --> 02:13:15.360
a massive step size.

02:13:15.360 --> 02:13:19.360
And it made us converge extremely fast.

02:13:19.360 --> 02:13:21.360
But basically now we have to do more steps

02:13:21.360 --> 02:13:24.360
to get to very low values of loss

02:13:24.360 --> 02:13:26.360
and get YPRED to be really good.

02:13:26.360 --> 02:13:27.360
We can try to

02:13:27.360 --> 02:13:34.360
step a bit greater.

02:13:34.360 --> 02:13:35.360
Yeah.

02:13:35.360 --> 02:13:36.360
We're going to get closer and closer

02:13:36.360 --> 02:13:38.360
to one minus one and one.

02:13:38.360 --> 02:13:39.360
So

02:13:39.360 --> 02:13:41.360
working with neural nets is sometimes

02:13:41.360 --> 02:13:44.360
tricky because

02:13:44.360 --> 02:13:46.360
you may have lots of bugs in the code

02:13:46.360 --> 02:13:48.360
and

02:13:48.360 --> 02:13:49.360
your network might actually work

02:13:49.360 --> 02:13:51.360
just like ours worked.

02:13:51.360 --> 02:13:52.360
But chances are is that

02:13:52.360 --> 02:13:54.360
if we had a more complex problem

02:13:54.360 --> 02:13:56.360
then actually this bug would have

02:13:56.360 --> 02:13:58.360
made us not optimize the loss very well.

02:13:58.360 --> 02:14:00.360
And we were only able to get away with it

02:14:00.360 --> 02:14:01.360
because

02:14:01.360 --> 02:14:03.360
the problem is very simple.

02:14:03.360 --> 02:14:05.360
So let's now bring everything together

02:14:05.360 --> 02:14:07.360
and summarize what we learned.

02:14:07.360 --> 02:14:08.360
What are neural nets?

02:14:08.360 --> 02:14:11.360
Neural nets are these mathematical expressions.

02:14:11.360 --> 02:14:13.360
Fairly simple mathematical expressions

02:14:13.360 --> 02:14:15.360
in the case of multi-layer perceptron

02:14:15.360 --> 02:14:18.360
that take input as the data

02:14:18.360 --> 02:14:20.360
and they take input the weights

02:14:20.360 --> 02:14:21.360
and the parameters of the neural net.

02:14:21.360 --> 02:14:24.360
Mathematical expression for the forward pass

02:14:24.360 --> 02:14:25.360
followed by a loss function.

02:14:25.360 --> 02:14:27.360
And the loss function tries to measure

02:14:27.360 --> 02:14:29.360
the accuracy of the predictions.

02:14:29.360 --> 02:14:31.360
And usually the loss will be low

02:14:31.360 --> 02:14:33.360
when your predictions are matching your targets

02:14:33.360 --> 02:14:36.360
or where the network is basically behaving well.

02:14:36.360 --> 02:14:38.360
So we manipulate the loss function

02:14:38.360 --> 02:14:40.360
so that when the loss is low

02:14:40.360 --> 02:14:42.360
the network is doing what you want it to do

02:14:42.360 --> 02:14:43.360
on your problem.

02:14:43.360 --> 02:14:46.360
And then we backward the loss.

02:14:46.360 --> 02:14:48.360
Use back propagation to get the gradient

02:14:48.360 --> 02:14:50.360
and then we know how to tune all the parameters

02:14:50.360 --> 02:14:52.360
to decrease the loss locally.

02:14:52.360 --> 02:14:54.360
But then we have to iterate that process

02:14:54.360 --> 02:14:57.360
many times in what's called the gradient descent.

02:14:57.360 --> 02:14:59.360
So we simply follow the gradient information

02:14:59.360 --> 02:15:01.360
and that minimizes the loss

02:15:01.360 --> 02:15:02.360
and the loss is arranged so that

02:15:02.360 --> 02:15:04.360
when the loss is minimized

02:15:04.360 --> 02:15:06.360
the network is doing what you want it to do.

02:15:06.360 --> 02:15:10.360
And yeah, so we just have a blob of neural stuff

02:15:10.360 --> 02:15:12.360
and we can make it do arbitrary things.

02:15:12.360 --> 02:15:15.360
And that's what gives neural nets their power.

02:15:15.360 --> 02:15:17.360
It's, you know, this is a very tiny network

02:15:17.360 --> 02:15:19.360
with 41 parameters.

02:15:19.360 --> 02:15:21.360
But you can build significantly more complicated

02:15:21.360 --> 02:15:23.360
neural nets with billions

02:15:23.360 --> 02:15:26.360
at this point almost trillions of parameters.

02:15:26.360 --> 02:15:28.360
And it's a massive blob of neural tissue

02:15:28.360 --> 02:15:30.360
simulated neural tissue

02:15:30.360 --> 02:15:32.360
roughly speaking.

02:15:32.360 --> 02:15:35.360
And you can make it do extremely complex problems.

02:15:35.360 --> 02:15:37.360
And these neural nets then

02:15:37.360 --> 02:15:39.360
have all kinds of very fascinating emergent properties

02:15:39.360 --> 02:15:43.360
in when you try to make them do

02:15:43.360 --> 02:15:45.360
significantly hard problems.

02:15:45.360 --> 02:15:47.360
As in the case of GPT for example

02:15:47.360 --> 02:15:50.360
we have massive amounts of text from the internet

02:15:50.360 --> 02:15:52.360
and we're trying to get a neural net to predict

02:15:52.360 --> 02:15:54.360
to take like a few words

02:15:54.360 --> 02:15:56.360
and try to predict the next word in a sequence.

02:15:56.360 --> 02:15:58.360
That's the learning problem.

02:15:58.360 --> 02:15:59.360
And it turns out that when you train this

02:15:59.360 --> 02:16:00.360
on all of internet

02:16:00.360 --> 02:16:02.360
the neural net actually has like really remarkable

02:16:02.360 --> 02:16:04.360
emergent properties.

02:16:04.360 --> 02:16:05.360
But that neural net would have

02:16:05.360 --> 02:16:07.360
hundreds of billions of parameters.

02:16:07.360 --> 02:16:10.360
But it works on fundamentally the exact same principles.

02:16:10.360 --> 02:16:13.360
The neural net of course will be a bit more complex.

02:16:13.360 --> 02:16:16.360
But otherwise the evaluating the gradient

02:16:16.360 --> 02:16:19.360
is there and will be identical.

02:16:19.360 --> 02:16:21.360
And the gradient descent would be there

02:16:21.360 --> 02:16:22.360
and basically identical.

02:16:22.360 --> 02:16:24.360
But people usually use slightly different updates.

02:16:24.360 --> 02:16:28.360
This is a very simple stochastic gradient descent update.

02:16:28.360 --> 02:16:31.360
And the loss function would not be a mean squared error.

02:16:31.360 --> 02:16:33.360
They would be using something called the cross entropy loss

02:16:33.360 --> 02:16:35.360
for predicting the next token.

02:16:35.360 --> 02:16:36.360
So there's a few more details

02:16:36.360 --> 02:16:38.360
but fundamentally the neural network setup

02:16:38.360 --> 02:16:39.360
and neural network training

02:16:39.360 --> 02:16:41.360
is identical and pervasive.

02:16:41.360 --> 02:16:43.360
And now you understand intuitively

02:16:43.360 --> 02:16:45.360
how that works under the hood.

02:16:45.360 --> 02:16:46.360
In the beginning of this video

02:16:46.360 --> 02:16:48.360
I told you that by the end of it

02:16:48.360 --> 02:16:50.360
you would understand everything in MicroGrad

02:16:50.360 --> 02:16:52.360
and then we'd slowly build it up.

02:16:52.360 --> 02:16:54.360
Let me briefly prove that to you.

02:16:54.360 --> 02:16:55.360
So I'm going to step through all the code

02:16:55.360 --> 02:16:57.360
that is in MicroGrad as of today.

02:16:57.360 --> 02:16:59.360
Actually potentially some of the code will change

02:16:59.360 --> 02:17:00.360
by the time you watch this video

02:17:00.360 --> 02:17:03.360
because I intend to continue developing MicroGrad.

02:17:03.360 --> 02:17:05.360
But let's look at what we have so far at least.

02:17:05.360 --> 02:17:07.360
Init.py is empty.

02:17:07.360 --> 02:17:10.360
When you go to engine.py that has the value.

02:17:10.360 --> 02:17:12.360
Everything here you should mostly recognize.

02:17:12.360 --> 02:17:14.360
So we have the data.data.grad attributes.

02:17:14.360 --> 02:17:16.360
We have the backward function.

02:17:16.360 --> 02:17:17.360
We have the previous set of children

02:17:17.360 --> 02:17:20.360
and the operation that produced this value.

02:17:20.360 --> 02:17:22.360
We have addition, multiplication

02:17:22.360 --> 02:17:24.360
and raising to a scalar power.

02:17:24.360 --> 02:17:26.360
We have the ReLU non-linearity

02:17:26.360 --> 02:17:28.360
which is a slightly different type of non-linearity

02:17:28.360 --> 02:17:30.360
than tanh that we used in this video.

02:17:30.360 --> 02:17:32.360
Both of them are non-linearities

02:17:32.360 --> 02:17:34.360
and notably tanh is not actually present

02:17:34.360 --> 02:17:36.360
in MicroGrad as of right now

02:17:36.360 --> 02:17:38.360
but I intend to add it later.

02:17:38.360 --> 02:17:40.360
We have the backward which is identical

02:17:40.360 --> 02:17:42.360
and then all of these other operations

02:17:42.360 --> 02:17:45.360
which are built up on top of operations here.

02:17:45.360 --> 02:17:47.360
So values should be very recognizable

02:17:47.360 --> 02:17:49.360
except for the non-linearity used in this video.

02:17:50.360 --> 02:17:52.360
There's no massive difference between ReLU and tanh

02:17:52.360 --> 02:17:54.360
and sigmoid and these other non-linearities.

02:17:54.360 --> 02:17:56.360
They're all roughly equivalent

02:17:56.360 --> 02:17:58.360
and can be used in MLPs.

02:17:58.360 --> 02:18:00.360
So I use tanh because it's a bit smoother

02:18:00.360 --> 02:18:02.360
and because it's a little bit more complicated than ReLU

02:18:02.360 --> 02:18:04.360
and therefore it's stressed a little bit more

02:18:04.360 --> 02:18:06.360
the local gradients

02:18:06.360 --> 02:18:08.360
and working with those derivatives

02:18:08.360 --> 02:18:10.360
which I thought would be useful.

02:18:10.360 --> 02:18:12.360
Init.py is the neural networks library

02:18:12.360 --> 02:18:14.360
as I mentioned.

02:18:14.360 --> 02:18:16.360
So you should recognize identical implementation

02:18:16.360 --> 02:18:18.360
of neuron, layer and MLP.

02:18:18.360 --> 02:18:20.360
Notably, or not so much

02:18:20.360 --> 02:18:22.360
we have a class module here

02:18:22.360 --> 02:18:24.360
that is a parent class of all these modules.

02:18:24.360 --> 02:18:26.360
I did that because there's an nn.module class

02:18:26.360 --> 02:18:28.360
in PyTorch

02:18:28.360 --> 02:18:30.360
and so this exactly matches that API

02:18:30.360 --> 02:18:32.360
and nn.module in PyTorch has also a 0 grad

02:18:32.360 --> 02:18:34.360
which I refactored out here.

02:18:36.360 --> 02:18:38.360
So that's the end of MicroGrad really.

02:18:38.360 --> 02:18:40.360
Then there's a test

02:18:40.360 --> 02:18:42.360
which you'll see basically creates

02:18:42.360 --> 02:18:44.360
two chunks of code

02:18:44.360 --> 02:18:46.360
one in MicroGrad and one in PyTorch

02:18:46.360 --> 02:18:48.360
and we'll make sure that the forward

02:18:48.360 --> 02:18:50.360
and the backward pass agree identically.

02:18:50.360 --> 02:18:52.360
For a slightly less complicated expression

02:18:52.360 --> 02:18:54.360
and slightly more complicated expression

02:18:54.360 --> 02:18:56.360
everything agrees

02:18:56.360 --> 02:18:58.360
so we agree with PyTorch on all of these operations.

02:18:58.360 --> 02:19:00.360
And finally there's a demo.pypyymb

02:19:00.360 --> 02:19:02.360
here and it's a bit more

02:19:02.360 --> 02:19:04.360
complicated binary classification demo

02:19:04.360 --> 02:19:06.360
than the one I covered in this lecture.

02:19:06.360 --> 02:19:08.360
So we only had a tiny data set of four examples.

02:19:08.360 --> 02:19:10.360
Here we have a bit more

02:19:10.360 --> 02:19:12.360
complicated example with lots of

02:19:12.360 --> 02:19:14.360
blue points and lots of red points

02:19:14.360 --> 02:19:16.360
and we're trying to again build a binary classifier

02:19:16.360 --> 02:19:18.360
to distinguish two-dimensional

02:19:18.360 --> 02:19:20.360
points as red or blue.

02:19:20.360 --> 02:19:22.360
It's a bit more complicated MLP here

02:19:22.360 --> 02:19:24.360
with it's a bigger MLP.

02:19:24.360 --> 02:19:26.360
The loss is a bit more complicated

02:19:26.360 --> 02:19:28.360
because it supports batches

02:19:28.360 --> 02:19:30.360
so because our data set

02:19:30.360 --> 02:19:32.360
was so tiny we always did a forward pass

02:19:32.360 --> 02:19:34.360
on the entire data set of four examples.

02:19:34.360 --> 02:19:36.360
But when your data set is like a million

02:19:36.360 --> 02:19:38.360
examples what we usually do in practice

02:19:38.360 --> 02:19:40.360
is we basically

02:19:40.360 --> 02:19:42.360
pick out some random subset, we call that a batch

02:19:42.360 --> 02:19:44.360
and then we only process the batch

02:19:44.360 --> 02:19:46.360
forward, backward and update.

02:19:46.360 --> 02:19:48.360
So we don't have to forward the entire training set.

02:19:48.360 --> 02:19:50.360
So this is

02:19:50.360 --> 02:19:52.360
something that supports batching

02:19:52.360 --> 02:19:54.360
because there's a lot more examples here.

02:19:54.360 --> 02:19:56.360
We do a forward pass.

02:19:56.360 --> 02:19:58.360
The loss is slightly more different.

02:19:58.360 --> 02:20:00.360
This is a max margin loss that I implement here.

02:20:00.360 --> 02:20:02.360
The one that we used was

02:20:02.360 --> 02:20:04.360
the mean squared error loss

02:20:04.360 --> 02:20:06.360
because it's the simplest one.

02:20:06.360 --> 02:20:08.360
There's also the binary cross entropy loss.

02:20:08.360 --> 02:20:10.360
All of them can be used for binary classification

02:20:10.360 --> 02:20:12.360
and don't make too much of a difference

02:20:12.360 --> 02:20:14.360
in the simple examples that we looked at so far.

02:20:14.360 --> 02:20:16.360
There's something called L2 regularization

02:20:16.360 --> 02:20:18.360
used here.

02:20:18.360 --> 02:20:20.360
This has to do with generalization of the neural net

02:20:20.360 --> 02:20:22.360
that controls the overfitting in machine learning setting

02:20:22.360 --> 02:20:24.360
but I did not cover these concepts

02:20:24.360 --> 02:20:26.360
in this video, potentially later.

02:20:26.360 --> 02:20:28.360
And the training loop you should recognize.

02:20:28.360 --> 02:20:30.360
So forward, backward,

02:20:30.360 --> 02:20:32.360
with, zero grad

02:20:32.360 --> 02:20:34.360
and update and so on.

02:20:34.360 --> 02:20:36.360
You'll notice that in the update here

02:20:36.360 --> 02:20:38.360
the learning rate is scaled as a function of

02:20:38.360 --> 02:20:40.360
number of iterations and it

02:20:40.360 --> 02:20:42.360
shrinks.

02:20:42.360 --> 02:20:44.360
And this is something called learning rate decay.

02:20:44.360 --> 02:20:46.360
So in the beginning you have a high learning rate

02:20:46.360 --> 02:20:48.360
and as the network sort of stabilizes near the end

02:20:48.360 --> 02:20:50.360
you bring down the learning rate

02:20:50.360 --> 02:20:52.360
to get to some of the fine details in the end.

02:20:52.360 --> 02:20:54.360
And in the end we see

02:20:54.360 --> 02:20:56.360
the decision surface of the neural net

02:20:56.360 --> 02:20:58.360
and we see that it learned to separate out the red

02:20:58.360 --> 02:21:00.360
and the blue area based on

02:21:00.360 --> 02:21:02.360
the data points.

02:21:02.360 --> 02:21:04.360
So that's the slightly more complicated example

02:21:04.360 --> 02:21:06.360
in the demo.hypiYMB

02:21:06.360 --> 02:21:08.360
that you're free to go over.

02:21:08.360 --> 02:21:10.360
But yeah, as of today, that is MicroGrad.

02:21:10.360 --> 02:21:12.360
I also wanted to show you a little bit of real stuff

02:21:12.360 --> 02:21:14.360
so that you get to see how this is actually implemented

02:21:14.360 --> 02:21:16.360
in a production grade library like PyTorch.

02:21:16.360 --> 02:21:18.360
So in particular I wanted to show

02:21:18.360 --> 02:21:20.360
I wanted to find and show you

02:21:20.360 --> 02:21:22.360
the backward pass for 10h in PyTorch.

02:21:22.360 --> 02:21:24.360
So here in MicroGrad

02:21:24.360 --> 02:21:26.360
we see that the backward pass for 10h

02:21:26.360 --> 02:21:28.360
is 1 minus t squared

02:21:28.360 --> 02:21:30.360
where t is the output of the 10h

02:21:30.360 --> 02:21:32.360
of x

02:21:32.360 --> 02:21:34.360
times of that grad

02:21:34.360 --> 02:21:36.360
which is the chain rule.

02:21:36.360 --> 02:21:38.360
So we're looking for something that looks like this.

02:21:38.360 --> 02:21:40.360
Now, I went to PyTorch

02:21:40.360 --> 02:21:42.360
which has

02:21:42.360 --> 02:21:44.360
an open source GitHub codebase

02:21:44.360 --> 02:21:46.360
and I looked through a lot of its code

02:21:46.360 --> 02:21:48.360
and honestly

02:21:48.360 --> 02:21:50.360
I spent about 15 minutes

02:21:50.360 --> 02:21:52.360
and I couldn't find 10h.

02:21:52.360 --> 02:21:54.360
And that's because these libraries, unfortunately

02:21:54.360 --> 02:21:56.360
they grow in size and entropy.

02:21:56.360 --> 02:21:58.360
And if you just search for 10h

02:21:58.360 --> 02:22:00.360
you get apparently 2,800 results

02:22:00.360 --> 02:22:02.360
and 406 files.

02:22:02.360 --> 02:22:04.360
So I don't know what these files

02:22:04.360 --> 02:22:06.360
are doing, honestly.

02:22:06.360 --> 02:22:08.360
And why there are

02:22:08.360 --> 02:22:10.360
so many mentions of 10h.

02:22:10.360 --> 02:22:12.360
But unfortunately these libraries are quite complex

02:22:12.360 --> 02:22:14.360
they're meant to be used, not really inspected.

02:22:14.360 --> 02:22:16.360
Eventually I did

02:22:16.360 --> 02:22:18.360
stumble on someone

02:22:18.360 --> 02:22:20.360
who tries to change

02:22:20.360 --> 02:22:22.360
the 10h backward code for some reason

02:22:22.360 --> 02:22:24.360
and someone here pointed to the

02:22:24.360 --> 02:22:26.360
CPU kernel and the CUDA kernel for

02:22:26.360 --> 02:22:28.360
10h backward.

02:22:28.360 --> 02:22:30.360
So basically it depends on if you're using

02:22:30.360 --> 02:22:32.360
PyTorch on a CPU device or on a GPU

02:22:32.360 --> 02:22:34.360
which these are different devices

02:22:34.360 --> 02:22:36.360
and I haven't covered this.

02:22:36.360 --> 02:22:38.360
But this is the 10h backward kernel

02:22:38.360 --> 02:22:40.360
for CPU

02:22:40.360 --> 02:22:42.360
and the reason it's so large

02:22:42.360 --> 02:22:44.360
is that

02:22:44.360 --> 02:22:46.360
number one, this is like if you're using a complex type

02:22:46.360 --> 02:22:48.360
which we haven't even talked about

02:22:48.360 --> 02:22:50.360
you're using a specific data type of bfloat16

02:22:50.360 --> 02:22:52.360
which we haven't talked about

02:22:52.360 --> 02:22:54.360
and then if you're not

02:22:54.360 --> 02:22:56.360
then this is the kernel

02:22:56.360 --> 02:22:58.360
and deep here we see something that resembles

02:22:58.360 --> 02:23:00.360
our backward pass.

02:23:00.360 --> 02:23:02.360
So they have a times one minus

02:23:02.360 --> 02:23:04.360
b square

02:23:04.360 --> 02:23:06.360
so this b here

02:23:06.360 --> 02:23:08.360
must be the output of the 10h

02:23:08.360 --> 02:23:10.360
and this is the out.grad

02:23:10.360 --> 02:23:12.360
so here we found it

02:23:12.360 --> 02:23:14.360
deep inside

02:23:14.360 --> 02:23:16.360
PyTorch on this location

02:23:16.360 --> 02:23:18.360
for some reason inside binary ops kernel

02:23:18.360 --> 02:23:20.360
10h is not actually binary op

02:23:20.360 --> 02:23:22.360
and then this is the

02:23:22.360 --> 02:23:24.360
GPU kernel

02:23:24.360 --> 02:23:26.360
we're not complex

02:23:26.360 --> 02:23:28.360
we're here

02:23:28.360 --> 02:23:30.360
and here we go with one line of code

02:23:30.360 --> 02:23:32.360
so we did find it

02:23:32.360 --> 02:23:34.360
but basically unfortunately

02:23:34.360 --> 02:23:36.360
these code bases are very large

02:23:36.360 --> 02:23:38.360
and micrograd is very very simple

02:23:38.360 --> 02:23:40.360
but if you actually want to use real stuff

02:23:40.360 --> 02:23:42.360
finding the code for it

02:23:42.360 --> 02:23:44.360
you'll actually find that difficult

02:23:44.360 --> 02:23:46.360
I also wanted to show you

02:23:46.360 --> 02:23:48.360
a little example here where PyTorch is showing you

02:23:48.360 --> 02:23:50.360
you can register a new type of function

02:23:50.360 --> 02:23:52.360
that you want to add to PyTorch

02:23:52.360 --> 02:23:54.360
as a lego building block

02:23:54.360 --> 02:23:56.360
so here if you want to for example add

02:23:56.360 --> 02:23:58.360
a gender polynomial 3

02:23:58.360 --> 02:24:00.360
here's how you could do it

02:24:00.360 --> 02:24:02.360
you will register it

02:24:02.360 --> 02:24:04.360
as a class that

02:24:04.360 --> 02:24:06.360
subclass says torch.rgrad.function

02:24:06.360 --> 02:24:08.360
and then you have to tell PyTorch how to forward

02:24:08.360 --> 02:24:10.360
your new function

02:24:10.360 --> 02:24:12.360
and how to backward through it

02:24:12.360 --> 02:24:14.360
so as long as you can do the forward pass

02:24:14.360 --> 02:24:16.360
of this little function piece that you want to add

02:24:16.360 --> 02:24:18.360
and as long as you know the local

02:24:18.360 --> 02:24:20.360
derivative, the local gradients

02:24:20.360 --> 02:24:22.360
which are implemented in the backward

02:24:22.360 --> 02:24:24.360
PyTorch will be able to back propagate through your function

02:24:24.360 --> 02:24:26.360
and then you can use this as a lego block

02:24:26.360 --> 02:24:28.360
in a larger lego castle

02:24:28.360 --> 02:24:30.360
of all the different lego blocks that PyTorch already has

02:24:30.360 --> 02:24:32.360
and so that's the only thing

02:24:32.360 --> 02:24:34.360
you have to tell PyTorch and everything will just work

02:24:34.360 --> 02:24:36.360
and you can register new types of functions

02:24:36.360 --> 02:24:38.360
in this way following this example

02:24:38.360 --> 02:24:40.360
and that is everything that I wanted to cover

02:24:40.360 --> 02:24:42.360
in this lecture

02:24:42.360 --> 02:24:44.360
so I hope you enjoyed building out micrograd with me

02:24:44.360 --> 02:24:46.360
I hope you find it interesting, insightful

02:24:46.360 --> 02:24:48.360
and yeah

02:24:48.360 --> 02:24:50.360
I will post a lot of the links

02:24:50.360 --> 02:24:52.360
that are related to this video

02:24:52.360 --> 02:24:54.360
in the video description below

02:24:54.360 --> 02:24:56.360
I will also probably post a link to a discussion forum

02:24:56.360 --> 02:24:58.360
or discussion group where you can ask

02:24:58.360 --> 02:25:00.360
questions related to this video

02:25:00.360 --> 02:25:02.360
and then I can answer or someone else can answer

02:25:02.360 --> 02:25:04.360
your questions

02:25:04.360 --> 02:25:06.360
and I may also do a follow up video

02:25:06.360 --> 02:25:08.360
that answers some of the most common questions

02:25:08.360 --> 02:25:10.360
but for now that's it

02:25:10.360 --> 02:25:12.360
I hope you enjoyed it

02:25:12.360 --> 02:25:14.360
if you did then please like and subscribe

02:25:14.360 --> 02:25:16.360
so that YouTube knows to feature this video to more people

02:25:16.360 --> 02:25:18.360
and that's it for now, I'll see you later

02:25:18.360 --> 02:25:20.360
bye

02:25:48.360 --> 02:25:50.360
I know what happened there