nohup: ignoring input
/home/chris/.conda/envs/openai-whisper-env/lib/python3.11/site-packages/whisper/transcribe.py:113: UserWarning: Performing inference on CPU when CUDA is available
  warnings.warn("Performing inference on CPU when CUDA is available")
/home/chris/.conda/envs/openai-whisper-env/lib/python3.11/site-packages/whisper/transcribe.py:115: UserWarning: FP16 is not supported on CPU; using FP32 instead
  warnings.warn("FP16 is not supported on CPU; using FP32 instead")
[00:00.000 --> 00:15.000]  Please welcome AI researcher and founding member of OpenAI, Andrej Karpathy.
[00:15.000 --> 00:23.560]  Andrej Karpathy Hi, everyone.
[00:23.560 --> 00:28.820]  I'm happy to be here to tell you about the state of GPT and more generally about the
[00:28.820 --> 00:32.140]  rapidly growing ecosystem of large language models.
[00:32.140 --> 00:35.400]  So I would like to partition the talk into two parts.
[00:35.400 --> 00:39.580]  In the first part, I would like to tell you about how we train GPT assistants.
[00:39.580 --> 00:43.740]  And then in the second part, we are going to take a look at how we can use these assistants
[00:43.740 --> 00:46.760]  effectively for your applications.
[00:46.760 --> 00:50.500]  So first, let's take a look at the emerging recipe for how to train these assistants.
[00:50.500 --> 00:53.200]  And keep in mind that this is all very new and still rapidly evolving.
[00:53.200 --> 00:55.960]  But so far, the recipe looks something like this.
[00:55.960 --> 00:58.800]  Now this is kind of a complicated slide, so I'm going to go through it piece by piece.
[00:58.800 --> 01:05.400]  But roughly speaking, we have four major stages, pre-training, supervised fine-tuning,
[01:05.400 --> 01:10.040]  reward modeling, reinforcement learning, and they follow each other serially.
[01:10.040 --> 01:14.900]  Now in each stage, we have a data set that powers that stage.
[01:14.900 --> 01:21.420]  We have an algorithm that for our purposes will be an objective for training the neural
[01:21.420 --> 01:22.420]  network.
[01:22.420 --> 01:24.060]  And then we have a resulting model.
[01:24.060 --> 01:25.980]  And then there's some notes on the bottom.
[01:25.980 --> 01:28.160]  So the first stage we're going to start with is the pre-training stage.
[01:28.160 --> 01:28.640]  So the first stage we're going to start with is the pre-training stage.
[01:28.800 --> 01:33.640]  Now this stage is kind of special in this diagram, and this diagram is not to scale.
[01:33.640 --> 01:36.460]  Because this stage is where all of the computational work basically happens.
[01:36.460 --> 01:42.020]  This is 99% of the training compute time, and also flops.
[01:42.020 --> 01:48.560]  And so this is where we are dealing with internet-scale data sets with thousands of GPUs in the supercomputer,
[01:48.560 --> 01:51.420]  and also months of training, potentially.
[01:51.420 --> 01:56.120]  The other three stages are fine-tuning stages that are much more along the lines of a small
[01:56.120 --> 01:58.160]  few number of GPUs and hours or days.
[01:58.800 --> 02:04.180]  So let's take a look at the pre-training stage to achieve a base model.
[02:04.180 --> 02:07.860]  First we're going to gather a large amount of data.
[02:07.860 --> 02:13.100]  Here's an example of what we call a data mixture that comes from this paper that was released
[02:13.100 --> 02:16.400]  by Meta, where they released this LAMA base model.
[02:16.400 --> 02:20.560]  Now you can see roughly the kinds of data sets that enter into these collections.
[02:20.560 --> 02:25.540]  So we have Common Crawl, which is just a web scrape, C4, which is also Common Crawl, and
[02:25.540 --> 02:27.360]  then some high-quality data sets as well.
[02:27.360 --> 02:27.960]  So for example, GitHub, Wikipedia, and so on.
[02:27.960 --> 02:28.460]  So we have common crawl, which is just a web scrape, C4, which is also common crawl, and then some high-quality data sets as well. So for example, GitHub, Wikipedia, and so on.
[02:28.800 --> 02:29.300]  So we have Common Crawl, which is just a web scrape, C4, which is also common crawl, and then some high-quality data sets as well.
[02:29.300 --> 02:29.800]  So we have Common Crawl, which is just a web scrape, C4, which is also common crawl, and then some high-quality data sets as well.
[02:29.800 --> 02:36.280]  These are all mixed up together, and then they are sampled according to some given proportions,
[02:36.280 --> 02:40.040]  and that forms the training set for the neural net, for the GPT.
[02:40.040 --> 02:45.120]  Now before we can actually train on this data, we need to go through one more pre-processing
[02:45.120 --> 02:46.920]  step, and that is tokenization.
[02:46.920 --> 02:50.820]  And this is basically a translation of the raw text that we scraped from the internet
[02:50.820 --> 02:53.300]  into sequences of integers.
[02:53.300 --> 02:57.420]  Because that's the native representation over which GPTs function.
[02:57.420 --> 02:57.960]  Now, this is a lossless method of group mapping.
[02:57.960 --> 02:58.460]  Now, this is a lossless method of group mapping.
[02:58.460 --> 03:04.180]  lossless kind of translation between pieces of text and tokens and integers. And there are a
[03:04.180 --> 03:07.960]  number of algorithms for this stage. Typically, for example, you could use something like byte
[03:07.960 --> 03:14.180]  pairing coding, which iteratively merges little text chunks and groups them into tokens. And so
[03:14.180 --> 03:18.780]  here I'm showing some example chunks of these tokens. And then this is the raw integer sequence
[03:18.780 --> 03:26.060]  that will actually feed into a transformer. Now, here I'm showing two sort of like examples for
[03:26.060 --> 03:31.300]  hyperparameters that govern this stage. So GPT-4, we did not release too much information about how
[03:31.300 --> 03:35.340]  it was trained and so on. So I'm using GPT-3's numbers. But GPT-3 is, of course, a little bit
[03:35.340 --> 03:41.180]  old by now, about three years ago. But LAMA is a fairly recent model from Meta. So these are
[03:41.180 --> 03:44.280]  roughly the orders of magnitude that we're dealing with when we're doing pre-training.
[03:45.080 --> 03:49.780]  The vocabulary size is usually a couple 10,000 tokens. The context length is usually something
[03:49.780 --> 03:55.740]  like 2,000, 4,000, or nowadays even 100,000. And this governs the maximum number of integers
[03:55.740 --> 03:56.040]  that we're dealing with. And so this is the order of magnitude that we're dealing with.
[03:56.060 --> 04:00.040]  That the GPT will look at when it's trying to predict the next integer in a sequence.
[04:01.800 --> 04:05.840]  You can see that roughly the number of parameters is, say, 65 billion for LAMA.
[04:06.280 --> 04:11.040]  Now, even though LAMA has only 65B parameters compared to GPT-3's 175 billion parameters,
[04:11.440 --> 04:16.360]  LAMA is a significantly more powerful model. And intuitively, that's because the model is
[04:16.360 --> 04:20.980]  trained for significantly longer. In this case, 1.4 trillion tokens instead of just 300 billion
[04:20.980 --> 04:25.160]  tokens. So you shouldn't judge the power of a model just by the number of parameters that it
[04:25.160 --> 04:25.500]  contains.
[04:26.060 --> 04:32.420]  Below, I'm showing some tables of rough hyperparameters that typically go into specifying
[04:32.420 --> 04:36.520]  the transformer neural network. So the number of heads, the dimension size, number of layers,
[04:36.600 --> 04:42.140]  and so on. And on the bottom, I'm showing some training hyperparameters. So for example,
[04:42.280 --> 04:50.920]  to train the 65B model, Meta used 2,000 GPUs, roughly 21 days of training, and roughly several
[04:50.920 --> 04:56.040]  million dollars. And so that's the rough orders of magnitude that you should have in mind for the
[04:56.060 --> 04:56.940]  pre-training stage.
[04:59.040 --> 05:03.680]  Now, when we're actually pre-training, what happens? Roughly speaking, we are going to take our tokens,
[05:03.700 --> 05:08.520]  and we're going to lay them out into data batches. So we have these arrays that will feed into the
[05:08.520 --> 05:13.720]  transformer, and these arrays are B, the batch size, and these are all independent examples stacked
[05:13.720 --> 05:19.400]  up in rows, and B by T, T being the maximum context length. So in my picture, I only have 10,
[05:20.180 --> 05:26.040]  the context length. So this could be 2,000, 4,000, et cetera. So these are extremely long rows. And what we do is we take these
[05:26.060 --> 05:32.440]  documents, and we pack them into rows, and we delimit them with these special end-of-text tokens, basically telling the
[05:32.440 --> 05:39.020]  transformer where a new document begins. And so here I have a few examples of documents, and then I've stretched them out
[05:39.020 --> 05:48.640]  into this input. Now, we're going to feed all of these numbers into transformer. And let me just focus on a single
[05:48.640 --> 05:55.440]  particular cell, but the same thing will happen at every cell in this diagram. So let's look at the green cell. The green cell is
[05:56.060 --> 06:03.740]  going to take a look at all of the tokens before it, so all of the tokens in yellow, and we're going to feed that entire context into the
[06:03.740 --> 06:10.820]  transformer neural network, and the transformer is going to try to predict the next token in a sequence, in this case, in red. Now, the
[06:10.820 --> 06:16.860]  transformer, I don't have too much time to, unfortunately, go into the full details of this neural network architecture. It's just a large blob of
[06:16.860 --> 06:23.120]  neural net stuff for our purposes, and it's got several 10 billion parameters, typically, or something like that. And, of course, as you
[06:23.120 --> 06:26.040]  tune these parameters, you're getting slightly different predicted distributions. And so what we're going to do is we're going to take a look at the
[06:26.060 --> 06:35.300]  distributions for every single one of these cells. And so, for example, if our vocabulary size is 50,257 tokens, then we're going to have that
[06:35.300 --> 06:42.120]  many numbers, because we need to specify a probability distribution for what comes next. So basically, we have a probability for whatever may
[06:42.120 --> 06:50.300]  follow. Now, in this specific example, for this specific cell, 513 will come next, and so we can use this as a source of supervision to update our
[06:50.300 --> 06:56.020]  transformer's weights. And so we're applying this, basically, on every single cell in the parallel, and we keep swapping back and forth between the
[06:56.060 --> 07:03.000]  batches, and we're trying to get the transformer to make the correct predictions over what token comes next in a sequence. So let me show you more
[07:03.000 --> 07:10.040]  concretely what this looks like when you train one of these models. This is actually coming from the New York Times, and they trained a small GPT on
[07:10.040 --> 07:18.140]  Shakespeare. And so here's a small snippet of Shakespeare, and they trained a GPT on it. Now, in the beginning, at initialization, the GPT starts with
[07:18.140 --> 07:25.740]  completely random weights, so you're just getting completely random outputs as well. But over time, as you train the GPT longer and longer,
[07:26.060 --> 07:34.920]  you are getting more and more coherent and consistent sort of samples from the model. And the way you sample from it, of course, is you predict what comes
[07:34.920 --> 07:43.080]  next, you sample from that distribution, and you keep feeding that back into the process, and you can basically sample large sequences. And so by the end, you see
[07:43.080 --> 07:49.680]  that the transformer has learned about words and where to put spaces and where to put commas and so on. And so we're making more and more consistent
[07:49.680 --> 07:56.000]  predictions over time. These are the kinds of plots that you're looking at when you're doing model pre-training. Effectively, we're looking at
[07:56.060 --> 08:05.060]  a loss function over time as you train, and low loss means that our transformer is predicting the correct, is giving a higher probability to get the correct next
[08:05.060 --> 08:14.900]  integer in a sequence. Now, what are we going to do with this model once we've trained it after a month? Well, the first thing that we noticed, we, the field, is that
[08:14.900 --> 08:23.540]  these models basically, in the process of language modeling, learn very powerful general representations, and it's possible to very efficiently fine-tune them
[08:23.540 --> 08:26.040]  for any arbitrary downstream task you might be interested in.
[08:26.060 --> 08:37.120]  So as an example, if you're interested in sentiment classification, the approach used to be that you collect a bunch of positives and negatives, and then you train some kind of an NLP model for that.
[08:37.120 --> 08:50.740]  But the new approach is, ignore sentiment classification, go off and do large language model pre-training, train the large transformer, and then you can only, you may only have a few examples, and you can very efficiently fine-tune your model for that task.
[08:50.740 --> 08:55.940]  And so this works very well in practice, and the reason for this is that basically, the transformer is
[08:55.940 --> 09:02.260]  forced to multitask a huge amount of tasks in the language modeling task, because just in terms of
[09:02.260 --> 09:06.680]  predicting the next token, it's forced to understand a lot about the structure of the text
[09:06.680 --> 09:12.760]  and all the different concepts therein. So that was GPT-1. Now, around the time of GPT-2, people
[09:12.760 --> 09:17.120]  noticed that actually even better than fine-tuning, you can actually prompt these models very
[09:17.120 --> 09:21.080]  effectively. So these are language models, and they want to complete documents. So you can actually
[09:21.080 --> 09:26.340]  trick them into performing tasks just by arranging these fake documents. So in this example,
[09:26.500 --> 09:31.800]  for example, we have some passage, and then we sort of like do QA, QA, QA. This is called a few-shot
[09:31.800 --> 09:36.040]  prompt. And then we do Q. And then as the transformer is trying to complete the document,
[09:36.200 --> 09:40.560]  it's actually answering our question. And so this is an example of prompt engineering a base model,
[09:40.900 --> 09:44.960]  making it believe that it's sort of imitating a document and getting it to perform a task.
[09:45.680 --> 09:50.600]  And so this picked off, I think, the era of, I would say, prompting over fine-tuning and seeing
[09:50.600 --> 09:51.060]  that this actually works. And so this is an example of prompt engineering a base model,
[09:51.060 --> 09:54.540]  actually can work extremely well on a lot of problems, even without training any neural
[09:54.540 --> 10:00.380]  networks, fine-tuning, or so on. Now, since then, we've seen an entire evolutionary tree of base
[10:00.380 --> 10:06.880]  models that everyone has trained. Not all of these models are available. For example, the GPT-4 base
[10:06.880 --> 10:11.480]  model was never released. The GPT-4 model that you might be interacting with over API is not a base
[10:11.480 --> 10:16.900]  model. It's an assistant model. And we're going to cover how to get those in a bit. GPT-3 base
[10:16.900 --> 10:20.020]  model is available via the API under the name DaVinci.
[10:20.020 --> 10:25.300]  And GPT-2 base model is available even as weights on our GitHub repo. But currently,
[10:25.420 --> 10:30.660]  the best available base model probably is the Lama series from Meta, although it is not
[10:30.660 --> 10:37.100]  commercially licensed. Now, one thing to point out is base models are not assistants. They don't
[10:37.100 --> 10:44.200]  want to make answers to your questions. They just want to complete documents. So if you tell them,
[10:44.480 --> 10:49.540]  write a poem about the bread and cheese, it will answer questions with more questions. It's just
[10:49.540 --> 10:54.280]  completing what it thinks is a document. However, you can prompt them in a specific way for base
[10:54.280 --> 11:00.060]  models that is more likely to work. So as an example, here's a poem about bread and cheese. And in
[11:00.060 --> 11:06.400]  that case, it will autocomplete correctly. You can even trick base models into being assistants. And
[11:06.400 --> 11:10.860]  the way you would do this is you would create like a specific few-shot prompt that makes it look like
[11:10.860 --> 11:14.680]  there's some kind of a document between a human and assistant, and they're exchanging sort of
[11:14.680 --> 11:19.160]  information. And then at the bottom, you sort of put your query at the end.
[11:19.540 --> 11:26.140]  And the base model will sort of like condition itself into being like a helpful assistant and kind of
[11:26.140 --> 11:30.100]  answer. But this is not very reliable and doesn't work super well in practice, although it can be
[11:30.100 --> 11:36.020]  done. So instead, we have a different path to make actual GPT assistants, not just base model document
[11:36.020 --> 11:40.840]  completers. And so that takes us into supervised fine-tuning. So in the supervised fine-tuning
[11:40.840 --> 11:46.660]  stage, we are going to collect small but high-quality data sets. And in this case, we're going to ask
[11:49.540 --> 11:54.160]  you to collect a set of the form prompt and ideal response. And we're going to collect lots of these,
[11:54.160 --> 11:58.480]  typically tens of thousands or something like that. And then we're going to still do language
[11:58.480 --> 12:02.320]  modeling on this data. So nothing changed algorithmically. We're just swapping out a
[12:02.320 --> 12:08.080]  training set. So it used to be internet documents, which is a high-quantity, low-quality, for
[12:08.080 --> 12:14.740]  basically a QA prompt response kind of data. And that is low-quantity, high-quality. So we
[12:14.740 --> 12:19.060]  still do language modeling, and then after training, we get an SFD model. And you can
[12:19.540 --> 12:24.100]  see these models, and they are actual assistants, and they work to some extent. Let me show you what
[12:24.100 --> 12:27.640]  an example demonstration might look like. So here's something that a human contractor might
[12:27.640 --> 12:32.380]  come up with. Here's some random prompt. Can you write a short introduction about the relevance of
[12:32.380 --> 12:36.700]  the term monopsony or something like that? And then the contractor also writes out an ideal
[12:36.700 --> 12:41.260]  response. And when they write out these responses, they are following extensive labeling documentations,
[12:41.260 --> 12:47.440]  and they are being asked to be helpful, truthful, and harmless. And this is labeling instructions
[12:47.440 --> 12:49.520]  here. You probably can't read it.
[12:49.540 --> 12:54.100]  Neither can I. But they're long, and this is just people following instructions and trying to
[12:54.100 --> 12:58.360]  complete these prompts. So that's what the data set looks like, and you can train these models,
[12:58.360 --> 13:03.760]  and this works to some extent. Now, you can actually continue the pipeline from here on and
[13:03.760 --> 13:09.100]  go into RLHF, reinforcement learning from human feedback, that consists of both reward modeling
[13:09.100 --> 13:13.300]  and reinforcement learning. So let me cover that, and then I'll come back to why you may want to go
[13:13.300 --> 13:18.040]  through the extra steps and how that compares to just SFD models. So in the reward modeling step,
[13:19.540 --> 13:23.860]  we're now going to shift our data collection to be of the form of comparisons. So here's an example
[13:23.860 --> 13:28.180]  of what our data set will look like. I have the same prompt, identical prompt on the top,
[13:28.180 --> 13:34.240]  which is asking the assistant to write a program or a function that checks if a given string is
[13:34.240 --> 13:39.340]  a palindrome. And then what we do is we take the SFD model, which we've already trained,
[13:39.340 --> 13:43.060]  and we create multiple completions. So in this case, we have three completions that the model
[13:43.060 --> 13:49.120]  has created. And then we ask people to rank these completions. So if you stare at this for a while,
[13:49.540 --> 13:53.680]  these are very difficult things to do to compare some of these predictions. And this can take
[13:53.680 --> 14:00.400]  people even hours for a single prompt completion pairs. But let's say we decided that one of these
[14:00.400 --> 14:04.900]  is much better than the others and so on. So we rank them. Then we can follow that with something
[14:04.900 --> 14:08.680]  that looks very much kind of like a binary classification on all the possible pairs
[14:08.680 --> 14:14.080]  between these completions. So what we do now is we lay out our prompt in rows, and the prompts
[14:14.080 --> 14:18.940]  is identical across all three rows here. So it's all the same prompt, but the completion does vary.
[14:19.540 --> 14:23.920]  So the yellow tokens are coming from the SFD model. Then what we do is we append another
[14:23.920 --> 14:30.460]  special reward readout token at the end. And we basically only supervise the transformer at this
[14:30.460 --> 14:36.640]  single green token. And the transformer will predict some reward for how good that completion
[14:36.640 --> 14:42.580]  is for that prompt. And so basically, it makes a guess about the quality of each completion. And
[14:42.580 --> 14:46.720]  then once it makes a guess for every one of them, we also have the ground truth, which is telling
[14:46.720 --> 14:49.480]  us the ranking of them. And so we can actually enforce that. And so we can actually enforce that,
[14:49.480 --> 14:53.620]  some of these numbers should be much higher than others and so on. We formulate this into a loss
[14:53.620 --> 14:58.060]  function, and we train our model to make reward predictions that are consistent with the ground
[14:58.060 --> 15:02.440]  truth coming from the comparisons from all these contractors. So that's how we train our reward
[15:02.440 --> 15:08.320]  model. And that allows us to score how good a completion is for a prompt. Once we have a reward
[15:08.320 --> 15:13.780]  model, we can't deploy this because this is not very useful as an assistant by itself, but it's
[15:13.780 --> 15:18.400]  very useful for the reinforcement learning stage that follows now. Because we have a reward model,
[15:18.400 --> 15:19.180]  we can score
[15:19.480 --> 15:24.280]  the quality of any arbitrary completion for any given prompt. So what we do during reinforcement
[15:24.280 --> 15:29.080]  learning is we basically get, again, a large collection of prompts. And now we do reinforcement
[15:29.080 --> 15:33.940]  learning with respect to the reward model. So here's what that looks like. We take a single
[15:33.940 --> 15:39.520]  prompt, we lay it out in rows, and now we use basically the model we'd like to train,
[15:39.520 --> 15:44.680]  which is initialized at SFT model, to create some completions in yellow. And then we append
[15:44.680 --> 15:49.420]  the reward token again, and we read off the reward according to the reward model, which
[15:49.480 --> 15:54.820]  is now kept fixed. It doesn't change anymore. And now the reward model tells us the quality of every
[15:54.820 --> 15:59.740]  single completion for these prompts. And so what we can do is we can now just basically apply the
[15:59.740 --> 16:05.920]  same language modeling loss function, but we're currently training on the yellow tokens, and we
[16:05.920 --> 16:11.740]  are weighing the language modeling objective by the rewards indicated by the reward model. So as
[16:11.740 --> 16:16.960]  an example, in the first row, the reward model said that this is a fairly high scoring completion,
[16:16.960 --> 16:19.260]  and so all of the tokens that we happened to
[16:19.480 --> 16:23.640]  on the first row are going to get reinforced and they're going to get higher probabilities
[16:23.640 --> 16:28.700]  for the future. Conversely, on the second row, the reward model really did not like this completion,
[16:28.920 --> 16:33.940]  negative 1.2. And so therefore, every single token that we sampled in that second row is going to
[16:33.940 --> 16:38.200]  get a slightly higher probability for the future. And we do this over and over on many prompts,
[16:38.200 --> 16:44.520]  on many batches. And basically, we get a policy which creates yellow tokens here. And it basically,
[16:44.720 --> 16:48.540]  all of them, all of the completions here will score high according to the reward model that
[16:48.540 --> 16:54.640]  we trained in the previous stage. So that's how we train. That's what the RLHF pipeline is.
[16:55.840 --> 16:59.780]  Now, and then at the end, you get a model that you could deploy. And so as an example,
[17:00.160 --> 17:05.040]  ChatGPT is an RLHF model. But some other models that you might come across, like for example,
[17:05.040 --> 17:11.100]  the Kuna 13b and so on, these are SFT models. So we have base models, SFT models, and RLHF models.
[17:11.900 --> 17:16.500]  And that's kind of like the state of things there. Now, why would you want to do RLHF?
[17:16.920 --> 17:18.400]  So one answer that is kind of...
[17:18.540 --> 17:22.440]  not that exciting, is that it just works better. So this comes from the InstructGPT paper.
[17:22.840 --> 17:27.760]  According to these experiments a while ago now, these PPO models are RLHF.
[17:27.760 --> 17:31.720]  And we see that they are basically just preferred in a lot of comparisons
[17:32.300 --> 17:34.880]  when we give them to humans. So humans just prefer out
[17:35.460 --> 17:41.800]  basically tokens that come from RLHF models compared to SFT models, compared to base model that is prompted to be an assistant.
[17:41.800 --> 17:45.160]  And so it just works better. But you might ask why?
[17:45.800 --> 17:48.380]  Why does it work better? And I don't think that there's a single
[17:48.540 --> 17:55.440]  like amazing answer that the community has really like agreed on, but I will just offer one reason, potentially.
[17:55.440 --> 18:01.800]  And it has to do with the asymmetry between how easy computationally it is to compare versus generate.
[18:02.300 --> 18:07.920]  So let's take an example of generating a haiku. Suppose I ask a model to write a haiku about paperclips.
[18:07.920 --> 18:14.160]  If you're a contractor trying to give training data, then imagine being a contractor collecting basically data for the SFT stage.
[18:14.160 --> 18:18.380]  How are you supposed to create a nice haiku for a paperclip? You might just not be very good at that.
[18:18.540 --> 18:23.660]  But if I give you a few examples of haikus, you might be able to appreciate some of these haikus a lot more than others.
[18:23.660 --> 18:26.780]  And so judging which one of these is good is a much easier task.
[18:26.780 --> 18:33.160]  And so basically this asymmetry makes it so that comparisons are a better way to potentially
[18:33.340 --> 18:37.040]  leverage yourself as a human and your judgment to create a slightly better model.
[18:37.740 --> 18:43.040]  Now, RLHF models are not strictly an improvement on the base models in some cases.
[18:43.360 --> 18:46.580]  So in particular, we've noticed, for example, that they lose some entropy.
[18:46.580 --> 18:48.420]  So that means that they give more
[18:48.540 --> 18:56.120]  peaky results. They can output lower variations, like they can output samples with lower variation than base model.
[18:56.120 --> 19:00.240]  So base model has lots of entropy and will give lots of diverse outputs.
[19:00.240 --> 19:13.740]  So, for example, one kind of place where I still prefer to use a base model is in a setup where you basically have n things and you want to generate more things like it.
[19:13.740 --> 19:16.700]  And so here is an example that I just cooked up.
[19:16.700 --> 19:18.540]  I want to generate cool Pokemon names.
[19:18.540 --> 19:24.160]  I gave it seven Pokemon names, and I asked the base model to complete the document, and it gave me a lot more Pokemon names.
[19:24.460 --> 19:28.540]  These are fictitious. I tried to look them up. I don't believe they're actual Pokemons.
[19:29.420 --> 19:33.260]  And this is the kind of task that I think base model would be good at, because it still has lots of entropy.
[19:33.260 --> 19:38.100]  It will give you lots of diverse, cool, kind of more things that look like whatever you give it before.
[19:40.220 --> 19:44.860]  So this is what, this is number, having said all that, these are kind of like the assistant models
[19:44.860 --> 19:46.860]  that are probably available to you at this point.
[19:47.260 --> 19:48.380]  There's a team at Berkeley,
[19:48.380 --> 19:53.260]  that ranked a lot of the available assistant models and gave them basically ELO ratings.
[19:53.260 --> 19:59.500]  So currently some of the best models, of course, are GPT-4, by far, I would say, followed by Clawed, GPT-3.5,
[19:59.500 --> 20:04.140]  and then a number of models, some of these might be available as weights, like the Kuna, Koala, etc.
[20:04.700 --> 20:13.100]  And the first three rows here, they're all RLHF models, and all of the other models, to my knowledge, are SFT models, I believe.
[20:16.060 --> 20:18.300]  Okay, so that's how we train these models.
[20:18.300 --> 20:19.340]  On the high level.
[20:19.340 --> 20:25.580]  Now I'm going to switch gears, and let's look at how we can best apply a GPT assistant model to your problems.
[20:26.220 --> 20:29.900]  Now, I would like to work in a setting of a concrete example.
[20:29.900 --> 20:33.020]  So let's work with a concrete example here.
[20:33.020 --> 20:37.980]  Let's say that you are working on an article or a blog post, and you're going to write this sentence at the end.
[20:38.620 --> 20:41.020]  California's population is 53 times that of Alaska.
[20:41.020 --> 20:44.060]  So for some reason, you want to compare the populations of these two states.
[20:45.180 --> 20:48.060]  Think about the rich internal monologue and tool use,
[20:48.300 --> 20:53.260]  and how much work actually goes computationally in your brain to generate this one final sentence.
[20:53.260 --> 20:55.100]  So here's maybe what that could look like in your brain.
[20:55.740 --> 21:00.380]  Okay, for this next step, let me blog, or my blog, let me compare these two populations.
[21:01.020 --> 21:04.540]  Okay, first I'm going to obviously need to get both of these populations.
[21:05.180 --> 21:08.940]  Now, I know that I probably don't know these populations off the top of my head.
[21:08.940 --> 21:12.380]  So I'm kind of like aware of what I know or don't know of my self-knowledge, right?
[21:13.180 --> 21:18.140]  So I go, I do some tool use, and I go to Wikipedia, and I look up California's population.
[21:18.300 --> 21:19.260]  And Alaska's population.
[21:20.140 --> 21:22.140]  Now I know that I should divide the two.
[21:22.140 --> 21:26.780]  But again, I know that dividing 39.2 by 0.74 is very unlikely to succeed.
[21:26.780 --> 21:29.580]  That's not the kind of thing that I can do in my head.
[21:29.580 --> 21:32.300]  And so therefore, I'm going to rely on the calculator.
[21:32.300 --> 21:36.140]  So I'm going to use a calculator, punch it in, and see that the output is roughly 53.
[21:37.180 --> 21:40.700]  And then maybe I do some reflection and sanity checks in my brain.
[21:40.700 --> 21:42.540]  So does 53 make sense?
[21:42.540 --> 21:46.220]  Well, that's quite a large fraction, but then California is the most populous state.
[21:46.220 --> 21:47.260]  So maybe that looks okay.
[21:47.260 --> 21:47.420]  Okay.
[21:47.420 --> 21:47.660]  Okay.
[21:47.660 --> 21:47.740]  Okay.
[21:47.740 --> 21:48.060]  Okay.
[21:48.060 --> 21:48.140]  Okay.
[21:48.140 --> 21:48.220]  Okay.
[21:48.220 --> 21:48.300]  Okay.
[21:48.300 --> 21:49.980]  So then I have all the information I might need.
[21:49.980 --> 21:52.700]  And now I get to the sort of creative portion of writing.
[21:52.700 --> 21:57.100]  So I might start to write something like, California has 53x times greater.
[21:57.100 --> 22:00.220]  And then I think to myself, that's actually like really awkward phrasing.
[22:00.220 --> 22:02.780]  So let me actually delete that, and let me try again.
[22:03.420 --> 22:08.300]  And so as I'm writing, I have this separate process almost inspecting what I'm writing
[22:08.300 --> 22:10.140]  and judging whether it looks good or not.
[22:10.940 --> 22:15.100]  And then maybe I delete, and maybe I reframe it, and then maybe I'm happy with what comes out.
[22:15.740 --> 22:18.060]  So basically, long story short, a ton happens.
[22:18.060 --> 22:20.380]  So I'm writing this sentence under the hood in terms of your internal monologue when you
[22:20.380 --> 22:21.420]  create sentences like this.
[22:21.980 --> 22:25.980]  But what does a sentence like this look like when we are training a GPT on it?
[22:27.340 --> 22:29.900]  From GPT's perspective, this is just a sequence of tokens.
[22:30.620 --> 22:35.180]  So a GPT, when it's reading or generating these tokens, it just goes chunk, chunk,
[22:35.180 --> 22:36.220]  chunk, chunk, chunk.
[22:36.220 --> 22:39.820]  And each chunk is roughly the same amount of computational work for each token.
[22:40.380 --> 22:43.260]  And these transformers are not very shallow networks.
[22:43.260 --> 22:45.340]  They have about 80 layers of reasoning.
[22:45.340 --> 22:46.940]  But 80 is still not like too much.
[22:47.500 --> 22:51.340]  And so this transformer is going to do its best to imitate.
[22:51.340 --> 22:55.260]  But of course, the process here looks very, very different from the process that you took.
[22:56.460 --> 23:01.020]  So in particular, in our final artifacts, in the data sets that we create and then eventually feed
[23:01.020 --> 23:04.220]  to LLMs, all of that internal dialogue is completely stripped.
[23:04.780 --> 23:10.380]  And unlike you, the GPT will look at every single token and spend the same amount of
[23:10.380 --> 23:12.060]  compute on every one of them.
[23:12.060 --> 23:16.140]  And so you can't expect it to actually like, well, you can't expect it to do,
[23:16.940 --> 23:18.540]  sort of do too much work per token.
[23:19.660 --> 23:23.740]  And also in particular, basically these transformers are just like token simulators.
[23:23.740 --> 23:25.660]  So they don't know what they don't know.
[23:25.660 --> 23:27.900]  Like they just imitate the next token.
[23:27.900 --> 23:29.660]  They don't know what they're good at or not good at.
[23:29.660 --> 23:31.660]  They just tried their best to imitate the next token.
[23:32.300 --> 23:33.980]  They don't reflect in the loop.
[23:33.980 --> 23:35.420]  They don't sanity check anything.
[23:35.420 --> 23:37.900]  They don't correct their mistakes along the way by default.
[23:37.900 --> 23:39.980]  They just sample token sequences.
[23:40.860 --> 23:43.660]  They don't have separate inner monologue streams in their head, right?
[23:43.660 --> 23:44.940]  They're evaluating what's happening.
[23:45.580 --> 23:46.540]  Now, they do have some.
[23:46.540 --> 23:51.260]  A sort of cognitive advantages, I would say, and that is that they do actually have very
[23:51.260 --> 23:55.980]  large fact based knowledge across a vast number of areas because they have, say, several 10
[23:55.980 --> 23:56.860]  billion parameters.
[23:56.860 --> 23:59.020]  So that's a lot of storage for a lot of facts.
[23:59.900 --> 24:04.620]  But and they also, I think, have a relatively large and perfect working memory.
[24:04.620 --> 24:09.260]  So whatever fixed into the whatever fits into the context window is immediately available
[24:09.260 --> 24:12.380]  to the transformer through its internal self attention mechanism.
[24:12.380 --> 24:16.060]  And so it's kind of like perfect memory, but it's got a finite size.
[24:16.540 --> 24:20.860]  The transformer has a very direct access to it, and so it can like a losslessly remember
[24:20.860 --> 24:23.100]  anything that is inside its context window.
[24:23.980 --> 24:25.820]  So that's kind of how I would compare those two.
[24:25.820 --> 24:30.460]  And the reason I bring all of this up is because I think to a large extent, prompting is just
[24:30.460 --> 24:37.340]  making up for this sort of cognitive difference between these two kind of architectures like
[24:37.340 --> 24:39.500]  our brains here and LLM brains.
[24:39.500 --> 24:40.780]  You can look at it that way almost.
[24:41.980 --> 24:45.900]  So here's one thing that people found, for example, works pretty well in practice, especially
[24:45.900 --> 24:48.140]  if your tasks require reasoning.
[24:48.140 --> 24:52.220]  You can't expect the transformer to make to do too much reasoning per token.
[24:52.220 --> 24:55.900]  And so you have to really spread out the reasoning across more and more tokens.
[24:55.900 --> 24:59.420]  So, for example, you can't give a transformer a very complicated question and expect it
[24:59.420 --> 25:00.780]  to get the answer in a single token.
[25:00.780 --> 25:02.060]  There's just not enough time for it.
[25:02.700 --> 25:06.300]  These transformers need tokens to think, quote unquote, I like to say sometimes.
[25:06.860 --> 25:08.860]  And so this is some of the things that work well.
[25:08.860 --> 25:12.380]  You may, for example, have a few shot prompt that shows the transformer that it should
[25:12.380 --> 25:15.660]  like show its work when it's answering the question when it's answering a question.
[25:15.900 --> 25:20.700]  And if you give a few examples, the transformer will imitate that template and it will just
[25:20.700 --> 25:23.740]  end up working out better in terms of its evaluation.
[25:24.540 --> 25:28.220]  Additionally, you can elicit this kind of behavior from the transformer by saying, let's
[25:28.220 --> 25:32.860]  think step by step, because this conditioned the transformer into sort of like showing
[25:32.860 --> 25:33.580]  its work.
[25:33.580 --> 25:37.900]  And because it kind of snaps into a mode of showing its work, it's going to do less
[25:37.900 --> 25:39.500]  computational work per token.
[25:40.060 --> 25:45.180]  And so it's more likely to succeed as a result because it's making slower reasoning over
[25:45.180 --> 25:45.580]  time.
[25:46.460 --> 25:47.420]  Here's another example.
[25:47.420 --> 25:48.860]  This one is called self-consistency.
[25:49.820 --> 25:54.540]  We saw that we had the ability to start writing and then if it didn't work out, I can try
[25:54.540 --> 26:00.540]  again and I can try multiple times and and maybe select the one that worked best.
[26:00.540 --> 26:04.700]  So in these kinds of approaches, you may sample not just once, but you may sample multiple
[26:04.700 --> 26:09.260]  times and then have some process for finding the ones that are good and then keeping just
[26:09.260 --> 26:11.900]  those samples or doing a majority vote or something like that.
[26:11.900 --> 26:15.580]  So basically, these transformers in the process as they predict the next token.
[26:15.900 --> 26:19.900]  Just like you, they can get unlucky and they could they could sample and not a very good
[26:19.900 --> 26:23.900]  token and they can go down sort of like a blind alley in terms of reasoning.
[26:23.900 --> 26:27.180]  And so unlike you, they cannot recover from that.
[26:27.180 --> 26:30.940]  They are stuck with every single token they sample, and so they will continue the sequence
[26:30.940 --> 26:34.060]  even if they even know that this sequence is not going to work out.
[26:34.060 --> 26:39.820]  So give them the ability to look back, inspect or try to find, try to basically sample around
[26:39.820 --> 26:39.980]  it.
[26:41.180 --> 26:41.980]  Here's one technique.
[26:41.980 --> 26:42.620]  Also, you could.
[26:43.580 --> 26:44.540]  It turns out that actually, LLMs.
[26:44.540 --> 26:45.500]  Like, they know they're going to be able to do it.
[26:45.500 --> 26:45.740]  They know they're going to be able to do it.
[26:45.740 --> 26:45.820]  They know they're going to be able to do it.
[26:45.820 --> 26:47.340]  They know when they've screwed up.
[26:47.340 --> 26:53.580]  So as an example, say you asked the model to generate a poem that does not rhyme, and
[26:53.580 --> 26:55.900]  it might give you a poem, but it actually rhymes.
[26:55.900 --> 26:59.900]  But it turns out that especially for the bigger models like GPT-4, you can just ask it, did
[26:59.900 --> 27:01.020]  you meet the assignment?
[27:01.020 --> 27:04.780]  And actually, GPT-4 knows very well that it did not meet the assignment.
[27:04.780 --> 27:07.100]  It just kind of got unlucky in its sampling.
[27:07.100 --> 27:09.580]  And so it will tell you, no, I didn't actually meet the assignment here.
[27:09.580 --> 27:10.300]  Let me try again.
[27:10.940 --> 27:15.660]  But without you prompting it, it doesn't even like it doesn't know it doesn't know
[27:15.660 --> 27:17.820]  to revisit and and so on.
[27:17.820 --> 27:19.980]  So you have to make up for that in your prompts.
[27:19.980 --> 27:21.900]  You have to get it to check.
[27:21.900 --> 27:24.140]  If you don't ask it to check, it's not going to check by itself.
[27:24.140 --> 27:25.260]  It's just a token simulator.
[27:29.020 --> 27:33.420]  I think more generally, a lot of these techniques fall into the bucket of what I would say recreating
[27:33.420 --> 27:34.540]  our system, too.
[27:34.540 --> 27:37.820]  So you might be familiar with the system one system to thinking for humans.
[27:37.820 --> 27:41.980]  System one is a fast automatic process, and I think kind of corresponds to like an LLM
[27:41.980 --> 27:45.500]  just sampling tokens and system two is the slower, the
[27:45.500 --> 27:48.460]  deliberate planning sort of part of your brain.
[27:49.260 --> 27:53.180]  And so this is a paper actually from just last week because this space is pretty quickly
[27:53.180 --> 27:53.740]  evolving.
[27:53.740 --> 27:58.700]  It's called Tree of Thought and in Tree of Thought, the authors of this paper proposed
[27:58.700 --> 28:04.140]  maintaining multiple completions for any given prompt, and then they are also scoring them
[28:04.140 --> 28:08.060]  along the way and keeping the ones that are going well, if that makes sense.
[28:08.060 --> 28:13.740]  And so a lot of people are like really playing around with kind of prompt engineering to
[28:14.780 --> 28:15.260]  basically.
[28:15.260 --> 28:19.020]  Bring back some of these abilities that we sort of have in our brain for LLMs.
[28:19.820 --> 28:22.780]  Now, one thing I would like to note here is that this is not just a prompt.
[28:22.780 --> 28:27.980]  This is actually prompts that are together used with some Python glue code because you
[28:27.980 --> 28:31.180]  don't you actually have to maintain multiple prompts and you also have to do some tree
[28:31.180 --> 28:35.340]  search algorithm here to figure out which prompts to expand, etc.
[28:35.340 --> 28:40.220]  So it's a symbiosis of Python glue code and individual prompts that are called in a while
[28:40.220 --> 28:41.500]  loop or in a bigger algorithm.
[28:42.380 --> 28:44.540]  I also think there's a really cool parallel here to AlphaGo.
[28:44.540 --> 28:49.660]  AlphaGo has a policy for placing the next stone when it plays go, and this policy was
[28:49.660 --> 28:51.900]  trained originally by imitating humans.
[28:52.460 --> 28:57.180]  But in addition to this policy, it also does multi-color tree search, and basically it
[28:57.180 --> 29:00.620]  will play out a number of possibilities in its head and evaluate all of them and only
[29:00.620 --> 29:01.820]  keep the ones that work well.
[29:01.820 --> 29:07.020]  And so I think this is kind of an equivalent of AlphaGo, but for text, if that makes sense.
[29:08.780 --> 29:13.100]  So just like Tree of Thought, I think more generally people are starting to really explore
[29:13.100 --> 29:17.900]  more general techniques of not just the simple question answer prompts, but something
[29:17.900 --> 29:21.980]  that looks a lot more like Python glue code stringing together many prompts.
[29:21.980 --> 29:27.500]  So on the right, I have an example from this paper called React, where they structure the
[29:27.500 --> 29:34.140]  answer to a prompt as a sequence of thought, action, observation, thought, action, observation,
[29:34.140 --> 29:37.660]  and it's a full rollout, a kind of a thinking process to answer the query.
[29:38.300 --> 29:41.500]  And in these actions, the model is also allowed to tool use.
[29:42.220 --> 29:42.860]  On the left.
[29:43.100 --> 29:45.340]  I have an example of AutoGPT.
[29:45.340 --> 29:52.460]  And now AutoGPT, by the way, is a project that I think got a lot of hype recently, but
[29:52.460 --> 29:54.860]  I think I still find it kind of inspirationally interesting.
[29:55.980 --> 30:00.940]  It's a project that allows an LLM to sort of keep a task list and continue to recursively
[30:00.940 --> 30:02.060]  break down tasks.
[30:02.060 --> 30:05.420]  And I don't think this currently works very well, and I would not advise people to use
[30:05.420 --> 30:07.020]  it in practical applications.
[30:07.020 --> 30:10.060]  I just think it's something to generally take inspiration from in terms of where this is
[30:10.060 --> 30:11.180]  going, I think, over time.
[30:11.180 --> 30:16.220]  So that's kind of like giving our model system to thinking.
[30:16.220 --> 30:20.940]  The next thing that I find kind of interesting is this following sort of, I would say, almost
[30:20.940 --> 30:25.500]  psychological quirk of LLMs is that LLMs don't want to succeed.
[30:26.540 --> 30:27.500]  They want to imitate.
[30:28.460 --> 30:30.380]  You want to succeed, and you should ask for it.
[30:31.180 --> 30:36.620]  So what I mean by that is when Transformers are trained, they have training sets, and
[30:37.500 --> 30:41.180]  there can be an entire spectrum of performance qualities in their training data.
[30:41.180 --> 30:44.780]  So, for example, there could be some kind of a prompt for some physics question or something
[30:44.780 --> 30:48.380]  like that, and there could be a student solution that is completely wrong, but there can also
[30:48.380 --> 30:50.460]  be an expert answer that is extremely right.
[30:51.020 --> 30:55.980]  And Transformers can't tell the difference between like, I mean, they know about low
[30:55.980 --> 30:59.740]  quality solutions and high quality solutions, but by default, they want to imitate all of
[30:59.740 --> 31:02.300]  it because they're just trained on language modeling.
[31:02.300 --> 31:06.060]  And so at test time, you actually have to ask for a good performance.
[31:06.060 --> 31:10.300]  So in this example, in this paper, they tried various prompts.
[31:11.180 --> 31:14.700]  Let's think step by step was very powerful because it sort of like spread out the reasoning
[31:14.700 --> 31:15.660]  over many tokens.
[31:15.660 --> 31:19.740]  But what worked even better is let's work this out in a step by step way to be sure we
[31:19.740 --> 31:20.860]  have the right answer.
[31:20.860 --> 31:23.740]  And so it's kind of like conditioning on getting the right answer.
[31:23.740 --> 31:27.180]  And this actually makes the Transformer work better because the Transformer doesn't have
[31:27.180 --> 31:31.100]  to now hedge its probability mass on low quality solutions.
[31:31.100 --> 31:32.860]  As ridiculous as that sounds.
[31:32.860 --> 31:37.260]  And so basically, don't feel free to ask for a strong solution.
[31:37.260 --> 31:39.740]  Say something like you are a leading expert on this topic.
[31:39.740 --> 31:41.020]  Pretend you have IQ 120.
[31:41.180 --> 31:41.980]  Et cetera.
[31:41.980 --> 31:47.020]  But don't try to ask for too much IQ because if you ask for IQ like 400, you might be out
[31:47.020 --> 31:52.060]  of data distribution or even worse, you could be in data distribution for some like sci-fi
[31:52.060 --> 31:56.380]  stuff and it will start to like take on some sci-fi like role playing or something like
[31:56.380 --> 31:56.940]  that.
[31:56.940 --> 31:58.780]  So you have to find like the right amount of IQ.
[31:59.580 --> 32:01.660]  I think it's got some U-shaped curve there.
[32:03.100 --> 32:08.860]  Next up, as we saw when we are trying to solve problems, we know what we are good at and
[32:08.860 --> 32:09.660]  what we're not good at.
[32:09.660 --> 32:11.020]  And we lean on tools.
[32:11.020 --> 32:11.980]  Computationally.
[32:11.980 --> 32:14.460]  You want to do the same potentially with your LLMs.
[32:15.100 --> 32:21.420]  So in particular, we may want to give them calculators, code interpreters, and so on.
[32:21.980 --> 32:23.260]  The ability to do search.
[32:23.820 --> 32:26.620]  And there's a lot of techniques for doing that.
[32:27.180 --> 32:31.580]  One thing to keep in mind again is that these Transformers by default may not know what
[32:31.580 --> 32:32.780]  they don't know.
[32:32.780 --> 32:36.700]  So you may even want to tell the Transformer in a prompt, you are not very good at mental
[32:36.700 --> 32:37.500]  arithmetic.
[32:37.500 --> 32:41.020]  Whenever you need to do very large number addition, multiplication, or whatever, you're
[32:41.020 --> 32:42.460]  going to use this calculator.
[32:42.460 --> 32:43.740]  Here's how you use the calculator.
[32:43.740 --> 32:46.380]  Use this token combination, et cetera, et cetera.
[32:46.380 --> 32:49.580]  So you have to actually like spell it out because the model by default doesn't know
[32:49.580 --> 32:51.660]  what it's good at or not good at necessarily.
[32:51.660 --> 32:53.500]  Just like you and I might be.
[32:55.500 --> 33:00.540]  Next up, I think something that is very interesting is we went from a world that was retrieval
[33:00.540 --> 33:05.420]  only all the way the pendulum has swung to the other extreme where it's memory only in
[33:05.420 --> 33:06.060]  LLMs.
[33:06.060 --> 33:10.380]  But actually, there's this entire space in between of these retrieval augmented models.
[33:10.380 --> 33:10.940]  And this was a very interesting thing.
[33:10.940 --> 33:12.380]  This works extremely well in practice.
[33:13.180 --> 33:17.100]  As I mentioned, the context window of a Transformer is its working memory.
[33:17.100 --> 33:21.260]  If you can load the working memory with any information that is relevant to the task,
[33:21.260 --> 33:25.660]  the model will work extremely well because it can immediately access all that memory.
[33:26.380 --> 33:32.060]  And so I think a lot of people are really interested in basically retrieval augmented
[33:32.060 --> 33:33.020]  generation.
[33:33.020 --> 33:37.180]  And on the bottom, I have like an example of LLMA index, which is one sort of data connector
[33:37.180 --> 33:38.940]  to lots of different types of data.
[33:38.940 --> 33:40.540]  And you can make it.
[33:40.940 --> 33:44.220]  You can index all of that data, and you can make it accessible to LLMs.
[33:44.220 --> 33:48.460]  And the emerging recipe there is you take relevant documents, you split them up into
[33:48.460 --> 33:52.860]  chunks, you embed all of them, and you basically get embedding vectors that represent that
[33:52.860 --> 33:53.420]  data.
[33:53.420 --> 33:57.420]  You store that in the vector store, and then at test time, you make some kind of a query
[33:57.420 --> 34:01.980]  to your vector store, and you fetch chunks that might be relevant to your task, and you
[34:01.980 --> 34:04.060]  stuff them into the prompt, and then you generate.
[34:04.060 --> 34:06.300]  So this can work quite well in practice.
[34:06.300 --> 34:10.380]  So this is, I think, similar to when you and I solve problems, you can do everything from
[34:10.380 --> 34:13.660]  your memory, and transformers have very large and extensive memory.
[34:13.660 --> 34:17.580]  But also, it really helps to reference some primary documents.
[34:17.580 --> 34:21.420]  So whenever you find yourself going back to a textbook to find something, or whenever
[34:21.420 --> 34:26.140]  you find yourself going back to documentation of a library to look something up, the transformers
[34:26.140 --> 34:27.660]  definitely want to do that too.
[34:27.660 --> 34:32.540]  You have some memory over how some documentation of a library works, but it's much better to
[34:32.540 --> 34:33.180]  look it up.
[34:33.180 --> 34:34.780]  So the same applies here.
[34:36.860 --> 34:39.580]  Next, I wanted to briefly talk about constraint prompting.
[34:39.580 --> 34:40.300]  I also find this very useful.
[34:40.300 --> 34:40.860]  Very interesting.
[34:42.140 --> 34:50.220]  This is basically techniques for forcing a certain template in the outputs of LLMs.
[34:50.220 --> 34:52.780]  So guidance is one example from Microsoft, actually.
[34:53.340 --> 34:57.260]  And here we are enforcing that the output from the LLM will be JSON.
[34:57.820 --> 35:01.980]  And this will actually guarantee that the output will take on this form, because they
[35:01.980 --> 35:04.860]  go in and they mess with the probabilities of all the different tokens that come out
[35:04.860 --> 35:07.500]  of the transformer, and they clamp those tokens.
[35:07.500 --> 35:09.980]  And then the transformer is only filling in the blanks here.
[35:09.980 --> 35:13.340]  And then you can enforce additional restrictions on what could go into those blanks.
[35:13.340 --> 35:14.700]  So this might be really helpful.
[35:14.700 --> 35:17.340]  And I think this kind of constraint sampling is also extremely interesting.
[35:20.060 --> 35:22.380]  I also wanted to say a few words about fine tuning.
[35:22.380 --> 35:27.260]  It is the case that you can get really far with prompt engineering, but it's also possible
[35:27.260 --> 35:28.940]  to think about fine tuning your models.
[35:29.580 --> 35:33.020]  Now, fine tuning models means that you are actually going to change the weights of the
[35:33.020 --> 35:33.340]  model.
[35:34.220 --> 35:38.780]  It is becoming a lot more accessible to do this in practice, and that's because of a
[35:38.780 --> 35:39.820]  number of techniques that have been developed.
[35:39.820 --> 35:43.020]  And I have libraries for very recently.
[35:43.020 --> 35:46.780]  So, for example, parameter efficient fine tuning techniques like LoRa make sure that
[35:47.660 --> 35:50.940]  you're only training small, sparse pieces of your model.
[35:50.940 --> 35:55.260]  So most of the model is kept clamped at the base model, and some pieces of it are allowed
[35:55.260 --> 35:55.820]  to change.
[35:55.820 --> 35:59.900]  And this still works pretty well empirically, and makes it much cheaper to sort of tune
[35:59.900 --> 36:01.100]  only small pieces of your model.
[36:03.020 --> 36:07.100]  It also means that because most of your model is clamped, you can use very low precision
[36:07.100 --> 36:09.740]  inference for computing those parts, because they are
[36:09.740 --> 36:11.660]  not going to be updated by gradient descent.
[36:11.660 --> 36:13.580]  And so that makes everything a lot more efficient as well.
[36:14.220 --> 36:17.420]  And in addition, we have a number of open sourced, high quality based models.
[36:17.420 --> 36:21.580]  Currently, as I mentioned, I think LAMA is quite nice, although it is not commercially
[36:21.580 --> 36:22.700]  licensed, I believe, right now.
[36:24.300 --> 36:29.580]  Something to keep in mind is that basically fine tuning is a lot more technically involved.
[36:29.580 --> 36:32.700]  It requires a lot more, I think, technical expertise to do right.
[36:32.700 --> 36:37.020]  It requires human data contractors for data sets and or synthetic data pipelines that
[36:37.020 --> 36:37.980]  can be pretty complicated.
[36:38.540 --> 36:41.260]  This will definitely slow down your iteration cycle by a lot.
[36:41.820 --> 36:47.420]  And I would say on a high level, SFT is achievable, because it is just your continuing the language
[36:47.420 --> 36:48.140]  modeling task.
[36:48.140 --> 36:49.660]  It's relatively straightforward.
[36:49.660 --> 36:55.020]  But RLHF, I would say, is very much research territory, and is even much harder to get
[36:55.020 --> 36:55.740]  to work.
[36:55.740 --> 37:00.620]  And so I would probably not advise that someone just tries to roll their own RLHF implementation.
[37:00.620 --> 37:04.540]  These things are pretty unstable, very difficult to train, not something that is, I think,
[37:04.540 --> 37:05.900]  very beginner friendly right now.
[37:05.900 --> 37:07.900]  And it's also potentially likely, also.
[37:08.460 --> 37:10.220]  To change pretty rapidly still.
[37:12.140 --> 37:14.940]  So I think these are my sort of default recommendations right now.
[37:15.660 --> 37:18.220]  I would break up your task into two major parts.
[37:18.220 --> 37:20.460]  Number one, achieve your top performance.
[37:20.460 --> 37:23.260]  And number two, optimize your performance, in that order.
[37:24.220 --> 37:27.500]  Number one, the best performance will currently come from GFT4 model.
[37:27.500 --> 37:29.020]  It is the most capable model by far.
[37:29.900 --> 37:31.820]  Use prompts that are very detailed.
[37:31.820 --> 37:35.660]  They have lots of task contents, relevant information and instructions.
[37:36.460 --> 37:37.740]  Think along the lines of, what would you do?
[37:37.740 --> 37:40.860]  Would you tell a task contractor if they can't email you back?
[37:40.860 --> 37:44.860]  But then also keep in mind that a task contractor is a human, and they have inner monologue,
[37:44.860 --> 37:46.300]  and they're very clever, et cetera.
[37:46.300 --> 37:48.620]  LLMs do not possess those qualities.
[37:48.620 --> 37:55.580]  So make sure to think through the psychology of the LLM, almost, and cater prompts to that.
[37:55.580 --> 38:00.620]  Retrieve and add any relevant context and information to these prompts.
[38:00.620 --> 38:03.100]  Basically refer to a lot of the prompt engineering techniques.
[38:03.100 --> 38:05.100]  Some of them I've highlighted in the slides above.
[38:05.100 --> 38:07.340]  But also, this is a very large space.
[38:07.340 --> 38:11.740]  And I would just advise you to look for prompt engineering techniques online.
[38:11.740 --> 38:12.860]  There's a lot to cover there.
[38:13.740 --> 38:15.820]  Experiment with few-shot examples.
[38:15.820 --> 38:17.820]  What this refers to is, you don't just want to tell.
[38:17.820 --> 38:19.980]  You want to show, whenever it's possible.
[38:19.980 --> 38:22.060]  So give it examples of everything.
[38:22.060 --> 38:24.380]  That helps it really understand what you mean, if you can.
[38:25.820 --> 38:29.660]  Experiment with tools and plug-ins to offload a task that are difficult for LLMs natively.
[38:31.180 --> 38:35.820]  And then think about not just a single prompt and answer, think about potential chains and reflection,
[38:35.820 --> 38:37.020]  and how you glue them together.
[38:37.340 --> 38:39.420]  How you could potentially make multiple samples, and so on.
[38:40.700 --> 38:43.420]  Finally, if you think you've squeezed out prompt engineering,
[38:43.420 --> 38:45.260]  which I think you should stick with for a while,
[38:45.900 --> 38:51.420]  look at some potentially fine-tuning a model to your application.
[38:51.420 --> 38:54.060]  But expect this to be a lot more slower and involved.
[38:54.060 --> 38:58.300]  And then there's an expert fragile research zone here, and I would say that is RLHF,
[38:58.300 --> 39:02.220]  which currently does work a bit better than SFT, if you can get it to work.
[39:02.220 --> 39:04.700]  But again, this is pretty involved, I would say.
[39:05.260 --> 39:07.340]  And to optimize your costs, try to explore,
[39:07.340 --> 39:11.100]  look for lower capacity models, or shorter prompts, and so on.
[39:13.420 --> 39:15.900]  I also wanted to say a few words about the use cases,
[39:15.900 --> 39:18.940]  in which I think LLMs are currently well-suited for.
[39:18.940 --> 39:22.860]  So in particular, note that there's a large number of limitations to LLMs today.
[39:22.860 --> 39:26.540]  And so I would keep that definitely in mind for all your applications.
[39:26.540 --> 39:28.780]  Models, and this, by the way, could be an entire talk,
[39:28.780 --> 39:30.940]  so I don't have time to cover it in full detail.
[39:30.940 --> 39:34.220]  Models may be biased, they may fabricate, hallucinate information.
[39:34.220 --> 39:35.340]  They may have reasoning errors.
[39:35.340 --> 39:36.540]  They may struggle.
[39:36.540 --> 39:37.980]  LLMs can run entire classes of applications.
[39:38.540 --> 39:39.980]  They have knowledge cut-offs,
[39:39.980 --> 39:43.820]  so they might not know any information above, say, September 2021.
[39:43.820 --> 39:46.380]  They are susceptible to a large range of attacks,
[39:46.380 --> 39:48.780]  which are sort of like coming out on Twitter daily,
[39:48.780 --> 39:52.380]  including prompt injection, jailbreak attacks, data poisoning attacks, and so on.
[39:52.940 --> 39:57.580]  So my recommendation right now is use LLMs in low-stakes applications,
[39:57.580 --> 40:00.220]  combine them with always with human oversight,
[40:00.220 --> 40:02.860]  use them as a source of inspiration and suggestions,
[40:02.860 --> 40:05.740]  and think co-pilots instead of completely autonomous agents,
[40:05.740 --> 40:07.340]  that are just like performing a task somewhere.
[40:07.900 --> 40:10.300]  It's just not clear that the models are there right now.
[40:12.300 --> 40:15.020]  So I wanted to close by saying that GPT-4 is an amazing artifact.
[40:15.020 --> 40:16.300]  I'm very thankful that it exists.
[40:16.860 --> 40:18.220]  And it's beautiful.
[40:18.220 --> 40:20.220]  It has a ton of knowledge across so many areas.
[40:20.220 --> 40:23.020]  It can do math, code, and so on.
[40:23.020 --> 40:25.980]  And in addition, there's this thriving ecosystem of everything else
[40:25.980 --> 40:28.700]  that is being built and incorporated into the ecosystem.
[40:29.260 --> 40:31.340]  Some of these things I've talked about.
[40:32.060 --> 40:35.180]  And all of this power is accessible at your fingertips.
[40:35.740 --> 40:38.540]  So here's everything that's needed in terms of code
[40:38.540 --> 40:41.740]  to ask GPT-4 a question, to prompt it, and get a response.
[40:42.460 --> 40:43.420]  In this case, I said,
[40:44.060 --> 40:47.260]  can you say something to inspire the audience of Microsoft Build 2023?
[40:47.900 --> 40:53.020]  And I just punched this into Python, and verbatim, GPT-4 said the following.
[40:54.060 --> 40:57.340]  And by the way, I did not know that they used this trick in the keynote.
[40:57.340 --> 41:02.700]  So I thought I was being clever, but it is really good at this.
[41:02.700 --> 41:05.660]  It says, ladies and gentlemen, innovators and trailblazers, Microsoft.
[41:05.740 --> 41:07.340]  Build 2023.
[41:07.340 --> 41:09.980]  Welcome to the gathering of brilliant minds like no other.
[41:09.980 --> 41:11.740]  You are the architects of the future,
[41:11.740 --> 41:15.580]  the visionaries molding the digital realm in which humanity thrives.
[41:15.580 --> 41:17.820]  Embrace the limitless possibilities of technologies
[41:17.820 --> 41:20.380]  and let your ideas soar as high as your imagination.
[41:20.380 --> 41:22.860]  Together, let's create a more connected, remarkable,
[41:22.860 --> 41:25.260]  and inclusive world for generations to come.
[41:25.260 --> 41:27.260]  Get ready to unleash your creativity,
[41:27.260 --> 41:30.300]  canvass the unknown, and turn dreams into reality.
[41:30.300 --> 41:32.140]  Your journey begins today.
[41:35.740 --> 41:40.220]  www.microsoft.com
[42:05.740 --> 42:07.740]  www.microsoft.com
[42:35.740 --> 42:37.740]  www.microsoft.com