VideoTranscript2Insight
Input: Text Transcript with Time, Output: Insights, Main Ideas, Keywords, Distilled Summary, Extractive Summary
Transcript Input: The Inside Story of ChatGPT's Astonishing Potential
https://www.youtube.com/watch?v=C_78DM8fG6E
URL:Raw Transcript:
""" We started OpenAI seven years ago because we felt like something really interesting was happening in AI 0:10 and we wanted to help steer it in a positive direction. 0:15 It's honestly just really amazing to see how far this whole field has come since then. 0:20 And it's really gratifying to hear from people like Raymond who are using the technology we are building, and others, 0:26 for so many wonderful things. We hear from people who are excited, we hear from people who are concerned, 0:33 we hear from people who feel both those emotions at once. And honestly, that's how we feel. 0:40 Above all, it feels like we're entering an historic period right now where we as a world are going to define a technology 0:48 that will be so important for our society going forward. And I believe that we can manage this for good. 0:56 So today, I want to show you the current state of that technology and some of the underlying design principles that we hold dear. 1:09 So the first thing I'm going to show you is what it's like to build a tool for an AI rather than building it for a human. Dolly 1:17 So we have a new DALL-E model, which generates images, and we are exposing it as an app for ChatGPT to use on your behalf. 1:25 And you can do things like ask, you know, suggest a nice post-TED meal and draw a picture of it. 1:35 (Laughter) Now you get all of the, sort of, ideation and creative back-and-forth 1:43 and taking care of the details for you that you get out of ChatGPT. And here we go, it's not just the idea for the meal, 1:49 but a very, very detailed spread. So let's see what we're going to get. 1:56 But ChatGPT doesn't just generate images in this case -- sorry, it doesn't generate text, it also generates an image. 2:02 And that is something that really expands the power of what it can do on your behalf in terms of carrying out your intent. 2:08 And I'll point out, this is all a live demo. This is all generated by the AI as we speak. So I actually don't even know what we're going to see. 2:16 This looks wonderful. (Applause) 2:22 I'm getting hungry just looking at it. Now we've extended ChatGPT with other tools too, 2:27 for example, memory. You can say "save this for later." 2:33 And the interesting thing about these tools is they're very inspectable. So you get this little pop up here that says "use the DALL-E app." 2:39 And by the way, this is coming to you, all ChatGPT users, over upcoming months. And you can look under the hood and see that what it actually did 2:46 was write a prompt just like a human could. And so you sort of have this ability to inspect how the machine is using these tools, 2:53 which allows us to provide feedback to them. Now it's saved for later, and let me show you what it's like to use that information 2:59 and to integrate with other applications too. You can say, “Now make a shopping list for the tasty thing 3:10 I was suggesting earlier.” And make it a little tricky for the AI. 3:16 "And tweet it out for all the TED viewers out there." (Laughter) 3:22 So if you do make this wonderful, wonderful meal, I definitely want to know how it tastes. 3:28 But you can see that ChatGPT is selecting all these different tools without me having to tell it explicitly which ones to use in any situation. 3:37 And this, I think, shows a new way of thinking about the user interface. Like, we are so used to thinking of, well, we have these apps, 3:44 we click between them, we copy/paste between them, and usually it's a great experience within an app as long as you kind of know the menus and know all the options. 3:52 Yes, I would like you to. Yes, please. Always good to be polite. (Laughter) Live Demo 4:00 And by having this unified language interface on top of tools, 4:05 the AI is able to sort of take away all those details from you. So you don't have to be the one 4:12 who spells out every single sort of little piece of what's supposed to happen. And as I said, this is a live demo, 4:18 so sometimes the unexpected will happen to us. But let's take a look at the Instacart shopping list while we're at it. 4:25 And you can see we sent a list of ingredients to Instacart. Here's everything you need. 4:30 And the thing that's really interesting is that the traditional UI is still very valuable, right? If you look at this, 4:37 you still can click through it and sort of modify the actual quantities. And that's something that I think shows 4:43 that they're not going away, traditional UIs. It's just we have a new, augmented way to build them. 4:49 And now we have a tweet that's been drafted for our review, which is also a very important thing. We can click “run,” and there we are, we’re the manager, we’re able to inspect, 4:58 we're able to change the work of the AI if we want to. And so after this talk, you will be able to access this yourself. 5:17 And there we go. Cool. Thank you, everyone. 5:23 (Applause) 5:29 So we’ll cut back to the slides. Now, the important thing about how we build this, Training ChatGPT 5:36 it's not just about building these tools. It's about teaching the AI how to use them. Like, what do we even want it to do 5:42 when we ask these very high-level questions? And to do this, we use an old idea. 5:48 If you go back to Alan Turing's 1950 paper on the Turing test, he says, you'll never program an answer to this. 5:53 Instead, you can learn it. You could build a machine, like a human child, and then teach it through feedback. 5:59 Have a human teacher who provides rewards and punishments as it tries things out and does things that are either good or bad. 6:06 And this is exactly how we train ChatGPT. It's a two-step process. First, we produce what Turing would have called a child machine 6:12 through an unsupervised learning process. We just show it the whole world, the whole internet and say, “Predict what comes next in text you’ve never seen before.” 6:20 And this process imbues it with all sorts of wonderful skills. For example, if you're shown a math problem, 6:25 the only way to actually complete that math problem, to say what comes next, that green nine up there, 6:30 is to actually solve the math problem. But we actually have to do a second step, too, 6:36 which is to teach the AI what to do with those skills. And for this, we provide feedback. We have the AI try out multiple things, give us multiple suggestions, 6:44 and then a human rates them, says “This one’s better than that one.” And this reinforces not just the specific thing that the AI said, 6:50 but very importantly, the whole process that the AI used to produce that answer. And this allows it to generalize. 6:55 It allows it to teach, to sort of infer your intent and apply it in scenarios that it hasn't seen before, that it hasn't received feedback. 7:02 Now, sometimes the things we have to teach the AI are not what you'd expect. For example, when we first showed GPT-4 to Khan Academy, 7:09 they said, "Wow, this is so great, We're going to be able to teach students wonderful things. Only one problem, it doesn't double-check students' math. 7:17 If there's some bad math in there, it will happily pretend that one plus one equals three and run with it." 7:23 So we had to collect some feedback data. Sal Khan himself was very kind and offered 20 hours of his own time to provide feedback to the machine 7:30 alongside our team. And over the course of a couple of months we were able to teach the AI that, 7:35 "Hey, you really should push back on humans in this specific kind of scenario." 7:41 And we've actually made lots and lots of improvements to the models this way. And when you push that thumbs down in ChatGPT, 7:48 that actually is kind of like sending up a bat signal to our team to say, “Here’s an area of weakness where you should gather feedback.” 7:55 And so when you do that, that's one way that we really listen to our users and make sure we're building something that's more useful for everyone. 8:02 Now, providing high-quality feedback is a hard thing. If you think about asking a kid to clean their room, 8:09 if all you're doing is inspecting the floor, you don't know if you're just teaching them to stuff all the toys in the closet. 8:15 This is a nice DALL-E-generated image, by the way. And the same sort of reasoning applies to AI. 8:24 As we move to harder tasks, we will have to scale our ability to provide high-quality feedback. 8:30 But for this, the AI itself is happy to help. It's happy to help us provide even better feedback FactChecking ChatGPT 8:37 and to scale our ability to supervise the machine as time goes on. And let me show you what I mean. 8:42 For example, you can ask GPT-4 a question like this, of how much time passed between these two foundational blogs 8:50 on unsupervised learning and learning from human feedback. And the model says two months passed. 8:57 But is it true? Like, these models are not 100-percent reliable, although they’re getting better every time we provide some feedback. 9:04 But we can actually use the AI to fact-check. And it can actually check its own work. 9:09 You can say, fact-check this for me. Now, in this case, I've actually given the AI a new tool. 9:16 This one is a browsing tool where the model can issue search queries and click into web pages. 9:22 And it actually writes out its whole chain of thought as it does it. It says, I’m just going to search for this and it actually does the search. 9:28 It then it finds the publication date and the search results. It then is issuing another search query. 9:33 It's going to click into the blog post. And all of this you could do, but it’s a very tedious task. It's not a thing that humans really want to do. 9:40 It's much more fun to be in the driver's seat, to be in this manager's position where you can, if you want, 9:45 triple-check the work. And out come citations so you can actually go and very easily verify any piece of this whole chain of reasoning. 9:53 And it actually turns out two months was wrong. Two months and one week, that was correct. 10:00 (Applause) Human AI Collaboration 10:07 And we'll cut back to the side. And so thing that's so interesting to me about this whole process 10:13 is that it’s this many-step collaboration between a human and an AI. Because a human, using this fact-checking tool 10:19 is doing it in order to produce data for another AI to become more useful to a human. 10:25 And I think this really shows the shape of something that we should expect to be much more common in the future, 10:31 where we have humans and machines kind of very carefully and delicately designed in how they fit into a problem 10:37 and how we want to solve that problem. We make sure that the humans are providing the management, the oversight, 10:42 the feedback, and the machines are operating in a way that's inspectable and trustworthy. 10:47 And together we're able to actually create even more trustworthy machines. And I think that over time, if we get this process right, 10:54 we will be able to solve impossible problems. And to give you a sense of just how impossible I'm talking, 11:00 I think we're going to be able to rethink almost every aspect of how we interact with computers. 11:05 For example, think about spreadsheets. They've been around in some form since, we'll say, 40 years ago with VisiCalc. 11:12 I don't think they've really changed that much in that time. And here is a specific spreadsheet of all the AI papers on the arXiv 11:22 for the past 30 years. There's about 167,000 of them. And you can see there the data right here. 11:28 But let me show you the ChatGPT take on how to analyze a data set like this. 11:37 So we can give ChatGPT access to yet another tool, this one a Python interpreter, 11:42 so it’s able to run code, just like a data scientist would. And so you can just literally upload a file 11:48 and ask questions about it. And very helpfully, you know, it knows the name of the file and it's like, "Oh, this is CSV," comma-separated value file, 11:56 "I'll parse it for you." The only information here is the name of the file, the column names like you saw and then the actual data. 12:04 And from that it's able to infer what these columns actually mean. Like, that semantic information wasn't in there. 12:11 It has to sort of, put together its world knowledge of knowing that, “Oh yeah, arXiv is a site that people submit papers 12:16 and therefore that's what these things are and that these are integer values and so therefore it's a number of authors in the paper," 12:23 like all of that, that’s work for a human to do, and the AI is happy to help with it. Now I don't even know what I want to ask. 12:29 So fortunately, you can ask the machine, "Can you make some exploratory graphs?" 12:37 And once again, this is a super high-level instruction with lots of intent behind it. But I don't even know what I want. 12:43 And the AI kind of has to infer what I might be interested in. And so it comes up with some good ideas, I think. 12:48 So a histogram of the number of authors per paper, time series of papers per year, word cloud of the paper titles. 12:53 All of that, I think, will be pretty interesting to see. And the great thing is, it can actually do it. Here we go, a nice bell curve. 13:00 You see that three is kind of the most common. It's going to then make this nice plot of the papers per year. 13:08 Something crazy is happening in 2023, though. Looks like we were on an exponential and it dropped off the cliff. 13:13 What could be going on there? By the way, all this is Python code, you can inspect. And then we'll see word cloud. 13:19 So you can see all these wonderful things that appear in these titles. But I'm pretty unhappy about this 2023 thing. 13:25 It makes this year look really bad. Of course, the problem is that the year is not over. So I'm going to push back on the machine. 13:33 [Waitttt that's not fair!!! 2023 isn't over. 13:38 What percentage of papers in 2022 were even posted by April 13?] 13:44 So April 13 was the cut-off date I believe. Can you use that to make a fair projection? 13:54 So we'll see, this is the kind of ambitious one. (Laughter) 13:59 So you know, again, I feel like there was more I wanted out of the machine here. 14:05 I really wanted it to notice this thing, maybe it's a little bit of an overreach for it 14:10 to have sort of, inferred magically that this is what I wanted. But I inject my intent, 14:15 I provide this additional piece of, you know, guidance. And under the hood, 14:21 the AI is just writing code again, so if you want to inspect what it's doing, it's very possible. And now, it does the correct projection. 14:30 (Applause) Parable 14:35 If you noticed, it even updates the title. I didn't ask for that, but it know what I want. 14:41 Now we'll cut back to the slide again. This slide shows a parable of how I think we ... 14:51 A vision of how we may end up using this technology in the future. A person brought his very sick dog to the vet, 14:58 and the veterinarian made a bad call to say, “Let’s just wait and see.” And the dog would not be here today had he listened. 15:05 In the meanwhile, he provided the blood test, like, the full medical records, to GPT-4, which said, "I am not a vet, you need to talk to a professional, 15:13 here are some hypotheses." He brought that information to a second vet who used it to save the dog's life. 15:21 Now, these systems, they're not perfect. You cannot overly rely on them. But this story, I think, shows 15:29 that a human with a medical professional and with ChatGPT as a brainstorming partner 15:35 was able to achieve an outcome that would not have happened otherwise. I think this is something we should all reflect on, 15:40 think about as we consider how to integrate these systems into our world. And one thing I believe really deeply, 15:46 is that getting AI right is going to require participation from everyone. And that's for deciding how we want it to slot in, 15:53 that's for setting the rules of the road, for what an AI will and won't do. And if there's one thing to take away from this talk, 15:59 it's that this technology just looks different. Just different from anything people had anticipated. And so we all have to become literate. 16:06 And that's, honestly, one of the reasons we released ChatGPT. Together, I believe that we can achieve the OpenAI mission 16:12 of ensuring that artificial general intelligence benefits all of humanity. Thank you. 16:18 (Applause) 16:33 (Applause ends) Chris Anderson: Greg. Wow. I mean ... Rethink 16:39 I suspect that within every mind out here there's a feeling of reeling. 16:46 Like, I suspect that a very large number of people viewing this, you look at that and you think, “Oh my goodness, 16:51 pretty much every single thing about the way I work, I need to rethink." Like, there's just new possibilities there. 16:57 Am I right? Who thinks that they're having to rethink the way that we do things? Yeah, I mean, it's amazing, 17:03 but it's also really scary. So let's talk, Greg, let's talk. I mean, I guess my first question actually is just 17:10 how the hell have you done this? (Laughter) OpenAI has a few hundred employees. 17:16 Google has thousands of employees working on artificial intelligence. 17:21 Why is it you who's come up with this technology that shocked the world? Greg Brockman: I mean, the truth is, 17:28 we're all building on shoulders of giants, right, there's no question. If you look at the compute progress, 17:33 the algorithmic progress, the data progress, all of those are really industry-wide. But I think within OpenAI, 17:38 we made a lot of very deliberate choices from the early days. And the first one was just to confront reality as it lays. 17:44 And that we just thought really hard about like: What is it going to take to make progress here? We tried a lot of things that didn't work, so you only see the things that did. 17:52 And I think that the most important thing has been to get teams of people who are very different from each other to work together harmoniously. 17:59 CA: Can we have the water, by the way, just brought here? I think we're going to need it, it's a dry-mouth topic. 18:06 But isn't there something also just about the fact that you saw something in these language models 18:14 that meant that if you continue to invest in them and grow them, that something at some point might emerge? 18:21 GB: Yes. And I think that, I mean, honestly, I think the story there is pretty illustrative, right? Emergence 18:28 I think that high level, deep learning, like we always knew that was what we wanted to be, was a deep learning lab, and exactly how to do it? 18:35 I think that in the early days, we didn't know. We tried a lot of things, and one person was working on training a model 18:41 to predict the next character in Amazon reviews, and he got a result where -- this is a syntactic process, 18:48 you expect, you know, the model will predict where the commas go, where the nouns and verbs are. But he actually got a state-of-the-art sentiment analysis classifier out of it. 18:57 This model could tell you if a review was positive or negative. I mean, today we are just like, come on, anyone can do that. 19:04 But this was the first time that you saw this emergence, this sort of semantics that emerged from this underlying syntactic process. 19:12 And there we knew, you've got to scale this thing, you've got to see where it goes. CA: So I think this helps explain 19:18 the riddle that baffles everyone looking at this, because these things are described as prediction machines. 19:23 And yet, what we're seeing out of them feels ... it just feels impossible that that could come from a prediction machine. 19:29 Just the stuff you showed us just now. And the key idea of emergence is that when you get more of a thing, 19:35 suddenly different things emerge. It happens all the time, ant colonies, single ants run around, when you bring enough of them together, 19:42 you get these ant colonies that show completely emergent, different behavior. Or a city where a few houses together, it's just houses together. 19:49 But as you grow the number of houses, things emerge, like suburbs and cultural centers and traffic jams. 19:57 Give me one moment for you when you saw just something pop that just blew your mind that you just did not see coming. 20:03 GB: Yeah, well, so you can try this in ChatGPT, if you add 40-digit numbers -- CA: 40-digit? 20:09 GB: 40-digit numbers, the model will do it, which means it's really learned an internal circuit for how to do it. 20:15 And the really interesting thing is actually, if you have it add like a 40-digit number plus a 35-digit number, 20:20 it'll often get it wrong. And so you can see that it's really learning the process, but it hasn't fully generalized, right? 20:27 It's like you can't memorize the 40-digit addition table, that's more atoms than there are in the universe. 20:32 So it had to have learned something general, but that it hasn't really fully yet learned that, Oh, I can sort of generalize this to adding arbitrary numbers 20:39 of arbitrary lengths. CA: So what's happened here is that you've allowed it to scale up and look at an incredible number of pieces of text. 20:46 And it is learning things that you didn't know that it was going to be capable of learning. GB Well, yeah, and it’s more nuanced, too. Engineering Quality 20:53 So one science that we’re starting to really get good at is predicting some of these emergent capabilities. 20:58 And to do that actually, one of the things I think is very undersung in this field is sort of engineering quality. 21:04 Like, we had to rebuild our entire stack. When you think about building a rocket, every tolerance has to be incredibly tiny. 21:10 Same is true in machine learning. You have to get every single piece of the stack engineered properly, and then you can start doing these predictions. 21:17 There are all these incredibly smooth scaling curves. They tell you something deeply fundamental about intelligence. 21:23 If you look at our GPT-4 blog post, you can see all of these curves in there. And now we're starting to be able to predict. 21:29 So we were able to predict, for example, the performance on coding problems. We basically look at some models 21:34 that are 10,000 times or 1,000 times smaller. And so there's something about this that is actually smooth scaling, 21:40 even though it's still early days. CA: So here is, one of the big fears then, 21:45 that arises from this. If it’s fundamental to what’s happening here, that as you scale up, things emerge that 21:52 you can maybe predict in some level of confidence, but it's capable of surprising you. 22:00 Why isn't there just a huge risk of something truly terrible emerging? GB: Well, I think all of these are questions of degree 22:07 and scale and timing. And I think one thing people miss, too, is sort of the integration with the world is also this incredibly emergent, 22:14 sort of, very powerful thing too. And so that's one of the reasons that we think it's so important to deploy incrementally. 22:20 And so I think that what we kind of see right now, if you look at this talk, a lot of what I focus on is providing really high-quality feedback. 22:27 Today, the tasks that we do, you can inspect them, right? It's very easy to look at that math problem and be like, no, no, no, 22:33 machine, seven was the correct answer. But even summarizing a book, like, that's a hard thing to supervise. 22:38 Like, how do you know if this book summary is any good? You have to read the whole book. No one wants to do that. 22:43 (Laughter) And so I think that the important thing will be that we take this step by step. 22:49 And that we say, OK, as we move on to book summaries, we have to supervise this task properly. We have to build up a track record with these machines 22:56 that they're able to actually carry out our intent. And I think we're going to have to produce even better, more efficient, 23:02 more reliable ways of scaling this, sort of like making the machine be aligned with you. CA: So we're going to hear later in this session, Confidence 23:09 there are critics who say that, you know, there's no real understanding inside, 23:15 the system is going to always -- we're never going to know that it's not generating errors, that it doesn't have common sense and so forth. 23:22 Is it your belief, Greg, that it is true at any one moment, but that the expansion of the scale and the human feedback 23:30 that you talked about is basically going to take it on that journey 23:35 of actually getting to things like truth and wisdom and so forth, with a high degree of confidence. 23:40 Can you be sure of that? GB: Yeah, well, I think that the OpenAI, I mean, the short answer is yes, I believe that is where we're headed. 23:47 And I think that the OpenAI approach here has always been just like, let reality hit you in the face, right? 23:52 It's like this field is the field of broken promises, of all these experts saying X is going to happen, Y is how it works. 23:58 People have been saying neural nets aren't going to work for 70 years. They haven't been right yet. They might be right maybe 70 years plus one 24:05 or something like that is what you need. But I think that our approach has always been, you've got to push to the limits of this technology 24:11 to really see it in action, because that tells you then, oh, here's how we can move on to a new paradigm. And we just haven't exhausted the fruit here. 24:18 CA: I mean, it's quite a controversial stance you've taken, that the right way to do this is to put it out there in public 24:24 and then harness all this, you know, instead of just your team giving feedback, the world is now giving feedback. 24:30 But ... If, you know, bad things are going to emerge, Controversy 24:36 it is out there. So, you know, the original story that I heard on OpenAI when you were founded as a nonprofit, 24:42 well you were there as the great sort of check on the big companies doing their unknown, possibly evil thing with AI. 24:51 And you were going to build models that sort of, you know, somehow held them accountable 24:57 and was capable of slowing the field down, if need be. Or at least that's kind of what I heard. 25:03 And yet, what's happened, arguably, is the opposite. That your release of GPT, especially ChatGPT, 25:12 sent such shockwaves through the tech world that now Google and Meta and so forth are all scrambling to catch up. 25:17 And some of their criticisms have been, you are forcing us to put this out here without proper guardrails or we die. 25:25 You know, how do you, like, make the case that what you have done is responsible here and not reckless. 25:31 GB: Yeah, we think about these questions all the time. Like, seriously all the time. And I don't think we're always going to get it right. 25:38 But one thing I think has been incredibly important, from the very beginning, when we were thinking about how to build artificial general intelligence, 25:45 actually have it benefit all of humanity, like, how are you supposed to do that, right? And that default plan of being, well, you build in secret, 25:52 you get this super powerful thing, and then you figure out the safety of it and then you push “go,” and you hope you got it right. 25:59 I don't know how to execute that plan. Maybe someone else does. But for me, that was always terrifying, it didn't feel right. 26:04 And so I think that this alternative approach is the only other path that I see, which is that you do let reality hit you in the face. 26:11 And I think you do give people time to give input. You do have, before these machines are perfect, before they are super powerful, that you actually have the ability 26:19 to see them in action. And we've seen it from GPT-3, right? GPT-3, we really were afraid that the number one thing people were going to do with it 26:26 was generate misinformation, try to tip elections. Instead, the number one thing was generating Viagra spam. 26:31 (Laughter) CA: So Viagra spam is bad, but there are things that are much worse. Pandoras Box 26:39 Here's a thought experiment for you. Suppose you're sitting in a room, there's a box on the table. 26:44 You believe that in that box is something that, there's a very strong chance it's something absolutely glorious 26:50 that's going to give beautiful gifts to your family and to everyone. But there's actually also a one percent thing in the small print there 26:58 that says: “Pandora.” And there's a chance that this actually could unleash unimaginable evils on the world. 27:06 Do you open that box? GB: Well, so, absolutely not. I think you don't do it that way. 27:12 And honestly, like, I'll tell you a story that I haven't actually told before, which is that shortly after we started OpenAI, 27:18 I remember I was in Puerto Rico for an AI conference. I'm sitting in the hotel room just looking out over this wonderful water, 27:24 all these people having a good time. And you think about it for a moment, if you could choose for basically that Pandora’s box 27:32 to be five years away or 500 years away, which would you pick, right? 27:38 On the one hand you're like, well, maybe for you personally, it's better to have it be five years away. But if it gets to be 500 years away and people get more time to get it right, 27:47 which do you pick? And you know, I just really felt it in the moment. I was like, of course you do the 500 years. 27:53 My brother was in the military at the time and like, he puts his life on the line in a much more real way than any of us typing things in computers 28:00 and developing this technology at the time. And so, yeah, I'm really sold on the you've got to approach this right. 28:08 But I don't think that's quite playing the field as it truly lies. Like, if you look at the whole history of computing, 28:14 I really mean it when I say that this is an industry-wide or even just almost like 28:20 a human-development- of-technology-wide shift. And the more that you sort of, don't put together the pieces 28:27 that are there, right, we're still making faster computers, we're still improving the algorithms, all of these things, they are happening. 28:34 And if you don't put them together, you get an overhang, which means that if someone does, or the moment that someone does manage to connect to the circuit, 28:42 then you suddenly have this very powerful thing, no one's had any time to adjust, who knows what kind of safety precautions you get. 28:48 And so I think that one thing I take away is like, even you think about development of other sort of technologies, 28:54 think about nuclear weapons, people talk about being like a zero to one, sort of, change in what humans could do. 29:00 But I actually think that if you look at capability, it's been quite smooth over time. And so the history, I think, of every technology we've developed 29:07 has been, you've got to do it incrementally and you've got to figure out how to manage it for each moment that you're increasing it. The Model 29:14 CA: So what I'm hearing is that you ... the model you want us to have is that we have birthed this extraordinary child 29:21 that may have superpowers that take humanity to a whole new place. 29:26 It is our collective responsibility to provide the guardrails 29:31 for this child to collectively teach it to be wise and not to tear us all down. 29:37 Is that basically the model? GB: I think it's true. And I think it's also important to say this may shift, right? Conclusion 29:43 We've got to take each step as we encounter it. And I think it's incredibly important today 29:48 that we all do get literate in this technology, figure out how to provide the feedback, decide what we want from it. 29:54 And my hope is that that will continue to be the best path, but it's so good we're honestly having this debate 30:00 because we wouldn't otherwise if it weren't out there. CA: Greg Brockman, thank you so much for coming to TED and blowing our minds. 30:07 (Applause) """
Raw Transcript 2
""" 0:00 [MUSIC] 0:07 ANNOUNCER: Please welcome AI researcher and founding member of OpenAI, Andrej Karpathy. 0:21 ANDREJ KARPATHY: Hi, everyone. I'm happy to be here to tell you about the state of GPT and more generally about 0:28 the rapidly growing ecosystem of large language models. I would like to partition the talk into two parts. 0:35 In the first part, I would like to tell you about how we train GPT Assistance, and then in the second part, 0:40 we're going to take a look at how we can use these assistants effectively for your applications. 0:46 First, let's take a look at the emerging recipe for how to train these assistants and keep in mind that this is all very new and still rapidly evolving, 0:53 but so far, the recipe looks something like this. Now, this is a complicated slide, I'm going to go through it piece by GPT Assistant training pipeline 0:59 piece, but roughly speaking, we have four major stages, pretraining, 1:04 supervised finetuning, reward modeling, reinforcement learning, and they follow each other serially. 1:09 Now, in each stage, we have a dataset that powers that stage. We have an algorithm that for our purposes will be 1:17 a objective and over for training the neural network, and then we have a resulting model, 1:23 and then there are some notes on the bottom. The first stage we're going to start with as the pretraining stage. Now, this stage is special in this diagram, 1:31 and this diagram is not to scale because this stage is where all of the computational work basically happens. This is 99 percent of the training 1:38 compute time and also flops. This is where we are dealing with 1:44 Internet scale datasets with thousands of GPUs in the supercomputer and also months of training potentially. 1:51 The other three stages are finetuning stages that are much more along the lines of small few number of GPUs and hours or days. 1:59 Let's take a look at the pretraining stage to achieve a base model. First, we are going to gather a large amount of data. Data collection 2:07 Here's an example of what we call a data mixture that comes from this paper that was released by 2:13 Meta where they released this LLaMA based model. Now, you can see roughly the datasets that 2:18 enter into these collections. We have CommonCrawl, which is a web scrape, C4, which is also CommonCrawl, 2:25 and then some high quality datasets as well. For example, GitHub, Wikipedia, Books, Archives, Stock Exchange and so on. 2:31 These are all mixed up together, and then they are sampled according to some given proportions, 2:36 and that forms the training set for the GPT. Now before we can actually train on this data, 2:43 we need to go through one more preprocessing step, and that is tokenization. This is basically a translation of 2:48 the raw text that we scrape from the Internet into sequences of integers because 2:53 that's the native representation over which GPTs function. Now, this is a lossless translation 3:00 between pieces of texts and tokens and integers, and there are a number of algorithms for the stage. 3:05 Typically, for example, you could use something like byte pair encoding, which iteratively merges text chunks 3:11 and groups them into tokens. Here, I'm showing some example chunks of these tokens, 3:16 and then this is the raw integer sequence that will actually feed into a transformer. Now, here I'm showing 2 example models 3:23 two examples for hybrid parameters that govern this stage. 3:28 GPT-4, we did not release too much information about how it was trained and so on, I'm using GPT-3s numbers, 3:33 but GPT-3 is of course a little bit old by now, about three years ago. But LLaMA is a fairly recent model from Meta. 3:40 These are roughly the orders of magnitude that we're dealing with when we're doing pretraining. The vocabulary size is usually a couple 10,000 tokens. 3:48 The context length is usually something like 2,000, 4,000, or nowadays even 100,000, 3:53 and this governs the maximum number of integers that the GPT will look at when it's trying to 3:58 predict the next integer in a sequence. You can see that roughly the number of parameters say, 4:04 65 billion for LLaMA. Now, even though LLaMA has only 65B parameters compared to GPP-3s 175 billion parameters, 4:11 LLaMA is a significantly more powerful model, and intuitively, that's because the model is trained for significantly longer. 4:17 In this case, 1.4 trillion tokens, instead of 300 billion tokens. You shouldn't judge the power of a model by 4:23 the number of parameters that it contains. Below, I'm showing some tables of rough hyperparameters that typically 4:31 go into specifying the transformer neural network, the number of heads, the dimension size, number of layers, 4:36 and so on, and on the bottom I'm showing some training hyperparameters. For example, to train the 65B model, 4:44 Meta used 2,000 GPUs, roughly 21 days of training and a roughly several million dollars. 4:52 That's the rough orders of magnitude that you should have in mind for the pre-training stage. 4:57 Now, when we're actually pre-training, what happens? Roughly speaking, we are going to take our tokens, 5:03 and we're going to lay them out into data batches. We have these arrays that will feed into the transformer, 5:09 and these arrays are B, the batch size and these are all independent examples stocked up in rows and B by T, 5:16 T being the maximum context length. In my picture I only have 10 the context lengths, so this could be 2,000, 4,000, etc. 5:23 These are extremely long rows. What we do is we take these documents, and we pack them into rows, 5:28 and we delimit them with these special end of texts tokens, basically telling the transformer where a new document begins. 5:35 Here, I have a few examples of documents and then I stretch them out into this input. 5:41 Now, we're going to feed all of these numbers into transformer. Let me just focus on a single particular cell, 5:49 but the same thing will happen at every cell in this diagram. Let's look at the green cell. The green cell is going to take 5:56 a look at all of the tokens before it, so all of the tokens in yellow, and we're going to feed that entire context 6:03 into the transforming neural network, and the transformer is going to try to predict the next token in 6:08 a sequence, in this case in red. Now the transformer, I don't have too much time to, unfortunately, go into the full details of this 6:14 neural network architecture is just a large blob of neural net stuff for our purposes, and it's got several, 6:20 10 billion parameters typically or something like that. Of course, as I tune these parameters, you're getting slightly different predicted distributions 6:26 for every single one of these cells. For example, if our vocabulary size is 50,257 tokens, 6:34 then we're going to have that many numbers because we need to specify a probability distribution for what comes next. 6:40 Basically, we have a probability for whatever may follow. Now, in this specific example, for this specific cell, 6:45 513 will come next, and so we can use this as a source of supervision to update our transformers weights. 6:51 We're applying this basically on every single cell in the parallel, and we keep swapping batches, and we're trying to get the transformer to make 6:58 the correct predictions over what token comes next in a sequence. Let me show you more concretely what this looks 7:03 like when you train one of these models. This is actually coming from the New York Times, and they trained a small GPT on Shakespeare. 7:11 Here's a small snippet of Shakespeare, and they train their GPT on it. Now, in the beginning, at initialization, 7:17 the GPT starts with completely random weights. You're getting completely random outputs as well. But over time, as you train the GPT longer and longer, 7:26 you are getting more and more coherent and consistent samples from the model, 7:31 and the way you sample from it, of course, is you predict what comes next, you sample from that distribution and 7:36 you keep feeding that back into the process, and you can basically sample large sequences. 7:42 By the end, you see that the transformer has learned about words and where to put spaces and where to put commas and so on. 7:48 We're making more and more consistent predictions over time. These are the plots that you are looking at when you're doing model pretraining. 7:54 Effectively, we're looking at the loss function over time as you train, and low loss means that our transformer 8:00 is giving a higher probability to the next correct integer in the sequence. 8:06 What are we going to do with model once we've trained it after a month? Well, the first thing that we noticed, we the field, Base models learn powerful, general representations 8:14 is that these models basically in the process of language modeling, learn very powerful general representations, 8:21 and it's possible to very efficiently fine tune them for any arbitrary downstream tasks you might be interested in. 8:26 As an example, if you're interested in sentiment classification, the approach used to be that you collect a bunch of positives 8:33 and negatives and then you train some NLP model for that, but the new approach is: 8:38 ignore sentiment classification, go off and do large language model pretraining, 8:43 train a large transformer, and then you may only have a few examples and you can very efficiently fine tune 8:48 your model for that task. This works very well in practice. The reason for this is that basically 8:55 the transformer is forced to multitask a huge amount of tasks in the language modeling task, 9:00 because in terms of predicting the next token, it's forced to understand a lot about the structure of the text and all the different concepts therein. 9:09 That was GPT-1. Now around the time of GPT-2, people noticed that actually even better than fine tuning, 9:15 you can actually prompt these models very effectively. These are language models and they want to complete documents, 9:20 you can actually trick them into performing tasks by arranging these fake documents. 9:25 In this example, for example, we have some passage and then we like do QA, QA, QA. 9:31 This is called Few-shot prompt, and then we do Q, and then as the transformer is tried to complete the document is actually answering our question. 9:37 This is an example of prompt engineering based model, making it believe that it's imitating a document and getting it to perform a task. 9:45 This kicked off, I think the era of, I would say, prompting over fine tuning and seeing that this 9:50 actually can work extremely well on a lot of problems, even without training any neural networks, fine tuning or so on. 9:56 Now since then, we've seen an entire evolutionary tree of base models that everyone has trained. 10:02 Not all of these models are available. for example, the GPT-4 base model was never released. 10:08 The GPT-4 model that you might be interacting with over API is not a base model, it's an assistant model, and we're going to cover how to get those in a bit. 10:15 GPT-3 based model is available via the API under the name Devanshi and GPT-2 based model 10:21 is available even as weights on our GitHub repo. But currently the best available base model 10:27 probably is the LLaMA series from Meta, although it is not commercially licensed. 10:32 Now, one thing to point out is base models are not assistants. They don't want to make answers to your questions, 10:41 they want to complete documents. If you tell them to write a poem about the bread and cheese, 10:46 it will answer questions with more questions, it's completing what it thinks is a document. 10:51 However, you can prompt them in a specific way for base models that is more likely to work. 10:57 As an example, here's a poem about bread and cheese, and in that case it will autocomplete correctly. You can even trick base models into being assistants. 11:06 The way you would do this is you would create a specific few-shot prompt that makes it look like there's some document between the human and assistant 11:13 and they're exchanging information. Then at the bottom, you put your query at the end and the base model 11:21 will condition itself into being a helpful assistant and answer, 11:26 but this is not very reliable and doesn't work super well in practice, although it can be done. Instead, we have a different path to make 11:32 actual GPT assistants not base model document completers. That takes us into supervised finetuning. 11:39 In the supervised finetuning stage, we are going to collect small but high quality data-sets, and in this case, 11:45 we're going to ask human contractors to gather data of the form prompt and ideal response. 11:52 We're going to collect lots of these typically tens of thousands or something like that. Then we're going to still do language 11:58 modeling on this data. Nothing changed algorithmically, we're swapping out a training set. It used to be Internet documents, 12:04 which has a high quantity local for basically Q8 prompt response data. 12:11 That is low quantity, high quality. We will still do language modeling and then after training, 12:16 we get an SFT model. You can actually deploy these models and they are actual assistants and they work to some extent. 12:22 Let me show you what an example demonstration might look like. Here's something that a human contractor might come up with. 12:28 Here's some random prompt. Can you write a short introduction about the relevance of the term monopsony or something like that? 12:34 Then the contractor also writes out an ideal response. When they write out these responses, they are following extensive labeling 12:40 documentations and they are being asked to be helpful, truthful, and harmless. 12:45 These labeling instructions here, you probably can't read it, neither can I, but they're long and this is people 12:52 following instructions and trying to complete these prompts. That's what the dataset looks like. You can train these models. This works to some extent. 12:59 Now, you can actually continue the pipeline from here on, and go into RLHF, 13:05 reinforcement learning from human feedback that consists of both reward modeling and reinforcement learning. 13:10 Let me cover that and then I'll come back to why you may want to go through the extra steps and how that compares to SFT models. 13:16 In the reward modeling step, what we're going to do is we're now going to shift our data collection to be of the form of comparisons. 13:23 Here's an example of what our dataset will look like. I have the same identical prompt on the top, RM Dataset 13:28 which is asking the assistant to write a program or a function that checks if a given string is a palindrome. 13:35 Then what we do is we take the SFT model which we've already trained and we create multiple completions. 13:41 In this case, we have three completions that the model has created, and then we ask people to rank these completions. 13:47 If you stare at this for a while, and by the way, these are very difficult things to do to compare some of these predictions. 13:52 This can take people even hours for a single prompt completion pairs, 13:57 but let's say we decided that one of these is much better than the others and so on. We rank them. 14:03 Then we can follow that with something that looks very much like a binary classification on all the possible pairs between these completions. RM Training 14:10 What we do now is, we lay out our prompt in rows, and the prompt is identical across all three rows here. 14:16 It's all the same prompt, but the completion of this varies. The yellow tokens are coming from the SFT model. 14:21 Then what we do is we append another special reward readout token at the end and we basically only 14:28 supervise the transformer at this single green token. The transformer will predict some reward 14:34 for how good that completion is for that prompt and basically it makes 14:39 a guess about the quality of each completion. Then once it makes a guess for every one of them, 14:44 we also have the ground truth which is telling us the ranking of them. We can actually enforce that some of 14:50 these numbers should be much higher than others, and so on. We formulate this into a loss function and we train our model to make reward predictions 14:56 that are consistent with the ground truth coming from the comparisons from all these contractors. That's how we train our reward model. 15:02 That allows us to score how good a completion is for a prompt. Once we have a reward model, 15:09 we can't deploy this because this is not very useful as an assistant by itself, but it's very useful for the reinforcement 15:15 learning stage that follows now. Because we have a reward model, we can score the quality of any arbitrary completion for any given prompt. 15:22 What we do during reinforcement learning is we basically get, again, a large collection of prompts and now we do 15:28 reinforcement learning with respect to the reward model. Here's what that looks like. We take a single prompt, 15:34 we lay it out in rows, and now we use basically the model we'd like to train which 15:39 was initialized at SFT model to create some completions in yellow, and then we append the reward token again 15:45 and we read off the reward according to the reward model, which is now kept fixed. It doesn't change any more. Now the reward model 15:53 tells us the quality of every single completion for all these prompts and so what we can do is we can now just basically apply the same 15:59 language modeling loss function, but we're currently training on the yellow tokens, and we are weighing 16:06 the language modeling objective by the rewards indicated by the reward model. As an example, in the first row, 16:13 the reward model said that this is a fairly high-scoring completion and so all the tokens that we 16:18 happen to sample on the first row are going to get reinforced and they're going to get higher probabilities for the future. 16:25 Conversely, on the second row, the reward model really did not like this completion, -1.2. Therefore, every single token that we sampled in 16:32 that second row is going to get a slightly higher probability for the future. We do this over and over on many prompts on many batches and basically, 16:39 we get a policy that creates yellow tokens here. It's basically all the completions here will 16:46 score high according to the reward model that we trained in the previous stage. 16:51 That's what the RLHF pipeline is. Then at the end, you get a model that you could deploy. 16:58 As an example, ChatGPT is an RLHF model, but some other models that you might come across for example, 17:05 Vicuna-13B, and so on, these are SFT models. We have base models, SFT models, and RLHF models. 17:12 That's the state of things there. Now why would you want to do RLHF? One answer that's not 17:19 that exciting is that it works better. This comes from the instruct GPT paper. According to these experiments a while ago now, 17:25 these PPO models are RLHF. We see that they are basically preferred in a lot 17:30 of comparisons when we give them to humans. Humans prefer basically tokens 17:36 that come from RLHF models compared to SFT models, compared to base model that is prompted to be an assistant. It just works better. 17:43 But you might ask why does it work better? I don't think that there's a single amazing answer 17:49 that the community has really agreed on, but I will offer one reason potentially. 17:55 It has to do with the asymmetry between how easy computationally it is to compare versus generate. 18:02 Let's take an example of generating a haiku. Suppose I ask a model to write a haiku about paper clips. 18:07 If you're a contractor trying to train data, then imagine being a contractor collecting basically data for the SFT stage, 18:14 how are you supposed to create a nice haiku for a paper clip? You might not be very good at that, but if I give you a few examples of 18:20 haikus you might be able to appreciate some of these haikus a lot more than others. Judging which one of these is good is a much easier task. 18:27 Basically, this asymmetry makes it so that comparisons are a better way to potentially leverage 18:33 yourself as a human and your judgment to create a slightly better model. Now, RLHF models are not 18:40 strictly an improvement on the base models in some cases. In particular, we'd notice for example that they lose some entropy. 18:46 That means that they give more peaky results. They can output samples Mode collapse 18:54 with lower variation than the base model. The base model has lots of entropy and will give lots of diverse outputs. 19:00 For example, one place where I still prefer to use a base model is in the setup 19:06 where you basically have n things and you want to generate more things like it. 19:13 Here is an example that I just cooked up. I want to generate cool Pokemon names. 19:18 I gave it seven Pokemon names and I asked the base model to complete the document and it gave me a lot more Pokemon names. 19:24 These are fictitious. I tried to look them up. I don't believe they're actual Pokemons. This is the task that I think the base model would be 19:31 good at because it still has lots of entropy. It'll give you lots of diverse cool more things that look like whatever you give it before. 19:41 Having said all that, these are the assistant models that are probably available to you at this point. 19:47 There was a team at Berkeley that ranked a lot of the available assistant models and give them basically Elo ratings. 19:53 Currently, some of the best models, of course, are GPT-4, by far, I would say, followed by Claude, GPT-3.5, and then a number of models, 20:00 some of these might be available as weights, like Vicuna, Koala, etc. The first three rows here are 20:07 all RLHF models and all of the other models to my knowledge, are SFT models, I believe. 20:15 That's how we train these models on the high level. Now I'm going to switch gears and let's look at how we can 20:22 best apply the GPT assistant model to your problems. Now, I would like to work 20:27 in setting of a concrete example. Let's work with a concrete example here. 20:32 Let's say that you are working on an article or a blog post, and you're going to write this sentence at the end. 20:38 "California's population is 53 times that of Alaska." So for some reason, you want to compare the populations of these two states. 20:44 Think about the rich internal monologue and tool use and how much work actually goes computationally in 20:50 your brain to generate this one final sentence. Here's maybe what that could look like in your brain. 20:55 For this next step, let me blog on my blog, let me compare these two populations. 21:01 First I'm going to obviously need to get both of these populations. Now, I know that I probably 21:06 don't know these populations off the top of my head so I'm aware of what I know or don't know of my self-knowledge. 21:12 I go, I do some tool use and I go to Wikipedia and I look up California's population and Alaska's population. 21:19 Now, I know that I should divide the two, but again, I know that dividing 39.2 by 0.74 is very unlikely to succeed. 21:26 That's not the thing that I can do in my head and so therefore, I'm going to rely on the calculator so I'm going to use a calculator, 21:33 punch it in and see that the output is roughly 53. Then maybe I do some reflection and sanity checks in 21:40 my brain so does 53 makes sense? Well, that's quite a large fraction, but then California is the most 21:45 populous state, so maybe that looks okay. Then I have all the information I might need, and now I get to the creative portion of writing. 21:52 I might start to write something like "California has 53x times greater" and then I think to myself, 21:58 that's actually like really awkward phrasing so let me actually delete that and let me try again. 22:03 As I'm writing, I have this separate process, almost inspecting what I'm writing and judging whether it looks good 22:09 or not and then maybe I delete and maybe I reframe it, and then maybe I'm happy with what comes out. 22:15 Basically long story short, a ton happens under the hood in terms of your internal monologue when you create sentences like this. 22:21 But what does a sentence like this look like when we are training a GPT on it? From GPT's perspective, this 22:28 is just a sequence of tokens. GPT, when it's reading or generating these tokens, 22:34 it just goes chunk, chunk, chunk, chunk and each chunk is roughly the same amount of computational work for each token. 22:40 These transformers are not very shallow networks they have about 80 layers of reasoning, 22:45 but 80 is still not like too much. This transformer is going to do its best to imitate, 22:51 but of course, the process here looks very different from the process that you took. In particular, in our final artifacts 22:59 in the data sets that we create, and then eventually feed to LLMs, all that internal dialogue was completely stripped and unlike you, 23:07 the GPT will look at every single token and spend the same amount of compute on every one of them. So, you can't expect it 23:13 to do too much work per token and also in particular, 23:21 basically these transformers are just like token simulators, they don't know what they don't know. 23:26 They just imitate the next token. They don't know what they're good at or not good at. They just tried their best to imitate the next token. 23:32 They don't reflect in the loop. They don't sanity check anything. They don't correct their mistakes along the way. 23:37 By default, they just are sample token sequences. They don't have separate inner monologue streams 23:43 in their head right? They're evaluating what's happening. Now, they do have some cognitive advantages, 23:48 I would say and that is that they do actually have a very large fact-based knowledge across a vast number of areas because they have, 23:55 say, several, 10 billion parameters. That's a lot of storage for a lot of facts. They also, I think have 24:02 a relatively large and perfect working memory. Whatever fits into the context window 24:07 is immediately available to the transformer through its internal self attention mechanism and so it's perfect memory, 24:14 but it's got a finite size, but the transformer has a very direct access to it and so it can a losslessly remember anything that 24:22 is inside its context window. This is how I would compare those two and the reason I bring all of this up is because I 24:27 think to a large extent, prompting is just making up for this cognitive difference between 24:34 these two architectures like our brains here and LLM brains. 24:39 You can look at it that way almost. Here's one thing that people found for example works pretty well in practice. 24:45 Especially if your tasks require reasoning, you can't expect the transformer to do too much reasoning per token. 24:52 You have to really spread out the reasoning across more and more tokens. For example, you can't give a transformer 24:57 a very complicated question and expect it to get the answer in a single token. There's just not enough time for it. "These transformers need tokens to 25:04 think," I like to say sometimes. This is some of the things that work well, you may for example have a few-shot prompt that 25:10 shows the transformer that it should show its work when it's answering question and if you give a few examples, 25:17 the transformer will imitate that template and it will just end up working out better in terms of its evaluation. 25:24 Additionally, you can elicit this behavior from the transformer by saying, let things step-by-step. 25:29 Because this conditions the transformer into showing its work and because 25:34 it snaps into a mode of showing its work, is going to do less computational work per token. 25:40 It's more likely to succeed as a result because it's making slower reasoning over time. 25:46 Here's another example, this one is called self-consistency. We saw that we had the ability Ensemble multiple attempts 25:51 to start writing and then if it didn't work out, I can try again and I can try multiple times 25:56 and maybe select the one that worked best. In these approaches, 26:02 you may sample not just once, but you may sample multiple times and then have some process for finding 26:07 the ones that are good and then keeping just those samples or doing a majority vote or something like that. Basically these transformers in the process as 26:14 they predict the next token, just like you, they can get unlucky and they could sample a not a very good 26:19 token and they can go down like a blind alley in terms of reasoning. Unlike you, they cannot recover from that. 26:27 They are stuck with every single token they sample and so they will continue the sequence, even if they know that this sequence is not going to work out. 26:34 Give them the ability to look back, inspect or try to basically sample around it. 26:40 Here's one technique also, it turns out that actually LLMs, they know when they've screwed up, Ask for reflection 26:47 so as an example, say you ask the model to generate a poem that does not 26:52 rhyme and it might give you a poem, but it actually rhymes. But it turns out that especially for the bigger models like GPT-4, 26:58 you can just ask it "did you meet the assignment?" Actually GPT-4 knows very well that it did not meet the assignment. 27:04 It just got unlucky in its sampling. It will tell you, "No, I didn't actually meet the assignment here. Let me try again." 27:10 But without you prompting it it doesn't know to revisit and so on. 27:17 You have to make up for that in your prompts, and you have to get it to check, if you don't ask it to check, 27:23 its not going to check by itself it's just a token simulator. 27:28 I think more generally, a lot of these techniques fall into the bucket of what I would say recreating our System 2. 27:34 You might be familiar with the System 1 and System 2 thinking for humans. System 1 is a fast automatic process and I 27:40 think corresponds to an LLM just sampling tokens. System 2 is the slower deliberate 27:46 planning part of your brain. This is a paper actually from 27:51 just last week because this space is pretty quickly evolving, it's called Tree of Thought. 27:56 The authors of this paper proposed maintaining multiple completions for any given prompt 28:02 and then they are also scoring them along the way and keeping the ones that are going well if that makes sense. 28:08 A lot of people are really playing around with prompt engineering 28:13 to basically bring back some of these abilities that we have in our brain for LLMs. 28:19 Now, one thing I would like to note here is that this is not just a prompt. This is actually prompts that are together 28:25 used with some Python Glue code because you actually have to maintain multiple prompts and you also have to do 28:30 some tree search algorithm here to figure out which prompts to expand, etc. It's a symbiosis of Python Glue code and 28:38 individual prompts that are called in a while loop or in a bigger algorithm. I also think there's a really cool 28:43 parallel here to AlphaGo. AlphaGo has a policy for placing the next stone when it plays go, 28:48 and its policy was trained originally by imitating humans. But in addition to this policy, 28:54 it also does Monte Carlo Tree Search. Basically, it will play out a number of possibilities in its head and evaluate all of 29:00 them and only keep the ones that work well. I think this is an equivalent of AlphaGo but for text if that makes sense. 29:08 Just like Tree of Thought, I think more generally people are starting to really explore 29:13 more general techniques of not just the simple question-answer prompts, but something that looks a lot more like 29:19 Python Glue code stringing together many prompts. On the right, I have an example from this paper called React where they 29:25 structure the answer to a prompt as a sequence of thought-action-observation, 29:32 thought-action-observation, and it's a full rollout and a thinking process to answer the query. 29:38 In these actions, the model is also allowed to tool use. On the left, I have an example of AutoGPT. 29:45 Now AutoGPT by the way is a project that I think got a lot of hype recently, 29:51 but I think I still find it inspirationally interesting. It's a project that allows an LLM to keep 29:58 the task list and continue to recursively break down tasks. I don't think this currently works very well and I would 30:04 not advise people to use it in practical applications. I just think it's something to generally take inspiration 30:09 from in terms of where this is going, I think over time. That's like giving our model System 2 thinking. 30:16 The next thing I find interesting is, this following serve I would say almost psychological quirk of LLMs, 30:23 is that LLMs don't want to succeed, they want to imitate. You want to succeed, and you should ask for it. 30:31 What I mean by that is, when transformers are trained, they have training sets and there can be 30:38 an entire spectrum of performance qualities in their training data. For example, there could be some kind of a prompt 30:43 for some physics question or something like that, and there could be a student's solution that is completely wrong but there can also be an expert 30:49 answer that is extremely right. Transformers can't tell the difference between low, 30:54 they know about low-quality solutions and high-quality solutions, but by default, they want to imitate all of 30:59 it because they're just trained on language modeling. At test time, you actually have to ask for a good performance. 31:06 In this example in this paper, they tried various prompts. Let's think step-by-step was very powerful 31:13 because it spread out the reasoning over many tokens. But what worked even better is, let's work this out in a step-by-step way 31:19 to be sure we have the right answer. It's like conditioning on getting the right answer, and this actually makes the transformer work 31:25 better because the transformer doesn't have to now hedge its probability mass on low-quality solutions, 31:31 as ridiculous as that sounds. Basically, feel free to ask for a strong solution. 31:37 Say something like, you are a leading expert on this topic. Pretend you have IQ 120, etc. But don't try to ask for too much IQ because if 31:44 you ask for IQ 400, you might be out of data distribution, or even worse, you could be in data distribution for 31:51 something like sci-fi stuff and it will start to take on some sci-fi, or like roleplaying or something like that. 31:56 You have to find the right amount of IQ. I think it's got some U-shaped curve there. 32:02 Next up, as we saw when we are trying to solve problems, we know what we are good at and what we're not good at, 32:09 and we lean on tools computationally. You want to do the same potentially with your LLMs. Tool use / Plugins 32:15 In particular, we may want to give them calculators, code interpreters, 32:20 and so on, the ability to do search, and there's a lot of techniques for doing that. 32:27 One thing to keep in mind, again, is that these transformers by default may not know what they don't know. 32:32 You may even want to tell the transformer in a prompt you are not very good at mental arithmetic. Whenever you need to do very large number addition, 32:40 multiplication, or whatever, instead, use this calculator. Here's how you use the calculator, you use this token combination, etc. 32:46 You have to actually spell it out because the model by default doesn't know what it's good at or not good at, necessarily, just like you and I might be. 32:54 Next up, I think something that is very interesting is we went from a world that was retrieval only all the way, 33:02 the pendulum has swung to the other extreme where its memory only in LLMs. But actually, there's this entire space in-between of 33:08 these retrieval-augmented models and this works extremely well in practice. As I mentioned, the context window of 33:14 a transformer is its working memory. If you can load the working memory with any information that is relevant to the task, 33:21 the model will work extremely well because it can immediately access all that memory. I think a lot of people are really interested 33:28 in basically retrieval-augment degeneration. On the bottom, I have an example of LlamaIndex which is 33:35 one data connector to lots of different types of data. You can index all 33:41 of that data and you can make it accessible to LLMs. The emerging recipe there is you take relevant documents, 33:47 you split them up into chunks, you embed all of them, and you basically get embedding vectors that represent that data. 33:53 You store that in the vector store and then at test time, you make some kind of a query to your vector store and you fetch chunks that 34:00 might be relevant to your task and you stuff them into the prompt and then you generate. This can work quite well in practice. 34:06 This is, I think, similar to when you and I solve problems. You can do everything from your memory and 34:11 transformers have very large and extensive memory, but also it really helps to reference some primary documents. 34:17 Whenever you find yourself going back to a textbook to find something, or whenever you find yourself going back to documentation of the library to look something up, 34:25 transformers definitely want to do that too. You have some memory over how 34:30 some documentation of the library works but it's much better to look it up. The same applies here. 34:35 Next, I wanted to briefly talk about constraint prompting. I also find this very interesting. 34:41 This is basically techniques for forcing a certain template in the outputs of LLMs. 34:50 Guidance is one example from Microsoft actually. Here we are enforcing that the output from the LLM will be JSON. 34:57 This will actually guarantee that the output will take on this form because they go in and they mess with the probabilities of 35:03 all the different tokens that come out of the transformer and they clamp those tokens and then the transformer is only filling in the blanks here, 35:09 and then you can enforce additional restrictions on what could go into those blanks. This might be really helpful, and I think 35:15 this constraint sampling is also extremely interesting. I also want to say 35:20 a few words about fine tuning. It is the case that you can get really far with prompt engineering, but it's also possible to 35:27 think about fine tuning your models. Now, fine tuning models means that you are actually going to change the weights of the model. 35:33 It is becoming a lot more accessible to do this in practice, and that's because of a number of techniques that have been 35:39 developed and have libraries for very recently. So for example parameter efficient fine tuning techniques like Laura, 35:46 make sure that you're only training small, sparse pieces of your model. So most of the model is kept clamped at 35:53 the base model and some pieces of it are allowed to change and this still works pretty well empirically and makes 35:58 it much cheaper to tune only small pieces of your model. It also means that because most of your model is clamped, 36:05 you can use very low precision inference for computing those parts because you are not going to be updated by 36:10 gradient descent and so that makes everything a lot more efficient as well. And in addition, we have a number of open source, high-quality base models. 36:17 Currently, as I mentioned, I think LLaMa is quite nice, although it is not commercially licensed, I believe right now. 36:23 Some things to keep in mind is that basically fine tuning is a lot more technically involved. 36:29 It requires a lot more, I think, technical expertise to do right. It requires human data contractors for 36:34 datasets and/or synthetic data pipelines that can be pretty complicated. This will definitely slow down 36:40 your iteration cycle by a lot, and I would say on a high level SFT is achievable because you're continuing 36:47 the language modeling task. It's relatively straightforward, but RLHF, I would say is very much research territory 36:53 and is even much harder to get to work, and so I would probably not advise that someone just tries to roll their own RLHF of implementation. 37:00 These things are pretty unstable, very difficult to train, not something that is, I think, very beginner friendly right now, 37:06 and it's also potentially likely also to change pretty rapidly still. 37:11 So I think these are my default recommendations right now. I would break up your task into two major parts. Default recommendations 37:18 Number 1, achieve your top performance, and Number 2, optimize your performance in that order. 37:23 Number 1, the best performance will currently come from GPT-4 model. It is the most capable of all by far. 37:29 Use prompts that are very detailed. They have lots of task content, relevant information and instructions. 37:36 Think along the lines of what would you tell a task contractor if they can't email you back, but then also keep in mind that a task contractor is a 37:43 human and they have inner monologue and they're very clever, etc. LLMs do not possess those qualities. 37:48 So make sure to think through the psychology of the LLM almost and cater prompts to that. 37:54 Retrieve and add any relevant context and information to these prompts. Basically refer to a lot of 38:01 the prompt engineering techniques. Some of them I've highlighted in the slides above, but also this is a very large space and I would 38:07 just advise you to look for prompt engineering techniques online. There's a lot to cover there. 38:13 Experiment with few-shot examples. What this refers to is, you don't just want to tell, you want to show whenever it's possible. 38:19 So give it examples of everything that helps it really understand what you mean if you can. 38:25 Experiment with tools and plug-ins to offload tasks that are difficult for LLMs natively, 38:30 and then think about not just a single prompt and answer, think about potential chains and reflection and how you glue 38:36 them together and how you can potentially make multiple samples and so on. Finally, if you think you've squeezed 38:42 out prompt engineering, which I think you should stick with for a while, look at some potentially 38:48 fine tuning a model to your application, but expect this to be a lot more slower in the vault and then 38:54 there's an expert fragile research zone here and I would say that is RLHF, which currently does work a bit 39:00 better than SFT if you can get it to work. But again, this is pretty involved, I would say. And to optimize your costs, 39:06 try to explore lower capacity models or shorter prompts and so on. 39:12 I also wanted to say a few words about the use cases in which I think LLMs are currently well suited for. 39:18 In particular, note that there's a large number of limitations to LLMs today, and so I would keep that 39:24 definitely in mind for all of your applications. Models, and this by the way could be an entire talk. So I don't have time to cover it in full detail. 39:30 Models may be biased, they may fabricate, hallucinate information, they may have reasoning errors, they may struggle in entire classes of applications, 39:38 they have knowledge cut-offs, so they might not know any information above, say, September, 2021. 39:43 They are susceptible to a large range of attacks which are coming out on Twitter daily, 39:48 including prompt injection, jailbreak attacks, data poisoning attacks and so on. So my recommendation right now is 39:54 use LLMs in low-stakes applications. Combine them always with human oversight. 40:00 Use them as a source of inspiration and suggestions and think co-pilots, instead of completely autonomous agents 40:05 that are just like performing a task somewhere. It's just not clear that the models are there right now. 40:11 So I wanted to close by saying that GPT-4 is an amazing artifact. I'm very thankful that it exists, and it's beautiful. 40:18 It has a ton of knowledge across so many areas. It can do math, code and so on. And in addition, there's this 40:24 thriving ecosystem of everything else that is being built and incorporated into the ecosystem. Some of these things I've talked about, 40:31 and all of this power is accessible at your fingertips. So here's everything that's needed in terms of 40:37 code to ask GPT-4 a question, to prompt it, and get a response. In this case, I said, 40:44 can you say something to inspire the audience of Microsoft Build 2023? And I just punched this into Python and verbatim 40:50 GPT-4 said the following: And by the way, I did not know that they 40:55 used this trick in the keynote. So I thought I was being clever, but it is really good at this. 41:02 It says, ladies and gentlemen, innovators and trailblazers Microsoft Build 2023. Welcome to the gathering of brilliant 41:08 minds like no other, you are the architects of the future, the visionaries molding the digital realm 41:13 in which humanity thrives. Embrace the limitless possibilities of technologies and let your ideas soar as high as your imagination. 41:20 Together, let's create a more connected, remarkable, and inclusive world for generations to come. Get ready to unleash your creativity, 41:27 canvas the unknown, and turn dreams into reality. Your journey begins today! """