280 | François Chollet on Deep Learning and the Meaning of Intelligence

Which is more intelligent, ChatGPT or a 3-year old? Of course this depends on what we mean by "intelligence." A modern LLM is certainly able to answer all sorts of questions that require knowledge far past the capacity of a 3-year old, and even to perform synthetic tasks that seem remarkable to many human grown-ups. But is that really intelligence? François Chollet argues that it is not, and that LLMs are not ever going to be truly "intelligent" in the usual sense -- although other approaches to AI might get there.

francois chollet

Support Mindscape on Patreon.

François Chollet received his Diplôme d'Ingénieur from École Nationale Supérieure de Techniques Avancées, Paris. He is currently a Senior Staff Engineer at Google. He has been awarded the Global Swiss AI award for breakthroughs in artificial intelligence. He is the author of Deep Learning with Python, and developer of the Keras software library for neural networks. He is the creator of the ARC (Abstraction and Reasoning Corpus) Challenge.

 

0:00:00.0 Sean Carroll: Hello everyone and welcome to the Mindscape podcast. I'm your host, Sean Carroll. You know that artificial intelligence is in the news. We've talked about AI in various different ways here on the podcast, especially over the last couple years where ChatGPT and other large language models have really become an enormous study of interest to many people for financial reasons, for intellectual reasons. They're becoming everywhere, right? Google has put them on the first page of its search results. Lots of people are using large language models to write texts. You can write programs using large language models. You can write the syllabus for your college course, et cetera. It's clear that this technology is going to have an enormous impact on how humans behave and live going forward. But there are subtleties. One of the things that I've talked about is the idea that large language models are amazing because they're able to mimic human speech and behavior, right?

0:01:02.7 SC: They're able to sound enormously human without actually thinking in the same way that human beings do. Large language models in some sense, memorize lots of things. They know a lot of facts about the world, and they're super good at interpolating between things that they know, that includes interpolating, different kinds of things that have never been interpolated before. So they can seem creative, they can do things that have never been done based on the training data of things that have been done before. They're less good at going outside of the range of that training data. And one can argue that the processes by which they come up with their outputs are very different than what a human being does in actually thinking reasoning about the problem presented to it. And many people, especially people who are experts in AI, understand this attitude perfectly well. It's certainly not new with me.

0:02:00.3 SC: It's, well known to many people, but it is denied by other people who are much more impressed with the progress in large language models and think that we're close to AGI, artificial general intelligence. So I thought it would be fun to talk to someone who is in the front lines, of developing deep learning models and AI more generally. So today's guest is Francois Chollet. He's a relatively young guy, but he, just to give you a sense of his accomplishments, he's a deep learning researcher at Google. One thing he's done is to develop a software package called Keras, K-E-R-A-S, which is a software library that can be used to interface, with deep learning techniques. So you could download it onto your computer and play with Keras and develop your own large language model or modify someone else's large language model if you want to.

0:02:57.4 SC: It's become incredibly popular, three million something users at last count. So it's had an impact on the field. Francois is also the author of a book called Deep Learning with Python, and I think there's also a version using R capital R, the computer language. So you could read that and learn about deep learning yourself. And finally, Francois has thought deeply about what it means to say that something is intelligent. And in particular, he strongly denies that modern large language models are intelligent in the conventional sense. He says that what they've done is they've memorized a bunch of things effectively and like we said, can interpolate between them, and that it gives them a wonderful ability to score well on many current measures of intelligence that we human beings use on each other. Large language models are good at passing tests, right? The bar exam for law school or whatever, large language models are really good at that.

0:04:00.1 SC: Francois makes the case that this is not because they're intelligent, it's just because they've learned a lot of things. And to make that clear, he wrote, an influential paper called On the Measure of Intelligence where he makes the case. He will explain it better than I could, but he makes the case that the whole point of intelligence is to go beyond what you've learned, right? To not merely master a skill, which large language models can do. They can learn whatever the particular subject matter is and spit it back at you. But to sort of extract abstract, I should say, from the data that you learn skills that you're not being explicitly taught. So, as Francois says, he has a 3-year-old kid who's very good at generalizing from just a few examples to, you know, build things with Legos that he's never seen before in a way that modern LLMs are not able to do.

0:04:55.2 SC: So this proposal from Francois has gone on to become a new competition. The, what is it called? The ARCathon. ARC stands for abstraction and reasoning Corpus ARC. And the idea is that rather than using questions from typical IQ tests or standardized exams or whatever, they have developed a set of novel logic puzzles. Okay? If you believe that intelligence has something to do with solving logic puzzles, at least here is a set of logic puzzles that are not already out there in the training data that many LLMs have already had access to. And guess what? A human being can easily do very well on this ARC test that has been developed, 80% success rate, et cetera. Large language models don't do so well. Some of them as low as 0%, but you know, typically 20%, 30%, something like that. Evidence for the fact that whatever they're doing it's not quite intelligence yet, which is not to say we can't get there.

0:05:56.7 SC: So the point of the ARC competition is to incentivize people to go beyond large language models, to develop AI systems that truly are intelligent. So it's not just a sort of skeptical attitude, it is a attempt to push us in better direction. So we don't know when and if AI is going to become generally intelligent, we know it's not there now. But maybe it'll get there soon. It depends on how clever we human beings are at developing such things. If you visit the show notes page for this episode of the podcast at preposterousuniverse.com/podcast, we'll give you links to all these things, the paper, the books, the competition and so forth. Okay. Occasional reminder that you can support the Mindscape podcast on Patreon, go to patreon.com/seanmcarroll and kick in a buck or two for every episode of Mindscape. In return you get ad free versions of the podcast, as well as the ability to ask AMA questions once a month. Very, very worthwhile rewards for such a minor contribution. And with that, let's go. Francois Chollet Welcome to the Mindscape Podcast.

0:07:19.4 Francois Chollet: Thanks for having me.

0:07:20.6 SC: So I've talked to people doing AI before on the podcast, and I have this picture in my mind that I just want you to tell me whether I'm on the right track or not. That back in the day there were these arguments about symbolic approaches to AI versus connectionist approaches. And the symbolic approaches. You would try to define variables that directly correlated to the world in some way, and then hope that the AI would figure out how they all fit together. Whereas in the connectionist approaches, you just put a bunch of little processors in there hooked up in the right way and hope it learns things. And in the early days, the symbolic approach ruled but didn't get very far. And these days we've had amazing progress with deep learning and large language models that are basically in the connectionist tradition. Is that rough picture approximately correct.

0:08:11.1 FC: On a very long time scale? Yeah, that's approximately correct. Symbolic methods. The big dichotomy here is actually between, having programmers hard code, model, symbolic program of the task that they wanna do versus having a system that can actually learn from data how to perform the task. And symbolic approaches, of course, are much more tractable if you don't have a lot of compute. 'cause if you only have a very small computer but you have a good brain you can just figure out, the right way to describe a task. And then the computer can perform that task like playing chess, for instance. However, if you want to make learning work, that's where you need some amount of skill. And as computers get better, then, machine learning started getting really popular.

0:09:10.3 FC: And machine learning did not actually start getting popular with, so called connectionist approaches initially. So the, one of the first, like big, after neural networks, one of the first big breakthroughs of machining were SVMs. That's a learning algorithm that can do classification, can do regression. After that, random forests got very popular in Denmark, 2000, 2010, early 2010s. Then, Gradient-boosted trees got also very popular. And by the way, so random forest and Gradient-boosted trees are not, neuro network based. They're not even curve fitting based. And after that, you had the great rebirth of neural networks with the rise of deep learning. So starting around, 2011, 2012, some people started training, deep neural nets, specifically deep ConvNets. So convolutional neural networks, which is the kind of neural nets that does very well with images.

0:10:19.8 FC: It's basically a kind of neural net that knows, how to split an image into small batches and, look at each batch, separately then merge the information that has seen. And progress is like this in a sort of like modular hierarchical fashion, not too differently from what the visual cortex is doing, by the way. And this is, new GPU-based ConvNets, starting winning mesh learning competitions. So Dan [0:10:52.6] ____ 2011 won a couple minor academic competitions, with this technique. Then, 2012 we had the big breakthrough with the, ImageNet, large scale in mesh classification challenge, to solved with, GPU trained ConvNets. And then in the following years, we had this, gradual but very, very fast and sort of like unstoppable rise of Deep Learning.

0:11:21.0 FC: Like every year there were more people doing Deep Learning, and Deep Learning could do more and more things. And one thing that has increased quite dramatically is the scale of these neural nets. So around, 2016 2017, we had, the arrival of a new kind of architecture that got very popular, which was the transformer architecture, for sequence processing. Before that sequence processing was done with recurrent neural networks, specifically, the LSTM architecture Engineering, which dates back from the early '90s. In fact, it's usually, often the case that neural net research, it's very much grounded in stuff that's from the '80s and '90s. But, one...

0:12:10.1 SC: Which you make sound like ancient history but I was alive then, so it's not that ancient.

0:12:18.8 FC: Yeah. I feel like, most people doing Deep Learning today have actually very little knowledge of anything that came before like 2015 to be honest. And, everyone is pretty much using the transformer architecture at this time which was developed in late 2016 and got public in 2017. And it works really well, and it works for sequence data but pretty much anything can be treated as sequence data. So it actually works for images, it can work for videos, it can work for pretty much whatever you want. And finally we had the rise of a GenAI, so even larger scale Transformers trained on, as much data as we can cram into them to train on the entire internet. In fact they're not just trained on the entire internet, they're trained on the entire internet plus, a lot of manually annotated data that's collected specifically for these models. Like currently there are like thousands of people who are employed full-time to create training data for these models. And they're not paid very well usually.

[laughter]

0:13:36.8 SC: I think I read that in something you wrote, and it kind of did take me back a little bit. So maybe can you elaborate on this?

0:13:38.4 FC: Sure.

0:13:40.2 SC: We'll, get back to the architectures and so forth, but, so there are people, what are they writing texts for large language models to be trained on? Or are they interacting with the models to correct their mistakes?

0:13:51.8 FC: So typically the process, it's more, it's more the second one, they're interacting with the model to correct their mistakes. So they're not necessarily interacting with the model but basically they are receiving a stream of queries that the model does not seem to be very good at. And they write answers for these queries or they correct an existing generated answer. And, so this is called data annotation or sometimes data rating. It can also by the way take the form of actual ratings, meaning that you get a choice between multiple jointed answers and you pick the best one. And every company out there that's training this foundation models is employing typically several thousands of people just doing this full time. And this is, by the way, this is very much what makes these models useful. It's the fact that not only are they trained to predict the next world across very much all the texts you can find on the internet but they're also trained to sort of like prefill the right answer across millions of different manually annotated queries.

0:15:08.4 SC: So as we are recording this in June, 2024, many listeners will be familiar with a set of problems that Google was having, having put forward their AI assistant onto search. And sometimes it would give very bad answers. And I guess the hope was that, like you say, individual human beings could go in there and just stamp out the bad answers one by one. But that hope seems to be a little bit gloomier than originally intended.

0:15:36.8 FC: That's right. And it's one of the big challenges and big limitations of LLMs is that you have to apply this point wise fixes, which are very labor intensive, And they only address one query at a time. It is virtually impossible to fix a general category of issues at once. And the reason why is because these models, they're basically big curves, they're big differential parametric curves that are fit to a data distribution. And so you cannot really input into them symbolic programs for instance. That would be valid for a very large category of problems. You can only input into them data points and they will fit these individual data points and they will be able to interpolate across them. So giving them some amounts of generalization power but not that much. And so if you want the LLM to perform well, the only option you have is that you need to densely sample the space of queries in which it's gonna have to operate.

0:16:47.7 FC: And this is kind of the problem that we saw with the weird Google AI answers is that they tended to be unusual queries. And of course these models, they don't actually understand the queries you're giving them. They're just mapping the query on the curve. So you can sort of like picture the curve as a surface. It's many forms, right? So it's like, you can picture it I guess in 3D. You can imagine a 2D surface inside a 3D space. And it's exactly what it is, like a napkin, It's exactly what it is, except, in a space that has thousands of dimensions. And basically, in that space, different dimensions encode different axes of meaning. And they can sort of like interpolate across data points but they cannot really model, for instance a situation described in a query, especially not in quantitative terms and which is why they're not reliable.

0:17:58.9 FC: And my advice in general when people start using financial models is that they're very good at giving you answers that are directionally accurate, that are a step in the right direction but they're extremely valid giving you exactly correct answers. So you should pretty much never ask a foundation model to give, especially if it's a quantitative problem by the way, to give you an exact answer and then just blindly use that answer. It's typically better to use it as a sort of like stepping stone to get you something that's in the right direction and then you refine it yourself. Or perhaps you could also automate that and add a sort of symbolic search system to automatically refine the answer. Because if you have a symbol search system and you have some way of telling whether your answer is correct or not, then you can just search across a range of answers and verify them.

0:18:55.6 FC: So use the LLM to provide you with sort of like initial smaller search space and then use a symbolic system to find the exactly correct answer within that space. But never basically blindly trust anything that's written by one of these models.

0:19:11.6 SC: I have learned that myself. I'm sure that you have also, but, so to put it back in the original terms, I'm getting the impression that rather than thinking of things as symbolic versus connectionist, maybe it's more helpful to think of models where the programmer tries to build in a structure versus models where the model learns a structure just from an enormous amount of data.

0:19:34.6 FC: That's right. That's right. And one thing that's interesting here is that in the first case, there's no intelligence involved. The only intelligence in the picture is the intelligence of the programmer that understands the task, understands the problem, models it in their head. And then writes down exact instructions a description of the task. A description of the task that is so precise that there is no uncertainty left. And when you actually run the program, it'll never have to deal with any kind of novelty, anything that is, it does not know how to handle because the Programmer did a good job. They anticipated everything, right? Every edge case, everything. And the program you get, people are gonna call it an AI. Which is actually no intelligence. It's just a crystallized static program. The intelligence here is the mind of the programmer that developed that program.

0:20:37.7 FC: Intelligence is this ability to look at a novel problem, something you've not seen before, and come up with the solution, write the program. Right? And when you look at learning systems, clearly they're capable of learning, they're capable of learning how to solve problems on their own or almost. So clearly they must have some intelligence. But the most popular methods for doing this today are just curve fitting. And curve fitting, I mean, clearly it's a form of learning a curve trend with gradient descent has non-zero intelligence, right? Because it turns data into solutions at some rate according to some sort of like conversion ratio, which is not a very good conversion ratio, by the way. It's extremely data inefficient. But it has very, very low intelligence. For this reason a system that is very intelligent would not be limited to this sort of like pointwise mappings like LLMs are. Instead, if you wanted to fix an issue in an actual intelligent system, you would just explain it why the answer they gave was wrong.

0:21:56.0 FC: And then they would automatically apply the patch, apply the fix to the entire underlying category of issues. Instead you have to apply this point-of-way fixes. And the reason why is really because a curve fitting is extremely data inefficient. It's a very, very low intelligence type of learning.

0:22:18.5 SC: And from those descriptions, well, I'm sure we'll get to this more later in the podcast, but you can see why it would be very hard for either approach to give rise to true creativity. One where the programmer puts in all the structures kind of limited in that way curve fitting is kind of limited once you wanna wander outside where the data already is.

0:22:40.7 FC: Yeah. If you adopt a symbolic approach you are entirely limited by the sort of search space that the programmer hardcoded into the system. You're limited by what the programmer can anticipate and imagine. And if you employ curve fitting, then you're limited to basically the context HAL of the latent space representations of your input data points. So basically you're limited to interpolations between data points in your trained data. And you cannot really create anything new, anything that you did not expect if you had seen everything in the trained data. And by the way, this is kind of like the reason why foundation models often give you the impression that they're being creative. It's because you haven't seen everything they've been trained on.

0:23:37.6 SC: Right.

0:23:37.6 FC: It's impossible. They've been trained on so much data, so they can surprise you. But if you had seen everything, they would not surprise you. And so that doesn't mean that creativity is something that cannot be achieved by an algorithm.

0:23:51.8 FC: I think it can be, but you have to employ the right set of methods. I think if you look at the history of computer science, when we saw real invention, real creativity initiated by an algorithm, it's been in cases where you had a very open-ended search process operating over a relatively unconstrained search space. Because if the search space is fairly unconstrained then no human can anticipate everything it contains. And the search process might find really interesting and useful and novel points in that space. So for instance, genetic algorithms if implemented the right way, have the potential of demonstrating true creativity and of inventing new things in a way that LLMs cannot, LLMs cannot invent anything 'cause they're limited to interpolations. A genetic algorithm with the right search space and the right fitness function can actually invent entirely new systems that no human could anticipate. And the...

0:25:05.7 SC: Maybe you should explain to the listeners what a genetic algorithm is.

0:25:09.0 FC: Absolutely. So a genetic algorithm, it's basically a discrete search process. So it's inspired by biological evolution. In biological evolution, individuals have a genome and they pass on half of their genome to offsprings and to the offspring. And this is basically...

0:25:40.1 FC: This is driven by a natural selection, in order to have a spring while you need to survive, you need to reproduce and so on. And so you end up with individuals that are increasingly good, increasingly fit at surviving and reproducing. And so this sort of criteria of survival and reproduction would be called the fitness function. And you can try to implement a computer version of this, where you have points that are described in some way, that's gonna be the genome and you're gonna apply, you're gonna code up some sort of fitness function, a way to evaluate how good certain genome is, and you're gonna generate a bunch of genomes. You're gonna apply a fitness function, select the best ones, top 10% or something, and then you're gonna modify them.

0:26:41.6 FC: And that could be random mutations, that could be crossover, you take parts of one genome and cross it over there with another because you're not limited by sexual reproduction. You can actually do whatever you want. You can do a crossover between many individuals for instance. But you have basically some sort of discrete mechanism for generating new combinations or compositions or mutations for existing individuals and now you have the next generation and you apply the fitness function again with selection function again, and you repeat. And assuming that your search space, which is basically the space of possible individuals that can be represented using your genome, assuming that it's fairly unconstrained, you may end up with some really interesting findings. The OG genetic algorithms guys, they came up for instance with a very novel design for an antenna using this technique.

0:27:45.8 SC: Okay.

0:27:47.2 FC: And this is the kind of design you could never have obtained with an LLM trained on every antenna design out there, 'cause it's actually novel. [laughter] In order to get novelty, you need search. LLMs cannot perform search, they can only perform interpolation.

0:28:05.0 SC: Good. I did want to, at the risk of scaring some listeners off, I did wanna spend just a few minutes digging into how the LLMs work. The LLMs are the things that have gotten so much experience, so much attention these days, and maybe this is the wrong place to begin, but I'm trying to wrap my head around a site thinking of words as vectors. Assigning values to words and saying that they're near to each other or far to each other in a vector space and taking dot products. Can you explain a little bit about how that works?

0:28:40.3 FC: Sure. So the big idea behind LLMs and behind deep learning in general is that the relationship between things can be described in terms of distance between things like a literal distance. So you're gonna take things and things could be pixels or image patches, or they could be words or tokens. So tokens, you can think of it as a word, it could be a subword as well. Tokens is basically a word. And the idea is that you're gonna map your things, so your tokens for instance, into some vector space. So vector space is basically just a geometric space points of coordinates and points are things, like points are tokens. Right? And you're gonna try to organize these points so that the distance between points represents how semantically similar they are. All right? And by the way, this is very similar to Hebbian Learning. In Hebbian Learning neurons that fire together wire together...

0:29:54.1 SC: In the real brain.

0:29:56.6 FC: In real brains, exactly, and how tightly wired two neurons are could be interpreted as a distance between them. Right? So, you could say that it's more of a topological distance than an actual geometric distance in this case but the idea is that if neurons encode concepts, then concepts that tend to co-occur together are gonna end up closer in the network. So closer in terms of some distance function. And it's exactly the same with transformers actually. So the way transformers work is basically, so you map these tokens to points in the vector space and then you're gonna compute pairwise distances and there are cosine distances, so basically dot products between words and between tokens. And you're gonna use that to figure out new coordinates for your points.

0:31:00.6 FC: So incrementally updated coordinates for your points. And you're gonna do that by taking into account the pairwise dot products between tokens in a certain window of text. And what you're effectively doing is that when tokens already have fairly high dot product between each other, they're gonna be pulled closer together, yeah. So the new token representations for the next layer, they're basically obtained by combining, by interpolating effectively between existing tokens. So one token is going to become... The representation for one token is going to become an interpolation between the representations of surrounding tokens. And that's basically weighted by Ha. Related to each other. They already are close to each other. They already are in this space.

0:32:00.0 FC: So this basically implements a kind of Hebbian learning. So there is some connection with the way the brain learns. But what you end up with once you've done this across many layers in a very high-dimensional space and across a lot of data, what you end up with is a high-dimensional manifold, which is basically just a surface. As I said, you can think of it as a kind of like 2D napkin in a 3D space.

0:32:30.6 FC: And that's exactly what it is, because you know it must be smooth, and it must be continuous, because it needs to be differentiable, right? It needs to be differentiable because the whole process is trained via gradient descent. Gradient descent is basically the only really scalable way, efficient and scalable way that we have to fit curves like this these days.

0:32:56.7 FC: And on this manifold, your token, so your information, is organized in a very semantically rich fashion. And things that are semantically similar are going to be embedded very close together in different axes, different dimensions. Along the manifold that are going to encode interesting transformations of the data, transformations that are semantically meaningful and so on.

0:33:26.8 FC: And what you end up observing is that the way your tokens are organized on this manifold, ends up encoding a bunch of useful semantic programs. So basically, patterns of data transformation that occurred frequently in the trained data and that the model found useful to encode in order to better compress the semantic relationships between your tokens.

0:34:00.9 FC: And this compression is insane because you need to cram all of these relationships on this manifold, which has very high dimensionality, so we can cram lots of things into it, but it's still not infinite, right? You still have pretty high constraints. So you actually need to compress things. And 'cause you need to compress things, you're going to find these useful reusable programs that help compress the data, express it in a more concise fashion.

0:34:28.3 FC: And that's really, I think, the most effective way of thinking about LLMs is that they are big stores of programs, millions of programs. And they're not, when I say program, they're not like Python programs or C++ programs, which are symbolic programs. Instead, they are more like vector functions, right? And that means that you can actually interpret between different programs. So a vector function is basically just it's a mapping between a subset of vector space and another subset. And it can cause a useful, interesting transformation. For instance, transforming the style of a paragraph from one style to like poetry, right?

0:35:21.1 FC: And it's not obvious that there exists a vector space in which you can embed words in such a way that you could define a vector function that does something like this. It seems extremely hard to imagine. And in fact, before LLMs actually showed that it was possible, I don't think many people would have believed it.

0:35:41.9 SC: But it works. And that's really the magic of deep learning, is that you express relationships between things as a distance function in vector space, and you do it at scale and magic starts happening. It turns out that you can fit curves to basically anything if you have large enough space and enough data.

0:36:04.2 SC: I mean, I'll confess I would have been very surprised if you had told me 20 years ago, that this would happen.

0:36:10.0 FC: Anyone would have been very surprised. I don't think anyone anticipated this.

0:36:12.6 SC: But so, for example, an example that you've used, and I've seen elsewhere, thinking of these tokens as elements of a vector space, you can have equations like king minus man plus woman equals queen.

0:36:24.9 FC: Yeah. So that's an example from Word2vec. And Word2vec is only distantly related to LLMs. But I think Word2vec is sort of like a miniature world of the sort of phenomena that you see in LLMs. And in particular, I think it's Word2vec is good to illustrate what is a semantically meaningful vector function. So in this case, you have words represented as points in the vector space, and you can actually add a certain vector to any point to get a new point, which is a new word, of course, because a point equals a word. And adding this vector will consistently transform your words in one way. For instance, making a word plural or going from a male word to a female word, that sort of thing.

0:37:21.2 SC: And you can see how once that starts to work, it's almost as if some understanding is creeping in to the model, or at least the appearance of understanding.

0:37:34.2 FC: That's right. So yeah, I guess it kind of depends how you want to define understanding. But what's going on is that having to organize tokens in a constraint space like this kind of forces you to arrange them in such a way that different dimensions in your space start representing transformations that enable compression of your space. You know what I mean?

0:38:10.9 FC: And you see that scale with LLMs. And because LLMs are extremely nonlinear, the vector transformations that you're going to be looking at are much more complex, much more powerful, and just adding vectors can be completely arbitrary, actually, completely nonlinear. And LLMs, they're collections of millions of very useful vector programs like this that enable a more concise representation of this token space. And when you're prompting an LLM with some query, what it's basically doing... What a human would do is try to understand the words and sort of like picture them in their mind. Basically create a sort of like, model for what's being said.

0:39:00.8 FC: Then you can maybe run some simulation in this model and so on. So basically you have this understanding of what is being described and what is being asked. And what the LLM actually does is that, it'll fetch from its collection of programs, to fetch the program it has memorized or maybe an interpolation across different programs it has memorized. And by the way, LLMs are actually pretty bad as compositionally, they're bad at composing different programs, like interpolate between programs. You can actually chain many programs, [chuckle] like this with LLMs, you are pretty much limited to patterns that have been exactly memorized by the model in its trained data. So it's fetching like a program and it's reapplying the program to the input you are giving to the model, and when it works, it works.

0:39:51.8 FC: So for anything that the model is familiar with, something that has seen dozens of times in its training data, it works, right? So it's great. And because it's seen so much data, there are millions of possible queries where it'll give you exactly what you want. So it can be tremendously useful. But anything that is more unfamiliar, it'll not be able to make sense of it, it'll fetch your program, apply, it's gonna give you the wrong results. And for the LLM, there is absolutely no way of telling, because it's doing the exact same thing in any case.

0:40:26.9 SC: Right.

0:40:27.7 FC: There's no difference for the LLM between generating something that's correct versus generating something that's completely off. And so, unfamiliarity is one way to treat trip up LLMs. LLMs really can only give you the right answer for something they've seen before, which is why data annotation, manual data annotation, is so important. But it's not the only failure case of LLMs.

0:40:57.3 FC: You find also the opposite failure case, where when you have a model that is too familiar with a certain pattern, it will be enabled to deviate from it. And a common example is, for instance, the logic puzzle, what's heavier, like one kilogram of steel or one kilogram of feathers, for instance. And this is a logic puzzle that occurs tens of thousands of times on the internet. And for this reason, with the early LLMs, like for instance the original ChatGPT-4, If you asked it what's heavier, like one kilogram of steel or two kilograms of feathers, it would be, "Oh, they weigh the same. I know the answer. They weigh the same."

0:41:48.3 FC: So it's not actually trying to read and understand the query. It's just fetching the pattern and reapplying it. And so this has been fixed since, of course. But the way they fixed it, again, it's these pointwise patches, they just explicitly teach the LLM about this new pattern for solving this particular kind of query. And if you teach the LLM the right way, then it will start paying attention to the numbers you're providing.

0:42:17.3 FC: So that's one example. There are many other such examples. And even today, if you take any of these LLMs, like Gemini or GPT-4 or whatever, you can find common logic puzzles like this, where if you provide a small variation, the LLM will break down. Basically, anything that has not been patched by hand will still fail today.

0:42:41.1 FC: And in general, this is also the reason why LLMs are very sensitive to the way you phrase things. They are very brittle in that way. And this is kind of what gave the rise to the concept of prompt engineering. So prompt engineering is the idea that if you just ask your query the right way, like there is a right way and there is a wrong way, if you just ask the right way, you get the right result.

0:43:12.4 FC: Another way to interpret it is, any time you find a query where you're getting the right answer, it is most likely possible to modify the query a little bit in a way that would be totally transparent to a human, like it would make total sense to a human, but it will cause the LLM to start failing. And this is true for virtually any query. You can always rephrase in a way that doesn't actually change the query, but it will make the LLM fail. And specifically, the way you find these variations, you just try to make the query slightly more unfamiliar or unexpected compared to what's on the web.

0:43:49.0 SC: So let me see if I understand, because you mentioned before the idea of the convex hull, so you and I know what that means. But the listeners out there should envision a set of points, and we're saying that not only, I think what's being said is that not only can the LLMs or deep learning models interpolate along the set of points, but also sort of the interior that is defined by that set. So if I ask it for a Shakespearean Sonnet that explains spontaneous symmetry breaking in particle physics, maybe no one has ever written such a thing before, but it knows a lot about Shakespearean Sonnets. It knows a lot about particle physics and the vocabulary words so it can sort of interpolate its way into giving you a good example.

0:44:34.8 FC: Yeah, that's right. So for instance, you could ask an LLM to talk like a pirate, but you could also ask it to talk like Shakespeare. But because you can, because these transformation vector programs are vector programs, you can actually merge them. You can average them, you can interpolate it between them. And that means you can start talking like a Shakespearean pirate for instance.

0:44:58.9 SC: Right. [chuckle]

0:45:00.2 FC: And that works, which is something that you cannot do with explicit logic program, by the way.

0:45:06.8 SC: Good. Okay. So then the... I guess the question is, does the way that the LLM succeeds at sounding so reasonable and smart happen through implicitly making an accurate symbolic model of the world? Or is it just a set of correspondence between the frequencies of words? Or are those secretly the same thing?

0:45:35.0 FC: So it's significantly more complex. The correct answer is basically somewhere in between. In an LLM, you will not find a symbolic model of the world but you will find a model of world space, a model of semantic space. And that model has some overlap with the world model that you may have, for instance. But they're different in nature. And the model that LLMs are working with is just not nearly as generalizable as the one you have in general. And any sort of symbolic model that enables simulation is gonna be able to generalize much further away from what it has seen before, because it does not just know about specific situations. It knows about the rules that generated this situation. So it can imagine completely novel situations. The LLM meanwhile, it's more of a case that it knows about specific situations and can also sort of like average, interpolate across situations, right?

0:46:41.0 SC: Right.

0:46:41.6 FC: But it cannot really move outside of these interpolations. And imagine something that would be possible if you knew about the rules generating these situations. And of course, the best way to really get develop an intuition about what LLMs do is to extensively play with them in an adversarial fashion, like try to make them fail, try to start developing a feel for what makes them fail. And many people actually never try that. They just stick as much as possible to things that work. And whenever they find something doesn't work, they blame themselves. They're like, oh, I used the wrong prompt.

0:47:27.8 SC: Prompt yeah.

0:47:28.1 FC: And as a result, they tend to have this bias that they're like, Hey, LLMs, understand everything I'm saying. But of course this is not quite true. It's very difficult to develop correct intuitions about LLMs because they are so counterintuitive due to their sheer scale. Like they have seen, they have memorized more text than you were reading in your entire life by like four orders of magnitude.

0:47:57.2 SC: Yeah. [chuckle]

0:47:58.1 FC: It's kind of hard to imagine that. Yeah.

0:48:03.6 SC: Okay. So are they intelligent?

0:48:08.2 FC: Not really, but they have non-zero intelligence. The way we define intelligence, is that intelligence... Most people define intelligence in terms of skills. They're like if it can do X, Y, Z, it is intelligent, and I'm like, yeah, not quite. This is skill, and being skilled at many things is useful. Obviously it's valuable. So LLMs are valuable in that sense. But when you talk about general intelligence, what makes it general it's not the fact that you have many X, Y, Z, right? That it scales to many tasks. The fact that it should be able to scale to an arbitrary task. Like you can come up with a new task and teach it to your model. If you cannot do that, then the model is not intelligent.

0:49:01.0 FC: So intelligence according to me, is the ability to pick up new skills, to adapt to new situations, to things you've not seen before. So for instance going back to this idea of Symbolic AI, Symbolic AI cannot adapt. It's a static program that does one thing. It cannot adapt to any novelty. It cannot learn anything. It has zero intelligence, like a chess engine has zero intelligence, right? And if you do curve fitting well, if you just fit your curve and then you have your static curve and you do static inference with it you also cannot adapt to any sort of novelty. You can only be skillful when you are within your data distribution, your trained data distribution because the curve is static, and this is how deep learning works today. You fit a curve, then it's frozen and you do inference with it. And such a system, again, has no intelligence.

0:50:04.8 FC: And lots of people talk about oh like LLMs can do in-context learning, but that's actually a total misconception. LLMs do no learning. What they can do is that given a new problem that is slightly novel but still very similar to something they've seen before, they can fetch the correct program or interpolate across different programs that they've learned and sort this new slightly new task. But that's not learning, that's actually fetching. It's not fetching of an answer, it's fetching of the rule set. So it's sort of like one level higher which is why it kinda seems like it's not actual learning. So that said, you can actually do active inference with an LLM, you can actually make the LLM learn, genuinely learn new things, and you do do so by actually adjusting the curve to new of [0:50:54.5] ____. And well, when you do that the main issue you run into is curve fitting is very data inefficient, even fine tuning doing something like [0:51:05.0] ____ compared to what humans can do, humans can actually pick up a new task from like a couple demonstration examples. Like, I have a 3D wall at home, and it's always fascinating, just how quickly. It can pick up like very new skills. Like climbing wall, for instance.

0:51:33.0 FC: Or just build like or building a car out of Legos. He's seen like five different Lego cars in his life but he can just imagine his own Lego cars and build them from the pieces he has available. There's no AI system today that can do anything close to this. Right. And it's not like he can he can do it because he's seen tens of thousands of Lego cars and tens of thousands of other Lego constructions and he has access to unlimited Lego pieces. No. It's like he's seen a handful. Like he's assembled a total of probably fewer than 1000 Lego bricks in his entire life. But no it can actually create new things really complex new things. So LLMs can definitely not do that.

0:52:20.7 FC: So they have non-zero intelligence because they can actually adapt to some amount of novelty. They can generalize beyond the exact trend data points they've seen which is what makes them useful. But they can only generalize close to what they've already seen. If you go a little bit too far away they break down and they can learn they can actually do active inference but in a way that's extremely data inefficient.

0:52:45.8 FC: So they have non-zero intelligence but it's extremely low. It's not... It's definitely not comparable to like the intelligence of a three-year-old. My three-year-old is like vastly more efficient than any LLM out there. It just doesn't compare. And like I feel sometimes that I feel a deep disconnect with some folks in the AI community that claim that hey LLMs today they're like high schooler level. This is absurd. Have they even met a human being before. Have they ever interacted with an LLM before? Like these are completely absurd claims. Anyway.

0:53:23.3 SC: But they're good at certain kinds of test taking which is what makes people think well that's how we measure intelligence.

0:53:29.7 FC: That's right. And this is one of the cognitive fallacies around LLMs is that the school system loves to test humans on memorization problems right? Like school is mostly about memorization. You typically don't even learn rules. You learn factoids. You learn point to point matchings, right? And LLMs are vastly superhuman at memorization.

0:53:58.8 FC: They are memorization machines. They're very very low intelligence, very, very low generalization power but extremely high memorization. And when it comes to showing skill at something at something familiar then you can always trade off intelligence for memorization.

0:54:18.7 FC: Like let's say for instance you're giving your students a physics exam. And the concepts are pretty challenging. Many students probably haven't fully understood them but what some students could do is just cram a lot of past exams right?

0:54:38.4 FC: And they may not really understand everything but they will for each problem they will memorize the pattern. And if you just give them the same problem with different numbers they just fetch the pattern required. It is exactly what LLMs do right? And these students they can end up scoring very high despite having no understanding of the underlying concepts. And this is true this is true for human beings where that have a limited memory and the limited amount of time to study.

0:55:08.6 FC: So they can only memorize like 10 exams or something but what if you have an LLM that can actually memorize 10000 exams it can end up showing very strong... The appearance of skill, the appearance of understanding with no actual understanding of concepts. And how do you tell that this is not true of understanding?

0:55:30.2 FC: 'Cause after all you can do your exam. And your exam is what you're using to judge your students. So how do you tell? Well the way to tell is that instead of just giving your students or the LLM a problem that's derivative that's just similar to something that you've given before you come up with something novel. So something that's never been asked before. And in order to approach this you actually need to think from first principles. You actually need to understand the underlying physics concepts, right?

0:56:04.3 FC: And if you give that to your students that don't understand the material but they've studied a lot they will fail. They they will score zero, right? The LLM will score zero as well. But then the smart student from the back of the class that understood everything but just doesn't care to actually memorize anything they will do extremely well because they're smart.

0:56:29.5 SC: But as a professor this sounds like hell if I need to come up with novel problems every single time.

0:56:35.1 FC: If you are looking to test understanding and intelligence then yes, do.

0:56:41.9 SC: Yeah gotta do it.

0:56:43.7 FC: If on the other hand if you're fine with just memorization then you don't. And the school system as a whole is fine with memorization. And sometimes it's because memorization is the goal but a lot of the time it's out of laziness. It's using memorization as a proxy for understanding but memorization is not a good proxy for understanding. Because you can always memorize your way into a high score with no understanding.

0:57:14.5 SC: No argument. For me there. It's absolutely true.

0:57:17.6 FC: Yeah. And by the way on the just to continue this topic a little bit.

0:57:23.4 SC: Sure.

0:57:24.8 FC: On the... On this idea that you... If you want to test actual intelligence you need problems that are novel problems where the test taking system or human being cannot have memorized the solution. Right. And I actually released a benchmark of machine intelligence a few years back in 2019. That's all about this idea. So it's called ARC, ARC-AGI in the long form. So it's the Abstraction and Reasoning Corpus for Artificial General Intelligence.

0:58:08.5 FC: And the idea is that well, deep learning does really well by just memorizing data points, but it has very low generalization power. How can you tell that something actually has intelligence? Well, you come up with puzzles that are all unique, all original, never seen before, not similar to anything you would find on the internet. So not really similar to existing IQ test puzzles, for instance. And so ARC is basically a collection of such puzzles, and they are public ones, but they're also private ones, which are not more difficult than the public ones. But they're hidden.

0:58:49.5 FC: And this is extremely important of course, because if they were public, then you could just train a model on them, right. And then it'll mean nothing anymore. And as it turns out deep learning methods and LLMs in particular have scored very poorly on ARC. So we ran a competition on the websites Kaggle in 2020 on ARC. And this was back when GPT-3 was available. GPT-3 got released around the same time as we ran the competition. And so people tried GPT-3, and it scored zero, right? And the methods had actually worked. We have discrete program search methods. So not curve fitting. Curve fitting just doesn't work very well for this type of puzzles. In general, curve fitting works very poorly to handle any kind of novelty. And so later, we also ran a two years of a new edition of the competition, which was called the ARCathon. And it remains extremely challenging. It's like, it kind of looks like an IQ test, and it's very easy for humans to do, but it's extremely difficult for AI to do.

1:00:00.2 FC: And in parallel, it's very difficult for LLMs. And we're actually about to launch a reboot of the competition on a larger scale. So we are relaunching on Kaggle again. We're back on Kaggle after four years. And we're gonna have over $1 million in prices. And the goal is to solve ARC to pretty much human levels. So something like 85%. And because we know LLMs just don't do very well on ARC, the goal here is really to incentivize people to come up with new ideas to look at these tasks, recognize just how easy it is for them to solve them, and how difficult it is for ChatGPT for instance to solve them, and try to nudge people into asking themselves.

1:00:58.7 FC: So what's going on here? Like, why can I do this? And the machine cannot and try to come up with new ideas, like, try to come up with ideas they would not have pursued otherwise if they stayed under the impression that LLMs can do anything, all they need is enough data, that's definitely not true. Like even after ingesting every IQ test in the world, still not, they cannot do ARC, even though ARC looks exactly like an IQ test. And fundamentally, the reason why is because each puzzle in ARC is new. It's something that you cannot have memorized before. It was created for ARC. And LLMs have basically no ability to adapt to novelty in this way. And if you want to solve ARC, if you want the million dollar you're gonna have to come up with something original, something that's gonna be on the path to AGI as opposed to LLMs, which are more often off ramp on the way to AGI.

1:01:54.2 SC: And sorry, just as a tiny technical detail. So when one enters the competition, you Francois, do not tell their LLMs the questions. They have to sort of let you give the questions without letting the people who wrote the LLMs know what the questions were.

1:02:12.2 FC: Right. So the way it works that you submit a program in the form of a notebook, and you have access to some amount of compute, which is 12 hours with one P, 100 GPU, and one Multi-core CPU. And within 12 hours you need to solve 100 hidden tasks. And so you're just submitting the program, so you are never directly seeing the hidden task. It's only your program that you've uploaded that's gonna see them. And then what you get out of that is a score. How many tasks did your program solve? And then you have to iterate and come up with a better program.

1:02:52.9 SC: How large are these programs?

1:02:56.7 FC: Well, we'll see. But they're computationally constrained, as I mentioned. They can only run for 12 hours and they only have access to one GPU. So we'll see.

[chuckle]

1:03:07.5 SC: But I mean just as a complete outsider, when I have an LLM, I kind of, since I don't have an LLM, I think of it as it must have a huge amount of data that it needs to call up to answer these queries. Is that part of what they're sending you the whole sort of compressed data set, or is it just the weights of different neurons?

1:03:28.3 FC: So if you do want to use LLMs in the competition, the way you would do it is you would make your pre-trained LLM part of your program. So before submitting your program, you would fine-tune your LLM on ARC data. And by the way, so you're not gonna be able to use an LLM API like the ChatGPT API for instance because that would require... Obviously, it will require kind of showing this third party service the hidden tasks, which is...

1:04:00.8 SC: I mean, again, for the non-experts, that means that your competitors are not allowed to call out to the outside world.

1:04:05.7 FC: No, they don't. Exactly. You actually don't have internet access at all. So anything the program needs access to must be part of the program. So if you want to use an LLM it has to be an open source LLM, and you include it in your program. So beforehand you would fine tune it on ARC-like data, presumably. And then you would actually use it as part of your program. And so of course it cannot be an LLM that's too large because you just have 1P1R GPU. So that said, that's enough for if you're using a Float16, that's enough for models that have a min-arch 8 billion parameters, which is actually pretty good.

1:04:47.3 SC: Okay. And going along with this claim that the LLMs are not really intelligent, I've seen related claims probably from your Twitter account that they can't reason and they can't plan either. Are these... Is that a correct characterization?

1:05:04.0 FC: Yeah, that's correct. And I could talk about it a little bit, but really I think what you want is more than just a vague summary. If you want precise scientific references, I can send you some. So actually let me pull up. There's this professor from, Arizona State University has a really good.

1:05:34.3 SC: We can put up links once we publish the episode on preposterousuniverse.com so people can get linked to it.

1:05:42.2 FC: Do you have any way to [laughter] send you links in here?

1:05:48.0 SC: There's a chat on the right. Oh, yeah. You can just respond to that. Perfect. Yeah.

1:05:52.8 FC: So you can check out this YouTube video. And, the guy also has a bunch of papers, but really, like I could send your reading list if you want, but if you actually rigorously investigate the ability of LLMS to plan a reason, you find that no, they cannot plan a reason, but what they can do is memorize patterns, memorize programs and they can reapply them. And as long as you are looking at a familiar task where the program is applicable, they will be able to show the appearance of reasoning by fetching the program and deploying it. But that's, that's kind of different from actual planning and reasoning. And the way you can tell it's different is that if you modify the task a little bit so that the existing program is no longer applicable, the LLM will fail.

1:06:46.8 SC: Right.

1:06:46.9 FC: And intelligence would really be the ability to adapt to these changes. So instead of fetching a program, an interpolated program, it'll be the ability to synthesize on the fly a new correct program that matches your novel problem. If you have that, and you can synthesize this program efficiently from just a few examples, then you have AGI, then you have general intelligence. And, if you have disability, you should also be able to solve ARC, by the way, because this is what ARC is all about. For each puzzle, you get a couple demonstration examples, and then you get a test example. And if you were able to synthesize on the fly a correct program that matches the demonstration examples, then you would be done. LLMs fail at that because all they can do is fetch.

1:07:45.9 FC: And of course, each puzzle is something they've never seen before. Right. And, you know, I feel like, I feel like people who claim that LLMs can reason, they're really stuck at this first stage where they see examples of something that look like reasoning, and they don't try to investigate it. They're like, oh, it's working. This is impossible if the LLMs was not reasoning right. But actually what it's doing is just fetching a program. And that's just memory. That's just memory. Like the LLM is a program database. That's it, it's an interpretive program database. Intelligence is not being an interpretive program. Database intelligence is being the programmer is having the ability to look at something new and come up with a new program to address it.

1:08:33.9 SC: Well, you just hinted at this a little bit, but I am certainly hearing a lot of people who are nominally experts in the field make noises about artificial general intelligence and how close we are to it if we're not already there.

1:08:49.5 FC: Yeah. We, I mean, the claim that we're already there or like, or like LLMs are like high, high schooler level intelligent, that's kind of absurd. Like I don't, I can't even fathom how I can make such claims. It just, it makes zero sense to me. Like I don't even understand how you can be like so deluded as to claim that. But if you want to ask seriously, under my definition of intelligence, which is obviously correct, like my opinions are obviously correct. Right?

1:09:23.1 SC: Of course. That's why you're on the podcast.

1:09:25.6 FC: No, but if you want to ask when is AGI coming, it's very difficult to answer because the situation we're in is that we have no technology today that is on the path to AGI. There is nothing that if you just scale it, it gives you intelligence. Right?

1:09:44.2 SC: Right.

1:09:45.6 FC: But that said, that does not necessarily mean that AGI is very, very far away. Rather, what it means is that you cannot predict when it will arrive because you need to invent something new. But maybe we'll invent it next year, like maybe the ARC competition will actually trigger someone into inventing it, you know? So may maybe it arrives next year. It's possible. It's possible, but it's unpredictable because it doesn't exist yet. And the claims that people are making, are basically the, they're founded on the idea that LLMs are on the path to AGI, and that you can predict how their intelligence will scale, with compute and data. And the idea is that, well, GPT-3 was like, Middle schooler level, GPT-4 is like high schooler level. GPT-5 is gonna be like postdoc level, [laughter], GPT-6 is gonna be super genius and so on. And, I mean, none of it makes any sense even with a very loose definition of intelligence.

1:11:07.0 SC: And do we understand what is going on inside the large language models? I mean, how much of a black box are they? Or are we still kind of doing the science needed to figure out what is inside the box?

1:11:23.9 FC: We are still in the process of figuring out how to interpret what they are doing, but there's already a lot of work that has been done along the lines of interpreting how LLMs work and visualizing what they're doing. There was a paper from Anthropic a few days or weeks ago that was actually really insightful on that topic.

1:11:45.7 SC: Okay, so that's not an intractable problem we will get there.

1:11:47.6 FC: No, it's not intractable. It's an active area of research, and we are making progress.

1:11:53.5 SC: Okay.

1:11:53.6 FC: And by the way, every time we get new results, they are along the line of showing that LLMs are actually just pattern matching engines. They are not intuitive. They are interpolative databases of programs. Again, the big difference between intelligence and a program database is the program database is like GitHub. Intelligence is like the programmer. The programmer individually knows dramatically less than what's in the database, but the database cannot adapt.

1:12:29.7 FC: It's only that fixed set of programs. You can maybe recombine some programs, but you have limited ability to recombine programs. The programmer can actually invent anything, adapt to anything, because it has general intelligence, right? That's really the difference. And people are like, yeah, so if we just scale GitHub to a thousand more programs, then it's going to be AGI. But no, it's just a bigger GitHub. It's just a more general GitHub. It is still not a programmer. There is no level, there's no amount of stored, memorized programs where you develop suddenly the ability to synthesize your own programs on the fly. It's just not how it works.

1:13:13.1 FC: If it worked this way, we'd already know because we've already scaled LLMs to literally all the trained data that's available out there, which by the way, is the reason why LLMs have entered the plateau since last year. It's because we've been running out of data. And sure, you can scale compute. You can always keep scaling compute, but it's becoming useless because the curve needs to be fit to something. The curve is literally just a representation of a trained dataset. If you've run out of data, then how do you improve the model? Well, one way is that you can try to better curate your trained data. So you don't increase the scale of the trained data, but you can increase the quality. That's actually one very promising way of improving LLMs. It's actually the way LLMs keep improving today.

1:14:00.8 FC: We've already run out of data. So the next stage is that we better curate the data. We're not training the LLMs on more data, we're actually curating it.

1:14:06.1 FC: Technically, we're still collecting new data from human raters. So there's a bit of an increase, but on balance, it's actually decreasing. But you're not gonna magically find a thousand times more novel, non-redundant data to train these models on. It just doesn't exist. You're not even going to find 2x. And that's the cause of the plateau we've been seeing.

1:14:41.1 FC: And something like GPT-5 is going to be released probably at the end of the year. It's going to be a big disappointment because it's not going to be meaningfully better than GPT-4.

1:14:53.7 SC: It occurs to me slightly belatedly that we should tell people what GitHub is, 'cause not all of them will know.

1:14:58.9 FC: Right. It's basically just a website that's a collection of many open source programs put there by organizations, by programmers across the world.

1:15:13.4 SC: So that's your analogy for what current generations of large language models are. What we want, in some sense, is something that is more truly creative and has the ability to outside the extrapolation.

1:15:25.6 FC: Yeah, that's right. And even if you take a first-year CS student, their knowledge is extremely limited. They know so little. They've seen so few real-world programs, but yet they have a much higher ability to write programs that are appropriate for a novel problem compared to a system that has seen every open source program out there, but then that has very little intelligence.

1:16:00.3 SC: Okay. Very good. So I'm 100% on your side here I've tried to convince people that the amazing thing about LLMs is how well they can mimic sounding like human intelligence rather than thinking in the same way that human beings do.

1:16:14.7 FC: But I think that's actually quite intuitive because you also see it in humans. You also see it in humans that there is this trade-off between memorization and intelligence, and that with enough memorization, you can actually reproduce the same outcomes as intelligence. And the way you can tell apart someone who's operating based on memorization and someone who's actually intelligent and is operating based on understanding is by presenting them with something new. And so it's true for human beings as well. And the reason why our intuitions are off with LLMs is because the scale of memorization is unlike anything that's possible for a human.

1:16:57.0 SC: Well, and maybe also trying to give some credit to the other side, maybe more problems that we're interested in than we think are solvable by memorizing lots of things rather than by thinking originally and creatively.

1:17:12.9 FC: Sure. I mean, LLMs, memorization is precisely what makes LLMs useful, is that they've stored lots of patterns on how to perform certain actions, solve certain problems, and they can fetch these solutions and reply them. And you may not know about these solutions, so they may actually teach you something new.

1:17:34.0 SC: Well, can an LLM or could AI in some broader sense be functioning as a creative scientist?

1:17:44.3 FC: Not an LLM, at least not in an LLM in isolation, to actually make the systems capable of invention, capable of developing new theories and so on. Well, either you can have a human in the loop. The human's actually in charge of the intelligence bits. The LLM is in charge of the memory. So you use the LLM as a sort of extension of your own memory, sort of like brain add-on. So that's one way you can create a super scientist that way by just supercharging an existing scientist with access to all this memorized content. And by the way, I'm not convinced this is actually super effective, [chuckle] to be honest. What have seen is that LLMs are very good at turning people who have no skill into people who are capable of an average mediocre outcome.

1:18:39.6 SC: Right.

1:18:41.5 FC: They are extremely bad at helping someone who's already extremely good getting better. It basically doesn't work. And there are many reasons why but empirically this is what you see. So this is why I don't think LLMs are gonna have much impact in science. The science is not about more mediocre papers, it's actually about the top ones, this is what's actually connected to progress. And the other way that you could try to make these systems capable of novel discoveries is to try to add a search component. Like we talked about genetic algorithms as a way to mine a search space and find unexpected points, unexpected inventions in it. I think you may be able to create sort of like hybrid LLM plus symbolic search systems that would be capable of invention.

1:19:40.8 SC: I definitely notice when I ask physics questions of LLMs, if it's a fairly straightforward question, they're pretty good, but as soon as it becomes subtle, they're no longer good. I mean, that in exactly the places where you don't get a lot of coverage out there in the training data, they can't figure it out. And as you say, why would we ever expected them to?

1:20:02.3 FC: Yeah. If they've seen many instances of the problem you're asking, they have memorized the solution template. And they can just fetch that solution template, reapply it to give you the right answer.

1:20:15.9 SC: Right.

1:20:17.3 FC: If it's something that's slightly different or that's similar but with maybe one word that actually changes the meaning, something like that. They will still fetch the same templates pretty much but now it's gonna be wrong. And they have no way of telling. They don't actually understand the words that you're putting in. They don't understand your query. They're just directly mapping to the solution that they think they know.

1:20:42.5 SC: So there's this famous thing where you ask an LLM a question, it gives you the wrong answer, and then you say, "No, that sounds wrong," and it corrects itself. Is that because it actually is correcting itself or is it just trying another possible answer from its storage of possibilities?

1:21:00.6 FC: It's adapting its solution based on patterns of program modification that it has seen before. So if you propose a pattern and then you add to it, oh by the way, this is wrong, here's the correct pattern, and you do this many, many times. The model learns a sort of like modification function that goes from this incorrect solution to fix solution. And if you tell it, "Oh, by the way, there's an error, please give me the right answer." What it's gonna do is that it's gonna apply this modification function to the previously, to the input it previously produced and it's gonna give you a new answer. And you may be like, "Hey, so why don't we do it preemptively?" But the thing is that in the absence of human feedback, there is no particular reason for the modified program to be more or less correct than the initial program. Like only the human can tell.

1:22:04.1 SC: Yeah. [chuckle] So I think I know what your answer to this is going to be because we've talked about intelligence before. But what about the oft-proclaimed dream of letting the AI program a smarter AI and therefore sort of bootstrapping our way up into greater and greater intelligence?

1:22:23.7 FC: Well, right now, if you want to use an LLM to do the programming, it's gonna be constrained by its trained data. It can only give you things that are simple interpolations of programs, code snippets that it has seen before, which is why LLMs were created as a stack overflow replacement but they do not work great as actual software engineers capable of novel problem solving. And the average senior software engineer is a tremendously capable novel problem solver, but they're also completely unable to invent AGI. [chuckle] So you're not gonna get an LLM which has no novel problem solving ability. You're not gonna get it to invent AGI. You cannot even invent the solution to an ARC problem.

1:23:15.9 SC: Right.

1:23:16.6 FC: Which is pretty trivial, like a four-year-old can do it. So no, that's not gonna work. But you could ask, "Hey, why just use elements? Why couldn't we use something else like genetic program search?" Since I mentioned that genetic algorithms could actually invent new things. Well, in practice, I think this is kind of a bad idea. It is viable in theory because if you think about it, humans were developed by an evolutionary algorithm. Intelligence is the answer to a question posed by nature. Could we not just get the same answer by asking the question again and just letting a search algorithm run its course? In theory, yes, in practice bad idea because the scale at which you would need to run is excessive. And I'll tell you...

1:24:12.2 FC: We already have general intelligence, we are general intelligence. And general intelligence gives you an extremely effective ability to predict what next idea should be tried. If you try to delegate this sort of ideation bit to an algorithm, you are wasting resources, right? Because what's gonna be computational intensive is actually evaluating the solution, trying, trying, trying to implement it, figure out whether it's actually on the password or not, and so on. The ideation bit is not expensive. And so what you're doing is you are effectively outsourcing the things that you're really good at, and that cost you very little to a machine that's really bad at it.

[chuckle]

1:25:00.5 FC: And meanwhile, the things that are actually automatable and very expensive while the machine still has to do them. So it's just an extremely ineffective idea. This idea that, hey, we can just like brute force our way to the right EGR architecture. It doesn't work. In fact, it doesn't even work on a much smaller scale. And by the way, another issue that you're gonna run with this brute force search idea is that you can search is only gonna find you points in your initial search space. You start as a human programming, you start by defining the space you want to search over. Like we'll put both possible genomes for your search algorithm, for instance, for genetic search algorithm, for instance. And what if the correct solution was not in your search space? If you don't know what the correct solution is in the advance, you have nowhere to tell.

1:26:22.7 FC: So maybe you're gonna be expanding, like an extraordinary amount of compute resources to mine a search base that does not even contain the right solution. And this whole idea doesn't even work on a much smaller scale. For instance, neural architecture search for a long time was a thing in deep learning. The idea was that, hey researchers have come up with a number of architectures that perform really well. Like there was xLSTM, there was transformers and so on. Could we not just make a machine that tries a bunch of different architectures? And if you find a better one, it has never worked. There's literally nothing out there that's popular that was developed by an algorithm, despite tremendous amounts of compute dedicated to this idea. And everything out there, like Transformers, for instance, or even the more novel and recent architectures like Mamba or xLSTM and so on. All of these were invented by humans because humans are really good at inventing. AI is not idea constrained today. So trying to outsource ideation is just a bad idea.

1:27:12.6 SC: But it's not because we're magical. I mean, it seems like we should in principle be able to write computer programs that are as smart as us.

1:27:22.7 FC: Yeah. In principle, sure. It's just not a straightforward problem. It's just not an easy little riddle. The human brain is tremendously complex.

1:27:34.5 SC: Yeah, fair enough.

1:27:35.8 FC: And no one really understands how it works today.

1:27:38.9 SC: So I guess that you're not going to assign a large probability to the existential threat of AI taking over the world.

1:27:48.7 FC: No. So, no. To start with because AGI is not a technology that exists today, and that we have nothing today that would lead to it. We need to invent it. We need new ideas. And this is the entire point of the ARC-AGI competition is to get people to come up with new ideas. 'Cause currently we are stuck. We are on an off ramp. So we need a reset. But even if we had a promising avenue to create AGI I think the whole idea that AGI is gonna end humanity it's based on several deep misconceptions about intelligence. Like intelligence is pretty much just a conversion ratio between the information you have to the ability to operate in novel situations in the future. Your turning intelligence is you turning your past experience and also the knowledge that you're born with.

1:28:56.5 FC: Because today, you're born, you're actually not born knowing nothing about the world. You know some things about the world, some things are hardcoded into your genes. And so you turn that, it's mostly your experience, but you turn that into the ability to approach each new day in your life and actually behave appropriately throughout your day, accomplish your goals and so on. And this ability to sort of like chart a path through situation space does not entail that the system should have goals of its own, of values of its own. That we need to align with human values. It is just an ability. It's just a pathfinding ability. And in order to make something like SkyNet or Terminators, well, you need more than just intelligence. You need intelligence plus goal setting, autonomous goal setting. But why would you want to give machines potentially very capable machines with autonomous goals?

[chuckle]

1:30:08.5 FC: Sounds like a bad idea. And of course, if you have goal setting, these goals needs to be grounded in some value system. You're gonna want to give machines their own values. And of course you're gonna want to give machines autonomy because intelligence does not imply autonomy, by the way. So autonomy in the sense that the ability to perceive the world and act in the world without mediation by humans. There is no machine out there today that is unmediated from humans. If only because they need a power supply. There's no machine out there that can just recharge itself and maintain itself in perpetuity. And no machine today has autonomy. So in order to create a danger, you will need to engineer the danger very deliberately. Like SkyNet, to be honest.

1:30:56.9 SC: Yeah.

1:30:58.0 FC: The whole thing with SkyNet is like, hey, we have this very intelligent thing and we've given it the ability to make its own autonomous decision space on its own value system. Hey, let's look it up to our nuclear arsenal. It sounds like a bad plan. So to make something dangerous, you literally have to create an agent, give it autonomous sensing, give it autonomous acting, give it its own value system, give it its own autonomous black box ability to set goals with no human supervision. And then you give it super intelligence. Well, to be honest, this whole thing already starts being dangerous even before you add intelligence. And intelligence in itself is just a tool. It's just a way to accomplish goals. If you don't look it up to autonomous goal setting, then it is pretty much harmless. It's not quite harmless because it's gonna be in the hands of humans and humans are dangerous. So it's dangerous in that sense that people are gonna potentially use it for bad purposes, but it's not dangerous in the sense that it compete with the human species.

1:32:08.3 FC: It's no more dangerous than any other tool that we have, it's like efficient energy, it's not on its own dangerous. It's just a tool. You can use it to create clean power. Or you can use it to make a bomb. But if it's gonna be threatening, it needs to be deliberately engineered to be threatening. And I think, AGI is gonna be, if I imagine something. I also think it's kind of pointless to try to plan for risks in something that is completely unknown. Like we don't know what AGI really looks like. How are you gonna plan for how you're gonna handle it? So I think how to handle AGI is something that we're gonna start making meaningful progress on when we start having it. And again, AGI on its own is not a strength, it's just a tool. To make it threatening, you need to engineer, engineer it into either something completely autonomous, which sounds really like a bad idea, or just turn it into a weapon in the hands of humans.

1:33:18.2 SC: Well, you've done a very good job of sort of deflating some of the misconceptions about intelligence and large language models, and so forth. But I wanted to maybe wind up with giving you a chance to talk about your day job, because you work on these things. You actually have a lot of positive things to say about deep learning models, et cetera. So let's open the door a little bit on what it means to be developing these things. I mean, you have a very successful software package that 3 million people use. Is it something that... Should more people out there be developing and training their own large language models?

1:33:58.3 FC: Absolutely. No, not just large language models, but instead of deep learning models.

1:34:03.9 SC: Deep learning models.

1:34:05.2 FC: I think it would be a sad world, if there were only a fixed set of companies training models, and just giving those models to other people to be consumers of those models. I think we want this technology to be a tool in the hands of everyone. I would like every software developer out there to be able to tackle their own problems using these tools, using deep learning, using large language models, using Keras. And that's basically the reason why I tried to make Keras as accessible as possible, as approachable as possible.

1:34:41.7 SC: So what is Keras? What does that mean?

1:34:44.1 FC: So Keras is a deep learning library. So it's a software library for building and training your own deep learning models on your own data. And it's not necessarily building models from scratch, you can also adapt an existing model, like an existing large language model, for instance.

1:35:02.8 SC: So could you take an existing large language model and then feed it all the transcripts of the Mindscape podcast and sort of elevate their importance in the model so that it would mimic some average scholarly Mindscape guest kind of point of view?

1:35:23.9 FC: That's right. So if you want, for instance, to generate new episodes, that's something you can do. You can take the Gemma 8B model, for instance, which is an open source LLM released by Google. It's available in Keras. You fine-tune it to predict the next word on your transcripts. You can use a technique called LoRa fine tuning, which is basically compute efficient fine tuning. And now you can generate new transcripts. It's probably not gonna be very good, but it's probably gonna sound like your podcast. If you start listening, you're probably gonna be raising eyebrows quite a bit. But at first glance, it's gonna sound like your podcast, yeah.

1:36:08.8 SC: I mean, have you done this with the equivalent? I know that...

1:36:11.5 FC: I have not, but people have done generative podcasts like this, yes.

1:36:18.0 SC: Yeah. Are you aware of the experiment that was done with the works of Daniel Dennett, the philosopher?

1:36:25.5 FC: No.

1:36:26.0 SC: So Eric Schwitzgebel, who's another philosopher, who was another guest on the podcast, he and some collaborators trained in LLM on everything ever written by Daniel Dennett. And then they asked it some questions. And they asked Dan Dennett these questions. Dan passed away recently, but before that happened. And then they asked some philosophers who are familiar with Dennett's work, which of these answers was the real Dennett. And they did better than chance, but in some cases, not a lot better, depending on the question.

1:36:58.1 FC: Yeah. Honestly, if you're just looking at a short text snippet in isolation, it's very hard to tell.

1:37:03.4 SC: Right. Exactly.

1:37:05.1 FC: And especially when you're reading the output of LLM, it's meaningful because you're interpreting it, you know?

1:37:15.4 SC: Yeah. You're giving it more credit, maybe, than it deserves.

1:37:21.2 FC: Yes. LLMs are all about mimicking humans, so they're very good at hacking your theory of mind because you have this bias towards interpreting as being like you anything that superficially act like you.

1:37:40.9 SC: Yep. The intentional stance. Dennett talked about this, actually. Yes. So, I mean, how realistic is it for the typical listener with a relatively late model MacBook Pro to open up Python on their computer and download some of your libraries and start going to town?

1:37:58.1 FC: It's very easy. So you're going to want a GPU. So if you have a MacBook Pro, one of the recent ones, you actually do have a GPU. And if you're using the TensorFlow backend of Keras, you can actually do GPU-accelerated computation on your MacBook Pro. So you can do that. You could also use a free GPU notebook service like Colab from Google, for instance. And it's actually extremely easy to just get started and do a LoRa fine tune of the Gemma model with Keras on your own data. If you already know Python, it's really easy. You'll be done like now.

1:38:38.3 SC: And if you don't, you've written a book.

1:38:41.2 FC: I have written a book. That's right. So the first edition was in 2017. Then there was a second edition in 2021. Now I'm actually writing the third edition.

1:38:51.5 SC: Okay.

1:38:51.7 FC: In that we get a lot more content on generative AI.

1:38:54.8 SC: I would get it.

1:38:55.6 FC: Both LLMs and image generation as well.

1:38:58.0 SC: And besides the fact that it sounds like a lot of fun and also educational to do this, are there use cases for people training their own LLMs to do their own specific tasks?

1:39:09.9 FC: Absolutely. If you're a business and you're having a specific business problem, like, Hey I have this spreadsheet with this information. I want to turn that into a set of emails, for instance. You can just, you could prompt the LLM into doing it. Maybe it will work. But if you want better results, you can just actually adapt the LLM to your problem so that it can not just fetch the right program, but maybe fit the right program from the data you provide. In fact, I would say if as a business, you want to make extensive use of LLMs, you should be finding your own LLMs because this gives you an advantage instead of just reusing the same program database as everyone, which is the sort of like public access LLM. You are starting to develop your own repository of private programs, trying to train on your own data specific to your needs. And that's very powerful.

1:40:11.3 SC: Is there an app out there that will answer my emails for me?

1:40:16.9 FC: There probably is, there are just so many generative AI sellers.

1:40:23.6 SC: Yeah. I hope that I don't end up starting SkyNet by sending an email that was generated by an LLM. But that there are, we shouldn't leave people with the impression that there aren't many many transformative ways that even LLMs can affect our lives going forward.

1:40:41.7 FC: Yeah. So many people try to do things with GenAI that it may not necessarily be suited for. My advice is in general, do not try to delegate any sort of decision-making to the LLM. The LLM is there to give you a shortcut towards the general area that you're looking for. Do not delegate your decision. Like, do not let the LLM generate your emails for you. But maybe it can help you write emails faster, for instance. So maybe you can fix typos in your emails.

1:41:16.3 SC: Anything that would make me go faster is very good. So I'll take that as useful advice. Francois Chollet, thanks very much for being on the Mindscape podcast.

1:41:22.9 FC: Thanks for having me.

2 thoughts on “280 | François Chollet on Deep Learning and the Meaning of Intelligence”

  1. Chollet seems quite correct here. The hype over LLMs has obscured the fact that they don’t do creative or human-like reasoning. They predict words based on statistical correlations of what they have been trained on.
    Chollet is also quite correct that AGI does not exist and no one is even working on a path to getting there as we don’t even know how to find such a path. You certainly can’t get there by scaling up LLMs. AGI is still just a myth and a chimera. There is no basis for predicting when if ever it will be achieved. As companies begin to realize the limitations of LLMs, it is possible that the generative AI stock market bubble will pop or deflate. This may have already started with the 13% pullback in NVDA over the three days ended June 24. And McDonald’s has just cancelled its generative AI joint venture after it stated ordering hundreds of drinks for inbdividual customers and massively misunderstanding simple orders. Bah humbug.

  2. Henrik Bodenstab

    This was a fascinating conversation, especially around defining the differentiation between humans/AGI and the current state of AI models in terms of fetching/memorizing and intelligence. One aspect where I feel the conversation fell short was diving into what exactly intelligence stands for. We learn a language, solve problems, and create new ideas—all of which include the ability to fetch memorized data and then use that to create. But where exactly does creation start? The examples of students understanding vs. memorizing and the description of the ARC challenge are descriptive, but I still feel that a discussion around the liminal space where intelligence transitions into the mechanical process of data retrieval and memory is needed.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top