Unlike your other posts in this series, the combination of repeatedly quoting GPT-4, in addition to the nuclear astrophysicist's example, has me feeling that your assertion of insight being a long way off for AI has a "whistling past the graveyard" feel to it. Logically, reading your post, I see that isn't the case-- I understand your arguments and why GPT-5 or equivalents will very likely lack what is needed to generate any real insights (long term memory, etc.)-- but somehow the whole post makes me feel like I'm reading what the Romans said to each other in explaining why the Huns were certainly not powerful or organized enough to sack Rome anytime soon. An irrational response, I grant you, but there it is.
I feel like the entire question of future increases in AI capabilities has a sort of "irresistible force vs. immovable object" flavor to it. The irresistible force is obviously the relentless, eye-popping increase in capabilities with each new scaling-up of LLMs. The immovable object is that building a complicated system that has to function in the real world *always* takes much longer than you thought it would, even when you take this into account. (See: self-driving cars.)
I will freely admit that I feel less confident about this "insight" business than I do regarding memory and iterative processes (learning to do things that can be written in a single perfect first draft). Regarding memory, for example: it's simply a physical fact that LLMs do not have long-term memory, and there are many things that can't be done without long-term memory. The only way that memory doesn't become a big thing that we need to somehow figure out how to build into the models, is if we can get by with the approach people are starting to use today, "retrieval-augmented generation" (RAG), where we pull information out of some relatively conventional database and stuff it into the token buffer. The only handwavy part of my argument regarding memory is the claim that proper long-term memory is a subtle thing, needs to be deeply intertwined with the rest of the thought process, and RAG is going to be too clumsy to serve as a substitute. I feel relatively confident regarding that handwave (and if it turns out to be wrong, then probably this is all going to come at us too fast for there to have been any hope of controlling it in the first place, I'll see you at the paperclip factory in about five years).
Insight is such a poorly-understood concept (by me, at least), that I can't point to any hard fact equivalent to the basic fact of memory, i.e. that there literally is no physical mechanism by which current LLMs could store long-term information. I just have an intuitive sense that insight requires taking a bunch of facts and sifting through them and looking at them from different angles, and there's nothing really in the LLM training process that would push the model toward being able to do that. Maybe insight will just emerge magically from, like, GPT-6, and I will be saying "but how, but how", and it will explain how, because it's insightful.
Excellent points, all. I can't disagree with any of it. But my intuition for what is possible for an AI has been so broken with LLMs that I just can't get out of my head how shocked all of us would have been 7 years ago to see the "non-insightful" things they do right now. For example, did you see this 5 minute video, just some low-key tech guy demoing the new GPT w/Vision? He just asks it to look at random photos and interpret them and it has no problem:
On the one hand, I can look at each example and say to myself "but it's looked at trillions of photos and there's nothing all that new under the sun and this guy doesn't even address how often it hallucinates". But ultimately, I feel like I am just rationalizing away how amazing it is to see an AI do this. Even if ultimately current LLMs are just an incredibly sophisticated version of Clever Hans (where the internet replaces the horse trainer), I still am amazed to see the horse add and multiply.
I find it fascinating that AI people like Altman seem to be really quite split on whether we are on the cusp of qualities like "understanding" and "insight". While they pay lip service to the relevant caveats-- "we may need another breakthrough or two like transformers"-- they basically keep saying "look at the curves of improvement on the various metrics we have, we were able to predict how good GPT-4 was going to be before we made it" and so half of them seem convinced by an extrapolation argument, and think all the papers that have come out showing GPT-4 can't currently really reason will seem irrelevant if they just give it more data and compute. But the other half of AI people seems to think, like you do, that these are fundamental problems that may well be overcome in the medium term but that there is no current indication that LLMs will be able to reason in the next few years. Extrapolation has its dangers.
Also, I love the last paragraph of your response, and especially the last sentence, where I laughed out loud :)
Thanks as always for the great comments! You keep pushing me in interesting directions.
The GPT vision stuff is legitimately amazing. At the same time, all of the examples I've seen so far (this video you linked, something from a New York Times reporter I think, and one or two others), are utterly banal and useless. For instance, consider the three invocations shown in this video:
1. It provides a verbose description of the cleaning fluid thing, certainly impressive, but probably way too detailed and literal for the suggested purpose of providing alt-text for the image as used in a blog post. What you'd actually want is a brief high-level description of the image, focusing on whatever aspect is relevant to the blog post.
2. Identifying items in a spice rack, and then suggesting a recipe. Note that the recipe is an utterly generic pasta dish, whose only connection to the original image is that it suggests using a single one of the pictures spices as a seasoning. Remember how appliance manufacturers have spent the last, like, 25 years telling us that our refrigerators would be connected to the Internet so that they could suggest recipes? Remember how (a) no one in the history of civilization has ever wanted their refrigerator to suggest a recipe, and (b) no one has ever come up with any other use case for connecting a refrigerator to the Internet? (I exaggerate of course – but not by much!) "It could suggest recipes" seems to be the default use case that people suggest for a thing when the honest truth is that they have no actual idea how the thing might ever be useful.
3. Making a bunch of obvious, overly verbose, and honestly patronizing observations about the information in a screenshot of the guy's SEO console.
Again, I agree that strictly in terms of the technical capabilities on display, this is amazing. But I am reminded of the old saying about a dancing bear – "The marvel is not that the bear dances well, but that the bear dances at all." One implication is that *if you need good dancing, the bear is useless to you*. A thing can simultaneously be amazing, and of little practical import.
This is related to what I was trying to get at in The AI Progress Paradox (https://amistrongeryet.substack.com/p/the-ai-progress-paradox) – amazingness is not necessarily a good indicator that AI is on the cusp of being able to accomplish meaningful things in the real world. I'm sure people will find some practical use cases for these new vision capabilities. But most of the examples people are waving around are notably shallow / pointless on closer inspection.
Speaking of the centaur-like human-llm combo... I wonder if hallucinations by themselves are sufficient to guide the human to an insight.
Personally I also start to self-doubt here and start questioning what actually is an insight. Is it just correlating previously uncorrelated points. LLMs might already be able to do that. What if you fed the output of a prompt back into the prompt repeatedly. Would the LLM be able to generate an insight then?
All of this is extremely fascinating and having the ability to play with this stuff and experiement with it in colab or home hardware is amazing. What a time to be alive!
Unlike your other posts in this series, the combination of repeatedly quoting GPT-4, in addition to the nuclear astrophysicist's example, has me feeling that your assertion of insight being a long way off for AI has a "whistling past the graveyard" feel to it. Logically, reading your post, I see that isn't the case-- I understand your arguments and why GPT-5 or equivalents will very likely lack what is needed to generate any real insights (long term memory, etc.)-- but somehow the whole post makes me feel like I'm reading what the Romans said to each other in explaining why the Huns were certainly not powerful or organized enough to sack Rome anytime soon. An irrational response, I grant you, but there it is.
Yeah this is a very real possibility. :-)
I feel like the entire question of future increases in AI capabilities has a sort of "irresistible force vs. immovable object" flavor to it. The irresistible force is obviously the relentless, eye-popping increase in capabilities with each new scaling-up of LLMs. The immovable object is that building a complicated system that has to function in the real world *always* takes much longer than you thought it would, even when you take this into account. (See: self-driving cars.)
I will freely admit that I feel less confident about this "insight" business than I do regarding memory and iterative processes (learning to do things that can be written in a single perfect first draft). Regarding memory, for example: it's simply a physical fact that LLMs do not have long-term memory, and there are many things that can't be done without long-term memory. The only way that memory doesn't become a big thing that we need to somehow figure out how to build into the models, is if we can get by with the approach people are starting to use today, "retrieval-augmented generation" (RAG), where we pull information out of some relatively conventional database and stuff it into the token buffer. The only handwavy part of my argument regarding memory is the claim that proper long-term memory is a subtle thing, needs to be deeply intertwined with the rest of the thought process, and RAG is going to be too clumsy to serve as a substitute. I feel relatively confident regarding that handwave (and if it turns out to be wrong, then probably this is all going to come at us too fast for there to have been any hope of controlling it in the first place, I'll see you at the paperclip factory in about five years).
Insight is such a poorly-understood concept (by me, at least), that I can't point to any hard fact equivalent to the basic fact of memory, i.e. that there literally is no physical mechanism by which current LLMs could store long-term information. I just have an intuitive sense that insight requires taking a bunch of facts and sifting through them and looking at them from different angles, and there's nothing really in the LLM training process that would push the model toward being able to do that. Maybe insight will just emerge magically from, like, GPT-6, and I will be saying "but how, but how", and it will explain how, because it's insightful.
Excellent points, all. I can't disagree with any of it. But my intuition for what is possible for an AI has been so broken with LLMs that I just can't get out of my head how shocked all of us would have been 7 years ago to see the "non-insightful" things they do right now. For example, did you see this 5 minute video, just some low-key tech guy demoing the new GPT w/Vision? He just asks it to look at random photos and interpret them and it has no problem:
https://www.youtube.com/watch?v=FXXF5P9at_I
On the one hand, I can look at each example and say to myself "but it's looked at trillions of photos and there's nothing all that new under the sun and this guy doesn't even address how often it hallucinates". But ultimately, I feel like I am just rationalizing away how amazing it is to see an AI do this. Even if ultimately current LLMs are just an incredibly sophisticated version of Clever Hans (where the internet replaces the horse trainer), I still am amazed to see the horse add and multiply.
I find it fascinating that AI people like Altman seem to be really quite split on whether we are on the cusp of qualities like "understanding" and "insight". While they pay lip service to the relevant caveats-- "we may need another breakthrough or two like transformers"-- they basically keep saying "look at the curves of improvement on the various metrics we have, we were able to predict how good GPT-4 was going to be before we made it" and so half of them seem convinced by an extrapolation argument, and think all the papers that have come out showing GPT-4 can't currently really reason will seem irrelevant if they just give it more data and compute. But the other half of AI people seems to think, like you do, that these are fundamental problems that may well be overcome in the medium term but that there is no current indication that LLMs will be able to reason in the next few years. Extrapolation has its dangers.
Also, I love the last paragraph of your response, and especially the last sentence, where I laughed out loud :)
Thanks as always for the great comments! You keep pushing me in interesting directions.
The GPT vision stuff is legitimately amazing. At the same time, all of the examples I've seen so far (this video you linked, something from a New York Times reporter I think, and one or two others), are utterly banal and useless. For instance, consider the three invocations shown in this video:
1. It provides a verbose description of the cleaning fluid thing, certainly impressive, but probably way too detailed and literal for the suggested purpose of providing alt-text for the image as used in a blog post. What you'd actually want is a brief high-level description of the image, focusing on whatever aspect is relevant to the blog post.
2. Identifying items in a spice rack, and then suggesting a recipe. Note that the recipe is an utterly generic pasta dish, whose only connection to the original image is that it suggests using a single one of the pictures spices as a seasoning. Remember how appliance manufacturers have spent the last, like, 25 years telling us that our refrigerators would be connected to the Internet so that they could suggest recipes? Remember how (a) no one in the history of civilization has ever wanted their refrigerator to suggest a recipe, and (b) no one has ever come up with any other use case for connecting a refrigerator to the Internet? (I exaggerate of course – but not by much!) "It could suggest recipes" seems to be the default use case that people suggest for a thing when the honest truth is that they have no actual idea how the thing might ever be useful.
3. Making a bunch of obvious, overly verbose, and honestly patronizing observations about the information in a screenshot of the guy's SEO console.
Again, I agree that strictly in terms of the technical capabilities on display, this is amazing. But I am reminded of the old saying about a dancing bear – "The marvel is not that the bear dances well, but that the bear dances at all." One implication is that *if you need good dancing, the bear is useless to you*. A thing can simultaneously be amazing, and of little practical import.
This is related to what I was trying to get at in The AI Progress Paradox (https://amistrongeryet.substack.com/p/the-ai-progress-paradox) – amazingness is not necessarily a good indicator that AI is on the cusp of being able to accomplish meaningful things in the real world. I'm sure people will find some practical use cases for these new vision capabilities. But most of the examples people are waving around are notably shallow / pointless on closer inspection.
Speaking of the centaur-like human-llm combo... I wonder if hallucinations by themselves are sufficient to guide the human to an insight.
Personally I also start to self-doubt here and start questioning what actually is an insight. Is it just correlating previously uncorrelated points. LLMs might already be able to do that. What if you fed the output of a prompt back into the prompt repeatedly. Would the LLM be able to generate an insight then?
All of this is extremely fascinating and having the ability to play with this stuff and experiement with it in colab or home hardware is amazing. What a time to be alive!
i read this as "the mythical man moth" and was very excited to learn how he explained programming delays