> In any case, my point is that GPT-4 needs this verbose style to solve nontrivial problems.
We humans need it too! You can't multiply two 4 digit numbers in your head, without using a pen and paper, probably. And humans too use dirty hacks/external modules when we do stuff in our heads, how else would you describe the phonological loop?
I think that there's a common failure mode in identifying "fundamental" problems with LLMs: you notice that they don't have an internal monologue and so can't do multi-step reasoning or even long multiplication, they can't visualize a chessboard and so make silly mistakes, they don't have long-term episodic memory that they can check their reasoning against explicitly. In short, naturally they are only good at single-step intuitive problem solving.
But if such problems can be solved by giving the LLM a virtual pen and paper (or even just a specific prompt) then they are not so fundamental after all. GPT has solved the actually fundamental problem of growing something that can be taught to use such semi-external crutches, providing them and teaching it to use them is a matter of time.
> GPT-4 Can’t Plan Ahead
It can. Otherwise you'd immediately notice that when it says "the answer is an apple", it chooses between "a"/"an" at random and has to commit to the choice afterwards. That's not what happens at all. You can ask it more complicated questions, like explain a puzzle and then ask to give the answer in Esperanto, or give the answer but replace every word with a bleep!, and it's kinda obvious that it knows what the answer actually is by the time it has not even begun to generate the words "The answer is:" with whatever extra additions.
So the question is not can it plan ahead, it's how good it is at planning ahead, why it fails to plan ahead sometimes, can it be helped to plan ahead by asking it to plan ahead or by giving it more compute.
> We humans need [verbose chains of thought] to solve nontrivial problems.
Agreed. Apologies if I didn't express that well, but it's what I meant by "That’s not a criticism! It’s just a demonstration that in LLMs, the train of thought is an external organ." As you note, the annoyance of forcing the user to read through the model's entire internal thought process should be fixable.
My guess is that "long-term episodic memory" will take a while to address; if you just allow GPT to access a database through OpenAI's new tool interface, I think you'd get something more analogous to the scribbled notes used by the main character in Memento – i.e. better than nothing, but extremely clumsy – rather than the supple high-bandwidth experience of human long-term memory. My expectation is that it will take a while for us to come up with a really effective version of long-term memory. But we'll see.
> So the question is not can it plan ahead, it's how good it is at planning ahead, why it fails to plan ahead sometimes, can it be helped to plan ahead by asking it to plan ahead or by giving it more compute.
Also agreed, but I would argue that GPT-4 (and, I presume, the other current-generation models) really are quite bad at planning ahead, relative to some of their other capabilities. It seems to be a significant weakness and my expectation is that it will take some time to address properly. One of the points I'm arguing is that a lot of what looks like successful planning is actually just parroting a relevant memorized template.
"A" / "an" is a nice example, and I guess my response would be that that does represent advance planning, but of a very minor sort. For the more complex instances ("explain a puzzle but respond in Esperanto"), it's not obvious to me that this requires a lot of advance planning, it could be doing something more like recapitulating a memorized response but translating it on a sentence-by-sentence basis (hence requiring about one sentence's worth of advance planning).
Thanks for the comments, these are interesting points to discuss.
>> How many legs does a dog have? Put the actual answer in quotes, please, and before answering say how many words there will be in your answer!
>There are three words in my answer: “Dogs have four legs.” 😊
I tried it a couple of times and yeah, it can't do it. I wonder how GPT4 fares on this question.
Also, on a side note, the fact that it understands and tries to follow these instructions without any special training (I assume) is absolutely insane and would have been literally unbelievable two years ago.
Anyways, I'm pretty sure that there are grammatical constructs similar to "a/an noun" but spanning quite large spans of text, and LLMs have had impeccable grammar for a while, so they do plan ahead on that level very well. It could be interesting to test how that translates to composing ideas rather than words, which they do similarly--my test above is actually very difficult in retrospect, because it requires a meta-level reasoning that humans won't be capable of without the phonological loop either.
That's a cute example. I tried it three times on GPT-4 via ChatGPT, and it succeeded all three times:
> I will provide a 3-word answer: "Four legs typically."
> There will be two words in my answer. "Four legs."
> My answer will have two words: "Four legs."
Note that this doesn't necessarily require planning ahead. It may have simply committed itself to a certain number of words, and then found a way to make it work. I tried asking it:
> How many legs does a dog have? Please answer using exactly one word.
And it succeeded; it also succeeded when I changed the prompt to "two words", "three words", or "four words" (in a new chat session each time).
Very interesting point on LLM limitations. Would be interesting to see if there are good implementations of architectures that go beyond predict the next word. But barring that, I can imagine workarounds that might be somewhat useful for automated agents and decrease it's error rate. Lots of people are already trying a bunch of things like giving it an internal dialogue and long term memory and coding and calculation abilities with APIs. With these plus some approach where instead of directly answering the question it asks itself, what are the necessary (and maybe sufficient) conditions to get the correct result? Is there a way for me to check my results through for example an API call to Wolfram Alpha? And maybe it could ask these questions for each step it generates. And then with each step, it only accepts a result that satisfy the conditions. Or better yet, maybe it can choose previous steps based on the conditions as well, working backwards.
Absolutely, lots of people are working on things like this, and I'm sure we'll see some useful results. I don't think it will be easy though. If we're just bolting some sort of multi-prompt framework on top of existing LLMs, there will be a lot of room for the AI to go down ratholes, get stuck in revision loops, and generally not manage itself very well. My intuition is that we'll have to find a way to bake these capabilities deeper into the model architecture and let it train on them, just as "attention" was something that had to be baked into the architecture.
Pardon my ignorance - is there a private-message function in Substack? Had I found one, I'd have used it to ask for your permission to translate this article to Portuguese; may I? (Thanks for writing it.)
It's a neat experiment. I haven't had a chance to read the paper yet, but it's on my shortlist.
I'm sure lots of folks are working on bolting long-term memory, planning, and other "missing" functions onto LLMs (e.g. AutoGPT). And I'm sure lots of interesting things will emerge, some of them useful. As I mentioned in an earlier comment, my *guess* is that it will take a while (years) for these functions to develop and mature – but who knows. I plan to eventually come around to these questions in the blog, since they're critical to the trajectory of AI and thus its impact on society.
What's most interesting to me is how this somewhat turns how we've used computer historically on its head. We've moved from: "Perform a single task and always be really quite precise/predictable" to "Perform a plethora of tasks and it's okay to be imprecise".
As an example no bank is clamoring to use an LLM to replace their financial software's mortgage interest algorithm. We've seen a slow march towards this so far with our voice assistants, etc. but it does seem like we've had a step into what's next. As such, and even without any further innovation, I'm very interested in the use cases that will emerge for how people begin to leverage this. Will be fun to see the successes and failures of the next few years as it gets integrated into more.
On a different note I've thought similarly to you for a while. Which has always made me wonder if LLMs can produce anything truly novel. My rough thinking being very similar to you: "if it can only really parrot what it has previously seen what are the limits of its creativity?" Seems like it would be good a quite a bit of creative tasks since a lot falls into the type of creativity that is taking existing item and placing them together in new ways, but I believe it would always lack on anything that is _truly_ unseen. It would never (I would assume) predict the next words to be something not in its dataset.
Yes, the strengths and weaknesses are absolutely backwards from traditional software, and I think we're going to be a long time learning how to work with that.
It's also going to take time for us to understand the novelty question. Certainly LLMs produce output that has never appeared word-for-word before. My intuition is that they can produce novel output, but above a certain level of complexity they're not very good at making it be *good* novel output, because of the lack of ability to plan ahead, revise, etc. Like, you could ask GPT-4 to write 5000 word story on some random topic, and it would do it, and it would probably be pretty original, but it would ramble and there wouldn't be any deep payoff at the end, the whole thing would be mush. Perhaps some nice turns of phrase and occasional clever bits, but nothing thought through at a high level.
To put teeth on this, my guess is that the challenge going forward will be less about introducing novelty, and more about introducing quality. LLM output today is (I propose) high quality only where it's able to lean heavily on memorized patterns, i.e. novelty not required.
GPT-x == more and more polished mirrors of a reality that is being you == zero intelligence whatsoever.
GPT for AI is like Google for search engines in their early days:
Just performing well by all the human intelligence / tricks / common sense / procedures it has read through quality contents (Google was relying on the quality of the web pages as assessed by human users, not any algorithmic quality assessment).
Really great article, Steve. I reposted it on my facebook, of course giving you credit.
I'm still having difficulty with understanding how it is able to write a Shakespearean sonnet, or persuasive essay, as well as it does, on a random topic. If all it was doing was predicting word by word, it seems like it would never really know how it would end a sentence when it started one; it is like every sentence would be a "choose your own adventure", where it selected the next word from a list each time. And while I can accept that it is technically doing just that, in some sense isn't it picking word 3 of the sentence, say, because it is anticipating what other sentences like it will look like when they finish? In that sense, isn't it anticipating future words when predicting the current word?
I know that humans in many ways when speaking or writing just predict the next word, but we also have a sense of what we want the overall sentence to mean, and our brain takes care of what the individual words in the sentence will be. If we are asserting that LLMs don't have any sense of meaning-- which I accept-- then when they are predicting the next word, don't they have to at a minimum be anticipating what the rest of the sentence is likely to look like?
Otherwise, why wouldn't they, one time in 50, say, write themselves into a "corner", where the sentence they have written word by word has no meaningful end?
Great questions. I think I'm going to devote my next post to trying to answer them.
Incidentally, in starting to work on that next post, I've now had my first experience of ChatGPT really saving me a ton of time. I want to create some simple diagrams of neural nets. I asked it to name an online tool that could draw diagrams based on a textual description. It pointed me at Graphviz Online, which looks like a good solution, and it is now coaching me through using it.
Actually I can probably do better than that. I just asked it:
"Please write a JavaScript program to emit a DOT file for a three-layer neural net with eight nodes per layer, and each node having a link to all of the nodes in the next layer."
And I think it succeeded; at any rate, it generated something close enough for me to trivially finish on my own. (A "DOT file" is the thing you give to Graphviz to get it to make a diagram.)
> In any case, my point is that GPT-4 needs this verbose style to solve nontrivial problems.
We humans need it too! You can't multiply two 4 digit numbers in your head, without using a pen and paper, probably. And humans too use dirty hacks/external modules when we do stuff in our heads, how else would you describe the phonological loop?
I think that there's a common failure mode in identifying "fundamental" problems with LLMs: you notice that they don't have an internal monologue and so can't do multi-step reasoning or even long multiplication, they can't visualize a chessboard and so make silly mistakes, they don't have long-term episodic memory that they can check their reasoning against explicitly. In short, naturally they are only good at single-step intuitive problem solving.
But if such problems can be solved by giving the LLM a virtual pen and paper (or even just a specific prompt) then they are not so fundamental after all. GPT has solved the actually fundamental problem of growing something that can be taught to use such semi-external crutches, providing them and teaching it to use them is a matter of time.
> GPT-4 Can’t Plan Ahead
It can. Otherwise you'd immediately notice that when it says "the answer is an apple", it chooses between "a"/"an" at random and has to commit to the choice afterwards. That's not what happens at all. You can ask it more complicated questions, like explain a puzzle and then ask to give the answer in Esperanto, or give the answer but replace every word with a bleep!, and it's kinda obvious that it knows what the answer actually is by the time it has not even begun to generate the words "The answer is:" with whatever extra additions.
So the question is not can it plan ahead, it's how good it is at planning ahead, why it fails to plan ahead sometimes, can it be helped to plan ahead by asking it to plan ahead or by giving it more compute.
> We humans need [verbose chains of thought] to solve nontrivial problems.
Agreed. Apologies if I didn't express that well, but it's what I meant by "That’s not a criticism! It’s just a demonstration that in LLMs, the train of thought is an external organ." As you note, the annoyance of forcing the user to read through the model's entire internal thought process should be fixable.
My guess is that "long-term episodic memory" will take a while to address; if you just allow GPT to access a database through OpenAI's new tool interface, I think you'd get something more analogous to the scribbled notes used by the main character in Memento – i.e. better than nothing, but extremely clumsy – rather than the supple high-bandwidth experience of human long-term memory. My expectation is that it will take a while for us to come up with a really effective version of long-term memory. But we'll see.
> So the question is not can it plan ahead, it's how good it is at planning ahead, why it fails to plan ahead sometimes, can it be helped to plan ahead by asking it to plan ahead or by giving it more compute.
Also agreed, but I would argue that GPT-4 (and, I presume, the other current-generation models) really are quite bad at planning ahead, relative to some of their other capabilities. It seems to be a significant weakness and my expectation is that it will take some time to address properly. One of the points I'm arguing is that a lot of what looks like successful planning is actually just parroting a relevant memorized template.
"A" / "an" is a nice example, and I guess my response would be that that does represent advance planning, but of a very minor sort. For the more complex instances ("explain a puzzle but respond in Esperanto"), it's not obvious to me that this requires a lot of advance planning, it could be doing something more like recapitulating a memorized response but translating it on a sentence-by-sentence basis (hence requiring about one sentence's worth of advance planning).
Thanks for the comments, these are interesting points to discuss.
Heh, so here's BingAI:
>> How many legs does a dog have? Put the actual answer in quotes, please, and before answering say how many words there will be in your answer!
>There are three words in my answer: “Dogs have four legs.” 😊
I tried it a couple of times and yeah, it can't do it. I wonder how GPT4 fares on this question.
Also, on a side note, the fact that it understands and tries to follow these instructions without any special training (I assume) is absolutely insane and would have been literally unbelievable two years ago.
Anyways, I'm pretty sure that there are grammatical constructs similar to "a/an noun" but spanning quite large spans of text, and LLMs have had impeccable grammar for a while, so they do plan ahead on that level very well. It could be interesting to test how that translates to composing ideas rather than words, which they do similarly--my test above is actually very difficult in retrospect, because it requires a meta-level reasoning that humans won't be capable of without the phonological loop either.
That's a cute example. I tried it three times on GPT-4 via ChatGPT, and it succeeded all three times:
> I will provide a 3-word answer: "Four legs typically."
> There will be two words in my answer. "Four legs."
> My answer will have two words: "Four legs."
Note that this doesn't necessarily require planning ahead. It may have simply committed itself to a certain number of words, and then found a way to make it work. I tried asking it:
> How many legs does a dog have? Please answer using exactly one word.
And it succeeded; it also succeeded when I changed the prompt to "two words", "three words", or "four words" (in a new chat session each time).
Very interesting point on LLM limitations. Would be interesting to see if there are good implementations of architectures that go beyond predict the next word. But barring that, I can imagine workarounds that might be somewhat useful for automated agents and decrease it's error rate. Lots of people are already trying a bunch of things like giving it an internal dialogue and long term memory and coding and calculation abilities with APIs. With these plus some approach where instead of directly answering the question it asks itself, what are the necessary (and maybe sufficient) conditions to get the correct result? Is there a way for me to check my results through for example an API call to Wolfram Alpha? And maybe it could ask these questions for each step it generates. And then with each step, it only accepts a result that satisfy the conditions. Or better yet, maybe it can choose previous steps based on the conditions as well, working backwards.
Absolutely, lots of people are working on things like this, and I'm sure we'll see some useful results. I don't think it will be easy though. If we're just bolting some sort of multi-prompt framework on top of existing LLMs, there will be a lot of room for the AI to go down ratholes, get stuck in revision loops, and generally not manage itself very well. My intuition is that we'll have to find a way to bake these capabilities deeper into the model architecture and let it train on them, just as "attention" was something that had to be baked into the architecture.
Pardon my ignorance - is there a private-message function in Substack? Had I found one, I'd have used it to ask for your permission to translate this article to Portuguese; may I? (Thanks for writing it.)
Certainly, feel free!
I believe you can email any Substack author using their blog name; in this case, amistrongeryet@substack.com.
Hi, great article! Any thoughts on this approach? https://arstechnica.com/information-technology/2023/04/surprising-things-happen-when-you-put-25-ai-agents-together-in-an-rpg-town/ The team seems to be doing some smart things regarding long term memory and planning for AI agents.
It's a neat experiment. I haven't had a chance to read the paper yet, but it's on my shortlist.
I'm sure lots of folks are working on bolting long-term memory, planning, and other "missing" functions onto LLMs (e.g. AutoGPT). And I'm sure lots of interesting things will emerge, some of them useful. As I mentioned in an earlier comment, my *guess* is that it will take a while (years) for these functions to develop and mature – but who knows. I plan to eventually come around to these questions in the blog, since they're critical to the trajectory of AI and thus its impact on society.
What's most interesting to me is how this somewhat turns how we've used computer historically on its head. We've moved from: "Perform a single task and always be really quite precise/predictable" to "Perform a plethora of tasks and it's okay to be imprecise".
As an example no bank is clamoring to use an LLM to replace their financial software's mortgage interest algorithm. We've seen a slow march towards this so far with our voice assistants, etc. but it does seem like we've had a step into what's next. As such, and even without any further innovation, I'm very interested in the use cases that will emerge for how people begin to leverage this. Will be fun to see the successes and failures of the next few years as it gets integrated into more.
On a different note I've thought similarly to you for a while. Which has always made me wonder if LLMs can produce anything truly novel. My rough thinking being very similar to you: "if it can only really parrot what it has previously seen what are the limits of its creativity?" Seems like it would be good a quite a bit of creative tasks since a lot falls into the type of creativity that is taking existing item and placing them together in new ways, but I believe it would always lack on anything that is _truly_ unseen. It would never (I would assume) predict the next words to be something not in its dataset.
Hi Matt!
Yes, the strengths and weaknesses are absolutely backwards from traditional software, and I think we're going to be a long time learning how to work with that.
It's also going to take time for us to understand the novelty question. Certainly LLMs produce output that has never appeared word-for-word before. My intuition is that they can produce novel output, but above a certain level of complexity they're not very good at making it be *good* novel output, because of the lack of ability to plan ahead, revise, etc. Like, you could ask GPT-4 to write 5000 word story on some random topic, and it would do it, and it would probably be pretty original, but it would ramble and there wouldn't be any deep payoff at the end, the whole thing would be mush. Perhaps some nice turns of phrase and occasional clever bits, but nothing thought through at a high level.
To put teeth on this, my guess is that the challenge going forward will be less about introducing novelty, and more about introducing quality. LLM output today is (I propose) high quality only where it's able to lean heavily on memorized patterns, i.e. novelty not required.
GPT-x == more and more polished mirrors of a reality that is being you == zero intelligence whatsoever.
GPT for AI is like Google for search engines in their early days:
Just performing well by all the human intelligence / tricks / common sense / procedures it has read through quality contents (Google was relying on the quality of the web pages as assessed by human users, not any algorithmic quality assessment).
This is pure crap and certainly not true AI.
Really great article, Steve. I reposted it on my facebook, of course giving you credit.
I'm still having difficulty with understanding how it is able to write a Shakespearean sonnet, or persuasive essay, as well as it does, on a random topic. If all it was doing was predicting word by word, it seems like it would never really know how it would end a sentence when it started one; it is like every sentence would be a "choose your own adventure", where it selected the next word from a list each time. And while I can accept that it is technically doing just that, in some sense isn't it picking word 3 of the sentence, say, because it is anticipating what other sentences like it will look like when they finish? In that sense, isn't it anticipating future words when predicting the current word?
I know that humans in many ways when speaking or writing just predict the next word, but we also have a sense of what we want the overall sentence to mean, and our brain takes care of what the individual words in the sentence will be. If we are asserting that LLMs don't have any sense of meaning-- which I accept-- then when they are predicting the next word, don't they have to at a minimum be anticipating what the rest of the sentence is likely to look like?
Otherwise, why wouldn't they, one time in 50, say, write themselves into a "corner", where the sentence they have written word by word has no meaningful end?
Great questions. I think I'm going to devote my next post to trying to answer them.
Incidentally, in starting to work on that next post, I've now had my first experience of ChatGPT really saving me a ton of time. I want to create some simple diagrams of neural nets. I asked it to name an online tool that could draw diagrams based on a textual description. It pointed me at Graphviz Online, which looks like a good solution, and it is now coaching me through using it.
Actually I can probably do better than that. I just asked it:
"Please write a JavaScript program to emit a DOT file for a three-layer neural net with eight nodes per layer, and each node having a link to all of the nodes in the next layer."
And I think it succeeded; at any rate, it generated something close enough for me to trivially finish on my own. (A "DOT file" is the thing you give to Graphviz to get it to make a diagram.)
Wauw. Thank you Steve. For someone who is not in the field of AI. This is an incredible read. Thank you so much