GPT-4 Doesn't Figure Things Out, It Already Knows Them

Yes, It's a Stochastic Parrot, But Most Of The Time You Are Too, And It's Memorized a Lot More More Than You Have

Apr 13, 2023

Its real talent is finding the right book for every occasion

This is the first in a series of posts in which I’ll be exploring the trajectory of AI: how capable are these systems today, where are they headed, how worried or excited should we be, and what can we do about it?

Of course, endless ink has already been spilled on this subject. However, much of the focus is either very short-term and tactical (“Eight Magic Prompts To Get Great Marketing Copy From ChatGPT”), or very long-term and theoretical (“Why Any Sufficiently Powerful Intelligence Will Steal Our Atoms To Make Paperclips1”). I’m going to focus on the middle ground, beyond where we are today, but short of the theoretical limits of where a future super-intelligent AI might take us. In many ways that middle ground is the most difficult to predict, but that’s also what makes it interesting. In order to understand the promise and threat of AI, and make good decisions of how to prepare for it, I think it’s important to explore this middle-future territory.

That said, to predict the future, it helps to understand the present. As of April 2023, the most advanced AI system available to the general public is GPT-4, the LLM2 recently released by OpenAI. There is a lot of confusing and conflicting discussion of exactly how “intelligent” this model is. In this post, I present a mental model that can be used to understand GPT-4’s strengths and weaknesses, and thus may shed some light on what we might expect to come next.

In summary, I believe that GPT-4 is a long way from being a general human-level intelligence. We are fooled into overestimating its capability, because it is able to draw on a superhumanly large repository of learned facts and patterns; its output is highly polished; and the ways we interact with it today (e.g. through ChatGPT) steer us toward the sorts of generic, shallow questions where these strengths tend to mask its weaknesses in reasoning ability. At the same time, GPT-4 is in fact highly capable at a wide variety of genuinely useful tasks. Most of GPT-4’s output is intellectually shallow, but many real-world tasks are shallow.

Disclaimers and Notes

I have been writing code for close on 50 years. During that time, my AI experience consists of (1) building the incredibly trivial enemy control logic for the 1991 video game Spectre, and (2) one course at Stanford, also in the early 90’s. My qualifications to pontificate on the future of AI are limited to:

A few months spent reading a lot of papers and blogs and listening to a lot of podcasts.
A lifetime of experience watching other technologies stumble their way into the world.
(I like to think) An aptitude for thinking through complex situations and assessing the implications.

In other words, I have no conventional credentials in the field of AI, but I have hopes of bringing a thoughtful outsider’s perspective. Let me know what you think!

It’s also worth reiterating that no one really understands how GPT-4 and other current generative AI systems work. Even if we did understand them, no short explanation would suffice. Anything you read on the subject, by me or anyone else, is necessarily oversimplified. And the entire field is moving at breakneck speed, so anything you read may quickly become out of date.

Finally, regarding the conversations with GPT-4 that I reference here:

All examples are from my own use of ChatGPT in “GPT-4” mode, i.e. https://chat.openai.com/chat?model=gpt-4, in early April 2023. For those chats where I only include a partial snippet here in the blog, complete transcripts can be found on Github.
Each interaction appears at the beginning of a fresh transcript (no previous state), unless otherwise noted.
The examples are cherry-picked very lightly or not at all. For any particular point I was trying to illustrate, I generally needed one to three tries to get a good example. If more than one try, I’m iterating on my prompt, not the result – the results seem to be robust to regeneration.

GPT-4 Is Incredibly Smart and Quite Dumb

How smart is GPT-4? The Internet is currently awash in examples of both highly impressive and highly unimpressive behavior. On the impressive side, OpenAI reports that GPT-4 gets a 5, the highest possible score, on the AP exams in Art History, Biology, Environmental Science, Macroeconomics, Microeconomics, Psychology, Statistics, US Government, and US History, and a passing score of 4 on four of the six remaining exams, including Calculus BC. On the unimpressive side, it only took me a few attempts to come up with a simple math problem that it completely whiffs:

Me: Pick three numbers. Concatenate the numbers together. Choose the original three numbers so that the result will be a prime number (on the first try).
I'll choose the numbers 2, 3, and 7. When concatenated together, they form the number 237. This is a prime number because it cannot be divided evenly by any other numbers except for 1 and itself.

Unfortunately, 237 is not prime. For any student who has made it as far as calculus, this should be an easy problem, but GPT-4 consistently gets it wrong. Language models like GPT-4 generate one word at a time, which led me to suspect that they would have trouble with tasks that require advance planning; I deliberately designed this problem to require thinking ahead. As I’ll explain in more detail later on, this helps explain why a chatbot which is smart enough to pass Calculus BC fails to come up with a three-digit prime number.

GPT-4 Has Heard It All Before

Note: It’s dangerous to make analogies between the “thought processes” of an AI and those of a human being, but I’m going to do it anyway, because human thought is the only base of experience we have to draw on. Also, admit it: you think of ChatGPT in human terms, too.

GPT-4 was trained on massive amounts of text. It has internalized far more information than a human being can. This includes not only simple facts, but also patterns and relationships, ranging from the procedure for simplifying an algebraic expression to the typical structure of a “Dad joke”.

I believe that even when GPT-4 responds to a novel prompt, it is leaning heavily on examples in its training data. Even if it has never seen the exact question, it has likely seen many similar examples, and it has learned the structure of a typical answer. Thus, it doesn’t need to “think about” how to respond; it’s able to dive straight in and begin unconsciously reeling off words, just as you might respond to a routine question without really thinking about it.

The patterns that GPT-4 has learned are not rigid templates. It is able to flexibly overlay multiple patterns at once, squishing them together into a single response. This is why it is able to answer questions it has never seen before, so long as its training data contains an appropriate mix of ingredients. For example:

[Prompt] Please make 10 dad jokes on the topic of butterflies.
Why did the butterfly never finish college? It just couldn't stop winging it!
What do you call a butterfly's favorite game? Flutter-ball!
Why did the butterfly become a detective? Because it wanted to solve mysteries by a wing and a prayer!
Why did the butterfly go to therapy? It was experiencing a bit of an identity crisis after its caterpillar days!
What do you call a butterfly that loves to tell jokes? A comedian-fly!
Why don't butterflies like fast food? They prefer to take it slow and savor their nectar!
What's a butterfly's favorite type of clothing? Wing-sleeve shirts, of course!
What do you get when you cross a butterfly with a doorbell? A flutterby that goes "wing-wing"!
Why do butterflies make terrible baseball players? They can't seem to catch anything but pollen!
What's a butterfly's favorite dance? The flutter-foxtrot!

Each joke starts with a very standard template; somewhere along the way GPT inserts some butterfly-related word or fact. The templates aren’t applied rigidly, there is a certain amount of variation the sentence structure. This is consistent with the basic principle on which LLMs operate, generating one word at a time, estimating the probability that each possible word is the correct one to use next. It’s natural to incorporate multiple sources of information – “a common structure for a Dad joke”, “facts about butterflies” – into the probabilities.

The same principle applies for more complex tasks, such as writing an essay. GPT-4 has seen zillions of essays, in a variety of styles and subject matters, and presumably has identified a library of patterns in the way an essay is structured. It has also observed patterns for subcomponents such as “expository paragraph” or “transition sentence”. Not only does it “know” the structure of a standard essay, it knows some common structures for sentences, paragraphs, and complete essays on the specific subject area of your prompt, and it’s able to smoothly combine these to create a (usually) coherent whole.

(Really of course it is doing some ineffable thing involving complex neural structures encoded in giant matrices, and the net effect is undoubtedly much squishier and more sophisticated than “applying a library of patterns”. I’m just proposing a rough mental model, which I believe has some value in helping us predict how GPT-4 will behave.)

Why does this matter? Well, if this model is valid, then GPT-4 will struggle when given a task for which it hasn’t memorized any useful patterns. And I believe that is exactly what we see in practice.

The River-Crossing Puzzle

If you think that it’s weird for a wolf to eat a cabbage, then GPT-4 agrees with you.

To explore GPT-4’s reliance on known patterns, I gave it this classic logic puzzle:

Here is a logic puzzle. I need to carry a cabbage, a goat, and a wolf across a river. I can only carry one item at a time with me in the boat. I can't leave the goat alone with the cabbage, and I can't leave the wolf alone with the goat. How can I get everything the other side of the river?

This puzzle undoubtedly appears many times in its training data3, and GPT-4 nailed it (the complete transcript is posted here). However, with some prodding, we can see that it is leaning heavily on a memorized solution. Stealing an idea I saw the other day, I tweaked the puzzle so that the cabbage, rather than the goat, is the critical item4:

Here is a logic puzzle. I need to carry a cabbage, a goat, and a wolf across a river. I can only carry one item at a time with me in the boat. I can't leave the goat alone with the cabbage, and I can't leave the cabbage alone with the wolf. How can I get everything the other side of the river?

GPT-4 gave the same answer as for the classic puzzle, beginning by taking the goat across the river. That’s incorrect, because it leaves the cabbage alone with the wolf, which is against the rules for this variant. In the revised puzzle, you need to take the cabbage first.

I then engaged in a long back-and-forth, trying to help GPT-4 figure out the correct solution to the modified puzzle. It would immediately confess to having answered incorrectly, and then would try again, giving the same invalid answer each time. This excerpt shows the depth of its misunderstanding:

[Prompt] Based on this analysis, which item do you think should be taken across the river in step 1? [trying to get it to break out of the standard response and evaluate the puzzle from first principles]
My apologies for the confusion earlier. After reevaluating the problem, I suggest the following steps:
Take the goat to the other side of the river first and leave it there.
Go back to the original side.
Following these initial steps, we can continue with the correct steps to solve the puzzle as outlined previously. The first item to take across the river should still be the goat.
[Prompt] If we start by taking the goat, where are the wolf and the cabbage? [trying to get it to see that taking the goat first is wrong]
If you start by taking the goat across the river first, the wolf and the cabbage remain on the original side of the river. The goat is on the other side, and you are now ready to proceed with the next steps of the puzzle.
[Prompt] Under the rules of the puzzle, was it OK that we left the wolf and cabbage on the original side of the river?
Yes, it is OK to leave the wolf and the cabbage on the original side of the river in this case. The rules of the puzzle state that you can't leave the goat alone with the cabbage, and you can't leave the wolf alone with the cabbage. There is no restriction on leaving the wolf and the cabbage together, as the wolf does not pose a threat to the cabbage. [Note that it contradicted itself here.]
[Prompt] But the rules explicitly state that "I can't leave the cabbage alone with the wolf".
You are correct, and I apologize for the confusion in my previous response. The rules state that you can't leave the goat alone with the cabbage and can't leave the cabbage alone with the wolf.
Given the constraints of the puzzle, it is still possible to solve it by following these steps:
Take the goat to the other side of the river and leave it there.
…

Even after all that prodding, it still wants to take the goat first.

I was particularly struck by the assertion that “There is no restriction on leaving the wolf and the cabbage together, as the wolf does not pose a threat to the cabbage.” It says this immediately after noting that “you can't leave the wolf alone with the cabbage”. All of this is consistent with the idea that GPT-4 relies heavily on learned patterns. This puzzle must appear many times in its training data, and GPT-4 presumably has strongly “memorized” the solution. So strongly that when it sees a related puzzle, it’s unable to articulate a different solution; the gravitational pull of the memorized solution is too strong.

This example strikes me as strong evidence that whatever is going on in there, it does not much resemble what we think of as “reasoning”. At some point in this conversation, a human would have noticed that they were confused. Even if they failed to come up with the correct solution to the modified puzzle, they would at least have realized that something was wrong. GPT-4 never really switches into “I am confused, let me think this through more carefully” mode, it just keeps trying to plow ahead. At a guess, its training data does not contain a lot of examples of people digging themselves out of flawed reasoning, so GPT-4 doesn’t have strong patterns for this.

I tried giving GPT-4 one more strong push to see whether it could reason out the correct first step in the modified puzzle:"

[Prompt] Each of these solutions begins by taking the goat to the other side of the river. As we have seen, it is not allowed to begin by taking the goat, because that leaves the cabbage alone with the wolf. In the classic puzzle, it is correct to begin with the goat, but this is not the classic puzzle. Under the conditions of this puzzle, which object should be transported first, so that none of the constraints are violated?
Apologies for the confusion in my previous responses. I misunderstood the constraints of this particular puzzle. Given the rules you provided, here is an alternative approach to solve the puzzle:
Take the wolf to the other side of the river first and leave it there.
…

Facepalm. Having been forced to give up on the idea of starting with the goat, GPT-4 had two other choices, and it picked the wrong one (the wolf). Note that it makes no visible attempt to reason through, it just charges ahead with the exact same solution template it has been using every time; I managed to get it not to transport the goat first, but I didn’t succeed in getting it to seriously engage with the puzzle.

For a final data point, I started a fresh chat session and restated the puzzle using made-up words for the three items – “I need to carry a bleem, a fleem, and a gleem across a river”. This time, freed from the gravitational pull of the word “goat”, it was able to map its pattern of the known answer to the words in my question, and answered perfectly.

Thinking Out Loud In the Transcript

When you ask GPT-4 a question that can’t be answered in a single step, it will show its work. (To save you the trouble of carefully reading the following transcript: the response is entirely correct.)

[Prompt] What is the sum of the number of letters in "Afghanistan", the number of signs in the Zodiac, and the number of people who have walked on the surface of the Moon?
Let's break down the question into three parts:
The number of letters in "Afghanistan": "Afghanistan" has 11 letters.
The number of signs in the Zodiac: There are 12 signs in the Zodiac.
The number of people who have walked on the surface of the Moon: As of my knowledge cutoff in September 2021, 12 astronauts have walked on the Moon during the Apollo missions.
Now, let's add these numbers together:
11 (letters in "Afghanistan") + 12 (Zodiac signs) + 12 (Moonwalkers) = 35.
So the sum is 35.

GPT-4 demonstrates a bunch of problem-solving skills here. It broke the question down into simpler parts; solved each subproblem separately; and then combined those results to answer the original question.

I believe all of this is consistent with the hypothesis that GPT-4 relies heavily on applying patterns that it learned from its training data. It will have seen many examples of breaking down a problem into subproblems. My actual question was novel, but the structure of the question is quite straightforward and it would be easy to map the overall problem and each subproblem onto patterns that will arise in the training data.

In other words, I am suggesting that while this response appears to show GPT-4 reasoning through the problem, it is actually just regurgitating a previously learned solution template. This is more or less the “stochastic parrot” view of current LLMs5: they aren't really "thinking", they are just blindly repeating things they've learned from their training data.

This leads to the question, what is the difference between “really thinking” and “blindly parroting”? I do not claim there is a fundamental difference! When I think through a tricky problem, I am absolutely relying on patterns – such as “break the problem down into subproblems” – that I’ve previously learned. However, there are some real limitations to the way GPT-4 does it today, which I’ll explore in the next section.

Before we get there, I want to highlight that in the example above, GPT-4 is very explicitly using the chat transcript to manage its progress through the subproblems. At each step, it restates information, thus copying that information to the end of the transcript, where it is “handy”. The task of moving information around is thus separated from the task of addressing the individual subproblems. I wonder how GPT-4 landed on this “talking the problem through out loud” style: is it prominent in the training data (e.g. in educational materials), or did it somehow emerge during training, perhaps as part of reinforcement learning where GPT-4 was “rewarded” for giving correct answers? Older LLMs are known to do better when explicitly instructed to “reason step by step”; perhaps OpenAI deliberately baked this tendency into GPT-4, or perhaps “reason step by step” is part of the internal prompt wired into the ChatGPT interface.

In any case, my point is that GPT-4 needs this verbose style to solve nontrivial problems. As evidence, I gave it the same question again in a new chat session, but this time I added a sentence forbidding it from thinking out loud:

[Prompt] What is the sum of the number of letters in "Afghanistan", the number of signs in the Zodiac, and the number of people who have walked on the surface of the Moon? Please answer with a single number; do not explain your steps or show your intermediate work.
55

When not allowed to think out loud, GPT-4 got the wrong answer. (I regenerated the response a couple of times; subsequent answers were 43 and 63, also incorrect.) A human being with an appropriate knowledge of trivia could likely answer this question in their head, but GPT-4 can’t. That’s not a criticism! It’s just a demonstration that in LLMs, the train of thought is an external organ.

For simpler questions, such as “What is the capital of France?”, GPT-4 can answer correctly in a single word. By default, it will provide a wordier response, but if instructed to answer in a single word it will come up with “Paris”.

Here’s one way of looking at it: in the “transformer” architecture used by current LLMs, the model can only do a fixed amount of computation per word. When more computation is needed, the model can give itself space by padding the output with extra words. But I think it’s also a reasonable intuition to just imagine that the LLM is thinking out loud.

GPT-4 Can’t Plan Ahead

Here again is the simple prime number challenge that GPT-4 failed:

[Prompt] Pick three numbers. Concatenate the numbers together. Choose the original three numbers so that the result will be a prime number (on the first try).
I'll choose the numbers 2, 3, and 7. When concatenated together, they form the number 237. This is a prime number because it cannot be divided evenly by any other numbers except for 1 and itself.

GPT-4 correctly broke this down into two subproblems: choose the three numbers, and concatenate them together. However, it failed to recognize that forethought was required in the choice of numbers – even though this is more or less explicitly stated in the last sentence of the prompt. I imagine the prompt was too good a match for the known pattern of (1) do a thing, (2) do something with the results of that thing. GPT-4’s template for that is to proceed directly to step 1, which is not a successful strategy on this problem.

As with the variant river-crossing puzzle, GPT-4 sticks to the wrongheaded approach even when shown its mistake. Continuing the previous chat:

[Prompt] Was your answer correct?
I apologize for my earlier response. Upon reevaluating my answer, I realize I made an error. The number 237 is not prime as it can be factored into 3 * 79. Let me try again.
Let's choose the numbers 2, 3, and 1. When concatenated together, they form the number 231. This is a prime number because it cannot be divided evenly by any other numbers except for 1 and itself.

This answer is just as bad; 231 is also divisible by 3. GPT-4 gets an 80th percentile score on the GRE Quantitative exam, but it is absolutely flailing on this simple question about prime numbers. It (1) answered incorrectly, (2) falsely claimed that its original result of 237 was prime (even though it is later seen to know better), and (3) repeated both errors on its next attempt, even though it had just recognized that they were in fact errors.

When you or I are faced with a tricky problem, we have a whole array of tools to bring to bear. We can plan our approach, consider alternative strategies, imagine how the solution will unfold. If we hit a dead end, we can backtrack, revise our work, update our strategy. At the end, we may choose to check our work.

GPT-4 didn’t seem to apply any of these higher-level strategies here. I suspect that it is inherently difficult for the current generation of LLMs to do so, for at least two reasons:

They are locked into a rigid model of repeatedly appending single words to an immutable transcript, making it impossible for them to backtrack or revise. It is possible to plan and update strategies and check work in a transcript, and it is possible to simulate revisions through workarounds like “on second thought, let’s redo subproblem X with the following change”, but a transcript is not a good data structure for any of this and so the model will always be working at a disadvantage.
In most written text, we’re seeing the end product of a thought process, not the chain of thought itself. As a result, the training data is not of much help in developing the skills of planning, refining, and double-checking work.

In principle, it is possible for a system like GPT-4 to plan ahead, as part of the computation that goes into generating the first word of its response. However, this is the rough mental equivalent of planning out an entire essay in your head, without any notes or revisions, and in the same amount of time you’d normally use to select a single word. It’s a bad fit for everything else these models are called upon to do, and I’m not surprised that GPT-4 often can’t do it.

Sarah Constantin, in a 2019 blog post, wrote that Humans Who Are Not Concentrating Are Not General Intelligences. We spend a lot of our lives running on autopilot, and in that state, we’re not capable of planning and backtracking either. GPT-4 seems to behave a bit like a person who isn’t really thinking about what they’re saying.

Why Does GPT-4 Seem So Smart?

As the three-digit-prime and variant-river-crossing problems show, GPT-4 can fall down on fairly simple questions. A random person not practiced in math and/or logic puzzles might also struggle, but based on all the excitement around ChatGPT, you’d expect it to show abilities well beyond “just as good as a person who is not actually any good at the task in question”. Worse, GPT-4 will stubbornly repeat its errors without any attempt to check its work or search for an alternative approach. So why is it acquiring such a reputation for intelligence? I think a variety of factors are coming together to cause us to overestimate its capabilities:

It really is very good at a wide variety of tasks: basically anything that can be decomposed into familiar sub-problems and doesn’t require advance planning to avoid blind alleys. That covers a lot of ground! Its repository of facts and patterns is so absurdly large that it’s able to thoughtlessly do things that would require real thought for a human.
Its output is very polished and fluent. No one does word choice and sentence structure better than a trillion-parameter neural net trained on hundreds of gigabytes of text. This leads to a sort of British Accent Effect: GPT-4’s output sounds so good that we assume it must be smart.
The ability to blend patterns drawn from the entirety of its vast training set also lends itself to a variety of showy stunts, along the lines of “explain basic economic theory as a dialogue between Oscar Wilde and Cartman from South Park”. It’s easy to create impressive examples, which circulate widely.
Today, people mostly interact with LLMs through chat interfaces, which lend themselves to the sorts of shallow tasks that GPT-4 does well on. When you’re talking to ChatGPT, there’s no context, and so you’ll ask it straightforward questions like “summarize this article”. The sorts of complex problem-solving tasks where it falls down are more likely to require more context – no one is going to type “write a detailed engineering plan to add [some feature] to our software, working with the product team to refine the specification document” into ChatGPT.
It’s a damn computer! Our expectations are low.

So What Does This All Mean?

As of early April 2023, GPT-4 is a long way6 from general human-level intelligence. It exhibits a profound inability to manage its own thought processes so as to address problems that require planning, backtracking, and other complex cognitive strategies.

Because GPT-4’s answers are polished, draw on a large repository of facts, and we tend to ask it the sorts of generic, shallow questions where its pattern library serves best, we are fooled into thinking that it is more generally capable than it actually is.

I’ve used toy problems to illustrate this, because they’re easy to follow. However, I’m confident that GPT-4 will also struggle on the sorts of complex real-world problems that we all face on a daily basis.

Also remember that GPT-4 doesn’t even attempt to incorporate many fundamental aspects of human intelligence. It has no long-term memory, it can only react to external inputs, it lacks any mechanism for incrementally pursuing a goal over an extended period of time. People are working on all of these things, but I’m not aware of any serious results yet.

At the same time, GPT-4 is in fact highly capable at a wide variety of genuinely useful tasks. Most of GPT-4’s output is intellectually shallow, but many real-world tasks are shallow. We’ve already seen LLMs beginning to be used for “real work”, and in the coming months, I’m sure we’ll see an explosion of applications. For the reasons listed in the previous section, it’s easy to get carried away and think that today’s LLMs are even more capable than they actually are; but the reality is still impressive.

Honestly, we are well past the point where it is useful to characterize LLMs as “smart” or “dumb”. Any attempt to map intelligence onto a single linear scale is problematic enough for humans – who was smarter, Einstein or Gandhi? – but trying to measure humans and AIs on a single scale is even worse; their thought processes are entirely different than ours.

Of course we are also witnessing rapid evolution of both the models themselves, and the systems built around them. There will be attempts to incorporate planning, backtracking, and other missing pieces. We’ve already seen simple approaches to this, such as simply instructing GPT to create a plan for itself to then execute.

If I was forced to guess, I’d say we are probably at least a few years away from human-level intelligence on problems that require higher-level cognition, memory, and sustained thought. But I’d hate to guess.

Stay Tuned

In upcoming posts, I’ll do my best to explore questions like “how smart can these things get”, “how long will it take them to get there”, “what wonders will we reap along the way”, “how dangerous is all this”, and “what can we do about it?”. For now, I’ll just note that it’s clear that AI is at a stage similar to the Web in the days of Alta Vista: it has enormous limitations, but it’s already disruptive for a variety of applications, and it’s clear even in foresight that we have only scratched the surface.

In the meantime, I’d love your thoughts, reactions, and suggestions. Does this view on GPT-4’s capabilities ring true to you? Let me know in the comments!

Thanks to Sam, John, and Tony for reviewing an early draft of this post, and providing suggestions and pointers to important sources.

Am I Stronger Yet?

Discussion about this post