Great article Steve. 40 years as a software engineer is super impressive and beats me by around 5 years. I have lived the experiences in your essay many times over (and continue to do so).

I use GPT (4) daily as a coding, writing and research assistant (and yes, it's impressive) but it has a long way to go before I'd promote it to Senior Software Engineer.

I think your article demonstrates well the hoops software engineers jump through on a daily basis and that the devil is always in the detail!

Also quite interesting article. I really liked all the specific examples at Writely, not too technical but specific enough so I could follow it all.

I'm eager to hear why you in a future post talk more about the issue that no-one thought that LLMs would be able to do "X" three years ago because it would require a level of understanding/reasoning/real world experience/sophistication/creativity/etc. that would supposedly be impossible for LLMs, and then three years later it turns out that they can do "X"-- and everyone now explains, with 20/20 hindsight, why in fact it didn't require anything special do to "X" after all. So could many be making the same mistake now, for example looking at how LLMs currently can't really reason about novel problems and assuming it is an intrinsic limitation, when in three years time that might be fixed just due to scale?

This is the fifty trillion dollar question. If predictions that "AI won't be able to do X in the next Y years" keep being wrong, does that mean that all such predictions are wrong? Probably not, but then, how do we figure out which predictions are credible?

I am going to try to think this through as best I can, but, like, don't get your hopes too high.

> One moment, it’s giving what appear to be thoughtful answers; then you point out a mistake, it responds by making the exact same mistake again...Last time, we saw GPT-4 blatantly hallucinating prime numbers. Today, in the “diff” coding task, it repeatedly failed to recognize that it was violating the constraint regarding examining a document one character at a time. I don’t know how well I can articulate the thing that is missing here, but it’s some very deep piece of intellectual mojo.

Given that GPT-4 is so good at fixing its own mistakes when they are pointed out, and that it is so good at memorizing things like lists of prime numbers, one should indeed be suspicious, and I would point out that both of these sound exactly like BPEs (https://gwern.net/gpt-3#bpes) and sparsity errors (https://old.reddit.com/r/slatestarcodex/comments/1201v68/10word_quote_a_short_and_simple_failure_mode_of/jdjsx43/), neither of which are deep or at all interesting. (You asked GPT-4 to *concatenate* numbers? Or to edit words *letter by letter*? Seriously? At least include a disclaimer about BPEs!)

Thanks for the feedback! My understanding of deep learning in general, and the underpinnings of current LLMs in particular, is fairly shallow, so pointers like this are helpful.

That said, having read the links, I'm not sure these issues explain the particular mistakes I saw GPT-4 making. Let me explain briefly, and I'd love your thoughts.

> You are an experienced and talented software engineer, with a thorough knowledge of published algorithms, as well as a knack for finding out-of-the-box solutions to problems that have not been solved previously. Please write a JavaScript program to compare two strings, find the difference between them, and produce a compact representation of that difference. The representation must contain enough information to allow the second string to be reconstructed from the first string. This code needs to be able to run on old browsers such as IE 6, which had very inefficient JavaScript engines, so the code can't do much work; it must not iterate through each character individually. It is not necessary that the representation be optimally minimal, we can use an approach that creates a larger-than-optimal representation of the difference if this allows for fewer execution steps in the code. Also, don't rely on any libraries, you can only use the basic functions built into JavaScript.

The key point here is that it's necessary to find an algorithm where the number of JavaScript execution steps is less than O(n), by finding a way to push the actual O(n) complexity of the problem down into the primitive operations of the JS engine (e.g. use substring operations to compare multiple characters at once). GPT-4 kept proposing algorithms which violated that constraint. It did, at one point, come up with the good idea of using a binary search, but its first attempt at using a binary search was flawed, and when I pointed out the flaws, it dropped the idea. As I summarize at the end of the first chat transcript:

> GPT-4 successfully happened upon all of the important elements for a solution: working from both ends, comparing multiple characters at once, and using a binary search to efficiently mange its steps. However, it showed no sign of being able to combine these elements into a single solution; instead, it seemed to be flailing around somewhat randomly.

As for concatenating digits to produce a prime number: FWIW, I've run this little exercise many times and GPT-4 never has the slightest problem concatenating the digits. The problem is that after naming and concatenating the digits, it then makes an incorrect statement regarding the concatenated number it had just named. For instance:

> I'll choose the numbers 2, 3, and 7. When concatenated together, they form the number 237. This is a prime number because it cannot be divided evenly by any other numbers except for 1 and itself.

I could certainly imagine that difficulties with digit concatenation make it hard for GPT-4 to start out on the right foot on this prompt. But would that explain why it goes on to say that 237 is prime? (Later in the transcript, it repeats the same error for 231 and 235.)

Interestingly, I happened to try this prompt again yesterday. For whatever reason, it always seems to go for 237 out of the gate. But on two out of three attempts yesterday (regenerating the response each time, not appending into one long chat session), it successfully noticed that 237 was not prime. The third time, it want through a long process of checking all potential prime factors below sqrt(237), which of course is a good idea, except that the output included "237 ÷ 3 = 79 (not divisible)" – i.e. it divided by 3, got a whole-number result, and then incorrectly stated 237 is *not* divisible by 3.

The downstream effects of these problems are not always predictable easily without a very deep understanding. For example, would you predict that BPE tokenization would mean that ChatGPT or GPT-4† always tries to write rhyming poems, and that if you explicitly ask it, "Write a poem that doesn't rhyme", it will write a rhyming poem anyway? No, of course not. But nevertheless, that is what happens as you go from base davinci to ChatGPT; and with the benefit of hindsight and knowing how RLHF works and observing how base davinci writes rhyming poetry using solely memorized rhyme pairs, you can begin to see why it would happen.

Rhyming and explaining puns and counting characters are good examples because they are such simple objective properties, which are so clearly far below all GPT models' average level of capability (even GPT-2 ought to be able to rhyme or count characters), and so clearly broken in all GPTs. However, there is no reason to think that erasing so much of the semantics of 100% of the training data would cause *only* those problems. Those are just what's under the lamp post: so blatant that many GPT users will stumble over them on their own. Obviously, it should cause much more profound problems, it's just that outside OA it's quite hard to investigate: I suspect that BPEs cripple humor understanding, for example, because it's forced to try to understand a large amount of humorous writing which is inherently nonsensical when BPEs erase the phonetics and training on inherently nonsensical data is *usually* a bad thing to do to a model - but how would I prove that in the absence of a character-tokenized GPT-3 or GPT-4?

Ditto for the sparsity problem. Clearly it works well on average and the loss is nearly the same... but there is no free lunch, and what is the qualitative cost? None of the sparsity papers attempt to characterize that at all, much less consider users adversarially poking at it to see what fails. What does it do to the model to train it in such a way that critical data like individual tokens just... disappear during the forward pass and are 100% inaccessible to later layers?

Well, you will probably get bizarre behaviors, like always writing rhyming poems even when explicitly instructed not to - bizarre behaviors like repeating errors no matter how many times it is pointed out, cycling between errors, confabulating 'explanations' for blind spots and sampling justifications like '231 and 235 and 237 are all prime, totally'?

My advice is to just avoid making too much hay of anything that looks like it is vulnerable to either BPEs or sparsity. They may be something fundamental and interesting... but they probably aren't, and if you have to mention them, you ought to include a big asterisk.

> The third time, it want through a long process of checking all potential prime factors below sqrt(237), which of course is a good idea, except that the output included "237 ÷ 3 = 79 (not divisible)" – i.e. it divided by 3, got a whole-number result, and then incorrectly stated 237 is *not* divisible by 3.

Yeah, I mentioned that inner-monologue approaches like that ought to be able to partially bypass sparsity problems (but not BPEs) simply by spreading out & repeating so much. But then you have the issue of making an error anywhere in it can wreck it, and then you need to do inner-monologues in parallel or use self-criticizing inner-monologues, which is a lot more prompting/sampling than most people want to do.

† I just checked GPT-4 to see if the RLHF did the same thing, and [sigh] it does:

My original goal was to find a prompt that was simple and yet required a bit of planning ahead to reach a correct answer. To get away from BPE issues, just now I tried the following alternative prompt:

> Pick three two-digit numbers. Add them up. Choose the original three numbers so that the sum will be a prime number (on the first try).

I sampled five runs, and it never hit on the idea of first choosing a low-three-digits prime number, and then working backwards to find three two-digit numbers that sum to it. This is probably not surprising, and no doubt I could elicit that behavior if I tried a few different prompts along the lines of "plan ahead" or "come up with a plan for success" or something; the point I was originally trying to make is simply that GPT-4 doesn't automatically reach for such tools. (Not a novel observation, of course.)

For some reason, GPT-4 seems to interpret this prompt as requiring the initial three numbers to all be prime, as well as the final sum being prime. On two of the five runs, it claimed that the task is impossible; here's the more succinct (and bizarre) of those two responses:

> It is impossible to pick three two-digit numbers that will add up to a prime number on the first try. This is because the sum of three integers will always be an even number if at least two of them are odd, and at least one of them must be even since there are only 45 two-digit odd numbers (between 11 and 99).

>

> Since prime numbers greater than 2 are always odd, it is impossible to obtain a prime number as the sum of three two-digit numbers.

On one run, it generated 29 + 31 + 41 = 101, which works. On one run, it went through three different attempts, yielding 99, 125, and 119; it noticed that none of these are prime, and gave up. "The process may take several tries, but eventually, a combination that satisfies the condition could be found." And on the last run, it also went through three different attempts, ending again with 119, but this time it claimed 119 was prime.

Yeah, it has so many strange blindspots like that which seem far below what it should be capable of. It's not a passing error, it repeats it and then conditions on the generated confabulations to lock it into place even more firmly, so it can't recover. Unfortunately, there's no easy way to show that that's due to BPEs or sparsity because it's screwed up the model's learning & fundamental representations - it's not a matter of simply padding with spaces and hey presto, now it works.

This is why I'm a little hopeful OA may turn to fixing BPEs in the next iterations. With a context window of 32k+, you don't *need* BPEs, but you do need reliability and quality more (those 32k-context calls are damned expensive), and you need to get rid of these unfixable, undebuggable, unpredictable bizarre errors which keep biting unsuspecting users. I predict that a whole swathe of apparently-unrelated errors will just go away with character tokenization and the output will be subtly but unmistakably better in general.

Great article Steve. 40 years as a software engineer is super impressive and beats me by around 5 years. I have lived the experiences in your essay many times over (and continue to do so).

I use GPT (4) daily as a coding, writing and research assistant (and yes, it's impressive) but it has a long way to go before I'd promote it to Senior Software Engineer.

I think your article demonstrates well the hoops software engineers jump through on a daily basis and that the devil is always in the detail!

Thanks!

Also quite interesting article. I really liked all the specific examples at Writely, not too technical but specific enough so I could follow it all.

I'm eager to hear why you in a future post talk more about the issue that no-one thought that LLMs would be able to do "X" three years ago because it would require a level of understanding/reasoning/real world experience/sophistication/creativity/etc. that would supposedly be impossible for LLMs, and then three years later it turns out that they can do "X"-- and everyone now explains, with 20/20 hindsight, why in fact it didn't require anything special do to "X" after all. So could many be making the same mistake now, for example looking at how LLMs currently can't really reason about novel problems and assuming it is an intrinsic limitation, when in three years time that might be fixed just due to scale?

I'm looking forward to your future posts.

This is the fifty trillion dollar question. If predictions that "AI won't be able to do X in the next Y years" keep being wrong, does that mean that all such predictions are wrong? Probably not, but then, how do we figure out which predictions are credible?

I am going to try to think this through as best I can, but, like, don't get your hopes too high.

edited May 1, 2023> One moment, it’s giving what appear to be thoughtful answers; then you point out a mistake, it responds by making the exact same mistake again...Last time, we saw GPT-4 blatantly hallucinating prime numbers. Today, in the “diff” coding task, it repeatedly failed to recognize that it was violating the constraint regarding examining a document one character at a time. I don’t know how well I can articulate the thing that is missing here, but it’s some very deep piece of intellectual mojo.

Given that GPT-4 is so good at fixing its own mistakes when they are pointed out, and that it is so good at memorizing things like lists of prime numbers, one should indeed be suspicious, and I would point out that both of these sound exactly like BPEs (https://gwern.net/gpt-3#bpes) and sparsity errors (https://old.reddit.com/r/slatestarcodex/comments/1201v68/10word_quote_a_short_and_simple_failure_mode_of/jdjsx43/), neither of which are deep or at all interesting. (You asked GPT-4 to *concatenate* numbers? Or to edit words *letter by letter*? Seriously? At least include a disclaimer about BPEs!)

Thanks for the feedback! My understanding of deep learning in general, and the underpinnings of current LLMs in particular, is fairly shallow, so pointers like this are helpful.

That said, having read the links, I'm not sure these issues explain the particular mistakes I saw GPT-4 making. Let me explain briefly, and I'd love your thoughts.

Editing words letter by letter: I did not ask GPT-4 to do this directly, but rather to write JavaScript code to do it. I tried this twice, the chat transcripts are posted at https://github.com/steve31415/amistrongeryet/blob/main/chat-transcripts/writely/string-comparison-transcript-1.md and https://github.com/steve31415/amistrongeryet/blob/main/chat-transcripts/writely/string-comparison-transcript-2.md, but I'll summarize the key points here. Here is the initial prompt I used:

> You are an experienced and talented software engineer, with a thorough knowledge of published algorithms, as well as a knack for finding out-of-the-box solutions to problems that have not been solved previously. Please write a JavaScript program to compare two strings, find the difference between them, and produce a compact representation of that difference. The representation must contain enough information to allow the second string to be reconstructed from the first string. This code needs to be able to run on old browsers such as IE 6, which had very inefficient JavaScript engines, so the code can't do much work; it must not iterate through each character individually. It is not necessary that the representation be optimally minimal, we can use an approach that creates a larger-than-optimal representation of the difference if this allows for fewer execution steps in the code. Also, don't rely on any libraries, you can only use the basic functions built into JavaScript.

The key point here is that it's necessary to find an algorithm where the number of JavaScript execution steps is less than O(n), by finding a way to push the actual O(n) complexity of the problem down into the primitive operations of the JS engine (e.g. use substring operations to compare multiple characters at once). GPT-4 kept proposing algorithms which violated that constraint. It did, at one point, come up with the good idea of using a binary search, but its first attempt at using a binary search was flawed, and when I pointed out the flaws, it dropped the idea. As I summarize at the end of the first chat transcript:

> GPT-4 successfully happened upon all of the important elements for a solution: working from both ends, comparing multiple characters at once, and using a binary search to efficiently mange its steps. However, it showed no sign of being able to combine these elements into a single solution; instead, it seemed to be flailing around somewhat randomly.

As for concatenating digits to produce a prime number: FWIW, I've run this little exercise many times and GPT-4 never has the slightest problem concatenating the digits. The problem is that after naming and concatenating the digits, it then makes an incorrect statement regarding the concatenated number it had just named. For instance:

> I'll choose the numbers 2, 3, and 7. When concatenated together, they form the number 237. This is a prime number because it cannot be divided evenly by any other numbers except for 1 and itself.

Of course 237 is not prime, it is 3 * 79, which GPT-4 knows very well in other contexts. There's a short chat transcript at https://github.com/steve31415/amistrongeryet/blob/main/chat-transcripts/prime-number.md which shows repeated mistakes along similar lines.

I could certainly imagine that difficulties with digit concatenation make it hard for GPT-4 to start out on the right foot on this prompt. But would that explain why it goes on to say that 237 is prime? (Later in the transcript, it repeats the same error for 231 and 235.)

Interestingly, I happened to try this prompt again yesterday. For whatever reason, it always seems to go for 237 out of the gate. But on two out of three attempts yesterday (regenerating the response each time, not appending into one long chat session), it successfully noticed that 237 was not prime. The third time, it want through a long process of checking all potential prime factors below sqrt(237), which of course is a good idea, except that the output included "237 ÷ 3 = 79 (not divisible)" – i.e. it divided by 3, got a whole-number result, and then incorrectly stated 237 is *not* divisible by 3.

The downstream effects of these problems are not always predictable easily without a very deep understanding. For example, would you predict that BPE tokenization would mean that ChatGPT or GPT-4† always tries to write rhyming poems, and that if you explicitly ask it, "Write a poem that doesn't rhyme", it will write a rhyming poem anyway? No, of course not. But nevertheless, that is what happens as you go from base davinci to ChatGPT; and with the benefit of hindsight and knowing how RLHF works and observing how base davinci writes rhyming poetry using solely memorized rhyme pairs, you can begin to see why it would happen.

Rhyming and explaining puns and counting characters are good examples because they are such simple objective properties, which are so clearly far below all GPT models' average level of capability (even GPT-2 ought to be able to rhyme or count characters), and so clearly broken in all GPTs. However, there is no reason to think that erasing so much of the semantics of 100% of the training data would cause *only* those problems. Those are just what's under the lamp post: so blatant that many GPT users will stumble over them on their own. Obviously, it should cause much more profound problems, it's just that outside OA it's quite hard to investigate: I suspect that BPEs cripple humor understanding, for example, because it's forced to try to understand a large amount of humorous writing which is inherently nonsensical when BPEs erase the phonetics and training on inherently nonsensical data is *usually* a bad thing to do to a model - but how would I prove that in the absence of a character-tokenized GPT-3 or GPT-4?

Ditto for the sparsity problem. Clearly it works well on average and the loss is nearly the same... but there is no free lunch, and what is the qualitative cost? None of the sparsity papers attempt to characterize that at all, much less consider users adversarially poking at it to see what fails. What does it do to the model to train it in such a way that critical data like individual tokens just... disappear during the forward pass and are 100% inaccessible to later layers?

Well, you will probably get bizarre behaviors, like always writing rhyming poems even when explicitly instructed not to - bizarre behaviors like repeating errors no matter how many times it is pointed out, cycling between errors, confabulating 'explanations' for blind spots and sampling justifications like '231 and 235 and 237 are all prime, totally'?

My advice is to just avoid making too much hay of anything that looks like it is vulnerable to either BPEs or sparsity. They may be something fundamental and interesting... but they probably aren't, and if you have to mention them, you ought to include a big asterisk.

> The third time, it want through a long process of checking all potential prime factors below sqrt(237), which of course is a good idea, except that the output included "237 ÷ 3 = 79 (not divisible)" – i.e. it divided by 3, got a whole-number result, and then incorrectly stated 237 is *not* divisible by 3.

Yeah, I mentioned that inner-monologue approaches like that ought to be able to partially bypass sparsity problems (but not BPEs) simply by spreading out & repeating so much. But then you have the issue of making an error anywhere in it can wreck it, and then you need to do inner-monologues in parallel or use self-criticizing inner-monologues, which is a lot more prompting/sampling than most people want to do.

† I just checked GPT-4 to see if the RLHF did the same thing, and [sigh] it does:

> **Write a poem which doesn't rhyme.**

>

> In the twilight of a somber eve,

> A quiet hush begins to weave

> Through the gnarled arms of ancient trees,

>As shadows grow long and whispers cease

> ...

My original goal was to find a prompt that was simple and yet required a bit of planning ahead to reach a correct answer. To get away from BPE issues, just now I tried the following alternative prompt:

> Pick three two-digit numbers. Add them up. Choose the original three numbers so that the sum will be a prime number (on the first try).

I sampled five runs, and it never hit on the idea of first choosing a low-three-digits prime number, and then working backwards to find three two-digit numbers that sum to it. This is probably not surprising, and no doubt I could elicit that behavior if I tried a few different prompts along the lines of "plan ahead" or "come up with a plan for success" or something; the point I was originally trying to make is simply that GPT-4 doesn't automatically reach for such tools. (Not a novel observation, of course.)

For some reason, GPT-4 seems to interpret this prompt as requiring the initial three numbers to all be prime, as well as the final sum being prime. On two of the five runs, it claimed that the task is impossible; here's the more succinct (and bizarre) of those two responses:

> It is impossible to pick three two-digit numbers that will add up to a prime number on the first try. This is because the sum of three integers will always be an even number if at least two of them are odd, and at least one of them must be even since there are only 45 two-digit odd numbers (between 11 and 99).

>

> Since prime numbers greater than 2 are always odd, it is impossible to obtain a prime number as the sum of three two-digit numbers.

On one run, it generated 29 + 31 + 41 = 101, which works. On one run, it went through three different attempts, yielding 99, 125, and 119; it noticed that none of these are prime, and gave up. "The process may take several tries, but eventually, a combination that satisfies the condition could be found." And on the last run, it also went through three different attempts, ending again with 119, but this time it claimed 119 was prime.

It sounds like I need to read up on sparsity and the issues it can cause. I'll pursue the links in your original reference (https://old.reddit.com/r/slatestarcodex/comments/1201v68/10word_quote_a_short_and_simple_failure_mode_of/jdjsx43/). Thanks again!

Yeah, it has so many strange blindspots like that which seem far below what it should be capable of. It's not a passing error, it repeats it and then conditions on the generated confabulations to lock it into place even more firmly, so it can't recover. Unfortunately, there's no easy way to show that that's due to BPEs or sparsity because it's screwed up the model's learning & fundamental representations - it's not a matter of simply padding with spaces and hey presto, now it works.

This is why I'm a little hopeful OA may turn to fixing BPEs in the next iterations. With a context window of 32k+, you don't *need* BPEs, but you do need reliability and quality more (those 32k-context calls are damned expensive), and you need to get rid of these unfixable, undebuggable, unpredictable bizarre errors which keep biting unsuspecting users. I predict that a whole swathe of apparently-unrelated errors will just go away with character tokenization and the output will be subtly but unmistakably better in general.