My personal journey down the AI rabbit hole was instigated by the phenomenal capabilities of GPT-4. Now there are rumors of GPT-5:
They started training in Jan...training ended in late March. But now [they] will do safety testing for 3-6 months before release.
This isn’t necessarily a reliable timeline1. But it seems likely that we’ll see a major new model from OpenAI in the second half of this year, or possibly early next year. I thought it would be fun to get out ahead with some predictions for GPT-5. That way, when the actual model is revealed, we can all enjoy a good laugh at my expense.
It doesn’t take a crystal ball to predict that the GPT-5 launch will make a big splash. But what will it tell us about the path to transcendent AI? I’ll present some things to watch out for.
Three Models For AI
Expectations for the current wave of AI vary wildly, but can be placed into three buckets:
Snake Oil: large language models (LLMs) are unreliable, mostly suited for low-value work such as churning out crappy news articles and cheating on high school essays. As people realize this, the hype cycle will fizzle.
Merely Important: LLMs and other generative AIs are an important new technology. Like, say, microprocessors or the Internet, they will obsolete some industries, create many others, and generate enormous economic value.
Transcendent: AI isn’t just another new technology, it is the ultimate technology. It won’t just replace some jobs, it will replace all jobs, forever. It will lead to ever-accelerating progress, because at some point it will build itself.
I believe (2) and (3) are both correct, but on different time scales.
Nothing Has Actually Happened Yet
For perspective, it’s important to note that the current AI wave has had approximately no impact on the world so far.
I mean, sure. ChatGPT inspired a zillion AI startups. Nvidia’s stock price is through the roof. Book publishers have had to deal with a flood of spammy machine-written manuscripts. There was a report of a company replacing a few hundred customer support agents. The US is restricting chip sales to China. But all of this amounts to either:
Things done in anticipation of future impact, or
Penny-ante shit.
AI certainly feels big, especially if (like me) you spend a lot of time following developments. But consider that in 2023, The Super Marios Bros. Movie took in more revenue than OpenAI. Most of what we’re getting caught up in is demos, early trends, future projections, and people’s reactions thereto. The actual impact on the economy has been negligible. What we have seen so far is merely the early glimmerings of what will eventually be an impactful technology.
It’s Mostly Hype, Until It Isn’t
In the late '90s it was obvious that the Internet was going to have an enormous impact, but it wasn't entirely obvious what that impact would be or how quickly it would unfold. This period saw the beginnings of giant businesses such as Amazon and Google; it also saw a lot of irrational exuberance, and massive flops like Excite and WebVan.
We’re now at a similar point in the development of AI. Are there economically meaningful applications of generative AI yet? Zvi Mowshowitz’s summary of a recent Twitter thread nicely captures the mixed signals:
Timothy Lee: The last year has been a lot of cognitive dissonance for me. Inside the AI world, there's non-stop talk about the unprecedented pace of AI improvement. But when I look at the broader economy, I struggle to find examples of transformative change I can write about.
AI has disrupted professional translators and has probably automated some low-end customer service jobs. AI makes programmers and lawyers more productive. But on the flip side, Amazon just scaled back Just Walk Out because it wasn't working well enough.
Nick McGreivy: Seeing the same thing in science: non-stop talk about how AI is accelerating science, tons of papers reporting positive results, but I struggle to find examples of it being useful for unsolved problems or applications. A few exceptions (Alphafold, weather models), but not many.
Ethan Mollick: I am talking to lots of companies doing impressive things internally (most keep it secret).
It has only been 16 months and social systems change [much] slower than technology. We could have AGI and most people’s jobs won’t change that fast.
Timothy Lee: Impressive like “wow that’s a great demo” or impressive like “wow we just boosted profits $100 million?”
Ethan Mollick: Haven’t seen $100M boost. That would be a pretty big change. But also beyond demos. Actual use at scale to solve tricky issues. It is really hard for large companies to move on major projects in a few months. I suspect you will see a lot of public stuff soon.
Today we see some genuinely useful applications today, along with a slew of use cases that aren’t quite ready for prime time, many of which are being massively over-hyped. Some people look at this and see a hype bubble, akin to NFTs. I think a better analogy is the late-90s Internet: the impact isn’t really here yet, but it’s coming. Some high-profile projects are going to fizzle, but the Amazons and Googles of the AI era are being created today. (It is worth noting that while the Internet revolution mostly created new winners, the “just another technology” phase of the AI revolution may substantially enrich existing tech giants.)
If AI is currently “just” an important, emerging new technology, when does it cross over into unprecedented territory?
Waiting For Agency
The Transcendent idea is based on the anticipation of agentic AI: systems that can flexibly pursue long-term goals, operating independently in the real world. If you want a computer to book your weekend getaway, run a marketing campaign, or take over your business while you hang out on the beach for the rest of your life, you need agentic AI.
We don’t have anything like that yet. The systems available today – chatbots, image generators, and so forth – are still just tools. A new kind of tool: unusually flexible, capable of handling tasks that were previously inaccessible to software, programmable in plain English; but still just tools. (They’re also unpredictable, insecure, and unreliable.)
This “just a tool” description applies to the current generation of language models and chatbots. It also applies to vision models, text-to-speech, speech-to-text, image synthesis, video synthesis, music synthesis, voice cloning, deepfakes, timeseries prediction, DNA models, and chemistry models. All of these things are useful; all are, to some degree, revolutionary. But not the Transcendent kind of revolutionary.
When AIs can successfully pursue long-term goals – planning, reacting, adapting, interacting, solving problems, asking for help, coordinating with people and other AIs – that’s when the world begins to change. That’s AGI. That’s when AI stops being a tool that can be used for tasks and starts being an independent entity that can hold down a job. That’s when we need to start asking questions like whether AI will subsume all jobs instead of just obsoleting some and creating others.
(I’m glossing over robotics here. I suspect that by the time we have agentic AI, we’ll also have decent robot bodies and the means to operate them.)
A key question, then, is what the path to agentic AI looks like.
What Are We Missing To Build Agentic AI?
People often say that the reason we don’t have AI agents today is that current models (like GPT-4) aren’t sufficiently reliable. If a job takes 30 steps, and a model is 95% reliable at each step, it has only a 21% chance of completing the job successfully. To create agentic AI, people say, all we need is to make language models more “reliable”.
I don’t think this is a helpful framing. Agency requires the ability to define a plan, carry it out, maintain context, adapt, self-correct, and generally operate independently in the real world. Lumping all this under “reliability” is like saying that all I’d need to start playing in the NBA is to be “better at basketball”: in reality, I would have to acquire a wide range of skills, many of which are not feasible given my current architecture. Similarly, I think agentic AI will require fundamental advances in the following areas:
Memory. Human memory is a complex and supple thing, and current language models, even with ever-growing context windows and “retrieval-augmented generation”, aren’t anywhere close to our level.
Exploration: current models are trained to produce a final product in a single pass; they aren’t really taught to iterate, revise, or replan. People are attempting to work around this, with techniques such as “chain of thought”, “tree of thought”, and asking models to critique their own output, but as with memory, this is being imposed from outside the model, rather than being natively integrated into the training process, and I don’t think that will work.
Creative insight: AI output is notoriously bland, workmanlike, uninspiring. LLMs are capable of remarkable stylistic feats, but never yield deep insights or fundamentally new ideas.
Robust reasoning: language models are prone to reasoning errors and hallucinations. They are vulnerable to manipulation, such as prompt injection and jailbreaks.
Recent progress in language models has primarily been a product of increasing scale: creating larger models (more “neurons” and connections), and training them on more data. There have been some changes in architecture2, but so far these have not resulted in any fundamental changes in the way the models operate; they are merely a way of increasing scale, efficiency, and memory space (token buffer size). In other words, a 2024 language model is primarily just a bigger, better version of a 2018 language model.
I don’t think mere scale will get us to AI agents. Scaling up an LLM won’t allow it to form new memories any more than it will allow it to pop out of the screen and walk across the room3. Exploration also seems like something that will need to be taught explicitly, and I’m not sure we have appropriate training data4. Insight and reasoning may also require more than mere scale; here are some thoughts on why gullibility is so endemic in LLMs.
We won’t have agentic AI until these gaps have been addressed, and new breakthroughs will be required. I predict that it will take a long time – say, at least 5 and probably 10 to 40 years – for robust (human-level or above) agentic AI to emerge. Agentic AI is the sort of complex problem on which it’s difficult to make very rapid progress, because there are a lot of things to address and it's hard for progress on any one dimension to get far ahead of the others5.
We’re Going To Keep Scaling
AI labs keep building larger and larger models, squeezing new capabilities out of essentially the same architecture. Things seem likely to continue in a similar vein for at least the next few years. I haven’t seen any radical new architectures on the horizon that promise to break the pattern6. It seems telling that all of the important models of the last few years appear to be climbing the exact same hill in very similar ways and at roughly the same pace7.
It does seem likely that the current scaling trend has some ways to run. It is being driven by three factors: better algorithms8, more cost-efficient chips, and companies spending ever-larger amounts to train their models. None of those trends seems ready to top out. People speculate as to whether we can keep finding more training data, but the big AI labs all seem confident they have a handle on the problem9, and for the moment I’m inclined to believe them.
GPT-3.5 and GPT-4 were, basically, larger versions of GPT-3. They surprised everyone with a wide array of new capabilities. I’m sure the next wave of even-larger models will be even more capable and reliable. However, scale alone won’t fundamentally change the nature of the models, and won’t address missing capabilities such as memory, exploration, or (probably) insight10.
What To Expect From GPT-5
The three models for AI tend to align with three different expectations for GPT-5:
Plateau: it’s been over a year since GPT-4 was launched, and neither OpenAI nor any competitor have managed to significantly improve on its capabilities. The rapid progress from GPT-1 through GPT-4 has come to an end, perhaps because we have exhausted the potential of the available training data. Things will move much more slowly from this point forward.
Steady progress: GPT-5 will represent as much of an advance over GPT-4 as GPT-4 was over GPT-3 (i.e. enormous). Sam Altman is on the record as expecting this11.
Incipient AGI: cutting-edge labs like OpenAI are beginning to use LLMs to accelerate their own work. Progress will only speed up from here, and the singularity may not be far off.
I expect that, with regard to the core “strength” of the model, reality will fall somewhere between (1) and (2). I can’t offer much in the way of coherent arguments, I just get the sense that we may be approaching diminishing returns from simply shoveling in more data12.
Many people have had that intuition in the past, and so far, they’ve always been wrong. Maybe I’m wrong too. Up to this point, language models, vision models, speech models, and many others improve dramatically whenever we train them on more data. If that trend continues, then we’ll be in scenario 2, steady (rapid) progress.
Either way, we should expect substantial improvements in the sorts of core model capabilities that seem to naturally improve with scale: breadth and depth of knowledge, reasoning ability, hallucination rate, accuracy at tasks such as summarization. It’s just a question of how quickly this will progress. Probably some surprising new capabilities will emerge; I couldn’t guess what they might be.
I’m more confident that GPT-5 will not push us into scenario 3. I think that meaningful “recursive self-improvement” – AIs accelerating the development of AI – will require something more like agentic AI, and as I explained above, I don’t think we’re on the cusp of that yet. Even if we were, there are other likely bottlenecks on progress, such as data, chips, and power for data centers.
In any case, whatever GPT-5 offers in terms of core strength, I strongly expect that we will see a lot of peripheral improvements – improvements to OpenAI’s overall product offerings that fall outside of the core language model. Sam Altman may have this in mind when he claims GPT-5 will represent as big a leap as GPT-4.
Peripheral Improvements in GPT-5
OpenAI may have not made any fundamental improvements in model strength since March 2023, but they’ve found ways to move forward without a fundamentally new model. They’ve added image analysis, the ability to customize ChatGPT, a simple form of long-term memory, and many, many other features. There is ample room for more, and we may see some of these peripheral improvements bundled into the GPT-5 launch. This could make GPT-5 look like a major advance even if the model itself is not drastically more capable than GPT-4.
I expect OpenAI to offer most of the following in the next year (though not necessarily bundled into the GPT-5 announcement):
Multiple sizes of the same model (a la Claude Haiku, Sonnet, Opus). The smaller variants would likely offer better price/performance than GPT-4.
Increased multi-modality, allowing the model to analyze and generate most or all of text, voice, non-voice audio, images, and video, and possibly other data types (rich-text documents? structured data? timeseries data? medical data?).
New options for fine-tuning (or otherwise customizing) models.
New ways of integrating with the model (including advances in tool use) and getting data in / out. This may include support for asynchronous use cases – allowing the model to work for an extended period of time and notify the user / application when it finishes13.
New pricing models, including additional options for applications that have higher or lower speed requirements (the recently announced “batch mode” is the first example of this).
Advances and additional options at the product level (ChatGPT, and possibly other new applications provided by OpenAI) – custom GPTs being an early example of this sort of progress.
Such peripheral improvements will enable a substantial increase in the range and richness of applications built on GPT-5. Meanwhile, application developers won’t be standing still, introducing new product designs, improving integration into the workplace, and generally innovating in a thousand different ways. Whatever the rate of improvement of core model strength, we can expect to be dazzled – and distracted – by rapidly maturing applications.
Will GPT-5 Represent Meaningful Progress Toward Agentic AI?
This is the big question.
OpenAI has openly discussed plans for developing “agents”, and I’m sure we’ll see developments here, whether as part of the GPT-5 launch or separately. But there are many different forms this could take, with different implications.
There are systems billed as “AI agents” available today, but they are limited to simple, low-stakes tasks14. If OpenAI releases something along these lines, it may come with some flashy demos but it won’t represent a fundamental advance. Two things I’ll be watching for:
Specific training on agentic tasks. The training data for existing models doesn’t seem to include much in the way of transcripts of agents (human or otherwise) carrying out extended tasks. The training data – books, web pages, etc. – consists of final products, not the path for getting there. If OpenAI can find a substantial source of “task transcript” data (perhaps from videos; or perhaps artificially generated data), that might show up as an ability to course-correct and recover from mistakes in extended tasks.
Improved memory. I’m sure that by the GPT-5 release, OpenAI will have further increased the maximum token buffer size, possibly even making it unlimited. They will also probably offer more options for bringing data into the token buffer. The question will be whether it can record new memories, and move information in and out of its short-term memory in a sophisticated fashion, as I discuss in an article linked earlier.
Significant progress on either front would be a departure from past models, and an important milestone on the path to agentic AI. Conversely, lack of such progress would suggest that agentic AI may be a long way off.
Two Tracks of AI Progress
GPT-5, whenever it appears, will trigger a huge hype cycle. OpenAI will likely bundle their largest model to date with a plethora of new modalities, integrations, APIs, ChatGPT features, and more. Third-party application developers will use the opportunity to hype their own work. In short, it will be an event full of sound and fury signifying… something, but exactly what may not be obvious at first glance.
AI is currently a Merely Important technology, and still in its infancy, having had little impact on the real world so far. To gauge progress, remember the late-90s Internet, and watch for demos, hype, and early-adopter products transitioning into profitable businesses that are growing rapidly15.
At some point, AI will become a Transcendent technology, upending all of our expectations from past waves of innovation. I believe this is still years away, but forecasting AI is notoriously difficult. Watch for the emergence of robust AI agents, and the underlying capabilities – such as memory, exploration, creative insight, and robust reasoning – such agents will require.
This isn’t an official schedule. The release of any software product often takes longer than anticipated. And it could be that the model referred to here is not the one that will eventually receive the label “GPT-5”.
Such as mixture-of-experts, ring attention, and state space models.
That is, increasing the number of “parameters” in a model won’t change the physical nature of the model. A model can’t walk because it isn’t connected to any motors, and it can’t form new memories because it isn’t connected to any storage. Systems like RAG can add long-term storage, but those are outside of the model itself, and in the linked post I explain why I don’t think this is sufficient.
OpenAI’s hyped “Q*” project may be intended to add exploration abilities, but my guess at this point is that it will not represent a fast path to agentic AI.
For example, we couldn’t appreciate the power of the transformer architecture until it was trained at large scale, and we needed transformers to help recognize the value in training large-scale models.
State-space models do not break the pattern. They manage token history in a different way, but they have fundamentally the same input, output, and processing capabilities as transformers.
From https://nonint.com/2023/06/10/the-it-in-ai-models-is-the-dataset/:
It’s becoming awfully clear to me that these models are truly approximating their datasets to an incredible degree. What that means is not only that they learn what it means to be a dog or a cat, but the interstitial frequencies between distributions that don’t matter, like what photos humans are likely to take or words humans commonly write down.
What this manifests as is – trained on the same dataset for long enough, pretty much every model with enough weights and training time converges to the same point. Sufficiently large diffusion conv-unets produce the same images as ViT generators. AR sampling produces the same images as diffusion.
See also this discussion of the recently released DBRX model:
People sometimes have this impression that big LLMs are like this space race where we stand around at whiteboards having breakthroughs and the company with the most talent or insights gets the best model.
The reality is much less sexy than that. We’re basically all getting the same power law scaling behavior, and it’s just a matter of:
Parameter count
Training data quantity / epoch count
Training data quality
Whether it’s an MoE model or not
Hyperparameter tuning
The real innovation is in dataset construction and the systems you build to scale efficiently.
It also takes immense skill to debug the training, with all sorts of subtleties arising from all over the stack. E.g., a lot of weird errors turn out to be symptoms of expert load imbalance and the resulting variations in memory consumption across devices and time steps. Another fun lesson was that, if your job restarts at exactly the same rate across too many nodes, you can DDOS your object store—and not only that, but do so in such a way that the vendor’s client library silently eats the error. I could list war stories like this for an hour (and I have).
Another non-obvious point about building huge LLMs is that there are a few open design/hparam choices we all pay attention to when another LLM shop releases details about their model. This is because we all assume they ablated the choice, so more labs making a given decision is evidence that it’s a good idea. E.g., seeing that MegaScale used parallel attention increased our suspicion that we could get away with it too (although so far we’ve always found it’s too large a quality hit).
…
It’s also an interesting datapoint around AI progress. DBRX would have been the world’s best LLM 15 months ago—and it could be much better if we had just chosen to throw more money at it. Since OpenAI had GPT-3 before we even existed as a company, this indicates that the gap between the leaders and others is narrowing (as measured in time, not necessarily model capabilities). My read is that everyone is hitting the same power law scaling, and there was just a big ramp-up time for people to build good training and serving infra initially. That’s not to say that no meaningful innovations will happen—just that they’re extremely rare (e.g., MoE) or incremental (e.g., the never-ending grind of data quality).
There is a constant stream of papers describing improvements in the training and use of large language models. That doesn’t mean the trend is about to curve upwards; it’s already accounted for in the existing “better algorithms” trend.
From Cogrev Data, data, everywhere - enough for AGI?: none of the AI lab leaders express concern about running out of training data.
There’s also the possibility of introducing synthetic data; see the first section of https://importai.substack.com/p/import-ai-369-conscious-machines.
For further argument that mere scale won’t get us to AGI, see The AI Progress Paradox.
See, for instance, https://twitter.com/atroyn/status/1780704786826031225.
For a similar take, see https://twitter.com/Dan_Jeffries1/status/1781567863595180090?t=V1bucKAR4dWbhXM1KnOC-w&s=03.
I took this idea from Nathan Labenz’s interview with the founders of Elicit on the Cognitive Revolution podcast.
It’s notable that some of the greatest excitement around AI agents centers on a demo of a coding agent called Devin – a demo which recently appears to have been unmasked as fake.
Profitable, that is, in the sense of selling a product for more than the incremental cost of producing it. Amazon has famously been “non-profitable” for much of its history, but that was because the company was constantly plowing cash into expansion, not because they were losing money on every sale.
Great blog title! I sort of expected what you were going to say in this post, as it mirrors what I have noticed following AI for the last year— lots of flashy announcements and noise, but nothing that seems to change things RIGHT NOW. I am surprised, though, that you think it is 5-40 years for Agentic AI— more than anything else, that prediction makes me think you perceive the current approach as fundamentally limited, even if they can milk out some superficially impressive improvements for GPT-5 and 6.
Excellent article, I largely agree with your cautious take.
I'm feeling somewhat optimistic, though, on your two things to watch for:
1. Specific training on agentic tasks: It seems certain that they are doing this for math and programming step-by-step reasoning via Process-supervised Reward Models. Nathan Lambert theorizes how this could be done at https://www.interconnects.ai/p/q-star.
But will better math and code reasoning translate to just better reasoning in general? I think so and it should naturally lead to better planning and execution in all multi-step agent frameworks.
2. Improved memory: Isn't fine-tuning on a small dataset or task essentially developing new long-term memories? It seems to me at least conceivable right now to fine-tune a GPT-like model on a few hour's worth of a partial task scored with PRM --here, fine-tuning is analogous to the memory consolidation of good night's sleep-- and have it resume afterwards, better able to grasp the context and problem complexity.
The computing expense of this setup would be likely prohibitive to most at this time. But as computing gets cheaper, maybe?