Great blog title! I sort of expected what you were going to say in this post, as it mirrors what I have noticed following AI for the last year— lots of flashy announcements and noise, but nothing that seems to change things RIGHT NOW. I am surprised, though, that you think it is 5-40 years for Agentic AI— more than anything else, that prediction makes me think you perceive the current approach as fundamentally limited, even if they can milk out some superficially impressive improvements for GPT-5 and 6.
That's exactly right, I do think the current approach is fundamentally limited. The "What Are We Missing To Build Agentic AI?" section of the post has some links to past posts where I've talked about what I see as limitations. I also expect that there are limitations we simply can't see yet, because we haven't yet run up against them (it will be difficult to perceive some of the limitations of the current approach until we reach them).
To be clear (and as I think you understand, but some other reader might not): I do expect further impressive results from LLMs, for at least a few more years; I just don't expect the further results to go very far in the direction of general-purpose real-world agents. And one of the reasons I'm flagging this is that if there *is* rapid movement toward robust agents (contrary to my prediction), we should treat that as a blazing red alert that transcendent AI is indeed coming fast and drastic action is likely necessary to prepare for that (or slow it down).
Excellent article, I largely agree with your cautious take.
I'm feeling somewhat optimistic, though, on your two things to watch for:
1. Specific training on agentic tasks: It seems certain that they are doing this for math and programming step-by-step reasoning via Process-supervised Reward Models. Nathan Lambert theorizes how this could be done at https://www.interconnects.ai/p/q-star.
But will better math and code reasoning translate to just better reasoning in general? I think so and it should naturally lead to better planning and execution in all multi-step agent frameworks.
2. Improved memory: Isn't fine-tuning on a small dataset or task essentially developing new long-term memories? It seems to me at least conceivable right now to fine-tune a GPT-like model on a few hour's worth of a partial task scored with PRM --here, fine-tuning is analogous to the memory consolidation of good night's sleep-- and have it resume afterwards, better able to grasp the context and problem complexity.
The computing expense of this setup would be likely prohibitive to most at this time. But as computing gets cheaper, maybe?
> But will better math and code reasoning translate to just better reasoning in general? I think so and it should naturally lead to better planning and execution in all multi-step agent frameworks.
Yup, this is a critical question. My intuition is that (a) doing this even just for math and programming may be difficult, and (b) if it turns out not to be difficult, that will be because math and programming are comparatively easy domains for this approach and therefore it will *not* generalize to messy real-world tasks like "run a marketing campaign".
One of the most impressive results in this direction so far has been the success on International Mathematical Olympiad geometry problems, but the consensus from what I've read is that IMO geometry problems are much much much more amenable to a relatively simple tree search than most problems in math (or programming), because it's such a limited domain. (I'll throw in that I'm a two-time IMO silver medalist, so have some perspective on this, but I have not gone deep into the literature on AI math solvers. Also I could never do the geometry problems, that was my Achilles' heel.)
It also seems to me that "reasoning" is not particularly the same thing as "planning and execution", especially in a messy real-world environment. Most of the decision making we do as we navigate multi-step tasks through our workday is done intuitively (System 1), and where will we get the data to train AI models to develop a similar intuition?
> Improved memory: Isn't fine-tuning on a small dataset or task essentially developing new long-term memories?
Yes, but we are continuously creating new memories as we go about our day, and I'm not aware of any suggestions that it would be practical to continuously fine-tune a model with everything that passes through its token buffer. Here's the post where I've said more about how I view the memory challenge: https://amistrongeryet.substack.com/p/memory-in-llms.
> One of the most impressive results in this direction so far has been the success on International Mathematical Olympiad geometry problems, but the consensus from what I've read is that IMO geometry problems are much much much more amenable to a relatively simple tree search than most problems in math (or programming), because it's such a limited domain. (I'll throw in that I'm a two-time IMO silver medalist, so have some perspective on this, but I have not gone deep into the literature on AI math solvers. Also I could never do the geometry problems, that was my Achilles' heel.)
Props to you! That stuff is way out of my league.
I'm noticing that the big scores on IMO are coming from training on brute-force methods that get a subset of the problem right (most recently Wu's Method -- https://arxiv.org/abs/2404.06405). What gives me some hope is that apparently we can train successfully on automated methods that fail to find the right answer a significant part of the time and YET still impart valuable insight and experience to the model such that it can surpass its teacher.
Reasoning and planning seem to be, in the big picture sense, formalizing a goal down to some kind of mindless executable specification. As you note, we do much of it intuitively, relying on experience and knowledge -- training -- which doesn't come for free.
Automated methods that formalize a problem to a solution provide vast amounts of crude knowledge on a domain (i.e. synthetic data). Learning math and programming seem well suited to these, and once you have models that exceed their tutors (AlphaGeometry, AlphaCode), they can become the automated methods for the next generation, leading to a kind of virtuous cycle.
But for the real world and its innumerable social and physical constraints, automated methods for formalizing crude solutions are few and low-quality. Worse, we can't rapidly execute and test any would-be solutions for correctness like we can with math/programming without extreme over-simplification. I agree this is a big obstacle.
Maybe Artificial Superintelligent Math Coders will save the day by giving us the tools to practically simulate the real world, and thus eventually train the true general intelligence?
Well-written and argued. I get what you mean about memory feeling like a distinct component, but I also think existing weights/parameters are right there for the taking. Meta did some hard work to conclude that language models store 2 bits of knowledge per parameter at most and that knowledge storage requires 1000-exposure (https://arxiv.org/pdf/2404.05405). This sets the stage for over-parameterized models to have plenty of room for new memories for daily tasks. (Advances such as LORA make the process somewhat rapid-- a few hours on consumer GPUs for an 8B Llama, if I recall.)
Currently I think there's little point in dynamic training (online learning) with fine-tuning unless a task prompt can be scored automatically in an agentic flow (i.e. is it good strategy towards goal, or bad strategy?). My best guess (hope!) is that GPT-5 shows us the first steps.
I stumbled across your Substack looking for thoughtful reactions to Leopold Aschenbrenner’s Situational Awareness document, and now I’m reading older posts. I apologize for this somewhat untimely comment. (I say “somewhat” because we’re still waiting for GPT-5!) In general, I agree with much of what you say here and how you’re framing the issues.
One thought on current models: Outside of the world of education, it’s hard to appreciate how significantly and quickly LLM chatbots, as flawed and limited as they currently are, have changed academic writing and grading. Take-home essays and papers, which used to be assigned by the tens of millions from middle school through graduate school, have been essentially destroyed as a means of evaluating student progress and achievement. Teachers, professors, and administrators are still working through this paradigm shift, but the world has forever changed. Less acknowledged, but just as real, will be the effects on “the publish or perish” regime of academia, especially in the Humanities and “social sciences.” As less tech-savvy but stressed and highly-motivated graduate students and junior faculty realize that they can now load published journal articles and even entire books into context windows and get almost instantaneous draft articles from LLMs analyzing those texts in the style of academic writing with only a few prompts, the floodgates will open. The economic impact of that might be hard to quantify, but it’s a profound shift in the (giant!) Educational-Industrial Complex.
A bigger and more general issue is hallucination. I haven’t been able to determine whether and how future models will avoid hallucinations. Although well-documented and discussed as to current models, the discussions I’ve seen about AIs of the future don’t address the issue. It’s a huge problem! Right now, even the best LLMs will make up - literally invent - facts, papers, biographies, even people. I’m most familiar with how lawyers using LLMs get briefs that cite fake cases and statutes. I can’t overemphasize how unacceptable it is to submit an argument with false, made-up authority to a judge or panel of judges! Similarly, a colleague in academia has given me numerous examples of LLMs referring to fake journal articles and books. According to him, current models are “worse than useless” for research assistance because the output is so untrustworthy. Will more compute and more training data eliminate the problem? I know that AI researchers are well-aware of the problem, but I hear surprisingly little about potential solutions even as they spin grand visions of “drop-in knowledge workers” performing human-level work. A drop-in knowledge worker who provides fake legal authority (or fake medical studies) is dangerous, not useful, even if the error rate is low. (In many cases now, the error rate is high!) To be clear, I imagine - indeed, think it’s likely - there are solutions, but I’m surprised at how under-discussed this issue is. If you have knowledge or thoughts about this, it would be great to hear them!
Thanks for the perspective on education! Interesting to hear that this has already had such a large effect. I'd heard of such things of course, but wasn't aware that the changes were so widespread.
I don't have a lot to say about the (important) issue of hallucinations. I would certainly expect improvements in this area, but I don't have a clear idea of what it will take to push the problem down to manageable (i.e. very rare) levels.
Part of the solution will likely involve LLMs capable of fact-checking themselves in cases, like legal briefs, where errors cannot be tolerated.
If you're interested in a deep dive, here's a paper from November on the subject of hallucinations that I bookmarked for later review: https://arxiv.org/pdf/2311.05232.pdf.
Great blog title! I sort of expected what you were going to say in this post, as it mirrors what I have noticed following AI for the last year— lots of flashy announcements and noise, but nothing that seems to change things RIGHT NOW. I am surprised, though, that you think it is 5-40 years for Agentic AI— more than anything else, that prediction makes me think you perceive the current approach as fundamentally limited, even if they can milk out some superficially impressive improvements for GPT-5 and 6.
That's exactly right, I do think the current approach is fundamentally limited. The "What Are We Missing To Build Agentic AI?" section of the post has some links to past posts where I've talked about what I see as limitations. I also expect that there are limitations we simply can't see yet, because we haven't yet run up against them (it will be difficult to perceive some of the limitations of the current approach until we reach them).
To be clear (and as I think you understand, but some other reader might not): I do expect further impressive results from LLMs, for at least a few more years; I just don't expect the further results to go very far in the direction of general-purpose real-world agents. And one of the reasons I'm flagging this is that if there *is* rapid movement toward robust agents (contrary to my prediction), we should treat that as a blazing red alert that transcendent AI is indeed coming fast and drastic action is likely necessary to prepare for that (or slow it down).
Excellent article, I largely agree with your cautious take.
I'm feeling somewhat optimistic, though, on your two things to watch for:
1. Specific training on agentic tasks: It seems certain that they are doing this for math and programming step-by-step reasoning via Process-supervised Reward Models. Nathan Lambert theorizes how this could be done at https://www.interconnects.ai/p/q-star.
But will better math and code reasoning translate to just better reasoning in general? I think so and it should naturally lead to better planning and execution in all multi-step agent frameworks.
2. Improved memory: Isn't fine-tuning on a small dataset or task essentially developing new long-term memories? It seems to me at least conceivable right now to fine-tune a GPT-like model on a few hour's worth of a partial task scored with PRM --here, fine-tuning is analogous to the memory consolidation of good night's sleep-- and have it resume afterwards, better able to grasp the context and problem complexity.
The computing expense of this setup would be likely prohibitive to most at this time. But as computing gets cheaper, maybe?
> But will better math and code reasoning translate to just better reasoning in general? I think so and it should naturally lead to better planning and execution in all multi-step agent frameworks.
Yup, this is a critical question. My intuition is that (a) doing this even just for math and programming may be difficult, and (b) if it turns out not to be difficult, that will be because math and programming are comparatively easy domains for this approach and therefore it will *not* generalize to messy real-world tasks like "run a marketing campaign".
One of the most impressive results in this direction so far has been the success on International Mathematical Olympiad geometry problems, but the consensus from what I've read is that IMO geometry problems are much much much more amenable to a relatively simple tree search than most problems in math (or programming), because it's such a limited domain. (I'll throw in that I'm a two-time IMO silver medalist, so have some perspective on this, but I have not gone deep into the literature on AI math solvers. Also I could never do the geometry problems, that was my Achilles' heel.)
It also seems to me that "reasoning" is not particularly the same thing as "planning and execution", especially in a messy real-world environment. Most of the decision making we do as we navigate multi-step tasks through our workday is done intuitively (System 1), and where will we get the data to train AI models to develop a similar intuition?
> Improved memory: Isn't fine-tuning on a small dataset or task essentially developing new long-term memories?
Yes, but we are continuously creating new memories as we go about our day, and I'm not aware of any suggestions that it would be practical to continuously fine-tune a model with everything that passes through its token buffer. Here's the post where I've said more about how I view the memory challenge: https://amistrongeryet.substack.com/p/memory-in-llms.
> One of the most impressive results in this direction so far has been the success on International Mathematical Olympiad geometry problems, but the consensus from what I've read is that IMO geometry problems are much much much more amenable to a relatively simple tree search than most problems in math (or programming), because it's such a limited domain. (I'll throw in that I'm a two-time IMO silver medalist, so have some perspective on this, but I have not gone deep into the literature on AI math solvers. Also I could never do the geometry problems, that was my Achilles' heel.)
Props to you! That stuff is way out of my league.
I'm noticing that the big scores on IMO are coming from training on brute-force methods that get a subset of the problem right (most recently Wu's Method -- https://arxiv.org/abs/2404.06405). What gives me some hope is that apparently we can train successfully on automated methods that fail to find the right answer a significant part of the time and YET still impart valuable insight and experience to the model such that it can surpass its teacher.
Reasoning and planning seem to be, in the big picture sense, formalizing a goal down to some kind of mindless executable specification. As you note, we do much of it intuitively, relying on experience and knowledge -- training -- which doesn't come for free.
Automated methods that formalize a problem to a solution provide vast amounts of crude knowledge on a domain (i.e. synthetic data). Learning math and programming seem well suited to these, and once you have models that exceed their tutors (AlphaGeometry, AlphaCode), they can become the automated methods for the next generation, leading to a kind of virtuous cycle.
But for the real world and its innumerable social and physical constraints, automated methods for formalizing crude solutions are few and low-quality. Worse, we can't rapidly execute and test any would-be solutions for correctness like we can with math/programming without extreme over-simplification. I agree this is a big obstacle.
Maybe Artificial Superintelligent Math Coders will save the day by giving us the tools to practically simulate the real world, and thus eventually train the true general intelligence?
> https://amistrongeryet.substack.com/p/memory-in-llms
Well-written and argued. I get what you mean about memory feeling like a distinct component, but I also think existing weights/parameters are right there for the taking. Meta did some hard work to conclude that language models store 2 bits of knowledge per parameter at most and that knowledge storage requires 1000-exposure (https://arxiv.org/pdf/2404.05405). This sets the stage for over-parameterized models to have plenty of room for new memories for daily tasks. (Advances such as LORA make the process somewhat rapid-- a few hours on consumer GPUs for an 8B Llama, if I recall.)
Currently I think there's little point in dynamic training (online learning) with fine-tuning unless a task prompt can be scored automatically in an agentic flow (i.e. is it good strategy towards goal, or bad strategy?). My best guess (hope!) is that GPT-5 shows us the first steps.
I stumbled across your Substack looking for thoughtful reactions to Leopold Aschenbrenner’s Situational Awareness document, and now I’m reading older posts. I apologize for this somewhat untimely comment. (I say “somewhat” because we’re still waiting for GPT-5!) In general, I agree with much of what you say here and how you’re framing the issues.
One thought on current models: Outside of the world of education, it’s hard to appreciate how significantly and quickly LLM chatbots, as flawed and limited as they currently are, have changed academic writing and grading. Take-home essays and papers, which used to be assigned by the tens of millions from middle school through graduate school, have been essentially destroyed as a means of evaluating student progress and achievement. Teachers, professors, and administrators are still working through this paradigm shift, but the world has forever changed. Less acknowledged, but just as real, will be the effects on “the publish or perish” regime of academia, especially in the Humanities and “social sciences.” As less tech-savvy but stressed and highly-motivated graduate students and junior faculty realize that they can now load published journal articles and even entire books into context windows and get almost instantaneous draft articles from LLMs analyzing those texts in the style of academic writing with only a few prompts, the floodgates will open. The economic impact of that might be hard to quantify, but it’s a profound shift in the (giant!) Educational-Industrial Complex.
A bigger and more general issue is hallucination. I haven’t been able to determine whether and how future models will avoid hallucinations. Although well-documented and discussed as to current models, the discussions I’ve seen about AIs of the future don’t address the issue. It’s a huge problem! Right now, even the best LLMs will make up - literally invent - facts, papers, biographies, even people. I’m most familiar with how lawyers using LLMs get briefs that cite fake cases and statutes. I can’t overemphasize how unacceptable it is to submit an argument with false, made-up authority to a judge or panel of judges! Similarly, a colleague in academia has given me numerous examples of LLMs referring to fake journal articles and books. According to him, current models are “worse than useless” for research assistance because the output is so untrustworthy. Will more compute and more training data eliminate the problem? I know that AI researchers are well-aware of the problem, but I hear surprisingly little about potential solutions even as they spin grand visions of “drop-in knowledge workers” performing human-level work. A drop-in knowledge worker who provides fake legal authority (or fake medical studies) is dangerous, not useful, even if the error rate is low. (In many cases now, the error rate is high!) To be clear, I imagine - indeed, think it’s likely - there are solutions, but I’m surprised at how under-discussed this issue is. If you have knowledge or thoughts about this, it would be great to hear them!
Thanks for the perspective on education! Interesting to hear that this has already had such a large effect. I'd heard of such things of course, but wasn't aware that the changes were so widespread.
I don't have a lot to say about the (important) issue of hallucinations. I would certainly expect improvements in this area, but I don't have a clear idea of what it will take to push the problem down to manageable (i.e. very rare) levels.
Part of the solution will likely involve LLMs capable of fact-checking themselves in cases, like legal briefs, where errors cannot be tolerated.
If you're interested in a deep dive, here's a paper from November on the subject of hallucinations that I bookmarked for later review: https://arxiv.org/pdf/2311.05232.pdf.
Thanks for the pointer. There’s a new paper in Nature that’s also on point: https://www.nature.com/articles/s41586-024-07421-0