5 Comments

Great blog title! I sort of expected what you were going to say in this post, as it mirrors what I have noticed following AI for the last year— lots of flashy announcements and noise, but nothing that seems to change things RIGHT NOW. I am surprised, though, that you think it is 5-40 years for Agentic AI— more than anything else, that prediction makes me think you perceive the current approach as fundamentally limited, even if they can milk out some superficially impressive improvements for GPT-5 and 6.

Expand full comment
author

That's exactly right, I do think the current approach is fundamentally limited. The "What Are We Missing To Build Agentic AI?" section of the post has some links to past posts where I've talked about what I see as limitations. I also expect that there are limitations we simply can't see yet, because we haven't yet run up against them (it will be difficult to perceive some of the limitations of the current approach until we reach them).

To be clear (and as I think you understand, but some other reader might not): I do expect further impressive results from LLMs, for at least a few more years; I just don't expect the further results to go very far in the direction of general-purpose real-world agents. And one of the reasons I'm flagging this is that if there *is* rapid movement toward robust agents (contrary to my prediction), we should treat that as a blazing red alert that transcendent AI is indeed coming fast and drastic action is likely necessary to prepare for that (or slow it down).

Expand full comment
Apr 27·edited Apr 27

Excellent article, I largely agree with your cautious take.

I'm feeling somewhat optimistic, though, on your two things to watch for:

1. Specific training on agentic tasks: It seems certain that they are doing this for math and programming step-by-step reasoning via Process-supervised Reward Models. Nathan Lambert theorizes how this could be done at https://www.interconnects.ai/p/q-star.

But will better math and code reasoning translate to just better reasoning in general? I think so and it should naturally lead to better planning and execution in all multi-step agent frameworks.

2. Improved memory: Isn't fine-tuning on a small dataset or task essentially developing new long-term memories? It seems to me at least conceivable right now to fine-tune a GPT-like model on a few hour's worth of a partial task scored with PRM --here, fine-tuning is analogous to the memory consolidation of good night's sleep-- and have it resume afterwards, better able to grasp the context and problem complexity.

The computing expense of this setup would be likely prohibitive to most at this time. But as computing gets cheaper, maybe?

Expand full comment
author

> But will better math and code reasoning translate to just better reasoning in general? I think so and it should naturally lead to better planning and execution in all multi-step agent frameworks.

Yup, this is a critical question. My intuition is that (a) doing this even just for math and programming may be difficult, and (b) if it turns out not to be difficult, that will be because math and programming are comparatively easy domains for this approach and therefore it will *not* generalize to messy real-world tasks like "run a marketing campaign".

One of the most impressive results in this direction so far has been the success on International Mathematical Olympiad geometry problems, but the consensus from what I've read is that IMO geometry problems are much much much more amenable to a relatively simple tree search than most problems in math (or programming), because it's such a limited domain. (I'll throw in that I'm a two-time IMO silver medalist, so have some perspective on this, but I have not gone deep into the literature on AI math solvers. Also I could never do the geometry problems, that was my Achilles' heel.)

It also seems to me that "reasoning" is not particularly the same thing as "planning and execution", especially in a messy real-world environment. Most of the decision making we do as we navigate multi-step tasks through our workday is done intuitively (System 1), and where will we get the data to train AI models to develop a similar intuition?

> Improved memory: Isn't fine-tuning on a small dataset or task essentially developing new long-term memories?

Yes, but we are continuously creating new memories as we go about our day, and I'm not aware of any suggestions that it would be practical to continuously fine-tune a model with everything that passes through its token buffer. Here's the post where I've said more about how I view the memory challenge: https://amistrongeryet.substack.com/p/memory-in-llms.

Expand full comment
May 2Liked by Steve

> One of the most impressive results in this direction so far has been the success on International Mathematical Olympiad geometry problems, but the consensus from what I've read is that IMO geometry problems are much much much more amenable to a relatively simple tree search than most problems in math (or programming), because it's such a limited domain. (I'll throw in that I'm a two-time IMO silver medalist, so have some perspective on this, but I have not gone deep into the literature on AI math solvers. Also I could never do the geometry problems, that was my Achilles' heel.)

Props to you! That stuff is way out of my league.

I'm noticing that the big scores on IMO are coming from training on brute-force methods that get a subset of the problem right (most recently Wu's Method -- https://arxiv.org/abs/2404.06405). What gives me some hope is that apparently we can train successfully on automated methods that fail to find the right answer a significant part of the time and YET still impart valuable insight and experience to the model such that it can surpass its teacher.

Reasoning and planning seem to be, in the big picture sense, formalizing a goal down to some kind of mindless executable specification. As you note, we do much of it intuitively, relying on experience and knowledge -- training -- which doesn't come for free.

Automated methods that formalize a problem to a solution provide vast amounts of crude knowledge on a domain (i.e. synthetic data). Learning math and programming seem well suited to these, and once you have models that exceed their tutors (AlphaGeometry, AlphaCode), they can become the automated methods for the next generation, leading to a kind of virtuous cycle.

But for the real world and its innumerable social and physical constraints, automated methods for formalizing crude solutions are few and low-quality. Worse, we can't rapidly execute and test any would-be solutions for correctness like we can with math/programming without extreme over-simplification. I agree this is a big obstacle.

Maybe Artificial Superintelligent Math Coders will save the day by giving us the tools to practically simulate the real world, and thus eventually train the true general intelligence?

> https://amistrongeryet.substack.com/p/memory-in-llms

Well-written and argued. I get what you mean about memory feeling like a distinct component, but I also think existing weights/parameters are right there for the taking. Meta did some hard work to conclude that language models store 2 bits of knowledge per parameter at most and that knowledge storage requires 1000-exposure (https://arxiv.org/pdf/2404.05405). This sets the stage for over-parameterized models to have plenty of room for new memories for daily tasks. (Advances such as LORA make the process somewhat rapid-- a few hours on consumer GPUs for an 8B Llama, if I recall.)

Currently I think there's little point in dynamic training (online learning) with fine-tuning unless a task prompt can be scored automatically in an agentic flow (i.e. is it good strategy towards goal, or bad strategy?). My best guess (hope!) is that GPT-5 shows us the first steps.

Expand full comment