Great post, I was looking for a balanced view of what o3 achieved on those benchmarks. I find trying to assess how impressive each new model is quite confusing, even with some effort and research; glad to see that even you find it hard, as well.
For sure! All we have is narrow data points and anecdotal reports. I'm looking forward to the day when things slow down. [Narrator: things did not slow down.]
Very informative post providing the much needed context for a better perspective on the performance of the model on these benchmarks.
One thing I am very curious to find out is what made o3 so much better at the Frontier Math problems (even assuming the ones it did correctly are no more difficult than average IMO level problems). We know for example that AlphaProof which solved 4 of the 6 IMO problems of this (last?) year used the RL framework that had been developed in the context of AlphaGo and used the formal language Lean to check several (thousands I guess) of candidate solutions. OpenAI has mentioned using reasoning chains but beyond that, is there anything we know?
Great job outlining some of the key limitations of current models/approaches! I have maintained a reasonable understanding of LLMs, next-token-prediction, yadda-yadda, but I wasn't really aware of exactly how these "reasoning models" in particular worked and were limited. Passing an image as a series of undifferentiated values seems quite challenging and it's a wonder it works as well as it does!
What this whole article makes me think of, yet again, is my sense that we humans may be far closer to "next-token-prediction"/simulation machines than we'd like to think. As an example, although it may not be literally/exactly the case, how can we say with confidence that we are not *also* doing "1000 attempts in our head and picking the most likely-seeming one" and calling that "creativity" and then saying that reasoning models lack it only because we can *see* that they're doing this and it doesn't seem "creative"? I do think we are tremendously efficient prediction machines and that way of seeing our behaviors and capabilities potentially explains a lot of things in my view.
So now if we consider the extension of current SOtA AI into having more full access to knowledge-in-the-world, it feels more reasonable to imagine human-like levels of capability across a broader spectrum of challenges. This one "simple" (to understand, not to resolve) limitation could be responsible for a huge part of the deficit between AI and human capabilities. Even something as fundamental as having a body in the world that can move around, sense, and interact could unlock major breakthroughs that look like "intelligence" but would arguably more simply be "similar intelligence given greater context/information access". That's a pretty interesting thing to contemplate...
Great post, I was looking for a balanced view of what o3 achieved on those benchmarks. I find trying to assess how impressive each new model is quite confusing, even with some effort and research; glad to see that even you find it hard, as well.
For sure! All we have is narrow data points and anecdotal reports. I'm looking forward to the day when things slow down. [Narrator: things did not slow down.]
Very informative post providing the much needed context for a better perspective on the performance of the model on these benchmarks.
One thing I am very curious to find out is what made o3 so much better at the Frontier Math problems (even assuming the ones it did correctly are no more difficult than average IMO level problems). We know for example that AlphaProof which solved 4 of the 6 IMO problems of this (last?) year used the RL framework that had been developed in the context of AlphaGo and used the formal language Lean to check several (thousands I guess) of candidate solutions. OpenAI has mentioned using reasoning chains but beyond that, is there anything we know?
Great job outlining some of the key limitations of current models/approaches! I have maintained a reasonable understanding of LLMs, next-token-prediction, yadda-yadda, but I wasn't really aware of exactly how these "reasoning models" in particular worked and were limited. Passing an image as a series of undifferentiated values seems quite challenging and it's a wonder it works as well as it does!
What this whole article makes me think of, yet again, is my sense that we humans may be far closer to "next-token-prediction"/simulation machines than we'd like to think. As an example, although it may not be literally/exactly the case, how can we say with confidence that we are not *also* doing "1000 attempts in our head and picking the most likely-seeming one" and calling that "creativity" and then saying that reasoning models lack it only because we can *see* that they're doing this and it doesn't seem "creative"? I do think we are tremendously efficient prediction machines and that way of seeing our behaviors and capabilities potentially explains a lot of things in my view.
So now if we consider the extension of current SOtA AI into having more full access to knowledge-in-the-world, it feels more reasonable to imagine human-like levels of capability across a broader spectrum of challenges. This one "simple" (to understand, not to resolve) limitation could be responsible for a huge part of the deficit between AI and human capabilities. Even something as fundamental as having a body in the world that can move around, sense, and interact could unlock major breakthroughs that look like "intelligence" but would arguably more simply be "similar intelligence given greater context/information access". That's a pretty interesting thing to contemplate...