6 Comments

Great Post! I was actually hoping you'd write a post like this, detailing what strategies high-level problem solvers use and how they use them. (I know you said all of us do, and that's true to a degree, but c'mon now...)

I also am very eager to see your analysis of the strengths and limitations of o1 mini and o1 preview-- I've found o1 mini to be much more capable and sophisticated at math/physics problems, and o1 preview has been able to solve the last 3 NYTimes Connections puzzles I've given it, whereas the previous 4o model couldn't solve these at all. Especially because Connections was the kind of puzzle where it seemed to require too much creativity/originality to just solve it using previous training data.

Anyway, I'd be interested in a hype-free analysis of the current new models, if that seems useful to you.

Expand full comment

I haven't had / made time to play with o1 yet but am very much looking forward to it. I've also been accumulating a huge pile of bookmarks of other people's reactions. At some point my plan is to read through all of the early takes and see whether I can find anything to add to the conversation.

Expand full comment

How did the US team do at the IMO in 1983?

Expand full comment

We squeaked through in second place, with 171 points, just ahead of Hungary at 170 and the USSR at 169. (West) Germany blew everyone away with 212. At the time, Beating The Russians felt like a big deal.

The next year (when I was also on the team) we slipped to a tie for 4th place with Hungary, and the USSR crushed it (235 points, second place was Bulgaria with 203, we had 195).

I remember being told at the time that the US had never placed below 5th. I don't think anyone ever said "so don't blow it" but it definitely felt like a lot of pressure...

https://www.imo-official.org/year_country_r.aspx?year=1983

Expand full comment

This is a terrific post!! One question for you. This is the third problem on the first day of the 2024 IMO (P3). AlphaProof didn't solve this one, but did solve the third problem on the second day (P6). These are both very hard problems for humans: each was only solved by only about 2% of contestants. So, it's just one data point, but what do you think might account for the difference in AlphaProof's performance?

My hunch, FWIW, is that the re-representation of the problem in the grid format makes it trickier for AlphaProof. I've looked up a few solutions to P3, and they all make use of a similar technique at some point. P6 requires a similarly long and intricate build-up of observations and intermediate results, but I would say it doesn't really require seeing the problem in a new way. In other words, both problems are deep, but P3 requires more creativity. Then again, it's also possible that the reason is more technical: maybe combinatorics is just clunkier to represent in Lean, making the solution necessarily longer, and thus yielding a bigger search space; I don't know anything about that.

Curious for your take, though!

Expand full comment

Great questions, to which I have no great answers.

I'll begin by noting that, over the course of a few scattered hours of thought last year, I did not manage to solve problem 6. I made progress, but I'm not certain I was on the right path. I haven't peeked at a solution, but I probably should, as it's unlikely I'll ever put in the time to solve it (and by no means certain that I could, although it feels like the kind of problem I would have had a good chance at back when I was in fighting trim).

And as you note, it's just one data point, so it could essentially be happenstance that AlphaProof solved one and not the other. As you say, perhaps it's the need to switch to a new representation to solve P3. Maybe it's something about Lean (I know nothing about Lean). Maybe AlphaProof's trick of fine-tuning on simplified versions of the problem worked better for P6 than P3. (I wish we knew more about how this works. I have no idea how you'd automatically generate simplified variations on either of those problems.)

I was really looking forward to the public API release of o3, so that Epoch AI could reproduce o3's score on FrontierMath and share some information about which sorts of problems o3 did well on and how it accomplished that. But now it sounds like there won't be an API release of o3, and we'll have to wait for GPT-5 (or perhaps strong reasoning models from Anthropic or Google in the meantime).

Expand full comment