Whew! Next time I’m going to have a tweet go viral, remind me to clear my calendar first.
A couple of days ago, I came across this from Ethan Mollick:
If you’ve seen news articles telling you to throw away your black plastic kitchen implements, the paper Ethan is referring to is the cause. Apparently, black plastic often comes from sources that contain fire retardant chemicals, which can then wind up in your body. The authors compute the likely exposure, compare that with a safe baseline, and reach a worrisome result. In the process, they made a simple math error: multiplying 7000 by 60 and getting 42,000. The correct figure is 420,000, which substantially reduces the health implications of the findings (emphasis added):
…we obtained an estimated daily intake of 34,700 ng/day from the use of contaminated utensils (see SI for methods). This compares to a ∑BDE intake in the U.S. of about 250 ng/day from home dust ingestion and about 50 ng/day from food (Besis and Samara, 2012) and would approach the U.S. BDE-209 reference dose of 7000 ng/kg bw/day (42,000 ng/day for a 60 kg adult).
Ethan Mollick tried the simplest possible experiment to test whether OpenAI’s “o1” model could spot the problem: he uploaded the PDF and asked it to "carefully check the math in this paper". It successfully identified the mistake (details).
(Incidentally, it turns out that Nick Gibb noticed the same thing four days earlier.)
Reading all this, I dashed off a fairly casual tweet:
I was thinking this would be a fun little project, try a couple of prompts on a thousand papers and see what fell out. The Internet had other ideas:
Including this:
Yesterday I posted a followup tweet inviting people to join a discussion, and the result was a flood of energy. There’s already a Github repo, a WhatsApp group with 160 members, and a very active Discord.
Now what?
Motivations
This has turned into a community project, and the community gets to pick the goals. But I propose that the high-level goal should be to help advance science – that is, the grand project of increasing our understanding of the world through a set of carefully evolved practices and norms. Nick and Ethan’s discovery suggests that recent advances in AI capabilities may provide an opportunity to further the scientific project by:
Uncovering flaws in past papers, to filter out incorrect information.
Uncovering patterns of flaws in past papers, to shed light on areas where practices and norms could use an update.
Providing tools that researchers can use to spot problems in their future papers before they are published.
I’d like to emphasize that I view this as a positive project: the goal is not to point fingers at mistakes, it’s to learn how to do better and provide tools that make that job easier.
As a side note, we may learn some interesting things about AI capabilities and how to get value from AI. In particular, if this project bears fruit, it will highlight one AI capability that I think is under-appreciated: the ability to grind through an enormous amount of work, such as reviewing a million scientific papers.
Destinations
There has already been extensive discussion on Twitter, WhatsApp, and Discord. These conversations have touched on a wide range of potential projects. We could check papers for math errors, logic errors, incorrect scientific facts, flawed assumptions, and inappropriate use of statistical methods. Those are all errors that (in principle) could be accomplished just by looking at a paper in isolation. Philipp M suggested going farther, by retrieving a paper’s citations to check whether they are consistent with the statements in the paper itself.
This could be deployed as a tool that researchers can employ to check their work, one paper at a time. There could be a “check this paper” website that anyone can use to validate a paper they’re reading. We could scan existing papers en masse, and try to get errors corrected, or publish an analysis highlighting which sorts of issues need more attention in certain fields.
Farther afield, you could imagine extending this to other forms of publication: blog posts, news articles, even podcasts and videos. In addition to factual errors, we could also scan for things like misleading rhetoric.
This is just a sampling of the ideas that have emerged in the first couple of days (really less than 24 hours since the discussion started to go properly viral). The possibilities are wide. It’s not yet clear how practical any of this will be; though it will only become more practical over time.
There are still a lot of basic questions we don’t know the answer to. What percentage of published papers have flaws that can, in principle, be discovered simply through analysis of the paper itself? How many of those can current models detect? What prompts work best for this? What is the false positive rate?
Until we start to answer these questions, I don’t think it’s worth investing too much energy into planning grand projects. The immediate focus should be on sketching out the landscape of practical possibilities.
Next Steps
We are currently in the “fuck around and see what works” stage. A couple of paragraphs back, I rattled off the questions I think we should try to answer first. These are questions that should initially be answered at small scale, experimenting with 1 to 100 papers at a time. To support that work, here are some things that would be helpful:
Sources for repositories of publicly available papers in various fields (submit here).
Sources for papers that are known to contain flaws, specifically flaws of the sort that I’ve described, that in principle could be noticed simply by reading the paper. This is perhaps the most important thing, as it will allow people to test models and prompts to see if they can spot the flaws. (Some folks have worried about contamination – if a paper is known to be flawed, a model’s training data may have included a discussion of the flaw, which could make it easier for the model to then “uncover” that flaw. I suspect this will be a weak effect that we don’t need to worry about during this early experimentation phase, with the possible exception of high-profile papers whose flaws were widely publicized.) Submit here.
Experimental results – if you’re able to get a model to successfully identify a flaw in a paper, what model, prompt, and paper did you use? Let’s share success stories and try to hone in on some high-quality prompts. (For now, join the Discord to share results.)
Coding up basic tools – for instance, for running a prompt on multiple papers and collating the results. (Join the Discord to participate.)
Expert verification – when a model reports a flaw in a paper, we may need expert help to verify whether the flaw is real. In the example that kicked off this whole conversation, no special expertise was needed. But that won’t always be the case. Domain experts who can evaluate reported flaws in papers from various fields will be needed. Register here.
Funding. I’ll fund the early small-scale projects, but Nick’s back-of-the-envelope calculation suggests that running a single paper through o1 might cost $0.25… which can add up fast1 when it’s time to scale up. If you’d like to help, contact me at amistrongeryet@substack.com.
Why Hasn’t This Already Been Done?
In part, the answer is: it has! People have pointed to a variety of related projects, some quite sophisticated. I’ll try to collate these for a followup post.
But, while people are already working on things like this, no one has (to my knowledge) yet produced an easy-to-use tool for thoroughly checking a paper on demand, nor systematically reviewed the corpus of existing papers (aside from some narrow projects looking for specific types of errors in specific fields).
It may be that this only just now become possible, with the release of o1-preview (three months ago) or even o1 and o1-pro (very recent!). It may be that it’s still not possible, in a practical sense; we haven’t yet determined how many errors can be detected, or what the false positive rate will be. Or it may be the the cost ($0.25 / paper?) is too much of a hurdle – in which case, stay tuned, because we’re nowhere near the end of rapid cost declines.
But it’s also possible that the only reason this hasn’t been done is because no one has gotten around to doing it. AI is driving a period of rapidly expanding possibilities. There are hundred-dollar bills lying all over the pavement, waiting to be picked up. You just need to keep your eyes open.
Get Involved
If you’re interested in helping out, follow one of the links in the Next Steps section, join the WhatsApp group (for high-level discussion), and/or the Discord (more focused on concrete implementation work). You might also check out the growing GitHub repo. If you’d like to reach out to me personally, the most reliable channel is amistrongeryet@substack.com. (If you replied to me on Twitter, or posted on WhatsApp or Discord, I may have missed it.)
Follow Along
The project home page is the-black-spatula-project.github.io. If you’d like to follow progress in detail, feel free to lurk in WhatsApp and/or Discord. For occasional updates, follow @theblackspatula on Twitter. And I’ll write meatier updates from time to time in my blog here.
Thanks to everyone who jumped in to contribute to this project!
As fans of Father Guido Sarducci will recognize (skip to 5:55 if you’re impatient).
I joined the Whatsapp and I'm submitting our 'repo' (more of a curation database above). I’m the director of The Unjournal (unjournal.org, “we commission research for evaluation and rating, to make impactful research more rigorous & vice-versa”)
I’d like to help the initiative and try to get the work in our pipeline “Black Spatula’d”.
You can see all the papers in our pipeline that we prioritized as ‘potentially high-impact’ at https://coda.io/d/Public-Database-of-Research_d7VdSLeCrpi/Unjournal-Research-with-potential-for-impact-database_suCwXMPL#_luqRnzt6
(Our evaluation output: unjournal.pubpub.org)
Happy if anyone wants to engage — please reach out. We may be able to provide a small amount of funding as well.
I think this is as good as it gets at this point. Well done.