Things move quickly in the AI world these days:
A lot has happened in the (as I first sat down to write this) five days since a viral tweet sparked The Black Spatula project. For those just joining in, this is a community project to leverage AI to help weed out errors in scientific papers. It’s named for the kerfuffle over kitchen implements made from black plastic, which were reported to contain dangerous levels of fire retardant chemicals, based on a publication that turned out to contain a simple math error. (I’m still not sure that black plastic is the best thing to be using around your food, but it’s 10x less bad than claimed by the paper.)
The community that came together spontaneously is doing all of the actual work. My role at this point is primarily cheerleader.
So, Some Cheerleading
The progress in a few short days has been amazing! Our WhatsApp group is now a WhatsApp community with over 250 members; there are 242 people on the Discord. There’s a GitHub repository, a website, the beginnings of a database of known flawed papers, active collaboration on prompting techniques, and more.
We are already uncovering errors. For instance, a peer-reviewed paper on dengue fever was found to contain an error in the way cases of mild and severe dengue were grouped – we’ve reported this to the author. Many more apparent issues have been identified in other papers, and are still being verified.
The big pushes right now are to find papers with known flaws and come up with prompts that can detect those flaws – along with experimentation as to which AI models work best. This is very much the-more-the-merrier work, ideal for our self-organizing community.
If you’d like to make your own contribution – however small – to improving science, this is a great (and fun!) opportunity. See the How To Help section at the end of this post. Two immediate needs are volunteers with expertise in some academic field to vet reported errors, and pointers to papers with known flaws (for testing). And everyone is invited to join the WhatsApp and Discord!
The Project is More Complicated Than I Thought
There’s a vigorous discussion taking place on the project Discord and WhatsApp group, exploring a range of use cases and implementation techniques.
One thing that has quickly become apparent is that this will be a much more complex project than I’d first contemplated. That’s not a surprise; this is always how it goes:
(In fairness to me, I’d originally contemplated something deliberately unambitious – just testing a few AI prompts on 1000 papers to satisfy my curiosity as to what sort of results could be easily achieved. The Black Spatula community is pursuing much more ambitious and valuable goals.)
Some of the complications are mundane technical issues. Scientific papers are generally available as PDF or .docx files; OpenAI’s o1 model does not yet accept these formats. Should we just convert the file to text? Then the diagrams will be lost, and tables will be garbled. Should we present the files to the model as images instead of text? Perhaps that will introduce other issues.
When we started, it turned out that o1 wasn’t even available via API (meaning, you could use it manually via ChatGPT but it couldn’t be invoked by a computer program). Fortunately, that was in the ancient times of 5 days ago; OpenAI has subsequently added o1 to their API.
What’s the best way to find downloadable copies of scientific papers? Where are papers for a given field hosted? Is there a paywall? A limit on the rate at which we can download papers? Legal issues?
Where can we find “known bad” papers to test on? Are papers which have been retracted a good test case? Does it depend on the reason they were retracted? How can we find experts in relevant fields to review any errors the AI reports?
Meanwhile, some of the questions that have arisen are almost philosophical in nature.
If We Build It, Who Will Come?
Suppose we indeed develop a system that can detect errors in scientific papers. (Not all errors, but a meaningful percentage.) What should we do with it?
There is a lot of support for the idea that the first / best use would be to help weed out errors in new papers before they are published. On a per-paper basis, this provides the maximum possible benefit (no one will be misled by reading the uncorrected paper) with the minimum effort (pre-publication is the easiest time to correct errors) and minimum side effects (no one’s reputation will be damaged).
But it’s not obvious how to get there. On the one hand, it’s not clear how you could limit such a tool to only being used by authors on their own papers. On the other hand, it’s not clear how many authors would bother to use the tool. As Marius Terblanche said in the WhatsApp discussion, “What's the pain point that will make a health researcher add another step in an already tedious submission process?”
At the other end of the spectrum, it would be fascinating to scan existing papers en masse. Conceivably, this could flag large numbers of issues, helping to weed erroneous conclusions from the scientific literature. It would also yield a fascinating set of data for mining: what sorts of errors are most prevalent? Which fields have more errors? Do certain researchers or institutions commit more mistakes? (Not to mention: which AI models and prompts are best at finding errors?)
While I expect some effort of this nature will eventually happen, and perhaps sooner than later, it is fraught with pitfalls. Our results will inevitably include “false positives” (incorrectly claiming to have spotted a mistake). There may be systematic sources of bias in the tool’s performance – it might be better at spotting errors in some fields than others, or a particular researcher’s writing style might somehow trigger a lot of false positives.
(Tools that claim to detect when a student has used AI to write an essay are well known to have false positives, and there are many cases of students being harmed by false accusations that were not properly vetted. We should seek to prevent automated error detection from being misused / misconstrued in the same fashion as automated AI-authorship detection. GJ Hagenaars notes, “we have a duty, as we create such tools, that the DISCLAIMER is very loud that there is no substitute for verification of what the AI claims to have found” – which is absolutely correct, and still leaves us with the problem of managing how people perceive and act on our reports.)
Even correct reports of errors could easily be subject to misinterpretation or lead to inappropriate consequences. Minor errors could be seized on as talking points to discredit a politically inconvenient result. An innocent error could be misconstrued as deliberate fraud.
I am sure that I am just scratching the surface here. In any case, it’s clear that achieving the maximum benefit to society from the use of AI to spot errors in scientific papers is not just a technical question, but also a social one.
How To Pay For It?
It’s not yet clear how much it will cost to check a single paper, but one early estimate was in the range of $0.25. Of course AI is rapidly getting cheaper (for a given level of capability), but that may be at least partially counterbalanced by a desire to construct increasingly complex analyses – posing multiple questions to the model, using multiple models, etc. (If you follow the AI discourse at all, you’ll have seen that the hot new scaling law centers on “inference-time compute” – meaning that you can get better results by asking an AI model to do more work, resulting in rapidly escalating costs, hence the new $200 / month “Pro” tier for ChatGPT and a rumored $2000 / month followup.)
There appear to be several million scientific papers published each year (estimates vary). Scanning each one once might cost $1M / year. A one-time effort to back-check the last 20 years might run to $10M (not $20M, because the number of papers per year used to be lower – it has been rising rapidly). These are not eye-watering sums in the scope of the tech industry, but for what I presume will be a philanthropic effort, they are not trivial either.
In any case, it’s a bit early to be thinking about how to pay for scanning millions of papers until we can demonstrate reliable results on one paper at a time.
Assessing the Opportunity
When we ask an AI to find errors in a scientific paper, it will make mistakes. When it identifies a real error, we call that a “true positive”. But sometimes it will incorrectly claim to have found an error; that’s a “false positive”. It may also fail to notice some actual errors (“false negative”).
We don’t need to worry too much about false negatives, so long as we set the expectation that a clean AI report is not a guarantee that a paper is correct in all respects. The interesting numbers are the true positive rate (how many genuine mistakes does the AI find?) and the false positive rate (how many false accusations does it make?).
For an example, I’m going to make up some numbers; they may be completely off base. Suppose that 5% of all peer-reviewed papers have at least one important error that can be spotted simply by reading the paper. Suppose that our tool has a true-positive rate of 50% (it catches half of the errors), and a false-positive rate of 10%. If we feed it 1000 papers, it will report errors in about 120 of them, but 95 of those 120 will be false accusations.
Even with that high false-positive rate, this might be useful as a tool for researchers to check their own work! On average, they’d have to wade through 5 error reports to find one true error. That might be an excellent return for their time. It could be even more valuable to use the tool on a first draft of a paper, which is likely to contain more errors.
However, that same tool might not be very useful for scanning a large corpus of published papers: most of the detected “errors” would be bogus, and it would be an enormous project to manually separate the wheat from the chaff.
To address this, we might prompt the AI to be conservative, and only report errors that it’s certain of. Suppose that the true-positive rate drops to 30%, but the false-positive rate drops to 0.1%. Now, on a corpus of 1000 papers, we should expect to get about 16 error reports, of which 15 will be legitimate. That would be much more useful. We’d still want to treat all reported errors as tentative, but investigating those reports would be a much better use of time.
Remember, though, that I made up all of these numbers. Until we find out what actual numbers we can achieve, we won’t know what we have to work with.
More Ideas
I was chatting with my friend (and Writely / Google Docs co-founder) Sam Schillace, and he had some interesting ideas.
To incentivize authors to check their own work, we could announce that starting in six months, we will be publishing a report card on every paper published after that date. Meanwhile, we’d make the analysis tool available for authors. This would allow authors to avoid embarrassment (they just need to get in the habit of checking for AI-detectable errors before publication), while still providing an incentive not to ignore the tool.
This assumes we can achieve a reasonably low false-positive rate. On that note, he had another suggestion: don’t worry about cost just yet. If there is one constant in the world of AI these days, it is that the cost of operation is plummeting. If it costs $0.25 to run a paper through o1 today, let’s find a way to spend $10. That’ll come down to a dime soon enough. You can get better results from an AI by asking it a question multiple times and somehow identifying the “best” answer, or even simply the most frequent answer.
More excellent ideas are emerging in the community discussions. Here are just a few examples:
Trying newer reasoning models. Google’s Gemini 2.0 Flash Thinking will likely be much cheaper than o1. OpenAI’s freshly announced o3 model appears to be even more capable than o1, by a wide margin.
Joaquin Gulloso suggests asking the LLM to read a paper and produce a customized set of instructions for what to review, which then would be fed back into the model.
GJ Hagenaars notes the possibility of using specialized (fine-tuned?) LLMs to look for different kinds of errors. Dominikus Brian adds, “With Post-Training pipeline we can switch freely between LLM engines and retain the domain expertise/memory/experienced gathered by the AI Agent.”
GJ also suggested cross-checking to see whether the information that one paper cites from another is in fact consistent with the cited paper.
Related Work
Many people and organizations have been working to raise the quality of published papers, including through the use of AI tools. Here I will briefly mention just a few that happen to have come to my attention, mostly via the Black Spatula forum.
ResearchHub is “A modern day pre-print server and platform for open science where users can review, publish, and collaborate on scientific research”.
PubPeer is another platform for (mostly biomedical) researchers to post comments on published papers.
ERROR – A Bug Bounty Program for Science, is a website that pays peer reviewers to check scientific papers for errors, with rewards ranging from 250-2500 CHF ($280-1,100) for errors discovered based on their severity.
FutureHouse is “a non-profit building AI agents to automate research in biology and other complex sciences”. As part of their work, they have been building tools which use AI to answer questions regarding published papers – though not the question of “does this paper contain any errors”.
Abhishaike Mahajan recently conducted an experiment, using o1-preview to look for errors in 59 recent papers from NeurIPS (a machine learning conference).
In a recent work on arXiv, Tianmai M. Zhang noticed the problem of referencing errors in scientific papers and showed that OpenAI’s language models are capable of detecting erroneous citations even with limited context.
Way back in 2016, James Heathers proposed the GRIM test, a simple approach to checking for arithmetic errors (or falsified data) in certain types of statistical data, followed by a similar test called SPRITE.
Stuart Buck reports:
FYI I just talked with a statistician at Wisconsin named Karl Rohe. He and his team have been developing an LLM approach (using Claude Sonnet) to check medical papers for whether they comply with the CONSORT guidelines (official standards for how clinical trials are supposed to be reported).
As well, other folks are working on a tool to identify problematic clinical trials used in systematic reviews, i.e., the kind that are used to develop medical guidelines. https://pmc.ncbi.nlm.nih.gov/articles/PMC10593010/
Sayash Kapoor (of the excellent AI Snake Oil blog) mentioned that “Over the last year, I've been working on several projects to improve science using AI. In September, we released a benchmark to evaluate if AI can automatically reproduce scientific papers when given access to the code and data released alongside the paper. I plan to expand this effort to focus on error detection in AI papers. In particular, I plan to create a benchmark to evaluate how well AI can find and fix common errors in AI research papers.”
Elisabeth Bik mentioned a 2016 study in which she systematically looked for inappropriately duplicated “Western blot” images in 20,621 research papers.
Manjari Narayan notes:
There are a pretty diverse group of researchers who work on AI for scientific claim verification too. Allen institute for intelligence works on this and there is a nice review of papers at this upcoming workshop here https://sites.google.com/view/ai4research2024/resources
How To Help
To follow along with the project, join our WhatsApp group (for announcements and high-level discussion). To get involved with the day-to-day work, or just watch it happen and perhaps contribute an occasional thought, join our active Discord. Right now, just playing around papers and prompts is a great way to contribute – please record your results in our spreadsheet! See the project home page for more information.
There will be a big need for domain experts to verify reported mistakes in papers. If you have experience in some academic field and are open to helping judge whether an AI-reported error in a paper is in fact an error, sign up here and we’ll get in touch when we start generating early results.
We could also really use papers that are known to contain flaws, specifically flaws that in principle could be noticed simply by reading the paper. Submit examples here. For instance, if you’ve written a paper and can submit an early draft with known errors, the final draft, and a description of the errors, that would be a big help!
GJ Hagenaars notes:
[We could use] folks with some spare time on their hands to write the history and the documentation of what is going on and what's being attempted. New folks are joining every day, and while the discussion channels on discord and whatsapp are full of useful information, it's not necessarily in a perfect format for consumption.
Finally, at some point we will need funding. We’ll publish more on this topic when we’re farther along, but if you’re interested in contributing to help us detect errors at scale, get in touch – join the community or drop me a line at amistrongeryet@substack.com.
Thanks to David Macius, Dominikus Brian, GJ Hagenaars, Michael J Jabbour, and Tianmai M. Zhang for specific contributions to this post, and to everyone who has been working to make The Black Spatula Project a reality!
I'm guessing your team has already thought of this but you could take existing papers and use AI to purposely introduce an error and then see if your new AI detector can find the error. For example, change numbers or operations in math problem or change conclusion language like "the chart shows X increasing when Y" to "the chart shows X decreasing when Y". Here are more ideas from ChatGPT:
1. Numerical or Logical Errors
Data inconsistencies: Change numerical values in tables or charts to conflict with reported statistics in the text.
Calculation mistakes: Introduce errors in mathematical derivations or results, such as adding where multiplication is required.
Unit mismatches: Change units (e.g., "10 cm" to "10 m") without adjusting the numbers appropriately.
Rounding issues: Alter significant digits or rounding in reported results.
2. Graph and Table Discrepancies
Graph mislabeling: Swap X and Y axis labels or change graph legends to introduce inconsistencies.
Mismatch with narrative: Alter graphs or tables to conflict with the description in the text.
Formatting errors: Introduce issues like missing axis labels, misaligned data points, or inconsistent scale.
3. Language and Writing Errors
Ambiguous phrasing: Change precise scientific language to something vague or misleading.
Contradictions: Add statements that contradict earlier claims in the paper.
Grammar changes: Introduce errors in sentence structure, missing articles, or subject-verb disagreement.
Tone shifts: Alter conclusions to sound less confident, or modify claims to seem exaggerated.
4. Citations and References
Mismatched citations: Replace a correct citation with an unrelated or invalid one.
Missing citations: Remove citations for claims that require supporting evidence.
Reference typos: Alter author names, years, or journal titles in references.
5. Methodology Problems
Inconsistent methods: Change details of the methodology to conflict with results (e.g., claim to have used one algorithm but show results from another).
Parameter mismatches: Modify key experimental parameters so they no longer align with results.
Misrepresentation of procedures: Change experimental details to make them illogical or infeasible.
6. Ethical and Compliance Errors
Fabrication: Insert made-up data or results that do not follow from the described experiment.
Plagiarism: Introduce text copied from other sources without citation.
7. Domain-Specific Errors
Biological papers: Introduce errors in species names, anatomical terms, or physiological processes.
Physics papers: Modify constants, assumptions, or units in equations.
Social sciences: Alter the interpretation of qualitative data, such as changing survey results or demographic descriptions.
8. Structural and Organizational Errors
Section misplacement: Swap sections like methods and results or conclusions and abstract.
Incomplete sections: Remove critical parts of a section, such as missing details in the methodology.
Duplications: Repeat sections or tables unnecessarily.
9. Logical Fallacies
Non sequiturs: Add conclusions that do not logically follow from the results.
Correlation vs. causation errors: Change phrasing to imply causation where there is only correlation.
10. Formatting and Style Errors
Inconsistent formatting: Change figure numbering or table referencing inconsistently throughout the paper.
Style guide violations: Alter fonts, headings, or other style elements to deviate from the journal’s formatting requirements.
Additional Ideas
To enhance testing, you could also create "graded mistakes," where some errors are more obvious (e.g., a missing table entirely) and others are subtle (e.g., minor rounding issues). Combining multiple error types in a single paper could test the robustness of your "problem finder" AI in identifying multiple issues simultaneously.
The problem you’re attempting to solve is very tricky as you’ve mentioned. We spent 18 months building out a peer review platform to do just this. We’ve solved nearly all of the problems you’ve encountered. We’ve made it free for everyone to use at paper-wizard.com