9 Comments

I'm guessing your team has already thought of this but you could take existing papers and use AI to purposely introduce an error and then see if your new AI detector can find the error. For example, change numbers or operations in math problem or change conclusion language like "the chart shows X increasing when Y" to "the chart shows X decreasing when Y". Here are more ideas from ChatGPT:

1. Numerical or Logical Errors

Data inconsistencies: Change numerical values in tables or charts to conflict with reported statistics in the text.

Calculation mistakes: Introduce errors in mathematical derivations or results, such as adding where multiplication is required.

Unit mismatches: Change units (e.g., "10 cm" to "10 m") without adjusting the numbers appropriately.

Rounding issues: Alter significant digits or rounding in reported results.

2. Graph and Table Discrepancies

Graph mislabeling: Swap X and Y axis labels or change graph legends to introduce inconsistencies.

Mismatch with narrative: Alter graphs or tables to conflict with the description in the text.

Formatting errors: Introduce issues like missing axis labels, misaligned data points, or inconsistent scale.

3. Language and Writing Errors

Ambiguous phrasing: Change precise scientific language to something vague or misleading.

Contradictions: Add statements that contradict earlier claims in the paper.

Grammar changes: Introduce errors in sentence structure, missing articles, or subject-verb disagreement.

Tone shifts: Alter conclusions to sound less confident, or modify claims to seem exaggerated.

4. Citations and References

Mismatched citations: Replace a correct citation with an unrelated or invalid one.

Missing citations: Remove citations for claims that require supporting evidence.

Reference typos: Alter author names, years, or journal titles in references.

5. Methodology Problems

Inconsistent methods: Change details of the methodology to conflict with results (e.g., claim to have used one algorithm but show results from another).

Parameter mismatches: Modify key experimental parameters so they no longer align with results.

Misrepresentation of procedures: Change experimental details to make them illogical or infeasible.

6. Ethical and Compliance Errors

Fabrication: Insert made-up data or results that do not follow from the described experiment.

Plagiarism: Introduce text copied from other sources without citation.

7. Domain-Specific Errors

Biological papers: Introduce errors in species names, anatomical terms, or physiological processes.

Physics papers: Modify constants, assumptions, or units in equations.

Social sciences: Alter the interpretation of qualitative data, such as changing survey results or demographic descriptions.

8. Structural and Organizational Errors

Section misplacement: Swap sections like methods and results or conclusions and abstract.

Incomplete sections: Remove critical parts of a section, such as missing details in the methodology.

Duplications: Repeat sections or tables unnecessarily.

9. Logical Fallacies

Non sequiturs: Add conclusions that do not logically follow from the results.

Correlation vs. causation errors: Change phrasing to imply causation where there is only correlation.

10. Formatting and Style Errors

Inconsistent formatting: Change figure numbering or table referencing inconsistently throughout the paper.

Style guide violations: Alter fonts, headings, or other style elements to deviate from the journal’s formatting requirements.

Additional Ideas

To enhance testing, you could also create "graded mistakes," where some errors are more obvious (e.g., a missing table entirely) and others are subtle (e.g., minor rounding issues). Combining multiple error types in a single paper could test the robustness of your "problem finder" AI in identifying multiple issues simultaneously.

Expand full comment

The problem you’re attempting to solve is very tricky as you’ve mentioned. We spent 18 months building out a peer review platform to do just this. We’ve solved nearly all of the problems you’ve encountered. We’ve made it free for everyone to use at paper-wizard.com

Expand full comment

We've run some test with the paper-wizard.com, the agents doing very thorough review indeed. The structured and detailed feedback is amazing, but unfortunately that is also what makes it very different (not necessarily inferior) compared to human feedback.

Human have the creativity and often get carried away with one or two particular issues at time. When human expert does so in peer review they will expand and explain a particular suggestion more. At times this is what was found to be helpful for both authors and editors. The "convincing" others part of the review.

AI that was programmed to assess everything will give structured but very predictable responses. On the other hand, AI that was prompted to be creative, often introduce unnecessary things that often not relevant and at the boundary of hallucinations.

18 months is a tonnes of experience for a niche like this that is very nascent and ever evolving and break neck pace. Looking forward to connect after the holidays. Visit us at ResearchHub.com

Expand full comment

> What’s the best way to find downloadable copies of scientific papers?

I'm sure you've discussed this, but arXiv source files might be a good place to start. For papers with only tables and formulas, they are a perfect format for o1, although they can sometimes be spread across multiple files.

Expand full comment

"To incentivize authors to check their own work, we could announce that starting in six months, we will be publishing a report card on every paper published after that date. Meanwhile, we’d make the analysis tool available for authors. This would allow authors to avoid embarrassment (they just need to get in the habit of checking for AI-detectable errors before publication), while still providing an incentive not to ignore the tool." Really depends what the intended purposes here are -- if you're operating under the assumption that all authors are acting in good faith, this seems like the right approach. If you're trying to catch fraudulent claims/results, however, this seems like it would invite a disastrous kind of Goodhearting.

Expand full comment

Agreed, this would be a bad way to deploy a test for, say, signs that data has been manipulated.

But much of what the nascent community has been discussing seems to not be vulnerable to Goodhearting (I think?). For instance, checking for math or logic errors, or checking whether the information that the paper derives from a cited source in fact matches the content of that source. If authors repeatedly modify their paper until it passes those tests, then hopefully that is net beneficial (https://xkcd.com/810/).

Basically Goodhart's Law results from a gap between the thing you are measuring and the thing you actually care about (and reflects the observation that such a gap tends to exist even when at first glance you wouldn't think so). I... think?... that the gap is pretty small for most of the error categories we're discussing.

(One way in which Goodhart's Law would still apply: "there is an obvious math error in this paper" might be a useful signal that the paper is also likely to contain less obvious errors. And we'd be erasing that signal. But this seems likely to be a worthwhile tradeoff in practice?)

Expand full comment

Awesome initiative to stumble upon. I’m going to be looking more closely and may dm you as part of a small series on AI for science.

Expand full comment

Working on pretraining, It’s super obvious how hard parsing pdfs is. Is wild this is a bottleneck.

Expand full comment

That is crazy; so much work invested in formatting the published artifact in a way that subtracts value, at least from a machine readability perspective (but also makes papers ~impossible to read on a phone, I would love something that can do a sophisticated job of reformatting PDFs for display on a narrow screen).

Expand full comment