Terence Tao, one of the most accomplished mathematicians alive, recently told Quanta Magazine something that should make anyone building AI research tools pause:
“AI without validation is too unreliable for serious work.”
He’s right. And as the AI research space explodes with new tools — OpenAI’s GPT-Rosalind, Sakana’s AI Scientist, Google’s PaperOrchestra — this observation becomes more urgent, not less.
The Generation Problem Is Solved. The Verification Problem Isn’t.
GPT-Rosalind can generate research hypotheses. AI Scientist can produce papers for $15 each. These are genuine achievements. But they share a common assumption: that generating more research output is the bottleneck.
It isn’t.
The real bottleneck is knowing which outputs to trust.
The Hallucination Problem at Scale
Here’s a number that should concern anyone relying on AI for research: large language models fabricate citations approximately 5-15% of the time. That sounds manageable — until you scale it.
At paper-generation scale (thousands of papers), that’s hundreds of false citations entering the literature. Each one is a landmine for researchers who cite it, reviewers who miss it, and readers who believe it.
We built verification into our pipeline from day one. The result: a 53% catch rate on DOI fabrications in adversarial review. Not because our models are smarter, but because we treat verification as a first-class requirement rather than an afterthought.
What Adversarial Review Looks Like
Our approach is simple in concept, demanding in practice: generate hypotheses, then try to destroy them.
We feed the same hypothesis to multiple models with opposing incentives. One generates; another critiques on five axes: mechanistic plausibility, prior art, confounders, falsifiability, and safety flags.
The result? 57% of generated hypotheses fail our adversarial pipeline.
This is a feature, not a bug. It means what survives is worth a researcher’s time.
The Human-AI Division of Labor
The pattern that works isn’t “AI replaces researcher.” It’s “AI does breadth, human does depth.”
AI can read 1,000 papers in the time it takes a human to read 10. But AI cannot judge clinical plausibility, assess resource constraints, or weigh ethical considerations with the nuance a domain expert brings.
The most interesting discoveries happen at this interface. We call them “Move 37 moments” — after AlphaGo’s famous unconventional move that surprised expert commentators. These are insights hiding in plain sight, missed by humans constrained by field conventions, surfaced by AI that doesn’t know the conventions.
In one of our projects, the AI identified an mGlu2/3 receptor timing optimization that had been “sitting there” in the published literature, unnoticed because researchers focused on single-agent studies rather than timing windows.
The Coming Verification Market
As autonomous research scales, verification will become the bottleneck. “Who verified this?” will matter as much as “Who wrote this?”
Trust isn’t a feature to add later. It’s architecture that has to be there from the start.
The race to autonomous research is missing this component. The organizations that build verification infrastructure now will define the category.
Proof Points
- 53% DOI fabrication catch rate — verified against CrossRef API
- 57% hypothesis rejection rate — adversarial quality filter
- 100% CFR verification — eCFR API integration for regulatory compliance
- Model-agnostic — not locked to any single AI vendor