The 110,000-Paper Problem: Why Citation Verification Is Research Infrastructure

Nature published a report in 2025 that estimated 110,000 scholarly publications from that year alone contain invalid AI-generated references. Not typos. Not formatting errors. Invalid references, citations to papers that do not exist, to studies that say the opposite of what the citing paper claims, to DOIs that resolve to unrelated work.

One hundred and ten thousand papers. In one year. Across every discipline where AI-assisted writing has entered the workflow.

This is not a detection problem. It is an infrastructure problem, and no amount of reviewer vigilance fixes it.

The Numbers Behind the Scale

The problem is not anecdotal. It is measurable, growing, and accelerating.

110,000+ publications from 2025 estimated to contain invalid AI-generated references (Nature, via Grounded AI analysis). This is the headline number, and it is conservative. It counts only papers where invalid references were detected, not the papers where they were not.

40% of AI-generated references contain errors. Only 26.5% are entirely correct (Enago Academy). The remaining 13.5% are partial failures, real papers attached to claims they do not fully support, contested evidence presented as settled, null findings cited as positive.

100+ hallucinated citations found in NeurIPS 2025 papers alone (GPTZero, reported in Fortune). NeurIPS is one conference, one year, one field. The 100+ citations represent papers that passed peer review with fabricated or misaligned references embedded in the published record.

NIH will terminate grants and refer to the Office of Research Integrity if AI-generated content is detected in grant applications (NOT-OD-25-132). This is not a warning. It is a policy with enforcement teeth, and it makes citation verification a compliance requirement, not a quality preference.

The distribution across failure modes matters. Fabricated citations, DOIs that do not resolve, are the most visible failure mode. They are also the easiest to catch. A single API call to CrossRef confirms or rejects a DOI in under 200 milliseconds.

The harder failures are the ones that pass existence checks. Backwards citations, real papers, inverted claims. Contested papers cited as settled. Structural misuse, a methods paper cited as a clinical outcome. These failures live in the gap between “the paper exists” and “the paper supports this claim.”

That gap is where verification infrastructure lives.

Why Detection After Publication Is Too Late

The damage is done when the paper is published. Retraction is slow, costly, and incomplete.

When a paper with invalid citations reaches publication, the harm has already occurred:

The paper enters the literature with bad references that future papers will cite, amplifying the error
The author’s credibility is damaged, one bad citation triggers reviewer doubt across the entire reference list
If the error is discovered, retraction takes months to years, during which the paper continues to be cited
For grant-funded research, invalid citations can trigger compliance reviews and funding clawbacks

Detection after publication is an autopsy. It identifies the cause of death. It does not prevent it.

The infrastructure that catches invalid citations before submission, systematic, automated, pre-submission verification, is fundamentally different from the tools that flag them after the fact. One prevents the error from entering the literature. The other documents the error for the retraction record.

110,000 papers with invalid citations is not a detection problem. It’s an infrastructure problem, and no amount of reviewer vigilance fixes it.

Why Manual Verification Does Not Scale

Authors cannot check every citation by hand. Reviewers certainly cannot.

The average empirical paper in the social sciences carries 40-80 references. A systematic review can carry 200+. Asking the author to manually verify every citation, checking the DOI, reading the abstract, confirming alignment with the claim, adds days of work to an already lengthy submission process.

And it is still incomplete. Manual verification catches fabricated citations (if the author checks every DOI) and obvious misalignments (if the author reads every abstract against every claim). It misses backwards citations where the paper’s conclusion is nuanced enough that a quick abstract read does not reveal the directional mismatch. It misses contested papers where the author does not know to check the broader literature’s treatment of the source.

Reviewers are in a worse position. They are volunteers reading a dense paper alongside their own workload. They check citations selectively, usually the ones attached to the paper’s central claims. A bad citation in the background section or the literature review will pass unnoticed unless a particularly thorough reviewer happens to look.

The system depends on human verification at both the author and reviewer end. Neither end has the time, tools, or incentive to do it thoroughly. The 110,000 papers are the result.

What Verification Infrastructure Looks Like

Systematic. Automated. Pre-submission. Producing an audit trail that proves the check was done.

Verification infrastructure for research is not a tool the author runs once before submission. It is a system integrated into the research workflow that checks every citation automatically, flags every anomaly, and produces a structured record that the check was performed.

The minimum components:

Existence verification. Every DOI checked against CrossRef, DataCite, and PubMed. Fabricated citations flagged immediately. Sub-200ms per citation.

Polarity analysis. Every cited paper checked against Scite.ai’s database of 1.6 billion citation statements. Contested papers flagged with contradiction ratios.

Claim-source alignment. Each citation checked against the specific claim it supports. Backwards citations, structural misalignments, and out-of-context use flagged with rationale.

Audit trail. A complete verification log, what was checked, what was found, what was flagged, what was corrected, that the author can produce if a reviewer, editor, or compliance officer asks.

The system does not replace the author’s judgment. It routes the author’s attention to the citations that require judgment, instead of asking the author to check every citation from scratch.

The Gate is the threshold that determines whether the paper’s citation set is ready for external scrutiny. A paper that passes The Gate has every citation verified across all three layers. A paper that fails has specific, named flags that must be addressed before submission.

The Question Every Research Organization Should Ask

Not whether your papers have bad citations. Whether you have a system that catches them.

110,000 papers in one year is not evidence that researchers are careless. It is evidence that the verification burden has been placed on the wrong layer of the system. Individual authors, working manually, cannot verify citations at the scale and speed that modern research demands. Reviewers cannot catch what authors miss. Editors cannot police a literature that is growing faster than their review capacity.

The infrastructure that catches bad citations before submission, systematic, automated, auditable, is the same infrastructure that protects the author’s credibility, the journal’s integrity, and the literature’s reliability.

The question is not whether your next paper has a bad citation. It is whether you have a system that would catch it before it reaches a reviewer.

If your research workflow depends on authors and reviewers catching bad citations by hand, the 110,000-paper problem is your problem too. Request a research intake at axion.activewizards.com/research-pilot or reach us at axion@arizenai.com.