A researcher submits a grant application. The AI assistant produced twelve citations. Eight resolve to real papers. Three return 404. One resolves to a real paper, a 2019 RCT, that the AI used to support a claim the paper’s own conclusion section explicitly rejects.
That last failure is the expensive one. A dead URL fails a basic sanity check. The Backwards Citation, a real DOI cited against its own findings, survives every check except reading the paper.
Both failures share a common cause: the language model was asked to source and reason in the same step. It was never designed to do the first task.
Why LLMs Hallucinate Citations
LLMs are next-token predictors. When asked for a citation, they pattern-complete toward what a citation looks like, correct journal formatting, plausible author names, believable volume numbers. Nothing in that process requires the source to exist.
The mechanism is not malice or laziness. It’s architecture. A model trained on billions of documents has seen enough citation strings to generate citation-shaped text with high confidence. That confidence is calibrated to form, not to existence.
Ask GPT-4 to cite three papers supporting a niche claim in immunogenomics. It will return three citations that look exactly right: Smith et al. (2021). Journal of Immunology, 206(4), 891-903. The DOI will be formatted correctly. The author name will be plausible. The journal will be real. The paper will not exist.
This is the default behavior when nothing verified is in context. The model fills the gap with the most statistically likely continuation, which, for a citation request, is a citation-shaped string.
The failure mode scales with specificity. Ask for a general claim about CRISPR and the model may surface something real from its training data. Ask for a 2023 meta-analysis on a narrow subpopulation and the hallucination rate climbs sharply, because the training distribution thins and the model has less real material to surface and more pressure to complete.
LLMs don’t search or retrieve, they pattern-complete toward plausible-looking sources. Source-Then-Think eliminates citation hallucinations by ensuring only verified material enters the producer’s context.
The fix is not a better prompt. It’s a different architecture.
The Community-Discovered Fix
In March 2026, an r/aiagents thread asked which AI research tool handled citations best. Sixty-plus upvotes later, the community had independently converged on a workflow that separates sourcing from reasoning, manually building the same architecture Axion automates.
The top comment, from a finance PhD student testing tools for a literature review: “claude is genuinely the best thinking partner when you feed it good sources. the gap is the sourcing step itself.”
The replies built out the workflow unprompted:
- Use Scira or Perplexity for initial discovery, they hit live indexes and return real URLs.
- Use Research Rabbit to expand from seed papers through citation graphs.
- Feed the verified results to Claude for synthesis, argument construction, and gap analysis.
No single commenter called this an architecture. But that’s what it is: a retrieval phase and a reasoning phase with a human acting as the verification boundary between them.
The workflow works because it respects what each system is actually good at. Scira and Perplexity are search tools, they retrieve from live indexes. Claude is a reasoning engine, it synthesizes, compares, and constructs arguments from material in context. Combining them in sequence, rather than asking one system to do both, eliminates the hallucination surface.
The problem is that the human in the middle is doing verification work without verification tooling. They’re checking that URLs resolve. They’re not checking citation polarity. They’re not catching The Backwards Citation. And the workflow breaks the moment a collaborator skips a step or a deadline compresses the process.
The r/aiagents community independently converged on “source elsewhere, think in Claude”, manually building the same architecture Axion automates: Semantic Scholar + CrossRef + Scite.ai for sourcing, Claude for production.
Manual STT is better than unified-step hallucination. It’s not good enough for grant submissions, systematic reviews, or expert witness reports.
Source-Then-Think as Architecture
Source-Then-Think (STT) makes the separation the r/aiagents community discovered into an explicit, enforced architectural boundary. The retrieval system and the reasoning system are separate processes. The producer never searches. The retrieval system never reasons.
Here’s the pipeline:
direction: right
retrieval: Retrieval Phase {
s2: Semantic Scholar\n214M papers
crossref: CrossRef\nDOI verification
scite: Scite.ai\nPolarity scoring
unpaywall: Unpaywall\nFull-text access
s2 -> crossref -> scite -> unpaywall
}
boundary: Verification Boundary {
style.stroke: "#15AABF"
style.stroke-width: 3
trust: The Trust Stack\nDOI ✓ · Polarity ✓ · Alignment ✓
}
reasoning: Reasoning Phase {
producer: Claude Producer\n(reasoning only)
output: Verified Output\nwith traceable citations
producer -> output
}
retrieval -> boundary -> reasoning
Retrieval phase. Axion queries Semantic Scholar’s index of 214 million papers using the research question as a structured query, not a natural language prompt to an LLM. Candidate papers return with DOIs, abstracts, and citation counts.
Each candidate then passes through The Trust Stack:
-
DOI existence, CrossRef REST API confirms the DOI resolves to a real record. Papers that fail here are dropped, not flagged for human review. There is no human review queue in a pipeline running at research scale.
-
Citation polarity, Scite.ai’s
resolve_polarityendpoint returns the ratio of supporting, neutral, and contradicting citations from subsequent literature. A paper with 40% contradicting citations is a different epistemic object than a paper with 4% contradicting citations, even if both exist and both are topically relevant. Polarity below threshold fails The Trust Stack. -
Full-text access, Unpaywall resolves open-access versions where available. The producer reasons from full text where possible, not just abstracts.
Reasoning phase. Claude receives a context window containing verified papers, their polarity scores, and their full text or structured abstracts. The system prompt enforces a hard constraint: the producer cites only material present in context. It does not search. It does not complete citations from training memory. If a claim requires a source not in the verified set, the producer flags the gap rather than filling it.
This is why The Backwards Citation fails at polarity, not at existence. CrossRef confirms the paper is real. Scite.ai reports that 31 of the 47 papers citing this RCT do so contradictorily, the field moved on. The paper is real. The support it appears to provide is not.
The verification boundary is the architectural contribution. Without it, you have a fast pipeline that produces confident hallucinations. With it, you have a slower pipeline that produces traceable claims.
What This Means for Your Workflow
If you’re manually chaining Perplexity, Research Rabbit, and Claude, you’re building STT by hand. The architectural question is whether the verification boundary between those phases is enforced or optional.
For a one-off literature summary, optional is probably fine. For an NIH R01 Significance section, optional is a career risk. For an expert witness report, optional is a liability exposure.
The distinction between manual STT and automated STT is not speed, though automated runs the full retrieval-verification-reasoning loop in minutes rather than hours. The distinction is the verification gate. In manual STT, the gate is a human who may be tired, under deadline, or unfamiliar with Scite.ai’s polarity data. In automated STT, the gate is a programmatic check that runs identically on every paper, every time.
Three workflow patterns where this matters:
Systematic reviews. PRISMA-compliant synthesis requires documented inclusion/exclusion criteria applied consistently across hundreds of papers. Manual STT cannot produce an audit trail. Automated STT logs every DOI check, every polarity score, every inclusion decision, the scite.py run produces a structured JSON record of every source that passed or failed The Trust Stack.
Grant applications. NIH study sections read citations. A fabricated citation in a Significance section is not a minor error, it’s a signal about the applicant’s rigor. Our grants vertical runs STT on every citation in the proposal before submission, with a polarity threshold calibrated to --rigor publication.
Legal expert reports. Opposing counsel will check every citation. A paper that exists but is contradicted by subsequent literature is a deposition liability. STT with polarity scoring catches this class of failure before the report leaves the building. See our legal vertical for the specific threshold configuration we use for expert witness work.
The open question STT doesn’t answer: what to do when the verified literature is thin. If Semantic Scholar returns twelve papers on a narrow topic and eight fail The Trust Stack, the right output is “insufficient verified evidence”, not a synthesis padded with lower-confidence sources to fill the word count. We surface that gap. What the researcher does with it is a judgment call STT is not designed to make.
Run STT on your next literature review. Request an architectural audit at axion.activewizards.com/pilot or reach out directly at axion@arizenai.com. We’ll review the research question, scope the right engagement, and return the retrieval-verification-reasoning audit trail for the bounded work unit.
The question worth sitting with: if your current workflow doesn’t produce an audit trail, how do you know which citations passed?