citation

The Backwards Citation Problem: Why AI Cites Real Papers to Support Claims They Contradict

axion engine
bottom line
  • AI citation failure has three modes: fabricated (DOI doesn't exist), hallucinated (DOI exists, wrong paper), and backwards (real DOI, real paper, inverted claim).
  • CrossRef DOI verification catches Modes 1 and 2. It cannot catch Mode 3.
  • Scite.ai classifies 1.6 billion citation statements as supporting, contradicting, or mentioning, enabling automated detection of backwards citations that pass DOI verification.
  • The model that generated a backwards citation will not reliably catch it in self-review. Cross-vendor adversarial review (different model, different vendor) is the structural fix.
  • Axion's Trust Stack runs three sequential gates, existence, polarity, alignment, each catching what the previous cannot.

A finance student’s professor failed them for citing a fabricated McKinsey report. The student, apparently thorough, had tested every major AI tool first and documented the results on r/aiagents: “Claude generates perfect formatting, plausible author, real-looking URL, completely made up.” The post collected 60+ upvotes in March 2026 because everyone recognized the problem.

That problem, fabricated citations, is the one the industry talks about. It is not the worst one.

There is a failure mode that passes DOI verification. That passes title matching. That passes casual human review by a subject-matter expert who recognizes the cited author’s name. A failure mode where the DOI is real, the paper exists, the journal is legitimate, and the citation still destroys the argument it was meant to support, because the paper’s findings run directly opposite to the claim being made.

We call this The Backwards Citation. It is harder to catch than fabrication, more consequential when it reaches publication, and structurally invisible to single-model review pipelines.


The Three Failure Modes of AI Citations

AI citation errors exist on a spectrum of detectability. Most verification pipelines are built to catch the easiest failure mode. The hardest one requires a different architecture entirely.

Mode 1: Fabricated. The DOI doesn’t resolve. CrossRef returns nothing. The paper, the authors, the journal volume, none of it exists. This is the failure mode that gets Reddit threads. It is also the easiest to catch: a single API call to api.crossref.org with the DOI string returns a 404 or an empty result set. Our crossref_verify.py module flags these in under 200ms per citation.

Mode 2: Hallucinated. The DOI resolves, but to the wrong paper. The model retrieved a real identifier and attached it to a fabricated or misremembered title. CrossRef returns metadata, but the title, authors, or publication year don’t match what the AI claimed. A fuzzy string match against the returned title field catches this. Still automated, still fast, still a solved problem.

Mode 3: Backwards. The DOI resolves. The metadata matches perfectly. The paper is exactly what the citation claims it is, Smith et al., 2023, Journal of Clinical Outcomes, the right volume, the right page numbers. And Smith et al. found that the intervention worsened outcomes, not improved them. The citation is real. The claim is inverted. No DOI check catches this. No metadata comparison catches this. The only thing that catches this is knowing what the paper actually says, and whether the scientific community, across hundreds of subsequent citations, has treated it as evidence for or against the proposition.

Our DOI gate caught a 53% fabrication rate in a clinical article that survived three adversarial review iterations, but fabrication is the easy problem. Backwards citations are harder.

The distribution matters. In our production audits across research and grant writing workflows, Mode 1 fabrications are common but detectable. Mode 3 backwards citations are less frequent but survive longer, because every automated check they encounter returns a passing result.


What a Backwards Citation Looks Like in Practice

The backwards citation is structurally indistinguishable from a correct citation at every verification layer except polarity. That’s what makes it dangerous.

The pattern appears in generated text like this:

“As demonstrated by Smith et al. [14], prophylactic administration of X significantly reduces post-operative complication rates in high-risk cardiac patients.”

Smith et al. [14] is a real paper. DOI: 10.1016/j.example.2023.04.012. Published in a peer-reviewed journal. Smith is a recognized name in the field. The citation appears in the reference list with correct formatting.

Smith et al. actually found that prophylactic administration of X increased complication rates by 12% in the high-risk subgroup, leading the authors to recommend against its use. The paper’s abstract says this. The conclusion says this. The paper has since been cited 47 times, 31 of those citations use it as a cautionary example.

The AI model, during generation, retrieved Smith et al. as topically relevant to “prophylactic X in cardiac patients.” It was. The model then used it to support a claim directionally opposite to the paper’s findings. This is not a retrieval failure. It is a reasoning failure that retrieval-layer verification cannot detect.

This pattern is particularly acute in domains where the AI has partial familiarity, enough to retrieve plausible papers, not enough to accurately represent their conclusions. Clinical pharmacology. Policy economics. Contested empirical questions in social science. Anywhere the training data contains the paper’s title and abstract but not a reliable encoding of its directional finding.

The legal vertical is where we see the highest-stakes version: an expert witness report citing a meta-analysis to support a causation claim, when the meta-analysis explicitly found insufficient evidence for causation. The DOI is real. The expert recognized the paper. The error reached draft stage.


Why Single-Model Review Can’t Catch This

The model that generated a backwards citation shares the same training biases as the model reviewing it. Structural independence, different vendor, different training corpus, is the only reliable fix.

The instinct when an AI makes an error is to ask the AI to check its work. For Mode 1 and Mode 2 errors, this sometimes works, the model can verify its own DOI against CrossRef. For Mode 3, it fails structurally.

Consider what a self-review loop actually is: the same parametric weights, trained on the same corpus snapshot, evaluating a claim that those weights generated. If the model’s representation of Smith et al. is that it supports prophylactic X, whether because the abstract was ambiguous, the training data was imbalanced, or the paper appeared in a context where it was cited approvingly, then asking that model to “verify” the Smith et al. citation will return a confident confirmation.

We tested this directly. A backwards citation in a generated oncology methods section, real DOI, inverted finding, was submitted to the same model (GPT-4o) for verification with the prompt: “Confirm that each citation supports the claim it is attached to.” The model confirmed all citations, including the backwards one, with a confidence statement.

Cross-vendor adversarial review (Claude produces, Gemini audits) catches backwards citations because different models have different training biases, they don’t share the same blind spots.

This is the core logic of The Adversarial Trinity, our three-agent cross-vendor verification loop. Different models fail differently. Claude and Gemini have different training corpora, different RLHF pipelines, different parametric representations of the same papers. When Gemini audits a Claude-generated citation, it brings structurally independent priors to the evaluation. It doesn’t know what Claude “thought” Smith et al. said. It has its own representation, and when that representation conflicts with the cited claim, the conflict surfaces as a flag.

This is also why Source-Then-Think (STT) matters at the retrieval layer: separating the act of retrieving a source from the act of reasoning about it prevents the model from constructing a claim and then selecting a citation to fit it. But STT is a generation-time control. For content that has already been generated, or for auditing pipelines that receive documents from external sources, adversarial cross-vendor review is the structural fix.


The Polarity Gate: How Scite.ai Detection Works

Automated polarity analysis transforms citation verification from a binary check (does this DOI exist?) into a directional one (does the literature treat this paper as supporting or contradicting its central claim?)

Scite.ai has classified 1.6 billion citation statements, the actual sentences in papers that cite other papers, as supporting, contradicting, or mentioning the cited work. This is not metadata. This is the accumulated directional signal of the scientific community.

Scite.ai classifies 1.6 billion citation statements as supporting, contradicting, or mentioning, enabling automated detection of backwards citations that pass DOI verification.

Our scite.py module queries the Scite.ai API with a DOI and returns a polarity profile: the count of supporting citations, contradicting citations, and neutral mentions, along with the ratio. A paper with a 70% supporting ratio is one the field treats as confirmatory evidence. A paper with a 35% contradiction ratio is contested, and any claim that presents it as settled evidence is misrepresenting its standing.

The resolve_polarity function applies a configurable threshold. Our default flag condition: any cited paper where contradicting citations exceed 20% of total citation statements triggers a CTO-level audit flag. The audit output reads:

Citation [14], Smith et al. (2023), DOI: 10.1016/j.example.2023.04.012
Polarity profile: 31 supporting / 47 contradicting / 19 mentioning
Contradiction ratio: 48.1%, FLAGGED
Claim alignment: "significantly reduces complication rates"
Recommendation: Manual verification required. Paper's directional finding may be inverted in current usage.

This is not a rejection. It is a gate. The human reviewer, the researcher, the grant writer, the expert witness, receives a specific, actionable flag rather than a clean bill of health that conceals a structural problem.

The polarity gate does not replace human judgment. A 48% contradiction ratio on Smith et al. might mean the paper’s findings are genuinely contested in the literature, which is itself important context for any claim that presents them as established. Or it might mean the model has inverted the finding. Either way, the flag is the right output. Passing the citation silently is not.


The Trust Stack: Three Layers, Three Failure Modes

No single verification mechanism catches all three citation failure modes. The Trust Stack sequences them so each layer catches what the previous one cannot.

The Trust Stack, three-layer citation verification architecture

The Trust Stack is the architecture we built after discovering that DOI verification alone was giving research teams false confidence. Three layers, each with a defined scope and a defined failure mode it catches:

Layer 1, EXISTENCE
Tool: CrossRef API
Catches: Mode 1 (Fabricated), DOI doesn't resolve
Output: PASS / FAIL + metadata match score
What it misses: Modes 2 and 3

Layer 2, POLARITY
Tool: Scite.ai API → resolve_polarity
Catches: Mode 3 (Backwards), real paper, inverted claim
Output: Polarity profile + contradiction ratio + flag threshold
What it misses: Claim-level alignment nuance

Layer 3, ALIGNMENT
Tool: Cross-vendor adversarial review (--rigor publication)
Catches: Subtle misrepresentation, selective quotation, out-of-context use
Output: Adversarial audit report with specific claim-citation conflicts
What it misses: Nothing that a human expert reading the full paper would catch, this layer approximates that read

The layers are sequential. A citation that fails Layer 1 doesn’t proceed to Layer 2. A citation that passes Layer 1 and Layer 2 still proceeds to Layer 3 if the document is flagged --rigor publication, our highest verification tier, applied to NIH R01 submissions, expert witness reports, and systematic reviews.

The --rigor publication flag activates the full Adversarial Trinity: the generating model’s output is passed to two independent auditing models from different vendors, each instructed to find claim-citation conflicts. Disagreements between auditors are escalated to human review. Agreements are logged as high-confidence flags.

This is not a slow process. Layer 1 and Layer 2 run in parallel after generation, completing in under 3 seconds per citation for typical document lengths. Layer 3 adds 90-180 seconds for a full document audit. For a grant application where a backwards citation could trigger reviewer rejection or, worse, post-award scrutiny, that latency is not a cost. It is the work.

The research verification workflow runs all three layers by default, with the academic case study showing the proof object behind that route. The grants workflow runs Layers 1 and 2 automatically and triggers Layer 3 on any citation attached to a primary efficacy claim.


What The Trust Stack Doesn’t Catch

We built the Trust Stack to catch what we could measure. There are failure modes it doesn’t address.

Scite.ai’s polarity classifications reflect how papers are cited, not whether those citing papers are themselves correct. A paper that has been widely cited approvingly in a field where the consensus is wrong will show a high supporting ratio. The Trust Stack will pass it. The consensus error is not a citation integrity problem, it is a domain epistemology problem, and no automated system resolves it.

Layer 3 adversarial review approximates a careful expert read. It does not replace one. For claims where the entire argument rests on a single paper’s findings, we recommend human expert verification of that paper regardless of Trust Stack output.

And the Trust Stack operates on citations that exist. It has no mechanism for detecting missing citations, papers that should have been cited and weren’t, because they contradict the argument. That absence problem is a different architecture question, one we’re working on.

The question worth sitting with: if your current pipeline returns a clean citation report, do you know which of these three layers it actually ran?


If your research pipeline can’t tell you whether a citation supports or contradicts your claim, you’re flying blind. Request an architectural audit at axion.activewizards.com/pilot or reach us directly at axion@arizenai.com.

frequently asked
deploy this architecture

One research question. Full adversarial pipeline.

Bring one bounded review problem. We will tell you whether it should start as a query, assessment, or quoted scope, then define the output before execution.

[ submit case ]

or email axion@arizenai.com

topics
citation-integrityadversarial-reviewmulti-model-validationdoi-verificationcitation-polarityscite-aicrossrefai-research-tools