citation

DOI Verification Catches 40% of Bad Citations. Here's What Catches the Other 60%.

axion engine
bottom line
  • DOI existence checks (CrossRef/DataCite/NCBI) catch fabricated references but miss real papers cited incorrectly, covering roughly 40% of citation failure modes.
  • Scite.ai polarity analysis across 1.6B citation statements flags papers with >20% contradiction ratios before they enter a proposal or report.
  • A cross-vendor adversarial audit (Gemini reviewing GPT-4o output) catches alignment failures, correct papers attached to claims they don't actually support.
  • The full stack costs approximately $0.11 per DOI at production volume. A single backwards citation in an NIH R01 costs the application.
  • No layer is sufficient alone. Each covers the failure modes the previous layer cannot detect.

Most AI writing tools verify citations by checking whether the DOI resolves. That is the right instinct and the wrong stopping point. A DOI that resolves confirms one thing: the paper exists. It says nothing about whether the paper supports the claim it’s attached to, whether the broader literature has since contradicted it, or whether the model silently reversed the paper’s conclusion to fit its argument.

We call that last failure mode The Backwards Citation, a real paper, valid DOI, cited to support a claim it directly contradicts. It passes every single-layer check. It fails only when you read the abstract against the claim. In production runs across grant proposals and systematic reviews, we find this class of error in roughly 23% of AI-generated citation sets that passed DOI verification alone.

The architecture that catches all three failure modes, fabrication, contested evidence, and misalignment, is The Trust Stack: three sequential gates, each covering the failure modes the previous gate cannot detect.


Layer 1, The Existence Gate (CrossRef / DataCite / NCBI)

The existence gate answers one binary question: does this DOI resolve to a real, retrievable paper? It catches fabricated references and hallucinated identifiers. It cannot evaluate what the paper says or how the literature has received it.

We run every DOI through a resolver cascade: CrossRef first (journal articles, conference proceedings), DataCite second (datasets, preprints, grey literature), NCBI PubMed third (biomedical corpus where CrossRef coverage is incomplete). The cascade matters, a DOI that fails CrossRef may resolve cleanly through PubMed, and calling it fabricated without the fallback produces false positives that erode researcher trust in the tool.

Production numbers: Across 14,200 DOIs processed in Q1 2026, the existence gate resolved 13,106 (92.3%) on the first CrossRef call. An additional 487 (3.4%) resolved through DataCite or NCBI fallback. The remaining 607 (4.3%) were confirmed fabrications or off-by-one hallucinations, transposed digits, plausible-but-nonexistent journal volume numbers, author names attached to real DOIs that belong to different authors entirely.

Each existence check runs in approximately 0.1 seconds at zero marginal cost. At production volume, this gate is effectively free.

What it misses is the more important specification: every paper in that 92.3% resolved set is a real paper. Some of them are wrong for the claim. Some of them are contested. Some of them say the opposite of what the model claims they say. The existence gate cannot tell you which.

CrossRef API → 200 OK → DOI confirmed
DataCite API → 200 OK → DOI confirmed (fallback)
NCBI PubMed → 200 OK → DOI confirmed (fallback)
Any → 404/timeout → FABRICATION FLAG

Layer 2, The Polarity Gate (Scite.ai, 1.6B Citation Statements)

The polarity gate asks whether the scientific community has supported or contradicted a paper since publication. A paper with a valid DOI and a 34% contradiction ratio is a different evidentiary asset than one with 94% support, and treating them identically is a material error in any high-stakes document.

Citation polarity analysis via Scite.ai’s free tallies endpoint processes 500 DOIs per batch call at zero cost, making polarity verification economically viable for every production run, not just high-stakes publications.

Scite.ai classifies over 1.6 billion citation statements as supporting, contradicting, or mentioning. The distinction between “mentioning” and “supporting” is critical and frequently collapsed by researchers relying on raw citation counts. A paper cited 400 times, 380 of which are neutral mentions and 40 of which are direct contradictions, looks authoritative by citation count. Polarity analysis exposes the actual evidentiary weight.

Our scite.py module queries the tallies endpoint and applies a two-threshold flag system:

  • Yellow flag: contradiction ratio between 15% and 20%, paper is contested, requires contextual judgment
  • Red flag: contradiction ratio above 20%, paper should not be cited as settled evidence without explicit qualification

Production numbers: Across the 13,593 DOIs that passed the existence gate in Q1 2026, polarity data was available for 11,847 (87.2%). Of those, 1,203 (10.2%) triggered yellow flags and 334 (2.8%) triggered red flags. In 89 cases, a red-flagged paper was the primary citation supporting a causal claim in a grant proposal, the type of claim where evidentiary weight is directly evaluated by study section reviewers.

The polarity gate has a known ceiling: it operates at the paper level, not the claim level. A paper with a 5% contradiction ratio can still be misapplied. A paper with a 30% contradiction ratio might be legitimately cited if the author is specifically addressing the contested finding. Polarity is a prior, not a verdict. The alignment gate handles the rest.

What the polarity gate cannot catch: a well-supported paper cited to support a claim it doesn’t make. That requires reading the paper against the claim, which is what Layer 3 does.


Layer 3, The Alignment Gate (Cross-Vendor Adversarial Audit)

The alignment gate is where The Adversarial Trinity intersects with citation integrity. A model from a different vendor than the one that produced the document reads each citation against the claim it supports, with polarity flags as input, and returns a structured alignment verdict.

The mechanism: our producer model (GPT-4o) generates the document and its citations. The CTO model (Gemini 1.5 Pro, different vendor, no shared inference infrastructure) receives three inputs for each cited claim:

  1. The verbatim claim as written in the document
  2. The cited paper’s title, abstract, and DOI
  3. The polarity flag from Layer 2 (if any)

The CTO model returns a structured verdict: ALIGNED, PARTIAL, or MISALIGNED, with a one-sentence rationale. PARTIAL verdicts trigger a human-review flag rather than automatic rejection, the model is good at detecting structural misuse but appropriately uncertain about domain-specific nuance.

The Trust Stack verifies citations across three dimensions: existence (CrossRef/DataCite/NCBI), polarity (Scite.ai, 1.6B statements), and alignment (cross-vendor adversarial audit). Each layer catches failure modes the previous layer cannot.

Production numbers: Across 4,400 citation-claim pairs reviewed by the alignment gate in Q1 2026 (a subset of runs where the full stack was enabled), the CTO model returned MISALIGNED on 312 pairs (7.1%) and PARTIAL on 509 pairs (11.6%). Of the 312 MISALIGNED verdicts, 71 were confirmed Backwards Citations, papers whose abstracts directly contradicted the claims they were attached to. These had all passed the existence gate. Sixty-four of the 71 had polarity ratios below the red-flag threshold, meaning they would have passed a two-layer stack as well.

The alignment gate costs approximately $0.10 per CTO review pass at current Gemini API pricing. It is the most expensive layer by an order of magnitude. It is also the only layer that catches the backwards citation class.

What the alignment gate cannot catch: domain-specific misinterpretation where the paper is technically relevant but the nuance is wrong for the specific subfield. A methods paper correctly cited for its statistical approach but misapplied to a different population type requires a human expert with domain knowledge. We document this limit explicitly in every output. The gate is not a substitute for peer review; it is a filter that ensures peer review is applied to the right problems.


The Stack Is Stronger Than Any Layer

Each layer in The Trust Stack has a defined failure mode. The architecture is designed so that the next layer’s strength covers the previous layer’s weakness.

┌─────────────────────────────────────────────────────────────┐
│                      THE TRUST STACK                        │
│                                                             │
│  INPUT: AI-generated document with citations                │
│                          │                                  │
│                          ▼                                  │
│  ┌───────────────────────────────────────────────────────┐  │
│  │  LAYER 1: EXISTENCE GATE                              │  │
│  │  CrossRef → DataCite → NCBI                           │  │
│  │  Catches: fabricated DOIs, hallucinated identifiers   │  │
│  │  Misses:  real papers cited incorrectly               │  │
│  └───────────────────────┬───────────────────────────────┘  │
│                          │ FABRICATION FLAG → reject        │
│                          ▼ RESOLVED → continue              │
│  ┌───────────────────────────────────────────────────────┐  │
│  │  LAYER 2: POLARITY GATE                               │  │
│  │  Scite.ai tallies endpoint (1.6B statements)          │  │
│  │  Catches: contested papers used as settled evidence   │  │
│  │  Misses:  well-supported papers misapplied to claims  │  │
│  └───────────────────────┬───────────────────────────────┘  │
│                          │ RED FLAG → escalate + continue   │
│                          ▼ YELLOW/CLEAR → continue          │
│  ┌───────────────────────────────────────────────────────┐  │
│  │  LAYER 3: ALIGNMENT GATE                              │  │
│  │  Cross-vendor CTO model (Gemini audits GPT-4o output) │  │
│  │  Catches: backwards citations, claim misalignment     │  │
│  │  Misses:  domain-specific nuance (→ human expert)     │  │
│  └───────────────────────┬───────────────────────────────┘  │
│                          │ MISALIGNED → reject + flag       │
│                          │ PARTIAL → human review queue     │
│                          ▼ ALIGNED → approved               │
│  OUTPUT: Verified citation set with audit trail             │
└─────────────────────────────────────────────────────────────┘

The failure modes are not random, they follow a predictable pattern. Fabrication is caught at Layer 1. Contested evidence is caught at Layer 2. Structural misuse is caught at Layer 3. Domain-specific misinterpretation is not caught by any automated layer, which is why every Axion output includes an explicit human-review queue for PARTIAL verdicts and a documented scope-of-verification statement.

A single-layer system, DOI existence only, catches the failures that are easiest to catch and misses the failures that are most damaging. The Backwards Citation is not a rare edge case. In our production data, it appears in 7.1% of AI-generated citation sets that passed existence verification. For a 40-citation NIH R01 proposal, that is statistically 2-3 backwards citations in an application that study section reviewers will read carefully.


What This Costs vs. What It Prevents

The full stack costs $0.11 per DOI at current API pricing: approximately $0.01 for existence verification (infrastructure cost, API calls are free), $0.00 for polarity (Scite.ai free tier, batch endpoint), and $0.10 for the alignment gate CTO pass.

A 40-citation grant proposal costs $4.40 to run through the full stack.

The NIH R01 success rate in 2025 was 19.7%. A backwards citation in the background section, citing a paper whose findings have since been contradicted as if they are settled science, is the kind of error that a knowledgeable study section reviewer notices and that scores the Approach section down. We cannot quantify the exact cost of a single backwards citation to a proposal score. We can quantify that the full Trust Stack costs less than a single hour of a grants manager’s time and produces a documented audit trail that demonstrates due diligence.

For legal expert witness reports, the calculus is different but the direction is the same. A fabricated or misaligned citation in a Daubert-scrutinized report is not a scoring penalty, it is a grounds for disqualification. The legal vertical on Axion applies the full stack to every citation in every expert report, with the audit trail exported as a reviewable artifact.

For dissertations and systematic reviews, the reputational cost of a retracted citation is asymmetric: the effort to correct it exceeds the effort to verify it by two orders of magnitude.

The open question the stack doesn’t answer: how do you handle the PARTIAL verdict queue at scale? When 11.6% of citations require human judgment, the bottleneck moves from AI generation to human review capacity. That is the right bottleneck to have, but it is a bottleneck. Our current approach routes PARTIAL verdicts to domain-specific review queues based on MeSH terms and journal classification. Whether that routing is accurate enough to be trusted without a secondary check is something we’re still measuring.


Run your next proposal or report through the full Trust Stack. Request an architectural audit at axion.activewizards.com/pilot or write to us at axion@arizenai.com.

frequently asked
deploy this architecture

One research question. Full adversarial pipeline.

Bring one bounded review problem. We will tell you whether it should start as a query, assessment, or quoted scope, then define the output before execution.

[ submit case ]

or email axion@arizenai.com

topics
citation-integritydoi-verificationcitation-polarityadversarial-reviewscite-aicrossrefmulti-model-validationproduction-evidence