Inside Regulatory Work

Why cited answers are the minimum bar for regulatory AI

Why cited answers are the minimum bar for regulatory AI
Published
AuthorRegAid Team

For regulatory teams, a fast answer without a verifiable source chain is not useful. It is risk. Why citations are the minimum bar for regulatory AI, and what serious teams should demand from any system they evaluate.

Regulatory teams do not need AI that sounds informed. They need AI that keeps the source chain intact. If an answer cannot be traced back to the exact clause, passage, or guidance span that supports it, the system has not solved the hard part of the work. It has only moved the verification burden downstream.

Citations are not a nice extra in regulatory AI. They are the minimum bar. A model that gives a polished answer without showing exactly where the answer came from is still asking the RA professional to do the most expensive work manually: reopen the source, confirm the clause, test the interpretation, and decide whether the wording is safe enough to reuse in a draft, response, or submission.

The public discussion around FDA's internal AI tool, Elsa, made that gap visible. Staff told CNN that the system fabricated studies and misrepresented research. Employees said outputs had to be checked before use. That is not a strange edge case. It is what happens when fluent generation outruns the evidence chain.

The real problem is not hallucination in the abstract

The word "hallucination" is useful, but it can hide the operational problem.

The issue for regulatory work is not simply that a model might be wrong. Humans can also be wrong. The issue is that a model can be fluently wrong while removing the friction that normally forces verification. It can produce the shape of a regulatory answer, complete with regulatory-sounding language and reference-like phrasing, while leaving the user unsure whether any part of it is actually traceable to a primary source.

That changes the review dynamic in a way many teams underestimate.

Once a polished answer exists on the screen, the reviewer is no longer starting from a blank page. They are reacting to a proposed interpretation. If the system has not preserved the source chain, the reviewer has to reconstruct it manually while resisting the anchoring effect of the generated wording. That is slower than many teams assume, and riskier than many vendors admit.

For regulatory work, the standard is simple: if a claim matters enough to include in a draft, it matters enough to verify at the source.

A citation is only useful if it proves the wording

Many products now say they provide citations. That is not enough.

A citation is useful only when it does three things:

  1. It links to a primary source, not a secondary summary.
  2. It points to the exact passage or clause, not just the document title.
  3. It reflects what the system retrieved before generating, not a reference added after the answer was written.

That third point is the one most buyers do not test carefully enough.

An answer can look cited while still being structurally weak. The model can draft the paragraph first and attach a plausible-looking reference afterward. The reference may exist. It may even be related. But that is not the same as showing that the wording on the page was grounded in the cited passage.

For regulated work, that distinction is not academic. If the answer is later questioned by a reviewer, notified body, competent authority, or internal approver, the team needs to show not only which source is relevant, but which source the system actually used to support the wording.

FDA already describes the standard serious teams should use

The irony in the Elsa discussion is that FDA had already described the concepts needed to judge whether a system is fit for this kind of work.

In January 2025, FDA issued draft guidance on Considerations for the Use of Artificial Intelligence to Support Regulatory Decision-Making for Drug and Biological Products. Two ideas from that document matter directly here.

The first is context of use. A system is not credible in the abstract. It is credible only for a defined use: what the model is supposed to do, what inputs it relies on, and what decision its output supports.

The second is credibility assessment. If a model is being used in a context where the output could shape regulatory judgment, its performance, transparency, and evidence base need to be assessed accordingly.

Applied to regulatory AI tools, the implication is straightforward. A system meant to support drafting, interpretation, or regulatory analysis must make its basis inspectable. Otherwise the user cannot assess whether the output is credible for the context in which it is being used.

The joint EMA/FDA guiding principles on AI in medicines development reinforced the same idea in January 2026. Data governance, transparency, clear context of use, and lifecycle oversight are not optional extras. They are the operating conditions for credible AI in regulated environments.

Retrieval-first systems are different in the way that matters

The useful distinction is not "good AI" versus "bad AI." It is whether the system is retrieval-first or generation-first.

A generation-first system produces an answer from model memory and pattern completion. If it cites, it often cites after the fact. That can still be useful for brainstorming. It is weak for regulated work.

A retrieval-first system starts by pulling passages from a defined corpus of primary materials, then generates against that evidence. The answer is constrained by what was actually retrieved. The user can inspect the same material the system used.

That architecture does not eliminate all error. No serious team should pretend it does. What it does is move the work into a form the reviewer can audit. The answer becomes inspectable instead of merely persuasive.

That is the threshold that matters for regulatory use.

What serious buyers should demand from any regulatory AI system

If you are evaluating an AI tool for RA, QA, clinical, or medical writing workflows, ask six questions.

1. What is the source corpus?
The vendor should be able to name the primary materials: FDA guidance, EMA scientific guidelines, MDCG documents, eCFR, EUR-Lex, ICH guidelines, Swissmedic notices, and so on. If they cannot describe the corpus clearly, the answer layer is already suspect.

2. Does the answer cite the exact clause or passage?
"See MDR" is not a citation. "MDR Article 61(10)" or a linked passage in an FDA guidance is.

3. Can I inspect the source without leaving the workflow?
If verification requires opening a new search, copying the title, and hunting the passage manually, the system has failed at the point where trust matters.

4. What happens when the answer is not in the corpus?
A credible system should say so. It should narrow the claim or state the limitation clearly. It should not guess.

5. Does the citation chain survive into drafting?
If the answer is usable only in the search surface but the citations disappear when the text moves into drafting or review, the workflow is still fragmented.

6. Is the output suitable for audit or review?
The real question is not whether the system is fast. The real question is whether a colleague can inspect the chain and defend the wording later.

Try this in RegAid: What does FDA's January 2025 AI credibility draft guidance say about context of use?

The standard should be higher than "good enough for a first draft"

One of the most common defenses of weak AI output is that humans will check the final draft anyway.

That is not a serious standard for regulatory work.

If the first draft already contains unsupported claims, invented references, or ambiguous interpretations, the reviewer is not just checking accuracy. They are undoing a flawed starting point. That wastes time and increases the chance that something subtle survives into the final output because the initial structure looked convincing.

The better standard is this: the first draft should already preserve the evidence chain well enough that review becomes faster, narrower, and more reliable. The reviewer should be validating judgment, not rebuilding provenance.

That is the difference between AI as assistance and AI as cleanup work.

Regulatory AI should make the source chain easier to keep intact

This is the broader product point.

The value of regulatory AI is not that it can produce text quickly. The value is that it can reduce the number of places where the source chain gets broken. Search, answer, draft, compare, and monitor should all make that chain more visible, not less.

That is why citations matter more than generic intelligence claims. They are the part the team can inspect. They are what let one user trust another user's draft. They are what let a reviewer reopen the exact span rather than re-run the entire question. They are what make a system useful beyond the first prompt.

For regulated work, that is what maturity looks like.

Key takeaways

  • Regulatory teams do not need plausible AI text; they need answers with a verifiable source chain
  • A citation is only meaningful if it points to the exact passage in a primary source and reflects what the system retrieved before generating
  • FDA's January 2025 AI credibility draft guidance and the January 2026 EMA/FDA AI principles both reinforce the need for inspectable, context-appropriate systems
  • Retrieval-first systems are better suited to regulatory work because they make the evidence base reviewable
  • The real buying test is not speed alone; it is whether the citation chain survives into drafting, review, and audit

How RegAid helps

RegAid is built retrieval-first, so every answer is grounded in primary regulatory sources and linked back to the exact passage used. The same citation chain can carry into drafting, review, monitoring, and comparison workflows instead of breaking after the first answer. Start with your own live regulatory question at regaid.ch.