regulatory-intelligenceaipharmamedtechcompliance

Why AI-generated regulatory answers need citations (and why FDA's own AI doesn't)

RegAid Team8 min read
Why AI-generated regulatory answers need citations (and why FDA's own AI doesn't)

Regulatory affairs cannot run on AI answers you cannot verify. That principle turned from best practice to public news when the FDA's own in-house AI tool, Elsa, was reported to be fabricating studies that do not exist and confidently misrepresenting research. Three current FDA employees went on record with CNN. FDA officials now tell staff that anything Elsa says must be double-checked before it is used. For RA teams evaluating AI tools, the lesson is direct: an AI that cannot ground every claim in a verifiable primary source is a liability, not an asset. This post covers what went wrong with Elsa, what "cited AI" actually means, what the FDA's own January 2025 framework already requires, and what to demand from any tool you are considering.

What went wrong with Elsa

Elsa is a generative AI copilot launched by the FDA in June 2025 (FDA announcement). It was built to help agency reviewers draft documents, search internal records, and summarise submissions. The product ambition is reasonable. The execution exposed a fundamental architectural problem.

FDA employees told CNN that Elsa "makes up nonexistent studies" and misrepresents research. Staff reviewers who queried Elsa for drug safety data received references to studies that could not be found in any database. One FDA official said on the record: "Anything that you don't have time to double-check is unreliable. It hallucinates confidently."

The root cause is not Elsa-specific. It is a property shared by every general-purpose large language model: without a retrieval layer grounding generation in verified source documents, the model produces text that is statistically plausible but not factually reliable. When asked for a reference, the model generates something that looks like a reference. When asked for a study conclusion, it generates something that looks like a conclusion. Whether those outputs correspond to actual documents is, from the model's perspective, beside the point.

Elsa lacks what matters in regulatory use: a transparent chain from each generated claim to a primary source a reviewer can open and verify.

The core problem: ungrounded LLM versus RAG

Generative AI tools in regulatory work fall into two architectural classes.

Ungrounded LLM: the model generates answers from parametric memory (what it learned during training). It can cite fluently but cannot prove any citation corresponds to a real source. This is how Elsa operates in practice.

Retrieval-augmented generation (RAG): the model is forced to retrieve passages from a trusted corpus (primary agency documents, statutes, guidances) before generating. Each claim is linked to the retrieved passage it came from. The user can click through to the original document.

A published 2026 study evaluating RAG for regulatory compliance of drug information found answer relevancy at 100 percent and faithfulness at 95 percent, meaning nearly every generated claim was traceable to the retrieved source documents. The same task performed by an ungrounded LLM produces fabrication rates high enough that the FDA's own deployment could not pass internal quality bars.

The architectural difference matters more than model scale or brand. A smaller model with RAG will outperform a larger model without, because the bottleneck for regulatory work is not creativity, it is verifiability.

What "cited" actually means in an AI tool

"Cited" is used loosely in AI marketing. For regulatory work, a citation is only meaningful if it meets four criteria.

1. Links to the exact passage, not the document. "See FDA guidance" is not a citation. "21 CFR 314.80(c)(1)(i) at eCFR" with a direct URL to the cited paragraph is a citation.

2. Links to a primary source, not a secondary summary. A citation to a commentary blog post is not equivalent to a citation to the Federal Register notice the blog post describes.

3. The AI retrieved it before generating the answer. If the AI generates the answer and post-hoc adds a plausible-looking citation, the citation may not correspond to what the answer claims. RAG retrieves first, generates second.

4. The user can open and verify it in one click. If the citation requires searching, the verification step is lost at the human layer. A clickable deep-link is non-negotiable for RA workflows.

Tools that fail any of these four tests are closer to Elsa than to a trustworthy RA assistant.

The FDA's own framework already requires this

The irony of Elsa is that the FDA itself published the framework that would have caught the problem. On January 6, 2025, the FDA released a draft guidance, Considerations for the Use of Artificial Intelligence to Support Regulatory Decision-Making for Drug and Biological Products. The guidance establishes a 7-step risk-based credibility framework and introduces two concepts directly relevant to this post.

Context of use (COU): a written statement of exactly what the AI model is for, what inputs it takes, and what decisions its output supports. Without a defined COU, credibility cannot be assessed at all.

Credibility assessment: a structured evaluation of whether the model's performance is adequate for the claimed COU, including data governance, model design, and transparency to users.

Applied to Elsa: its context of use drifted from "document search assistant" to "primary research source," and the credibility assessment for the latter was never performed. The EMA and FDA joint guiding principles released in January 2026 doubled down with a principle on "Data governance and documentation" and another on "Clear context of use." The principles describe, in regulatory language, the exact properties a cited-first RAG system provides natively.

What to demand from any AI RA tool: a 7-point checklist

When evaluating an AI tool for regulatory work, a short due-diligence checklist separates products that behave like Elsa from products that meet regulator expectations.

  1. Ask for a cited answer to a specific regulatory question, then open every citation. If any citation does not correspond to the claim made, stop evaluating.
  2. Ask what the source corpus is. A tool that cannot name its primary sources (FDA guidances, EMA scientific guidelines, MDCG documents, eCFR, EUR-Lex, ICH guidelines, ISO standards) is generating from parametric memory, not retrieval.
  3. Verify the retrieval is real, not post-hoc. Ask whether the model retrieves passages before generating or attaches citations after. Only the first is RAG.
  4. Check citation granularity. Are citations at the document level ("see FDA guidance X") or at the clause level ("21 CFR 314.80(c)(1)(i)")? Clause-level is required.
  5. Test for fabrication. Ask a question you know has no good source answer. A well-grounded tool will say so. A brittle tool will invent one.
  6. Ask about updating the corpus. A tool whose index was snapshot a year ago is not suitable for regulatory work, which changes weekly.
  7. Review the tool's own documentation against the FDA draft guidance 7-step credibility framework. If the tool vendor cannot map their product onto COU and credibility assessment, the product is not ready for RA use.

Try this in RegAid: What is the FDA guidance on AI credibility for drug submissions?

Common misconceptions

"All the big AI tools cite now": many add citations as a UI veneer while still generating primarily from parametric memory. The citation may not correspond to the claim. This is the Elsa pattern dressed up. Test every tool before you trust it.

"If the LLM is fine-tuned on regulations, it's safe": fine-tuning adjusts model weights but does not guarantee retrieval. A fine-tuned model can still fabricate citations. The safety property comes from architecture (retrieval + verifiable links), not from training data alone.

"FDA Elsa is a government product, surely it's reliable": Elsa is an internal agency tool subject to the same limitations as any ungrounded LLM. The FDA's own staff have publicly raised concerns. Origin does not confer reliability.

"Citations slow down the AI": retrieval adds a step, but a well-engineered RAG system returns in seconds, not minutes. The latency cost is trivial against the verification cost of an ungrounded answer.

"It's good enough for first drafts, humans check the final": this is the most dangerous argument. An AI that generates plausible-but-false first drafts anchors reviewers toward wrong conclusions. Cognitive psychology research on anchoring effects shows that even reviewers aware of AI limitations are influenced by the initial draft. If the AI cannot be trusted for a first draft, it cannot be used for a first draft.

Key takeaways

  • FDA's own AI tool Elsa is publicly reported to fabricate studies, illustrating the risk of ungrounded LLMs in regulatory work (CNN, July 2025)
  • The architectural distinction is ungrounded LLM versus retrieval-augmented generation (RAG); RAG grounds every claim in a retrieved source
  • A citation is only meaningful if it links to the exact passage in a primary source and the AI retrieved it before generating
  • FDA's January 2025 draft AI guidance establishes a 7-step credibility framework around "context of use" and "credibility assessment"
  • The EMA and FDA joint principles (January 2026) formalise the same requirements internationally
  • Evaluate any AI RA tool against the 7-point checklist; tools that cannot pass are not suitable for regulatory work

How RegAid helps

RegAid is built as a retrieval-first regulatory intelligence platform. Every answer is grounded in retrieved passages from the primary source corpus: FDA guidances, EMA scientific guidelines, MDCG documents, Swissmedic guidance, eCFR, EUR-Lex, ICH guidelines, ISO standards, and agency Federal Register notices. Every citation deep-links to the exact clause. Ask "What does FDA's January 2025 draft AI guidance say about context of use?" or "What is the acceptable intake for N-nitroso-rivaroxaban?" and click the citation to open the primary document in one step. No fabricated studies. No post-hoc citations. No parametric-memory guessing.