Engineering Notes

What 1.2 Million ChatGPT Responses Actually Reveal About LLM Citation Patterns

TV
Thiago Victorino
10 min read
What 1.2 Million ChatGPT Responses Actually Reveal About LLM Citation Patterns

A new industry is forming around the question of how to get cited by AI. It calls itself Generative Engine Optimization. It has conferences, tools, consultants, and a growing body of claims about “the science of how AI pays attention.”

Most of it is noise.

Not all of it. Buried under the marketing, there is real signal — backed by academic research and large-scale data. The problem is that the field conflates three distinct technical processes, overstates statistical findings, and is largely funded by companies selling optimization tools. If you make decisions based on the marketing layer, you will optimize for the wrong things. If you dig to the research layer, there are genuinely useful findings about how LLMs select what to cite.

This piece separates the two.

What the Data Actually Shows

Kevin Indig published an analysis of 1.2 million ChatGPT responses, examining which web content gets cited and what structural features those cited passages share. The dataset is large enough to surface real patterns, and several of the findings align with independent academic research. Here is what holds up.

Content position matters — a lot. 44.2% of citations come from the first 30% of a page. This is not surprising if you know the literature. Liu et al. at Stanford demonstrated in 2023 that LLMs systematically underweight information placed in the middle of long contexts — a phenomenon they named “Lost in the Middle” (published in TACL 2024). The model’s attention is U-shaped: strong at the beginning, strong at the end, weak in the middle. Content that appears early in a document has a structural advantage in retrieval-augmented generation pipelines, where chunks from the top of a page are both more likely to be retrieved and more likely to be attended to once retrieved.

Entity density correlates with citation. Cited content has roughly 20.6% entity density — proper nouns, brand names, tools, people — compared to 5-8% in standard English text (established baselines from the Brown Corpus and Penn Treebank). Content rich in specific, named entities gives the model concrete anchors. This makes mechanical sense: entities are high-information tokens that help the model ground its output in verifiable claims.

Definitional language gets cited more than hedged language. Passages using definitive phrasing (“X is defined as,” “The standard approach is”) are cited at 36.2% versus 20.2% for vague constructions. This may seem trivially obvious — definitions get cited for definitional queries — but the magnitude of the gap is worth noting for anyone structuring technical content.

Question-format headings double citation rates. H2 headings framed as questions correlate with approximately 2x more citations. The likely mechanism is straightforward: RAG systems chunk content by headings, and users ask questions. A heading that mirrors the user’s query structure increases the semantic similarity between the chunk and the prompt.

Readability has a sweet spot. Content at Flesch-Kincaid grade 16 (college-level) outperforms content at grade 19.1 (PhD-level) for citations. Overly dense academic prose gets retrieved less often, presumably because the embedding similarity with conversational queries is lower.

These findings are independently corroborated. Aggarwal et al. at Princeton, Georgia Tech, and Allen AI published “GEO: Generative Engine Optimization” at ACM KDD 2024, finding that authoritative language boosts AI engine visibility by up to 40%. The RANLP 2025 paper on exploiting primacy effects in LLMs confirms that positional bias is a real and measurable phenomenon in model behavior.

What the Data Does Not Show

Here is where the field’s marketing diverges from its evidence.

Indig’s analysis claims to reveal “the science of how AI pays attention.” It does not. It measures citation patterns in ChatGPT’s output. These patterns are the result of at least three distinct technical processes that the analysis does not distinguish:

Retrieval — which documents and chunks the RAG pipeline surfaces in response to a query. This is a search problem governed by embedding similarity, chunk boundaries, and index construction. It happens before the model sees anything.

Attention — the transformer mechanism that determines how the model weights different tokens in its context window. This is the actual “how AI pays attention” part, and the study does not measure it at all. Attention patterns are internal to the model and are not observable from output analysis.

Generation — how the model constructs its response and selects what to cite from the retrieved context. This is influenced by the model’s training, RLHF alignment, and the specific prompt structure.

Conflating these three processes leads to wrong conclusions. When the article claims that “the word ‘is’ acts as a bridge in vector databases,” this is technically incorrect. Sentence transformers encode entire sentences into embedding vectors. Individual words do not have independent roles in embedding space. The word “is” does not bridge anything — the sentence containing it has a particular embedding, and that embedding’s similarity to the query determines retrieval.

The statistical claims also deserve scrutiny. The analysis reports a “p-value of 0.0” for several findings. P-values cannot be exactly zero — this is a display artifact, not a statistical result. More importantly, with a sample size of 18,012 cited passages, statistical significance is trivially guaranteed for almost any measurable difference. The question that matters is effect size: how large is the difference, and is it practically meaningful? No effect sizes are reported.

The 0.55 cosine similarity threshold used to match citations to source content is lower than the 0.65+ threshold common in academic information retrieval work. A lower threshold inflates match counts by including weaker associations. No sensitivity analysis is reported — we do not know how the findings change if the threshold moves to 0.60 or 0.65.

The Source Credibility Problem

The data comes from Gauge, an AI visibility platform that sells optimization tools to marketers. The article offers 75% off Gauge’s product and promotes a companion tool with a promo code. This is industry marketing research, not independent academic work.

This does not automatically invalidate the findings. Industry research can be rigorous. But it does mean the incentive structure favors findings that make optimization seem important, measurable, and actionable — because that is what sells tools.

The Tow Center for Digital Journalism at Columbia University published research in March 2025 finding that AI search engines produce incorrect citations more than 60% of the time. If citations are wrong that often, optimizing for citation patterns may be optimizing for noise. The citation you earn may not even point to you accurately.

This is a fundamental problem the GEO field has not addressed: the reliability of the outcome variable. If you optimize your content to be cited more by systems that cite inaccurately most of the time, what exactly have you optimized?

The Governance Angle Nobody Is Discussing

Here is where this gets interesting for anyone building or deploying AI systems.

If citation patterns are predictable and exploitable — and the data suggests they are, at least partially — then LLM-powered search and research tools have a systematic bias that can be gamed. Content that appears early on the page, uses definitive language, packs in named entities, and structures headings as questions will be disproportionately cited, regardless of whether it is the most accurate or relevant source.

This is not a hypothetical concern. It is the same dynamic that created the SEO manipulation industry for traditional search, except the manipulation surface is different. Instead of optimizing for PageRank and keyword density, you optimize for entity density and positional placement.

For organizations that use AI tools for research, procurement decisions, competitive analysis, or any knowledge work where citations influence conclusions, this creates a governance gap. The AI tool’s citations are not a neutral sample of available knowledge. They are structurally biased toward content that matches specific formatting patterns — patterns that sophisticated content producers will increasingly optimize for.

Governed AI systems should treat citation provenance the same way they treat any other input: with verification layers. If your team uses ChatGPT or Perplexity or any retrieval-augmented system to inform decisions, the citation is a starting point, not an endpoint. The positional bias alone — 44% of citations from the first 30% of content — means the model is systematically undersampling information that appears later in documents.

What Practitioners Should Actually Do

Strip away the marketing, keep the verified findings, and the actionable guidance is straightforward.

Front-load your key claims. The positional bias is real and well-documented across multiple studies. If your content makes an important claim on page three that it buries under two pages of context-setting, that claim is structurally disadvantaged in any retrieval-augmented system. Put the insight first. Explain it after.

Increase entity density. Name the tools, the people, the companies, the standards. “A major cloud provider’s container orchestration platform” is invisible to retrieval. “Amazon EKS” is a concrete entity that anchors the content in embedding space. Specificity is not just good writing — it is mechanically advantageous for retrieval.

Use question-format headings for reference content. If you are writing content that answers specific questions — documentation, guides, FAQs — structure headings as the questions users actually ask. This is not a trick. It is alignment between your content structure and the query structure of the systems retrieving it.

Write at a college reading level, not a PhD level. Flesch-Kincaid 16, not 19. This is also just good writing advice independent of any AI consideration. Clear prose communicates more effectively to both humans and machines.

Be definitive. When you know something, state it. “The standard approach is X” gets cited. “It could potentially be argued that X might sometimes apply” does not. Hedging is not rigor. It is ambiguity. State what you know. Qualify what you do not. Do not hedge what you know.

Do not optimize for citation as a primary goal. The Tow Center’s finding that AI citations are incorrect 60%+ of the time means the entire citation economy is unreliable. Write to be useful and accurate. If the AI cites you correctly, good. If it cites you incorrectly, no amount of optimization helps. If it does not cite you, your content can still reach people through every other channel that exists.

The Uncomfortable Summary

The GEO field contains real findings wrapped in commercial incentive. The positional bias is real (Liu et al., Stanford). The entity density effect is real and mechanically grounded. The readability sweet spot is real and unsurprising.

But the field overstates what it has proven, conflates retrieval with attention with generation, reports statistics without effect sizes, and is funded by companies selling the solution to the problem they are defining. The “science of how AI pays attention” is not yet science. It is pattern observation on output data, with commercial interpretation layered on top.

For practitioners: use the actionable findings. Front-load, be specific, be definitive, write clearly. These are good practices regardless of whether you care about AI citations.

For leaders deploying AI tools: understand that citation bias is a real, measurable, and exploitable property of retrieval-augmented systems. Your teams’ AI-assisted research inherits these biases. Govern accordingly.

For the GEO industry: separate your findings from your products. Publish effect sizes. Distinguish retrieval from attention from generation. Acknowledge the 60% citation error rate. The signal is there. Stop burying it under the noise.


Sources

  • Kevin Indig. Analysis of 1.2M ChatGPT responses and citation patterns. withgauge.com, 2026.
  • Liu et al. “Lost in the Middle: How Language Models Use Long Contexts.” Stanford University. arxiv.org/abs/2307.03172. TACL, 2024.
  • Aggarwal et al. “GEO: Generative Engine Optimization.” Princeton, Georgia Tech, Allen AI. arxiv.org/abs/2311.09735. ACM KDD, 2024.
  • “Exploiting Primacy Effect to Improve LLMs.” RANLP, 2025.
  • Tow Center for Digital Journalism, Columbia University. AI search engine citation accuracy study. March 2025.
  • Brown Corpus and Penn Treebank entity density baselines. Established computational linguistics references.

Victorino Group helps organizations build AI systems that are governed by design — including how those systems consume, cite, and surface information. If your team needs to understand the biases in your AI toolchain, reach out at contact@victorinollc.com or visit www.victorinollc.com.

If this resonates, let's talk

We help companies implement AI without losing control.

Schedule a Conversation