Citation URL Extractor — Parse Sources From Perplexity,...

Part of the AEO / GEO / AI-search audit tool stack. See the pillar post for the full catalog of sibling audits and where this one fits in the lineup.

Perplexity cites 5-10 sources per answer. Copilot cites 3-7. Gemini cites 2-5 on AI Overview mode. ChatGPT with browsing cites whatever it felt like.

The important question isn't "how many sources?" It's "which sources?" If Perplexity cites the same three Reddit threads and one Wikipedia article for every query in your category, the source bucket is saturated and your odds of breaking in are low. If it cites a scattered mix of blogs, news sites, and forum posts, the bucket is open and you can win a slot.

The Citation URL Extractor parses the source list from any pasted AI response, classifies each domain, and scores diversity. Low diversity = concentrated citation market. High diversity = competitive citation market.

The diversity index

Using the Shannon diversity index from ecology (normalized 0-100%):

0-30%. One or two sources dominate. The AI engine trusts them as the canonical voice for this category. To break in you need to match or exceed their authority — e-e-a-t signals, Wikipedia presence, citation density on high-authority neighbors.
30-60%. Moderate spread, 4-6 sources in rotation. Most competitive categories land here. You can win a slot with consistent publishing + on-page AEO signals.
60-100%. High diversity, many sources cited across queries. The engine is uncertain which source to anchor on. Easy to break in; easy to be pushed out.

Pair with the own-domain citation share: what percentage of the citations (in the one response you pasted, or across many) come from your own domain? Zero is the baseline. Above 10% across a broad sample means you've won the category.

The seven source buckets

The extractor classifies domains into seven categories:

Own. Your domain (plus subdomains). Provided as a form input.
Competitor. Explicitly-listed competitor domains.
Wikipedia / Wikidata. The encyclopedic anchor. Frequently cited because it's been audited.
Community / forum. Reddit, Stack Overflow, Hacker News, Quora, Medium, Substack.
Video. YouTube, Vimeo, TikTok, TED.
News / media. NYT, WaPo, Bloomberg, Reuters, WSJ, BBC, Forbes, TechCrunch, etc.
Government / research. .gov, .edu, .ac.uk, NIH, CDC, NIST, arxiv.org, nature.com, science.org.

Everything else lands in "Other," which is usually the long-tail of niche blogs, vendor docs, and smaller sites. If a category you care about (say fintech-specific publications) isn't well-represented, extend the classifier lexicon in the tool source.

Why Perplexity responses are ideal input

Perplexity shows its source list. Copy the whole answer including the source footer and paste. The extractor finds every URL including the inline-citation form [1] that Perplexity uses.

Copilot, Gemini, and ChatGPT-with-browsing also cite but with less structure. The extractor still pulls any URL that appears anywhere in the pasted text, so it works on all four. You'll miss a few URLs that weren't rendered as clickable links but were mentioned in prose; not much you can do about that without semantic parsing.

What to do with low own-domain citation share

Typical first run: own-domain citations are zero or one. The fix is upstream — AEO signals, passage retrievability per article, entity consistency across pages, llms.txt structure.

The extractor is measurement, not fix. Use it to baseline, then rerun monthly to see whether your upstream work moved the number.

Fact-check notes and sources

Shannon diversity index: en.wikipedia.org/wiki/Diversity_index#Shannon_index
Perplexity source-citation mechanism: perplexity.ai
Google AI Overview citation format: blog.google/products/search/generative-ai-search

The $100 Network covers building citation-worthy content across site networks so that own-domain citation share grows as a function of network size. The extractor is how you verify each network is contributing to the citation pool.

Source Diversity — The AI-Answer Metric Nobody Talks About