Sitemap Delta Guide: Detecting Orphans and Stale URLs i...

I was looking at why a site had Google-indexed URLs returning 404. The 404s were not from external links. They were from the site's own sitemap.xml, which was listing pages that had been renamed or deleted months ago. The sitemap had drifted from reality, nobody noticed, and the crawlers kept trying to fetch dead URLs.

Running a sitemap delta, a diff between URLs the site links to internally and URLs listed in the sitemap, surfaced 34 stale entries in two minutes. Cleaning them up stopped the 404s, freed up crawl budget, and made the sitemap honest again.

What a Sitemap Delta Is

A sitemap delta is a set difference. You build two lists:

Crawl set: every URL reachable by following internal links starting from the homepage.
Sitemap set: every URL listed in /sitemap.xml (or the sitemap index and any referenced children).

Then you compute three intersections:

In both: URLs the site links to that are also in the sitemap. These are healthy.
Sitemap-only: URLs in the sitemap that nothing on the site currently links to. Probably stale, probably 404, probably 301-chained.
Crawl-only: URLs the site links to that are missing from the sitemap. These are orphaned from a crawler's perspective, reachable via internal navigation but not advertised for indexing.

Each bucket has a specific meaning and a specific fix.

The In-Both Bucket (Healthy)

The first bucket is the good one. URLs here appear in both the crawl and the sitemap. The site links to them, and the sitemap confirms they should be indexed. No action needed.

This should be the largest bucket on a healthy site. If it is not, if sitemap-only or crawl-only are the majority, the sitemap is out of sync with the site and both buckets need work.

The Sitemap-Only Bucket (Stale)

URLs in the sitemap that nothing on the site links to. The common causes:

404s: page was deleted, sitemap was not updated. Crawlers hit the sitemap entry, follow it, get a 404, log the error, move on. After enough of these, crawl confidence drops.
301s: page was moved, sitemap still lists the old URL. Crawlers follow the redirect and eventually update, but the redirect hop is waste.
Orphaned legitimate pages: pages that still exist and work but are no longer linked from anywhere on the site. Archive pages, old campaign landing pages, legacy docs. Technically fine for a crawler to find via sitemap, but suspicious, if the site operator no longer thinks the page is worth linking to, the crawler may not either.
Build-system leftovers: templates or drafts that got permalinks emitted into sitemap.xml but never got linked from production pages.

The fix per sub-case:

404 → remove from sitemap.
301 → update sitemap to the new URL.
Orphan that should stay → add a link to it from a relevant page on the site. If nothing links to it, the sitemap entry is not enough.
Build artifact → exclude it from sitemap generation.

The Crawl-Only Bucket (Missing)

URLs the site links to that are not in the sitemap. These are the inverse problem:

New pages that have not been rebuilt into the sitemap yet. One-off publishing, forgot to regenerate.
Dynamically generated URLs that the sitemap generator does not know about. Filter pages, search result pages, parameterized URLs.
Tag and category pages that the theme links to from posts but excludes from the sitemap generator by default.
Pages intentionally excluded from the sitemap, staging pages, legal disclaimers, thank-you pages, that are nonetheless linked from the main navigation or footer.

Crawl-only is the more interesting bucket because it reveals what the site is structured around versus what it is advertising for indexing. If every post on a blog links to its tag archive but no tag archive is in the sitemap, the tag archives are discoverable but not declared. That is a missed signal, crawlers will find them, but they will treat them as lower-priority than sitemap-listed URLs.

The fix per sub-case:

New page missed → rebuild the sitemap.
Dynamic URL that should be indexable → add a rule to the sitemap generator.
Tag/category excluded by default → reconsider whether exclusion is right. On a blog with meaningful tags, the tag archives often deserve to be in the sitemap.
Intentionally excluded → add <meta name="robots" content="noindex"> on the page itself to confirm the exclusion, and add rel="nofollow" on the links pointing to it if you want to prevent discovery. A page that is linked without either is ambiguous to a crawler.

Why Bing, Yandex, Kagi, and AI Crawlers Trust Sitemaps More

Google is good at crawling sites through link discovery. It finds most pages without needing a sitemap to point the way. As a result, Google treats the sitemap as a secondary hint rather than a primary source of truth.

Bing, Yandex, Kagi, and the AI crawlers (Perplexity's, OpenAI's, Anthropic's) are different. They have less infrastructure, less crawl budget per domain, and less patience for link-graph exploration. They lean on the sitemap as the canonical list of what to index. If your sitemap is clean and complete, these crawlers build accurate models of your site. If your sitemap has drift, they build wrong models.

The AI citation engines are especially sensitive to this. A sitemap that lists 200 URLs, 30 of which 404, is a sitemap that trained the engine's retrieval layer on a 15% failure rate. The engine responds by discounting the whole domain's sitemap as unreliable, and falls back to on-the-fly crawling, which is slower, more expensive, and results in fewer citations.

A clean sitemap is the cheapest reliability signal you can send to these engines.

How to Run a Sitemap Delta

The Link Graph tool at /tools/link-graph/ has a Sitemap Delta mode. Input the site URL, it crawls the internal link graph, fetches the sitemap (and any index children), and outputs the three buckets.

Manually, the process is:

Run a crawl starting from the homepage with a depth limit high enough to hit every internal page (depth 5 covers most sites).
Fetch and parse /sitemap.xml (and any <sitemap> children if it is an index).
Normalize URLs on both sides: trailing slashes, query strings, fragment removal, case.
Compute the set intersection and the two differences.
For each URL in the sitemap-only bucket, fetch it and check the response: 200, 301, 404, 410. That classifies each entry into its fix path.
For each URL in the crawl-only bucket, inspect the page: is it linked from navigation (probably should be in the sitemap) or from a deeper context (maybe intentionally excluded)?

What to Do With the Output

The usual result on a site that has never audited the sitemap:

Crawl-only: 5-20 URLs, mostly tag/category pages or new posts. Fix the sitemap generator.
Sitemap-only: 10-50 URLs, mostly old posts that got deleted or renamed without updating the sitemap. Batch-clean them.
In-both: the rest, and it should be the biggest by far.

I run this audit on new client sites as a matter of course. Half the time it turns up a sitemap that has never been regenerated since the original build, and the cleanup takes an afternoon. The other half it turns up a sitemap that is close but has 5-15 drifted entries, and the cleanup takes an hour.

After the cleanup, Bing and the AI crawlers notice within two to four weeks. Google notices eventually. None of them publish a "thank you for the clean sitemap" event, but the 404 log stops showing sitemap-sourced hits, and citation frequency on the AI engines tends to tick up.

The Short Version

A sitemap delta diffs your internal link graph against your sitemap.xml.
Three buckets: in-both (healthy), sitemap-only (stale, 404, or unlinked), crawl-only (missing from sitemap).
Fix stale sitemap entries by removing 404s, updating 301s, and either linking or excluding orphans.
Fix crawl-only entries by adding them to the sitemap generator unless they are intentionally excluded.
Bing, Yandex, Kagi, and AI crawlers weight sitemap cleanliness more than Google does. A drifted sitemap is a direct signal of low operator attention.
Run the audit at /tools/link-graph/ with Sitemap Delta mode enabled.

Sitemap Delta: Finding the URLs Your Site Links To That Are Not in Your Sitemap (and Vice Versa)