TL;DR. Indexing bugs compound silently. One stray <meta robots="noindex"> left in a template after staging can deindex the whole site; Search Console flags it weeks after it starts.
The Sitemap Audit is the audit you reach for when you already suspect a problem in this dimension and need a fast, copy-paste-able fix list. It reuses the same chrome as every other jwatte.com tool — deep-links from the mega analyzers, AI-prompt export, CSV/PDF/HTML download — but the checks it runs are narrow and specific to the dimension described above.
Validates sitemap.xml, sitemap-image.xml, sitemap-news.xml + sitemap-index. Probes every URL for status, finds stale lastmod, dead URLs, missing image refs.
Why this dimension matters
Indexing issues compound silently. A single <meta name="robots" content="noindex"> left in a template after staging can deindex the entire site; a sitemap that omits pagination URLs can leave half the catalog uncrawled; a Disallow: that overlaps with a Sitemap: entry creates a per-bot disagreement (Google may index the URL; Bing may not). These are the slow-leak failures that Search Console flags weeks after they start.
Common failure patterns
- Canonical tag pointing at a 404 or a redirect chain — the audit verifies that every canonical URL resolves 200-OK and doesn't redirect. A canonical that chains to /404 or that 301s to another URL is a Google Webmaster Guidelines violation.
- Mismatched hreflang cluster — locale A links to locale B with hreflang=es, but locale B does not reciprocate. Google silently drops the entire cluster from international indexing. The audit checks bidirectionality.
- Sitemap declaring URLs that
noindexvia meta or X-Robots-Tag — Sitemap entries are suggestions; noindex is authoritative. If the same URL says "index me" in sitemap and "don't index me" in the HTML, Google follows the HTML. Flag and resolve. - Soft-404s on category/tag pages with zero items — the page returns HTTP 200 but has no substantive content. Google treats these as low-quality and deprioritizes the domain. Generate a 404 response for empty tag/category pages.
How to fix it at the source
Treat Search Console as the source of truth for what Google actually thinks of your site; submit sitemap updates + changelogs there. For hreflang, use a link-graph audit to verify bidirectional coverage every sitemap regeneration. For indexing conflicts, the audit's per-bot simulation (Googlebot vs Bingbot vs per-LLM bot) catches directives that pass one crawler and fail another.
Thresholds that matter
| Signal | Target |
|---|---|
| Sitemap URL cap per file | 50,000 URLs or 50 MB uncompressed — split via sitemap index above that. |
| Canonical target | Must return HTTP 200 and self-reference; no redirect chain. |
| hreflang bidirectionality | 100% — every pair must reciprocate. |
| Crawl depth to any indexable page | ≤ 3 clicks from the home page for priority content. |
Example fix
robots.txt + sitemap reference + per-bot AI block:
User-agent: *
Allow: /
Disallow: /admin
Disallow: /search?
# Block AI training crawlers while allowing retrieval crawlers
User-agent: CCBot
Disallow: /
User-agent: ClaudeBot
Allow: /
User-agent: GPTBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Google-Extended
Allow: /
Sitemap: https://yoursite.com/sitemap.xml
Sitemap: https://yoursite.com/sitemap-images.xml
When to run the audit
- After a major site change — redesign, CMS migration, DNS change, hosting platform swap.
- Quarterly as part of routine technical hygiene; the checks are cheap to run repeatedly.
- Before an investor / client review, a PCI scan, a SOC 2 audit, or an accessibility-compliance review.
- When a downstream metric drops (rankings, conversion, AI citations) and you need to rule out this dimension as the cause.
Reading the output
Every finding is severity-classified. The playbook is the same across tools:
- Critical / red — same-week fixes. These block the primary signal and cascade into downstream dimensions.
- Warning / amber — same-month fixes. Drag the score, usually don't block.
- Info / blue — context only. Often what a PR reviewer would flag but that doesn't block merge.
- Pass / green — confirmation. Keep the control in place.
Every audit also emits an "AI fix prompt" — paste into ChatGPT / Claude / Gemini for exact copy-paste code patches tied to your specific stack.
Related tools in this family
- Mega Analyzer — single-URL orchestrator — catches indexing issues alongside everything else.
- IndexNow Submission Audit — verifies IndexNow integration pings Bing / Yandex / Seznam correctly.
- robots.txt Simulator — per-bot simulation — shows what Googlebot vs Bingbot vs GPTBot actually see.
- noindex / X-Robots-Tag Conflict Audit — flags disagreements between meta robots / X-Robots-Tag / robots.txt / sitemap.
- Link-Graph Depth Audit — how many clicks to reach every indexable page — 3+ depth is a deindex risk.
Fact-check notes and sources
- Google Search Central: Robots.txt introduction
- Sitemaps.org: Protocol spec
- IndexNow: Protocol spec
- Google: hreflang annotations for localized pages
This post is informational and not a substitute for professional consulting. Mentions of third-party platforms in the tool itself are nominative fair use. No affiliation is implied.