← Back to Blog

A Sitemap Audit Tool — Catching Stale lastmod, Dead URLs, and Sitemap-Index Loops

A Sitemap Audit Tool — Catching Stale lastmod, Dead URLs, and Sitemap-Index Loops

TL;DR. Indexing bugs compound silently. One stray <meta robots="noindex"> left in a template after staging can deindex the whole site; Search Console flags it weeks after it starts.

The Sitemap Audit is the audit you reach for when you already suspect a problem in this dimension and need a fast, copy-paste-able fix list. It reuses the same chrome as every other jwatte.com tool — deep-links from the mega analyzers, AI-prompt export, CSV/PDF/HTML download — but the checks it runs are narrow and specific to the dimension described above.

Validates sitemap.xml, sitemap-image.xml, sitemap-news.xml + sitemap-index. Probes every URL for status, finds stale lastmod, dead URLs, missing image refs.

Why this dimension matters

Indexing issues compound silently. A single <meta name="robots" content="noindex"> left in a template after staging can deindex the entire site; a sitemap that omits pagination URLs can leave half the catalog uncrawled; a Disallow: that overlaps with a Sitemap: entry creates a per-bot disagreement (Google may index the URL; Bing may not). These are the slow-leak failures that Search Console flags weeks after they start.

Common failure patterns

  • Canonical tag pointing at a 404 or a redirect chain — the audit verifies that every canonical URL resolves 200-OK and doesn't redirect. A canonical that chains to /404 or that 301s to another URL is a Google Webmaster Guidelines violation.
  • Mismatched hreflang cluster — locale A links to locale B with hreflang=es, but locale B does not reciprocate. Google silently drops the entire cluster from international indexing. The audit checks bidirectionality.
  • Sitemap declaring URLs that noindex via meta or X-Robots-Tag — Sitemap entries are suggestions; noindex is authoritative. If the same URL says "index me" in sitemap and "don't index me" in the HTML, Google follows the HTML. Flag and resolve.
  • Soft-404s on category/tag pages with zero items — the page returns HTTP 200 but has no substantive content. Google treats these as low-quality and deprioritizes the domain. Generate a 404 response for empty tag/category pages.

How to fix it at the source

Treat Search Console as the source of truth for what Google actually thinks of your site; submit sitemap updates + changelogs there. For hreflang, use a link-graph audit to verify bidirectional coverage every sitemap regeneration. For indexing conflicts, the audit's per-bot simulation (Googlebot vs Bingbot vs per-LLM bot) catches directives that pass one crawler and fail another.

Thresholds that matter

Signal Target
Sitemap URL cap per file 50,000 URLs or 50 MB uncompressed — split via sitemap index above that.
Canonical target Must return HTTP 200 and self-reference; no redirect chain.
hreflang bidirectionality 100% — every pair must reciprocate.
Crawl depth to any indexable page ≤ 3 clicks from the home page for priority content.

Example fix

robots.txt + sitemap reference + per-bot AI block:

User-agent: *
Allow: /
Disallow: /admin
Disallow: /search?

# Block AI training crawlers while allowing retrieval crawlers
User-agent: CCBot
Disallow: /
User-agent: ClaudeBot
Allow: /
User-agent: GPTBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Google-Extended
Allow: /

Sitemap: https://yoursite.com/sitemap.xml
Sitemap: https://yoursite.com/sitemap-images.xml

When to run the audit

  • After a major site change — redesign, CMS migration, DNS change, hosting platform swap.
  • Quarterly as part of routine technical hygiene; the checks are cheap to run repeatedly.
  • Before an investor / client review, a PCI scan, a SOC 2 audit, or an accessibility-compliance review.
  • When a downstream metric drops (rankings, conversion, AI citations) and you need to rule out this dimension as the cause.

Reading the output

Every finding is severity-classified. The playbook is the same across tools:

  • Critical / red — same-week fixes. These block the primary signal and cascade into downstream dimensions.
  • Warning / amber — same-month fixes. Drag the score, usually don't block.
  • Info / blue — context only. Often what a PR reviewer would flag but that doesn't block merge.
  • Pass / green — confirmation. Keep the control in place.

Every audit also emits an "AI fix prompt" — paste into ChatGPT / Claude / Gemini for exact copy-paste code patches tied to your specific stack.

Related tools in this family

Fact-check notes and sources

This post is informational and not a substitute for professional consulting. Mentions of third-party platforms in the tool itself are nominative fair use. No affiliation is implied.

← Back to Blog

Accessibility Options

Text Size
High Contrast
Reduce Motion
Reading Guide
Link Highlighting
Accessibility Statement

J.A. Watte is committed to ensuring digital accessibility for people with disabilities. This site conforms to WCAG 2.1 and 2.2 Level AA guidelines.

Measures Taken

  • Semantic HTML with proper heading hierarchy
  • ARIA labels and roles for interactive components
  • Color contrast ratios meeting WCAG AA (4.5:1)
  • Full keyboard navigation support
  • Skip navigation link
  • Visible focus indicators (3:1 contrast)
  • 44px minimum touch/click targets
  • Dark/light theme with system preference detection
  • Responsive design for all devices
  • Reduced motion support (CSS + toggle)
  • Text size customization (14px–20px)
  • Print stylesheet

Feedback

Contact: jwatte.com/contact

Full Accessibility StatementPrivacy Policy

Last updated: April 2026