One canonical product page can spray thousands of URLs once you layer filtering, sorting, pagination, UTMs, affiliate codes, and A/B-test parameters. Googlebot crawls them. AI crawlers often give up entirely, or they crawl a subset and treat the remainder as off-budget.
The URL Parameter Crawl-Waste Audit clusters URLs by parameter fingerprint, estimates the duplicate-crawl percentage, and emits the robots.txt + canonical-tag config to stop the bleed.
Parameter categories (and what to do about each)
- Tracking (
utm_*,fbclid,gclid,ref,affiliate): block in robots.txt. Never useful to index. - Sort / order (
sort,order,direction): block. The unsorted canonical is what should rank. - Facets (
filter,category,tag,color,size,price_min): canonical to unfiltered page unless you're intentionally targeting facet-specific keywords. - Pagination (
page,p,offset): keep, add rel=next/prev in the head. - Locale (
lang,locale,hl): keep, use hreflang for international. - Session (
session,sid,token): block. Session URLs should never be indexed. - View mode (
print,view,format): canonical to default view.
What the auditor does
- Takes a sitemap URL or a pasted list (e.g. from GSC Coverage CSV export).
- Parses every URL, extracts the path + sorted parameter-key fingerprint.
- Clusters by path. Paths with 3+ distinct fingerprints are marked as duplicate clusters.
- Counts parameter frequency across all URLs.
- Classifies each parameter using the 7 categories above.
- Estimates crawl-waste % = duplicates / total URLs.
- Emits a ready-to-paste robots.txt block (Disallow for all BLOCK-classified params) + canonical-tag recipe.
Typical first-run output
- Clean small site: 3-5% waste. Usually UTM parameters from inbound traffic. Low priority to block.
- Medium blog / marketing site: 10-15% waste. Mix of tracking + old A/B test params. Moderate priority.
- E-commerce / large catalog: 30-60% waste. Facet + sort + pagination combinatorial explosion. Critical priority.
The robots.txt block alone takes 5 minutes to ship and can recover 20-40% of crawl budget on a typical e-commerce site. The canonical tags take a deploy but close the loop for facets you want crawlable (for inventory freshness) but not indexable (for dedup).
Related reading
- Sitemap Audit — validates XML sitemap
- AI Bot Policy Generator — emits robots.txt for AI bots
- Internal Link Auditor — finds 404s in your site's links
- Index Coverage Delta — diffs live crawl vs sitemap
Fact-check notes and sources
- Google URL parameter handling (post-2022 deprecation): developers.google.com/search/docs/crawling-indexing/consolidate-duplicate-urls
- Faceted navigation best practices: developers.google.com/search/docs/specialty/ecommerce/faceted-navigation
- robots.txt specification (RFC 9309): www.rfc-editor.org/rfc/rfc9309
The $100 Network covers keeping crawl budgets lean across site networks. The auditor is how you catch parameter creep as categories and filters expand over time.