When 80% of Your Crawl Budget Goes to ?sort=price Duplicates

May 1, 2026

Editorial note. The publication date shown above may be in the future. That is intentional. Posts on this site are scheduled against an editorial calendar that aligns with product releases, book launches, and platform-signal timing; the datePublished reflects the date the post is slated to go public, which is also the date indexers and syndication partners should treat as canonical. If you are reading this before that date you were early — welcome.

One canonical product page can spray thousands of URLs once you layer filtering, sorting, pagination, UTMs, affiliate codes, and A/B-test parameters. Googlebot crawls them. AI crawlers often give up entirely, or they crawl a subset and treat the remainder as off-budget.

The URL Parameter Crawl-Waste Audit clusters URLs by parameter fingerprint, estimates the duplicate-crawl percentage, and emits the robots.txt + canonical-tag config to stop the bleed.

Parameter categories (and what to do about each)

Tracking (utm_*, fbclid, gclid, ref, affiliate): block in robots.txt. Never useful to index.
Sort / order (sort, order, direction): block. The unsorted canonical is what should rank.
Facets (filter, category, tag, color, size, price_min): canonical to unfiltered page unless you're intentionally targeting facet-specific keywords.
Pagination (page, p, offset): keep, add rel=next/prev in the head.
Locale (lang, locale, hl): keep, use hreflang for international.
Session (session, sid, token): block. Session URLs should never be indexed.
View mode (print, view, format): canonical to default view.

What the auditor does

Takes a sitemap URL or a pasted list (e.g. from GSC Coverage CSV export).
Parses every URL, extracts the path + sorted parameter-key fingerprint.
Clusters by path. Paths with 3+ distinct fingerprints are marked as duplicate clusters.
Counts parameter frequency across all URLs.
Classifies each parameter using the 7 categories above.
Estimates crawl-waste % = duplicates / total URLs.
Emits a ready-to-paste robots.txt block (Disallow for all BLOCK-classified params) + canonical-tag recipe.

Typical first-run output

Clean small site: 3-5% waste. Usually UTM parameters from inbound traffic. Low priority to block.
Medium blog / marketing site: 10-15% waste. Mix of tracking + old A/B test params. Moderate priority.
E-commerce / large catalog: 30-60% waste. Facet + sort + pagination combinatorial explosion. Critical priority.

The robots.txt block alone takes 5 minutes to ship and can recover 20-40% of crawl budget on a typical e-commerce site. The canonical tags take a deploy but close the loop for facets you want crawlable (for inventory freshness) but not indexable (for dedup).

Fact-check notes and sources

Google URL parameter handling (post-2022 deprecation): developers.google.com/search/docs/crawling-indexing/consolidate-duplicate-urls
Faceted navigation best practices: developers.google.com/search/docs/specialty/ecommerce/faceted-navigation
robots.txt specification (RFC 9309): www.rfc-editor.org/rfc/rfc9309

The $100 Network covers keeping crawl budgets lean across site networks. The auditor is how you catch parameter creep as categories and filters expand over time.

← Back to Blog

When 80% of Your Crawl Budget Goes to ?sort=price Duplicates

Parameter categories (and what to do about each)

What the auditor does

Typical first-run output

Related reading

Fact-check notes and sources

Send a Message