← Back to Blog

When 80% of Your Crawl Budget Goes to ?sort=price Duplicates

When 80% of Your Crawl Budget Goes to ?sort=price Duplicates

One canonical product page can spray thousands of URLs once you layer filtering, sorting, pagination, UTMs, affiliate codes, and A/B-test parameters. Googlebot crawls them. AI crawlers often give up entirely, or they crawl a subset and treat the remainder as off-budget.

The URL Parameter Crawl-Waste Audit clusters URLs by parameter fingerprint, estimates the duplicate-crawl percentage, and emits the robots.txt + canonical-tag config to stop the bleed.

Parameter categories (and what to do about each)

  • Tracking (utm_*, fbclid, gclid, ref, affiliate): block in robots.txt. Never useful to index.
  • Sort / order (sort, order, direction): block. The unsorted canonical is what should rank.
  • Facets (filter, category, tag, color, size, price_min): canonical to unfiltered page unless you're intentionally targeting facet-specific keywords.
  • Pagination (page, p, offset): keep, add rel=next/prev in the head.
  • Locale (lang, locale, hl): keep, use hreflang for international.
  • Session (session, sid, token): block. Session URLs should never be indexed.
  • View mode (print, view, format): canonical to default view.

What the auditor does

  1. Takes a sitemap URL or a pasted list (e.g. from GSC Coverage CSV export).
  2. Parses every URL, extracts the path + sorted parameter-key fingerprint.
  3. Clusters by path. Paths with 3+ distinct fingerprints are marked as duplicate clusters.
  4. Counts parameter frequency across all URLs.
  5. Classifies each parameter using the 7 categories above.
  6. Estimates crawl-waste % = duplicates / total URLs.
  7. Emits a ready-to-paste robots.txt block (Disallow for all BLOCK-classified params) + canonical-tag recipe.

Typical first-run output

  • Clean small site: 3-5% waste. Usually UTM parameters from inbound traffic. Low priority to block.
  • Medium blog / marketing site: 10-15% waste. Mix of tracking + old A/B test params. Moderate priority.
  • E-commerce / large catalog: 30-60% waste. Facet + sort + pagination combinatorial explosion. Critical priority.

The robots.txt block alone takes 5 minutes to ship and can recover 20-40% of crawl budget on a typical e-commerce site. The canonical tags take a deploy but close the loop for facets you want crawlable (for inventory freshness) but not indexable (for dedup).

Related reading

Fact-check notes and sources


The $100 Network covers keeping crawl budgets lean across site networks. The auditor is how you catch parameter creep as categories and filters expand over time.

← Back to Blog

Accessibility Options

Text Size
High Contrast
Reduce Motion
Reading Guide
Link Highlighting
Accessibility Statement

J.A. Watte is committed to ensuring digital accessibility for people with disabilities. This site conforms to WCAG 2.1 and 2.2 Level AA guidelines.

Measures Taken

  • Semantic HTML with proper heading hierarchy
  • ARIA labels and roles for interactive components
  • Color contrast ratios meeting WCAG AA (4.5:1)
  • Full keyboard navigation support
  • Skip navigation link
  • Visible focus indicators (3:1 contrast)
  • 44px minimum touch/click targets
  • Dark/light theme with system preference detection
  • Responsive design for all devices
  • Reduced motion support (CSS + toggle)
  • Text size customization (14px–20px)
  • Print stylesheet

Feedback

Contact: jwatte.com/contact

Full Accessibility StatementPrivacy Policy

Last updated: April 2026