# When 80% of Your Crawl Budget Goes to ?sort=price Duplicates

E-commerce and large blog sites routinely spray URLs like /products?sort=price&amp;filter=red&amp;page=2&amp;utm_source=email. Googlebot crawls every combination; 80% of crawl budget goes to near-duplicates of one canonical page. The auditor clusters URLs by parameter fingerprint and emits the robots.txt + canonical config to stop the waste.

Author: J.A. Watte
Published: May 1, 2026
Source: https://jwatte.com/blog/blog-tool-param-crawl-waste/

---

One canonical product page can spray thousands of URLs once you layer filtering, sorting, pagination, UTMs, affiliate codes, and A/B-test parameters. Googlebot crawls them. AI crawlers often give up entirely, or they crawl a subset and treat the remainder as off-budget.

The [URL Parameter Crawl-Waste Audit](/tools/param-crawl-waste/) clusters URLs by parameter fingerprint, estimates the duplicate-crawl percentage, and emits the robots.txt + canonical-tag config to stop the bleed.

## Parameter categories (and what to do about each)

- **Tracking** (`utm_*`, `fbclid`, `gclid`, `ref`, `affiliate`): **block in robots.txt.** Never useful to index.
- **Sort / order** (`sort`, `order`, `direction`): **block.** The unsorted canonical is what should rank.
- **Facets** (`filter`, `category`, `tag`, `color`, `size`, `price_min`): **canonical to unfiltered page** unless you're intentionally targeting facet-specific keywords.
- **Pagination** (`page`, `p`, `offset`): **keep, add rel=next/prev** in the head.
- **Locale** (`lang`, `locale`, `hl`): **keep, use hreflang** for international.
- **Session** (`session`, `sid`, `token`): **block.** Session URLs should never be indexed.
- **View mode** (`print`, `view`, `format`): **canonical to default view.**

## What the auditor does

1. Takes a sitemap URL or a pasted list (e.g. from GSC Coverage CSV export).
2. Parses every URL, extracts the path + sorted parameter-key fingerprint.
3. Clusters by path. Paths with 3+ distinct fingerprints are marked as duplicate clusters.
4. Counts parameter frequency across all URLs.
5. Classifies each parameter using the 7 categories above.
6. Estimates crawl-waste % = duplicates / total URLs.
7. Emits a ready-to-paste robots.txt block (Disallow for all BLOCK-classified params) + canonical-tag recipe.

## Typical first-run output

- **Clean small site:** 3-5% waste. Usually UTM parameters from inbound traffic. Low priority to block.
- **Medium blog / marketing site:** 10-15% waste. Mix of tracking + old A/B test params. Moderate priority.
- **E-commerce / large catalog:** 30-60% waste. Facet + sort + pagination combinatorial explosion. Critical priority.

The robots.txt block alone takes 5 minutes to ship and can recover 20-40% of crawl budget on a typical e-commerce site. The canonical tags take a deploy but close the loop for facets you want crawlable (for inventory freshness) but not indexable (for dedup).

## Related reading

- [Sitemap Audit](/tools/sitemap-audit/) — validates XML sitemap
- [AI Bot Policy Generator](/tools/ai-bot-policy-gen/) — emits robots.txt for AI bots
- [Internal Link Auditor](/tools/internal-link-auditor/) — finds 404s in your site's links
- [Index Coverage Delta](/tools/index-coverage-delta/) — diffs live crawl vs sitemap

## Fact-check notes and sources

- Google URL parameter handling (post-2022 deprecation): [developers.google.com/search/docs/crawling-indexing/consolidate-duplicate-urls](https://developers.google.com/search/docs/crawling-indexing/consolidate-duplicate-urls)
- Faceted navigation best practices: [developers.google.com/search/docs/specialty/ecommerce/faceted-navigation](https://developers.google.com/search/docs/specialty/ecommerce/faceted-navigation)
- robots.txt specification (RFC 9309): [www.rfc-editor.org/rfc/rfc9309](https://www.rfc-editor.org/rfc/rfc9309)

---

*The $100 Network covers keeping crawl budgets lean across site networks. The auditor is how you catch parameter creep as categories and filters expand over time.*


---

Canonical HTML: https://jwatte.com/blog/blog-tool-param-crawl-waste/
RSS: https://jwatte.com/feed.xml
JSON Feed: https://jwatte.com/feed.json
Hero image: https://jwatte.com/images/blog-tool-param-crawl-waste.webp
