Robots.txt as a Platform Fingerprint — Catching Hidden ...

Platform fingerprinting is one of those audit features you don't think about until it silently fails. The Mega Analyzer keeps a list of about 50 SaaS website builders (WordPress, Squarespace, Wix, Webflow, Shopify, Pixieset, SmugMug, Webflow, Showit, and so on), each matched by a URL-regex against the rendered HTML. If the page references cdn.shopify.com, it's Shopify. If parastorage.com, it's Wix. If pixieset.com, it's Pixieset.

That works the vast majority of the time. It quietly breaks when a platform lets tenants serve assets from their own domain.

The custom-domain problem

Pixieset Website is the example that surfaced this gap. Pixieset is a Canadian SaaS built for photographers — galleries, marketing pages, store, CRM, the whole stack. When a photographer uploads images, Pixieset's default CDN serves them from pixieset.com paths, which the URL regex catches cleanly. But if the photographer has a paid plan and uploads custom-resolution exports that load from their own domain, the pixieset.com strings never appear in the HTML.

The site is still 100% Pixieset. The bot-protection layer is still Pixieset's managed Cloudflare. The platform constraints (no root-file uploads, limited Custom Code, JSON-LD only on Premium plans) all still apply. The analyzer just doesn't know.

I caught this during an audit of a wedding-photographer client. The site fingerprint came back as "unknown CMS" because the rendered HTML had no platform-specific URLs. The only way to confirm Pixieset was to probe a known Pixieset internal endpoint (/api/widgets/all) and see Pixieset's own HTML stub return. That's not a check you want to run blindly against every audit URL.

What's already on every page anyway

Robots.txt. Every site has one. Most hosted platforms ship a default robots.txt that's identical across all tenants. And those defaults are surprisingly fingerprintable.

Pixieset's default robots.txt, for example, includes thirteen specific bots — AhrefsBot, bingbot, BLEXBot, BUbiNG, dotbot, msnbot, MJ12bot, PetalBot, four SemrushBot variants, SiteAuditBot, SMTBot, Yandex — each on its own User-agent line followed by Crawl-delay: 10. Then User-agent: * with Crawl-delay: 1. The exact bot list, in that order, with that delay configuration, is Pixieset.

Squarespace has a different signature. Showit has another. WordPress installs without an SEO plugin have a near-empty default. Wix has its own pattern.

What the new check does

Round 8 of the Mega Analyzer (shipped 2026-05-20) adds a robots.txt-fingerprint fallback to the platform detector. If the URL-regex pass returns nothing AND the robots.txt matches a known platform's default signature, the detector promotes the site to that platform.

For Pixieset specifically, the check matches when six or more of the named bots are listed with Crawl-delay: 10 and the wildcard * agent gets Crawl-delay: 1. The threshold of six (rather than all thirteen) is intentional — tenants on platforms occasionally edit robots.txt to add or remove a bot, and the fingerprint should survive that.

When the fallback fires, the analyzer sets d.vendorPlatformDetectedVia so the report shows how it was identified ("Pixieset — detected via robots.txt fingerprint"), and it also flips d.platformCantHostRootFiles to true. That second flag matters because it gates the AGENTS.md, security.txt, llms-full.txt, and humans.txt checks. Without the platform flag set, those checks fire as full warnings against a tenant who literally cannot publish a root file no matter how much they want to. With the flag set, they downgrade to info-severity with a "your platform doesn't support this" note.

The general pattern

The lesson generalizes beyond Pixieset. Any SaaS website builder that:

Ships a default robots.txt that's identical across all tenants
Lets tenants serve assets from custom domains (so URL fingerprinting misses)
Has platform-level constraints that should gate certain checks (root files, custom code restrictions, schema-injection limitations)

…is a candidate for the same fallback. The Mega Analyzer round-8 commit is open about its extension hints: Format, Showit, Pic-Time, ShootProof, and Zenfolio are all photographer-focused SaaS where a robots.txt fingerprint would be more reliable than URL-based detection. Future audits will add them.

What this means for SEO auditors

If you run a site through any audit tool — Screaming Frog, Ahrefs, Semrush, Sitebulb, or a homegrown crawler — and the platform comes back as "unknown CMS," pull the robots.txt yourself and read it. The default robots.txt of a hosted platform is one of the most stable fingerprints you can ask for. It survives custom domains, theme changes, and content rewrites that wreck most other signals.

For tenants on these platforms, the actionable takeaway is the inverse: if you don't want auditors (or competitors) easily identifying your platform from robots.txt alone, customize the file. Most platforms let you, even if it's a Premium-plan feature or a support ticket. Adding a Sitemap: line and a few custom Disallow paths is enough to break the default-signature match.

The fix list this surfaces — gate platform-locked checks based on detected platform — is one of the highest-leverage refinements you can make to any audit tool. False-positive fatigue is the single biggest reason teams stop trusting an audit. Every warning that comes back saying "publish /humans.txt" on a platform that physically cannot host /humans.txt erodes trust in every other finding the tool produces.

Robots.txt as a Platform Fingerprint — Catching Pixieset, Squarespace, and Showit When CDN Detection Misses