# How to Build a Self-Healing Lead Pipeline with the Google Places API

Step-by-step build of an automated lead generation pipeline using Claude Code, Google Places API, Node.js, and Instantly.ai. Includes the prompt, the code, and the full import walkthrough.

Author: J.A. Watte
Published: April 29, 2026
Source: https://jwatte.com/blog/wcag-lead-scraper-google-places-api/

---

I needed a lead list. Not a thousand names from some reseller database that's been emailed to death. I needed fresh, structured contact data for small businesses across the US, sorted by industry and city, with real phone numbers and working websites.

I built the entire pipeline in one sitting with Claude Code and the Google Places API. It pulled 67,000+ unique businesses in under an hour. Cost me less than $15 in API calls.

Here's exactly how it works and how to build your own.

## What you get out of this

The pipeline queries Google's Places API for businesses by category and city. For each result you get back a name, website, phone number, full address, and category type. No scraping HTML. No fighting CAPTCHAs. No getting blocked by Cloudflare.

Then a separate script visits each website to pull contact emails off their /contact and /about pages. Then a CSV builder formats everything for import into whatever outreach tool you use (I use Instantly.ai).

You end up with a CSV that has: email, first name, last name, company name, website, phone, address, city, state, and category. Ready to load and send.

## The stack

- **Google Places API (New)** Text Search endpoint. You send a natural language query like "dentist in Boise Idaho" and get back 20 structured results.
- **Node.js** Plain scripts. One file per step. No framework.
- **Puppeteer** Headless browser that visits each business website looking for email addresses.
- **Claude Code** Wrote every file, tested it, hit dead ends with other approaches, pivoted to Places API, and ran the whole thing.

## What you need before starting

A Google Cloud account with the Places API enabled and an API key. Google gives every project $200/month in free credits. The Text Search endpoint costs a few cents per call depending on which fields you request. Each call returns up to 20 businesses.

At those rates, 3,500 API calls (which is roughly what this pipeline makes across all cities and categories) costs well under $15.

## The prompt

This is the exact prompt I gave Claude Code:

```
Build me a Node.js pipeline that uses the Google Places API Text Search
endpoint to find businesses across US cities and export them as a CSV
for Instantly.ai cold email campaigns.

Requirements:
- Search 50+ US cities across CA, NY, FL, TX, ID, OR, WA, CO, AZ, NV,
  UT, TN, GA, NC and others
- Categories: dentists, chiropractors, plumbers, electricians, HVAC,
  roofers, landscapers, law firms, accountants, restaurants, salons,
  auto repair, pet groomers, gyms, daycares, and any other high-value
  SMB segments
- Exclude: digital marketing agencies and realtors
- Store results in a JSON database with deduplication by website URL
- Include a Puppeteer-based email finder that visits each site and
  scrapes contact/about pages for email addresses
- Output CSV with columns: email, first_name, last_name, company_name,
  website, phone, address, city, state, category
- Use env var GOOGLE_PLACES_API_KEY (never hardcode)
- Rate limit at 150-200ms between API calls
- Save progress every 25 queries for resumability
```

Claude Code tried web scraping first (Google, Bing, YellowPages, DuckDuckGo). Every one got blocked because the session was running from a cloud IP. It pivoted to the Places API on its own and had a working pipeline within minutes. That's the useful part of AI-assisted development here. It hit walls and routed around them without me needing to debug each failure.

## The API call

```javascript
const res = await fetch('https://places.googleapis.com/v1/places:searchText', {
  method: 'POST',
  headers: {
    'Content-Type': 'application/json',
    'X-Goog-Api-Key': process.env.GOOGLE_PLACES_API_KEY,
    'X-Goog-FieldMask': 'places.displayName,places.formattedAddress,places.websiteUri,places.nationalPhoneNumber,places.types'
  },
  body: JSON.stringify({
    textQuery: 'dentist in Boise Idaho',
    pageSize: 20,
    languageCode: 'en'
  })
});
```

The `FieldMask` header controls what you get back and what you pay. Only request the fields you actually need. The response is clean JSON with structured data. No HTML parsing required.

## Categories I used

I started with the obvious (dentists, plumbers, lawyers) and kept expanding until I'd covered every SMB segment I could think of that has a public-facing website.

Medical: dermatology, orthopedic, mental health, physical therapy, oral surgeons, allergists, cardiologists, sports medicine, acupuncture, massage therapists.

Trades: pool service, fence companies, tree service, carpet cleaning, pressure washing, concrete, painting, drywall, flooring, cabinet makers, solar installers, foundation repair, handymen.

Food and beverage: breweries, wineries, food trucks, juice bars, catering, butcher shops, every type of restaurant.

Fitness: CrossFit, pilates, personal trainers, dance studios, boxing, rock climbing, tennis clubs, trampoline parks.

Professional: architects, interior designers, IT support, bookkeepers, tax preparers, notaries, translators, every type of lawyer.

Education: driving schools, language schools, coding bootcamps, music teachers, pottery studios, summer camps.

Auto: body shops, towing, detailing, RV dealers, motorcycle dealers, transmission repair, custom car shops.

Entertainment: escape rooms, bowling, arcades, comedy clubs, laser tag, go-karts, paintball, museums, live music venues.

I excluded digital marketing agencies (they already know what you're selling) and realtors (totally different buying cycle). Everyone else is fair game.

## What I actually got

After running all segments across 50+ cities in 14 states:

- 67,000+ unique businesses with websites
- 97% have phone numbers from Google directly
- 80+ business categories
- About 15-18 new unique leads per API call after deduplication

The 14 states I focused on: CA, NY, FL, TX, ID, OR, WA, CO, AZ, NV, UT, TN, GA, NC. You can add any state or city by editing one array in the code.

## The email problem

Google gives you names, websites, and phones. Not emails. If you're doing email outreach, you still need to get those somewhere.

The pipeline includes a Puppeteer script that visits each site and looks for email addresses on /contact, /about, and the homepage. It finds them for maybe 30-40% of leads. For the rest, you can run them through Apollo.io (10K free lookups per month) or Hunter.io ($49/month) to append emails.

The phone numbers are still useful on their own if you're doing calls or SMS alongside email.

## Deduplication

You'll hit duplicates. A business in Meridian, Idaho shows up when you query both "dentist in Meridian ID" and "dentist in Boise ID" because Google returns results within a radius, not strictly within city limits.

The database layer catches these by normalizing every URL to its origin and checking against a Set:

```javascript
function normalizeUrl(u) {
  if (!u) return null;
  u = u.trim().toLowerCase();
  if (!u.startsWith('http')) u = 'https://' + u;
  try { return new URL(u).origin; } catch { return null; }
}

export function addBusinesses(db, list) {
  const existing = new Set(db.businesses.map(b => b.website));
  let added = 0;
  for (const biz of list) {
    const url = normalizeUrl(biz.website);
    if (!url || existing.has(url)) continue;
    if (isExcluded(biz)) continue;
    db.businesses.push({ ...biz, website: url });
    existing.add(url);
    added++;
  }
  return added;
}
```

Three things happening here:

1. Every URL gets stripped to origin only. `https://example.com/contact` and `http://www.example.com` both become `https://example.com`. That catches most of the near-duplicates.
2. The Set gives you O(1) lookups instead of scanning the full array for each new lead.
3. New URLs get added to the Set immediately, so duplicates within the same batch also get caught.

The exclusion filter checks the business name and category against a short keyword list. If something matches "digital marketing" or "keller williams" it gets dropped before it ever hits the database.

## Importing into Instantly.ai

Once you've got your CSV, here's how to get it loaded and sending.

**Upload:** Go to Lead Management, click Upload Leads, drop your CSV.

**Map columns:** Instantly asks you to match your CSV headers to its fields. Here's the mapping:

| CSV Column | Instantly Field |
|---|---|
| email | Email |
| first_name | First Name |
| last_name | Last Name |
| company_name | Company Name |
| website | Website |
| phone | Phone |
| city | Custom Variable 1 |
| state | Custom Variable 2 |
| category | Custom Variable 3 |

If you have phone-only leads (no email yet), you can still upload them. Use Instantly's built-in enrichment or run them through Apollo first to append emails before activating the campaign.

**Assign to a campaign:** Pick which campaign these leads belong to. Instantly deduplicates against leads already in your workspace, so if you upload in batches you won't double-send.

**Write your email:** The custom variables (city, state, category) let you personalize without sounding templated. You can reference their specific industry, their city, or their website directly in the email body using {% raw %}`{{company_name}}`, `{{website}}`, `{{category}}`{% endraw %}, etc.

Whatever you're selling, the first email should give them something useful before asking for anything. A free audit result, a specific observation about their site, a relevant stat about their industry. Make it about them, not about you.

**Warm up first:** If you're sending from new domains, warm them for at least 14 days before scaling. Instantly handles this automatically. Start at 20-30 sends per mailbox per day and scale to 50-80 once warmup is done. Don't skip this step or you'll land in spam on day one.

## The parts the simple version leaves out

The pipeline I described above works. But when you're visiting 100,000+ websites with Puppeteer, three things will break that a naive implementation doesn't handle: the email finder will stall on bad sites, your email list will be full of junk addresses, and the whole process will die at 3am with nobody watching. Here's how the production version solves each one.

## The watchdog: self-healing process manager

Puppeteer visits tens of thousands of websites. Some of those sites hang forever. The browser tab opens, starts loading, and never fires the load event. Without a watchdog, your entire pipeline stops on one bad site and you don't know until you check hours later.

The per-page timeout helps but doesn't catch everything. Puppeteer itself can hang at the protocol level. The Node process can leak memory after thousands of pages and slow to a crawl. The watchdog sits outside all of that and asks one simple question: is progress still being made?

```javascript
// email-finder-watchdog.mjs (simplified structure)
const child = spawn('node', ['find-emails.mjs'], { stdio: 'inherit' });

setInterval(() => {
  const db = JSON.parse(fs.readFileSync('leads.json'));
  const emailCount = db.businesses.filter(b => b.emails?.length).length;

  if (emailCount === lastCount && Date.now() - lastProgress > 120_000) {
    // Stalled for 2 minutes. Kill the full process tree and restart.
    execSync(`taskkill /PID ${child.pid} /T /F`);
    respawn();
  }
  lastCount = emailCount;
  lastProgress = emailCount > prevCount ? Date.now() : lastProgress;
}, 60_000);
```

Every 60 seconds: is the child process alive? Has the email count gone up in the last two minutes? If the process died, restart it immediately. If it stalled, kill the process tree and restart.

The process tree kill is critical on Windows. Puppeteer spawns Chromium as a child process. If you only kill the Node parent, the Chromium processes become orphans eating RAM. `taskkill /T /F` kills the parent and every descendant. On Linux or Mac, use process group kill instead.

The database has all the progress saved, so the new process picks up exactly where the old one died. You lose at most a couple minutes of work per restart.

Two things the production watchdog learned the hard way. First, the database is a single JSON file the finder rewrites constantly, so the watchdog will sometimes read it mid-write and get a parse error. Treat that as "no reading this tick" and skip it, not as a stall. If you count a torn read as a stall, you restart a perfectly healthy process. Second, base the stall check on a counter that moves on every kind of progress, not just on emails found. If you only count emails, a long run of sites that have no email looks identical to a hang, and the watchdog kills a process that was doing fine. There is no backoff and no restart cap here, on purpose. The database makes every restart cheap. But it does mean a bad stall signal can quietly stack up hundreds of needless restarts before you notice.

## Browser recycling

After 5 consecutive site errors (timeouts, navigation failures, crashed tabs), the email finder kills the entire Puppeteer browser instance and launches a fresh one. This clears any corrupted browser state, leaked memory, or stuck connections that accumulated over thousands of page visits.

Each site also gets a hard 60-second timeout across all pages combined. If a site's homepage, /contact, and /about pages together take longer than 60 seconds, skip the site and record the skip reason in the database so it's never retried.

## MX validation: why regex email matches aren't enough

A regex that finds `user@something.tld` on a web page doesn't mean that address receives mail. Common false positives that show up in scraped data:

- **Template placeholders** left in by web developers: `user@domain.com`, `info@mysite.com`
- **Font license strings** parsed as email: `name@latofonts.com`, `designer@indiantypefoundry.com`
- **Image filenames** that match the email pattern: `hero@2x.webp`
- **Parked or expired domains** that used to have a website but no longer have a mail server
- **Typos** on the business's own contact page: `info@bussiness.com`

An MX lookup takes milliseconds and tells you whether the domain has a mail server configured. If it doesn't, the message hard bounces every time. Removing these before they reach your sending platform is what protects your sender reputation.

Here is the part worth getting right, because the first version of this got it wrong. Check MX records only. Do not fall back to an A record.

```javascript
import dns from 'dns/promises';

const mxCache = new Map();

async function hasMX(domain) {
  if (mxCache.has(domain)) return mxCache.get(domain);
  try {
    const records = await dns.resolveMx(domain);
    const valid = records.length > 0;
    mxCache.set(domain, valid);
    return valid;
  } catch {
    // No A-record fallback. A domain that resolves but has no MX
    // cannot receive mail, and accepting it produces hard bounces.
    mxCache.set(domain, false);
    return false;
  }
}
```

RFC 5321 does say a domain with no MX record can receive mail at its A record, so an A-record fallback looks technically correct. In practice it is a trap for cold sending. The early version of this pipeline fell back to A records, which removed almost nothing from a twenty-thousand-row export and let through a wave of addresses that bounced anyway. A campaign got paused for the bounce rate. MX-only validation is stricter, and it is the version that keeps your reputation intact.

Caching MX results matters because many emails share a domain. If you find fifty leads at gmail.com, you look up gmail.com once.

## The domain blocklist

The email regex matches anything that looks like `name@domain.tld` in the HTML source. That includes CSS `@font-face` declarations referencing foundry URLs, JavaScript analytics libraries with contact strings in their source, JSON-LD structured data with schema.org references, SaaS widget configuration objects, and WordPress theme attribution comments.

Without a blocklist, 5-10% of your "emails" are scraped from code, not from contact pages. These will all bounce.

The blocklist I use rejects emails from 50+ known junk domains, organized by category:

- **Template placeholders:** domain.com, email.com, mysite.com, company.com, yoursite.com
- **Font/CSS files parsed as emails:** latofonts.com, indiantypefoundry.com, sansoxygen.com, typekit.net, fonts.com
- **Platform domains:** wix.com, squarespace.com, shopify.com, wordpress.com, godaddy.com, weebly.com
- **SaaS widgets:** keen.io, intercom.io, hubspot.com, zendesk.com, crisp.chat, mailchimp.com
- **Social/big tech:** facebook.com, google.com, youtube.com
- **Payment:** stripe.com, paypal.com

It also rejects emails where the domain ends in an asset extension (.png, .jpg, .svg, .woff, .woff2, .css, .js), emails with 5+ consecutive digits in the local part, and any address with a local part over 40 characters or a domain over 40 characters.

## Clean the list in stages before it ever sends

The blocklist runs while you scrape. After the scrape, run the export through a short pipeline that catches what extraction cannot see on a single page. Three stages, each a small script, each writing a new CSV.

**Stage one, raw export.** Pull every record that has a validated email out of the database into a CSV. No filtering yet, just the columns your sending tool wants.

**Stage two, the bounce-prevention pass.** This is where most of the junk goes. It drops:

- **Free webmail.** Gmail, Yahoo, Outlook, iCloud, and the rest. A cold message to a personal webmail address is both a deliverability risk and the wrong first touch for a business.
- **Cross-domain contaminants.** An email whose domain is not the site you scraped it from. If you crawl `joesplumbing.com` and pull `support@some-booking-widget.com` off an embedded scheduler, that address belongs to the vendor, not the business. Match the email domain against the site's own domain and drop anything that does not line up.
- **Compliance-only roles.** Addresses like `privacy@`, `postmaster@`, `abuse@`, `jobs@`, and `press@` exist for reasons that are not your outreach. They go.
- **Placeholders and template leftovers.** `john.doe@`, `firstname@`, `name@example.com`, and friends.
- **Duplicates**, by lowercased email.

It also blanks the first name on generic inboxes like `info@` and `sales@` rather than dropping the row, so your greeting falls back to something neutral instead of "Hi Info."

**Stage three, the live check.** Before the list goes anywhere near a campaign, run the survivors through a parallel DNS pass: confirm each domain still has an MX record right now, drop disposable-mailbox providers, and drop the throwaway TLDs that spammers churn through. This stage also flags any domain that shows up more than five times, which usually means a franchise or a multi-location chain you want to throttle rather than blast.

The reason for the stages is that each one is cheap and each one catches a different class of bad address. Extraction filters what is on the page. The clean pass filters what is structurally wrong. The live check filters what changed since you scraped. Skip any one of them and the bounces show up in the stage you skipped.

## Writing the first email

Your lead list is only as useful as what you send. The pattern that works for cold outreach to small businesses is the same no matter what you sell: lead with one specific, real thing you found on their site, give it away, and ask for nothing on the first touch.

The thing you found is where a free audit earns its keep. Run each prospect's site through the [Mega Analyzer](/tools/mega-analyzer/) (free, no signup) and it hands you a concrete finding to open with. It checks the things a small business actually cares about and rarely looks at:

- **Google Business Profile and local presence:** reviews, hours, category, map signals
- **Page speed and Core Web Vitals:** how fast the site loads on a phone
- **SEO and structured data:** titles, meta, schema, and what shows up in search
- **AI-search readiness:** whether ChatGPT, Perplexity, and Google's AI answers can actually read the site
- **Accessibility:** the issues that also trigger ADA demand letters
- **Security headers and basic site health**

Pick the one finding that lands hardest for that business and build the email around it. Here is the shape, with a Google Business Profile angle as the lead:

{% raw %}
```
Subject: Found something on {{company_name}}'s website you should know about

Hey {{first_name}},

I ran {{website}} through a free site checker and one thing jumped out.
Your Google Business Profile is showing 3 reviews, and the nearest
{{category}} business to you in {{city}} is sitting at 47. That gap is
usually the difference between showing up in the local map results and
getting skipped.

A few smaller things came up too, nothing scary, mostly quick fixes.

I put the results in a short report. Happy to send it over if you want
to look. No cost, no follow-up from me unless you reply.
```
{% endraw %}

Six things make this work:

1. **Specific, not generic.** It references their company name, website, category, and city. Merge fields make it feel written for them even at scale.
2. **Value first.** The whole email hands them something useful, a real finding about their own site, before asking for anything. The offer is a free report, not a sales call.
3. **No tracking.** No pixel, no link shortener. The email says plainly "no follow-up from me." That builds trust and improves deliverability, because Gmail penalizes tracked mail.
4. **Single send.** One email, not a four-step drip. If they don't reply, they don't hear from you again. That respects their time and keeps your complaint rate low.
5. **Role-based first touch.** Send to info@, contact@, hello@, and office@ first. Those are public inboxes the business chose to publish. Don't cold-email a personal work address on the first touch.
6. **Real numbers.** "3 reviews against 47" is concrete and checkable. Specific numbers get opened. Round claims like "you're losing customers" get deleted.

Swap the finding to match the business and what you offer. The structure does not change, only the first paragraph:

{% raw %}
```
Subject: Quick question about {{company_name}}'s website

Hey {{first_name}},

I was looking at {{website}} and noticed a couple of things that might
be costing you customers. [INSERT 1-2 SPECIFIC FINDINGS FROM THE AUDIT,
e.g. "your site takes about 8 seconds to load on a phone, so roughly
half your visitors leave before it finishes" or "your pages have no
structured data, so search engines and AI answers can't tell what you
do or where you are."]

Most {{category}} businesses in {{city}} have the same gaps. Nobody
thinks about it until it starts showing up in the numbers.

If you want, I can send over a breakdown of what I found. Takes five
minutes to read and might save you some headaches. No strings.
```
{% endraw %}

Accessibility is one more finding you can lead with, and it carries real legal weight. The same scan flags the missing image descriptions, low contrast, and keyboard traps behind ADA demand letters, and there were more than 4,000 federal website accessibility lawsuits filed in 2023 with the count still climbing. Use whichever finding the business will feel first.

One more move: point them at the tool directly. Plenty of small businesses will run the [Mega Analyzer](/tools/mega-analyzer/) on their own site once they know it exists, which is exactly why it is free and asks for no signup.

## Run it yourself

Ten files in the production version:

- `scrape-businesses.mjs` Google Places API scraper (main, all categories)
- `segment-runner.mjs` Run one industry segment at a time
- `find-emails.mjs` Puppeteer email scraper with MX validation
- `email-finder-watchdog.mjs` Self-healing process manager
- `validate-mx.mjs` Standalone MX revalidation for existing data
- `build-csv.mjs` CSV exporter (supports `--emails-only` and `--output=filename.csv`)
- `clean-csv.mjs` The bounce-prevention pass (free webmail, cross-domain contaminants, role addresses, dupes)
- `validate-for-instantly.mjs` The live DNS check (parallel MX, disposable filter, domain-concentration warning)
- `db.mjs` JSON database with deduplication
- `leads.json` All data (businesses + emails + audit state)

Set your `GOOGLE_PLACES_API_KEY` environment variable, run `node scrape-businesses.mjs`, wait about 10 minutes, then run `node email-finder-watchdog.mjs` to let the email finder work through every site with automatic recovery. When it's done, run `node build-csv.mjs --emails-only` to export the validated leads, then `node clean-csv.mjs` for the bounce-prevention pass, then `node validate-for-instantly.mjs` for the live DNS check. That last file is the one you actually load into a campaign. Run `node build-csv.mjs` without the flag any time you want the full list with phone numbers for calls or SMS.

If you want to run one industry at a time (useful for testing or if you only care about certain verticals), use the segment runner: `node segment-runner.mjs medical` or `node segment-runner.mjs trades`.

For a comparison of other data sources you can plug into this same pipeline structure, see [7 Ways to Build a Local Business Lead List Without Buying One](/blog/lead-generation-api-methods-compared/). Once you have the list, [How to Send Email That Actually Gets Delivered](/blog/blog-email-infrastructure-small-business/) covers the domain and warmup setup, and [How to Send Your Own Email Digest From Serverless Functions](/blog/self-hosted-email-digest-serverless/) shows how to run the sending side yourself instead of renting it.

---

*This post describes a technical process, not legal or compliance advice. Every platform you touch (Google Cloud, Instantly, Apollo, your email provider) has its own terms of service and acceptable use policies. Read them. Adapt your approach to stay within those terms. Anti-spam laws (CAN-SPAM, GDPR, CASL, state privacy statutes) vary by jurisdiction and change regularly. What's described here worked for my circumstances. Yours may differ. Do your own due diligence before sending anything.*

---

## Fact-check notes and sources

- Google Places API pricing: $200/month free credit confirmed as of April 2026. Text Search (Pro tier) is billed per session at varying rates depending on fields requested. [Google Maps Platform Pricing](https://developers.google.com/maps/billing-and-pricing/pricing)
- Google Places API returns a maximum of 20 results per Text Search request as documented in the [Places API reference](https://developers.google.com/maps/documentation/places/web-service/text-search).
- ADA Title III lawsuit data referenced for context: 4,605 federal website accessibility lawsuits filed in 2023 per UsableNet's annual report.
- [RFC 5321 Section 5.1](https://www.rfc-editor.org/rfc/rfc5321#section-5.1) does define implicit MX: a domain with no MX record may receive mail at its A record. This post argues against relying on that for cold sending, because those domains bounce at high rates in practice even though they are valid under the spec.
- CAN-SPAM physical address requirement: [FTC CAN-SPAM Act compliance guide](https://www.ftc.gov/business-guidance/resources/can-spam-act-compliance-guide-business).


---

Canonical HTML: https://jwatte.com/blog/wcag-lead-scraper-google-places-api/
RSS: https://jwatte.com/feed.xml
JSON Feed: https://jwatte.com/feed.json
Hero image: https://jwatte.com/images/wcag-lead-scraper-google-places-api.webp