← Back to Blog

What AI Coding Agents Leave Out (and How to Put It in Your Prompt)

What AI Coding Agents Leave Out (and How to Put It in Your Prompt)

In late 2025 a security firm pointed five of the most popular AI coding tools at the same job and let them build. Claude Code, OpenAI Codex, Cursor, Replit, and Devin produced 15 working applications. Across all 15, the count of apps that shipped basic security headers was zero. The count with CSRF protection was zero. Exactly one attempted rate limiting. The total vulnerability count came to 69.

That is the thing to sit with before you hand an agent a real build. It is not that the agents are bad at code. The apps worked. They did what was asked. The problem is everything that wasn't asked, because an agent treats your prompt as the whole specification, and "build me a tenant portal with payments" does not contain the words "rate limiting" or "Content-Security-Policy."

I recently wrote a long, deliberately prescriptive build prompt for a billing and payments app, and the more I worked on it the more I realized the prompt was mostly a list of things the agent would skip if I stayed quiet. Here is the generalized version of that list, the considerations worth putting in writing before any agent starts typing.

It builds the feature and stops

This is the core behavior, and it shows up in study after study. AI-generated code carries roughly 2.74 times more security vulnerabilities than human-written code in one analysis of 470 real pull requests. A Stanford study found about 40% of Copilot suggestions contained a vulnerability. The agent implements the requested feature competently and then stops, because from its point of view the feature is the deliverable.

So you have to make the hardening part of the feature. If you want rate limiting on the login route, say so. If you want security headers, name them: HSTS, X-Content-Type-Options, X-Frame-Options or a CSP frame-ancestors, Referrer-Policy, a real Content-Security-Policy. If you want object-level authorization, the kind where changing an ID in a URL can't expose someone else's record, write that requirement out and tell the agent to enforce it on the server on every request, not by hiding a button in the UI. None of this appears on its own.

A clean static scan does not mean it's secure

Here is the trap that catches people who do try to check their agent's work. You run a static scanner, it comes back clean, and you ship. In the same research where the apps lacked headers and rate limiting, static analysis tools reported nothing wrong. SAST is good at spotting an injection pattern in a line of code. It is blind to the absence of a control. A wide-open CORS policy, an exposed API docs page, a missing rate limiter, an authorization check that was never written: these are things that aren't there, and you cannot grep for a thing that isn't there.

The fix is to test behavior, not just code. Ask the agent to build a small exploit suite with concrete pass or fail conditions, the kind you can run with curl: a protected route without a token returns 401, one user requesting another user's record returns 403, a disallowed file upload is rejected, a webhook with a bad signature is rejected, a disallowed origin gets no CORS headers back. Make that suite a gate that has to pass before anything merges. A green scanner is necessary. It is not the bar.

The bugs the exploit suite still won't catch

The exploit suite from the last section catches the controls that are missing. It will not catch the bug that only happens when a webhook arrives twice, a cron fires while a customer is mid-payment, or two people try to rent the last unit in the same second. Those are timing and concurrency bugs, and an agent's happy-path code is full of them precisely because the happy path is the only path it tested.

This is where the deterministic-testing playbook earns its place, and it is worth reading the piece I wrote on Antithesis and deterministic simulation testing alongside this one. The short version: you make the test environment control every source of non-determinism (the clock, the scheduler, retries, random ids), then you inject the faults that actually take systems down and assert your invariants hold anyway. Two ideas from it port to almost any new project without buying a thing.

The first is fault injection as a first-class test input. Take the idempotency rule from earlier, the one that says a retry or a double-fired cron must never double-bill. You don't prove that by hoping. You prove it by deliberately delivering the same webhook twice, killing the process between the charge and the ledger write, and re-running the cron mid-flight, then asserting the customer was charged exactly once. The storage-billing prompt that started this article carries a whole reconciliation watchdog for the same reason: a cron can miss a run, a webhook can fail to deliver, an ACH debit can sit pending for days and then bounce. Each of those is a fault you can inject in a test today, and each maps to a specific assertion: the late ACH return reverses the ledger entry, the dropped cron gets caught and healed, the two racing move-ins resolve to exactly one winner on the last unit.

The second idea is the "sometimes assertion," and it is the quiet upgrade to "make the test suite the bar." A normal assertion says something is always true. A sometimes assertion says a condition is at least sometimes reached, which is how you discover your test harness never actually exercised the retry path or the race you thought you were covering. Line coverage tells you a line ran. A sometimes assertion tells you the meaningful situation (the double charge, the concurrent rental, the failed-then-recovered payment) was actually produced and survived. For anything touching money or concurrency, that is the difference between a green suite and a suite that means something.

You don't need the full platform on day one. A new site is better served by borrowing the mindset: seed your tests deterministically, inject the faults that match your real failure modes, and write sometimes-assertions for the invariants that would cost you money if they broke. As the project grows real concurrency and distributed moving parts, that is exactly the point where deterministic testing stops being overkill and starts being the thing that lets you sleep.

Pin the stack, or it picks its own

When you don't specify a stack, the agent substitutes its defaults. That sounds harmless until it reaches for a hosting provider you don't use, an ORM you didn't choose, or a hand-rolled auth system in place of a vetted one. Worse, agents will sometimes try to provision new infrastructure on their own, spinning up an account or a database because that's their habit, while ignoring the setup you already documented two paragraphs up.

So pin it. Name the exact framework, the exact host, the exact auth provider, and add an explicit instruction: do not substitute your defaults, do not create accounts or provision services, and if anything here seems to conflict with the existing setup, ask before swapping it. That last clause matters more than it looks. The default failure mode is a silent swap, and a silent swap to the agent's favorite tool is how you end up debugging infrastructure you never agreed to.

Tell it what not to build

The opposite problem is just as real. Give an agent a modest app and it will reach for microservices, a message broker, a container orchestrator, full event-sourcing, the whole distributed-systems toolbox, because that machinery shows up constantly in its training data next to the word "production." For a single-facility app that needs to be correct and maintainable, all of that is cost and risk with no payoff.

A short "deliberately out of scope" section earns its place in every build prompt. Name the things you do not want: no microservices, no Kubernetes, no Kafka-class broker, no CQRS, no enterprise SSO you'll never use. Constraints are not just permission to keep it simple. They actively stop the agent from over-building.

Know the traps in your runtime

Some of the worst agent code is code that works in the demo and silently does nothing in production. The classic example is an in-memory rate limiter on a stateless serverless platform. The agent writes a perfectly reasonable counter in a variable, it passes a quick test, and then in production, where each request can hit a fresh instance with no shared memory, the counter never accumulates and the limiter quietly protects nothing. The code isn't wrong in isolation. It is wrong for where it runs.

You are the one who knows your runtime's constraints, so you have to encode them. If state has to be durable, say it has to be durable and name the primitive (a managed store, a Durable Object, a real cache), not an in-memory variable. The agent will not infer the deployment model from the framework name.

The domain rules it can't know

An agent knows general programming. It does not know your domain's landmines, and it will confidently build something that is technically clean and legally or financially wrong. A few generic examples from regulated workflows: you cannot add a surcharge to a debit card, so a "card costs extra" feature has to be modeled as two posted prices, not a fee tacked on at the end. A customer deposit is a refundable liability, not revenue, so a report that counts it as income is wrong even though the math adds up. A collections process often has statutory notice windows and a right to cure that the software has to respect to the day.

The agent will not surface any of this. It is on you to hand it the domain rules as hard requirements, and to have a human who knows the domain review every route that touches money, authorization, or anything a regulator cares about. This is the part you cannot delegate.

The rest of the short list

A few more that belong in any serious build prompt, each because it does not appear by default:

  • Don't hand-roll auth, sessions, JWT, or password hashing. Use a vetted provider or library. If any token verification exists, validate the algorithm explicitly and use constant-time comparison.
  • Centralize the security-critical logic. CORS, headers, input validation, and auth checks scattered across many route handlers get implemented inconsistently. Put them in one guarded middleware layer so they apply uniformly and can be tested in one place.
  • Idempotency on anything that moves money or retries. A retried request or a double-fired job must not double-charge. Idempotency keys are not optional here.
  • Secrets in environment variables or a vault, never in the repo or the client bundle. Fail fast at startup if a required secret is missing.
  • Pin dependency versions and commit the lockfile. Recent package-registry compromises make this real. Minimize the dependency count and review anything new before it lands.
  • Validate file uploads by extension and content type and magic bytes. Agents have been caught accepting executable uploads when the framework didn't force the check.
  • Build in phases, each ending with its own hardening and tests. Don't defer security to a final pass that never comes.

How to actually write the prompt

Put all of this together and the shape of a good build prompt is clear. Be prescriptive about the stack and tell it not to substitute. Make the security controls explicit requirements, not hopes. Demand a behavioral test suite and make it a merge gate. Say what not to build. Encode your runtime's constraints and your domain's rules as hard requirements. Then review the money, authorization, and destructive routes by hand, because those are the ones where a quiet mistake is expensive.

The agent is a fast, capable builder that will do exactly what the spec says and nothing the spec leaves out. The whole job, the part that is actually yours, is writing a spec that leaves nothing dangerous out. This is the same muscle I lean on all through The $20 Dollar Agency: you can absolutely build the thing yourself with cheap AI tools instead of hiring it out, as long as you know which guardrails the tools will never add on their own.

A build-prompt template you can copy

Here is the whole thing as a skeleton. Select it, paste it into a file, fill in the angle-bracket placeholders, and you have a build prompt that closes the gaps above. The point isn't the exact wording. It's that every section names something the agent would otherwise skip.

# <Project name>: Build Prompt

## How to use this
Build in the phases at the end, not all at once. Treat Stack and Security as
non-negotiable. Personally review every route that touches authorization,
money, or anything a regulator cares about.

## Stack (do not substitute)
- Use exactly: <framework>, <host>, <database>, <auth provider>.
- Do NOT reach for your defaults. Do NOT hand-roll auth/sessions/JWT; use <auth provider>.
- Do NOT create accounts or provision services. If a service is needed, list it
  for me to set up and read it from an environment variable.
- If anything here conflicts with the existing setup, ASK before swapping it.

## What you're building
<one paragraph + the core entities and flows>

## Security (build these; they will not appear on their own)
- Security headers on every response: HSTS, X-Content-Type-Options: nosniff,
  X-Frame-Options / CSP frame-ancestors, Referrer-Policy, a restrictive CSP.
- Strict CORS allowlist (only <origins>); never reflect arbitrary origins.
- Object-level authorization on the server on every request. Changing an id in a
  URL must never expose another user's data. Enforce in middleware, not the UI.
- Rate-limit auth and any sensitive endpoint, using durable storage (not an
  in-memory counter).
- Validate every request body; parameterize every query; encode output.
- No card/bank data on the server; store only processor tokens. Verify webhook
  signatures; use idempotency keys on anything that moves money or retries.
- Secrets in env/secret store, never in the repo or client bundle. Fail fast on a
  missing required secret at startup.
- No production API docs / schema introspection.
- Centralize CORS, headers, auth, and validation in ONE middleware layer.

## Tests are the bar (a clean static scan is not)
Ship a behavioral exploit suite with concrete pass/fail (curl-based, no model
judge) that must pass before merge: protected route without a token -> 401; one
user reading another's record -> 403; disallowed file upload -> rejected; bad
webhook signature -> rejected; weak password -> rejected; rapid logins ->
throttled; security headers present; a disallowed CORS origin gets no headers; no
docs reachable in prod; errors return generic JSON.

## Do NOT build (stay right-sized)
No microservices, no Kubernetes, no message broker, no CQRS/event-sourcing, no
enterprise SSO. This is <scale>, not a hyperscale platform.

## Runtime constraints it can't infer
<e.g. the runtime is stateless per request, so any rate limiter or cache MUST use
durable storage, never an in-memory variable.>

## Domain rules I'm handing you (you can't infer these)
<the legal / financial / regulatory constraints the code must encode, e.g. a
specific charge is prohibited, a deposit is a liability not revenue, notice X must
precede action Y by N days. Flag these for human / counsel review.>

## Build order (each phase ends with its hardening + tests)
1. <the core loop first>
2. <next>
3. <next>

Review every route touching authorization, money, or the regulated process by
hand before it ships.

Fact-check notes and sources

Related reading

This post is informational, not legal, security, or financial advice. The regulatory and accounting examples are generic illustrations; verify the specifics for your situation with a qualified professional. Tool and company names are referenced as nominative fair use. No affiliation is implied.

← Back to Blog

Accessibility Options

Text Size
High Contrast
Reduce Motion
Reading Guide
Link Highlighting
Accessibility Statement

J.A. Watte is committed to ensuring digital accessibility for people with disabilities. This site conforms to WCAG 2.1 and 2.2 Level AA guidelines.

Measures Taken

  • Semantic HTML with proper heading hierarchy
  • ARIA labels and roles for interactive components
  • Color contrast ratios meeting WCAG AA (4.5:1)
  • Full keyboard navigation support
  • Skip navigation link
  • Visible focus indicators (3:1 contrast)
  • 44px minimum touch/click targets
  • Dark/light theme with system preference detection
  • Responsive design for all devices
  • Reduced motion support (CSS + toggle)
  • Text size customization (14px–20px)
  • Print stylesheet

Feedback

Contact: jwatte.com/contact

Full Accessibility StatementPrivacy Policy

Last updated: April 2026