In May 2026 Microsoft published something that got written up as a horse race: its security system, MDASH, topped a hard public benchmark for finding software vulnerabilities, scoring 88.45 percent on UC Berkeley's CyberGym, ahead of the next entry at 83.1 percent, which was Anthropic's restricted Claude Mythos model (Microsoft Security Blog, GeekWire).
The "Microsoft beat Anthropic" framing is the least interesting part, and it deserves a caveat: those leaderboard scores are self-reported, on different setups and dates, so read it as "topped the published leaderboard," not a refereed knockout. The interesting part is how MDASH did it. It is not one enormous model. It is a pipeline of more than 100 specialized agents working in stages, an ensemble of cheaper and frontier models, with agents that argue against each other to validate findings and domain knowledge fed in at each step (Microsoft Security Blog, InfoQ).
That is the lesson worth carrying into a small business: for hard, specific work, the structure around the model can matter more than which model you buy.
Why this is the opposite of the usual advice
The reflex in 2026 is to reach for the biggest, newest, priciest model and assume it will be smartest. Sometimes it is. But MDASH is a public, documented case where a structured system of smaller parts beat a single restricted frontier model on its own turf. CyberGym itself is real and serious: a UC Berkeley benchmark of 1,507 real-world vulnerability tasks across 188 open-source projects (UC Berkeley, MarkTechPost). And some of the people who built MDASH came from Team Atlanta, the Georgia Tech-led group that won DARPA's AI Cyber Challenge, a $4 million grand prize announced at DEF CON in August 2025 (Georgia Tech, The Record).
You will never run anything at that scale, and you do not need to. The shape of the idea is what scales down.
How to copy the idea for about the price of a coffee
You can mimic MDASH's core move with whatever AI you already pay for. The trick is to stop asking one model to do the whole job in one shot, and instead split the job into roles:
- Use a finder pass. One prompt does the first draft of the work: the analysis, the copy, the code, the list.
- Use an independent checker pass. A second, separate prompt reviews the first one's output and argues against it. Ask it to find what is wrong, not to agree. This is the cheap version of MDASH's debating agents, and it catches the confident mistakes a single pass misses.
- Feed in your own domain context. The model never trained on your prices, your policies, your customer, your last incident. Paste that in. MDASH's edge came partly from injected domain knowledge, and yours will too.
- Use cheap models for the high-volume passes. Reserve the expensive model for the one genuinely hard step, and let a cheaper or distilled model handle the bulk. That is the whole economics of the approach.
- Add a dedup or cleanup pass. A final cheap pass that removes repeats and tidies the output is often what turns "rough" into "shippable."
None of that requires a platform or a budget. It requires deciding that a workflow of small, checkable steps beats one big hopeful prompt. That is the same under-100-dollars-a-month thesis behind my book The $20 Dollar Agency (search the title on Amazon Kindle): you get expert-grade output from structure, not from spending.
The one security takeaway you should act on this week
MDASH is a security story, so here is the security point that actually applies to you, and it is not "buy more tools." Mandiant's M-Trends 2026 reports that the time from a vulnerability being disclosed to being exploited has gone negative, meaning working exploits increasingly show up before the patch does (Help Net Security).
When exploits beat patches, the highest-return move for a small business is boring: turn on automatic updates everywhere, on every laptop, phone, router, and server you own, and let them install without you. No new product will protect you faster than patching the moment a fix exists.
Related reading
- Stop Prompting, Start Designing Loops: the mindset shift behind splitting work into checkable steps.
- How AI Agents Coordinate and How They Remember: the architecture choices behind any multi-step agent setup.
- Learn to Secure AI, Then Build With It: the security-training side of all this, plus a visual canvas for building workflows.
- How a Small Business Runs AI Agents Without a Surprise Bill: keeping a multi-pass workflow cheap.
- I Cut a Recurring AI Bill by More Than Half in an Afternoon: routing the bulk work to cheaper models.
Fact-check notes and sources
- MDASH and the benchmark: Microsoft's multi-model agentic security system scored 88.45 percent on CyberGym, ahead of the next entry at 83.1 percent (Claude Mythos), per Microsoft's Security Blog and GeekWire. Scores are self-reported, so I describe MDASH as topping the published leaderboard rather than as a controlled head-to-head win. A third-place "GPT-5.5 at 81.8 percent" figure appeared in the source article but I could not independently confirm it, so I have left it out.
- CyberGym: a UC Berkeley benchmark of 1,507 real-world tasks across 188 open-source projects (UC Berkeley, MarkTechPost).
- Claude Mythos: a restricted Anthropic security model released to partners under Project Glasswing (Decrypt).
- Team Atlanta: the Georgia Tech-led team won DARPA's AI Cyber Challenge, a $4 million grand prize announced at DEF CON 33 in August 2025 (Georgia Tech, The Record). The source article's "$29.5 million in 2024" was incorrect, so I have not used it.
- Time-to-exploit: Mandiant's M-Trends 2026 reports time-to-exploit has gone negative (Help Net Security). A specific "28.3 percent within 24 hours" figure from the source article was uncorroborated and is not used here.
This post is informational, not security-consulting advice. Benchmark scores are self-reported by their publishers. Mentions of Microsoft, Anthropic, UC Berkeley, Mandiant, and other third parties are nominative fair use. No affiliation is implied.