Two open-weight models shipped within four days of each other in April 2026. DeepSeek V4 Pro landed on April 24 with 1.6 trillion parameters. Kimi K2.6 from Moonshot AI dropped on April 20 with 1 trillion parameters and full open weights under a Modified MIT license. Both are free to self-host. Both score within a point of each other on the coding benchmarks that matter.
The question that actually matters for working developers: do either of these replace Claude Opus 4.7 or GPT-5.4 for daily coding work? The honest answer is complicated. On raw benchmarks, they're close. On cost, they're dramatically cheaper. On the specific things that make a coding CLI useful day to day, the gaps show up fast.
The benchmarks, side by side
| Model | SWE-bench Verified | SWE-bench Pro | Codeforces | Cost (input/output per 1M) | Weights |
|---|---|---|---|---|---|
| Claude Opus 4.7 | 87.6% | 64.3% | N/A | $5.00 / $25.00 | Closed |
| GPT-5.4 | ~85.0% | 57.7% | 3,168 | ~$5.00 / $15.00 | Closed |
| DeepSeek V4 Pro | 80.6% | 55.4% | 3,206 | $0.30 / $1.20 | Open |
| Kimi K2.6 | 80.2% | 58.6% | N/A | $0.60 / $2.50 | Open (MIT) |
| DeepSeek V4 Flash | 79.0% | N/A | N/A | $0.07 / $0.28 | Open |
A few things jump out.
Opus 4.7 leads SWE-bench Verified by 7 full points. That's not a rounding error. On the harder Pro variant, the gap is smaller but still real: 64.3% vs 58.6% (Kimi) vs 55.4% (DeepSeek). For complex multi-file refactors and real-world GitHub issues, the closed models still win.
But DeepSeek V4 Pro costs roughly 17 times less than Opus 4.7 per input token. Kimi K2.6 costs about 8 times less. For batch workflows, automated testing, or tasks where 80% accuracy is good enough, the cost math changes everything.
DeepSeek V4 Pro holds the highest Codeforces rating of any model at 3,206, above GPT-5.4's 3,168. For competitive programming tasks, it's the strongest option available.
Where Kimi K2.6 stands out
Kimi's real advantage isn't the benchmark scores. It's the agent swarm architecture. K2.6 can scale to 300 concurrent sub-agents across 4,000 coordinated steps. That's 3x more agents and 2.7x more steps than the previous K2.5 version.
For tasks like "analyze these 50 repositories and report commonalities" or "research 8 competitor products in parallel," Kimi's swarm capability does something that sequential models can't match. It parallelizes the work at the model level, not just the orchestration level.
The pricing makes this practical. Running 300 parallel agent calls through Opus 4.7 would cost hundreds of dollars. Through Kimi K2.6 at $0.60 per million input tokens, the same swarm run costs a small fraction of that.
K2.6 also scores 92.5% F1 on DeepSearchQA and 66.7% on Terminal-Bench 2.0, which measures command-line and systems-level task performance. For DevOps-adjacent work, those numbers matter.
The open weights under Modified MIT mean you can self-host with no API costs at all, assuming you have the hardware. The mixture-of-experts architecture (1T total parameters, 32B active per token) means inference is more tractable than the parameter count suggests.
Where DeepSeek V4 stands out
DeepSeek's strength is breadth across cost tiers. V4 Pro is the flagship at 1.6 trillion parameters. V4 Flash is the cost-optimized variant at 284 billion parameters, scoring 79.0% on SWE-bench Verified at $0.07 per million input tokens. That's seventy times cheaper than Opus 4.7.
For production inference at scale, V4 Flash is the model that changes the economics. If you're running thousands of API calls per day for code review, test generation, or automated PR analysis, the difference between $0.07 and $5.00 per million tokens compounds into real money.
DeepSeek V4 Pro also scored 67.9% on Terminal-Bench 2.0, ahead of Claude's 65.4%. For systems-level tasks, shell scripting, and infrastructure work, it holds its own against the closed models.
The NVIDIA developer community has been testing V4 extensively, with benchmark threads on the NVIDIA forums tracking performance across different hardware configurations. The model runs well on consumer-grade GPUs for smaller inference tasks, making self-hosting accessible.
Where the closed models still win
Claude Opus 4.7's 87.6% on SWE-bench Verified isn't just a number. It represents a 7-point gap that maps directly to "does the model get the hard problems right." In practice, the gap shows up on:
- Multi-file refactors where the model needs to trace dependencies across a large codebase
- Subtle bug fixes where the model needs to understand the business logic, not just the code
- Instruction-following on constrained tasks where you need the model to respect your CLAUDE.md rules
GPT-5.4's advantage is similar. On problems that require deep reasoning over many files, the closed models are measurably better at following through. The open-weight models do well on contained tasks (single-file fixes, function generation, competitive programming) but drop off faster on the kind of multi-step, context-heavy work that Claude Code and Codex are designed for.
The practical setup
If you're running a coding CLI as your primary tool, the closed models are still the better default for complex work. But the open-weight models are now good enough to serve as:
- Cost-efficient batch runners. Pipe your audit prompts through Kimi K2.6 or DeepSeek V4 Flash for tasks where volume matters more than perfection.
- Fallback models. When your primary provider ships a regression (see the Opus 4.7 case study), having a self-hosted open-weight model means you're never stuck waiting for a fix.
- Parallel research agents. Kimi's swarm architecture or DeepSeek's batch API for tasks that benefit from parallelization.
- Local inference for sensitive codebases. Self-hosted means no data leaves your infrastructure.
The AI CLIs guide covers how to pipe prompts into different model backends. The tool aichat supports DeepSeek and Kimi natively. Simon Willison's llm tool supports them via plugins.
If you're comparing AI tools for running a small business, The $20 Dollar Agency covers how to pick the right model for each task without overspending on tokens you don't need. Search "The $20 Dollar Agency" on Amazon Kindle.
Related reading
- How to validate an AI coding model before you trust it — the 10-step checklist applies to open-weight models too
- Where Opus 4.7 actually ranks and what early adopters learned — the closed-model benchmarks in detail
- Top AI CLIs and how to use them with our generators — how to route prompts to different model backends
- A Markdown file is the best memory layer for your AI coding tool — project context that works across model providers
- Two CLIs, one workflow: Codex alongside Claude Code — the multi-tool routine
Fact-check notes and sources
- DeepSeek V4 Pro SWE-bench Verified (80.6%), Codeforces (3,206), Terminal-Bench (67.9%): nxcode.io and buildfastwithai.com.
- DeepSeek V4 Flash SWE-bench Verified (79.0%), pricing ($0.07/$0.28): llm-stats.com.
- Kimi K2.6 SWE-bench Verified (80.2%), SWE-bench Pro (58.6%), pricing ($0.60/$2.50): llm-stats.com and artificialanalysis.ai.
- Kimi K2.6 agent swarm (300 concurrent sub-agents, 4,000 steps), architecture (1T total, 32B active, Modified MIT): huggingface.co/moonshotai/Kimi-K2.6 and nerdleveltech.com.
- Claude Opus 4.7 SWE-bench Verified (87.6%), GPT-5.4 SWE-bench (~85.0%): swebench.com and tokenmix.ai.
This post is informational, not consulting or financial advice. Mentions of DeepSeek, Moonshot AI, Anthropic, OpenAI, and NVIDIA are nominative fair use. No affiliation is implied.