DeepSeek V4 vs Kimi K2.6 vs Claude vs GPT: Where Open-Weight Models Actually Stand

April 29, 2026

Editorial note. The publication date shown above may be in the future. That is intentional. Posts on this site are scheduled against an editorial calendar that aligns with product releases, book launches, and platform-signal timing; the datePublished reflects the date the post is slated to go public, which is also the date indexers and syndication partners should treat as canonical. If you are reading this before that date you were early — welcome.

Two open-weight models shipped within four days of each other in April 2026. DeepSeek V4 Pro landed on April 24 with 1.6 trillion parameters. Kimi K2.6 from Moonshot AI dropped on April 20 with 1 trillion parameters and full open weights under a Modified MIT license. Both are free to self-host. Both score within a point of each other on the coding benchmarks that matter.

The question that actually matters for working developers: do either of these replace Claude Opus 4.7 or GPT-5.4 for daily coding work? The honest answer is complicated. On raw benchmarks, they're close. On cost, they're dramatically cheaper. On the specific things that make a coding CLI useful day to day, the gaps show up fast.

The benchmarks, side by side

Model	SWE-bench Verified	SWE-bench Pro	Codeforces	Cost (input/output per 1M)	Weights
Claude Opus 4.7	87.6%	64.3%	N/A	$5.00 / $25.00	Closed
GPT-5.4	~85.0%	57.7%	3,168	~$5.00 / $15.00	Closed
DeepSeek V4 Pro	80.6%	55.4%	3,206	$0.30 / $1.20	Open
Kimi K2.6	80.2%	58.6%	N/A	$0.60 / $2.50	Open (MIT)
DeepSeek V4 Flash	79.0%	N/A	N/A	$0.07 / $0.28	Open

A few things jump out.

Opus 4.7 leads SWE-bench Verified by 7 full points. That's not a rounding error. On the harder Pro variant, the gap is smaller but still real: 64.3% vs 58.6% (Kimi) vs 55.4% (DeepSeek). For complex multi-file refactors and real-world GitHub issues, the closed models still win.

But DeepSeek V4 Pro costs roughly 17 times less than Opus 4.7 per input token. Kimi K2.6 costs about 8 times less. For batch workflows, automated testing, or tasks where 80% accuracy is good enough, the cost math changes everything.

DeepSeek V4 Pro holds the highest Codeforces rating of any model at 3,206, above GPT-5.4's 3,168. For competitive programming tasks, it's the strongest option available.

Where Kimi K2.6 stands out

Kimi's real advantage isn't the benchmark scores. It's the agent swarm architecture. K2.6 can scale to 300 concurrent sub-agents across 4,000 coordinated steps. That's 3x more agents and 2.7x more steps than the previous K2.5 version.

For tasks like "analyze these 50 repositories and report commonalities" or "research 8 competitor products in parallel," Kimi's swarm capability does something that sequential models can't match. It parallelizes the work at the model level, not just the orchestration level.

The pricing makes this practical. Running 300 parallel agent calls through Opus 4.7 would cost hundreds of dollars. Through Kimi K2.6 at $0.60 per million input tokens, the same swarm run costs a small fraction of that.

K2.6 also scores 92.5% F1 on DeepSearchQA and 66.7% on Terminal-Bench 2.0, which measures command-line and systems-level task performance. For DevOps-adjacent work, those numbers matter.

The open weights under Modified MIT mean you can self-host with no API costs at all, assuming you have the hardware. The mixture-of-experts architecture (1T total parameters, 32B active per token) means inference is more tractable than the parameter count suggests.

Where DeepSeek V4 stands out

DeepSeek's strength is breadth across cost tiers. V4 Pro is the flagship at 1.6 trillion parameters. V4 Flash is the cost-optimized variant at 284 billion parameters, scoring 79.0% on SWE-bench Verified at $0.07 per million input tokens. That's seventy times cheaper than Opus 4.7.

For production inference at scale, V4 Flash is the model that changes the economics. If you're running thousands of API calls per day for code review, test generation, or automated PR analysis, the difference between $0.07 and $5.00 per million tokens compounds into real money.

DeepSeek V4 Pro also scored 67.9% on Terminal-Bench 2.0, ahead of Claude's 65.4%. For systems-level tasks, shell scripting, and infrastructure work, it holds its own against the closed models.

The NVIDIA developer community has been testing V4 extensively, with benchmark threads on the NVIDIA forums tracking performance across different hardware configurations. The model runs well on consumer-grade GPUs for smaller inference tasks, making self-hosting accessible.

Where the closed models still win

Claude Opus 4.7's 87.6% on SWE-bench Verified isn't just a number. It represents a 7-point gap that maps directly to "does the model get the hard problems right." In practice, the gap shows up on:

Multi-file refactors where the model needs to trace dependencies across a large codebase
Subtle bug fixes where the model needs to understand the business logic, not just the code
Instruction-following on constrained tasks where you need the model to respect your CLAUDE.md rules

GPT-5.4's advantage is similar. On problems that require deep reasoning over many files, the closed models are measurably better at following through. The open-weight models do well on contained tasks (single-file fixes, function generation, competitive programming) but drop off faster on the kind of multi-step, context-heavy work that Claude Code and Codex are designed for.

The practical setup

If you're running a coding CLI as your primary tool, the closed models are still the better default for complex work. But the open-weight models are now good enough to serve as:

Cost-efficient batch runners. Pipe your audit prompts through Kimi K2.6 or DeepSeek V4 Flash for tasks where volume matters more than perfection.
Fallback models. When your primary provider ships a regression (see the Opus 4.7 case study), having a self-hosted open-weight model means you're never stuck waiting for a fix.
Parallel research agents. Kimi's swarm architecture or DeepSeek's batch API for tasks that benefit from parallelization.
Local inference for sensitive codebases. Self-hosted means no data leaves your infrastructure.

The AI CLIs guide covers how to pipe prompts into different model backends. The tool aichat supports DeepSeek and Kimi natively. Simon Willison's llm tool supports them via plugins.

If you're comparing AI tools for running a small business, The $20 Dollar Agency covers how to pick the right model for each task without overspending on tokens you don't need. Search "The $20 Dollar Agency" on Amazon Kindle.

Fact-check notes and sources

DeepSeek V4 Pro SWE-bench Verified (80.6%), Codeforces (3,206), Terminal-Bench (67.9%): nxcode.io and buildfastwithai.com.
DeepSeek V4 Flash SWE-bench Verified (79.0%), pricing ($0.07/$0.28): llm-stats.com.
Kimi K2.6 SWE-bench Verified (80.2%), SWE-bench Pro (58.6%), pricing ($0.60/$2.50): llm-stats.com and artificialanalysis.ai.
Kimi K2.6 agent swarm (300 concurrent sub-agents, 4,000 steps), architecture (1T total, 32B active, Modified MIT): huggingface.co/moonshotai/Kimi-K2.6 and nerdleveltech.com.
Claude Opus 4.7 SWE-bench Verified (87.6%), GPT-5.4 SWE-bench (~85.0%): swebench.com and tokenmix.ai.

This post is informational, not consulting or financial advice. Mentions of DeepSeek, Moonshot AI, Anthropic, OpenAI, and NVIDIA are nominative fair use. No affiliation is implied.

← Back to Blog

DeepSeek V4 vs Kimi K2.6 vs Claude vs GPT: Where Open-Weight Models Actually Stand

The benchmarks, side by side

Where Kimi K2.6 stands out

Where DeepSeek V4 stands out

Where the closed models still win

The practical setup

Related reading

Fact-check notes and sources

Send a Message