A small team running an internal coding agent on Claude Opus 4.7 watched their monthly Anthropic bill cross $1,200 in April 2026. The agent was doing useful work: code review, refactor proposals, doc generation. But maybe 70% of the calls were not the kind of work that needed a frontier model. Classification, extraction, summarization, structured generation. Things that Opus 4.6 handled fine in 2025, before everybody upgraded.
On April 24, 2026, DeepSeek released V4. V4-Pro scores 80.6 on SWE-bench Verified, statistically tied with Opus 4.6's 80.8 [1]. It is not a frontier-killer. Opus 4.7 still leads at 87.6, and on hard agentic tasks the gap shows [2]. But V4-Pro hits last year's frontier quality at roughly 10× the discount: 1.74permillioninputtokensversusOpus4.7′s15, 3.48outputversus75 [1][3].
For solo devs and small teams that put everything on Opus 4.7 because it was simpler, this is the moment to split the workload. Keep the frontier model for the 20% of calls that actually need frontier reasoning. Move the other 80% to V4. The math gets you most of a 10× cost reduction without rewriting your stack from scratch.
This guide walks the migration: what to keep, what to move, what breaks in the prompts, and how to audit your own traffic to find the candidates.
DeepSeek V4 migration featured
What V4 Actually Is
DeepSeek V4 ships in two variants, both with 1M token context windows and MIT-licensed open weights [1]:
V4-Pro. 1.6T total parameters in a Mixture-of-Experts architecture, 49B active per token. The flagship. Hosted at 1.74input/3.48 output per million tokens on DeepSeek's own API [3]. Self-hostable for teams with the GPU budget.
V4-Flash. 284B total / 13B active. The efficient-tier variant. 0.14input/0.28 output per million tokens. Roughly 50× cheaper than Opus 4.7 input, 270× cheaper output [3]. Designed for high-volume classification and extraction.
Both variants share three reasoning modes the Anthropic ecosystem does not have a direct analog for [1]:
Non-Think. Fast, intuitive responses. Equivalent to a low-effort Anthropic call. For routine tasks where you want completion speed and do not need deliberation.
Think High. Conscious logical analysis. The default for non-trivial work. Roughly equivalent to Anthropic's effort: high.
Think Max. Full reasoning capability. Comparable to Anthropic's effort: xhigh. Requires a minimum 384K context window allocation.
The benchmark comparison that matters for migration:
Benchmark
V4-Pro-Max
Opus 4.6
Opus 4.7
SWE-bench Verified
80.6
80.8
87.6
LiveCodeBench
93.5
n/a
n/a
Codeforces Rating
3,206
n/a
n/a
GPQA Diamond
90.1
n/a
n/a
MMLU-Pro
87.5
n/a
n/a
V4-Pro essentially ties Opus 4.6 on SWE-bench, the most workload-relevant benchmark for coding teams, and beats GPT-5.5 on Codeforces [1]. It is 7 points behind Opus 4.7 on SWE-bench, which matters for hard agentic loops but not for the bulk of day-to-day work.
The Economics
The headline number is "10× cheaper" but the breakdown is more useful. For a typical mid-volume workload where output is roughly 30% of input by token count:
Opus 4.7: 15×0.7+75 × 0.3 = $33 blended per million tokens
V4-Pro: 1.74×0.7+3.48 × 0.3 = $2.26 blended per million tokens
V4-Flash: 0.14×0.7+0.28 × 0.3 = $0.18 blended per million tokens
Want to know how effective your prompts are? Prompt Score analyzes them on 6 criteria.
That is a 14.6× reduction moving from Opus 4.7 to V4-Pro, and a 183× reduction moving to V4-Flash for the workloads that fit Flash's capability ceiling.
For a small team paying $1,200/month on Opus 4.7, with 80% of traffic moving to V4-Pro and 20% staying on Opus 4.7, the new bill is roughly:
20% on Opus 4.7: $240
80% on V4-Pro: $66
New total: ~$306/month
A 75% reduction without losing quality on the work that matters. The savings compound when high-volume classification or extraction is offloaded to Flash, where the 183× difference makes batch processing economically transformational.
DeepSeek V4 workload split
What Stays on Opus 4.7 (the 20%)
Three categories of work justify keeping the frontier model:
Hard agentic coding loops. Multi-step refactors, cross-file architecture changes, debugging in unfamiliar codebases. The 7-point SWE-bench gap is concentrated in tasks where the agent has to plan, retry, and reason about state. Opus 4.7's literal instruction following [2] also matters here because agentic prompts encode tight contracts.
Decisions with high downside risk. Code that ships to production, contracts that go to clients, anything where a wrong answer is more expensive than the model cost difference. The marginal $30 saved is not worth the chance of a regression.
Long-horizon reasoning. Tasks that span many steps and require the model to maintain a coherent mental model across them. V4 is good but Opus 4.7's reasoning depth on complex chains is still the state of the art.
If your work is in these buckets, do not migrate. The cost is real but the quality penalty is also real.
What Moves to V4 Cleanly (the 80%)
Most production traffic in small-team workflows falls into categories where V4-Pro or V4-Flash performs at or near frontier quality:
Structured extraction. Pull JSON from documents, normalize entities, parse semi-structured input. V4-Flash handles this well at $0.14/MTok, often 100× cheaper than Opus 4.7 doing the same job. Structure your prompt with strict schema and examples and the gap to frontier closes.
Classification and routing. Topic classification, sentiment, intent detection, support ticket triage. These tasks were a poor use of Opus 4.7 to begin with. V4-Flash is the right tool, and the prompts barely need to change.
Summarization and templating. Email drafts, doc summarization, report generation from structured input. V4-Pro handles long-form templating at near-frontier quality. The 1M context window is the same.
Mid-context coding tasks. Generating tests for a function, writing a single-file refactor, explaining unfamiliar code, pulling out interfaces. V4-Pro on Think-High mode produces output that scores comparably to Opus 4.7 on these scoped tasks.
Tool-use orchestration where each step is simple. If your agent loop is "call a tool, parse the result, decide next tool", and each individual decision is shallow, V4-Pro handles the orchestration without paying the frontier premium.
The 80/20 is approximate. Some teams will find their split is 70/30. A few will find it is 90/10. The audit recipe below tells you how to measure your own.
The techniques you're reading about work. Test your prompts now with Prompt Score and see your score in real time.
Five specific things change when you swap Anthropic for DeepSeek V4. None are dealbreakers but all need handling.
1. Different Chat Template
V4 uses a custom encoding format documented in the encoding_dsv4 module shipped with the model card [1]. It is not OpenAI-compatible by default, and the tool-call format diverges from Anthropic's. If you use the official DeepSeek API, the wrapper handles this. If you self-host with vLLM or SGLang, you call the encoding helpers directly:
from encoding_dsv4 import encode_messages
messages = [
{"role": "user", "content": "Refactor this function..."},
]
prompt = encode_messages(messages, thinking_mode="thinking")
Prompts that hardcode Anthropic-specific syntax (e.g., <thinking> tags, cache_control blocks) need to be cleaned before they pass to V4.
2. Reasoning Modes Replace the Effort Parameter
Anthropic's effort parameter does not exist on V4. Instead you choose a reasoning mode at request time. The mapping that works in practice:
Anthropic effort: low → V4 Non-Think
Anthropic effort: high → V4 Think High
Anthropic effort: xhigh → V4 Think Max
A prompt that worked on effort: high will likely work on Think High. Migrate the mode tag in lockstep with the prompt.
Note that Think Max requires a minimum 384K context window allocation [1], which has cost implications even for short prompts. Use Think High by default and reach for Max only on hard reasoning.
3. Sampling Settings Differ
Opus 4.7 rejects temperature, top_p, and top_k outright [2]. V4 expects them, with the recommended setting temperature=1.0, top_p=1.0 [1]. If you stripped sampling parameters from your Anthropic client, you need to add them back when calling V4.
Do not assume Opus 4.6 sampling defaults port over. The training is different.
4. No Native Cache-Control
Anthropic's cache_control field is Anthropic-specific. V4 has its own caching behavior at the inference layer (the Hybrid Attention architecture reduces KV cache to 10% of V3.2 for 1M context [1]), but there is no equivalent prompt-level directive for declaring cache prefixes.
If your workflow depended on Anthropic's prompt caching for cost optimization, the migration math changes. Run the numbers on V4-Pro without caching against Opus 4.7 with caching. V4-Pro often still wins on absolute cost, but the gap narrows for high-cache-hit workloads.
5. Refusal Patterns Differ
V4's RLHF training produces a different refusal profile than Opus 4.7. Some prompts that Anthropic flags get answered cleanly on V4. Some that V4 flags get answered on Anthropic. Most production prompts pass both, but if you have content-sensitive workflows, regression-test the migration carefully.
The Audit Recipe (5 Steps)
This is a practical workflow to identify migration candidates without manual case-by-case triage.
Step 1: Tag every prompt by category. Classify each production prompt as one of: extraction, classification, summarization, templating, mid-context coding, hard reasoning, or agentic loop. Tag in your prompt management tool or in a spreadsheet.
Step 2: Score each prompt on Opus 4.7 with KMP. Run a structural quality score. Note the score before any migration changes. This is your baseline.
Step 3: Migrate the top tag in volume to V4. Pick the category with the most calls per month (usually classification or extraction for small teams). Rewrite the prompt for V4: drop Anthropic-specific syntax, swap the mode tag, add sampling settings, use the V4 chat template.
Step 4: Score the V4 version and run a side-by-side eval. Use 50 representative inputs and check the V4 output against the Opus 4.7 output. If the score difference is within 0.3 points and the eval shows acceptable quality, the migration sticks.
Step 5: Repeat for the next category. Once one category is migrated, move to the next-highest-volume one. Stop when the remaining categories are clearly in the "stays on frontier" bucket.
Most teams converge on their final split within 2-3 weeks of this loop.
Worked Example: A Classification Prompt Migration
Here is a typical classification prompt before and after migration. The task: classify support tickets by urgency.
Before (Claude Opus 4.7):
client.messages.create(
model="claude-opus-4-7",
max_tokens=200,
system=[{
"type": "text",
"text": "You are a support ticket classifier. Output JSON.",
"cache_control": {"type": "ephemeral", "ttl": "1h"}
}],
messages=[{"role": "user", "content": ticket}]
)
After (DeepSeek V4-Flash):
from encoding_dsv4 import encode_messages
messages = [
{"role": "system", "content": "You are a support ticket classifier. Output JSON."},
{"role": "user", "content": ticket}
]
prompt = encode_messages(messages, thinking_mode="non-thinking")
response = deepseek_client.completions.create(
model="deepseek-v4-flash",
prompt=prompt,
max_tokens=200,
temperature=1.0,
top_p=1.0
)
Five changes: dropped cache_control, switched chat template, set thinking mode to non-thinking (classification does not need reasoning), added sampling parameters, and changed the API client. The prompt content is identical.
For 100,000 tickets/month at roughly 200 input tokens and 20 output tokens each, the cost goes from about 450onOpus4.7to3.50 on V4-Flash. A 130× reduction on a category that does not benefit from frontier reasoning anyway.
What This Means If You Use Keep My Prompts
The migration is mostly mechanical once you know which prompts to migrate. The hard part is the audit: knowing which prompts are doing real reasoning work versus which are doing low-skill classification dressed up in a frontier-model wrapper.
A prompt that scores above 4.0 on structural quality is almost always portable to V4-Pro. The structure carries the model's behavior, not the model itself. A prompt that scores below 3.0 is fragile on any model. Fix the structure first, then migrate.
Score your prompts at keepmyprompts.com, tag them by target model, and watch the migration become a checklist instead of a project.