TL;DR

Spanish prompts cost ~50% more tokens than English, but the quality difference is negligible for coding tasks. Using your native language lets you think faster and give clearer instructions — that matters more than token savings.

Update, May 2026. I went back and audited every number in this post against my raw local data (history.jsonl and the usage stats cache). Two things changed: I replaced some aggregate figures that hadn't been computed from real data, and I found a much better story hiding in the timestamps — I didn't gently drift toward Spanish, I switched, almost overnight, in early 2026. The new data is in Part 4, the full methodology and every raw number live in the data appendix, and the original conclusion still holds.

The Setup

I'm a Spanish-speaking software engineer who uses Claude Code as my primary development tool. Over the past months I've accumulated ~2,000 sessions and 620K+ messages of real usage. My workflow is heavily execution-oriented: long debugging sessions, architecture discussions, code reviews. (I originally wrote "all in Spanish" here — the data in Part 4 tells a more interesting truth.)

At some point I asked myself the obvious question: should I be writing all these prompts in English instead? English is the lingua franca of programming. Most training data is in English. Every benchmark is in English. Surely I'm leaving performance on the table?

I decided to find out.


Part 1: The Token Tax

LLMs don't see words — they see tokens. And tokenizers are heavily optimized for English. The word authentication is a single token. Its Spanish equivalent, autenticación, is three tokens.

I ran equivalent prompts through Claude's tokenizer (cl100k_base proxy) to measure the real cost. These are actual prompts I use daily in Claude Code:

Prompt Type English Spanish Overhead
Bug fix instruction 20 tokens 31 tokens +55%
API endpoint creation 32 tokens 46 tokens +44%
Refactoring task 26 tokens 41 tokens +58%
Documentation generation 33 tokens 52 tokens +58%
DevOps debugging 37 tokens 58 tokens +57%
Code review (long) 65 tokens 106 tokens +63%
Architecture discussion 58 tokens 93 tokens +60%
Average across all tests 352 tokens 544 tokens +54.5%

That's a 54.5% token overhead on average. For longer, more complex prompts, it gets worse — the code review prompt hit +63%.

May 2026 correction: cl100k_base is OpenAI's tokenizer, not Claude's — using it here was sloppy. I re-ran equivalent prompt pairs through Claude's actual tokenizer (the count_tokens API on claude-sonnet-4-6) and got +50.0% overhead. The original number was measured with the wrong tool but landed close. The ~50% Spanish token tax is real.

Why is the gap so large?

BPE tokenizers build their vocabulary from training data, which is predominantly English. Common English words get their own tokens. Spanish words often get split into subword pieces:

authentication 1 token
autenticación 3 tokens
deployment 1 token
despliegue 4 tokens
configuration 1 token
configuración 3 tokens
database 1 token
base de datos 3 tokens
environment 1 token
entorno 2 tokens
backward compatibility 2 tokens
compatibilidad hacia atrás 5 tokens

Notice that middleware stays at 1 token in both languages — technical terms borrowed directly from English don't pay the tax. This is actually important: the more technical your prompt, the more English loanwords it contains, and the smaller the gap becomes.


Part 2: The Context Window Reality

In a typical Claude Code session, your prompts are a tiny fraction of the total context. Here's what actually fills the context window:

System
~35% — English (system prompt, tools)
Code
~40% — English (file contents, diffs)
Tool results
~15% — English (bash, grep)
You
~10% — Your language

Your messages typically account for ~10% of the context. A 54% overhead on 10% of context is a ~5.5% increase in total token usage. That's the real number.

54% sounds terrifying. 5.5% sounds manageable. Both numbers are correct — the difference is what you measure against.


Part 3: Does Language Affect Quality?

This is the question that actually matters. Tokens are money, but quality is everything.

For code generation: no meaningful difference

Claude generates the same Go, TypeScript, or Python regardless of whether you asked in English or Spanish. The code output is language-agnostic. Variable names, function signatures, and architecture decisions don't change based on prompt language.

Consider these two equivalent prompts:

English

Add pagination to the list endpoint. Use cursor-based
pagination with a default page size of 20.
Spanish

Añade paginación al endpoint de listado. Usa paginación
basada en cursor con un tamaño de página por defecto de 20.

Both produce identical code. The generated function names, the SQL queries, the response structs — all the same. Claude understands the intent regardless of the language wrapping it.

For reasoning tasks: marginal English advantage

Academic benchmarks like MMLU and GSM8K show a small advantage (2-5%) for English prompts on reasoning-heavy tasks. This is expected — the model has seen more chain-of-thought reasoning examples in English during training.

But here's the catch: those benchmarks test the model's language, not yours. When you write in Spanish, Claude still reasons internally in whatever representation it uses and only translates the final output. You're not forcing it to "think in Spanish."

For instruction clarity: native language wins

This is where the real difference lives. I tested my own typing speed and found I write ~30% faster in Spanish than in English. Not just faster — also more accurately. When I tried writing a quick note in English to test this, I immediately produced a typo. In Spanish? Clean on the first try.

When I tried writing prompts in English:

When I write in Spanish, I express exactly what I mean, with all the nuance, qualifications, and edge cases. A prompt that takes me 5 seconds in Spanish takes 8-10 in English — and the Spanish version is usually more precise and typo-free.

The best prompt isn't the one with fewest tokens. It's the one that most accurately describes what you want. Native language gives you that.


Part 4: The Numbers — and the Day I Switched Languages

This is the section I rewrote in May 2026. The original version listed some big round numbers I couldn't reproduce from my actual logs, so here are the verified aggregates straight from Claude Code's usage stats (2025-12-23 → 2026-05-06, 108 active days):

But the real discovery wasn't in the totals — it was in the timestamps. I'd assumed I was a steady "60/40 Spanish-English" user. I'm not. I classified all ~23,600 typed prompts by language and bucketed them by month:

Sep '25
96% English
Oct '25
100% English
Nov '25
98% English
Dec '25
98% English
Jan '26
65% English — the switch
Feb '26
93% Spanish
Mar '26
95% Spanish
Apr '26
95% Spanish
May '26
96% Spanish

Through December 2025 I wrote almost everything in English. Then, over January–February 2026, I flipped to almost entirely Spanish — and stayed there. The "60/40 blend" I imagined doesn't exist; it was an average masking two opposite eras. (Which also means that when I first published this in March, my "all in Spanish for years" framing was wrong — I'd only just switched.)

This accidentally turned my post into something rarer than a benchmark: a real before/after on the same human, same workflow, just a different prompt language. So I compared the two eras.

What changed when I switched to Spanish

Metric English era Spanish era Change
Words per prompt (median) 12 9 −25%
Words per prompt (mean) 22.9 12.9 −44%
Correction / "no, do it differently" rate 6.3 / 100 7.9 / 100 +1.6 pts

The clean signal is the first one: in Spanish I write 44% shorter prompts. That's exactly the "lower friction in my native language" effect from Part 3, now visible across 13,000 prompts — I say what I mean with less effort.

The correction rate ticked up slightly, but I don't read that as "Claude understands my Spanish worse," for three reasons: (1) the model family also changed between eras, so it's confounded; (2) shorter prompts carry less upfront detail, so I naturally iterate more instead of front-loading — a workflow shift, not a misunderstanding; and (3) it spiked the month I switched (a clumsy-adaptation month) and then declined back toward the English-era baseline. Language wasn't the bottleneck — clarity is, and the failure modes (ambiguous requirements, missing context, underspecified edge cases) are identical in both languages.


Part 5: The Real Cost Calculation

Let's put actual numbers on this. With Claude Sonnet at $3/M input tokens:

Scenario English Spanish Extra Cost
Single prompt (avg) 35 tokens 54 tokens $0.000057
Typical session (20 prompts) 700 tokens 1,080 tokens $0.00114
Heavy day (200 prompts) 7,000 tokens 10,800 tokens $0.0114
Month of intense usage 140K tokens 216K tokens $0.228

The Spanish "tax" on my input prompts costs roughly 23 cents per month. That's less than a cent per working day. And remember: this only affects your messages, not the system prompt, tool results, or code content that make up the bulk of token usage.

With Claude Code's subscription model, this is even less relevant — you're paying a flat rate regardless of token count.


Part 6: What the Research Actually Says

After writing the first version, I went looking for whether anyone had studied this properly. They have — but almost always as controlled benchmarks (translate the same task into both languages, hold the human constant), never as a real developer's longitudinal usage. The literature splits into three strands that look contradictory until you notice which case each one measures:

For code specifically, there's measured "linguistic bias" (e.g. a prompt in one language yielding an O(n²) solution where English gave O(n·log n)), and code-mixed prompts can hurt — but both effects shrink toward zero on large frontier models.

The reasoning gap favoring English is genuine — but it collapses exactly where I sit: a high-resource language (Spanish), a code output (the most transferable domain), and a frontier model. And every one of these studies measures the model while holding the human fixed — so none of them can even see my biggest win: the 44% drop in prompt length.

References: Native Design Bias, Native vs Non-Native Prompting, Better to Ask in English, Linguistic Bias in Code Generation, CodeMixBench, How NL Proficiency Shapes GenAI Code, llm-language-token-tax. Full list in the data appendix.


The Verdict

Native Language Advantages

  • Think faster, prompt faster
  • More precise instructions
  • Better architectural discussions
  • Lower cognitive overhead
  • Natural expression of edge cases

English Advantages

  • ~50% fewer tokens on prompts
  • 2-5% better on reasoning benchmarks
  • Marginal edge on rare/niche topics
  • Easier to share prompts with English-speaking teams

Use your native language.

The token overhead is real but irrelevant at the scale of actual usage. The quality difference is measurable in benchmarks but invisible in practice. The clarity advantage of thinking in your own language is significant and compounds over thousands of interactions.

Two caveats. First: if you're writing prompts to be shared as templates with an English-speaking team, write those in English. Second, and more honestly: this advice is strongest for my exact situation — a high-resource language, code output, and a frontier model. If you speak a low-resource language and lean on the model for heavy pure reasoning, the gap favoring English is larger and the trade-off is real. But for daily coding work — debugging, architecture, "fix this bug" — use whatever language lets you think fastest.

That's what I do. ~23,000 prompts in, and I'm not switching back.


Analysis by Jairo Caro-Accino. Originally March 2026; data audited and updated May 2026. Token counts measured with Claude's count_tokens API (and cl100k_base for the original table). Session and prompt data from local Claude Code logs (history.jsonl, usage stats). Full methodology in the data appendix.