Should You Talk to Claude in English?

TL;DR

Spanish prompts cost ~50% more tokens than English, but the quality difference is negligible for coding tasks. Using your native language lets you think faster and give clearer instructions — that matters more than token savings.

Update, May 2026. I went back and audited every number in this post against my raw local data (history.jsonl and the usage stats cache). Two things changed: I replaced some aggregate figures that hadn't been computed from real data, and I found a much better story hiding in the timestamps — I didn't gently drift toward Spanish, I switched, almost overnight, in early 2026. The new data is in Part 4, the full methodology and every raw number live in the data appendix, and the original conclusion still holds.

The Setup

I'm a Spanish-speaking software engineer who uses Claude Code as my primary development tool. Over the past months I've accumulated ~2,000 sessions and 620K+ messages of real usage. My workflow is heavily execution-oriented: long debugging sessions, architecture discussions, code reviews. (I originally wrote "all in Spanish" here — the data in Part 4 tells a more interesting truth.)

At some point I asked myself the obvious question: should I be writing all these prompts in English instead? English is the lingua franca of programming. Most training data is in English. Every benchmark is in English. Surely I'm leaving performance on the table?

I decided to find out.

Part 1: The Token Tax

LLMs don't see words — they see tokens. And tokenizers are heavily optimized for English. The word authentication is a single token. Its Spanish equivalent, autenticación, is three tokens.

I ran equivalent prompts through Claude's tokenizer (cl100k_base proxy) to measure the real cost. These are actual prompts I use daily in Claude Code:

Prompt Type	English	Spanish	Overhead
Bug fix instruction	20 tokens	31 tokens	+55%
API endpoint creation	32 tokens	46 tokens	+44%
Refactoring task	26 tokens	41 tokens	+58%
Documentation generation	33 tokens	52 tokens	+58%
DevOps debugging	37 tokens	58 tokens	+57%
Code review (long)	65 tokens	106 tokens	+63%
Architecture discussion	58 tokens	93 tokens	+60%
Average across all tests	352 tokens	544 tokens	+54.5%

That's a 54.5% token overhead on average. For longer, more complex prompts, it gets worse — the code review prompt hit +63%.

May 2026 correction: cl100k_base is OpenAI's tokenizer, not Claude's — using it here was sloppy. I re-ran equivalent prompt pairs through Claude's actual tokenizer (the count_tokens API on claude-sonnet-4-6) and got +50.0% overhead. The original number was measured with the wrong tool but landed close. The ~50% Spanish token tax is real.

Why is the gap so large?

BPE tokenizers build their vocabulary from training data, which is predominantly English. Common English words get their own tokens. Spanish words often get split into subword pieces:

authentication 1 token

autenticación 3 tokens

deployment 1 token

despliegue 4 tokens

configuration 1 token

configuración 3 tokens

database 1 token

base de datos 3 tokens

environment 1 token

entorno 2 tokens

backward compatibility 2 tokens

compatibilidad hacia atrás 5 tokens

Notice that middleware stays at 1 token in both languages — technical terms borrowed directly from English don't pay the tax. This is actually important: the more technical your prompt, the more English loanwords it contains, and the smaller the gap becomes.

Part 2: The Context Window Reality

In a typical Claude Code session, your prompts are a tiny fraction of the total context. Here's what actually fills the context window:

System

~35% — English (system prompt, tools)

Code

~40% — English (file contents, diffs)

Tool results

~15% — English (bash, grep)

You

~10% — Your language

Your messages typically account for ~10% of the context. A 54% overhead on 10% of context is a ~5.5% increase in total token usage. That's the real number.

54% sounds terrifying. 5.5% sounds manageable. Both numbers are correct — the difference is what you measure against.

Part 3: Does Language Affect Quality?

This is the question that actually matters. Tokens are money, but quality is everything.

For code generation: no meaningful difference

Claude generates the same Go, TypeScript, or Python regardless of whether you asked in English or Spanish. The code output is language-agnostic. Variable names, function signatures, and architecture decisions don't change based on prompt language.

Consider these two equivalent prompts:

English

Add pagination to the list endpoint. Use cursor-based
pagination with a default page size of 20.

Spanish

Añade paginación al endpoint de listado. Usa paginación
basada en cursor con un tamaño de página por defecto de 20.

Both produce identical code. The generated function names, the SQL queries, the response structs — all the same. Claude understands the intent regardless of the language wrapping it.

For reasoning tasks: marginal English advantage

Academic benchmarks like MMLU and GSM8K show a small advantage (2-5%) for English prompts on reasoning-heavy tasks. This is expected — the model has seen more chain-of-thought reasoning examples in English during training.

But here's the catch: those benchmarks test the model's language, not yours. When you write in Spanish, Claude still reasons internally in whatever representation it uses and only translates the final output. You're not forcing it to "think in Spanish."

For instruction clarity: native language wins

This is where the real difference lives. I tested my own typing speed and found I write ~30% faster in Spanish than in English. Not just faster — also more accurately. When I tried writing a quick note in English to test this, I immediately produced a typo. In Spanish? Clean on the first try.

When I tried writing prompts in English:

I spent more time crafting the prompt instead of describing what I wanted
I occasionally used ambiguous phrasing that I wouldn't use in Spanish
Complex architectural discussions lost nuance
I defaulted to simpler descriptions to avoid language friction
More typos, which sometimes confused the model

When I write in Spanish, I express exactly what I mean, with all the nuance, qualifications, and edge cases. A prompt that takes me 5 seconds in Spanish takes 8-10 in English — and the Spanish version is usually more precise and typo-free.

The best prompt isn't the one with fewest tokens. It's the one that most accurately describes what you want. Native language gives you that.

Part 4: The Numbers — and the Day I Switched Languages

This is the section I rewrote in May 2026. The original version listed some big round numbers I couldn't reproduce from my actual logs, so here are the verified aggregates straight from Claude Code's usage stats (2025-12-23 → 2026-05-06, 108 active days):

621,987 messages exchanged
1,980 sessions
141,279 tool calls (across all tools, not just bash)
~23,600 prompts I personally typed, recorded in my prompt history since September 2025

But the real discovery wasn't in the totals — it was in the timestamps. I'd assumed I was a steady "60/40 Spanish-English" user. I'm not. I classified all ~23,600 typed prompts by language and bucketed them by month:

Sep '25

96% English

Oct '25

100% English

Nov '25

98% English

Dec '25

98% English

Jan '26

65% English — the switch

Feb '26

93% Spanish

Mar '26

95% Spanish

Apr '26

95% Spanish

May '26

96% Spanish

Through December 2025 I wrote almost everything in English. Then, over January–February 2026, I flipped to almost entirely Spanish — and stayed there. The "60/40 blend" I imagined doesn't exist; it was an average masking two opposite eras. (Which also means that when I first published this in March, my "all in Spanish for years" framing was wrong — I'd only just switched.)

This accidentally turned my post into something rarer than a benchmark: a real before/after on the same human, same workflow, just a different prompt language. So I compared the two eras.

What changed when I switched to Spanish

Metric	English era	Spanish era	Change
Words per prompt (median)	12	9	−25%
Words per prompt (mean)	22.9	12.9	−44%
Correction / "no, do it differently" rate	6.3 / 100	7.9 / 100	+1.6 pts

The clean signal is the first one: in Spanish I write 44% shorter prompts. That's exactly the "lower friction in my native language" effect from Part 3, now visible across 13,000 prompts — I say what I mean with less effort.

The correction rate ticked up slightly, but I don't read that as "Claude understands my Spanish worse," for three reasons: (1) the model family also changed between eras, so it's confounded; (2) shorter prompts carry less upfront detail, so I naturally iterate more instead of front-loading — a workflow shift, not a misunderstanding; and (3) it spiked the month I switched (a clumsy-adaptation month) and then declined back toward the English-era baseline. Language wasn't the bottleneck — clarity is, and the failure modes (ambiguous requirements, missing context, underspecified edge cases) are identical in both languages.

Part 5: The Real Cost Calculation

Let's put actual numbers on this. With Claude Sonnet at $3/M input tokens:

Scenario	English	Spanish	Extra Cost
Single prompt (avg)	35 tokens	54 tokens	$0.000057
Typical session (20 prompts)	700 tokens	1,080 tokens	$0.00114
Heavy day (200 prompts)	7,000 tokens	10,800 tokens	$0.0114
Month of intense usage	140K tokens	216K tokens	$0.228

The Spanish "tax" on my input prompts costs roughly 23 cents per month. That's less than a cent per working day. And remember: this only affects your messages, not the system prompt, tool results, or code content that make up the bulk of token usage.

With Claude Code's subscription model, this is even less relevant — you're paying a flat rate regardless of token count.

Part 6: What the Research Actually Says

After writing the first version, I went looking for whether anyone had studied this properly. They have — but almost always as controlled benchmarks (translate the same task into both languages, hold the human constant), never as a real developer's longitudinal usage. The literature splits into three strands that look contradictory until you notice which case each one measures:

English wins on reasoning benchmarks. A study on Arabic vs English prompting found English won every task (7–9%), even on an Arabic-native model. Big multilingual suites show the same, with gaps up to +20 points — but concentrated in low-resource languages and pure-reasoning tasks. Crucially, code knowledge is the dimension that transfers best across languages.
Native language wins for native-language output. "Match the prompt to the content" studies report up to +50% on extraction/localization — but only when you're generating text in that language. Not my case: my output is code.
The token tax is real. Independent measurements put Spanish at ~25–60% more tokens (Arabic/Japanese 3×+). That matches my +50%.

For code specifically, there's measured "linguistic bias" (e.g. a prompt in one language yielding an O(n²) solution where English gave O(n·log n)), and code-mixed prompts can hurt — but both effects shrink toward zero on large frontier models.

The reasoning gap favoring English is genuine — but it collapses exactly where I sit: a high-resource language (Spanish), a code output (the most transferable domain), and a frontier model. And every one of these studies measures the model while holding the human fixed — so none of them can even see my biggest win: the 44% drop in prompt length.

References: Native Design Bias, Native vs Non-Native Prompting, Better to Ask in English, Linguistic Bias in Code Generation, CodeMixBench, How NL Proficiency Shapes GenAI Code, llm-language-token-tax. Full list in the data appendix.

The Verdict

Native Language Advantages

Think faster, prompt faster
More precise instructions
Better architectural discussions
Lower cognitive overhead
Natural expression of edge cases

English Advantages

~50% fewer tokens on prompts
2-5% better on reasoning benchmarks
Marginal edge on rare/niche topics
Easier to share prompts with English-speaking teams

Use your native language.

The token overhead is real but irrelevant at the scale of actual usage. The quality difference is measurable in benchmarks but invisible in practice. The clarity advantage of thinking in your own language is significant and compounds over thousands of interactions.

Two caveats. First: if you're writing prompts to be shared as templates with an English-speaking team, write those in English. Second, and more honestly: this advice is strongest for my exact situation — a high-resource language, code output, and a frontier model. If you speak a low-resource language and lean on the model for heavy pure reasoning, the gap favoring English is larger and the trade-off is real. But for daily coding work — debugging, architecture, "fix this bug" — use whatever language lets you think fastest.

That's what I do. ~23,000 prompts in, and I'm not switching back.

Analysis by Jairo Caro-Accino. Originally March 2026; data audited and updated May 2026. Token counts measured with Claude's count_tokens API (and cl100k_base for the original table). Session and prompt data from local Claude Code logs (history.jsonl, usage stats). Full methodology in the data appendix.