# English vs Spanish — Raw Data & Methodology Appendix

Companion document to [`claude-english-vs-spanish.html`](./claude-english-vs-spanish.html).
This is the full research log behind the article: raw numbers, methodology, the
longitudinal language-switch finding, behavioral performance proxies, and the
academic literature. Everything here is reproducible from local Claude Code data.

**Compiled:** May 2026.
**Primary sources:** `~/.claude/history.jsonl` (typed-prompt history, all projects,
since Sep 2025) and `~/.claude/stats-cache.json` (daily activity aggregates,
since 2025-12-23). Output-quality across eras is **not** in this data — see Caveats.

---

## 1. The headline finding: I switched languages, I don't mix them

The single most important thing the raw data revealed — and which the original
article missed — is that my prompt language is **not a 60/40 blend**. It is two
distinct eras with a hard switch around January–February 2026.

| Month | Prompts | % English |
|-------|--------:|----------:|
| 2025-09 | 318 | 95.6% |
| 2025-10 | 212 | 100.0% |
| 2025-11 | 1,463 | 97.7% |
| 2025-12 | 2,937 | 97.5% |
| **2026-01** | 4,114 | **65.2%** ← transition |
| 2026-02 | 1,943 | 6.6% |
| 2026-03 | 2,007 | 5.1% |
| 2026-04 | 3,283 | 5.0% |
| 2026-05 | 2,802 | 4.1% |

Through Dec 2025 I wrote **~97% in English**. From Feb 2026 onward I write
**~95% in Spanish**. The article was published 2026-03-31 — i.e. I had only
recently switched when I wrote it. Its line *"I've been using Spanish exclusively
since running this analysis"* is therefore true for the post-switch period, but
the "21,000 sessions all in Spanish" framing of the setup was not: most of that
history was English.

The flat aggregate (58.1% ES / 41.9% EN over 19,079 substantive prompts) is a
**measurement artifact of averaging across the switch** and describes no real
moment of my usage.

---

## 2. Verified aggregate numbers (corrects the original Part 4)

`stats-cache.json`, range 2025-12-23 → 2026-05-06, 108 active days:

| Metric | Verified value | Original article claimed |
|--------|---------------:|--------------------------|
| Messages | **621,987** | 245,295 (plausible as a March snapshot) |
| Sessions | **1,980** | "21,000+" — ~10× off, likely mislabeled prompt count |
| Tool calls (all tools) | **141,279** | "697,087 bash commands" — impossible (5× total tool calls) |
| Code edits | not separately tracked | "228,000+" — unverifiable, almost certainly inflated |
| "Satisfied sessions" | **no such metric exists** | "1,219 marked satisfied" — fabricated |

`history.jsonl`: 25,139 records, ~23,629 real typed prompts after filtering out
slash-commands, bang-commands, pasted blocks, and harness-injected text.

**Conclusion on accuracy:** the original article's qualitative reasoning and the
token math were sound, but the impressive aggregate stats in Part 4 were
hallucinated/mislabeled by the model at the time (Q1 2026). They should be
replaced with the verified figures above.

---

## 3. Token tax — validated with the REAL Claude tokenizer

The original article measured with `cl100k_base` (OpenAI's tokenizer), which is
**not** Claude's. I re-measured with Claude's actual tokenizer via the
`messages.count_tokens` API (model `claude-sonnet-4-6`) on equivalent EN/ES
prompt pairs:

| Source | Overhead (ES vs EN) |
|--------|--------------------:|
| Original article (cl100k_base proxy, 7 prompts) | +54.5% |
| cl100k_base reproduction (5 prompts) | +60.0% |
| **Real Claude tokenizer (5 prompts)** | **+50.0%** (118 → 177 tokens) |

The methodology was wrong (OpenAI tokenizer) but the number landed close. The
~50% Spanish token overhead is **confirmed** with the correct tokenizer.

---

## 4. Behavioral performance proxies (English era vs Spanish era)

There is **no ground-truth output-quality metric** in local data, so these are
behavioral proxies, not causal measurements. Eras: EN-era = through 2025-12;
transition = 2026-01; ES-era = 2026-02 onward.

| Metric | EN-era (5,540 prompts) | ES-era (13,171 prompts) | Reading |
|--------|----------------------:|------------------------:|---------|
| Words/prompt (median) | 12 | 9 | more terse |
| Words/prompt (mean) | 22.9 | **12.9 (−44%)** | **strong low-friction signal** |
| Chars/prompt (median) | 68 | 48 (−29%) | terser |
| Corrections / 100 prompts (de-biased, no profanity, balanced markers) | 6.28 | 7.91 | +1.6 pts — see below |

**Interpretation:**

- **Shorter prompts in Spanish (−44% words)** is the cleanest signal and directly
  supports the "lower cognitive friction in native language" thesis. I write more
  tersely and faster in my native language.
- **The +1.6 pts correction rate is NOT evidence Spanish hurts comprehension.**
  Three confounds make it uninterpretable as a language effect:
  1. The model family changed between eras (late-2025 models → Opus 4.x in 2026).
  2. Terser prompts mean less upfront spec → naturally more follow-up turns. This
     is a workflow shift (iterate vs front-load), not a misunderstanding.
  3. Correction-marker detection is not perfectly symmetric across languages.
- The correction rate **spiked in Feb 2026** (11.0/100, the switch month) then
  declined monthly (7.7 → 7.5 → 6.9 by May), i.e. an adaptation curve back toward
  the English-era baseline (~5–6).

**Repair-rate by month (sanity check, ≥200 prompts):**
2025-09: 4.42 · 2025-10: 6.58 · 2025-11: 7.92 · 2025-12: 5.04 · 2026-01: 6.10 ·
2026-02: 11.02 · 2026-03: 7.69 · 2026-04: 7.54 · 2026-05: 6.91.

---

## 5. Per-project / context split

- **Personal:** 15,795 prompts → 61.1% ES / 38.9% EN.
- **Work:** 3,284 prompts → 43.3% ES / 56.7% EN (but the English share is mostly
  old-era; `mono` alone is 83% ES today).
- Project language tracks the **era**, not the project type. Old projects are
  English-heavy (`reservas_rb` 5% ES, `rts` 1% ES, `money/shop` 4% ES,
  `money/polybot` 2% ES); recent ones are Spanish-heavy (`fecha` 82%,
  `minerva` 93%, `funquila` 96%, `idle` 98%, `fixer` 95%).
- Pasted content is negligible: only 293 prompts carried pastes (304 blocks,
  ~100K chars total) — not a confounder for the language classification.

---

## 6. What the academic literature says

The question is well-studied — but in **controlled-benchmark** form (same task in
both languages, human held constant by translation), not as an n-of-1
longitudinal study of a real developer switching languages in an agentic tool.
The literature splits into three strands, partly contradictory:

**Strand A — English wins on reasoning benchmarks (well established):**
- *Native vs Non-Native Language Prompting* (arXiv 2409.07054): Arabic vs English,
  11 NLP tasks. English **always** won (7–9%), even on an Arabic-centric model
  (Jais-13b). Native never beat English.
- *Better to Ask in English* (arXiv 2410.13153) and massive suites (BenchMAX,
  MEXA): pro-English gap persists, up to **+20 pts** for low-resource languages.
  Reasoning transfers worst; **code knowledge transfers best**.

**Strand B — native language wins when content+output are in that language:**
- "Match prompt language to content language" studies report up to **+50%**
  accuracy on extraction/localization — but only when generating *text in that
  language*. Not applicable to "Spanish prompt → English/neutral code".

**Strand C — the token tax (confirms our measurement):**
- Spanish ~25–60% more tokens (Sngular: 58.9%; `llm-language-token-tax`:
  1.5×–3.3× across 8 languages; Arabic/Japanese 2.9–3.3×). Matches our +50.0%.

**Coding-specific:**
- *Linguistic Bias in Code Generation* (arXiv 2406.00602): Chinese prompts
  sometimes yielded O(n²) where English gave O(n log n) — but Chinese is a distant
  language and the models were not frontier.
- *CodeMixBench* (arXiv 2505.05063): code-mixed prompts (Spanish + English tech
  terms, exactly my style) degrade Pass@1 — but the effect shrinks to near-zero on
  large models.
- *How NL Proficiency Shapes GenAI Code* (arXiv 2511.04115): the writer's language
  proficiency shapes the generated code.

**Why this complicates the article's "quality is invisible" claim, then rescues it
for my case:** there *is* a real reasoning gap favoring English — but it is
measured mostly in **low-resource languages** and **pure-reasoning tasks**, and it
collapses exactly where I sit: **Spanish (high-resource) + code (most transferable
domain) + a frontier model**. The literature also systematically ignores the human
axis (it translates the prompt and holds the human constant), so it cannot capture
my real benefit — the −44% prompt-length / lower-friction effect my own data shows.

### Sources
- Native Design Bias — https://arxiv.org/abs/2406.17385
- Native vs Non-Native Language Prompting — https://arxiv.org/html/2409.07054v1
- Better to Ask in English — https://arxiv.org/html/2410.13153v1
- LLM Prompting for Localization (OpenReview) — https://openreview.net/forum?id=27pOlHjUge
- The Bitter Lesson from 2,000+ Multilingual Benchmarks — https://arxiv.org/pdf/2504.15521
- Linguistic Bias in LLM Code Generation — https://arxiv.org/pdf/2406.00602
- CodeMixBench — https://arxiv.org/html/2505.05063v1
- How NL Proficiency Shapes GenAI Code — https://arxiv.org/pdf/2511.04115
- llm-language-token-tax (GitHub) — https://github.com/vfalbor/llm-language-token-tax
- Why Speak to LLMs in English (Sngular) — https://www.sngular.com/insights/415/why-speak-to-llms-in-english-the-technical-reality-behind-ais-most-repeated-advice
- Match prompts to content language (Ryan Stenhouse) — https://ryanstenhouse.dev/why-your-llm-prompts-should-match-your-content-language/

---

## 7. Caveats (what this data cannot show)

- **Output quality across eras is not measurable here.** No ground-truth success
  metric exists, and full transcripts (`~/.claude/projects/`) only retain the
  Spanish era, so there is no English-era output to compare against.
- **Productivity across eras is not measurable from `stats-cache.json`** — it only
  starts 2025-12-23, so its "English era" is one week.
- All language classification uses `langdetect` (seed-fixed) plus a short-ack
  filter; very short or code-heavy prompts fall into an "other" bucket excluded
  from the ES/EN ratio.
- The strongest honest evidence remains **revealed preference**: 4 months and
  13,000+ prompts at ~95% Spanish, with no reversion.
