Skip to content

feat: benchmark /translate skill — 200 keys regenerated across 9 languages#12028

Draft
premiumjibles wants to merge 9 commits intodevelopfrom
benchmark/translate-200-keys
Draft

feat: benchmark /translate skill — 200 keys regenerated across 9 languages#12028
premiumjibles wants to merge 9 commits intodevelopfrom
benchmark/translate-200-keys

Conversation

@premiumjibles
Copy link
Collaborator

Description

Benchmark of the /translate skill pipeline quality. Removes 200 existing translated keys from all 9 non-English locales and regenerates them using the automated translate-review-refine pipeline. The diff shows original (human) translations vs. AI-generated translations for expert comparison.

Key Selection (200 keys, stratified sampling)

Category Count Criteria
Short strings 40 1-3 words, single UI labels
Multiple placeholders 50 2+ %{variable} placeholders
Tagged (HTML markup) 7 Contains <span>, <strong>, <link> etc. (only 7 exist)
Long/complex 30 15+ words, full sentences
Crypto domain 30 Staking, liquidity, vault, swap, slippage, yield, etc.
General (random) 43 Random selection for broad coverage

Keys distributed across 31 top-level namespaces for realistic coverage.

Pipeline Results

All 1800 translations (200 × 9 locales) passed validation:

Locale Status Translated Manual Review
de (German) Success 200 0
es (Spanish) Success 200 0
fr (French) Success 200 0
ja (Japanese) Success 200 0
pt (Portuguese BR) Success 200 0
ru (Russian) Success 200 0
tr (Turkish) Success 200 0
uk (Ukrainian) Success 200 0
zh (Chinese Simplified) Success 200 0

How to Review

Each locale's diff shows the removed original translations (red) vs. regenerated translations (green). Language experts should evaluate:

  1. Accuracy — Does the translation convey the same meaning?
  2. Naturalness — Does it sound native?
  3. Terminology — Are crypto terms translated consistently?
  4. Register — Is formal/informal address used correctly?
  5. Placeholders/Tags — Are %{variables} and HTML tags preserved?

Ground truth (original translations) saved at /tmp/benchmark-ground-truth.json for automated comparison.

Issue (if applicable)

N/A — internal benchmark

Risk

Zero risk. This is a draft PR for evaluation only, not intended for merge. Translation-only changes, no code modifications.

No protocols, transactions, wallets or contracts affected.

Testing

Engineering

  1. Verify all locale JSON files are valid: node .claude/skills/translate/scripts/validate-file.js {locale} for each of de, es, fr, ja, pt, ru, tr, uk, zh
  2. Compare diff line counts — should be ~1235 insertions and ~1235 deletions (1:1 replacement)

Operations

  • 🏁 My feature is behind a flag and doesn't require operations testing (yet)

This is a translation-only benchmark PR. No functional changes to test.

Screenshots (if applicable)

N/A

premiumjibles and others added 9 commits February 24, 2026 07:34
Refactor translation pipeline so each per-language sub-agent owns its
full lifecycle (translate → validate → retry → review → refine → merge
→ verify) instead of the orchestrator managing all steps across 9
languages. Reduces orchestrator to a lightweight coordinator that spawns
agents and reads status files.

- Extract shared script-detection utilities into script-utils.js
- Refactor validate.js to import from script-utils.js (no behavior change)
- Add validate-file.js for post-merge full-file validation (JSON validity,
  key completeness, aggregate script ratio, regression detection)
- Simplify merge.js: remove duplicate script-validation, add pre-merge
  backup for rollback support
- Rewrite SKILL.md Steps 5-8 for self-contained language agent architecture

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Translate 11 missing English strings into de, es, fr, ja, pt, ru, tr,
uk, zh using the new /translate Claude Code skill. Covers RFOX FAQ
entries, action center failure messages, and yield cooldown notices.

Also fixes merge.js to only add new keys by default, never overwriting
existing translations. A --force flag is available for intentional
re-translation of changed English strings.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix glossary key mismatch in compile-report.js (disambiguated keys
  didn't match actual glossary.json keys, silently skipping 4 checks)
- Fix mixed Latin/Cyrillic in ru.md locale guide (vы → вы)
- Fix fragile file-path detection in merge.js (use fs.existsSync instead
  of includes('/'), add missing-arg guard and JSON.parse try/catch)
- Add try/catch in missing-keys.js for corrupt/missing locale files
- Add French elision rule to fr.md: use "de" when numeric %{amount}
  buffers the symbol, use "en" when symbol placeholder is directly
  after the preposition (avoids runtime elision ambiguity)
- Retranslate French yield/unstake strings applying the new rule:
  "déstaking de %{symbol}" → "déstake en %{symbol}"

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Ukrainian: в/у and з/із/зі preposition alternation rules for dynamic
placeholders where runtime values are unknown at translation time.

Turkish: vowel harmony rules for dynamic placeholders — prefer
postpositions over direct suffixes on placeholders since crypto symbols
span all vowel classes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add register examples to all 9 locale files (de, es, fr, ja, pt, ru, tr, uk, zh) with correct/incorrect pairs for non-pronoun register markers
- Add register consistency as 6th reviewer focus in SKILL.md
- Add "Multichain Snap" and "Snap" to glossary never-translate list
- Fix 2 broken community translations across all 9 locales (stale multiChain.body, missing %{symbol} in getAssets.about)
- Update compile-report.js to use stemMatch instead of raw .includes() for glossary metrics
- Improve stemMatch with language-aware morphological matching (suffix stripping, Levenshtein distance, CJK character overlap)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove 200 existing translated keys from all 9 non-English locales
and regenerate them using the /translate skill pipeline. This creates
a diff where language experts can compare original vs. AI-generated
translations for quality benchmarking.

Key selection: 40 short, 50 multi-placeholder, 7 tagged, 30 long,
30 crypto-domain, 43 general — distributed across 31 namespaces.

All 1800 translations (200 × 9 locales) passed validation with
0 rejections and 0 manual review items.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 25, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch benchmark/translate-200-keys

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Base automatically changed from agent-translations to develop February 25, 2026 22:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants