Skip to content

fix(openmemory): prevent multilingual simhash collisions across JS and Python#153

Open
zfaustk wants to merge 6 commits intoCaviraOSS:mainfrom
zfaustk:fix-multilingual-dedup-tokenization
Open

fix(openmemory): prevent multilingual simhash collisions across JS and Python#153
zfaustk wants to merge 6 commits intoCaviraOSS:mainfrom
zfaustk:fix-multilingual-dedup-tokenization

Conversation

@zfaustk
Copy link
Copy Markdown

@zfaustk zfaustk commented Mar 31, 2026

Summary

This fixes the multilingual simhash collision path behind #147 across both OpenMemory runtimes.

  • extend tokenization to retain CJK runs instead of dropping them as non-ASCII noise
  • expand CJK tokens into bigrams so distinct Chinese inputs no longer collapse into the same empty token set
  • fall back to a stable text hash when tokenization yields nothing, instead of returning the same all-zero simhash
  • add matching JS and Python regression coverage for Chinese inputs and punctuation-only inputs

Fixes #147.

Testing

  • cd packages/openmemory-py && pytest -q tests/test_multilingual_dedup.py
  • cd packages/openmemory-js && timeout 20s npx --yes tsx tests/test_multilingual_dedup.ts

Notes

  • The JS regression assertions completed and printed test_multilingual_dedup.ts passed, but the process kept lingering handles alive afterward, so I wrapped that check with timeout 20s for this run.
  • I intentionally left packages/openmemory-js/package-lock.json out of this patch because the only change was incidental install churn, not source behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Chinese text is incorrectly deduplicated in openmemory_store due to ASCII-only tokenization

1 participant