strip_identifier can cause a panic with cjk identifiers#138
Merged
loewenheim merged 3 commits intogetsentry:masterfrom Jan 20, 2026
Merged
strip_identifier can cause a panic with cjk identifiers#138loewenheim merged 3 commits intogetsentry:masterfrom
loewenheim merged 3 commits intogetsentry:masterfrom
Conversation
Change "変数名" (Japanese) to "变量名" (Chinese) for better CJK coverage.
loewenheim
approved these changes
Jan 19, 2026
by breaking it across multiple lines.
auto-merge was automatically disabled
January 20, 2026 00:15
Head branch was pushed to by a user without write access
Contributor
Author
|
@loewenheim I applied |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Calling
strip_identifierwith identifiers that contain cjk characters causes a Rust panic as shown belowUsing cjk characters in identifiers is not common, but I found examples where cjk characters are used as identifiers and the panic happened.
Javascript identifiers are not limited to ascii characters and can include unicode characters.
The current implementation stored only the start byte index of each character in
end_idxwhile iterating, and then sliced the string using an inclusive range&s[..=end_idx].For example, the string "한글", '한' occupies bytes 0–2 and '글' occupies bytes 3–5, but when processing '글',
i = 3is stored asend_idxand slicing with&s[..=3]breaks the UTF-8 character boundary.This results in a
byte index is not a char boundarypanic.To fix, the code now tracks the end position of each character instead of the start position by calculating
end_idx = i + c.len_utf8(), and uses an exclusive range&s[..end_idx]when slicing.This change covers not only CJK characters but also other non-ASCII Unicode identifiers.