-
Notifications
You must be signed in to change notification settings - Fork 122
Description
I’m exporting a DeBERTa-v2 tokenizer to ONNX and seeing mismatches between Hugging Face tokenizers output and the ONNX tokenizer output, especially around special Unicode characters. After digging into the ORTX code path for loading the tokenizer normalizer, it looks like only the Precompiled normalizer’s precompiled_charsmap is extracted/applied, while other normalization steps in a Sequence normalizer are ignored.
In the tokenizer.json, the normalizer has multiple normalizers:
"normalizer": {
"type": "Sequence",
"normalizers": [
{
"type": "Strip",
"strip_left": true,
"strip_right": true
},
{
"type": "Precompiled",
"precompiled_charsmap": ....
},
{
"type": "Replace",
"pattern": {
"Regex": " {2,}"
},
"content": " "
},
{
"type": "Replace",
"pattern": {
"String": " "
},
"content": ""
},
{
"type": "NFKC"
},
{
"type": "Replace",
"pattern": {
"String": " "
},
"content": ""
},
{
"type": "NFKC"
}
]
}However, in operators/tokenizer/ugm_kernels.hpp, the normalizer parsing code appears to only look for and load the first type == "Precompiled" charsmap, and does not interpret/execute the remaining Sequence steps (e.g., Strip, Replace, NFKC):
OrtxStatus LoadCharsMap(const json& j_vocab) {
auto normalizer = j_vocab.find("normalizer");
std::string charsmap;
if (normalizer != j_vocab.end()) {
auto iter = normalizer->find("precompiled_charsmap");
if (iter != normalizer->end()) {
charsmap = iter->get<std::string>();
} else {
auto iter = normalizer->find("normalizers"); // v2 schema
if (iter != normalizer->end()) {
for (const auto& normalizer : iter->items()) {
if (normalizer.value().contains("type")) {
auto type = normalizer.value()["type"].get<std::string>();
if (type == "Precompiled") {
charsmap = normalizer.value()["precompiled_charsmap"].get<std::string>();
break;
}
}
}
}
}
}This seems likely to cause mismatches for tokenizers that depend on other normalizers in the sequence for Unicode/whitespace handling.
Test case
Mismatch for: 'Emoji 🙂 and symbols © ™ — …'
Field: input_ids
REF : [[ 1 416 95267 8692 1964 123024 264 5188 2418 2015
303 2 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0]]
ONNX: [[ 1 416 95267 260 8692 306 23967 264 4334 260 3 662
260 3 2 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0]]
REF pieces: ['[CLS]', '▁E', 'moji', '🙂', 'and', 'symbol', 's', '©', 'TM', '—', '...', '[SEP]']
ONNX pieces: ['[CLS]', '▁E', 'moji', '▁', '🙂', '▁and', '▁symbol', 's', '▁©', '▁', '[UNK]', '▁—', '▁', '[UNK]', '[SEP]']
Notably, the ONNX/ORTX output introduces extra whitespace pieces (▁), and maps some symbols (e.g., ™, …) to [UNK], whereas the reference tokenizer produces TM and ....
As a workaround, I added a StringRegexReplace node in the ONNX graph to handle whitespace normalization. This helped somewhat for whitespace-related differences, but it still does not resolve the mismatches for special Unicode characters, which continue to tokenize differently vs the HF reference (often showing up as [UNK] on the ORTX side).
Could you confirm whether this is intended/known?