Skip to content

DeBERTa-v2 ONNX tokenization mismatch: normalizer sequence ignored, Unicode breaks #1031

@yaoluxun

Description

@yaoluxun

I’m exporting a DeBERTa-v2 tokenizer to ONNX and seeing mismatches between Hugging Face tokenizers output and the ONNX tokenizer output, especially around special Unicode characters. After digging into the ORTX code path for loading the tokenizer normalizer, it looks like only the Precompiled normalizer’s precompiled_charsmap is extracted/applied, while other normalization steps in a Sequence normalizer are ignored.

In the tokenizer.json, the normalizer has multiple normalizers:

"normalizer": {
    "type": "Sequence",
    "normalizers": [
      {
        "type": "Strip",
        "strip_left": true,
        "strip_right": true
      },
      {
        "type": "Precompiled",
		"precompiled_charsmap": ....
		},
      {
        "type": "Replace",
        "pattern": {
          "Regex": " {2,}"
        },
        "content": " "
      },
      {
        "type": "Replace",
        "pattern": {
          "String": " "
        },
        "content": ""
      },
      {
        "type": "NFKC"
      },
      {
        "type": "Replace",
        "pattern": {
          "String": " "
        },
        "content": ""
      },
      {
        "type": "NFKC"
      }
    ]
  }

However, in operators/tokenizer/ugm_kernels.hpp, the normalizer parsing code appears to only look for and load the first type == "Precompiled" charsmap, and does not interpret/execute the remaining Sequence steps (e.g., Strip, Replace, NFKC):

OrtxStatus LoadCharsMap(const json& j_vocab) {
    auto normalizer = j_vocab.find("normalizer");
    std::string charsmap;
    if (normalizer != j_vocab.end()) {
      auto iter = normalizer->find("precompiled_charsmap");
      if (iter != normalizer->end()) {
        charsmap = iter->get<std::string>();
      } else {
        auto iter = normalizer->find("normalizers");  // v2 schema
        if (iter != normalizer->end()) {
          for (const auto& normalizer : iter->items()) {
            if (normalizer.value().contains("type")) {
              auto type = normalizer.value()["type"].get<std::string>();
              if (type == "Precompiled") {
                charsmap = normalizer.value()["precompiled_charsmap"].get<std::string>();
                break;
              }
            }
          }
        }
      }
    }

This seems likely to cause mismatches for tokenizers that depend on other normalizers in the sequence for Unicode/whitespace handling.

Test case

Mismatch for: 'Emoji 🙂 and symbols © ™ — …'
Field: input_ids
REF : [[     1    416  95267   8692   1964 123024    264   5188   2418   2015
     303      2      0      0      0      0      0      0      0      0
       0      0      0      0      0      0      0      0      0      0
       0      0]]
ONNX: [[    1   416 95267   260  8692   306 23967   264  4334   260     3   662
    260     3     2     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0]]
REF pieces: ['[CLS]', '▁E', 'moji', '🙂', 'and', 'symbol', 's', '©', 'TM', '—', '...', '[SEP]']
ONNX pieces: ['[CLS]', '▁E', 'moji', '▁', '🙂', '▁and', '▁symbol', 's', '▁©', '▁', '[UNK]', '▁—', '▁', '[UNK]', '[SEP]']

Notably, the ONNX/ORTX output introduces extra whitespace pieces (▁), and maps some symbols (e.g., ™, …) to [UNK], whereas the reference tokenizer produces TM and ....

As a workaround, I added a StringRegexReplace node in the ONNX graph to handle whitespace normalization. This helped somewhat for whitespace-related differences, but it still does not resolve the mismatches for special Unicode characters, which continue to tokenize differently vs the HF reference (often showing up as [UNK] on the ORTX side).

Could you confirm whether this is intended/known?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions