Hello, I'm trying to use BlingFire tools to build tokenization model for CLIP out of existing vocab.json/merges.txt file available here: https://huggingface.co/openai/clip-vit-base-patch32/tree/main I tried to the same approach given for RoBERTa: https://github.com/microsoft/BlingFire/tree/master/ldbsrc/roberta However, export_vocab script expects `Ġ` prefix in the vocabulary. CLIP's vocabulary uses `</w>` as a **suffix** and not a **prefix**. I tried to modify the script to detect ending `</w>` instead of `Ġ` to append `0x2581`: https://github.com/microsoft/BlingFire/blob/master/ldbsrc/gpt2/export_vocab.py#L91 but this gives slightly different results than tokenizer from hugging face when dealing with punctuation: Input string: `"a photo of a really, functistaner big cat."` ``` Hugging faces: 49406, 320, 1125, 539, 320, 1414, 267, 8679, 555, 2203, 528, 1205, 2368, 269, 49407] BlingFire: 320 1125 539 320 1414 11 1499 66 555 2203 517 1205 2368 13 ``` Is there some way to make BlingFire support CLIP version of tokenizer? My current scripts and reproduction steps: https://github.com/dkalinowski/BlingFire/tree/clip/ldbsrc/clip
Hello, I'm trying to use BlingFire tools to build tokenization model for CLIP out of existing vocab.json/merges.txt file available here: https://huggingface.co/openai/clip-vit-base-patch32/tree/main
I tried to the same approach given for RoBERTa: https://github.com/microsoft/BlingFire/tree/master/ldbsrc/roberta
However, export_vocab script expects
Ġprefix in the vocabulary. CLIP's vocabulary uses</w>as a suffix and not a prefix.I tried to modify the script to detect ending
</w>instead ofĠto append0x2581: https://github.com/microsoft/BlingFire/blob/master/ldbsrc/gpt2/export_vocab.py#L91but this gives slightly different results than tokenizer from hugging face when dealing with punctuation:
Input string:
"a photo of a really, functistaner big cat."Is there some way to make BlingFire support CLIP version of tokenizer?
My current scripts and reproduction steps:
https://github.com/dkalinowski/BlingFire/tree/clip/ldbsrc/clip