When you create a blingfire model based on the settings of Hugging Face's BertTokenizer, it outputs the wrong answer in certain cases.
Of course, (HF)BertTokenizerFast and (TF)tf_text.FastBertTokenizer also have more than 99% correct answers when run on the same vocab.txt, but blingfire only has 93% correct answers.
(vocab.txt is about 30000)
In the example below, the actual vocab.txthas ##ㅋbut no ㅋ, as shown below.
In this case, ##ㅋ must be concatenated with the preceding character, so they all match for 아ㅋ, as shown below.
| Tokenizer Framework |
text |
ids |
decode |
| (HF) BertTokenizer |
아ㅋ |
[31998, 21, 29981, 31999] |
[CLS] 아ㅋ [SEP] |
| (HF) BertTokenizerFast |
아ㅋ |
[31998, 21, 29981, 31999] |
[CLS] 아ㅋ [SEP] |
| (BF) bert_custom.bin |
아ㅋ |
[31998, 21, 29981, 31999] |
[CLS] 아ㅋ [SEP] |
| (TF) FastBertTokenizer |
아ㅋ |
[31998, 21, 29981, 31999] |
[CLS] 아ㅋ [SEP] |
On the other hand, if there is a space in the middle, like in 아 ㅋ, only blingfire will produce a different result.
| Tokenizer Framework |
text |
ids |
decode |
| (HF) BertTokenizer |
아 ㅋ |
[31998, 21, 31997, 31999] |
[CLS] 아 [UNK] [SEP] |
| (HF) BertTokenizerFast |
아 ㅋ |
[31998, 21, 31997, 31999] |
[CLS] 아 [UNK] [SEP] |
| (BF) bert_custom.bin |
아 ㅋ |
[31998, 21, 29981, 31999] |
[CLS] 아ㅋ [SEP] |
| (TF) FastBertTokenizer |
아 ㅋ |
[31998, 21, 31997, 31999] |
[CLS] 아 [UNK] [SEP] |
The blingfire settings are shown below.
ldb.conf.small
[wbd]
max-depth 4
xword 2
seg 3
ignore 4
fsm 1
multi-map-mode triv-dump
multi-map 2
options.small
OUTPUT = bert_custom.bin
opt_build_wbd = --dict-root=. --full-unicode
opt_pack_wbd_fsa = --alg=triv --type=moore-dfa --remap-iws --use-iwia
opt_pack_wbd_mmap = --alg=triv --type=mmap
#opt_pack_charmap = --alg=fixed --type=mmap --imp-mmap
resources = \
$(tmpdir)/wbd.fsa.$(mode).dump \
$(tmpdir)/wbd.mmap.$(mode).dump \
wdb.lex.utf8
_include common/bert.common.def.txt
_define LetterFromVocab [\x0030-\x0039\x0041-\x005a\x0061-...]
< (ChineseChars)|(BertPunctuation) > --> WORD _call FnTokWord
< (AllLettersWithoutToLower|LetterFromVocab)+ > --> WORD _call FnTokWord
#
# BERT specific
#
< [\[] UNK [\]] > --> WORD _call FnTokWord
< [\[] CLS [\]] > --> WORD _call FnTokWord
< [\[] SEP [\]] > --> WORD _call FnTokWord
< [\[] MASK [\]] > --> WORD _call FnTokWord
_function FnTokWord
_include bert_custom/vocab.falex
_end
Other than that, we specified vocab.falex, wdb.target.txt, ldb.conf.i2w, and options.small exactly as guided.
How do you know which part is the problem?
When you create a blingfire model based on the settings of Hugging Face's BertTokenizer, it outputs the wrong answer in certain cases.
Of course,
(HF)BertTokenizerFastand(TF)tf_text.FastBertTokenizeralso have more than 99% correct answers when run on the samevocab.txt, butblingfireonly has 93% correct answers.(vocab.txt is about 30000)
In the example below, the actual
vocab.txthas##ㅋbut noㅋ, as shown below.In this case,
##ㅋmust be concatenated with the preceding character, so they all match for아ㅋ, as shown below.On the other hand, if there is a space in the middle, like in
아 ㅋ, only blingfire will produce a different result.The blingfire settings are shown below.
ldb.conf.small
options.small
wdb.lex.utf8
Other than that, we specified
vocab.falex,wdb.target.txt,ldb.conf.i2w, andoptions.smallexactly as guided.How do you know which part is the problem?