In some cases, blingfire models created with the new vocab.txt produce different results.

 When you create a blingfire model based on the settings of Hugging Face's BertTokenizer, it outputs the wrong answer in certain cases.

 Of course, `(HF)BertTokenizerFast` and `(TF)tf_text.FastBertTokenizer` also have more than 99% correct answers when run on the same `vocab.txt`, but `blingfire` only has 93% correct answers. 

 (vocab.txt is about 30000)

 In the example below, the actual `vocab.txt`has `##ㅋ`but no `ㅋ`, as shown below.

```
--vocab.txt--
##ㅋ
```

In this case, `##ㅋ`  must be concatenated with the preceding character, so they all match for `아ㅋ`, as shown below.

| Tokenizer Framework    | text | ids                       | decode           |
| ---------------------- | ---- | ------------------------- | ---------------- |
| (HF) BertTokenizer     | 아ㅋ | [31998, 21, 29981, 31999] | [CLS] 아ㅋ [SEP] |
| (HF) BertTokenizerFast | 아ㅋ | [31998, 21, 29981, 31999] | [CLS] 아ㅋ [SEP] |
| (BF) bert_custom.bin   | 아ㅋ | [31998, 21, 29981, 31999] | [CLS] 아ㅋ [SEP] |
| (TF) FastBertTokenizer | 아ㅋ | [31998, 21, 29981, 31999] | [CLS] 아ㅋ [SEP] |

On the other hand, if there is a space in the middle, like in `아 ㅋ`, only blingfire will produce a different result.

| Tokenizer Framework    | text  | ids                       | decode               |
| ---------------------- | ----- | ------------------------- | -------------------- |
| (HF) BertTokenizer     | 아 ㅋ | [31998, 21, 31997, 31999] | [CLS] 아 [UNK] [SEP] |
| (HF) BertTokenizerFast | 아 ㅋ | [31998, 21, 31997, 31999] | [CLS] 아 [UNK] [SEP] |
| (BF) bert_custom.bin   | 아 ㅋ | [31998, 21, 29981, 31999] | [CLS] 아ㅋ [SEP]     |
| (TF) FastBertTokenizer | 아 ㅋ | [31998, 21, 31997, 31999] | [CLS] 아 [UNK] [SEP] |

## The blingfire settings are shown below.

### ldb.conf.small

```
[wbd]
max-depth 4
xword 2
seg 3
ignore 4
fsm 1
multi-map-mode triv-dump
multi-map 2
```

### options.small

```
OUTPUT = bert_custom.bin

opt_build_wbd = --dict-root=. --full-unicode

opt_pack_wbd_fsa = --alg=triv --type=moore-dfa --remap-iws --use-iwia
opt_pack_wbd_mmap = --alg=triv --type=mmap
#opt_pack_charmap = --alg=fixed --type=mmap --imp-mmap

resources = \
	$(tmpdir)/wbd.fsa.$(mode).dump \
	$(tmpdir)/wbd.mmap.$(mode).dump \

```

### wdb.lex.utf8

```
_include common/bert.common.def.txt

_define LetterFromVocab [\x0030-\x0039\x0041-\x005a\x0061-...]

< (ChineseChars)|(BertPunctuation) > --> WORD _call FnTokWord
< (AllLettersWithoutToLower|LetterFromVocab)+ > --> WORD _call FnTokWord

#
# BERT specific
#

< [\[] UNK [\]] > --> WORD _call FnTokWord
< [\[] CLS [\]] > --> WORD _call FnTokWord
< [\[] SEP [\]] > --> WORD _call FnTokWord
< [\[] MASK [\]] > --> WORD _call FnTokWord

_function FnTokWord
_include bert_custom/vocab.falex
_end

```

Other than that, we specified `vocab.falex`, `wdb.target.txt`, `ldb.conf.i2w`, and `options.small` exactly as guided.

How do you know which part is the problem?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

In some cases, blingfire models created with the new vocab.txt produce different results. #181

The blingfire settings are shown below.

ldb.conf.small

options.small

wdb.lex.utf8

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Tokenizer Framework	text	ids	decode
(HF) BertTokenizer	아ㅋ	[31998, 21, 29981, 31999]	[CLS] 아ㅋ [SEP]
(HF) BertTokenizerFast	아ㅋ	[31998, 21, 29981, 31999]	[CLS] 아ㅋ [SEP]
(BF) bert_custom.bin	아ㅋ	[31998, 21, 29981, 31999]	[CLS] 아ㅋ [SEP]
(TF) FastBertTokenizer	아ㅋ	[31998, 21, 29981, 31999]	[CLS] 아ㅋ [SEP]

Tokenizer Framework	text	ids	decode
(HF) BertTokenizer	아 ㅋ	[31998, 21, 31997, 31999]	[CLS] 아 [UNK] [SEP]
(HF) BertTokenizerFast	아 ㅋ	[31998, 21, 31997, 31999]	[CLS] 아 [UNK] [SEP]
(BF) bert_custom.bin	아 ㅋ	[31998, 21, 29981, 31999]	[CLS] 아ㅋ [SEP]
(TF) FastBertTokenizer	아 ㅋ	[31998, 21, 31997, 31999]	[CLS] 아 [UNK] [SEP]

In some cases, blingfire models created with the new vocab.txt produce different results. #181

Description

The blingfire settings are shown below.

ldb.conf.small

options.small

wdb.lex.utf8

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions