How can I get the mapping relationship between byte values and Unicode characters of the fast tokenizer? #1545

LuoKaiGSW · 2024-06-04T07:37:39Z

I have a model that uses BloomTokenizerFast, which does not have properties like byte_decoder and sp_model, so I can't figure out how it implements the mapping between byte values and Unicode characters. I've looked through the source code and only found that the pre_tokenize_str function can convert input text characters into Unicode characters, but I didn't see the mapping relationship it depends on. So I want to ask, how can I find this mapping relationship? Or is the mapping relationship used by the fast tokenizer the same as that of gpt2?

ArthurZucker · 2024-06-05T07:29:04Z

Hey! I suppose you are using python and can't see what's inside your tokenizer! #1542 should help you with this 🤗

LuoKaiGSW · 2024-06-05T08:06:29Z

Hey! I suppose you are using python and can't see what's inside your tokenizer! #1542 should help you with this 🤗

Thank you for your reply, but I didn't fully understand what you meant. After using tokenizer._tokenizer.model, I got a BPE object, but I didn't see the attribute I wanted in it - that is, the mapping from byte values to Unicode. Could you explain it a bit more clearly, please?

ArthurZucker · 2024-06-11T13:32:50Z

You cannot see any attributes because both __repr__ and __str__ are not implemented

LuoKaiGSW · 2024-06-11T13:47:13Z

You cannot see any attributes because both __repr__ and __str__ are not implemented

So, is it impossible to read this mapping relationship from the fast tokenizer?

ArthurZucker · 2024-06-11T16:42:41Z

It is coming with the PR that I linked 😉

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can I get the mapping relationship between byte values and Unicode characters of the fast tokenizer? #1545

How can I get the mapping relationship between byte values and Unicode characters of the fast tokenizer? #1545

How can I get the mapping relationship between byte values and Unicode characters of the fast tokenizer? #1545

How can I get the mapping relationship between byte values and Unicode characters of the fast tokenizer? #1545

Comments