[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can I get the mapping relationship between byte values and Unicode characters of the fast tokenizer? #1545

Open
LuoKaiGSW opened this issue Jun 4, 2024 · 5 comments

Comments

@LuoKaiGSW
Copy link

I have a model that uses BloomTokenizerFast, which does not have properties like byte_decoder and sp_model, so I can't figure out how it implements the mapping between byte values and Unicode characters. I've looked through the source code and only found that the pre_tokenize_str function can convert input text characters into Unicode characters, but I didn't see the mapping relationship it depends on. So I want to ask, how can I find this mapping relationship? Or is the mapping relationship used by the fast tokenizer the same as that of gpt2?

@ArthurZucker
Copy link
Collaborator

Hey! I suppose you are using python and can't see what's inside your tokenizer! #1542 should help you with this 🤗

@LuoKaiGSW
Copy link
Author

Hey! I suppose you are using python and can't see what's inside your tokenizer! #1542 should help you with this 🤗

Thank you for your reply, but I didn't fully understand what you meant. After using tokenizer._tokenizer.model, I got a BPE object, but I didn't see the attribute I wanted in it - that is, the mapping from byte values to Unicode. Could you explain it a bit more clearly, please?

@ArthurZucker
Copy link
Collaborator

You cannot see any attributes because both __repr__ and __str__ are not implemented

@LuoKaiGSW
Copy link
Author

You cannot see any attributes because both __repr__ and __str__ are not implemented

So, is it impossible to read this mapping relationship from the fast tokenizer?

@ArthurZucker
Copy link
Collaborator

It is coming with the PR that I linked 😉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants