[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement gpt2 (BPE) GGUF tokenizer conversion #397

Merged
merged 24 commits into from
Jun 10, 2024

Conversation

EricLBuehler
Copy link
Owner

No description provided.

Copy link
github-actions bot commented Jun 5, 2024
Code Metrics Report
  ===============================================================================
 Language            Files        Lines         Code     Comments       Blanks
===============================================================================
 Dockerfile              1           34           25            0            9
 Happy                   1          442          369            0           73
 JSON                    9           21           21            0            0
 Python                 27          995          848           29          118
 TOML                   16          430          390            1           39
-------------------------------------------------------------------------------
 Jupyter Notebooks       1            0            0            0            0
 |- Markdown             1           60           30           22            8
 |- Python               1           96           87            1            8
 (Total)                            156          117           23           16
-------------------------------------------------------------------------------
 Markdown               16         1091            0          809          282
 |- BASH                 5          100           97            0            3
 |- Python               6          122          110            0           12
 |- Rust                 2           80           72            3            5
 (Total)                           1393          279          812          302
-------------------------------------------------------------------------------
 Rust                  109        33197        30074          549         2574
 |- Markdown            55          627           13          581           33
 (Total)                          33824        30087         1130         2607
===============================================================================
 Total                 181        36210        31727         1388         3095
===============================================================================
  

Comment on lines 139 to 140
tokenizer.add_special_tokens(&[AddedToken::from(tokens[bos as usize].clone(), true)]);
tokenizer.add_special_tokens(&[AddedToken::from(tokens[eos as usize].clone(), true)]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, BPE has support for setting unk in it's builder variant, is it not relevant for some reason? (I know very little about these things)

Copy link
Owner Author
@EricLBuehler EricLBuehler Jun 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@polarathene the GGUF file I am testing with (QuantFactory/Meta-Llama-3-8B-Instruct-GGUF) does not have a unk token in the metadata, so I left it out here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but what about when they do? I assume that's possible since the tokenizer builder for BPE does support setting unk? There is no check for this, so if there was it'd just ignore it and introduce a bug?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I wasn't sure if this is guaranteed to be not provided, but just in case I added 8d4dba5 and 763241e.

mistralrs-core/src/gguf/gguf_tokenizer.rs Outdated Show resolved Hide resolved
Comment on lines +197 to +202
.map(|merge| {
let split: (&str, &str) = merge
.splitn(2, ' ')
.collect_tuple()
.expect("Failed to convert split into 2-tuple");
(split.0.to_string(), split.1.to_string())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This splits a string like "I like rust" into ("I", "like rust")?

I haven't seen any examples where you'd have space in the input, but I assume this is referencing an existing implementation somewhere already 😅 (not doubting your work, just curious)

I came across this article that mentions white-space splitting at the end, noting it is not suitable for languages like chinese.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem, happy to explain. When you have a merges field, from all the examples I have looked at, they are encoded together, separated by a space because the BPE tokenizer inserts spaces, so there is no space token. The key point is that each element of the table is a merge pair. This code would split the merge representation "he llo" into the pair ("he", "llo").

I looked at the article you linked, and it also mentioned the pair construction of merges.

mistralrs-core/src/gguf/gguf_tokenizer.rs Outdated Show resolved Hide resolved
@EricLBuehler EricLBuehler merged commit 46b0364 into master Jun 10, 2024
4 checks passed
@EricLBuehler EricLBuehler deleted the gpt2_gguf_tokenizer branch June 10, 2024 11:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants