Implement gpt2 (BPE) GGUF tokenizer conversion #397

EricLBuehler · 2024-06-05T23:27:55Z

No description provided.

github-actions · 2024-06-05T23:28:53Z

Code Metrics Report

  ===============================================================================
 Language            Files        Lines         Code     Comments       Blanks
===============================================================================
 Dockerfile              1           34           25            0            9
 Happy                   1          442          369            0           73
 JSON                    9           21           21            0            0
 Python                 27          995          848           29          118
 TOML                   16          430          390            1           39
-------------------------------------------------------------------------------
 Jupyter Notebooks       1            0            0            0            0
 |- Markdown             1           60           30           22            8
 |- Python               1           96           87            1            8
 (Total)                            156          117           23           16
-------------------------------------------------------------------------------
 Markdown               16         1091            0          809          282
 |- BASH                 5          100           97            0            3
 |- Python               6          122          110            0           12
 |- Rust                 2           80           72            3            5
 (Total)                           1393          279          812          302
-------------------------------------------------------------------------------
 Rust                  109        33197        30074          549         2574
 |- Markdown            55          627           13          581           33
 (Total)                          33824        30087         1130         2607
===============================================================================
 Total                 181        36210        31727         1388         3095
===============================================================================

polarathene · 2024-06-06T13:06:17Z

mistralrs-core/src/pipeline/gguf_tokenizer.rs

+            tokenizer.add_special_tokens(&[AddedToken::from(tokens[bos as usize].clone(), true)]);
+            tokenizer.add_special_tokens(&[AddedToken::from(tokens[eos as usize].clone(), true)]);


Just curious, BPE has support for setting unk in it's builder variant, is it not relevant for some reason? (I know very little about these things)

@polarathene the GGUF file I am testing with (QuantFactory/Meta-Llama-3-8B-Instruct-GGUF) does not have a unk token in the metadata, so I left it out here.

Sure, but what about when they do? I assume that's possible since the tokenizer builder for BPE does support setting unk? There is no check for this, so if there was it'd just ignore it and introduce a bug?

Yes, I wasn't sure if this is guaranteed to be not provided, but just in case I added 8d4dba5 and 763241e.

mistralrs-core/src/gguf/gguf_tokenizer.rs

polarathene · 2024-06-08T03:35:08Z

mistralrs-core/src/gguf/gguf_tokenizer.rs

+        .map(|merge| {
+            let split: (&str, &str) = merge
+                .splitn(2, ' ')
+                .collect_tuple()
+                .expect("Failed to convert split into 2-tuple");
+            (split.0.to_string(), split.1.to_string())


This splits a string like "I like rust" into ("I", "like rust")?

I haven't seen any examples where you'd have space in the input, but I assume this is referencing an existing implementation somewhere already 😅 (not doubting your work, just curious)

I came across this article that mentions white-space splitting at the end, noting it is not suitable for languages like chinese.

No problem, happy to explain. When you have a merges field, from all the examples I have looked at, they are encoded together, separated by a space because the BPE tokenizer inserts spaces, so there is no space token. The key point is that each element of the table is a merge pair. This code would split the merge representation "he llo" into the pair ("he", "llo").

I looked at the article you linked, and it also mentioned the pair construction of merges.

mistralrs-core/src/gguf/gguf_tokenizer.rs

Co-authored-by: Brennan Kinney <5098581+polarathene@users.noreply.github.com>

* Bump version to 0.1.17 * Fix version bump

* Add readmes * Fix typos

* ISQ support for phi3v * Document it

* Use new slice_assign * Fix dead image links

* Fix kv head usage * Fix rope weights * Clippy

Implement gpt2 gguf tokenizer

6a803a2

EricLBuehler added 2 commits June 5, 2024 19:30

Fix unk tok calculation

aa10e21

Remove normalizer

d66e5af

EricLBuehler mentioned this pull request Jun 6, 2024

Refactor: GGUF metadata tokenizer #389

Merged

polarathene reviewed Jun 6, 2024

View reviewed changes

EricLBuehler added 4 commits June 7, 2024 20:51

Merge

d267f11

Update gguf tokenizer

5881440

Allow adding unk token when found

8d4dba5

Add unk token to builder if provided.

763241e

polarathene reviewed Jun 8, 2024

View reviewed changes

EricLBuehler added 2 commits June 8, 2024 05:11

Improve add_special_tokens

be4ebc0

Use tokenizerx builder

c2ba871

polarathene reviewed Jun 8, 2024

View reviewed changes

mistralrs-core/src/gguf/gguf_tokenizer.rs Show resolved Hide resolved

EricLBuehler and others added 15 commits June 8, 2024 07:12

Add useful comment

aa338be

Co-authored-by: Brennan Kinney <5098581+polarathene@users.noreply.github.com>

Bump version to 0.1.16 (#404)

4685783

* Bump version to 0.1.17 * Fix version bump

Add and update template READMEs (#405)

2591a9e

* Add readmes * Fix typos

Improve Rust docs (#406)

46ecfac

Expose phi3v loader and remove unused deps (#408)

146f751

Support format for mixtral where experts are in one tensor (#355)

286e7fd

Normal loading metadata for vision models (#409)

41b34f6

Phi 3 vision ISQ support (#410)

c29b7cb

* ISQ support for phi3v * Document it

Remove causal masks cache (#412)

4756196

Fix: use new slice_assign (#415)

552cd0e

* Use new slice_assign * Fix dead image links

Fix Phi-3 GGUF (#414)

d21be7d

* Fix kv head usage * Fix rope weights * Clippy

Work on the gpt2 conversion

335aa0a

Add comment

361c744

Add some tests

f09750c

Update readme

53a4225

EricLBuehler merged commit 46b0364 into master Jun 10, 2024
4 checks passed

EricLBuehler deleted the gpt2_gguf_tokenizer branch June 10, 2024 11:59

EricLBuehler mentioned this pull request Jun 10, 2024

Running model from a GGUF file, only #326

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement gpt2 (BPE) GGUF tokenizer conversion #397

Implement gpt2 (BPE) GGUF tokenizer conversion #397

		tokenizer.add_special_tokens(&[AddedToken::from(tokens[bos as usize].clone(), true)]);
		tokenizer.add_special_tokens(&[AddedToken::from(tokens[eos as usize].clone(), true)]);

Implement gpt2 (BPE) GGUF tokenizer conversion #397

Implement gpt2 (BPE) GGUF tokenizer conversion #397

Conversation

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment