| Name: Sentencepiece |
| Short Name: sentencepiece |
| URL: https://github.com/google/sentencepiece |
| Version: 8cbdf13794284c30877936f91c6f31e2c1d5aef7 |
| Date: 2023-08-16 |
| License: Apache 2.0 |
| License File: LICENSE |
| Security Critical: Yes |
| Shipped: yes |
| CPEPrefix: unknown |
| |
| Description: |
| SentencePiece is an unsupervised text tokenizer and detokenizer mainly for |
| Neural Network-based text generation systems where the vocabulary size is |
| predetermined prior to the neural model training. SentencePiece implements |
| subword units (e.g., byte-pair-encoding (BPE) [Sennrich et al.] and unigram |
| language model [Kudo.]) with the extension of direct training from raw |
| sentences. SentencePiece allows us to make a purely end-to-end system that does |
| not depend on language-specific pre/postprocessing. |
| |
| Making new patches: |
| 1. git commit changes, be sure to write a helpful commit message |
| 2. Run `git format-patch -o third_party/sentencepiece/patches/ origin/main third_party/sentencepiece/src/` |
| |
| Patches: |
| * 0001-Remove-config-include.patch - Removes a header for version information |
| that is not used and requires cmake. |
| * 0002-Fix-absl-includes.patch - Rewrites absl includes from third_party/absl to |
| just absl. |
| * 0003-Remove-absl-flags.patch - Removes some usage of absl flags which was not |
| needed for chromium. |
| * 0004-Remove-util-Status.patch - Swaps out the custom util::Status for |
| absl::Status. |
| * 0005-Fix-utf8towide.patch - Fixes file loading on Windows. |
| * 0006-Fix-gn-check-in-sentencepiece.patch - Removes more usage of absl flags |
| which was not needed and was giving "gn check" errors. |
| * 0007-SentencePiece-introduce-no-exceptions-mode.patch - Introduces a new build |
| variant that allows us to build SentencePiece and its dependencies without |
| having to enable exceptions. Candidate for upstreaming. |
| |
| In addition the python, third_party/absl, and third_party/protobuf-lite |
| directories were removed to avoid adding a bunch of unused code which could |
| cause strange compile errors if include paths were incorrect. |