[go: nahoru, domu]

blob: 27092a921b3fcf0ca1ac1f818e3e8b260e417291 [file] [log] [blame]
Name: Sentencepiece
Short Name: sentencepiece
URL: https://github.com/google/sentencepiece
Version: 8cbdf13794284c30877936f91c6f31e2c1d5aef7
Date: 2023-08-16
License: Apache 2.0
License File: LICENSE
Security Critical: Yes
Shipped: yes
CPEPrefix: unknown
Description:
SentencePiece is an unsupervised text tokenizer and detokenizer mainly for
Neural Network-based text generation systems where the vocabulary size is
predetermined prior to the neural model training. SentencePiece implements
subword units (e.g., byte-pair-encoding (BPE) [Sennrich et al.] and unigram
language model [Kudo.]) with the extension of direct training from raw
sentences. SentencePiece allows us to make a purely end-to-end system that does
not depend on language-specific pre/postprocessing.
Making new patches:
1. git commit changes, be sure to write a helpful commit message
2. Run `git format-patch -o third_party/sentencepiece/patches/ origin/main third_party/sentencepiece/src/`
Patches:
* 0001-Remove-config-include.patch - Removes a header for version information
that is not used and requires cmake.
* 0002-Fix-absl-includes.patch - Rewrites absl includes from third_party/absl to
just absl.
* 0003-Remove-absl-flags.patch - Removes some usage of absl flags which was not
needed for chromium.
* 0004-Remove-util-Status.patch - Swaps out the custom util::Status for
absl::Status.
* 0005-Fix-utf8towide.patch - Fixes file loading on Windows.
* 0006-Fix-gn-check-in-sentencepiece.patch - Removes more usage of absl flags
which was not needed and was giving "gn check" errors.
* 0007-SentencePiece-introduce-no-exceptions-mode.patch - Introduces a new build
variant that allows us to build SentencePiece and its dependencies without
having to enable exceptions. Candidate for upstreaming.
In addition the python, third_party/absl, and third_party/protobuf-lite
directories were removed to avoid adding a bunch of unused code which could
cause strange compile errors if include paths were incorrect.