Releases: openvinotoolkit/openvino_tokenizers
Releases · openvinotoolkit/openvino_tokenizers
2024.5.0.0
What's Changed
New and Reimplemented Operations
- Add Skip Tokens Node by @apaniukov in #264
- Optimize CombineSegments by @pavel-esir in #265
- Add Charsmap Operation by @apaniukov in #267
- Improve BPE by @pavel-esir in #281
- Reimplement WordPiece tokenization by @pavel-esir in #298
Improvements and Compatibility
- Store tokenizer conversion params in rt_info / refactor passing params by @pavel-esir in #268
- add packages versions to rt_info by @pavel-esir in #292
- Fix GLM4 Tokenization by @apaniukov in #280
Build Changes
- Dynamic linking with msvc runtime by @mryzhov in #260
- Linking with sentencepiece_train by @mryzhov in #272
Full Changelog: 2024.4.1.0...2024.5.0.0
2024.4.1.0
2024.4.0.0
What's Changed
- Reduce icud.dll by @mryzhov in #196
- Split implementation without FastTokenizer by @pavel-esir in #208
- Align Sentencepiece Model Vocab by @apaniukov in #205
- Ops Optimization by @apaniukov in #219
- [TF FE][Tokenizers] Avoid dependency from TF FE in tokenizers by @rkazants in #227
- Add Truncation To Sentencepiece by @apaniukov in #225
- reimplement BPE tokenizer by @pavel-esir in #220
- [TF FE][Tokenizers] Optimize TF FE extensions by @rkazants in #232
- Enabled build w/o FastTokenizers by @ilya-lavrenov in #237
- Win debug build by @mryzhov in #218
- Switch To BPE Backend by @apaniukov in #235
- Add UTF-8 validation by @pavel-esir in #242
Full Changelog: 2024.3.0.0...2024.4.0.0
2024.3.0.0
What's Changed
Improvements
- Fix Tokenization of Special Tokens in Sentencepiece by @apaniukov in #173
- Add Left Padding and Padding to Max Length by @apaniukov in #152
- Sentencepiece Tokenization Improvements by @apaniukov in #176
- BPE Fallback for Sentencepiece by @apaniukov in #181
- Update Sentencepiece Parsing by @apaniukov in #185
- Fix Decoding For Long Tokens by @apaniukov in #187
- Sentencepiece Left Padding by @apaniukov in #186
- Update Remaining Inputs Detection During Model Connection by @apaniukov in #190
- Update rt_info by @apaniukov in #191
- Truncate Left Side When Left Padding Is Used by @apaniukov in #192
- Add Separate Special Token Handling To Sentencepiece by @apaniukov in #198
- Support GLM-4 Tokenizer by @apaniukov in #202
- use PCRE2 fallback for RegexNormalization @pavel-esir in #203
Changes
Build, Packaging and CI
- Package into correct dirs by @Wovchena in #148
- Set cmake policies by @ilya-lavrenov in #157
- Fix usage of protobuf_MODULE_COMPATIBLE by @ilya-lavrenov in #158
- Build release by default (#162) by @ilya-lavrenov in #163
- Fixed conda-forge on Windows (#164) by @ilya-lavrenov in #165
- Package into correct dirs (#155) by @Wovchena in #167
- New python build scheme by @ilya-lavrenov in #166
- Support build from OpenVINO wheel only by @mryzhov in #178
- Configure cmake similar to GenAI (#175) by @Wovchena in #180
- Patch icu external project by @mryzhov in #184
- [CI] Build from OV wheel by @mryzhov in #183
- [GHA] Set permissions read-all by @mryzhov in #189
- [CI] Fixed Jenkins artifacts by @mryzhov in #195
- [MERGE] reduced icudt.dll (#196) by @mryzhov in #201
Full Changelog: 2024.2.0.0...2024.3.0.0
2024.2.0.0
What's Changed
- Add support for left padding in Wordpiece, BPE and tiktoken-based tokenizers
- Enhanced handling of special tokens
- Add support for padding to a particular length
- New option to add or not add special tokens during the tokenization
- Support Punctuation Pretokenizer
- Enchanse tokenizer postprocessing parser for better model coverage
- Add StringToHashBucketFast Tensorflow Translator
- Optimize EqualStr and VocabEncoder Operations
- Add Benchmarking Script
Full Changelog: 2024.1.0.2...2024.2.0.0
2024.1.0.2
What's Changed
- Fixed prebuild tokenizers on Windows by @ilya-lavrenov in #141
Full Changelog: 2024.1.0.1...2024.1.0.2
2024.1.0.1
What's Changed
- Llama3 Tokenizer Support
- Add
not-add-special-tokens
flag to CLI conversion tool
Full Changelog: 2024.1.0.0...2024.1.0.1
2024.1.0.0
What's Changed
- New operations:
- TrieTokenizer
- VocabEncoder
- EqualStr
- RaggedToSparse
- RaggedToRagged
- FuzeRagged
- Update existing operations:
- Add
max_splits
argument to RegexSplit - Add
encoding
argument to CaseFold
- Add
- Add new and update existing TensorFlow translators for TextVectorization layer partial support.
- RWKV tokenizer support.
- New way to get OpenVINO Tokenizers - build from files. Supports RWKV tokenizer.
- Update tokenizer operation caching mechanism for OpenVINO model caching support
- SentencePiece tokenizer changes and fixes:
- Update to 0.2.0 version
- Use constant 0 as mask hide token by @as-suvorov in #90
- Sentencepiece BOS Token Detection
- Fix multi-input model merging by @yas-sim in #53
New Contributors
- @dependabot made their first contribution in #30
- @yas-sim made their first contribution in #53
- @as-suvorov made their first contribution in #90
- @akladiev made their first contribution in #102
Full Changelog: 2024.0.0.0...2024.1.0.0