Releases · openvinotoolkit/openvino_tokenizers

20 Nov 11:07

apaniukov

2024.5.0.0

e30c99f

2024.5.0.0 Latest

Latest

What's Changed

New and Reimplemented Operations

Add Skip Tokens Node by @apaniukov in #264
Optimize CombineSegments by @pavel-esir in #265
Add Charsmap Operation by @apaniukov in #267
Improve BPE by @pavel-esir in #281
Reimplement WordPiece tokenization by @pavel-esir in #298

Improvements and Compatibility

Store tokenizer conversion params in rt_info / refactor passing params by @pavel-esir in #268
add packages versions to rt_info by @pavel-esir in #292
Fix GLM4 Tokenization by @apaniukov in #280

Build Changes

Dynamic linking with msvc runtime by @mryzhov in #260
Linking with sentencepiece_train by @mryzhov in #272

Full Changelog: 2024.4.1.0...2024.5.0.0

Contributors

pavel-esir, mryzhov, and apaniukov

Assets 2

30 Sep 13:20

apaniukov

2024.4.1.0

74186cd

2024.4.1.0 Pre-release

Pre-release

OpenVINO patch release.

What's Changed

Bump OV version to 2024.4.1 by @akladiev in #266

Full Changelog: 2024.4.0.0...2024.4.1.0

Contributors

akladiev

Assets 2

19 Sep 09:49

apaniukov

2024.4.0.0

3dde884

2024.4.0.0

What's Changed

Reduce icud.dll by @mryzhov in #196
Split implementation without FastTokenizer by @pavel-esir in #208
Align Sentencepiece Model Vocab by @apaniukov in #205
Ops Optimization by @apaniukov in #219
[TF FE][Tokenizers] Avoid dependency from TF FE in tokenizers by @rkazants in #227
Add Truncation To Sentencepiece by @apaniukov in #225
reimplement BPE tokenizer by @pavel-esir in #220
[TF FE][Tokenizers] Optimize TF FE extensions by @rkazants in #232
Enabled build w/o FastTokenizers by @ilya-lavrenov in #237
Win debug build by @mryzhov in #218
Switch To BPE Backend by @apaniukov in #235
Add UTF-8 validation by @pavel-esir in #242

Full Changelog: 2024.3.0.0...2024.4.0.0

Contributors

ilya-lavrenov, pavel-esir, and 3 other contributors

Assets 2

31 Jul 10:47

apaniukov

2024.3.0.0

fb0157c

2024.3.0.0

What's Changed

Improvements

Fix Tokenization of Special Tokens in Sentencepiece by @apaniukov in #173
Add Left Padding and Padding to Max Length by @apaniukov in #152
Sentencepiece Tokenization Improvements by @apaniukov in #176
BPE Fallback for Sentencepiece by @apaniukov in #181
Update Sentencepiece Parsing by @apaniukov in #185
Fix Decoding For Long Tokens by @apaniukov in #187
Sentencepiece Left Padding by @apaniukov in #186
Update Remaining Inputs Detection During Model Connection by @apaniukov in #190
Update rt_info by @apaniukov in #191
Truncate Left Side When Left Padding Is Used by @apaniukov in #192
Add Separate Special Token Handling To Sentencepiece by @apaniukov in #198
Support GLM-4 Tokenizer by @apaniukov in #202
use PCRE2 fallback for RegexNormalization @pavel-esir in #203

Changes

Switch default skip tokens flag behavior by @slyalin in #160

Build, Packaging and CI

Package into correct dirs by @Wovchena in #148
Set cmake policies by @ilya-lavrenov in #157
Fix usage of protobuf_MODULE_COMPATIBLE by @ilya-lavrenov in #158
Build release by default (#162) by @ilya-lavrenov in #163
Fixed conda-forge on Windows (#164) by @ilya-lavrenov in #165
Package into correct dirs (#155) by @Wovchena in #167
New python build scheme by @ilya-lavrenov in #166
Support build from OpenVINO wheel only by @mryzhov in #178
Configure cmake similar to GenAI (#175) by @Wovchena in #180
Patch icu external project by @mryzhov in #184
[CI] Build from OV wheel by @mryzhov in #183
[GHA] Set permissions read-all by @mryzhov in #189
[CI] Fixed Jenkins artifacts by @mryzhov in #195
[MERGE] reduced icudt.dll (#196) by @mryzhov in #201

Full Changelog: 2024.2.0.0...2024.3.0.0

Contributors

ilya-lavrenov, pavel-esir, and 4 other contributors

Assets 2

17 Jun 13:19

apaniukov

2024.2.0.0

c615ec5

2024.2.0.0

What's Changed

Add support for left padding in Wordpiece, BPE and tiktoken-based tokenizers
Enhanced handling of special tokens
Add support for padding to a particular length
New option to add or not add special tokens during the tokenization
Support Punctuation Pretokenizer
Enchanse tokenizer postprocessing parser for better model coverage
Add StringToHashBucketFast Tensorflow Translator
Optimize EqualStr and VocabEncoder Operations
Add Benchmarking Script

Full Changelog: 2024.1.0.2...2024.2.0.0

Assets 2

10 May 09:42

apaniukov

2024.1.0.2

c754503

2024.1.0.2

What's Changed

Fixed prebuild tokenizers on Windows by @ilya-lavrenov in #141

Full Changelog: 2024.1.0.1...2024.1.0.2

Contributors

ilya-lavrenov

Assets 2

08 May 15:59

apaniukov

2024.1.0.1

37d20ce

2024.1.0.1

What's Changed

Llama3 Tokenizer Support
Add not-add-special-tokens flag to CLI conversion tool

Full Changelog: 2024.1.0.0...2024.1.0.1

Assets 2

25 Apr 13:04

apaniukov

2024.1.0.0

ad37623

2024.1.0.0

What's Changed

New operations:
- TrieTokenizer
- VocabEncoder
- EqualStr
- RaggedToSparse
- RaggedToRagged
- FuzeRagged
Update existing operations:
- Add max_splits argument to RegexSplit
- Add encoding argument to CaseFold
Add new and update existing TensorFlow translators for TextVectorization layer partial support.
RWKV tokenizer support.
New way to get OpenVINO Tokenizers - build from files. Supports RWKV tokenizer.
Update tokenizer operation caching mechanism for OpenVINO model caching support
SentencePiece tokenizer changes and fixes:
- Update to 0.2.0 version
- Use constant 0 as mask hide token by @as-suvorov in #90
- Sentencepiece BOS Token Detection
Fix multi-input model merging by @yas-sim in #53

New Contributors

@dependabot made their first contribution in #30
@yas-sim made their first contribution in #53
@as-suvorov made their first contribution in #90
@akladiev made their first contribution in #102

Full Changelog: 2024.0.0.0...2024.1.0.0

Contributors

akladiev, dependabot, and 2 other contributors

Assets 2

21 Mar 14:38

apaniukov

2024.0.0.0

aa0587d

2024.0.0.0

What's Changed

Improve Regex Support - filter lookarounds, unsupported by re2
Improve model coverage - T5 tokenizers, QWEN2
Add tokenizer metadata to rt_info - EOS token id
Support TensorFlow Text MUSE model conversion and inference

New Contributors

@Wovchena made their first contribution in #20

Contributors

Wovchena

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's Changed

New and Reimplemented Operations

Improvements and Compatibility

Build Changes

Contributors

What's Changed

Contributors

What's Changed

Contributors

What's Changed

Improvements

Changes

Build, Packaging and CI

Contributors

What's Changed

What's Changed

Contributors

What's Changed

What's Changed

New Contributors

Contributors

What's Changed

New Contributors

Contributors

Releases: openvinotoolkit/openvino_tokenizers

2024.5.0.0

What's Changed

New and Reimplemented Operations

Improvements and Compatibility

Build Changes

Contributors

2024.4.1.0

What's Changed

Contributors

2024.4.0.0

What's Changed

Contributors

2024.3.0.0

What's Changed

Improvements

Changes

Build, Packaging and CI

Contributors

2024.2.0.0

What's Changed

2024.1.0.2

What's Changed

Contributors

2024.1.0.1

What's Changed

2024.1.0.0

What's Changed

New Contributors

Contributors

2024.0.0.0

What's Changed

New Contributors

Contributors