Releases: gbenson/dom-tokenizers
Releases · gbenson/dom-tokenizers
0.0.17
0.0.16: - Handle more apostrophe surrogates
- Fix a tokenizer crash
0.0.15
Consolidation
0.0.13
Don't lowercase special tokens
0.0.12: - Change tokenization to be more like HTML
- Switch back to uncased base model
0.0.11
Major refactor
0.0.10
Introduce `DOMSnapshotPreTokenizer.hook_into()`
0.0.9
Transliterate non-ASCII input texts
0.0.8
Tokenizer comparison script
0.0.7
Whole dataset tokenizer, for comparisons