Releases: WorksApplications/sudachi.rs
Releases · WorksApplications/sudachi.rs
v0.6.9
Highlights
- Support python3.13
- py3.13t is not supported yet
- Remove Python 3.7 and 3.8 support as it reaches its end of life (https://devguide.python.org/versions/) (#249, #281)
rust
Added
- freebsd support (#222 by @KonstantinDjairo, #251)
- Add option for embedded config and fallback resources (#262 by @Kuuuube)
Changed
fetch_dictionary.sh
targets latest dictionary by default (#240)- Migrate from structopt to clap (#248 by @tkhshtsh0917)
python
Added
- Allow string literals as
SplitMode
(#245) - Add
sudachipy.Config
andsudachipy.errors.SudachiError
to default import (#260)
Changed
-s
(system dictionary path) ofsudachi ubuild
command is now required (#239)-d
option of sudachi cli (which is no-op) now warns (#278)- Update the output of
sudachi dump
subcommand (#277)
see rust changelog and python changelog for more.
v0.6.8
Highlights
- Produce builds for Python 3.12 (#236)
- Add a simple configuration API
- Add surface projections (#230)
Surface Projections
- For chiTra compatibility SudachiPy can now directly produce different tokens in the surface field.
- Original surface is accessible via
Morheme.raw_surface()
method - It is possible to customize projection dictionary-wise, via Config object, passing it on a dictionary creation, or for a single pre-tokenizer.
0.6.7
Highlights
- Provide binary wheels for Python 3.11
- Add
Dictionary.lookup()
method which allows you to enumerate morphemes from the dictionary without performing analysis.
0.6.6
Highlights
- Add boundary matching mode to regex oov handler
- macOS binary builds are now unversal2 (arm+x64)
MacOS
- Binary builds are universal2
- Caveat: we don't run tests on arm because there are no public arm instances, so builds may be broken without any warning
0.6.5
Highlights
- Fixed invalid POS tags which appeared when using user-defined POS tags both in user dictionaries and OOV handlers. You are not affected by this bug if you did not use user-defined POS in OOV handlers.
Version 0.6.4
Highlights
- Remove Python 3.6 support which reached end-of-life status on 2021-12-23
- OOV handler plugins support user-defined POS, similar to Java version
- Added Regex OOV handler
Regex OOV Handler
- For details, see Java version changelog
- In Rust/Python Regexes do not support backtracking and backreferences
maxLength
setting defines maximum length in unicode codepoints, not in utf-8 bytes as in Java (will be changed to codepoints later)
0.6.3
Highlights
- Fixed path resolution algorithm for resources. They are now resolved in the following order (first existing file wins):
- Absolute paths stay as they are
- Relative to "path" value of the config file
- Relative to "resource_dir" parameter of the config object during creation
- For SudachiPy it is the parameter of
Dictionary
constructor
- For SudachiPy it is the parameter of
- Relative to the location of the configuration file
- Relative to the current directory
Python
Dictionary
now has__repr__()
function which displays absolute paths to dictionaries in use.Dictionary
now haspos_of()
function which returns a POS tuple for a given POS id.PosMatcher
supports set operations- union (
m1 | m2
) - intersection (
m1 & m2
) - difference (
m1 - m2
) - negation (
~m1
)
- union (
0.6.2
Highlights
- Fixed analysis differences from 0.5.4
- Central dot ・ is handled correctly
- Catch-all OOV handler was used even if other OOV handlers could produces some results
0.6.1
Highlights
- Added Fuzzing (see
sudachi-fuzz
subdirectory), Sudachi.rs seems to be pretty robust towards arbitrary inputs (no crashes and panics)- Issues like #182 should never occur more
- ~5% analysis speed improvement over 0.6.0
- Added support for Unicode combining symbols, now Sudachi.rs/py should be much better with emoji (🎅🏾) and more complex Unicode (İstanbul)
Rust
- Added partial dictionary read functionality, it is now possible to skip reading certain fields if they are not needed
- Improved startup times, especially for debug builds
Python
Morpheme.part_of_speech
method now returns Tuple of POS components instead of a list.- Partial Dictionary Read
- It is possible to ask for a subset of morpheme fields instead of all fields
- Supported API:
Dictionary.create()
,Dictionary.pre_tokenizer()
- HuggingFace PreTokenizer support
- We provide a built-in HuggingFace-compatible pre-tokenizer
- API:
Dictionary.pre_tokenizer()
- It is multithreading-compatible and supports customization
- Memory allocation reuse
- It is possible to reduce re-allocation overhead by using
out
parameters which acceptMorphemeList
s - Supported API:
Tokenizer.tokenize()
,Morpheme.split()
- It is now a recommended way to use both those APIs
- It is possible to reduce re-allocation overhead by using
- PosMatcher
- New API for checking if a morpheme has a POS tag from a set
- Strongly prefer using it instead of string comparison of POS components
- Performance
- Greatly decreased cost of accessing POS components
len(Morpheme)
now returns the length of the morpheme in Unicode codepoints. Use it instead oflen(m.surface())
Morpheme.split()
has newadd_single
parameter, which can be used to check whether the split has produced anything- E.g. with
if m.split(SplitMode.A, out=res, add_single=False): handle_splits(res)
add_single=True
, returning the list with the current morpheme is the current behavior
- E.g. with
Morpheme
/MorphemeList
now have readable__repr__
and__str__
0.6.0
Highlights
- Full feature parity with Java version
- ~15% analysis speed improvement over 0.6.0-rc1
- SudachiPy compatible Python bindings
- ~30x speed improvement over original SudachiPy
Rust
- No public API at the moment (contact us if you want to use Rust version directly, internals will significantly change and names are not finalized)
- Added dictionary build functionality
- Added an option to perform analysis without sentence splitting
- Use it with
--split-sentences=no
- Use it with
Python
- Added bindings for dictionary build (undocumented and not supported as API).
- See #157
sudachipy build
andsudachipy ubuild
should work once more- Report on build times and dictionary part sizes can differ from the original SudachiPy