Changelog
Added
edsnlp.data.read_parquet
now accept awork_unit="fragment"
option to split tasks between workers by parquet fragment instead of row. When this is enabled, workers do not read every fragment while skipping 1 in n rows, but read all rows of 1/n fragments, which should be faster.- Accept no validation data in
edsnlp.train
script - Log the training config at the beginning of the trainings
- Support a specific model output dir path for trainings (
output_model_dir
), and whether to save the model or not (save_model
) - Specify whether to log the validation results or not (
logger=False
) - Added support for the CoNLL format with
edsnlp.data.read_conll
and with a specificeds.conll_dict2doc
converter - Added a Trainable Biaffine Dependency Parser (
eds.biaffine_dep_parser
) component and metrics - New
eds.extractive_qa
component to perform extractive question answering using questions as prompts to tag entities instead of a list of predefined labels as ineds.ner_crf
.
Fixed
- Fix
join_thread
missing attribute inSimpleQueue
when cleaning a multiprocessing executor - Support huggingface transformers that do not set
cls_token_id
andsep_token_id
(we now also look for these tokens in thespecial_tokens_map
andvocab
mappings) - Fix changing scorers dict size issue when evaluating during training
- Seed random states (instead of using
random.RandomState()
) when shuffling in data readers : this is important for- reproducibility
- in multiprocessing mode, ensure that the same data is shuffled in the same way in all workers
- Bubble BaseComponent instantiation errors correctly
- Improved support for multi-gpu gradient accumulation (only sync the gradients at the end of the accumulation), now controled by the optiona
sub_batch_size
argument ofTrainingData
. - Support again edsnlp without pytorch installed
- We now test that edsnlp works without pytorch installed
- Fix units and scales, ie 1l = 1dm3, 1ml = 1cm3
Pull Requests
- fix: check join_thread attribute in queue when cleaning mp exec by @percevalw in #345
- fix: support hf transformers with cls_token_id and sep_token_id set to None by @percevalw in #346
- fix: changing scorers dict size issue when evaluating during training by @percevalw in #347
- Fix streams by @percevalw in #350
- Various trainer fixes by @percevalw in #352
- Trainable biaffine dependency parser by @percevalw in #353
- feat: new eds.extractive_qa component by @percevalw in #351
- Fix training and multiprocessing by @percevalw in #354
- fix: correct conversions for volumes, areas by @etienneguevel in #349
- chore: bump version to 0.15.0 by @percevalw in #355
Full Changelog: v0.14.0...v0.15.0