Release v0.15.0 · aphp/edsnlp

Changelog

edsnlp.data.read_parquet now accept a work_unit="fragment" option to split tasks between workers by parquet fragment instead of row. When this is enabled, workers do not read every fragment while skipping 1 in n rows, but read all rows of 1/n fragments, which should be faster.
Accept no validation data in edsnlp.train script
Log the training config at the beginning of the trainings
Support a specific model output dir path for trainings (output_model_dir), and whether to save the model or not (save_model)
Specify whether to log the validation results or not (logger=False)
Added support for the CoNLL format with edsnlp.data.read_conll and with a specific eds.conll_dict2doc converter
Added a Trainable Biaffine Dependency Parser (eds.biaffine_dep_parser) component and metrics
New eds.extractive_qa component to perform extractive question answering using questions as prompts to tag entities instead of a list of predefined labels as in eds.ner_crf.

Fix join_thread missing attribute in SimpleQueue when cleaning a multiprocessing executor
Support huggingface transformers that do not set cls_token_id and sep_token_id (we now also look for these tokens in the special_tokens_map and vocab mappings)
Fix changing scorers dict size issue when evaluating during training
Seed random states (instead of using random.RandomState()) when shuffling in data readers : this is important for
1. reproducibility
2. in multiprocessing mode, ensure that the same data is shuffled in the same way in all workers
Bubble BaseComponent instantiation errors correctly
Improved support for multi-gpu gradient accumulation (only sync the gradients at the end of the accumulation), now controled by the optiona sub_batch_size argument of TrainingData.
Support again edsnlp without pytorch installed
We now test that edsnlp works without pytorch installed
Fix units and scales, ie 1l = 1dm3, 1ml = 1cm3

fix: check join_thread attribute in queue when cleaning mp exec by @percevalw in #345
fix: support hf transformers with cls_token_id and sep_token_id set to None by @percevalw in #346
fix: changing scorers dict size issue when evaluating during training by @percevalw in #347
Fix streams by @percevalw in #350
Various trainer fixes by @percevalw in #352
Trainable biaffine dependency parser by @percevalw in #353
feat: new eds.extractive_qa component by @percevalw in #351
Fix training and multiprocessing by @percevalw in #354
fix: correct conversions for volumes, areas by @etienneguevel in #349
chore: bump version to 0.15.0 by @percevalw in #355

Full Changelog: v0.14.0...v0.15.0