Releases: aphp/edsnlp
Releases · aphp/edsnlp
v0.15.0
Changelog
Added
edsnlp.data.read_parquet
now accept awork_unit="fragment"
option to split tasks between workers by parquet fragment instead of row. When this is enabled, workers do not read every fragment while skipping 1 in n rows, but read all rows of 1/n fragments, which should be faster.- Accept no validation data in
edsnlp.train
script - Log the training config at the beginning of the trainings
- Support a specific model output dir path for trainings (
output_model_dir
), and whether to save the model or not (save_model
) - Specify whether to log the validation results or not (
logger=False
) - Added support for the CoNLL format with
edsnlp.data.read_conll
and with a specificeds.conll_dict2doc
converter - Added a Trainable Biaffine Dependency Parser (
eds.biaffine_dep_parser
) component and metrics - New
eds.extractive_qa
component to perform extractive question answering using questions as prompts to tag entities instead of a list of predefined labels as ineds.ner_crf
.
Fixed
- Fix
join_thread
missing attribute inSimpleQueue
when cleaning a multiprocessing executor - Support huggingface transformers that do not set
cls_token_id
andsep_token_id
(we now also look for these tokens in thespecial_tokens_map
andvocab
mappings) - Fix changing scorers dict size issue when evaluating during training
- Seed random states (instead of using
random.RandomState()
) when shuffling in data readers : this is important for- reproducibility
- in multiprocessing mode, ensure that the same data is shuffled in the same way in all workers
- Bubble BaseComponent instantiation errors correctly
- Improved support for multi-gpu gradient accumulation (only sync the gradients at the end of the accumulation), now controled by the optiona
sub_batch_size
argument ofTrainingData
. - Support again edsnlp without pytorch installed
- We now test that edsnlp works without pytorch installed
- Fix units and scales, ie 1l = 1dm3, 1ml = 1cm3
Pull Requests
- fix: check join_thread attribute in queue when cleaning mp exec by @percevalw in #345
- fix: support hf transformers with cls_token_id and sep_token_id set to None by @percevalw in #346
- fix: changing scorers dict size issue when evaluating during training by @percevalw in #347
- Fix streams by @percevalw in #350
- Various trainer fixes by @percevalw in #352
- Trainable biaffine dependency parser by @percevalw in #353
- feat: new eds.extractive_qa component by @percevalw in #351
- Fix training and multiprocessing by @percevalw in #354
- fix: correct conversions for volumes, areas by @etienneguevel in #349
- chore: bump version to 0.15.0 by @percevalw in #355
Full Changelog: v0.14.0...v0.15.0
v0.14.0
Changelog
Added
- Support for setuptools based projects in
edsnlp.package
command - Pipelines can now be instantiated directly from a config file (instead of having to cast a dict containing their arguments) by putting the @core = "pipeline" or "load" field in the pipeline section)
edsnlp.load
now correctly takes disable, enable and exclude parameters into account- Pipeline now has a basic repr showing is base langage (mostly useful to know its tokenizer) and its pipes
- New
python -m edsnlp.evaluate
script to evaluate a model on a dataset - Sentence detection can now be configured to change the minimum number of newlines to consider a newline-triggered sentence, and disable capitalization checking.
- New
eds.split
pipe to split a document into multiple documents based on a splitting pattern (useful for training) - Allow
converter
argument ofedsnlp.data.read/from_...
to be a list of converters instead of a single converter - New revamped and documented
edsnlp.train
script and API - Support YAML config files (supported only CFG/INI files before)
- Most of EDS-NLP functions are now clickable in the documentation
- ScheduledOptimizer now accepts schedules directly in place of parameters, and easy parameter selection:
ScheduledOptimizer( optim="adamw", module=nlp, total_steps=2000, groups={ "^transformer": { # lr will go from 0 to 5e-5 then to 0 for params matching "transformer" "lr": {"@schedules": "linear", "warmup_rate": 0.1, "start_value": 0 "max_value": 5e-5,}, }, "": { # lr will go from 3e-4 during 200 steps then to 0 for other params "lr": {"@schedules": "linear", "warmup_rate": 0.1, "start_value": 3e-4 "max_value": 3e-4,}, }, }, )
Changed
eds.span_context_getter
's parametercontext_sents
is no longer optional and must be explicitly set to 0 to disable sentence context- In multi-GPU setups, streams that contain torch components are now stripped of their parameter tensors when sent to CPU Workers since these workers only perform preprocessing and postprocessing and should therefore not need the model parameters.
- The
batch_size
argument ofPipeline
is deprecated and is not used anymore. Use thebatch_size
argument ofstream.map_pipeline
instead.
Fixed
- Sort files before iterating over a standoff or json folder to ensure reproducibility
- Sentence detection now correctly match capitalized letters + apostrophe
- We now ensure that the workers pool is properly closed whatever happens (exception, garbage collection, data ending) in the
multiprocessing
backend. This prevents some executions from hanging indefinitely at the end of the processing. - Propagate torch sharing strategy to other workers in the
multiprocessing
backend. This is useful when the system is running out of file descriptors andulimit -n
is not an option. Torch sharing strategy can also be set via an environment variableTORCH_SHARING_STRATEGY
(default isfile_descriptor
, consider usingfile_system
if you encounter issues).
Data API changes
LazyCollection
objects are now calledStream
objects- By default,
multiprocessing
backend now preserves the order of the input data. To disable this and improve performance, usedeterministic=False
in theset_processing
method - 🚀 Parallelized GPU inference throughput improvements !
- For simple {pre-process → model → post-process} pipelines, GPU inference can be up to 30% faster in non-deterministic mode (results can be out of order) and up to 20% faster in deterministic mode (results are in order)
- For multitask pipelines, GPU inference can be up to twice as fast (measured in a two-tasks BERT+NER+Qualif pipeline on T4 and A100 GPUs)
- The
.map_batches
,.map_pipeline
and.map_gpu
methods now support a specificbatch_size
and batching function, instead of having a single batch size for all pipes - Readers now have a
loop
parameter to cycle over the data indefinitely (useful for training) - Readers now have a
shuffle
parameter to shuffle the data before iterating over it - In
multiprocessing
mode, file based readers now read the data in the workers (was an option before) - We now support two new special batch sizes
- "fragment" in the case of parquet datasets: rows of a full parquet file fragment per batch
- "dataset" which is mostly useful during training, for instance to shuffle the dataset at each epoch.
These are also compatible in batched writer such as parquet, where each input fragment can be processed and mapped to a single matching output fragment.
- 💥 Breaking change: a
map
function returning a list or a generator won't be automatically flattened anymore. Useflatten()
to flatten the output if needed. This shouldn't change the behavior for most users since most writers (to_pandas, to_polars, to_parquet, ...) still flatten the output - 💥 Breaking change: the
chunk_size
andsort_chunks
are now deprecated : to sort data before applying a transformation, use.map_batches(custom_sort_fn, batch_size=...)
Training API changes
- We now provide a training script
python -m edsnlp.train --config config.cfg
that should fit many use cases. Check out the docs ! - In particular, we do not require pytorch's Dataloader for training and can rely solely on EDS-NLP stream/data API, which is better suited for large streamable datasets and dynamic preprocessing (ie different result each time we apply a noised preprocessing op on a sample).
- Each trainable component can now provide a
stats
field in itspreprocess
output to log info about the sample (number of words, tokens, spans, ...):- these stats are both used for batching (e.g., make batches of no more than "25000 tokens")
- for logging
- for computing correct loss means when accumulating gradients over multiple mini-mini-batches
- for computing correct loss means in multi-GPU setups, since these stats are synchronized and accumulated across GPUs
- Support multi GPU training via hugginface
accelerate
and EDS-NLPStream
API consideration of env['WOLRD_SIZE'] and env['LOCAL_RANK'] environment variables
Pull Requests
- Improve training tutorials by @percevalw in #331
- Various fixes by @percevalw in #332
- Multiprocessing related fixes by @percevalw in #333
- chore: bump version to 0.14.0 by @percevalw in #334
Full Changelog: v0.13.1...v0.14.0
v0.13.1
Changelog
Added
eds.tables
accepts a minimum_table_size (default 2) argument to reduce pollutionRuleBasedQualifier
now expose aprocess
method that only returns qualified entities and token without actually tagging them, deferring this task to the__call__
method.- Added new patterns for metastasis detection developed on CT-Scan reports.
- Added citation of articles
Fixed
- Disorder and Behavior pipes don't use a "PRESENT" or "ABSENT"
status
anymore. Instead,status=None
by default,
andent._.negation
is set to True instead of settingstatus
to "ABSENT". To this end, the tobacco and alcohol
now use theNegationQualifier
internally. - Numbers are now only detected without trying to remove the pollution in between digits, ie
55 @ 77777
could be detected as a full number before, but not anymore. - Fix fsspec open file encoding to "utf-8".
Changed
- Rename
eds.measurements
toeds.quantities
- scikit-learn (used in
eds.endlines
) is no longer installed by default when installingedsnlp[ml]
Pull Requests
- Remove pollution exclusion during numbers matching by @percevalw in #316
- Rename eds.measurements by @svittoz in #313
- Adding minimum_table_size argument to eds.tables by @svittoz in #318
- Fs encoding fix by @Aremaki in #320
- chore(deps): bump actions/download-artifact from 2 to 4.1.7 in /.github/workflows in the github_actions group across 1 directory by @dependabot in #319
- fix: skip spacy 3.8.0 due to numpy build dep by @percevalw in #321
- Fix behavior, disorder and qualifier pipes by @Thomzoy in #322
- Metastatic status by @aricohen93 in #308
- chore: bump version to 0.13.1 by @percevalw in #327
- Test 3.12 by @percevalw in #328
New Contributors
- @dependabot made their first contribution in #319
Full Changelog: v0.13.0...v0.13.1
v0.13.0
Changelog
Added
data.set_processing(...)
now expose anautocast
parameter to disable or tweak the automatic casting of the tensor
during the processing. Autocasting should result in a slight speedup, but may lead to numerical instability.- Use
torch.inference_mode
to disable view tracking and version counter bumps during inference. - Added a new NER pipeline for suicide attempt detection
- Added date cues (regular expression matches that contributed to a date being detected) under the extension
ent._.date_cues
- Added tables processing in eds.measurement
- Added 'all' as possible input in eds.measurement measurements config
- Added new units in eds.measurement
Changed
- Default to mixed precision inference
Fixed
edsnlp.load("your/huggingface-model", install_dependencies=True)
now correctly resolves the python pip
(especially on Colab) to auto-install the model dependencies- We now better handle empty documents in the
eds.transformer
,eds.text_cnn
andeds.ner_crf
components - Support mixed precision in
eds.text_cnn
andeds.ner_crf
components - Support pre-quantization (<4.30) transformers versions
- Verify that all batches are non empty
- Fix
span_context_getter
forcontext_words
= 0,context_sents
> 2 and support assymetric contexts - Don't split sentences on rare unicode symbols
- Better detect abbreviations, like
E.coli
, now split as [E.
,coli
] and not [E
,.
,coli
]
What's Changed
- Various ml fixes by @percevalw in #303
- TS by @aricohen93 in #269
- date cues by @cvinot in #265
- Fix fast inference by @percevalw in #305
- Fix typo in diabetes patterns by @isabelbt in #306
- Fix span context getter by @aricohen93 in #307
- Fix sentences by @percevalw in #310
- chore: bump version to 0.13.0 by @percevalw in #312
New Contributors
Full Changelog: v0.12.3...v0.13.0
v0.12.3
v0.12.2
Changelog
Changed
Packages:
- Pip-installable models are now built with
hatch
instead of poetry, which allows us to exposeartifacts
(weights)
at the root of the sdist package (uploadable to HF) and move them inside the package upon installation to avoid conflicts. - Dependencies are no longer inferred with dill-magic (this didn't work well before anyway)
- Option to perform substitutions in the model's README.md file (e.g., for the model's name, metrics, ...)
- Huggingface models are now installed with pip editable installations, which is faster since it doesn't copy around the weights
What's Changed
- Better packages by @percevalw in #302
Full Changelog: v0.12.1...v0.12.2
v0.12.1
Changelog
Added
- Added binary distribution for linux aarch64 (Streamlit's environment)
- Added new separator option in eds.table and new input check
Fixed
- Make catalogue & entrypoints compatible with py37-py312
- Check that a data has a doc before trying to use the document's
note_datetime
Pull Requests
- Fix catalogue entrypoints by @percevalw in #297
- Adding sep_pattern in eds.tables docstring by @svittoz in #286
- chore: bump version to 0.12.1 by @percevalw in #300
Full Changelog: v0.12.0...v0.12.1
v0.12.0
Changelog
Added
- The
eds.transformer
component now acceptsprompts
(passed to itspreprocess
method, see breaking change below) to add before each window of text to embed. LazyCollection.map
/map_batches
now support generator functions as arguments.- Window stride can now be disabled (i.e., stride = window) during training in the
eds.transformer
component bytraining_stride = False
- Added a new
eds.ner_overlap_scorer
to evaluate matches between two lists of entities, counting true when the dice overlap is above a given threshold edsnlp.load
now accepts EDS-NLP models from the huggingface hub 🤗 !- New
python -m edsnlp.package
command to package a model for the huggingface hub or pypi-like registries
Changed
- Trainable embedding components now all use
foldedtensor
to return embeddings, instead of returning a tensor of floats and a mask tensor. - 💥 TorchComponent
__call__
no longer applies the end to end method, and instead calls theforward
method directly, like all torch modules. - The trainable
eds.span_qualifier
component has been renamed toeds.span_classifier
to reflect its general purpose (it doesn't only predict qualifiers, but any attribute of a span using its context or not). omop
converter now takes thenote_datetime
field into account by default when building a documentspan._.date.to_datetime()
andspan._.date.to_duration()
now automatically take thenote_datetime
into accountnlp.vocab
is no longer serialized when saving a model, as it may contain sensitive information and can be recomputed during inference anyway- 💥 Major breaking change in trainable components, moving towards a more "task-centric" design:
- the
eds.transformer
component is no longer responsible for deciding which spans of text ("contexts") should be embedded. These contexts are now passed via thepreprocess
method, which now accepts more arguments than just the docs to process. - similarly the
eds.span_pooler
is now longer responsible for deciding which spans to pool, and instead pools all spans passed to it in thepreprocess
method.
- the
Consequently, the eds.transformer
and eds.span_pooler
no longer accept their span_getter
argument, and the eds.ner_crf
, eds.span_classifier
, eds.span_linker
and eds.span_qualifier
components now accept a context_getter
argument instead, as well as a span_getter
argument for the latter two. This refactoring can be summarized as follows:
- eds.transformer.span_getter
+ eds.ner_crf.context_getter
+ eds.span_classifier.context_getter
+ eds.span_linker.context_getter
- eds.span_pooler.span_getter
+ eds.span_qualifier.span_getter
+ eds.span_linker.span_getter
and as an example for the eds.span_linker
component:
nlp.add_pipe(
eds.span_linker(
metric="cosine",
probability_mode="sigmoid",
+ span_getter="ents",
+ # context_getter="ents", -> by default, same as span_getter
embedding=eds.span_pooler(
hidden_size=128,
- span_getter="ents",
embedding=eds.transformer(
- span_getter="ents",
model="prajjwal1/bert-tiny",
window=128,
stride=96,
),
),
),
name="linker",
)
Fixed
edsnlp.data.read_json
now correctly read the files from the directory passed as an argument, and not from the parent directory.- Overwrite spacy's Doc, Span and Token pickling utils to allow recursively storing Doc, Span and Token objects in the extension values (in particular, span._.date.doc)
- Removed pendulum dependency, solving various pickling, multiprocessing and missing attributes errors
Pull Requests
- Drop codecov by @percevalw in #292
- Fix dates by @percevalw in #288
- Loading models from the hf hub by @percevalw in #293
- Fix: only reinstall hf model when cache files are changed by @percevalw in #295
- feat: expose package script to cli by @percevalw in #294
- chore: bump version to 0.12.0 by @percevalw in #296
Full Changelog: v0.11.2...v0.12.0
v0.11.2
Changelog
Fixed
- Fix
edsnlp.utils.file_system.normalize_fs_path
file system detection not working correctly - Improved performance of
edsnlp.data
methods over a filesystem (fs
parameter)
Pull Requests
- Fix normalize fs path by @svittoz in #283
- Faster fs io by @percevalw in #285
New Contributors
Full Changelog: v0.11.1...v0.11.2
v0.11.1
Changelog
Added
- Automatic estimation of cpu count when using multiprocessing
optim.initialize()
method to create optim state before the first backward pass
Changed
nlp.post_init
will not tee lazy collections anymore (useedsnlp.utils.collections.multi_tee
yourself if needed)
Fixed
- Corrected inconsistencies in
eds.span_linker
Pull Requests
- Fix span linking by @percevalw in #282
Full Changelog: v0.11.0...v0.11.1