CLN/ENH: Rename and refactor datapipes, add datasets; fix #574 #724 #754 #755

NickleDave · 2024-05-11T12:48:36Z

This PR:

renames current datasets -> datapipes
and renames datapipes to datapipes.frame_classification.TrainDatapipe and datapipes.frame_classification.InferDatapipe (for frame classification models), or just datapipes.parametric_umap.Datapipe (for parametric UMAP models)
has the datapipes use a default set of transforms and adds parameters to the classes for those transforms -- this lets us remove spaghetti code of calling transforms.get with parameters that sometimes appear in a dataset class as well, and lets us avoid making a user aware of nuances about transform vs dataset until they really need to, as in Refactor frame classification models to use single WindowedFramesDatapipe #574. For now I'm not using just a single datapipe for frame classification, to avoid having either a __getitem__ method that is gigantic or another function call inside __getitem__ that depends on what "mode" (train/infer) we are in
Adds a new datasets module with an initial implementation of the class for the BioSoundSegBench dataset
- Rewrite FrameClassificationModel to make it possible to run different models on the different targets of this dataset (train/eval/predict)
- This is not at all tested right now in the unit tests 😕

…apipe.InferDatapipe

…pipe.py

…nd PredictItemTransform as a single class, InferItemTransform, and remname spect_standardizer -> frames_standardizer in that module

…inting in src/vak/transforms/functional.py

…on.transforms.Compose, so we get back a windowed batch

…s that get passed to TrainItemTransform when instatiating it inside TrainDatapipe.__init__

…default.frame_classification.InferItemTransform, add parameters that get passed to InferItemTransform when instatiating it inside InferDatapipe.__init__

… now use default transforms, and no longer call transforms.defaults.get

…at now use default transforms, and no longer call transforms.defaults.get

…now use default transforms, and no longer call transforms.defaults.get

…at now use default transforms, and no longer call transforms.defaults.get

… 'standardize (normalize) frames'

…eters for item transform

…ain/frame_classification.py

…_classification/make_splits.py

…ulti-class frame labels with boundary frame labels

…n.constants.DEFAULT_BACKGROUND_LABEL, used to validate length of string labels in labelmap

…er place in common/labels.py

…in prep/frame_classification/learncurve.py

…abels/functional.py

…n/learncurve.py

…/make_splits.py

…n.py and add type annotations to function signature

…_LABEL in tests

…del.py

…ests/

…st_transforms/test_frame_labels/test_functional.py

(#755) * Rename vak/datasets -> vak/datapipes * Rename frame_classifcation.window_dataset.WindowDataset -> TrainDatapipe * Rename frame_classification/window_dataset.py -> train_datapipe.py * Fix WindowDataset -> TrainDatapipe in docstrings * Rename frame_classification.frames_dataset.FramesDataset -> infer_datapipe.InferDatapipe * Rename transforms.StandardizeSpect -> FramesStandarizer * Import FramesStandarizer in datapipes/frame_classification/infer_datapipe.py * Add module-level docstring in vak/datapipes/__init__.py * Rewrite transforms.defaults.frames_classification.EvalItemTransform and PredictItemTransform as a single class, InferItemTransform, and remname spect_standardizer -> frames_standardizer in that module * Fix bug in view_as_window_batch so it works on 1-D arrays, add type hinting in src/vak/transforms/functional.py * Change frame_labels_transform in InferItemTransform to be a torchvision.transforms.Compose, so we get back a windowed batch * Remove TODO in src/vak/models/frame_classification_model.py * Rewrite TrainDatapipe to always use TrainItemTransform, add parameters that get passed to TrainItemTransform when instatiating it inside TrainDatapipe.__init__ * Rewrite frames_classification.InferDatapipe to always use transforms.default.frame_classification.InferItemTransform, add parameters that get passed to InferItemTransform when instatiating it inside InferDatapipe.__init__ * Rewrite train.frame_classification to pass kwargs into datapipes that now use default transforms, and no longer call transforms.defaults.get * Rewrite predict.frame_classification to pass kwargs into datapipes that now use default transforms, and no longer call transforms.defaults.get * Rewrite eval.frame_classification to pass kwargs into datapipes that now use default transforms, and no longer call transforms.defaults.get * Rewrite predict.frame_classification to pass kwargs into datapipes that now use default transforms, and no longer call transforms.defaults.get * Rename 'spect_scaler_path' -> 'frames_standardizer_path' * Rename 'normalize_spectrogram' -> 'standardize_frames' * Fix 'SpectScaler' -> 'FramesStandardizer', 'normalize spectrogram' -> 'standardize (normalize) frames' * Fix 'SpectScaler' -> 'FramesStandardizer' in tests/ * Fix key names in doc/toml * Add missing comma in src/vak/train/frame_classification.py * Rename config/valid-version-1.1.toml -> valid-version-1.2.toml * Fix normalize spectrograms -> standardize frames more places in docs * Fix datapipes.frame_classification.InferDatapipe to have needed parameters for item transform * Fix datapipes.frame_classification.TrainDatapipe to have needed parameters for item transform * Fix arg name 'spect_standardizer -> frames_standardizer in src/vak/train/frame_classification.py * fixup fix TrainDatapipe parameters * Fix variable name in src/vak/datapipes/frame_classification/train_datapipe.py * Add missing arg return_padding_mask in src/vak/train/frame_classification.py * Fix transforms.default.frame_classification.InferItemTransform to not window frame labels, just convert them to LongTensor * Revise docstring in eval/frame_classification * Remove item_transform from docstring in datapipes/frame_classification/train_datapipe.py * Add return_padding_mask arg in vak/predict/frame_classification.py * Remove src/vak/transforms/defaults/parametric_umap.py * Rename/rewrite Datapipe class for ParametricUMAP, hard-code in transform * Remove transforms/defaults/get.py, remove related imports in transforms/defaults/__init__.py * Finish removing transform fetching for ParametricUMAP * Fix typo in src/vak/eval/frame_classification.py * Fix "StandardizeSpect" -> "FramesStandardizer" in src/vak/learncurve/frame_classification.py * Apply changes from nox lint session * Make flake8 fixes, remove unused function get_default_frame_classification_transform * Fix "StandardizeSpect" -> "FramesStandardizer" in tests/scripts/vaktestdata/configs.py" * WIP: Add datasets/ with biosoundsegbench * Renam tests/test_datasets -> test_datapipes, fix tests * Fix 'StandardizeSpect' -> 'FramesStandardizer' in two tests * Remove two uses of vak.transforms.defaults.get_default_transform from tests * Fix datapipe used in tests/test_models/test_parametric_umap_model.py * Use TYPE_CHECKING to avoid circular import in src/vak/datapipes/frame_classification/infer_datapipe.py * Add method 'fit_inputs_targets_csv_path' to FramesStandardizer, rewrite 'fit_dataset_path' method to just call this new method * fixup add method * Add unit test for FramesStandardizer.fit_inputs_targets_csv_path * Remove unused import from src/vak/transforms/transforms.py * Remove unused import in src/vak/transforms/defaults/frame_classification.py * Pep8 fix in src/vak/datasets/__init__.py * Apply linting to src/vak/transforms/transforms.py * Correct docstring in src/vak/transforms/defaults/frame_classification.py * Import datasets in src/vak/__init__.py * Rename datapipes/frame_classification/constants.FRAME_LABELS_EXT -> MULTI_FRAME_LABELS_EXT, and change value to 'multi-frame-labels.npy', and change value of FRAME_LABELS_NPY_PATH_COL_NAME to 'multi_frame_labels_npy_path' * Rename vak.datapipes.frame_classification.constants.FRAME_LABELS_NPY_PATH_COL_NAME -> MULTI_FRAME_LABELS_PATH_COL_NAME * Rename key in item returned by frame_classification.TrainItemTransform and InferItemTransform; 'frame_labels' -> 'multi_frame_labels' * WIP: Get BioSoundSegBench class working * Rewrite FrameClassificationModel to handle different target types * Add VALID_SPLITS to common.constants * In datasets/biosoundsegbench.py: change VALID_TARGET_TYPES to be the ones we're using for experiments right now, fix TrainItemTransform to handle target types, clean up __init__ method validation * Add initial unit tests for BioSoundSegBench dataset * Add helper function vak.datasets.get * Clean up how we validate target_type in datasets.BioSoundSegBench.__init__ * Add tests/test_datasets/__init__.py (to make a sub-package) * Add initial unit tests for vak.datasets.get * Modify BioSoundSegBench.__init__ so we can write splits_path as just the filename * Use expanded_user_path converter on path and splits_path attributes of DatasetConfig * Rename BOUNDARY_ONEHOT_PATH_COL_NAME -> BOUNDARY_FRAME_LABELS_PATH_COL_NAME in datasets/biosoundsegbench.py * Modify datasets.BioSoundSegBench to compute metadata from splits_json path * Fix mock_biosoundsegbench_dataset fixture so mocked files follow naming conventions of dataset * Modify mock_biosoundsegbench_dataset fixture to save labelmaps.json * Change BioSoundSegBench.__init__ so we have training_replicate_metadata attribute, frame_dur attribute, and labelmap attribute * Add DATASETS dict in dataset/__init__.py, used by vak.datasets.get to look up class (value) by name (key) * Use vak.datasets.DATASETS in vak.datasets.get to get class * Rewrite BioSoundSegBench.__init__ so we can either pass in a FramesStandardizer instance or tell it to fit a new one to the specified split, that then gets added to the transform * Import DATASETS inside vak.datasets.get to avoid circular import * Make fixes in datasets/biosoundsegbench.py: import FramesStandardizer inside TrainItemTransform.__init__, fix tmp_splits_path -> splits-jsons (plural), add needed __len__ method to class * Rename BioSoundSegBench property 'input_shape' -> 'shape' for consistency with frame_classification datapipes * Get vak/train/frame_classification.py to the point where it runs * Add missing self in BioSoundSegBench._getitemval * Rewrite src/vak/eval/frame_classification.py to work with built-in datasets, and remove 'split' parameter from eval_frame_classification_model function -- check if 'split' is in dataset_config and if not, default to 'test' * Remove split argument in call to eval_frame_classification_model inside src/vak/learncurve/frame_classification.py * Remove split parameter from eval._eval.eval -- it's not an attribute of EvalConfig and we can now pass in a 'split' through dataset_config * Remove 'split' parameter from eval_parametric_umap_model, check if 'split' in dataset_config and if not default to 'test' * Rewrite src/vak/predict/frame_classification.py to work with built-in datasets; check if 'split' is in dataset_config and if not, default to 'predict' * Add comments to structure src/vak/train/frame_classification.py * Fix how we check for key in src/vak/predict/frame_classification.py * Fix how we check for key in dict in src/vak/eval/parametric_umap.py * Fix how we check for key in dict in src/vak/eval/frame_classification.py * Fix unit tests in test_dataset.py: assert that path attributes are vak.converters.expanded_user_path(value from config), not pathlib.Path * Fix how we parametrize tests/test_dataset/test_get.py * In BioSoundSegBench.__init__, fix how we calculate frame_dur and how we set labelmap attribute for binary/boundary frame labels * In FrameClassificationModel.validation_step, convert Levenshtein distance to float to squelch warning from Lightning * Fix FrameClassificationModel so train/val with multi-class + boundary labels works * Fix vak.cli.predict to not assume that config has a prep attribute * Fix how we override default split with a split from dataset_config['params'] in predict/frame_classification and eval/frame_classification * Change BioSoundSegBench so __getitem__ can return 'frames_path' in 'item' for eval/predict * In predict.frame_classification, set 'return_frames_path' to True in dataset_config['params'] since we need this for predictions * Add constant DEFAULT_SPECT_FORMAT in common.constants * Fix SPECT_KEY -> TIMEBINS_KEY in cli.prep * Fix how we determine input_type and spect_format for built-in datasets in predict/frame_classification * Add nn/loss/crossentropy.py, wraps torch.nn.CrossEntropy, but converts weight arg as list to tensor * Fixup add loss * Use nn.loss.CrossEntropy with TweetyNet model * Clean up prediction_step in FrameClassificationModel * Get predict working for multi_frame_labels and boundary_frame_labels, still need to test binary_frame_labels and (boundary, multi) * Rename 'unlabeled_label' -> 'background_label' in transforms/frame_labels * Rename 'unlabeled_label' -> 'background_label' in tests/test_transforms/test_frame_labels * Rewrite transforms/frame_labels/functional.py to handle boundary labels - Add `boundary_labels_to_segment_inds_list' that finds segment indexing arrays from a list of boundary labels - Rename `to_segment_inds` -> `frame_labels_to_segment_inds_list - Have `preprocess` optionally take `boundary_labels` and use it to find segments, instead of frame labels - Fix type annotations to use npt.NDArray instead of np.ndarray * Change how FrameClassificationModel calls loss for multi-class + boundary targets -- assume we pass to an instance of a loss function, and get back either a scalar loss or a dict mapping loss names to scalar values * Change arg name 'unlabeled_label' -> 'background_label' in prep/frame_classification/make_splits.py * Fix predict.frame_classification for multi-class, and add logic for multi-class frame labels with boundary frame labels * Add DEFAULT_BACKGROUND_LABEL to common.constants * Use DEFAULT_BACKGROUND_LABEL in transforms.frame_labels.functional * Rename unlabeled -> background_label in common.labels * Add background_label in docstring in common/labels.py * Add 'background_label' to FrameClassificationModel, defaults to common.constants.DEFAULT_BACKGROUND_LABEL, used to validate length of string labels in labelmap * Fix 'unlabeled' -> common.constants.DEFAULT_BACKGROUND_LABEL in anohter place in common/labels.py * Fix unlabeled -> background label in docstrings in transforms * Use 'background_label' argument in place of magic string 'unlabeled' in prep/frame_classification/learncurve.py * Fix unlabeled -> background label in docstrings in transforms/frame_labels/functional.py * Add background_label to docstring in src/vak/prep/frame_classification/learncurve.py * Add background_label to function in src/vak/prep/frame_classification/make_splits.py * Add background_label parameter to src/vak/predict/frame_classification.py and add type annotations to function signature * Fix unlabeled -> background / vak.common.constants.DEFAULT_BACKGROUND_LABEL in tests * Fix 'map_unlabeled' -> 'map_background' in tests/ * Fix 'constants' -> 'common' in src/vak/models/frame_classification_model.py * Fix arg name map_unlabeled -> map_background * Fix arg name map_unlabeled -> map_background in prep/parametric_umap * Fix 'unlabeled' -> vak.common.constants.DEFAULT_BACKGROUND_LABEL in tests/ * Fix name `to_inds_list` -> segment_inds_list_from_class_labels` in test_transforms/test_frame_labels/test_functional.py

NickleDave added 30 commits May 11, 2024 08:50

Rename vak/datasets -> vak/datapipes

ac074a1

Rename frame_classifcation.window_dataset.WindowDataset -> TrainDatapipe

005217f

Rename frame_classification/window_dataset.py -> train_datapipe.py

e13e1b2

Fix WindowDataset -> TrainDatapipe in docstrings

d5dc232

Rename frame_classification.frames_dataset.FramesDataset -> infer_dat…

853b9d9

…apipe.InferDatapipe

Rename transforms.StandardizeSpect -> FramesStandarizer

84ffa6b

Import FramesStandarizer in datapipes/frame_classification/infer_data…

4958e7a

…pipe.py

Add module-level docstring in vak/datapipes/__init__.py

0ef4ad8

Rewrite transforms.defaults.frames_classification.EvalItemTransform a…

855e0cf

…nd PredictItemTransform as a single class, InferItemTransform, and remname spect_standardizer -> frames_standardizer in that module

Fix bug in view_as_window_batch so it works on 1-D arrays, add type h…

2550f43

…inting in src/vak/transforms/functional.py

Change frame_labels_transform in InferItemTransform to be a torchvisi…

9416320

…on.transforms.Compose, so we get back a windowed batch

Remove TODO in src/vak/models/frame_classification_model.py

51fdbb0

Rewrite TrainDatapipe to always use TrainItemTransform, add parameter…

40c2b96

…s that get passed to TrainItemTransform when instatiating it inside TrainDatapipe.__init__

Rewrite frames_classification.InferDatapipe to always use transforms.…

f63661b

…default.frame_classification.InferItemTransform, add parameters that get passed to InferItemTransform when instatiating it inside InferDatapipe.__init__

Rewrite train.frame_classification to pass kwargs into datapipes that…

6893a1c

… now use default transforms, and no longer call transforms.defaults.get

Rewrite predict.frame_classification to pass kwargs into datapipes th…

5a56f6e

…at now use default transforms, and no longer call transforms.defaults.get

Rewrite eval.frame_classification to pass kwargs into datapipes that …

b1ecb78

…now use default transforms, and no longer call transforms.defaults.get

Rewrite predict.frame_classification to pass kwargs into datapipes th…

433ffbf

…at now use default transforms, and no longer call transforms.defaults.get

Rename 'spect_scaler_path' -> 'frames_standardizer_path'

232aee4

Rename 'normalize_spectrogram' -> 'standardize_frames'

ccf0659

Fix 'SpectScaler' -> 'FramesStandardizer', 'normalize spectrogram' ->…

a98dbfa

… 'standardize (normalize) frames'

Fix 'SpectScaler' -> 'FramesStandardizer' in tests/

40b8712

Fix key names in doc/toml

862abc1

Add missing comma in src/vak/train/frame_classification.py

0159edb

Rename config/valid-version-1.1.toml -> valid-version-1.2.toml

bbc475e

Fix normalize spectrograms -> standardize frames more places in docs

d92a539

Fix datapipes.frame_classification.InferDatapipe to have needed param…

7b9dc57

…eters for item transform

Fix datapipes.frame_classification.TrainDatapipe to have needed param…

c0c139b

…eters for item transform

Fix arg name 'spect_standardizer -> frames_standardizer in src/vak/tr…

7c19d55

…ain/frame_classification.py

fixup fix TrainDatapipe parameters

1248a40

NickleDave added 21 commits May 11, 2024 08:50

Change arg name 'unlabeled_label' -> 'background_label' in prep/frame…

baad820

…_classification/make_splits.py

Fix predict.frame_classification for multi-class, and add logic for m…

2b5ebc1

…ulti-class frame labels with boundary frame labels

Add DEFAULT_BACKGROUND_LABEL to common.constants

180ae92

Use DEFAULT_BACKGROUND_LABEL in transforms.frame_labels.functional

e1616bf

Rename unlabeled -> background_label in common.labels

1176230

Add background_label in docstring in common/labels.py

9e4ab1d

Add 'background_label' to FrameClassificationModel, defaults to commo…

c97639d

…n.constants.DEFAULT_BACKGROUND_LABEL, used to validate length of string labels in labelmap

Fix 'unlabeled' -> common.constants.DEFAULT_BACKGROUND_LABEL in anoht…

956dfce

…er place in common/labels.py

Fix unlabeled -> background label in docstrings in transforms

2b22b3f

Use 'background_label' argument in place of magic string 'unlabeled' …

e242a94

…in prep/frame_classification/learncurve.py

Fix unlabeled -> background label in docstrings in transforms/frame_l…

bd232c0

…abels/functional.py

Add background_label to docstring in src/vak/prep/frame_classificatio…

efa438e

…n/learncurve.py

Add background_label to function in src/vak/prep/frame_classification…

c4470e1

…/make_splits.py

Add background_label parameter to src/vak/predict/frame_classificatio…

cc6abf2

…n.py and add type annotations to function signature

Fix unlabeled -> background / vak.common.constants.DEFAULT_BACKGROUND…

04c8f60

…_LABEL in tests

Fix 'map_unlabeled' -> 'map_background' in tests/

33e3782

Fix 'constants' -> 'common' in src/vak/models/frame_classification_mo…

02420e7

…del.py

Fix arg name map_unlabeled -> map_background

ecd53a3

Fix arg name map_unlabeled -> map_background in prep/parametric_umap

d308e91

Fix 'unlabeled' -> vak.common.constants.DEFAULT_BACKGROUND_LABEL in t…

98576c0

…ests/

Fix name to_inds_list -> segment_inds_list_from_class_labels` in te…

e54f70b

…st_transforms/test_frame_labels/test_functional.py

NickleDave force-pushed the rename-refactor-datapipes-add-datasets branch from d015a83 to e54f70b Compare May 11, 2024 12:50

NickleDave merged commit 5003113 into main May 11, 2024
0 of 4 checks passed

NickleDave deleted the rename-refactor-datapipes-add-datasets branch May 11, 2024 12:51

NickleDave changed the title ~~CLN/ENH: Rename and refactor datapipes, add datasets; fix 574 724 754~~ CLN/ENH: Rename and refactor datapipes, add datasets; fix #574 #724 #754 May 11, 2024

This was referenced May 11, 2024

ENH: Rename datasets to pipes, that have built-in transforms #724

Closed

ENH: Add datasets module with BioSoundSegBench #754

Closed

NickleDave added a commit that referenced this pull request May 11, 2024

DOC: Update CHANGELOG after merging #755 [skip ci]

4642126

NickleDave mentioned this pull request May 11, 2024

rename 'unlabeled_label' -> 'unlabeled_class'; define as constant and use constant for default args #408

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLN/ENH: Rename and refactor datapipes, add datasets; fix #574 #724 #754 #755

CLN/ENH: Rename and refactor datapipes, add datasets; fix #574 #724 #754 #755

NickleDave commented May 11, 2024

CLN/ENH: Rename and refactor datapipes, add datasets; fix #574 #724 #754 #755

CLN/ENH: Rename and refactor datapipes, add datasets; fix #574 #724 #754 #755

Conversation

NickleDave commented May 11, 2024