ENH: Add datasets module with BioSoundSegBench #754

NickleDave · 2024-05-11T12:34:22Z

Requires fixing #724 first

(#755) * Rename vak/datasets -> vak/datapipes * Rename frame_classifcation.window_dataset.WindowDataset -> TrainDatapipe * Rename frame_classification/window_dataset.py -> train_datapipe.py * Fix WindowDataset -> TrainDatapipe in docstrings * Rename frame_classification.frames_dataset.FramesDataset -> infer_datapipe.InferDatapipe * Rename transforms.StandardizeSpect -> FramesStandarizer * Import FramesStandarizer in datapipes/frame_classification/infer_datapipe.py * Add module-level docstring in vak/datapipes/__init__.py * Rewrite transforms.defaults.frames_classification.EvalItemTransform and PredictItemTransform as a single class, InferItemTransform, and remname spect_standardizer -> frames_standardizer in that module * Fix bug in view_as_window_batch so it works on 1-D arrays, add type hinting in src/vak/transforms/functional.py * Change frame_labels_transform in InferItemTransform to be a torchvision.transforms.Compose, so we get back a windowed batch * Remove TODO in src/vak/models/frame_classification_model.py * Rewrite TrainDatapipe to always use TrainItemTransform, add parameters that get passed to TrainItemTransform when instatiating it inside TrainDatapipe.__init__ * Rewrite frames_classification.InferDatapipe to always use transforms.default.frame_classification.InferItemTransform, add parameters that get passed to InferItemTransform when instatiating it inside InferDatapipe.__init__ * Rewrite train.frame_classification to pass kwargs into datapipes that now use default transforms, and no longer call transforms.defaults.get * Rewrite predict.frame_classification to pass kwargs into datapipes that now use default transforms, and no longer call transforms.defaults.get * Rewrite eval.frame_classification to pass kwargs into datapipes that now use default transforms, and no longer call transforms.defaults.get * Rewrite predict.frame_classification to pass kwargs into datapipes that now use default transforms, and no longer call transforms.defaults.get * Rename 'spect_scaler_path' -> 'frames_standardizer_path' * Rename 'normalize_spectrogram' -> 'standardize_frames' * Fix 'SpectScaler' -> 'FramesStandardizer', 'normalize spectrogram' -> 'standardize (normalize) frames' * Fix 'SpectScaler' -> 'FramesStandardizer' in tests/ * Fix key names in doc/toml * Add missing comma in src/vak/train/frame_classification.py * Rename config/valid-version-1.1.toml -> valid-version-1.2.toml * Fix normalize spectrograms -> standardize frames more places in docs * Fix datapipes.frame_classification.InferDatapipe to have needed parameters for item transform * Fix datapipes.frame_classification.TrainDatapipe to have needed parameters for item transform * Fix arg name 'spect_standardizer -> frames_standardizer in src/vak/train/frame_classification.py * fixup fix TrainDatapipe parameters * Fix variable name in src/vak/datapipes/frame_classification/train_datapipe.py * Add missing arg return_padding_mask in src/vak/train/frame_classification.py * Fix transforms.default.frame_classification.InferItemTransform to not window frame labels, just convert them to LongTensor * Revise docstring in eval/frame_classification * Remove item_transform from docstring in datapipes/frame_classification/train_datapipe.py * Add return_padding_mask arg in vak/predict/frame_classification.py * Remove src/vak/transforms/defaults/parametric_umap.py * Rename/rewrite Datapipe class for ParametricUMAP, hard-code in transform * Remove transforms/defaults/get.py, remove related imports in transforms/defaults/__init__.py * Finish removing transform fetching for ParametricUMAP * Fix typo in src/vak/eval/frame_classification.py * Fix "StandardizeSpect" -> "FramesStandardizer" in src/vak/learncurve/frame_classification.py * Apply changes from nox lint session * Make flake8 fixes, remove unused function get_default_frame_classification_transform * Fix "StandardizeSpect" -> "FramesStandardizer" in tests/scripts/vaktestdata/configs.py" * WIP: Add datasets/ with biosoundsegbench * Renam tests/test_datasets -> test_datapipes, fix tests * Fix 'StandardizeSpect' -> 'FramesStandardizer' in two tests * Remove two uses of vak.transforms.defaults.get_default_transform from tests * Fix datapipe used in tests/test_models/test_parametric_umap_model.py * Use TYPE_CHECKING to avoid circular import in src/vak/datapipes/frame_classification/infer_datapipe.py * Add method 'fit_inputs_targets_csv_path' to FramesStandardizer, rewrite 'fit_dataset_path' method to just call this new method * fixup add method * Add unit test for FramesStandardizer.fit_inputs_targets_csv_path * Remove unused import from src/vak/transforms/transforms.py * Remove unused import in src/vak/transforms/defaults/frame_classification.py * Pep8 fix in src/vak/datasets/__init__.py * Apply linting to src/vak/transforms/transforms.py * Correct docstring in src/vak/transforms/defaults/frame_classification.py * Import datasets in src/vak/__init__.py * Rename datapipes/frame_classification/constants.FRAME_LABELS_EXT -> MULTI_FRAME_LABELS_EXT, and change value to 'multi-frame-labels.npy', and change value of FRAME_LABELS_NPY_PATH_COL_NAME to 'multi_frame_labels_npy_path' * Rename vak.datapipes.frame_classification.constants.FRAME_LABELS_NPY_PATH_COL_NAME -> MULTI_FRAME_LABELS_PATH_COL_NAME * Rename key in item returned by frame_classification.TrainItemTransform and InferItemTransform; 'frame_labels' -> 'multi_frame_labels' * WIP: Get BioSoundSegBench class working * Rewrite FrameClassificationModel to handle different target types * Add VALID_SPLITS to common.constants * In datasets/biosoundsegbench.py: change VALID_TARGET_TYPES to be the ones we're using for experiments right now, fix TrainItemTransform to handle target types, clean up __init__ method validation * Add initial unit tests for BioSoundSegBench dataset * Add helper function vak.datasets.get * Clean up how we validate target_type in datasets.BioSoundSegBench.__init__ * Add tests/test_datasets/__init__.py (to make a sub-package) * Add initial unit tests for vak.datasets.get * Modify BioSoundSegBench.__init__ so we can write splits_path as just the filename * Use expanded_user_path converter on path and splits_path attributes of DatasetConfig * Rename BOUNDARY_ONEHOT_PATH_COL_NAME -> BOUNDARY_FRAME_LABELS_PATH_COL_NAME in datasets/biosoundsegbench.py * Modify datasets.BioSoundSegBench to compute metadata from splits_json path * Fix mock_biosoundsegbench_dataset fixture so mocked files follow naming conventions of dataset * Modify mock_biosoundsegbench_dataset fixture to save labelmaps.json * Change BioSoundSegBench.__init__ so we have training_replicate_metadata attribute, frame_dur attribute, and labelmap attribute * Add DATASETS dict in dataset/__init__.py, used by vak.datasets.get to look up class (value) by name (key) * Use vak.datasets.DATASETS in vak.datasets.get to get class * Rewrite BioSoundSegBench.__init__ so we can either pass in a FramesStandardizer instance or tell it to fit a new one to the specified split, that then gets added to the transform * Import DATASETS inside vak.datasets.get to avoid circular import * Make fixes in datasets/biosoundsegbench.py: import FramesStandardizer inside TrainItemTransform.__init__, fix tmp_splits_path -> splits-jsons (plural), add needed __len__ method to class * Rename BioSoundSegBench property 'input_shape' -> 'shape' for consistency with frame_classification datapipes * Get vak/train/frame_classification.py to the point where it runs * Add missing self in BioSoundSegBench._getitemval * Rewrite src/vak/eval/frame_classification.py to work with built-in datasets, and remove 'split' parameter from eval_frame_classification_model function -- check if 'split' is in dataset_config and if not, default to 'test' * Remove split argument in call to eval_frame_classification_model inside src/vak/learncurve/frame_classification.py * Remove split parameter from eval._eval.eval -- it's not an attribute of EvalConfig and we can now pass in a 'split' through dataset_config * Remove 'split' parameter from eval_parametric_umap_model, check if 'split' in dataset_config and if not default to 'test' * Rewrite src/vak/predict/frame_classification.py to work with built-in datasets; check if 'split' is in dataset_config and if not, default to 'predict' * Add comments to structure src/vak/train/frame_classification.py * Fix how we check for key in src/vak/predict/frame_classification.py * Fix how we check for key in dict in src/vak/eval/parametric_umap.py * Fix how we check for key in dict in src/vak/eval/frame_classification.py * Fix unit tests in test_dataset.py: assert that path attributes are vak.converters.expanded_user_path(value from config), not pathlib.Path * Fix how we parametrize tests/test_dataset/test_get.py * In BioSoundSegBench.__init__, fix how we calculate frame_dur and how we set labelmap attribute for binary/boundary frame labels * In FrameClassificationModel.validation_step, convert Levenshtein distance to float to squelch warning from Lightning * Fix FrameClassificationModel so train/val with multi-class + boundary labels works * Fix vak.cli.predict to not assume that config has a prep attribute * Fix how we override default split with a split from dataset_config['params'] in predict/frame_classification and eval/frame_classification * Change BioSoundSegBench so __getitem__ can return 'frames_path' in 'item' for eval/predict * In predict.frame_classification, set 'return_frames_path' to True in dataset_config['params'] since we need this for predictions * Add constant DEFAULT_SPECT_FORMAT in common.constants * Fix SPECT_KEY -> TIMEBINS_KEY in cli.prep * Fix how we determine input_type and spect_format for built-in datasets in predict/frame_classification * Add nn/loss/crossentropy.py, wraps torch.nn.CrossEntropy, but converts weight arg as list to tensor * Fixup add loss * Use nn.loss.CrossEntropy with TweetyNet model * Clean up prediction_step in FrameClassificationModel * Get predict working for multi_frame_labels and boundary_frame_labels, still need to test binary_frame_labels and (boundary, multi) * Rename 'unlabeled_label' -> 'background_label' in transforms/frame_labels * Rename 'unlabeled_label' -> 'background_label' in tests/test_transforms/test_frame_labels * Rewrite transforms/frame_labels/functional.py to handle boundary labels - Add `boundary_labels_to_segment_inds_list' that finds segment indexing arrays from a list of boundary labels - Rename `to_segment_inds` -> `frame_labels_to_segment_inds_list - Have `preprocess` optionally take `boundary_labels` and use it to find segments, instead of frame labels - Fix type annotations to use npt.NDArray instead of np.ndarray * Change how FrameClassificationModel calls loss for multi-class + boundary targets -- assume we pass to an instance of a loss function, and get back either a scalar loss or a dict mapping loss names to scalar values * Change arg name 'unlabeled_label' -> 'background_label' in prep/frame_classification/make_splits.py * Fix predict.frame_classification for multi-class, and add logic for multi-class frame labels with boundary frame labels * Add DEFAULT_BACKGROUND_LABEL to common.constants * Use DEFAULT_BACKGROUND_LABEL in transforms.frame_labels.functional * Rename unlabeled -> background_label in common.labels * Add background_label in docstring in common/labels.py * Add 'background_label' to FrameClassificationModel, defaults to common.constants.DEFAULT_BACKGROUND_LABEL, used to validate length of string labels in labelmap * Fix 'unlabeled' -> common.constants.DEFAULT_BACKGROUND_LABEL in anohter place in common/labels.py * Fix unlabeled -> background label in docstrings in transforms * Use 'background_label' argument in place of magic string 'unlabeled' in prep/frame_classification/learncurve.py * Fix unlabeled -> background label in docstrings in transforms/frame_labels/functional.py * Add background_label to docstring in src/vak/prep/frame_classification/learncurve.py * Add background_label to function in src/vak/prep/frame_classification/make_splits.py * Add background_label parameter to src/vak/predict/frame_classification.py and add type annotations to function signature * Fix unlabeled -> background / vak.common.constants.DEFAULT_BACKGROUND_LABEL in tests * Fix 'map_unlabeled' -> 'map_background' in tests/ * Fix 'constants' -> 'common' in src/vak/models/frame_classification_model.py * Fix arg name map_unlabeled -> map_background * Fix arg name map_unlabeled -> map_background in prep/parametric_umap * Fix 'unlabeled' -> vak.common.constants.DEFAULT_BACKGROUND_LABEL in tests/ * Fix name `to_inds_list` -> segment_inds_list_from_class_labels` in test_transforms/test_frame_labels/test_functional.py

NickleDave · 2024-05-11T12:55:13Z

Closed by #755

NickleDave added the Datasets Issue related to datasets label May 11, 2024

NickleDave self-assigned this May 11, 2024

NickleDave closed this as completed May 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Add datasets module with BioSoundSegBench #754

ENH: Add datasets module with BioSoundSegBench #754

NickleDave commented May 11, 2024

NickleDave commented May 11, 2024

ENH: Add datasets module with BioSoundSegBench #754

ENH: Add datasets module with BioSoundSegBench #754

Comments

NickleDave commented May 11, 2024

NickleDave commented May 11, 2024