You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This commit was created on GitHub.com and signed with GitHub’s verified signature.
config_utils.py: Add additional information saved upon running Selene. Specifically, we now save the version of Selene that the latest run used, make a copy of the input configuration file, and save this along with the model architecture file in the output directory. This adds a new dependency to Selene, the package ruamel.yaml
H5Dataloader and _H5Dataset: Previously H5Dataloader had a number of arguments that were used to then initialize _H5Dataset internally. One major change in this version is that we now ask that users initialize _H5Dataset explicitly and then pass it to H5Dataloader as a class argument. This makes the two classes consistent with the PyTorch specifications for Dataset and DataLoader classes, enabling them to be compatible with different data parallelization configurations supported by PyTorch and the PyTorch Lightning framework.
_H5Dataset class initialization optional arguments:
unpackbits can now be specified separately for sequences and targets by way of unpackbits_seq and unpackbits_tgt
use_seq_len enables subsetting to the center use_seq_len length of the sequences in the dataset.
shift (particularly paired with use_seq_len) allows for retrieving sequences shifted from the center position by shift bases. Note currently shift only allows shifting in one direction, depending on whether you pass in a positive or negative integer.
GenomicFeaturesH5: This is a new targets class to handle continuous-valued targets, stored in an HDF5 file, that can be retrieved based on genomic coordinate. Once again, genomic regions are stored in a tabix-indexed .bed file, with the main change being that the BED file now specifies for each genomic regions the index of the row in the HDF5 matrix that contains all the target values to predict. If multiple target rows are returned for a query region, the average of those rows is returned.
RandomPositionsSampler:
exclude_chrs: Added a new optional argument which by default excludes all nonstandard chromosomes exclude_chrs=['_'] by ignoring all chromosomes with an underscore in the name. Pass in a list of chromosomes or substrings to exclude. When loading possible sampling positions, the class now iterates through the exclude_chrs list and checks for each substring s in list if s in chrom, and if so, skips that chromosome entirely.
Internal function _retrieve now takes in an optional argument strand (default None) to enable explicit retrieval of a sequence at chrom, position for a specific side. The default behavior of the RandomPositionsSampler class remains the same, with the strand side randomly selected for each genomic position sampled.
PerformanceMetrics:
Now supports spearmanr and pearsonr from scipy.stats. Room for improvement to generalize this class in the future.
The update function now has an optional argument scores which can pass in a subset of the metrics as list(str) to compute.
TrainModel:
self.step starts from self._start_step, which is non-zero if loaded from a Selene-saved checkpoint
removed call to self._test_metrics.visualize in evaluate since the visualize method does not generalize well.
NonStrandSpecific: Can now handle a model outputting two outputs in forward, will handle by taking either the mean or max of each of the two individual outputs for their forward and reverse predictions. A custom NonStrandSpecific class is recommended for more specific cases.