-
Notifications
You must be signed in to change notification settings - Fork 5
Data Management
ChEBai accesses the ChEBI ontology data from the following URL: http://purl.obolibrary.org/obo/chebi/{version}/chebi.obo.
You can find more information on the ChEBI ontology here: https://www.ebi.ac.uk/chebi
Change the chebi version used for all sets (default: 200):
--data.init_args.chebi_version=VERSION
To change only the version of the train and validation sets independently of the test set, use
--data.init_args.chebi_version_train=VERSION
Upon loading the ontology data, ChEBai undergoes preprocessing, including hierarchy extraction and division into train, validation, and test sets. During preprocessing, a filter is applied to consider only chemical entities with a minimum number of subclasses (e.g., 50 or 100) annotated with SMILES (Simplified Molecular Input Line Entry System) strings.
Data is organized within the following directory structure:
Contains the raw chebi data (chebi.obo
) which is downloaded from respective chebi website
data/${chebi_version}/${dataset_name}/raw/
Contains the processed data (data.pkl
) with SMILES strings and class columns with boolean values, along with classes.txt
file containing the list of classes for the data
data/${chebi_version}/${dataset_name}/processed/
Contains the encoded data (data.pt
), in a format which is compatible with the torch
library
data/${chebi_version}/${dataset_name}/processed/${reader_name}/
- ${dataset_name} represents the _name attribute of the DataModule used.
- ${chebi_version} refers to the ChEBI version.
- ${reader_name} denotes the name attribute of the associated Reader class.
For cross-validation, the folds are stored as cv_${n_folds}_fold/fold_{fold_index}_train.pkl
and cv_${n_folds}_fold/fold_{fold_index}_validation.pkl
in the raw directory.
Data is organized within the following directory structure:
Contains the raw GO ontology (.obo
) and Swiss UniProt data (.dat
) files. As there are no version-specific files for this dataset, the same raw files are used across all data subsets.
data/GO_UniProt/raw/
Includes processed data file (data.pkl
) generated by the prepare_data
method. This file features a dataframe with protein IDs, data representations (Protein Amino Acid Sequence), and class columns with boolean values. It also has classes.txt
file listing the selected GO classes and might have splits.csv
containing saved data splits from previous runs.
data/GO_UniProt/${dataset_name}/processed/
Contains data encoded file (data.pt
) for compatibility with the PyTorch library, generated by the setup
method. It includes keys such as ident, features, labels, and group, preparing the data for model input.
data/GO_UniProt/${dataset_name}/processed/${reader_name}
Note: If go_branch
is specified, the dataset_name
will include the branch name in the format ${dataset_name}_${go_branch}
. Otherwise, it will be simply ${dataset_name}
.