Skip to content

Data Management

Aditya Khedekar edited this page Aug 28, 2024 · 5 revisions

Loading ChEBI Ontology Data

ChEBai accesses the ChEBI ontology data from the following URL: http://purl.obolibrary.org/obo/chebi/{version}/chebi.obo.

You can find more information on the ChEBI ontology here: https://www.ebi.ac.uk/chebi

ChEBI versions

Change the chebi version used for all sets (default: 200):

--data.init_args.chebi_version=VERSION

To change only the version of the train and validation sets independently of the test set, use

--data.init_args.chebi_version_train=VERSION

Data Preprocessing

Upon loading the ontology data, ChEBai undergoes preprocessing, including hierarchy extraction and division into train, validation, and test sets. During preprocessing, a filter is applied to consider only chemical entities with a minimum number of subclasses (e.g., 50 or 100) annotated with SMILES (Simplified Molecular Input Line Entry System) strings.

Data folder structure

Data is organized within the following directory structure:

Contains the raw chebi data (chebi.obo) which is downloaded from respective chebi website

data/${chebi_version}/${dataset_name}/raw/

Contains the processed data (data.pkl) with SMILES strings and class columns with boolean values, along with classes.txt file containing the list of classes for the data

data/${chebi_version}/${dataset_name}/processed/

Contains the encoded data (data.pt), in a format which is compatible with the torch library

data/${chebi_version}/${dataset_name}/processed/${reader_name}/
  • ${dataset_name} represents the _name attribute of the DataModule used.
  • ${chebi_version} refers to the ChEBI version.
  • ${reader_name} denotes the name attribute of the associated Reader class.

For cross-validation, the folds are stored as cv_${n_folds}_fold/fold_{fold_index}_train.pkl and cv_${n_folds}_fold/fold_{fold_index}_validation.pkl in the raw directory.

GOUniProt Data folder Structure

Data is organized within the following directory structure:

Contains the raw GO ontology (.obo) and Swiss UniProt data (.dat) files. As there are no version-specific files for this dataset, the same raw files are used across all data subsets.

data/GO_UniProt/raw/

Includes processed data file (data.pkl) generated by the prepare_data method. This file features a dataframe with protein IDs, data representations (Protein Amino Acid Sequence), and class columns with boolean values. It also has classes.txt file listing the selected GO classes and might have splits.csv containing saved data splits from previous runs.

data/GO_UniProt/${dataset_name}/processed/

Contains data encoded file (data.pt) for compatibility with the PyTorch library, generated by the setup method. It includes keys such as ident, features, labels, and group, preparing the data for model input.

data/GO_UniProt/${dataset_name}/processed/${reader_name}

Note: If go_branch is specified, the dataset_name will include the branch name in the format ${dataset_name}_${go_branch}. Otherwise, it will be simply ${dataset_name}.

Clone this wiki locally