How to get scaffold labels in molecular datasets? #159
wzhang2022
started this conversation in
General
Replies: 1 comment 2 replies
-
Hi! You can get scaffold by from rdkit import Chem
from rdkit.Chem.Scaffolds import MurckoScaffold
def generate_scaffold(smiles, include_chirality=False):
"""return scaffold string of target molecule"""
mol = Chem.MolFromSmiles(smiles)
scaffold = MurckoScaffold\
.MurckoScaffoldSmiles(mol=mol, includeChirality=include_chirality)
return scaffold |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
In the
ogbg-molhiv
andogbg-molpcba
datasets, the train/validation/test split is done via scaffold splitting rather than random. I haven't worked with molecular data before, but my understanding is that since we split based on the structural frameworks of the molecules, we aren't guaranteed i.i.d. data, and we get distribution shift between train/test/evaluation. If I'm interested in getting extracting features of the molecule that are invariant to which scaffold the molecule is in, I might want to know the scaffold labels. Does OGB support this feature?Beta Was this translation helpful? Give feedback.
All reactions