-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dataset selection #29
Comments
Thank you for starting this issue! In particular, I am interested in the following few details.
I would like to be extremely specific here so we can plan an experiment section for a V0 of a tech report to guide our current explorations closely to that loop. Many thanks, @yuanqing-wang ! |
ESOL has 1128 molecular graphs with average number of nodes around 20.
We can use ZINC or [Enamine Real] (https://enamine.net/library-synthesis/real-compounds/real-database), or the subset thereof to represent the synthesizable space of (druglike) organic small molecules.
They used: FreeSolv, Melting, ESOL, CatS, Malaria, p450. Shouldn't be hard to add APIs to grab and import these.
To keep comparisons fair and simple, I'd suggest partition within a same dataset to be foreground and background. Does that sound reasonable? |
It is that small? 1128 samples is not nothing, but not 'much'.
As asked above, how big would this be?
Great. Sizes? Relevance to tasks we may want to solve? Ideally we would build towards a piepleine that has some relevance for the covid tasks John suggested in the slack a while ago.
The problem is, none of this will be 'out of distribution' and we don't really know if this is fair 'background data', as the Cambridge paper discussed that they needed to stratify the unsupervised data to get good representations. This is going to be a major part of the experimental design here, but for now we can set up an experimental loop which has flags for what all these objects are, passes the related dataloaders accordingly and we can set them to other datasets later. But I would really like to create a real full example of datasets that we would consider publishable material that we run things on now. |
Welcome to the world of molecular machine learning. The rest of the dataset that they used are not dramatically larger either: FreeSolv has 650 data points. Well the rest looks like they're property names rather than specific dataset names. And depends on where you get them the size may vary. But this is generally true: (data, measurement) pairs dataset in molecular ML is either small or unreliable. Each entry costs money and time. If you have enough money and time you're probably a pharama company and therefore wouldn't be excited in the idea of sharing data. The exceptions are QM9 dataset and friends, which are quantum physical data but they depend (to various extent) on the geometry of the graph, rather than the topology alone. |
Ways to provide out-of-distribution data: we can partition the datasets by the time the compound is developed, the scaffold it contains, etc. Like they did here: PotentialNet for Molecular Property Prediction |
Cool. I am sure John can add more color here for variants we should care about, but I think this provides enough background information to get started (or keep working) on the experiments with ESOL. |
Speaking of datasets @jchodera may like... I guess it would be at cool, or at least topical, to use the data harvested in COVID moonshot project. https://postera.ai/covid/activity_data It's nice that
but not all molecules have the same type of measurements. |
Great, this could be useful. |
we're using ESOL dataset to start with. let's discuss what datasets to use here.
ideally, we would want our dataset to be
The text was updated successfully, but these errors were encountered: