dataset selection #29

yuanqing-wang · 2020-05-13T20:59:37Z

we're using ESOL dataset to start with. let's discuss what datasets to use here.

ideally, we would want our dataset to be

regression tasks for now?
solely dependent on graph: (therefore I would vote against something like QM9 and friends, geometry complicates things)
enable us to have out-of-distribution data

karalets · 2020-05-13T21:24:43Z

Thank you for starting this issue!

In particular, I am interested in the following few details.

We are currently using ESOL I assume to do purely supervised learning on measurements for particular graphs. Can we discuss training-test-set sizes here etc?
What 'background' datasets would we consider using if we were to try semi-supervised learning which partially informs the graph more than the ESOL training set?
In. the cambridge paper they also have foreground task and background datasets, can we discuss their datasets as well?
In the task we care about primarily down the line, we anticipate being in a regime where we have few measurements and want to trade off cost of generating a measurement with information theoretical quantities. Can we make a specific pitch for a first loop we could have here of a dataset we could use to consider having some given measurements and doing active learning to pick molecules to get better at held out data?
What background data would be relevant here?

I would like to be extremely specific here so we can plan an experiment section for a V0 of a tech report to guide our current explorations closely to that loop.

Many thanks, @yuanqing-wang !

yuanqing-wang · 2020-05-14T01:55:44Z

We are currently using ESOL I assume to do purely supervised learning on measurements for particular graphs. Can we discuss training-test-set sizes here etc?

ESOL has 1128 molecular graphs with average number of nodes around 20.

What 'background' datasets would we consider using if we were to try semi-supervised learning which partially informs the graph more than the ESOL training set?

We can use ZINC or [Enamine Real] (https://enamine.net/library-synthesis/real-compounds/real-database), or the subset thereof to represent the synthesizable space of (druglike) organic small molecules.

In. the cambridge paper they also have foreground task and background datasets, can we discuss their datasets as well?

They used: FreeSolv, Melting, ESOL, CatS, Malaria, p450. Shouldn't be hard to add APIs to grab and import these.

In the task we care about primarily down the line, we anticipate being in a regime where we have few measurements and want to trade off cost of generating a measurement with information theoretical quantities. Can we make a specific pitch for a first loop we could have here of a dataset we could use to consider having some given measurements and doing active learning to pick molecules to get better at held out data?
What background data would be relevant here?

To keep comparisons fair and simple, I'd suggest partition within a same dataset to be foreground and background. Does that sound reasonable?

karalets · 2020-05-14T02:05:26Z

We are currently using ESOL I assume to do purely supervised learning on measurements for particular graphs. Can we discuss training-test-set sizes here etc?

ESOL has 1128 molecular graphs with average number of nodes around 20.

It is that small? 1128 samples is not nothing, but not 'much'.
How much data for just graph structures can we get from other related datasets?

What 'background' datasets would we consider using if we were to try semi-supervised learning which partially informs the graph more than the ESOL training set?

We can use ZINC or [Enamine Real] (https://enamine.net/library-synthesis/real-compounds/real-database), or the subset thereof to represent the synthesizable space of (druglike) organic small molecules.

As asked above, how big would this be?

In. the cambridge paper they also have foreground task and background datasets, can we discuss their datasets as well?

They used: FreeSolv, Melting, ESOL, CatS, Malaria, p450. Shouldn't be hard to add APIs to grab and import these.

Great. Sizes? Relevance to tasks we may want to solve? Ideally we would build towards a piepleine that has some relevance for the covid tasks John suggested in the slack a while ago.

In the task we care about primarily down the line, we anticipate being in a regime where we have few measurements and want to trade off cost of generating a measurement with information theoretical quantities. Can we make a specific pitch for a first loop we could have here of a dataset we could use to consider having some given measurements and doing active learning to pick molecules to get better at held out data?
What background data would be relevant here?

To keep comparisons fair and simple, I'd suggest partition within a same dataset to be foreground and background. Does that sound reasonable?

The problem is, none of this will be 'out of distribution' and we don't really know if this is fair 'background data', as the Cambridge paper discussed that they needed to stratify the unsupervised data to get good representations.

This is going to be a major part of the experimental design here, but for now we can set up an experimental loop which has flags for what all these objects are, passes the related dataloaders accordingly and we can set them to other datasets later.

But I would really like to create a real full example of datasets that we would consider publishable material that we run things on now.

yuanqing-wang · 2020-05-14T03:00:37Z

Is that small?

Welcome to the world of molecular machine learning. The rest of the dataset that they used are not dramatically larger either: FreeSolv has 650 data points. Well the rest looks like they're property names rather than specific dataset names. And depends on where you get them the size may vary.

But this is generally true: (data, measurement) pairs dataset in molecular ML is either small or unreliable. Each entry costs money and time. If you have enough money and time you're probably a pharama company and therefore wouldn't be excited in the idea of sharing data.

The exceptions are QM9 dataset and friends, which are quantum physical data but they depend (to various extent) on the geometry of the graph, rather than the topology alone.

yuanqing-wang · 2020-05-14T03:01:52Z

Ways to provide out-of-distribution data: we can partition the datasets by the time the compound is developed, the scaffold it contains, etc.

Like they did here:

PotentialNet for Molecular Property Prediction
https://doi.org/10.1021/acscentsci.8b00507

karalets · 2020-05-14T03:09:13Z

Cool. I am sure John can add more color here for variants we should care about, but I think this provides enough background information to get started (or keep working) on the experiments with ESOL.

yuanqing-wang · 2020-05-14T04:50:58Z

Speaking of datasets @jchodera may like...

I guess it would be at cool, or at least topical, to use the data harvested in COVID moonshot project.

https://postera.ai/covid/activity_data

It's nice that

it's 370 molecules now and counting
measurements are with error bars

but not all molecules have the same type of measurements.

karalets · 2020-05-14T05:06:55Z

Great, this could be useful.

yuanqing-wang assigned jchodera May 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dataset selection #29

dataset selection #29

yuanqing-wang commented May 13, 2020 •

edited

Loading

karalets commented May 13, 2020

yuanqing-wang commented May 14, 2020 •

edited

Loading

karalets commented May 14, 2020 •

edited

Loading

yuanqing-wang commented May 14, 2020

yuanqing-wang commented May 14, 2020

karalets commented May 14, 2020

yuanqing-wang commented May 14, 2020

karalets commented May 14, 2020

dataset selection #29

dataset selection #29

Comments

yuanqing-wang commented May 13, 2020 • edited Loading

karalets commented May 13, 2020

yuanqing-wang commented May 14, 2020 • edited Loading

karalets commented May 14, 2020 • edited Loading

yuanqing-wang commented May 14, 2020

yuanqing-wang commented May 14, 2020

karalets commented May 14, 2020

yuanqing-wang commented May 14, 2020

karalets commented May 14, 2020

yuanqing-wang commented May 13, 2020 •

edited

Loading

yuanqing-wang commented May 14, 2020 •

edited

Loading

karalets commented May 14, 2020 •

edited

Loading