Memory usage during featurization #44

jaimergp · 2021-05-18T20:46:32Z

The recent changes in the featurization pipeline changed how the featurizers go through the different systems in a dataset.

Previously, a single system would go all the way through the pipeline (composed of N featurizers). This made it difficult to auto-detect global properties of the dataset (e.g. maximum length to pad bit-vectors to), so we refactored the pipeline so it traverses featurizers first.

# before
for system in systems:
    for featurizer in featurizers:
        featurizer.featurizer(system)

# now
for featurizer in featurizers:
   featurizer.featurize(systems)

This, however, implies that ALL the artifacts created by each featurizer coexist in time for the full dataset; aka, more memory usage. To give some numbers, ChEMBL28 (158K systems) peaks at around 6GB of RAM; mainly all the RDKit molecules that will be created from the SMILES. We do clear the featurizations dictionary after each pass by default (recent change), but I am writing this down as an issue because it might become a bottleneck for more complex schemes. These might require featurizing datasets in batches.

schallerdavid added low-priority enhancement New feature or request high-effort labels Aug 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory usage during featurization #44

Memory usage during featurization #44

jaimergp commented May 18, 2021

Memory usage during featurization #44

Memory usage during featurization #44

Comments

jaimergp commented May 18, 2021