Adding Analysis Module #168

ALescoulie · 2021-07-27T20:17:49Z

MDPOW-Analysis Module

Features

I would like to add an analysis module to MDPOW that takes the files in a FEP directory and runs different analysis functions in a modular framework, and saves results to csv files to an ANALYSIS directory in the molecule directory. The eventual goal being to use the data in a decision tree to predict convergence.

The goal is setting this up in such a way that an analysis can be easily run with a script or with in interactive just by passing in some kwargs like the example below.

run_analysis(directory='Benzene_mdpow', molecule='benzene', 
             solvents=['water', 'octanol'], H_bond=False, dihedral_selections=[])

I plan to run dihedral analysis, hydrogen bonding analysis, and solvation shell analysis. The analyses run are defiantly expandable.

Implementation

To do this I plan to implement an analysis class that takes either a simulation path or GSolv object and loads the systems at each lambda into MDAnalysis universes. The hope is to assemble universes in an ensemble similar to this pull request in MDAnalysis. I don't think my implementation has to be as detailed since mostly it will be used for storing universes in an organized way and getting atom group selections of a set of similar universes in a more concise way. With the ensemble the systems for each lambda and each solvent can be more easily analyzed as a group.

As for the analysis most it probably is simplest to set it up based on the AnalysisBase class. Please leave any suggestions you have for how better to implement this. I'm not yet committed to any one idea for implementation.

To Do

Design tests and an implementation of an object for storing several related systems with options to add, remove, and return a specific system accessed with a given key.
Write an AnalysisBase for the different analyses I want to run
Build an overall framework for accepting test parameters, loading systems, running tests, and saving results.

I'll update this issue as I have progress on implementing these features.

The text was updated successfully, but these errors were encountered:

VOD555 · 2021-07-28T20:04:43Z

We also can separate the analysis input keywords from the fep section of the .yml file, and add a new analysis section.

orbeckst · 2021-07-28T22:02:10Z

I would not have run_analysis perform multiple analysis tasks. Keep it simple along the lines of

class EnsembleAnalysis:
      """base class to perform an analysis tasks on all simulations of a FEP calculation"""

Then you can derive your different analyses from EnsembleAnalysis.

This will then easily allow you to run multiple EnsembleAnalysis over different molecules or analysis tasks. Think of coding building blocks instead of complete solutions.

orbeckst · 2021-07-28T22:11:49Z

To add: think of the hierarchy of simulations. I would consider the "ensemble" the smallest set of simulations that is necessary to calculate a solvation free energy, i.e., for a given solvent, all the Coulomb and VDW windows. That's the base unit to work with for an ensemble. So you need to first write analysis code to analyze a single simulation (ideally, code based on MDAnalysis.analysis.base.AnalysisBase for new code or using existing analysis classes (eg for dihedrals you can derive your own class from analysis.dihedrals.Dihedral.) Then the EnsembleAnalysis runs the analysis class over the individual trajectories. Initially that can just be in serial. Once you have serial code, we can think about parallelizing.

orbeckst · 2021-07-28T22:15:07Z

Finally: If you start with small units then it's a lot easier to write tests for them. In particular, I'd start with one (or two) AnalysisBase-based analysis class(es) because it/they will be immediately useful and can be easily tested.

Don't slow yourself down too much with thinking about tests – get to a prototype quickly.

orbeckst · 2021-07-28T22:37:21Z

It would be nice if the following would eventually work:

from mdpow.analysis.solvation import NearestSolvent

# run analysis to get number of water molecules in first solvation shell
nearest_solvent = NearestSolvent("benzene/water.pickle", selection="name OH", r1=3.0).run()

# plot the number of OH within 3 Å of the solute as violinplots split by interaction and for each lambda
# from the tidy dataframe (contains: interaction, lambda, time, n) for all simulations
seaborn.catplot(data=nearest_solvent.results.timeseries, 
                           x="lambda", y="n", col="interaction",
                           kind="violin", ...)

The EnsembleAnalysis would assemble all the individual time series (stored as CVS files with each simulation in the analysis directory) as a tidy (but huge) pandas.DataFrame . It is then easy to analyze the data in many different ways and also concatenate these dataframes for higher order analysis (e.g., add a solvent column , a molecule_id column, and a forcefield column and then one can aggregate over all the data in SAMPL7).

ALescoulie · 2021-09-09T16:45:25Z

Analysis Modules

Now that the Ensemble objects in PR #179 are just about finished, I thought I'd come back to this issue to layout next steps. To start with I'd like to implement three analyses.

Dihedral
Hydrogen Bonds
Solivation Shell

For the analysis modules some things need to be worked out.

DataFrame structure for storing results
Method for inputing parameters, possibly with a yaml?

I'd appreciate some input on how to build the DataFrames in a way that the data is ready for use immediately, and on how the user should interface with this module.

orbeckst · 2021-09-09T16:59:14Z

For data structures I recommend tidy data. One observation - one row. If you have N columns then N-1 columns are "tags" such as solvent, interaction, lambda, timestep, molecule, dihedral, .... and one column is the actual observation. This makes it easy to aggregate data using pandas groupby and similar tricks or directly plot with seaborn's categorical plots.

You might think that this wastes space by repeating information in the dataframe but experience shows that this is generally a no-issue compared to the ease of processing the data.

(There's more to read about tidy data, including Hadley Wickham's article https://www.jstatsoft.org/article/view/v059i10/ .)

orbeckst · 2021-09-09T17:02:59Z

For hydrogen bonds I would just try to wrap the MDAnalysis analysis class. Look at the datastructure produced from the HydrogenBondAnalysis class and see if you want to keep it like it is or make it part of a tidy dataframe.

As a guiding principle you want to do as little as data gymnastics as possible for analysis, i.e., avoid having custom data wrangling code like "average_by_lambdas" which then pulls data out of custom dictionaries. Instead leverage what's present in pandas. Learn about groupby and friends.

orbeckst · 2021-09-09T17:06:50Z

Initially, don't worry about specifying analysis parameters. Focus on the building blocks, which take parameters as args or kwargs and nothing else.

Once we work on the user interface (in the form of the CLI commands) then I would go with @VOD555 's suggestion #168 (comment) to add a analysis section to the runinput.yml.

* new Ensemble framework for aggregate analysis in mdpow.analysis * part of #168 (adding analysis to MDPOW) * new submodule mdpow.analysis * new Ensemble, EnsembleAtomGroup, and EnsembleAnalysis classes * add docs (new section on Analysis) * add tests including ensemble test data (water simulations, octanol to be added later) * update CHANGES

orbeckst · 2021-09-09T17:53:35Z

@ALescoulie I suggest you open a new issue for new analysis modules by transferring your #168 (comment) . I would then open individual issues for the analysis tools that you want to build. Keep things modular and separate.

Although I didn't intent to close this issue with PR #179, it make sense. I re-read the issue and PR #179 addresses pretty much all points. I will add the issue number to CHANGES.

Added issue #168 number to the Ensemble entries.

* new Ensemble framework for aggregate analysis in mdpow.analysis * part of Becksteinlab#168 (adding analysis to MDPOW) * new submodule mdpow.analysis * new Ensemble, EnsembleAtomGroup, and EnsembleAnalysis classes * add docs (new section on Analysis) * add tests including ensemble test data (water simulations, octanol to be added later) * update CHANGES

orbeckst assigned ALescoulie Jul 28, 2021

orbeckst linked a pull request Jul 30, 2021 that will close this issue

Analysis Draft #169

Closed

ALescoulie linked a pull request Aug 3, 2021 that will close this issue

Add Ensemble Objects #179

Merged

orbeckst added this to the 0.8.0 milestone Aug 4, 2021

orbeckst closed this as completed in #179 Sep 9, 2021

orbeckst added a commit that referenced this issue Sep 9, 2021

update CHANGES

d4767ee

Added issue #168 number to the Ensemble entries.

orbeckst mentioned this issue Dec 27, 2021

release 0.8.0 with EnsembleAnalysis and Python 3 only #200

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding Analysis Module #168

Adding Analysis Module #168

ALescoulie commented Jul 27, 2021 •

edited

Loading

VOD555 commented Jul 28, 2021

orbeckst commented Jul 28, 2021 •

edited

Loading

orbeckst commented Jul 28, 2021

orbeckst commented Jul 28, 2021

orbeckst commented Jul 28, 2021

ALescoulie commented Sep 9, 2021

orbeckst commented Sep 9, 2021

orbeckst commented Sep 9, 2021

orbeckst commented Sep 9, 2021

orbeckst commented Sep 9, 2021

Adding Analysis Module #168

Adding Analysis Module #168

Comments

ALescoulie commented Jul 27, 2021 • edited Loading

MDPOW-Analysis Module

Features

Implementation

To Do

VOD555 commented Jul 28, 2021

orbeckst commented Jul 28, 2021 • edited Loading

orbeckst commented Jul 28, 2021

orbeckst commented Jul 28, 2021

orbeckst commented Jul 28, 2021

ALescoulie commented Sep 9, 2021

Analysis Modules

orbeckst commented Sep 9, 2021

orbeckst commented Sep 9, 2021

orbeckst commented Sep 9, 2021

orbeckst commented Sep 9, 2021

ALescoulie commented Jul 27, 2021 •

edited

Loading

orbeckst commented Jul 28, 2021 •

edited

Loading