Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Analysis Module #168

Closed
3 tasks done
ALescoulie opened this issue Jul 27, 2021 · 10 comments · Fixed by #179
Closed
3 tasks done

Adding Analysis Module #168

ALescoulie opened this issue Jul 27, 2021 · 10 comments · Fixed by #179
Assignees
Milestone

Comments

@ALescoulie
Copy link
Contributor

ALescoulie commented Jul 27, 2021

MDPOW-Analysis Module

Features

I would like to add an analysis module to MDPOW that takes the files in a FEP directory and runs different analysis functions in a modular framework, and saves results to csv files to an ANALYSIS directory in the molecule directory. The eventual goal being to use the data in a decision tree to predict convergence.

The goal is setting this up in such a way that an analysis can be easily run with a script or with in interactive just by passing in some kwargs like the example below.

run_analysis(directory='Benzene_mdpow', molecule='benzene', 
             solvents=['water', 'octanol'], H_bond=False, dihedral_selections=[])

I plan to run dihedral analysis, hydrogen bonding analysis, and solvation shell analysis. The analyses run are defiantly expandable.

Implementation

To do this I plan to implement an analysis class that takes either a simulation path or GSolv object and loads the systems at each lambda into MDAnalysis universes. The hope is to assemble universes in an ensemble similar to this pull request in MDAnalysis. I don't think my implementation has to be as detailed since mostly it will be used for storing universes in an organized way and getting atom group selections of a set of similar universes in a more concise way. With the ensemble the systems for each lambda and each solvent can be more easily analyzed as a group.

As for the analysis most it probably is simplest to set it up based on the AnalysisBase class. Please leave any suggestions you have for how better to implement this. I'm not yet committed to any one idea for implementation.

To Do

  • Design tests and an implementation of an object for storing several related systems with options to add, remove, and return a specific system accessed with a given key.
  • Write an AnalysisBase for the different analyses I want to run
  • Build an overall framework for accepting test parameters, loading systems, running tests, and saving results.

I'll update this issue as I have progress on implementing these features.

@VOD555
Copy link
Contributor

VOD555 commented Jul 28, 2021

We also can separate the analysis input keywords from the fep section of the .yml file, and add a new analysis section.

@orbeckst
Copy link
Member

orbeckst commented Jul 28, 2021

I would not have run_analysis perform multiple analysis tasks. Keep it simple along the lines of

class EnsembleAnalysis:
      """base class to perform an analysis tasks on all simulations of a FEP calculation"""

Then you can derive your different analyses from EnsembleAnalysis.

This will then easily allow you to run multiple EnsembleAnalysis over different molecules or analysis tasks. Think of coding building blocks instead of complete solutions.

@orbeckst
Copy link
Member

To add: think of the hierarchy of simulations. I would consider the "ensemble" the smallest set of simulations that is necessary to calculate a solvation free energy, i.e., for a given solvent, all the Coulomb and VDW windows. That's the base unit to work with for an ensemble. So you need to first write analysis code to analyze a single simulation (ideally, code based on MDAnalysis.analysis.base.AnalysisBase for new code or using existing analysis classes (eg for dihedrals you can derive your own class from analysis.dihedrals.Dihedral.) Then the EnsembleAnalysis runs the analysis class over the individual trajectories. Initially that can just be in serial. Once you have serial code, we can think about parallelizing.

@orbeckst
Copy link
Member

Finally: If you start with small units then it's a lot easier to write tests for them. In particular, I'd start with one (or two) AnalysisBase-based analysis class(es) because it/they will be immediately useful and can be easily tested.

Don't slow yourself down too much with thinking about tests – get to a prototype quickly.

@orbeckst
Copy link
Member

It would be nice if the following would eventually work:

from mdpow.analysis.solvation import NearestSolvent

# run analysis to get number of water molecules in first solvation shell
nearest_solvent = NearestSolvent("benzene/water.pickle", selection="name OH", r1=3.0).run()

# plot the number of OH within 3 Å of the solute as violinplots split by interaction and for each lambda
# from the tidy dataframe (contains: interaction, lambda, time, n) for all simulations
seaborn.catplot(data=nearest_solvent.results.timeseries, 
                           x="lambda", y="n", col="interaction",
                           kind="violin", ...) 

The EnsembleAnalysis would assemble all the individual time series (stored as CVS files with each simulation in the analysis directory) as a tidy (but huge) pandas.DataFrame . It is then easy to analyze the data in many different ways and also concatenate these dataframes for higher order analysis (e.g., add a solvent column , a molecule_id column, and a forcefield column and then one can aggregate over all the data in SAMPL7).

@orbeckst orbeckst linked a pull request Jul 30, 2021 that will close this issue
@ALescoulie ALescoulie linked a pull request Aug 3, 2021 that will close this issue
@orbeckst orbeckst added this to the 0.8.0 milestone Aug 4, 2021
@ALescoulie
Copy link
Contributor Author

Analysis Modules

Now that the Ensemble objects in PR #179 are just about finished, I thought I'd come back to this issue to layout next steps. To start with I'd like to implement three analyses.

  • Dihedral
  • Hydrogen Bonds
  • Solivation Shell

For the analysis modules some things need to be worked out.

  • DataFrame structure for storing results
  • Method for inputing parameters, possibly with a yaml?

I'd appreciate some input on how to build the DataFrames in a way that the data is ready for use immediately, and on how the user should interface with this module.

@orbeckst
Copy link
Member

orbeckst commented Sep 9, 2021

For data structures I recommend tidy data. One observation - one row. If you have N columns then N-1 columns are "tags" such as solvent, interaction, lambda, timestep, molecule, dihedral, .... and one column is the actual observation. This makes it easy to aggregate data using pandas groupby and similar tricks or directly plot with seaborn's categorical plots.

You might think that this wastes space by repeating information in the dataframe but experience shows that this is generally a no-issue compared to the ease of processing the data.

(There's more to read about tidy data, including Hadley Wickham's article https://www.jstatsoft.org/article/view/v059i10/ .)

@orbeckst
Copy link
Member

orbeckst commented Sep 9, 2021

For hydrogen bonds I would just try to wrap the MDAnalysis analysis class. Look at the datastructure produced from the HydrogenBondAnalysis class and see if you want to keep it like it is or make it part of a tidy dataframe.

As a guiding principle you want to do as little as data gymnastics as possible for analysis, i.e., avoid having custom data wrangling code like "average_by_lambdas" which then pulls data out of custom dictionaries. Instead leverage what's present in pandas. Learn about groupby and friends.

@orbeckst
Copy link
Member

orbeckst commented Sep 9, 2021

Initially, don't worry about specifying analysis parameters. Focus on the building blocks, which take parameters as args or kwargs and nothing else.

Once we work on the user interface (in the form of the CLI commands) then I would go with @VOD555 's suggestion #168 (comment) to add a analysis section to the runinput.yml.

orbeckst pushed a commit that referenced this issue Sep 9, 2021
* new Ensemble framework for aggregate analysis in mdpow.analysis
* part of #168 (adding analysis to MDPOW)
* new submodule mdpow.analysis
* new Ensemble, EnsembleAtomGroup, and EnsembleAnalysis classes
* add docs (new section on Analysis)
* add tests including ensemble test data (water simulations, octanol to be added later)
* update CHANGES
@orbeckst
Copy link
Member

orbeckst commented Sep 9, 2021

@ALescoulie I suggest you open a new issue for new analysis modules by transferring your #168 (comment) . I would then open individual issues for the analysis tools that you want to build. Keep things modular and separate.

Although I didn't intent to close this issue with PR #179, it make sense. I re-read the issue and PR #179 addresses pretty much all points. I will add the issue number to CHANGES.

orbeckst added a commit that referenced this issue Sep 9, 2021
Added issue #168 number to the Ensemble entries.
ALescoulie added a commit to ALescoulie/MDPOW that referenced this issue Sep 9, 2021
* new Ensemble framework for aggregate analysis in mdpow.analysis
* part of Becksteinlab#168 (adding analysis to MDPOW)
* new submodule mdpow.analysis
* new Ensemble, EnsembleAtomGroup, and EnsembleAnalysis classes
* add docs (new section on Analysis)
* add tests including ensemble test data (water simulations, octanol to be added later)
* update CHANGES
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants