-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding Analysis Module #168
Comments
We also can separate the analysis input keywords from the fep section of the |
I would not have class EnsembleAnalysis:
"""base class to perform an analysis tasks on all simulations of a FEP calculation""" Then you can derive your different analyses from This will then easily allow you to run multiple EnsembleAnalysis over different molecules or analysis tasks. Think of coding building blocks instead of complete solutions. |
To add: think of the hierarchy of simulations. I would consider the "ensemble" the smallest set of simulations that is necessary to calculate a solvation free energy, i.e., for a given solvent, all the Coulomb and VDW windows. That's the base unit to work with for an ensemble. So you need to first write analysis code to analyze a single simulation (ideally, code based on MDAnalysis.analysis.base.AnalysisBase for new code or using existing analysis classes (eg for dihedrals you can derive your own class from analysis.dihedrals.Dihedral.) Then the EnsembleAnalysis runs the analysis class over the individual trajectories. Initially that can just be in serial. Once you have serial code, we can think about parallelizing. |
Finally: If you start with small units then it's a lot easier to write tests for them. In particular, I'd start with one (or two) AnalysisBase-based analysis class(es) because it/they will be immediately useful and can be easily tested. Don't slow yourself down too much with thinking about tests – get to a prototype quickly. |
It would be nice if the following would eventually work: from mdpow.analysis.solvation import NearestSolvent
# run analysis to get number of water molecules in first solvation shell
nearest_solvent = NearestSolvent("benzene/water.pickle", selection="name OH", r1=3.0).run()
# plot the number of OH within 3 Å of the solute as violinplots split by interaction and for each lambda
# from the tidy dataframe (contains: interaction, lambda, time, n) for all simulations
seaborn.catplot(data=nearest_solvent.results.timeseries,
x="lambda", y="n", col="interaction",
kind="violin", ...) The |
Analysis ModulesNow that the Ensemble objects in PR #179 are just about finished, I thought I'd come back to this issue to layout next steps. To start with I'd like to implement three analyses.
For the analysis modules some things need to be worked out.
I'd appreciate some input on how to build the DataFrames in a way that the data is ready for use immediately, and on how the user should interface with this module. |
For data structures I recommend tidy data. One observation - one row. If you have N columns then N-1 columns are "tags" such as solvent, interaction, lambda, timestep, molecule, dihedral, .... and one column is the actual observation. This makes it easy to aggregate data using pandas groupby and similar tricks or directly plot with seaborn's categorical plots. You might think that this wastes space by repeating information in the dataframe but experience shows that this is generally a no-issue compared to the ease of processing the data. (There's more to read about tidy data, including Hadley Wickham's article https://www.jstatsoft.org/article/view/v059i10/ .) |
For hydrogen bonds I would just try to wrap the MDAnalysis analysis class. Look at the datastructure produced from the HydrogenBondAnalysis class and see if you want to keep it like it is or make it part of a tidy dataframe. As a guiding principle you want to do as little as data gymnastics as possible for analysis, i.e., avoid having custom data wrangling code like "average_by_lambdas" which then pulls data out of custom dictionaries. Instead leverage what's present in pandas. Learn about groupby and friends. |
Initially, don't worry about specifying analysis parameters. Focus on the building blocks, which take parameters as args or kwargs and nothing else. Once we work on the user interface (in the form of the CLI commands) then I would go with @VOD555 's suggestion #168 (comment) to add a analysis section to the |
* new Ensemble framework for aggregate analysis in mdpow.analysis * part of #168 (adding analysis to MDPOW) * new submodule mdpow.analysis * new Ensemble, EnsembleAtomGroup, and EnsembleAnalysis classes * add docs (new section on Analysis) * add tests including ensemble test data (water simulations, octanol to be added later) * update CHANGES
@ALescoulie I suggest you open a new issue for new analysis modules by transferring your #168 (comment) . I would then open individual issues for the analysis tools that you want to build. Keep things modular and separate. Although I didn't intent to close this issue with PR #179, it make sense. I re-read the issue and PR #179 addresses pretty much all points. I will add the issue number to CHANGES. |
Added issue #168 number to the Ensemble entries.
* new Ensemble framework for aggregate analysis in mdpow.analysis * part of Becksteinlab#168 (adding analysis to MDPOW) * new submodule mdpow.analysis * new Ensemble, EnsembleAtomGroup, and EnsembleAnalysis classes * add docs (new section on Analysis) * add tests including ensemble test data (water simulations, octanol to be added later) * update CHANGES
MDPOW-Analysis Module
Features
I would like to add an analysis module to MDPOW that takes the files in a FEP directory and runs different analysis functions in a modular framework, and saves results to csv files to an ANALYSIS directory in the molecule directory. The eventual goal being to use the data in a decision tree to predict convergence.
The goal is setting this up in such a way that an analysis can be easily run with a script or with in interactive just by passing in some kwargs like the example below.
I plan to run dihedral analysis, hydrogen bonding analysis, and solvation shell analysis. The analyses run are defiantly expandable.
Implementation
To do this I plan to implement an analysis class that takes either a simulation path or GSolv object and loads the systems at each lambda into MDAnalysis universes. The hope is to assemble universes in an ensemble similar to this pull request in MDAnalysis. I don't think my implementation has to be as detailed since mostly it will be used for storing universes in an organized way and getting atom group selections of a set of similar universes in a more concise way. With the ensemble the systems for each lambda and each solvent can be more easily analyzed as a group.
As for the analysis most it probably is simplest to set it up based on the AnalysisBase class. Please leave any suggestions you have for how better to implement this. I'm not yet committed to any one idea for implementation.
To Do
I'll update this issue as I have progress on implementing these features.
The text was updated successfully, but these errors were encountered: