Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sensitivity analysis #108

Merged
merged 15 commits into from
Dec 22, 2023
Merged

Add sensitivity analysis #108

merged 15 commits into from
Dec 22, 2023

Conversation

luigibonati
Copy link
Owner

@luigibonati luigibonati commented Dec 18, 2023

Description

Add a sensitivity analyisis as done in the DeepLDA/DeepTICA paper, as requested in #105.

Todos

  • sensitivity_analysis in utils.explain
  • regtests
  • plot_sensitivity_analysis in utils.plot
  • add it in deeplda example notebook
  • create a tutorial to explain it
  • retrieve automatically feature names from dataset (Add feature_names attribute to DictDataset #106)

For now I created a new module inside utils called explain, altough we might also move it outside utils and create an explain module which contains this together with other things such as the sparse classification with Lasso (@pietronvll)

Copy link

codecov bot commented Dec 18, 2023

Codecov Report

Merging #108 (d0399d3) into main (4fc7301) will increase coverage by 0.36%.
Report is 2 commits behind head on main.
The diff coverage is 92.35%.

Additional details and impacted files

mlcolvar/tests/test_utils_explain.py Dismissed Show dismissed Hide dismissed
mlcolvar/utils/plot.py Fixed Show fixed Hide fixed
mlcolvar/utils/plot.py Fixed Show fixed Hide fixed
mlcolvar/utils/explain.py Fixed Show fixed Hide fixed
mlcolvar/utils/explain.py Fixed Show fixed Hide fixed
mlcolvar/utils/explain.py Fixed Show fixed Hide fixed
@luigibonati
Copy link
Owner Author

In order to automatically retrieve the input descriptors names I added an attribute feature_names to DictDataset (both in the constructor and as a property) and changed the function create_dataset_from_file to automatically set it when creating a dataset from a dataframe. let me know if you have comments

mlcolvar/utils/plot.py Fixed Show fixed Hide fixed
return_ax = False

# define utils functions
def _set_violin_attributes(violin_parts, color, alpha=0.5, label=None, zorder=None):

Check notice

Code scanning / CodeQL

Explicit returns mixed with implicit (fall through) returns Note

Mixing implicit and explicit returns may indicate an error as implicit returns always return None.
mlcolvar/utils/explain.py Dismissed Show dismissed Hide dismissed
@EnricoTrizio
Copy link
Collaborator

I would prefer to have the plots horizontally rather than vertically, please fix this @luigibonati 🤗

Copy link
Collaborator

@andrrizzi andrrizzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@jinmei-1
Copy link

Hello! Thank you for providing the sensitivity analysis code. We have tested it and found its performance to be excellent. However, we encountered an issue during the simplification of our system's 4536 descriptors, specifically regarding the image size being too large. In particular, we received the following error message:
ValueError: Image size of 500x113400 pixels is too large. It must be less than 2^16 in each direction.
We would like to seek your guidance on how to modify the code to address this issue. Additionally, while reading your article "Deep learning the slow modes for rare events sampling," we interest in the method you mentioned for reducing the descriptor set in the Chignolin Folding case. In your research, you reducing the number of descriptors to from 4278 to 210 by selecting the most relevant ones through sensitivity analysis of the primary CVs. **### We would like to learn more about the criteria you used to select these 210 descriptors.**Why is it 210 instead of 220 or 200?
We appreciate your assistance and look forward to your guidance. Meanwhile, we are sharing our test code with you for a better understanding of our issues.
Thank you for your time and support. We eagerly await your response.
12-21-code and COVARL.zip

@luigibonati
Copy link
Owner Author

Hello! Thank you for providing the sensitivity analysis code. We have tested it and found its performance to be excellent. However, we encountered an issue during the simplification of our system's 4536 descriptors, specifically regarding the image size being too large. In particular, we received the following error message: ValueError: Image size of 500x113400 pixels is too large. It must be less than 2^16 in each direction. We would like to seek your guidance on how to modify the code to address this issue. Additionally, while reading your article "Deep learning the slow modes for rare events sampling," we interest in the method you mentioned for reducing the descriptor set in the Chignolin Folding case. In your research, you reducing the number of descriptors to from 4278 to 210 by selecting the most relevant ones through sensitivity analysis of the primary CVs. **### We would like to learn more about the criteria you used to select these 210 descriptors.**Why is it 210 instead of 220 or 200? We appreciate your assistance and look forward to your guidance. Meanwhile, we are sharing our test code with you for a better understanding of our issues. Thank you for your time and support. We eagerly await your response. 12-21-code and COVARL.zip

hi, thanks for the feedback. i added a tutorial in the documentation where i show how to customize the analysis. it should work now, let me know!

  • i added that by default it only prints the first 50 features, but this can be changed in the plot_sensitivity options
  • i have also added how to create a new dataset using only the first N features. how to choose N depends on 1) how is the distribution of features 2) how much is the cost of using the resulting CV. i remember that regarding the deeptica paper you were getting qualitatively similar results using e.g. 100, 200 or 300 features

@@ -256,6 +256,160 @@
else:
return None


def plot_sensitivity(results, mode="violin", per_class=None, max_features = 100, ax=None):

Check notice

Code scanning / CodeQL

Explicit returns mixed with implicit (fall through) returns Note

Mixing implicit and explicit returns may indicate an error as implicit returns always return None.
@luigibonati luigibonati merged commit 6f7a6df into main Dec 22, 2023
12 checks passed
@luigibonati
Copy link
Owner Author

this closes also #105 and #106

@jinmei-1
Copy link

Hello! Thank you for providing your tutorial and the new code. We've adapted and executed DeepTICA (our protein-ligand system) based on your DeepLDA tutorial. We successfully plotted sensitivity, but encountered an issue when attempting to create a new dataset using only the first N features, as indicated below:
KeyError: "None of [Index(['2003', '2094', '2007', '2099', '2005', '2008', '2046', '2004', '2100', '2006'], dtype='object')] are in the [columns]"

image
image

Could you please guide us on how to modify it to suit DeepTICA data?
Additionally, we have a minor query regarding this line of code: relevant_features = results['feature_names'][-n_features:]. Why are the last n_features elements selected from the end of the feature_names list, rather than choosing n_features from front to back in the order of sensitivity ranking elements?
We've attached our test code for your review to provide better insight into our challenges. Your time and support are greatly appreciated. We look forward to your response.
12-27-code and COVARL.zip

@luigibonati
Copy link
Owner Author

The features are sorted in ascending order, this is why we need to slice the last N features instead of the first ones.

Regarding the error in creating the new dataset, it is related to the fact that in the case of a time-lagged dataset (at variance with those created with the create_dataset_from_files) there is no information about the feature names, so they have to be specified by hand. Indeed, in the sensitivity plot the features are displayed as numbers and not with the correct names.

let me know if this works


feature_names = colvar.filter(regex='d_').columns

results = sensitivity_analysis(model,               
                               dataset, 
                               metric="mean_abs_val",       # metric to use to compute the sensitivity per feature (e.g. mean absolute value or root mean square)
                               feature_names=feature_names, # by default, they will be taken from `dataset.feature_names` 
                               per_class=False,             # whether to do per-class statistics
                               plot_mode=None)              # plot mode (see below) 

I will add a warning that the feature names could not be retrieved if they are not present in the dataset.

Luigi

@jinmei-1
Copy link

Thank you for providing the new code, it works and I get the new deep-TICA(.tpc) model.
I have successfully implemented GROMACS_2022.5-plumed_2.9.0 on our server. Additionally, I have configured LibTorch and the PyTorch module, with both "plumed config has libtorch" and "plumed config module pytorch" returning on.
However, when attempting enhanced sampling simulations, I encountered an error: GROMACS displayed "free(): invalid pointer." This issue persisted when I tried to run the Deep-TICA example for alanine, as outlined in your paper titled "Deep learning the slow modes for rare events sampling" (Luigi Bonati, GiovanniMaria Piccini, and Michele Parrinello. Proceedings of the National Academy of Sciences, 118(44), 2021).

I find this error perplexing, and I was hoping you could assist me in troubleshooting or provide any guidance on resolving this matter. I have attached relevant files such as plumed.dat, .tpr, .tpc, descriptors.dat, and a log for your reference.
Thank you very much for your time and consideration. I appreciate any insights or assistance you can provide.

Best regards,
jinmei
ref.zip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants