Skip to content

Commit

Permalink
Merge pull request #66 from bioshape-analysis/qiyu
Browse files Browse the repository at this point in the history
modified blog after review
  • Loading branch information
clementsoubrier authored Jan 1, 2025
2 parents 7750ff3 + 6c29835 commit aee0918
Show file tree
Hide file tree
Showing 3 changed files with 23 additions and 9 deletions.
12 changes: 6 additions & 6 deletions posts/RECOVAR/index.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -10,11 +10,11 @@ categories: [cryo-EM] # [biology, bioinformatics, theory, etc.]
## Background
Cryogenic electron microscopy (cryo-EM), a cryomicroscopy technique applied on samples embedding in ice, along with recent development of powerful hardwares and softwares, have achieved huge success in the determination of biomolecular structures at near-atomic level. Cryo-EM takes screenshots of thousands or millions of particles in different poses frozen in the sample, and thus allows the reconstruction of the 3D structure from those 2D projections.

Early algorithms and softwares of processing cryo-EM data focus on resolving homogeneous structure of biomolecules. However, many biomolecules are very dynamic in conformations, compositions, or both. For example, ribosomes comprise of many sub-units, and their compositions may vary within the sample and are of research interest. Spike protein is an example of conformational heterogeneity, where the receptor-binding domain (RBD) keeps switching between close and open states in order to bind to receptors and meanwhile resist the binding of antibody. When studying the antigen-antibody complex, both compositional and conformational heterogeneity need to be considered.
Early algorithms and softwares of processing cryo-EM data focus on resolving homogeneous structure of biomolecules. However, many biomolecules are very dynamic in conformations, compositions, or both. For example, ribosomes comprise of many sub-units, and their compositions may vary within the sample and are of research interest. Spike protein is an example of conformational heterogeneity, where the receptor-binding domain (RBD) keeps switching between close and open states in order to bind to receptors and meanwhile resist the binding of antibody [@PPR:PPR217218]. When studying the antigen-antibody complex, both compositional and conformational heterogeneity need to be considered.

![A simple illustration of the conformational heterogeneity of spike protein, where it displays two kinds of conformations: closed RBD and open RBD of one chain (colored in blue) [@Wang2020]. Spike protein is a trimer so in reality all the three chains will move possibly in different ways and the motion of spike protein is much more complex than what's shown here.](img/spike.png){ width=65% style="display: block; margin-left: auto; margin-right: auto;" }

The initial heterogeneity analysis of 3D structrues reconstructed from cryo-EM data started from relatively simple 3D classfication, which outputs discrete classes of different conformations. This is usually done by expectation-maximization (EM) algorithms, where 2D particle stacks were iteratively assigned to classes and used to reconstruct the volume of that class. However, such an approach has two problems: first, the classification decreases the number of images used to reconstruct the volume, and thus lower the resolution we are able to achieve; second, the motion of biomolecule is continuous in reality and discrete classification may not describe the heterogeneity very well, and we may miss some transient states.
The initial heterogeneity analysis of 3D structrues reconstructed from cryo-EM data started from relatively simple 3D classfication, which outputs discrete classes of different conformations. This is usually done by expectation-maximization (EM) algorithms, where 2D particle stacks were iteratively assigned to classes and used to reconstruct the volume of that class [@scheres_2012]. However, such an approach has two problems: first, the classification decreases the number of images used to reconstruct the volume, and thus lower the resolution we are able to achieve; second, the motion of biomolecule is continuous in reality and discrete classification may not describe the heterogeneity very well, and we may miss some transient states.

Therefore, nowadays people start to focus on methods modeling continuous heterogeneity without any classification step to avoid the above issues. Most methods adopt similar structures, where 2D particle stacks are mapped to latent embeddings, clusters/trajectories are estimated in latent space, and finally volumes are mapped and reconstructed from latent embeddings. Early methods use linear mapping (e.g. 3DVA), but with the applications of deep learning techniques in the field of cryo-EM data processing, people find methods adapted from variational autoencoder (VAE) achieving better performance (e.g. cryoDRGN, 3DFlex). Nevertheless, the latent space obtained from VAE and other deep learning methods is hard to interpret, and do not conserve distances and densities, imposing difficulties in reconstructing motions/trajectories, which are what most structure biologists desire at the end.

Expand Down Expand Up @@ -65,7 +65,7 @@ $$d_{s,v}(h) = \lVert S_sV^{-1/2}(M_v(\hat{x}_{CV}-\hat{x}(h)))\rVert_2^2$$

where $S_s$ is a matrix that extracts shell $s$; $M_v$ is a matrix extracting subvolume $v$; and $V$ is a diagonal matrix containing the variance of the template. For each $s$ and $v$, the minimizer over $h$ of the cross-validarion score is selected, and the final volume is obtained by first recombining frequency shells for each subvolume and then recombining all the subvolumes.

![Volumes are reconstructed from the embedding by adaptive kernel regression.](img/3D_reconstruct.png){ width=100% style="display: block; margin-left: auto; margin-right: auto;" }
![Volumes are reconstructed from the embedding by adaptive kernel regression [@Gilles2023].](img/3D_reconstruct.png){ width=100% style="display: block; margin-left: auto; margin-right: auto;" }

### Estimation of state density
Since motion is what structure biologists finally want, we have to figure out a method to sample from latent space to form a trajectory representing the motion of the molecule. According to Boltzmann statistics, the density of a particular state is a measure of the free energy of that state, which means a path which maximizes conformational density is equivalent to the path minimizing the free energy. Taking the advantage of linear mapping, we can easily relate embedding density to conformational density. The embedding density estimator is given by:
Expand Down Expand Up @@ -101,15 +101,15 @@ The original paper of RECOVAR presents results on precatalytic spliceosome datas

Results on EMPIAR-10180 focuses on comformational heterogeneity. Three local maxima in conformational density were identified, a path between two of which was identified to show arm regions moving down followed by head regions moving up.

![Latent space and volume view of precatalytic spliceosome conformational heterogeneity. Latent view of the path is projected on the plane formed by different combinations of two principal components.](img/EMPIAR-10180.png){ width=75% style="display: block; margin-left: auto; margin-right: auto;" }
![Latent space and volume view of precatalytic spliceosome conformational heterogeneity. Latent view of the path is projected on the plane formed by different combinations of two principal components [@Gilles2023].](img/EMPIAR-10180.png){ width=75% style="display: block; margin-left: auto; margin-right: auto;" }

EMPIAR-10345 contains both conformational and compositional heterogeneity. Two local maxima were found, with the smaller one corresponds to a different composition never reported by provious studies. Also a motion of the arm was found along the path.

![RECOVAR finds both comformational and compositional heterogeneity within integrin](img/EMPIAR-10345.png){ width=45% style="display: block; margin-left: auto; margin-right: auto;" }
![RECOVAR finds both comformational and compositional heterogeneity within integrin [@Gilles2023].](img/EMPIAR-10345.png){ width=45% style="display: block; margin-left: auto; margin-right: auto;" }

EMPIAR-10076 is used to show the ability of RECOVAR to find stable states. RECOVAR finds two stable states of the 70S ribosomes.

![The volume of two stable states are reconstructed, correspinding to two peaks in densities](img/EMPIAR-10076.png){ width=65% style="display: block; margin-left: auto; margin-right: auto;" }
![The volume of two stable states are reconstructed, correspinding to two peaks in densities [@Gilles2023].](img/EMPIAR-10076.png){ width=65% style="display: block; margin-left: auto; margin-right: auto;" }

### Results of SARS-CoV2 datasets

Expand Down
16 changes: 15 additions & 1 deletion posts/RECOVAR/references.bib
Original file line number Diff line number Diff line change
Expand Up @@ -14,4 +14,18 @@ @article {Gilles2023
elocation-id = {2023.10.28.564422},
year = {2024},
doi = {10.1101/2023.10.28.564422},
publisher = {Cold Spring Harbor Laboratory}}
publisher = {Cold Spring Harbor Laboratory}}

@article {PPR:PPR217218,
Title = {Critical Interactions Between the SARS-CoV-2 Spike Glycoprotein and the Human ACE2 Receptor},
Author = {Taka, Elhan and Yilmaz, Sema and Golcuk, Mert and Kilinc, Ceren and Aktas, Umut and Yildiz, Ahmet and Gur, Mert},
DOI = {10.1101/2020.09.21.305490},
Abstract = {<h4>ABSTRACT</h4> Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infects human cells upon binding of its spike (S) glycoproteins to ACE2 receptors and causes the coronavirus disease 2019 (COVID-19). Therapeutic approaches to prevent SARS-CoV-2 infection are mostly focused on blocking S-ACE2 binding, but critical residues that stabilize this interaction are not well understood. By performing all-atom molecular dynamics (MD) simulations, we identified an extended network of salt bridges, hydrophobic and electrostatic interactions, and hydrogen bonding between the receptor-binding domain (RBD) of the S protein and ACE2. Mutagenesis of these residues on the RBD was not sufficient to destabilize binding but reduced the average work to unbind the S protein from ACE2. In particular, the hydrophobic end of RBD serves as the main anchor site and unbinds last from ACE2 under force. We propose that blocking the hydrophobic surface of RBD via neutralizing antibodies could prove an effective strategy to inhibit S-ACE2 interactions.},
Journal = {bioRxiv},
Year = {2020},
URL = {https://europepmc.org/article/PPR/PPR217218},
}

@article{scheres_2012, title={RELION: Implementation of a Bayesian approach to cryo-EM structure determination}, volume={180}, url={https://pmc.ncbi.nlm.nih.gov/articles/PMC3690530/}, DOI={https://doi.org/10.1016/j.jsb.2012.09.006}, number={3}, journal={Journal of Structural Biology}, publisher={Elsevier BV}, author={Scheres, Sjors H.W.}, year={2012}, month={Dec}, pages={519–530} }

4 changes: 2 additions & 2 deletions posts/extension_to_RECOVAR/index.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ categories: [cryo-EM] # [biology, bioinformatics, theory, etc.]
## Background
In the previous post *Heterogeneity analysis of cryo-EM data of proteins dynamic in comformation and composition using linear subspace methods*, we reviewed the pipeline of RECOVAR [@Gilles2023] to generate movies showing the heterogeneity of proteins, and discussed its pros, cons and some improvements we could make. RECOVAR is a linear method which borrows the idea from principal component analysis to project complex structure information within cryo-EM data corresponding to each particle onto a lower dimensional space, where a trajectory is computed to illustrate the conformational and compositional changes (see previous post for details).

Compared with other methods, mostly based on deep learning, RECOVAR has several advantages, include but not limited to fast computation of embeddings, easy trajectory discovery in latent space and fewer hyperparameters to tune. Nevertheless, we've noticed several problems when we tested RECOVAR on our SARS-CoV2 datasets. One shortcoming is that the density-based trajectory discovery algorithm used by RECOVAR involves a deconvolution operation between two large matrices, which is extremely expensive. The other improvement we would like to make is to extend the series of density maps output by RECOVAR to the series of atomic models, which is usually the final product structure biologists desire in order to obtain atomic interpretations. In this post, we will focus on how we address these two problems, and present and interpret results from our SARS-CoV2 dataset.
Compared with other methods, mostly based on deep learning, RECOVAR has several advantages, including but not limited to fast computation of embeddings, easy trajectory discovery in latent space and fewer hyperparameters to tune. Nevertheless, we've noticed several problems when we tested RECOVAR on our SARS-CoV2 datasets. One shortcoming is that the density-based trajectory discovery algorithm used by RECOVAR involves a deconvolution operation between two large matrices, which is extremely expensive. The other improvement we would like to make is to extend the series of density maps output by RECOVAR to the series of atomic models, which is usually the final product structure biologists desire in order to obtain atomic interpretations. In this post, we will focus on how we address these two problems, and present and interpret results from our SARS-CoV2 dataset.

Before getting to the Methods, I would like to provide background information about SARS-CoV2 spike protein. SARS-CoV2 spike protein is a trimer binding to the surface of SARS-CoV2 virus. It has a so-called recpetor-binding domain (RBD) capable of switching between "close" and "open" states. When in the open state, the spike is able to recognize and bind to angiotensin-converting enzyme 2 (ACE2), an omnipresent enzyme on the membrane of the cells of the organs in the respiratory system, heart, intestines, testis and kidney [@hikmet_2020]. The binding to ACE2 helps the virus dock on the target cells and initalize the invasion and infection of the cells. Therefore, spike is often the major target for antibody development. Previous researches mainly focus on developing drugs neutralizing the RBD regions in the open state. However as I mentioned before, spike can switch to the close state, in which the antibody targeting open RBD will not longer be able to access it, making the drugs less effective. Motivated by recent progress in the heterogeneity analysis of proteins, researchers now focus on the conformational changes instead of a homogeneous state. Developing drugs to block the shape change of spike is considered an potentially more efficient way to neutralize viruses. This is why it is important to have a reliable pipeline to generate movies showing the conformational changes in spike proteins.

Expand Down Expand Up @@ -101,7 +101,7 @@ Most conformational changes in this series occur in the RBD region, with 7V7S un
Same as the previous test, Algorithm1 performs better than Algorithm2 in fitting all the maps in the series. Compared with fitting to maps generated from 7V7Q starting with "true" 7V7R, initializing model with fitted 7V7R from previous step does not lead to siginificant increase in RMSD in fitted 7V7Q. There are some white regions with medium RMSD shared by three fitted models, but the RMSD of these regions does not increase. There is a part with high RMSD in the left region of the last structure 7V7S in the series, but it seems that the error is not accumulated from previous fitting as the RMSD of this region of the privious fitting is very low.

## Discussion
In this project we proposed MPPC as an alternative approach to compute path. Although this a method can be used to find paths in higher dimension with very fast speed, it is more sensitive to outliers. One way to address this issue is to iteratively remove points that are far away from the curves and then fit the curve. Another feature of MPPC is that it does not take the starting and ending points. This can be either an advantage or disadvantage, depending on the objective. MPPC works if the goal is to study the conformational change trajectory in the entire space. Nevertheless, if we are more interested in how proteins transit between two specific states, MPPC may output path even not passing these two states. On the other hand, the movie output from trajectories found by MPPC in higher dimension indeed captures more changes in shape, which helps discover rare conformations.
In this project we proposed MPPC as an alternative approach to compute path. Although this method can be used to find paths in higher dimension with very fast speed, it is more sensitive to outliers. One way to address this issue is to iteratively remove points that are far away from the curves and then fit the curve. Another feature of MPPC is that it does not take the starting and ending points. This can be either an advantage or disadvantage, depending on the objective. MPPC works if the goal is to study the conformational change trajectory in the entire space. Nevertheless, if we are more interested in how proteins transit between two specific states, MPPC may output path even not passing these two states. On the other hand, the movie output from trajectories found by MPPC in higher dimension indeed captures more changes in shape, which helps discover rare conformations.

One problem occurring to lots of datasets like the one we tested is that the output path contains both conformational and compositional heterogeneity. From the movies of the spike we can see ACE2 suddenly appear or disappear at the top of the lifted RBD region. This is essential as we want the algorithm to discover compositional heterogeneity as well, but it causes trouble to atomic fitting. In the conventional pipeline, people address this problem via discrete 3D classification to separate particles with different compositions, which may not have very high accuracy when applied to complex datasets with both compositional and comformational heterogeneity. Actually 3D classification of cryoSPARC fails to distinguish particles with and without ACE2 on our spike protein dataset without templates. Here instead we may want to leverage the powerful tool of RECOVAR, and directly classify particles in the continuous latent space. One potential approach would be segmenting latent space based on the mass of the volume associated with the embeddings. This approach may not work in the case where the compositional difference does not lead to a change in mass, but as long as the compositional heterogeneity leads to difference in mass that is more significant than noise (like SARS-CoV2 spike + ACE2 in our case), this method should work. We checked the feasibility of this approach by computing mass of the density maps along time in a movie output by RECOVAR using our SARS-CoV2 data as following:

Expand Down

0 comments on commit aee0918

Please sign in to comment.