adding notebooks for day3 (#4)

* [day2] feature selection with VAMP using the pentapeptide * [day2] TPT notebooks * [day2] added MSM theory slides * [day2] added MSM analysis notebook * [day2] remove junk * [day3] HMM material added * [day3] thermo notebooks * [day3] thermo data * remove obsolete material * moving/removing HMM material * minor changes * reorganizing MEMM material * fixing file permissions * amended MEMM_0[1-3] notebooks * removing obsolete files * adding MEMM exercse notebook
markovmodel · Feb 21, 2018 · 2dcdfde · 2dcdfde
1 parent 0ad255b
commit 2dcdfde
Show file tree

Hide file tree

Showing 7 changed files with 2,550 additions and 0 deletions.
diff --git a/notebooks/HMM_01_exercise.ipynb b/notebooks/HMM_01_exercise.ipynb
@@ -0,0 +1,228 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# MSMs versus HMMs\n",
+    "\n",
+    "In this notebook we will study how to select the lag time of Markov state models or Hidden Markov models, and how the models can be validated in order to decide whether they can be used in order to make reliable predictions about the long-term kinetics of the molecular system studied.\n",
+    "\n",
+    "We start with a few general imports and settings"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%matplotlib inline\n",
+    "import matplotlib.pyplot as plt\n",
+    "import matplotlib as mpl\n",
+    "from pyemma import msm\n",
+    "import numpy as np\n",
+    "import pyemma.plots as mplt\n",
+    "mpl.rcParams.update({'font.size': 14})"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# load double well data\n",
+    "import pyemma.datasets\n",
+    "double_well_data = pyemma.datasets.load_2well_discrete()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Draw double well and discretization\n",
+    "-------\n",
+    "We will use simulation data generated for a double-well potential. This data is part of the standard datasets in PyEMMA. In this notebook, we compare two discretizations: An excellent discretization with six states ('good discretization'), and a poor two-state discretization where the dividing surface is far away from the transition state ('bad discretization').\n",
+    "\n",
+    "Let's look at the double well potential and these two discretizations:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "mpl.rcParams.update({'font.size': 10})\n",
+    "fig, axis = plt.subplots(1, 2, figsize=(10, 2))\n",
+    "P = double_well_data.transition_matrix\n",
+    "mu = msm.markov_model(P).stationary_distribution\n",
+    "E = -np.log(mu)\n",
+    "# plot 1\n",
+    "i = 0\n",
+    "axis[i].set_title('good discretization')\n",
+    "axis[i].plot(E - 2.0, linewidth=2, color='black')\n",
+    "axis[i].fill_between(range(len(E)), np.zeros(len(E)), E - 2.0, color='lightgrey')\n",
+    "axis[i].set_xlim(20, 80); axis[i].set_xlabel('x'); axis[i].set_ylim(0,8); axis[i].set_ylabel('free energy / kT')\n",
+    "for b in [40, 45, 50, 55, 60]:\n",
+    "    axis[i].plot([b, b], [0, 8], linewidth=2, linestyle='dashed', color='black')\n",
+    "# plot 2\n",
+    "i = 1\n",
+    "axis[i].set_title('bad discretization')\n",
+    "axis[i].plot(E - 2.0, linewidth=2, color='black')\n",
+    "axis[i].fill_between(range(len(E)), np.zeros(len(E)), E - 2.0, color='lightgrey')\n",
+    "axis[i].set_xlim(20, 80); axis[i].set_xlabel('x'); axis[i].set_ylim(0, 8); axis[i].set_ylabel('free energy / kT')\n",
+    "axis[i].set_ylabel(''); axis[i].yaxis.set_ticklabels([]); \n",
+    "axis[i].plot([40, 40], [0, 8], linewidth=2, linestyle='dashed', color='black')\n",
+    "#\n",
+    "#savefig('figs/fig_selval_ab.png', bbox_inches='tight')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Timescales: MSM\n",
+    "-----"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "First we want to build Markov state models. Since the discretization is given, the only parameter that still needs to be selected is the lag time. Both the discretization and the lag time are crucial for the approximation quality of a Markov state model [1,2].\n",
+    "\n",
+    "One common approach to select a lagtime is to compute the relaxation timescales implied by the estimated Markov state model [3], shortly called implied timescales: \n",
+    "\n",
+    "$    t_i = -\\tau / \\ln | \\lambda_i (\\tau)| $\n",
+    "\n",
+    "where $\\lambda_i(\\tau)$ are the eigenvalues of the Markov state model estimated at lag time $\\tau$. As the relaxation timescales are physical properties of the system the implied timescales should be independent of the lag time tau. In practice, this will not be fulfilled for short lag times due to discretization errors [2]. It will neither be fulfilled for very long lag times exceeding the implied timescales as the Markov model eigenvalues are then essentially resulting from the projection on random vectors [4]."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "its_good_bmsm = msm.timescales_msm([double_well_data.dtraj_T100K_dt10_n6good], lags = 100, errors='bayes')\n",
+    "its_bad_bmsm = msm.timescales_msm([double_well_data.dtraj_T100K_dt10_n2bad], lags = 100, errors='bayes')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Plotting these implied timescales already show that the good discretization converges at a lag time of about 20-40 steps to a timescale somewhat about 300, while the bad discretization doesn't appear to converge. According to the variational principle of conformation dynamics [5], we know in principle that the model with longer timescales are preferable, so at this point we might select the first discretization and a lag time of 20-40 steps..."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We see that the uncertainties (by default two sigma's = 95% confidence are shown) are small compared to the tau-dependent changes of the timescales in the bad discretization and we can reject that discretization even without knowing about the good discretization. The good discretization has a timescale that is indistinguishable from a constant af lag times of about 30 steps or longer."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "mpl.rcParams.update({'font.size': 10})\n",
+    "fig, axis = plt.subplots(1, 2, figsize=(10, 2))\n",
+    "#\n",
+    "mplt.plot_implied_timescales(its_good_bmsm, ax=axis[0], ylog=False)\n",
+    "axis[0].set_ylim(0, 500); \n",
+    "axis[0].set_title('good discretization')\n",
+    "#\n",
+    "mplt.plot_implied_timescales(its_bad_bmsm, ax=axis[1], ylog=False)\n",
+    "axis[1].set_ylim(0, 500); axis[1].set_ylabel(''); axis[1].yaxis.set_ticklabels([]);\n",
+    "axis[1].set_title('bad discretization')\n",
+    "#\n",
+    "#savefig('figs/fig_selval_cd.png', bbox_inches='tight')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Hidden Markov state models\n",
+    "------"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now, let's play with Hidden Markov state models. Discrete Hidden Markov models (HMMs) are describe in [7]. For metastable systems such as this one, discrete HMMs can be shown to be excellent approximations to PMMs, that are exact descriptions of the non-Markovian dynamics on discretized state spaces for metastable dynamics [8]. We estimate the HMM as described in [8], which involves to first estimate a coarse-grained Markov state model and then running a HMM estimation on top of it."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "1. Compute the implied timescales for the same lag times as above, but using HMSMs. Check out the documentation of the msm package.\n",
+    "2. Compute the implied timescales with Bayesian errors. We do this by using the algorithms described in [9].\n",
+    "3. Estimate a Bayesian HMSM at lag time of 5 steps for both discretizations\n",
+    "4. Conduct a Chapman-Kolmogorov Test for both models. What do you conclude?\n",
+    "5. Analyze the transition matrix, the timescales and the observation probabilities of both models. Try to explain your observations\n",
+    "6. Adapt the lag time in the bad discretization so as to get a successful HMM model. What can be said about this model in comparison to an MSM?"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "References\n",
+    "---------\n",
+    "[1] Prinz, J.-H., H. Wu, M. Sarich, B. G. Keller, M. Senne, M. Held, J. D. Chodera, Ch. Schütte and F. Noé: Markov models of molecular kinetics: Generation and Validation. J. Chem. Phys. 134, 174105 (2011)\n",
+    "\n",
+    "[2] Sarich, M., F. Noé, Ch. Schütte: On the Approximation Quality of Markov State Models. Multiscale Model. Simul. 8, 1154-1177 (2010)\n",
+    "\n",
+    "[3] Swope, W. C., J. W. Pitera and F. Suits: Describing protein folding kinetics by molecular dynamics simulations: 1. Theory, J. Phys. Chem. B. 108, 6571-6581 (2004)\n",
+    "\n",
+    "[4] Beauchamp, K. A., R. McGibbon, Y. S. Lin and V. S. Pande: Simple few-state models reveal hidden complexity in protein folding. Proc. Natl. Acad. Sci. USA 109, 17807-17813 (2012)\n",
+    "\n",
+    "[5] Noé, F. and F. Nüske: A variational approach to modeling slow processes in stochastic dynamical systems. SIAM Multiscale Model. Simul. 11. 635-655 (2013).\n",
+    "\n",
+    "[6] Trendelkamp-Schroer, B., H. Wu, F. Paul and F. Noé: Estimation and uncertainty of reversible Markov models. arxiv.org/pdf/1507.05990 (2015)\n",
+    "\n",
+    "[7] Rabiner, L. R.: A tutorial on hidden markov models and selected applications in speech recognition. Proc. IEEE 77, 257--286 (1989)\n",
+    "\n",
+    "[8] Noé, F., H. Wu, J.-H. Prinz and N. Plattner, N.: Projected and Hidden Markov Models for calculating kinetics and metastable states of complex molecules. J. Chem. Phys. 139, 184114 (2013)\n",
+    "\n",
+    "[9] Chodera, J. D., P. Elms, F. Noé, B. Keller, C. M. Kaiser, A. Ewall-Wice, S. Marqusee, C. Bustamante, N. Singhal Hinrichs: Bayesian hidden Markov model analysis of single-molecule force spectroscopy: Characterizing kinetics under measurement uncertainty. http://arxiv.org/abs/1108.1430."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.6.1"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 1
+}