From 4b344d1c5b7182988f7a839bda30903a9e18c399 Mon Sep 17 00:00:00 2001 From: Quarto GHA Workflow Runner Date: Wed, 18 Dec 2024 04:44:30 +0000 Subject: [PATCH] Built site for gh-pages --- .nojekyll | 2 +- index.html | 102 +- listings.json | 1 + posts/MATH-612/index.html | 4 +- posts/contour-analysis/data/data-binary.png | Bin 0 -> 47230 bytes .../data/data-closed-contours.png | Bin 0 -> 2673830 bytes posts/contour-analysis/data/data-contours.png | Bin 0 -> 2678954 bytes posts/contour-analysis/data/data-denoised.png | Bin 0 -> 2073121 bytes .../contour-analysis/data/data-grayscale.png | Bin 0 -> 776228 bytes posts/contour-analysis/data/data.png | Bin 0 -> 2089240 bytes posts/contour-analysis/data/segmentation.gif | Bin 0 -> 112347 bytes posts/contour-analysis/final-post.html | 861 ++++++++++ search.json | 1443 +++++++++-------- sitemap.xml | 132 +- 14 files changed, 1748 insertions(+), 797 deletions(-) create mode 100644 posts/contour-analysis/data/data-binary.png create mode 100644 posts/contour-analysis/data/data-closed-contours.png create mode 100644 posts/contour-analysis/data/data-contours.png create mode 100644 posts/contour-analysis/data/data-denoised.png create mode 100644 posts/contour-analysis/data/data-grayscale.png create mode 100644 posts/contour-analysis/data/data.png create mode 100644 posts/contour-analysis/data/segmentation.gif create mode 100644 posts/contour-analysis/final-post.html diff --git a/.nojekyll b/.nojekyll index 6456917..b9840fe 100644 --- a/.nojekyll +++ b/.nojekyll @@ -1 +1 @@ -2f4e286b \ No newline at end of file +a12db627 \ No newline at end of file diff --git a/index.html b/index.html index ae6a964..0deb59b 100644 --- a/index.html +++ b/index.html @@ -193,7 +193,7 @@

Biological shape analysis (under construction)

+
Categories
All (32)
AFM (2)
Benamou-Brenier's Formulation (2)
Cell Migration (1)
Cell Morphology (1)
Differential Geometry (1)
Graph theory (1)
Kantorovich's Formulation (1)
MATH 612 (1)
MDS (1)
Math 612D (1)
Monge's Problem (1)
Vascular Networks (1)
agriculture (1)
automatic differentiation (1)
bioinformatics (10)
biology (12)
biomedical engineering (1)
cell morphology (1)
cryo-EM (5)
cryo-ET (1)
cryo-em (2)
example (1)
landscape-analysis (1)
mathematics (1)
neuroscience (1)
optimal transport (2)
pytorch (1)
ribosome (3)
shape morphing (2)
theory (4)
@@ -207,7 +207,7 @@
Categories
-
+
+
+
+

+

+

+
+ +
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
@@ -1118,7 +1154,7 @@

-
+
-
+
-
+
-
+
-
+
-
+

diff --git a/listings.json b/listings.json index 6d28f55..e85623b 100644 --- a/listings.json +++ b/listings.json @@ -3,6 +3,7 @@ "listing": "/index.html", "items": [ "/posts/MATH-612/index.html", + "/posts/contour-analysis/final-post.html", "/posts/ImageMorphing/OT4DiseaseProgression2.html", "/posts/Embryonic-Shape/index.html", "/posts/Farm-Shape-Analysis/index.html", diff --git a/posts/MATH-612/index.html b/posts/MATH-612/index.html index 7e31b94..9cb2ac4 100644 --- a/posts/MATH-612/index.html +++ b/posts/MATH-612/index.html @@ -6,7 +6,7 @@ - + Welcome to MATH 612 – bioshape-analysis + + + + + + + + + + + + + + + + + + + + + + + + + + +
+
+ +
+ +
+
+
+

An analysis and segmentation of contours in AFM imaging data

+
+
biology
+
AFM
+
+
+
+ + +
+ +
+
Author
+ +
+ +
+
Published
+
+

December 17, 2024

+
+
+ + +
+ + +
+ + + + +
+ + + + + +
+

Context and motivation

+

The segmentation of pieces in AFM images gives us a chance to gather information about their shape. This can very well be a determining characteristic for certain biological objects. Analyzing an image piece by piece is usually easier. It also allows us to iterate through pieces of an image if we wish to analyze something different that is not necessarily related to its shape.

+

Although the work done in this project is applicable to any AFM image, one of my main goals in to detect R-loops in those images. Further information about this topic can be found in this previous blog post. Unedited AFM images in this blog post was captured by the Pyne Lab.

+
+
+

Preparations before analysis

+

For image denoising and binarization we will use the OpenCV library. Images will be loaded into Numpy arrays.

+
+
import cv2
+import numpy as np
+
+

Background noises in images are problematic for edge detection algorithms. Many of them rely on counting pixels around a neighbourhood with a similar color value. When noise is present, we are more likely to get disconnected edges. The most common way to get around this is to use a Gaussian blurring, which basically calculates the average of pixels in a square of pre-determined length. This process makes the image more smooth at the cost of some details and precision.

+

We will use a better version of this algorithm called non-local means denoising (Buades, Coll, and Morel 2011). Instead of just looking at the immediate surroundings of a pixel, non-local denoising takes into account similar portions in the entire image and calculates the average of all those pixels.

+
+
+
+

+
Figure 1, An AFM image of DNA fragments (picture by the Pyne lab)
+
+
+
+
+
src = cv2.imread("data/data.png", cv2.IMREAD_COLOR)
+
+# filter strength for luminance component = 10
+# filter strength for color components = 10
+# templateWindowSize = 7 (for computing weights)
+# searchWindowSize = 21 (for computing averages)
+src = cv2.fastNlMeansDenoisingColored(src,None,10,10,7,21)
+
+cv2.imwrite("data/data-denoised.png", src)
+
+
+
+
+

+
Figure 2, The image after denoising
+
+
+
+

As we are only interested in finding contours in the image, RGB colors will not be important. In fact, it makes it harder to analyze. We start by changing the color coding of the image to grayscale.

+
+
src = cv2.imread("data/data-denoised.png", cv2.IMREAD_COLOR)
+
+src_gray = cv2.cvtColor(src, cv2.COLOR_BGR2GRAY)
+
+cv2.imwrite("data/data-grayscale.png", src_gray)
+
+
+
+
+

+
Figure 3, Grayscale version of the image
+
+
+
+

The final step is to completely binarize the image. We are only interested in parts of the image that are considered to be DNA matter, which has a smaller color value compared to the background. We will apply a threshold to the image. Any pixel with a color value above 80 is considered to be DNA matter and it is mapped to a white pixel. Everything else is mapped to a black pixel.

+
+
src = cv2.imread("data/data-grayscale.png", cv2.IMREAD_COLOR)
+
+# threshold: 80
+# max_value: 255
+# method: THRESH_BINARY
+ret,src_binary = cv2.threshold(src_gray,80,255,cv2.THRESH_BINARY)
+
+cv2.imwrite("data/data-binary.png", src_binary)
+
+
+
+
+

+
Figure 4, The binarized image after the thresholding
+
+
+
+
+
+

Finding contours

+

We will make use of the findContours function in OpenCV with the additional parameter RETR_TREE, which stands for contour retrieval tree. For our purposes, a contour is just a continuous set of points, but its position is also important. A shape can be located inside another shape or it might be connected to some other shape, which is useful information.

+

We consider the outer contour a parent, and the inner one a child. findContours returns a multi-dimensional array that contains the parent and child relation for any contour in an image.

+

After finding contours from the binarized image, we draw them on top of the original AFM image we initially started with.

+
+
src = cv2.imread("data/data.png", cv2.IMREAD_COLOR)
+src_binary = cv2.imread("data/data-binary.png", cv2.IMREAD_UNCHANGED)
+
+contours, hierarchy = cv2.findContours(src_binary, cv2.RETR_TREE, cv2.CHAIN_APPROX_NONE)
+hierarchy = hierarchy[0]
+
+for i,c in enumerate(contours):
+    # omit very small contours on the background
+    if (cv2.arcLength(c, True) < 75):
+        continue
+    color = (randint(0,255), randint(0,255), randint(0,255))
+    cv2.drawContours(src, contours, i, color, 2)
+
+cv2.imwrite("data/data-contours.png", src)
+
+
+
+
+

+
Figure 5, Each contour is highlighted with a different color
+
+
+
+
+
+

Segmentation

+

The hierarchy tree returned by findContours lets us iterate through any piece or specific level. The following code draws the outermost contours.

+
+
contours, hierarchy = cv2.findContours(src_binary, cv2.RETR_TREE, cv2.CHAIN_APPROX_NONE)
+hierarchy = hierarchy[0]
+
+# create a full black image
+background = np.zeros((1504,1504,3), dtype=np.uint8)
+
+for i,c in enumerate(hierarchy):
+    # find the first outermost contour
+    if(hierarchy[i][1] == -1 and hierarchy[i][3] == -1):
+        current = hierarchy[i]
+    else:
+        continue
+
+    # after we find it, draw all the other outermost contours in the same level
+    while(current[0] != -1):
+        # omit very small contours on the background
+        if (cv2.arcLength(contours[i], True) < 75):
+            # point to the next element
+            current = hierarchy[current[0]]
+            i = current[0]
+            continue
+        cv2.drawContours(background, contours, i, (255,255,255), 2)
+        # point to the next element
+        current = hierarchy[current[0]]
+        i = current[0]
+
+    # after outermost contours are drawn, exit
+    break
+
+
+
+
+

+
Figure 6, Pieces in the binarized AFM image
+
+
+
+
+
+

Detecting closed contours

+

If a contour passes through one pixel more than once, we expect it to have a child contour inside. A closed shape will have an outer contour and at least one inner contour. By looking at the values in the returned tree hierarchy, we can determine whether a contour is open or closed.

+
+
src = cv2.imread("data/data.png", cv2.IMREAD_COLOR)
+src_binary = cv2.imread("data/data-binary.png", cv2.IMREAD_UNCHANGED)
+
+contours, hierarchy = cv2.findContours(src_binary, cv2.RETR_TREE, cv2.CHAIN_APPROX_NONE)
+hierarchy = hierarchy[0]
+
+for i,c in enumerate(hierarchy):
+    # find the first outermost contour
+    if(hierarchy[i][1] == -1 and hierarchy[i][3] == -1):
+        current = hierarchy[i]
+    else:
+        continue
+
+    # after we find it, draw all the other outermost contours in the same level
+    while(current[0] != -1):
+        # omit very small contours on the background
+        if (cv2.arcLength(contours[i], True) < 75):
+            # point to the next element
+            current = hierarchy[current[0]]
+            i = current[0]
+            continue
+
+        # check whether the contour has a child
+        if hierarchy[i][2] >= 0:
+            cv2.drawContours(src, contours, i, (0, 255, 0), 2)
+        else:
+            cv2.drawContours(src, contours, i, (255, 0, 150), 2)
+
+        # point to the next element
+        current = hierarchy[current[0]]
+        i = current[0]
+
+    # after outermost contours are drawn, exit
+    break
+
+cv2.imwrite("data/data-closed-contours.png", src)
+
+
+
+
+

+
Figure 7, Green contours are closed while the magenta ones are not
+
+
+
+
+
+

Future goals

+

This program sometimes gives false positives if there are artificial holes inside the DNA strand. If this is detected as an inner loop, the program considers that the contour is closed even though it is not.

+

Depending on how bright the picture is, the threshold color value needs to be adjusted manually. Otherwise, some parts of the DNA will not appear in the binarized image. An automatic detection is more preferable.

+

Currently, I am using the Python fork of OpenCV, which was originally written in C++. Heavy operations take a considerable amount of time in Python. One of my plans is to rewrite this in C++.

+ + + +
+ +

References

+
+Buades, Antoni, Bartomeu Coll, and Jean-Michel Morel. 2011. Non-Local Means Denoising.” Image Processing On Line 1: 208–12. +
+
+ +
+ + + + + \ No newline at end of file diff --git a/search.json b/search.json index 9d91ac8..7f4810e 100644 --- a/search.json +++ b/search.json @@ -7,536 +7,438 @@ "text": "Introduction\nIn this post we briefly explained Centroidal Voronoi Tessellation (CVT) as a point sampling method and how it rises from the Optimal Transport (OT) theory. Here we will explain the theory behing this method and prove the relation between CVT and OT.\n\n\nVoronoi cells and Centroidal Voronoi Tessellation\nAssume \\((X,d)\\) is a metric space. Given a set of \\(n\\) points \\(a_1,\\dots,a_n \\in X\\), the Voronoi diagram (Aurenhammer 1991) is formed by \\(n\\) cells \\(V_1,\\dots,V_n \\subset X\\) where \\(V_i\\) is defined as \\[V_i = \\{x \\in X | d(x, a_j) \\ge d(x,a_i) \\quad \\forall 1 \\le j \\le n \\},\\] and \\(a_i\\) is called the generator of \\(V_i\\). In addition, for all \\(x \\in X\\), let \\(i(x)\\) denote the index such that \\(x \\in V_{i(x)}\\). Bellow is an example of Voronoi diagrams formed by 20 points in \\([0, 1]^2\\) with \\(l_2\\) (right) and \\(l_1\\) (left) norm.\n\n\n\nAn example of Voronoi diagram formed by \\(l_1\\) (right) and \\(l_2\\) (left) norms.\n\n\nFor the rest of this post we assume \\(X\\) is \\(\\mathbb{R}^2\\) or \\(\\mathbb{R}^3\\) and \\(d\\) is the eclidean distance. For a given probability distribution \\(\\mathbf{p}\\) over \\(X\\), we say \\(a_1,\\dots,a_n \\in X\\) form a Centroidal Voronoi Tessellation if considering the Voronoi cells \\(V_1,\\dots,V_n\\) generated by them, we have \\[a_i = \\int_{V_i}x\\mathbf{p}(x)dx (\\forall 1 \\le i \\le n),\\] i.e., \\(a_i\\) is both the generator and the centroid of \\(V_i\\). Below an example of such a tessellation over a square with uniform distribution is illustrated.\n\n\n\nAn example of Centroidal Voronoi Tessellation formed by 5 points on a square with uniform distribution.\n\n\n\n\nSemidiscrete Wasserstein distance\nSemidescrete Wasserstein distance is a variant of Optimal Transport problem, specifically designed for comparing a discrete and a continuous probability distribution. Assume \\((X, d)\\) is a metric space, given a set of weighted points \\(\\mathbf{A} = \\{a_1,\\dots,a_n\\}\\) and weights \\(w_1 + \\dots + w_n = 1\\) we define the distribution \\(\\mathbf{p_A} = \\sum_i w_i \\delta_{a_i}\\), where \\(\\delta_{a_i}\\) is the Dirac delta function located at \\(a_i\\). Like the previouse section we assume \\(\\mathbf{p}\\) is a given distribution over \\(X\\). With this setting the non-regularized semidiscrete Wasserstein distance between \\(\\mathbf{p_A}\\) and \\(\\mathbf{p}\\), denoted \\(\\mathcal{W}(\\mathbf{p},\\mathbf{p_A})\\), is defined as\n\\[\n\\mathcal{W}(\\mathbf{p},\\mathbf{p_A}) = [\\min_{P : X \\times [n] \\rightarrow \\mathbb{R}_+} \\quad \\ \\int_{X} \\sum_{i=1}^n d(x,a_i)P(x,i)dx]^{1/2} \\tag{1}\\] \\[\n\\textrm{s.t.} \\quad \\int_X P(x,i)dx = w_i, \\sum_{i=1}^n P(x,i) = \\mathbf{p}(x).\n\\] Just like the previouse section we assume \\(X = \\mathbb{R}^2\\) or \\(\\mathbb{R}^3\\) and \\(d\\) is the equclidean distance for the rest of this post.\n\n\nThe relation of CVT and Semidiscrete Wasserstein distance\nAssume we are given \\(\\mathbf{p}\\) and we want to find a weighted point set \\(\\mathbf{A}\\) that minimizes \\[\\min_{w_i \\in \\mathbb{R}, \\mathbf{A} \\subset \\mathbb{R}^3} \\mathcal{W}(\\mathbf{p}, \\mathbf{p_A}). \\tag{2}\\] To solve this optimization problem we will prove the following theorem.\nTheorem 1. An optimal solution of Equation 2 is a set of points \\(\\mathbf{A} = \\{a_1,\\dots, a_n\\}\\) that forms a CVT over \\(\\mathbf{p}\\), and \\(w_i = \\int_{V_i} \\mathbf{p}(x)dx\\).\nWe will prove this theorem by splitting it into two lemmas.\nLemma 1. Given a probability distribution \\(\\mathbf{p}\\), a set of points \\(\\mathbf{A} = \\{ a_1,\\dots, a_n\\} \\subset \\mathbb{R}^3\\) and the Voronoi diagram associated with \\(\\mathbf{A}\\), the weights defined as \\(w_i = \\int_{V_i} \\mathbf{p}(x)dx\\) and the transport plan \\(P^*\\) defined as \\(P^*(x,i(x)) = \\mathbf{p}(x), P^*(x,j) = 0 \\text{ (for any }j\\ne i(x)\\text{)}\\) solve Equation 1 and Equation 2.\nProof. For a fixed set of points we can combine Equation 1 and Equation 2 and write \\[\\min_{w_i \\in \\mathbb{R}, \\mathbf{A} \\subset \\mathbb{R}^3} \\mathcal{W}(\\mathbf{p}, \\mathbf{p_A}) = \\min_{P : \\mathbb{R}^3 \\times [n] \\rightarrow \\mathbb{R}_+,w_i \\in \\mathbb{R}, \\mathbf{A} \\subset \\mathbb{R}^3} \\quad \\ \\int_{\\mathbb{R}^3} \\sum_{i=1}^n \\lVert x - a_i\\rVert^2 P(x,i)dx]^{1/2} \\tag{3}\\] \\[\\textrm{s.t.} \\quad \\int_{\\mathbb{R}^3} P(x,i)dx = w_i, \\sum_{i=1}^n P(x_0,i) = \\mathbf{p}(x_0)\\] As \\(w_1,\\dots,w_n\\) only appear in the condition and parameters of the optimization problem Equation 3 we can ignore them and assume \\(w_i = \\int_{\\mathbb{R}^3} P(x,i)dx\\) by default. This simplifies the problem to \\[\\min_{w_i \\in \\mathbb{R}, \\mathbf{A} \\subset \\mathbb{R}^3} \\mathcal{W}(\\mathbf{p}, \\mathbf{p_A}) =\n[\\min_{P : \\mathbb{R}^3 \\times [n] \\rightarrow \\mathbb{R}_+, \\mathbf{A} \\subset \\mathbb{R}^3} \\quad \\ \\int_{\\mathbb{R}^3} \\sum_i \\lVert x - a_i\\rVert^2 P(x,i)dx]^{1/2} \\tag{4}\\] \\[\\textrm{s.t.} \\quad \\sum_{i=1}^n P(x_0,i) = \\mathbf{p}(x_0)\\] \\[ \\ge [\\min_{\\mathbf{A} \\subset \\mathbb{R}^3} \\int_{\\mathbb{R}^3}\\lVert x - a_{i(x)}\\rVert^2\\mathbf{p}(x)dx]^{1/2}.\\] In the last line, we use the fact that \\(x\\) is in the Voronoi cell of \\(a_{i(x)}\\), i.e., for all \\(i \\in [n]\\) we have \\(\\lVert x-a_i\\rVert \\ge \\lVert x - a_{i(x)}\\rVert\\). Take note that the last line is itself expressing the Optimal Transport problem between \\(\\mathbf{p}\\) and \\(\\mathbf{p}_\\mathbf{A}\\) with a specific choice of \\(P\\) that assigns every point \\(x\\) to \\(i(x)\\). Hence in our optimal solution, \\(P^*\\) assigns all points in \\(V_i\\) to \\(a_i\\) for all \\(i \\in [n]\\), i.e., \\(P^*(x,i(x)) = \\mathbf{p}(x)\\) and \\(P^*(x,j) = 0\\) for any \\(j \\ne i(x)\\).\nLemma 2. Given a region \\(V_i\\) and a fixed transportation plan \\(P\\) with the property that \\(P(x,i)\\) is equal to $ (x)$ for all \\(x \\in V_i\\) and \\(0\\) otherwise, \\(a_i = \\int_{V_i}x\\mathbf{p}(x)dx / \\int_{V_i}\\mathbf{p}(x)dx\\) solves Equation 2.\nProof. Assume we want to minimize the integral \\(\\int_{\\mathbb{R}^3}\\lVert x-a_i\\rVert^2P(x,i)dx\\) by choosing \\(a_i\\). First, using the assumption on \\(P\\) we can simplify it as \\[\\int_{\\mathbb{R}^3}\\lVert x-a_i\\rVert^2P(x,i)dx = \\int_{V_i}\\lVert x-a_i \\rVert^2\\mathbf{p}(x)dx\\] \\[= \\int_{V_i}[\\lVert x\\rVert^2 + \\lVert a_i \\rVert^2 - 2\\langle a_i, x \\rangle]\\mathbf{p}(x)dx\\] \\[= \\int_{V_i}\\lVert x \\rVert^2 \\mathbf{p}(x)dx + \\lVert a_i\\rVert^2\\int_{V_i}\\mathbf{p}(x)dx\\] \\[\\quad\\quad-2 \\langle a_i , \\int_{V_i} x\\mathbf{p}(x)dx\\rangle.\\] Also, as this integral is invariant to rigid body transformations, we can assume \\(\\int x\\mathbf{p}(x)dx = 0\\) (after applying an appropriate translation to \\(\\mathbf{A}\\) and \\(\\mathbf{p}\\)). This assumption yields \\[\\int_{\\mathbb{R}^3}\\lVert x-a_i \\rVert^2P(x,i)dx =\\int_{V_i}\\lVert x\\rVert^2 \\mathbf{p}(x)dx + \\lVert a_i \\rVert^2\\int_{V_i}\\mathbf{p}(x)dx.\\] The minimum of this equation is \\(a_i = 0\\), so we conclude that the optimal choice for \\(a_i\\) is the centroid of \\(V_i\\). In other words, \\(a_i = \\int_{V_i}x\\mathbf{p}(x)dx / \\int_{V_i}\\mathbf{p}(x)dx\\).\n\n\n\n\n\nReferences\n\nAurenhammer, Franz. 1991. “Voronoi Diagrams—a Survey of a Fundamental Geometric Data Structure.” ACM Computing Surveys (CSUR) 23 (3): 345–405." }, { - "objectID": "posts/sy mds tunnel/index.html", - "href": "posts/sy mds tunnel/index.html", - "title": "Multi Dimensional Scaling of ribosome exit tunnel shapes", + "objectID": "posts/ET/ey.html", + "href": "posts/ET/ey.html", + "title": "Analysis of Eye Tracking Data", "section": "", - "text": "The ribosome exit tunnel is a sub-compartment of the ribosome whose geometry varies significantly across species, potentially affecting the translational dynamics and co-translational folding of nascent polypeptide1.\nAs the recent advances in imaging technologies result in a surge of high-resolution ribosome structures, we are now able to study the tunnel geometric heterogeneity comprehensively across three domains of life: bacteria, archaea and eukaryotes.\nHere, we present some methods for large-scale analysis and comparison of tunnel structures." + "text": "Eye Tracking\n\nEye tracking (ET) is a process by which a device measures the gaze of a participant – with a number of variables that can be captured, such as duration of fixation, re-fixation (go-backs), saccades, blinking, pupillary response. The ‘strong eye-mind hypothesis’ provides the theoretical ground where the underlying assumption is that duration of fixation is a reflection of preference, and that information is processed with immediacy. ET also is a non-invasive technique that has recently garnered attention in autism research as a method to elucidate or gather more information about the supposed central cognitive deficit (Flack-Ytter et al., 2013, Senju et al., 2009).\n\nExperimental set up\n\n22 youth (13-17) with high functioning autism and without autism will be recruited into this study.Students will be brought into a quiet room and asked to read a manga comic displayed on a monitor connected to the eye tracking device (Tobii pro eye tracker, provided by Professor Conati’s lab)" }, { - "objectID": "posts/sy mds tunnel/index.html#summary-and-background", - "href": "posts/sy mds tunnel/index.html#summary-and-background", - "title": "Multi Dimensional Scaling of ribosome exit tunnel shapes", + "objectID": "posts/ET/ey.html#eye-tracking-backagroud", + "href": "posts/ET/ey.html#eye-tracking-backagroud", + "title": "Analysis of Eye Tracking Data", "section": "", - "text": "The ribosome exit tunnel is a sub-compartment of the ribosome whose geometry varies significantly across species, potentially affecting the translational dynamics and co-translational folding of nascent polypeptide1.\nAs the recent advances in imaging technologies result in a surge of high-resolution ribosome structures, we are now able to study the tunnel geometric heterogeneity comprehensively across three domains of life: bacteria, archaea and eukaryotes.\nHere, we present some methods for large-scale analysis and comparison of tunnel structures." - }, - { - "objectID": "posts/sy mds tunnel/index.html#tunnel-shape", - "href": "posts/sy mds tunnel/index.html#tunnel-shape", - "title": "Multi Dimensional Scaling of ribosome exit tunnel shapes", - "section": "Tunnel Shape", - "text": "Tunnel Shape\nThe ribosome exit tunnel spans from the peptidyl-transferase center (PTC), where amino acids are polymerized onto the growing nascent chain, to the surface of the ribosome.\nTypically, it measures 80-100 Å in length and 10-20 Å in diameter. While the eukaryotic tunnels are, on average, shorter and substantially narrower than prokaryote ones1.\nIn all domains of life, the tunnel features a universally conserved narrow region downstream of the PTC, so-called constriction site. However, the eukaryotic exit tunnel exhibit an additional (second) constriction site due to the modified structure of the surrounding ribosomal proteins.\n\n\n\nIllustration of the tunnel structure of H.sapiens." + "text": "Eye Tracking\n\nEye tracking (ET) is a process by which a device measures the gaze of a participant – with a number of variables that can be captured, such as duration of fixation, re-fixation (go-backs), saccades, blinking, pupillary response. The ‘strong eye-mind hypothesis’ provides the theoretical ground where the underlying assumption is that duration of fixation is a reflection of preference, and that information is processed with immediacy. ET also is a non-invasive technique that has recently garnered attention in autism research as a method to elucidate or gather more information about the supposed central cognitive deficit (Flack-Ytter et al., 2013, Senju et al., 2009).\n\nExperimental set up\n\n22 youth (13-17) with high functioning autism and without autism will be recruited into this study.Students will be brought into a quiet room and asked to read a manga comic displayed on a monitor connected to the eye tracking device (Tobii pro eye tracker, provided by Professor Conati’s lab)" }, { - "objectID": "posts/sy mds tunnel/index.html#ribosome-dataset", - "href": "posts/sy mds tunnel/index.html#ribosome-dataset", - "title": "Multi Dimensional Scaling of ribosome exit tunnel shapes", - "section": "Ribosome Dataset", - "text": "Ribosome Dataset\nCryo-EM reconstructions and X-ray crystallography structures of ribosomes were retrived from the Protein Data Bank (https://www.rcsb.org) including 762 structures across 34 species domain.\nThe exit tunnels were extracted from the ribosomes using our developed tunnel-searching pipeline based on the MOLE cavity extraction algorithm developed by Sehnal et al.2." + "objectID": "posts/ET/ey.html#visualisation", + "href": "posts/ET/ey.html#visualisation", + "title": "Analysis of Eye Tracking Data", + "section": "2 Visualisation", + "text": "2 Visualisation\nOne way of visualizing your data in Tobii Pro Lab is by creating Heat maps. Heat maps visualize where a participant’s (or a group of participants’) fixations or gaze data samples were distributed on a still image or a video frame. The distribution of the data is represented with colors.Each sample corresponds to a gaze point from the eye tracker, consistently sampled every 1.6 to 33 milliseconds (depending on the sampling data rate of the eye tracker). When using an I-VT Filter, it will group the raw eye tracking samples into fixations. The duration of each fixation depends on the gaze filter used to identify the fixations.\n\n\n\nHeatmap" }, { - "objectID": "posts/sy mds tunnel/index.html#pairwise-distance", - "href": "posts/sy mds tunnel/index.html#pairwise-distance", - "title": "Multi Dimensional Scaling of ribosome exit tunnel shapes", - "section": "Pairwise Distance", - "text": "Pairwise Distance\nTo simplify the geomertic comparisons, we first reduced the tunnel structure into a coordinate set that describes both the centerline trajectory and the tunnel radius at each centerline position,\nWe then applied the pairwise distance metrics developed by Dao Duc et al.1 to compute the geometric similarity between tunnels. More details can be found in the previous work1.\n\n\n\nPairwise comparison of radial varaition plots between H.sapiens and E.coli" + "objectID": "posts/ET/ey.html#features", + "href": "posts/ET/ey.html#features", + "title": "Analysis of Eye Tracking Data", + "section": "3 Features", + "text": "3 Features\n\nData processing of eye tracking recordings\n\nTo run a statistical study on the data recorded, we carried out in two stages data processing. First using Tobio Pro Lab, then the EMADAT package. Following the experiments, the files are processed using Tobii Pro Lab software. We delimited the AOI for each page, manually pointed the gazes points for the 22 participants on the 12 selected pages. Then exported the data for each participant in a tsv format.\nThen EMDAT was used to generate the datasets. Indeed, to extract the gaze features we used EMDAT python 2.7. EMDAT stands for Eye Movement Data Analysis Toolkit, it is an open-source toolkit developed by our group. EMDAT receives three types of input folder: a folder containing the recordings from Tobii in a tsv format, a Segment folder containing the timestamp for the start and end of page reading for each participant, and an AOI folder containing the coordinates and the time spent per participant of each AOI per page. We have also automated the writing of the Segments and AOIs folders. Then we run the EMDAT script for each page. EMDAT also validates the quality of the recordings per page, here the parameter has been set to VALIDITY_METHOD = 1 (see documentation). In particular, we found that the quality of the data did not diminish over the course of the recordings.\n\nEye tracking features\n\nUpon following the data processing protocol, we extracted the following features:\n\nnumber of fixation (quantitative feature): The number of fixations denoted by is defined as the total number of fixations recorded over the total duration spent on a page by a participant.\nmean fixation duration (duration feature): The mean fixation duration denoted by is defined as as the average fixation duration during page reading.\nstandard deviation of the relative path angle (spatial feature): The standard deviation of the relative path angle denoted by is defined as as the average fixation duration during page reading.the standard deviation of the relative angle between two successive saccades. This component enables us to capture the consistency of a participant’s gaze pattern. The greater the standard deviation, the more likely the participant is to look across the different areas of a page." }, { - "objectID": "posts/sy mds tunnel/index.html#mds", - "href": "posts/sy mds tunnel/index.html#mds", - "title": "Multi Dimensional Scaling of ribosome exit tunnel shapes", - "section": "MDS", - "text": "MDS\nThe Multidimensional Scaling (MDS) method developed by Li et al.3 was applied on the pairwise distance matrix to visualize the geometric similarity of tunnels. Each data point represents a single tunnel structure, and the Euclidean distance between data points represents the similarity.\n\n\n\nMDS plot of tunnel structures across prokaryotes and eukaryotes" + "objectID": "posts/ET/ey.html#t-test", + "href": "posts/ET/ey.html#t-test", + "title": "Analysis of Eye Tracking Data", + "section": "4 T-test", + "text": "4 T-test\nFirst, we wondered whether there were any major differences in the way the two groups read. To do this, we compared the two populations along the three axes - quantitative, duration and spatial - defined in the previous section. To quantify these differences, we used a t-test to compare the means of the distributions, and a Kolmogorov-Smirnov test to compare the distributions. Concerning the total number of fixations per page, the two populations seem to have the same characteristics (p-value>0.1 and Cohen’s d=0.2) and to be from the same distribution (two sided K-s test p-value>0.1). However, on the other two criteria, the autistic adolescents had a shorter mean fixation time and a lower standard deviation (p-value<0.05, Cohen’s d > 0.5), and their associated distribution was lower than that of the control population (less K-S test p-value>0.1).\n\n\n\n\n\n\n\n\n\nT-test\nK-S test\n\n\n\n\nNum fixations\nNo statistically significant differences in the mean number of fixation (small effect size, two-sided p-value > 0.1)\nThe distributions of the number of fixations per page look similar across the two populations (KS two-sided p-value > 0.1)\n\n\nMean fixation duration\nND seems to have a shorter mean duration fixation (Negative medium effect size, two-sided p-value < 0.01)\nThe ND mean fixation duration distribution is smaller than the NT mean fixation duration distribution (KS less p-value > 0.1)\n\n\nStandard deviation relative path angle\nND seems to have on average a smaller std (Negative medium effect size, two-sided p-value < 0.01)\nThe ND std relative path angle distribution is smaller than the NT std relative path angle distribution (KS less p-value > 0.1)" }, { - "objectID": "posts/point-cloud/pointcloud.html", - "href": "posts/point-cloud/pointcloud.html", - "title": "Point cloud representation of 3D volumes", + "objectID": "posts/HDM/index.html", + "href": "posts/HDM/index.html", + "title": "Horizontal Diffusion Map", "section": "", - "text": "In the context of cryo-EM, many computationally exhaustive methods rely on simpler representations of cryo-EM density maps to overcome their scalability challenges. There are many choices for the form of the simpler representation, such as vectors (Han et al. 2021) or a mixture of Gaussians (Kawabata 2008). In this post, we discuss a format that is probably the simplest and uses a set of points (called a point cloud).\nThis problem can be formulated in a much more general sense rather than cryo-EM. In this sense, we are given a probability distribution over \\(\\mathbb{R}^3\\) and we want to generate a set of 3D points that represent this distribution. The naive approach for finding such a point cloud is to just sample points from the distribution. Although this approach is guaranteed to find a good representation, it needs many points to cover the distribution evenly. Since methods used in this field can be computationally intensive with cubic or higher time complexity, generating a point cloud that covers the given distribution with a smaller point-cloud size leads to a significant improvement in their runtime.\nIn this approach, we present two methods for generating a point cloud from a cryo-EM density map or a distribution in general. The first one is based on the Topological Representing Network (TRN) (Martinetz and Schulten 1994) and the second one combines the usage of the Optimal Transport (OT) (Peyré, Cuturi, et al. 2019) theory and a computational geometry object named Centroidal Voronoi Tessellation (CVT).\n\n\nFor the sake of simplicity in this post, we assume we are given a primal distribution over \\(\\mathbb{R}^2\\). As an example, we will work on a multivariate Gaussian distribution that it’s domain is limited to \\([0, 1]^2\\). The following code prepares and illustrates the pdf of the example distribution.\n\nimport numpy as np\nimport scipy as scp\nimport matplotlib\nimport matplotlib.pyplot as plt\n\nplt.rcParams[\"figure.figsize\"] = (20,20)\n\n\n\nmean = np.array([0,0])\ncov = np.array([[0.5, 0.25], [0.25, 0.5]])\ndistr = scp.stats.multivariate_normal(cov = cov, mean = mean, seed = 1)\n\n\nfig, ax = plt.subplots(figsize=(8,8))\nim = ax.imshow([[distr.pdf([i/100,j/100]) for i in range(100,-100,-1)] for j in range(-100,100)], extent=[-1, 1, -1, 1])\ncbar = ax.figure.colorbar(im, ax=ax)\nplt.title(\"The pdf of our primal distribution\")\nplt.show()\n\n\n\n\n\n\n\n\nBoth of the methods that we are going to cover are iterative methods relying on an initial sample of points. For generating a point cloud with size \\(n\\), they begin by randomly sampling \\(n\\) points and refining it over iterations. We use \\(n=200\\) in our examples.\n\ndef sampler(rvs):\n while True:\n sample = rvs(1)\n if abs(sample[0]) > 1 or abs(sample[1]) > 1:\n continue\n return sample\n\ninitial_samples = []\nwhile len(initial_samples) < 200:\n sample = sampler(distr.rvs)\n initial_samples.append(list(sample))\ninitial_samples = np.array(initial_samples)\n\nl = list(zip(*initial_samples))\nx = list(l[0])\ny = list(l[1])\n\nfig, ax = plt.subplots(figsize=(8,8))\nax.scatter(x, y)\nax.plot((-1,-1), (-1,1), 'k-')\nax.plot((-1,1), (-1,-1), 'k-')\nax.plot((1,1), (1,-1), 'k-')\nax.plot((-1,1), (1,1), 'k-')\nplt.ylim(-1.1,1.1)\nplt.xlim(-1.1,1.1)\nplt.xticks([])\nplt.yticks([])\nplt.show()" + "text": "This post is based on the following references:\n\nShan Shan, Probabilistic Models on Fibre Bundles (https://dukespace.lib.duke.edu/server/api/core/bitstreams/21bc2e06-ee66-4331-83af-115fe9518e80/content)\nTingran Gao, The Diffusion Geometry of Fibre Bundles: Horizontal Diffusion Maps (https://arxiv.org/pdf/1602.02330)" }, { - "objectID": "posts/point-cloud/pointcloud.html#data", - "href": "posts/point-cloud/pointcloud.html#data", - "title": "Point cloud representation of 3D volumes", + "objectID": "posts/HDM/index.html#references", + "href": "posts/HDM/index.html#references", + "title": "Horizontal Diffusion Map", "section": "", - "text": "For the sake of simplicity in this post, we assume we are given a primal distribution over \\(\\mathbb{R}^2\\). As an example, we will work on a multivariate Gaussian distribution that it’s domain is limited to \\([0, 1]^2\\). The following code prepares and illustrates the pdf of the example distribution.\n\nimport numpy as np\nimport scipy as scp\nimport matplotlib\nimport matplotlib.pyplot as plt\n\nplt.rcParams[\"figure.figsize\"] = (20,20)\n\n\n\nmean = np.array([0,0])\ncov = np.array([[0.5, 0.25], [0.25, 0.5]])\ndistr = scp.stats.multivariate_normal(cov = cov, mean = mean, seed = 1)\n\n\nfig, ax = plt.subplots(figsize=(8,8))\nim = ax.imshow([[distr.pdf([i/100,j/100]) for i in range(100,-100,-1)] for j in range(-100,100)], extent=[-1, 1, -1, 1])\ncbar = ax.figure.colorbar(im, ax=ax)\nplt.title(\"The pdf of our primal distribution\")\nplt.show()\n\n\n\n\n\n\n\n\nBoth of the methods that we are going to cover are iterative methods relying on an initial sample of points. For generating a point cloud with size \\(n\\), they begin by randomly sampling \\(n\\) points and refining it over iterations. We use \\(n=200\\) in our examples.\n\ndef sampler(rvs):\n while True:\n sample = rvs(1)\n if abs(sample[0]) > 1 or abs(sample[1]) > 1:\n continue\n return sample\n\ninitial_samples = []\nwhile len(initial_samples) < 200:\n sample = sampler(distr.rvs)\n initial_samples.append(list(sample))\ninitial_samples = np.array(initial_samples)\n\nl = list(zip(*initial_samples))\nx = list(l[0])\ny = list(l[1])\n\nfig, ax = plt.subplots(figsize=(8,8))\nax.scatter(x, y)\nax.plot((-1,-1), (-1,1), 'k-')\nax.plot((-1,1), (-1,-1), 'k-')\nax.plot((1,1), (1,-1), 'k-')\nax.plot((-1,1), (1,1), 'k-')\nplt.ylim(-1.1,1.1)\nplt.xlim(-1.1,1.1)\nplt.xticks([])\nplt.yticks([])\nplt.show()" + "text": "This post is based on the following references:\n\nShan Shan, Probabilistic Models on Fibre Bundles (https://dukespace.lib.duke.edu/server/api/core/bitstreams/21bc2e06-ee66-4331-83af-115fe9518e80/content)\nTingran Gao, The Diffusion Geometry of Fibre Bundles: Horizontal Diffusion Maps (https://arxiv.org/pdf/1602.02330)" }, { - "objectID": "posts/AlphaShape/index.html", - "href": "posts/AlphaShape/index.html", - "title": "Alpha Shapes in 2D and 3D", - "section": "", - "text": "Alpha shapes are a generalization of the convex hull used in computational geometry. They are particularly useful for understanding the shape of a point cloud in both 2D and 3D spaces. In this document, we will explore alpha shapes in both dimensions using Python.\nWhat is \\(\\alpha\\) shape? My favorite analogy (reference https://doc.cgal.org/latest/Alpha_shapes_2/index.html):\nImagine you have a huge mass of ice cream in either 2D or 3D, and the points are “hard” chocolate pieces which we would like to avoid. Using one of these round-shaped ice-cream spoons with radius \\(1/\\alpha\\), we carve out all the ice cream without bumping into any of the chocolate pieces. Finally we straighten the round boundaries to obtain the so-called \\(\\alpha\\) shape.\nWhat is the \\(\\alpha\\) parameter? \\(1/\\alpha\\) is the radius of your “carving spoon” and controls the roughness of your boundary. If the radius of spoon is too small (\\(\\alpha\\to \\infty\\)), all the ice cream can be carved out except the chocolate chips themselves, so eventually all data points become singletons and no information regarding the shape can be revealed. However, choosing big radius (\\(\\alpha \\approx 0\\)) may not be ideal either because it does not allow carving out anything, so we end up with a convex hull of all data points." + "objectID": "posts/HDM/index.html#introduction", + "href": "posts/HDM/index.html#introduction", + "title": "Horizontal Diffusion Map", + "section": "Introduction", + "text": "Introduction\nHorizontal Diffusion Maps are a variant of diffusion maps used in dimensionality reduction and data analysis. They focus on preserving the local structure of data points in a lower-dimensional space by leveraging diffusion processes. Here’s a simple overview:\n\nDiffusion Maps Overview\n\nDiffusion Maps: These are a powerful technique in machine learning and data analysis for reducing dimensionality and capturing intrinsic data structures. They are based on the concept of diffusion processes over a graph or data manifold.\nConcept: Imagine a diffusion process where particles spread out over a data set according to some probability distribution. The diffusion map captures the way these particles spread and organizes the data into a lower-dimensional space that retains the local and global structure.\n\nHorizontal Diffusion Maps\n\nPurpose: Horizontal Diffusion Maps specifically aim to capture and visualize the horizontal or local structure of the data manifold. This can be particularly useful when you want to emphasize local relationships while reducing dimensionality.\nDifference from Standard Diffusion Maps: While standard diffusion maps focus on capturing both local and global structures, horizontal diffusion maps emphasize local, horizontal connections among data points. This means they preserve local neighborhoods and horizontal relationships more explicitly." }, { - "objectID": "posts/AlphaShape/index.html#introduction", - "href": "posts/AlphaShape/index.html#introduction", - "title": "Alpha Shapes in 2D and 3D", - "section": "", - "text": "Alpha shapes are a generalization of the convex hull used in computational geometry. They are particularly useful for understanding the shape of a point cloud in both 2D and 3D spaces. In this document, we will explore alpha shapes in both dimensions using Python.\nWhat is \\(\\alpha\\) shape? My favorite analogy (reference https://doc.cgal.org/latest/Alpha_shapes_2/index.html):\nImagine you have a huge mass of ice cream in either 2D or 3D, and the points are “hard” chocolate pieces which we would like to avoid. Using one of these round-shaped ice-cream spoons with radius \\(1/\\alpha\\), we carve out all the ice cream without bumping into any of the chocolate pieces. Finally we straighten the round boundaries to obtain the so-called \\(\\alpha\\) shape.\nWhat is the \\(\\alpha\\) parameter? \\(1/\\alpha\\) is the radius of your “carving spoon” and controls the roughness of your boundary. If the radius of spoon is too small (\\(\\alpha\\to \\infty\\)), all the ice cream can be carved out except the chocolate chips themselves, so eventually all data points become singletons and no information regarding the shape can be revealed. However, choosing big radius (\\(\\alpha \\approx 0\\)) may not be ideal either because it does not allow carving out anything, so we end up with a convex hull of all data points." + "objectID": "posts/HDM/index.html#example-möbius-strip", + "href": "posts/HDM/index.html#example-möbius-strip", + "title": "Horizontal Diffusion Map", + "section": "Example: Möbius Strip", + "text": "Example: Möbius Strip\nIn this section, we show how horizontal diffusion map works on Möbius Strip parameterized by:\n\\[\nx = (1 + v\\cos(\\frac{u}{2}))\\cos(u),\\quad y= (1 + v\\cos(\\frac{u}{2}))\\sin(u),\n\\] for \\(u\\in [0,2\\pi)\\) and \\(v \\in [-1,1]\\).\nIt is known as one of the most simple yet nontrivial fibre bundle. See below for a visualization:\n\nimport numpy as np\nimport matplotlib.pyplot as plt\nfrom mpl_toolkits.mplot3d import Axes3D\n\ndef mobius_strip(u, v):\n \"\"\"\n Generate coordinates for a Möbius strip.\n \n Parameters:\n - u: Parameter that varies from 0 to 2*pi\n - v: Parameter that varies from -0.5 to 0.5\n \n Returns:\n - x, y, z: Coordinates of the Möbius strip\n \"\"\"\n # Parameters for the Möbius strip\n radius = 1.0\n width = 1.0\n \n # Compute coordinates\n x = (radius + width * v * np.cos(u / 2)) * np.cos(u)\n y = (radius + width * v * np.cos(u / 2)) * np.sin(u)\n z = width * v * np.sin(u / 2)\n \n return x, y, z\n\ndef plot_mobius_strip():\n u = np.linspace(0, 2 * np.pi, 100)\n v = np.linspace(-1, 1, 10)\n \n u, v = np.meshgrid(u, v)\n x, y, z = mobius_strip(u, v)\n \n fig = plt.figure(figsize=(10, 7))\n ax = fig.add_subplot(111, projection='3d')\n \n # Plot the Möbius strip\n ax.plot_surface(x, y, z, cmap='inferno', edgecolor='none')\n \n # Set labels and title\n ax.set_xlabel('X')\n ax.set_ylabel('Y')\n ax.set_zlabel('Z')\n ax.set_title('Möbius Strip')\n \n plt.show()\n\n# Run the function to plot the Möbius strip\nplot_mobius_strip()\n\n\n\n\n\n\n\n\nNow we generate samples from the surface uniformly by first sample \\(N_{base}\\) points on the `base manifold’, parameterized by the \\(v\\) component. Then we sample \\(N_{fibre}\\) points along each fibre:\n\nN_fibre = 20\nv = np.linspace(-1,1,N_fibre,endpoint=False) #samples on each fibre\nN_base = 50\nu = np.linspace(0,2*np.pi,N_base,endpoint=False) #different objects\n# Here we concatenate all fibres to create the overall object\nV = np.tile(v,len(u))\nU= np.array([num for num in u for _ in range(len(v)) ])\nN = U.shape[0]\n\nHere we visualize the points to see how they are distributed on the manifold:\n\nu, v = np.meshgrid(U,V)\nx, y, z = mobius_strip(u, v)\n \nfig = plt.figure(figsize=(10, 7))\nax = fig.add_subplot(111, projection='3d')\n \n# Plot the Möbius strip\nax.scatter(x, y, z, c=v, s=1)\n \n# Set labels and title\nax.set_xlabel('X')\nax.set_ylabel('Y')\nax.set_zlabel('Z')\nax.set_title('Möbius Strip')\n \nplt.show()\n\n\n\n\n\n\n\n\nLater on, we will go over the horizontal diffusion map and apply it to the data we just created!" }, { - "objectID": "posts/AlphaShape/index.html#d-alpha-shape", - "href": "posts/AlphaShape/index.html#d-alpha-shape", - "title": "Alpha Shapes in 2D and 3D", - "section": "2D Alpha Shape", - "text": "2D Alpha Shape\nTo illustrate alpha shapes in 2D, we’ll use the alphashape library. Let’s start by generating a set of random points and compute their alpha shape.\nFirst we create a point cloud:\n\nimport numpy as np\nimport matplotlib.pyplot as plt\nimport alphashape\nfrom matplotlib.path import Path\nfrom scipy.spatial import ConvexHull\n\ndef generate_flower_shape(num_petals, num_points_per_petal):\n angles = np.linspace(0, 2 * np.pi, num_points_per_petal, endpoint=False)\n r = 1 + 0.5 * np.sin(num_petals * angles)\n \n x = r* np.cos(angles)\n \n y = r * np.sin(angles)\n \n return np.column_stack((x, y))\n\ndef generate_random_points_within_polygon(polygon, num_points):\n \"\"\"Generate random points inside a given polygon.\"\"\"\n min_x, max_x = polygon[:, 0].min(), polygon[:, 0].max()\n min_y, max_y = polygon[:, 1].min(), polygon[:, 1].max()\n \n points = []\n while len(points) < num_points:\n x = np.random.uniform(min_x, max_x)\n y = np.random.uniform(min_y, max_y)\n if Path(polygon).contains_point((x, y)):\n points.append((x, y))\n \n return np.array(points)\n\nplt.figure(figsize=(8, 6))\npoints = generate_flower_shape(num_petals=6, num_points_per_petal=100)\npoints = generate_random_points_within_polygon(points, 1000)\nplt.scatter(points[:, 0], points[:, 1], s=10, color='blue', label='Points')\n\n/Users/wenjunzhao/opt/anaconda3/lib/python3.8/site-packages/scipy/__init__.py:146: UserWarning:\n\nA NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.24.3\n\n\n\n\n\n\n\n\n\n\nTry run this with \\(\\alpha=0.1\\):\n\n# Create alpha shape\nalpha = 0.1\nalpha_shape = alphashape.alphashape(points, alpha)\n\n# Plot points and alpha shape\nplt.figure(figsize=(8, 6))\nplt.scatter(points[:, 0], points[:, 1], s=10, color='blue', label='Points')\nplt.plot(*alpha_shape.exterior.xy, color='red', lw=2, label='Alpha Shape')\nplt.title('2D Alpha Shape')\nplt.xlabel('X')\nplt.ylabel('Y')\nplt.legend()\nplt.grid(True)\nplt.show()\n\n\n\n\n\n\n\n\nOops, it seems the radius we picked is too big! Let’s try a few other choices.\n\nalpha_values = [0.1, 5.0, 10.0, 15.0]\n# Plot the flower shape and alpha shapes with varying alpha values\nfig, axes = plt.subplots(2, 2, figsize=(6,6))\naxes = axes.flatten()\n\nfor i, alpha in enumerate(alpha_values):\n # Compute alpha shape\n alpha_shape = alphashape.alphashape(points, alpha)\n \n # Plot the points and the alpha shape\n ax = axes[i]\n #print(alpha_shape.type)\n if alpha_shape.type == 'Polygon':\n ax.plot(*alpha_shape.exterior.xy, color='red', lw=2, label='Alpha Shape')\n ax.scatter(points[:, 0], points[:, 1], color='orange', s=10, label='Point Cloud')\n \n \n \n ax.set_title(f'Alpha Shape with alpha={alpha}')\n ax.legend()\n ax.grid(True)\n\nplt.tight_layout()\nplt.show()\n\n/var/folders/k7/s0t_zwg11h56xb5xp339s5pm0000gp/T/ipykernel_29951/885549844.py:13: ShapelyDeprecationWarning:\n\nThe 'type' attribute is deprecated, and will be removed in the future. You can use the 'geom_type' attribute instead." + "objectID": "posts/HDM/index.html#horizontal-diffusion-map-hdm", + "href": "posts/HDM/index.html#horizontal-diffusion-map-hdm", + "title": "Horizontal Diffusion Map", + "section": "Horizontal diffusion map (HDM)", + "text": "Horizontal diffusion map (HDM)\nThe first step is to create a kernel matrix. As outlined by the references, two common approaches are:\nHorizontal diffusion kernel: For two data points \\(e=(u,v)\\) and \\(e' = (u',v')\\): \\[\nK_{\\epsilon}(e, e') = \\exp( -(u - u')^2/\\epsilon) \\text{ if }v' = P_{uu'}v,\n\\] and zero otherwise. Here \\(P_{uu'}\\) is the map which connects every point from \\(v\\) to its image \\(v'\\), which, for our case, maps \\(v\\) to itself.\n\ndef horizontal_diffusion_kernel(U,V,eps):\n \n N = U.shape[0]\n K = np.zeros((N,N))\n for i in range(N):\n for j in range(N):\n if V[i] == V[j]:# and U[i] != U[j]:\n #print('match')\n K[i,j] = np.exp(-(U[i]-U[j])**2/eps)\n return K\n\neps = 0.2\nK = horizontal_diffusion_kernel(U,V,0.2)\nplt.imshow(K)\nplt.show()\n\n\n\n\n\n\n\n\nAn alternative, soft version of the kernel above is the coupled diffusion kernel: \n\\[\nK_{\\epsilon, \\delta}(e,e') = \\exp( -(u - u')^2/\\epsilon) \\exp( -(v-v')^2/\\delta ).\n\\]\n\ndef coupled_diffusion_kernel(U,V,eps,delta):\n N = U.shape[0]\n K_c = np.zeros((N,N))\n for i in range(N):\n for j in range(N):\n if True:#U[i] != U[j]:\n #print('match')\n K_c[i,j] = np.exp(-(U[i]-U[j])**2/eps) * np.exp( -(V[i]-V[j])**2/delta )\n return K_c\n\neps = .2\ndelta = .01 \nK_c = coupled_diffusion_kernel(U,V,eps,delta) \nplt.imshow(K_c)\nplt.show()\n\n\n\n\n\n\n\n\nAfter we created the kernel matrix, we can then proceed with the regular diffusion map by (1) Create the diffusion operator by normalizing the kernel matrix and computing its eigendecomposition, and (2) extract the diffusion coordinates by using the eigenvectors corresponding to the largest eigenvalues (excluding the trivial eigenvalue) to form the diffusion coordinates.\n\nimport numpy as np\nimport matplotlib.pyplot as plt\nfrom scipy.linalg import eigh\nfrom sklearn.preprocessing import normalize\n\ndef compute_diffusion_map(kernel_matrix, num_components=2):\n \"\"\"\n Compute the diffusion map from a kernel matrix.\n\n Parameters:\n - kernel_matrix: The kernel matrix (e.g., RBF kernel matrix).\n - num_components: Number of diffusion map dimensions to compute.\n\n Returns:\n - diffusion_coordinates: The 2D diffusion map coordinates.\n \"\"\"\n # Compute the degree matrix\n degree_matrix = np.diag(np.sum(kernel_matrix, axis=1))\n \n # Compute the normalized Laplacian matrix\n laplacian = np.linalg.inv(degree_matrix) @ kernel_matrix\n \n # Compute eigenvalues and eigenvectors\n eigvals, eigvecs = eigh(laplacian)\n \n # Sort eigenvalues and eigenvectors\n sorted_indices = np.argsort(eigvals)[::-1]\n eigvals = eigvals[sorted_indices]\n #print(eigvals)\n eigvecs = eigvecs[:, sorted_indices]\n \n # Take the first `num_components` eigenvectors (excluding the first one which is trivial)\n diffusion_coordinates = eigvecs[:, 1:num_components+1] @ np.diag(np.sqrt(eigvals[1:num_components+1]))\n \n return diffusion_coordinates\n\n\ndef plot_diffusion_map(diffusion_coordinates,color):\n \"\"\"\n Plot the 2D diffusion map.\n\n Parameters:\n - diffusion_coordinates: The 2D diffusion map coordinates.\n \"\"\"\n plt.figure(figsize=(8, 6))\n plt.scatter(diffusion_coordinates[:, 0], diffusion_coordinates[:, 1], c=color, s=10, alpha=0.7)\n plt.title('2D Diffusion Map')\n plt.xlabel('Dimension 1')\n plt.ylabel('Dimension 2')\n plt.grid(True)\n plt.show()\n\nNow project the data points into a lower-dimensional space defined by the significant diffusion coordinates. This projection helps in visualizing and analyzing the local structure of the data.\n\n# Compute the diffusion map\neps = 0.2\nK = horizontal_diffusion_kernel(U,V,eps)\ndiffusion_coordinates = compute_diffusion_map( K, num_components=2)\n#print(diffusion_coordinates)\n# Plot the 2D diffusion map, where color represents where they were on the fibre. Points that are mapped \nplot_diffusion_map(diffusion_coordinates,V)\n\n\n\n\n\n\n\n\nSimilarly we perform the same procedure for the coupled diffusion matrix:\n\n# Compute the diffusion map\neps = 0.2\ndelta = 0.01\nK_c = coupled_diffusion_kernel(U,V,eps,delta)\n\ndiffusion_coordinates = compute_diffusion_map( K_c, num_components=2)\n#print(diffusion_coordinates)\n# Plot the 2D diffusion map\nplot_diffusion_map(diffusion_coordinates,V)\n#plot_diffusion_map(diffusion_coordinates,U)\n\n\n\n\n\n\n\n\nThe points are colored according to their correspondence on all the fibres through component \\(v\\). If two points correspond to each other across different but nearby fibres, they are likely to be neighbors in the visualization above." }, { - "objectID": "posts/AlphaShape/index.html#application-of-2d-alpha-shapes-on-reaction-diffusion-equation", - "href": "posts/AlphaShape/index.html#application-of-2d-alpha-shapes-on-reaction-diffusion-equation", - "title": "Alpha Shapes in 2D and 3D", - "section": "Application of 2D alpha shapes on reaction-diffusion equation", - "text": "Application of 2D alpha shapes on reaction-diffusion equation\nNow we discuss an application of 2D alpha shape on quantifying the patterns that arise in reaction-diffusion equations modeling morphogenesis.\nReference: Zhao, Maffa, Sandstede. http://bjornsandstede.com/papers/Data_Driven_Continuation.pdf\nAs an example, let’s consider the Brusselator model in 2D, and below is a simple simulator that generates the snapshot of its solution over the spatial domain. The initial condition is random, and patterns start to arise after we evolve the system forward for a short time.\n\nimport numpy as np\nimport matplotlib.pyplot as plt\n\ndef brusselator_2d_simulation(A, B, Lx=100, Ly=100, Nx=100, Ny=100, dt=0.005, D_u=4, D_v=32, T=20):\n \"\"\"\n Simulate the 2D Brusselator model and return the concentration field u at time T.\n \n Parameters:\n - A: Reaction parameter A\n - B: Reaction parameter B\n - Lx: Domain size in x direction\n - Ly: Domain size in y direction\n - Nx: Number of grid points in x direction\n - Ny: Number of grid points in y direction\n - dt: Time step\n - D_u: Diffusion coefficient for u\n - D_v: Diffusion coefficient for v\n - T: Total simulation time\n \n Returns:\n - u: Concentration field u at time T\n \"\"\"\n \n # Generate random points\n np.random.seed(0) # For reproducibility\n\n # Initialize variables\n dx, dy = Lx / Nx, Ly / Ny\n u = np.random.uniform(size=(Nx, Ny))\n v = np.zeros((Nx, Ny))\n \n \n # Prepare the grid\n x = np.linspace(0, Lx, Nx)\n y = np.linspace(0, Ly, Ny)\n \n # Compute Laplacian\n def laplacian(field):\n return (np.roll(field, 1, axis=0) + np.roll(field, -1, axis=0) +\n np.roll(field, 1, axis=1) + np.roll(field, -1, axis=1) -\n 4 * field) / (dx * dy)\n \n # Time-stepping loop\n num_steps = int(T / dt)\n for _ in range(num_steps):\n # Compute Laplacian\n lap_u = laplacian(u)\n lap_v = laplacian(v)\n \n # Brusselator model equations\n du = D_u * lap_u + A - (B + 1) * u + u**2 * v\n dv = D_v * lap_v + B * u - u**2 * v\n \n # Update fields\n u += du * dt\n v += dv * dt\n \n return u, x, y\n\n# Example usage\nA = 4.75\nB = 11.0\nu_at_T, x, y = brusselator_2d_simulation(A, B)\n\n# Plot the result\nplt.figure(figsize=(8, 8))\nplt.imshow(u_at_T, cmap='viridis', interpolation='bilinear', origin='lower')\nplt.colorbar(label='Concentration of u')\nplt.title(f'Concentration of u at T=100 with A={A}, B={B}')\nplt.xlabel('x')\nplt.ylabel('y')\nplt.grid(True)\nplt.show()\n\n\n\n\n\n\n\n\nNow we create point cloud via thresholding the solution:\n\ndef get_threshold_points(u, threshold=0.7):\n \"\"\"\n Get grid points where the concentration field u exceeds the specified threshold.\n \n Parameters:\n - u: Concentration field\n - threshold: The threshold value as a percentage of the maximum value in u\n \n Returns:\n - coords: Array of grid points where u exceeds the threshold\n \"\"\"\n max_u = np.max(u)\n threshold_value = threshold * max_u\n coords = np.argwhere(u > threshold_value)\n return coords\n\n# Get grid points above 70% of the maximum value\ncoords = get_threshold_points(u_at_T, threshold=0.7)\n# Highlight points above threshold\nx_coords, y_coords = coords[:, 1], coords[:, 0]\nplt.scatter(x_coords, y_coords, color='red', s=20, marker='o', edgecolor='w')\n\n\n\n\n\n\n\n\nAfter we obtain the point cloud, now we can run alpha shape on it. As mentioned before, picking a good alpha can be tricky, so let’s try a few alpha values to see which one identifies the boundary in an ideal way.\n\nalpha_values = [.3, 0.35, 0.5, 1.]\n# Plot the flower shape and alpha shapes with varying alpha values\nfig, axes = plt.subplots(2, 2, figsize=(6,6))\naxes = axes.flatten()\n\nfor i, alpha in enumerate(alpha_values):\n # Scatter the plot\n \n # Compute alpha shape\n alpha_shape = alphashape.alphashape(coords, alpha)\n #print(alpha_shape.type)\n # Plot the points and the alpha shape\n plt.subplot(2,2,i+1)\n #ax = axes[i]\n \n if alpha_shape.geom_type == 'GeometryCollection':\n print(alpha_shape)\n for geom in list( alpha_shape.geoms ):\n \n if geom.type == 'Polygon':\n x, y = geom.exterior.xy\n plt.plot(x, y, 'r-')\n elif alpha_shape.geom_type == 'Polygon':\n x, y = alpha_shape.exterior.xy\n plt.plot(x, y, 'r-')\n elif alpha_shape.geom_type == 'MultiPolygon':\n \n alpha_shape = list( alpha_shape.geoms )\n for polygon in alpha_shape:\n x, y = polygon.exterior.xy\n plt.plot(x, y, 'r-')#, label='Alpha Shape')\n plt.scatter(coords[:, 0], coords[:, 1], color='orange', s=10, label='Point Cloud')\n \n \n \n plt.title(f'alpha={alpha}')\n #plt.legend()\n #plt.grid(True)\n\nplt.tight_layout()\nplt.show()\n\n\n\n\n\n\n\n\nNow we can study different pattern statistics for these clusters! For example, the roundness of clusters are defined as \\(4\\pi Area/Perimeter^2\\), which is bounded between zero (stripe) and one (spot). For each cluster, a roundness score value can be computed. The resulting histogram of roundness scores of all clusters will follow a bimodal distribution, with its two peaks correspond to spots and stripes, respectively.\n\nalpha_values = [.3, 0.4, 0.6, 1.]\n# Plot the flower shape and alpha shapes with varying alpha values\nfig, axes = plt.subplots(2, 2, figsize=(6,6))\naxes = axes.flatten()\n\nfor i, alpha in enumerate(alpha_values):\n plt.subplot(2,2,i+1)\n # Compute alpha shape\n alpha_shape = alphashape.alphashape(coords, alpha)\n if alpha_shape.geom_type == 'MultiPolygon':\n # Extract and print the area of each polygon\n areas = [polygon.area for polygon in list(alpha_shape.geoms)]\n perimeters = [polygon.length for polygon in list(alpha_shape.geoms)]\n roundness = [4*np.pi*areas[i]/perimeters[i]**2 for i in range(len(list(alpha_shape.geoms))) ]\n else:\n areas = [ alpha_shape.area ]\n perimeters = [alpha_shape.length]\n roundness = [areas[0]*4*np.pi/perimeters[0]**2]\n plt.hist(roundness,density=True, range=[0,1])\n plt.xlim([0,1])\n plt.title(f'Roundness with alpha={alpha}')\n \n\nplt.tight_layout()\nplt.show()" + "objectID": "posts/HDM/index.html#horizontal-base-diffusion-map-hbdm", + "href": "posts/HDM/index.html#horizontal-base-diffusion-map-hbdm", + "title": "Horizontal Diffusion Map", + "section": "Horizontal base diffusion map (HBDM)", + "text": "Horizontal base diffusion map (HBDM)\nIn addition to embed all the data points, the framework also allows for embedding different objects (fibres). The new kernel is defined as the Frobenius norm of all entries in the previous kernel matrix that correspond to the two fibres:\n\neps = .2\nK = horizontal_diffusion_kernel(U,V,eps)\nK_base = np.zeros( (N_base,N_base) )\nfor i in range(N_base):\n for j in range(N_base):\n #print( np.ix_( range(N_fibre*(i),N_fibre*(i+1)), range(N_fibre*(j),N_fibre*(j+1)) ) )\n K_base[i,j] = np.linalg.norm( K[ np.ix_( range(N_fibre*(i),N_fibre*(i+1)), range(N_fibre*(j),N_fibre*(j+1)) ) ] ,'fro')\n#plt.imshow(K_base)\n#plt.show()\n\n\n# Compute the diffusion map\ndiffusion_coordinates = compute_diffusion_map( K_base, num_components=2)\n\n# Plot the 2D diffusion map\n\nplot_diffusion_map(diffusion_coordinates, np.sort(list(set(list(U)) ) ) )\n\n\n\n\n\n\n\n\nThe embedded points are colored according to the `ground truth’ \\(u\\). The smooth color transition shows that the embedding uncovers the information of all fibres on the base manifold." }, { - "objectID": "posts/AlphaShape/index.html#d-alpha-shapes", - "href": "posts/AlphaShape/index.html#d-alpha-shapes", - "title": "Alpha Shapes in 2D and 3D", - "section": "3D Alpha shapes", - "text": "3D Alpha shapes\n\nfrom mpl_toolkits.mplot3d import Axes3D\n\ndef plot_torus_with_random_points(R1=1.0, r1=0.3, R2=0.8, r2=0.3, num_points=1000):\n \"\"\"\n Plots a torus with random points filling its volume.\n\n Parameters:\n R (float): Major radius of the torus.\n r (float): Minor radius of the torus.\n num_points (int): Number of random points to generate inside the torus.\n \"\"\"\n \n # Generate random points\n np.random.seed(0) # For reproducibility\n theta = np.random.uniform(0, 2 * np.pi, num_points) # Angle around the major circle\n phi = np.random.uniform(0, 2 * np.pi, num_points) # Angle around the minor circle\n u = np.random.uniform(0, 1, num_points) # Random uniform distribution for radial distance\n \n # Convert uniform distribution to proper volume inside the torus\n u = np.sqrt(u) # To spread points more evenly\n\n # Parametric equations for the double torus\n # First torus\n x1 = .5*(R1 + r1 * np.cos(phi)) * np.cos(theta)\n y1 = (R1 + r1 * np.cos(phi)) * np.sin(theta)\n z1 = r1 * np.sin(phi)\n \n # Second torus\n x2 = -1 + .5*(R2 + r2 * np.cos(phi)) * np.cos(theta)\n y2 = (R2 + r2 * np.cos(phi)) * np.sin(theta)\n z2 = r2 * np.sin(phi)# + 2 * (R2 + r2 * np.cos(phi)) * np.sin(theta) # Shifted in z-direction for double torus effect\n\n # Combine points from both tori\n x = np.concatenate([x1, x2])\n y = np.concatenate([y1, y2])\n z = np.concatenate([z1, z2])\n\n \n\n # Plot the torus and the random points\n fig = plt.figure()\n ax = fig.add_subplot(111, projection='3d')\n\n # Plot the random points\n ax.scatter(x, y, z, c='red', s=1, label='Random Points') # Using a small point size for clarity\n\n\n # Add titles and labels\n ax.set_title('Torus with Random Points')\n ax.set_xlabel('X axis')\n ax.set_ylabel('Y axis')\n ax.set_zlabel('Z axis')\n #ax.set_xlim([-1.5,0.5])\n #ax.set_ylim([-0.5,1.5])\n ax.set_zlim([-1.5,1.5])\n ax.legend()\n plt.show()\n return x,y,z\n\n# Example usage\nx, y, z = plot_torus_with_random_points(num_points=2000)\n\n\n\n\n\n\n\n\nThe intuition on picking alpha still holds! Let’s first try a big alpha (small radius and refined boundaries) and then a small one (big radius and rough boundaries)\n\nimport alphashape\n\n\nalpha_shape = alphashape.alphashape(np.column_stack((x,y,z)), 5.0)\nalpha_shape.show()\n\n\n\n\n\nalpha_shape = alphashape.alphashape(np.column_stack((x,y,z)), 3.0)\nalpha_shape.show()" + "objectID": "posts/HDM/index.html#applications-in-shape-data", + "href": "posts/HDM/index.html#applications-in-shape-data", + "title": "Horizontal Diffusion Map", + "section": "Applications in shape data", + "text": "Applications in shape data\nThe horizontal diffusion map framework is particularly useful in the two following espects, both demonstrated in Gao et al.:\n\nHorizontal diffusion map (embedding all data points): The embedding automatically suggests a global registration for all fibres that respects a mutual similarity measure.\nHorizontal base diffusion map (embedding all data objects/fibres): Compared to the classical diffusion map without correspondences, the horizontal base diffusion map is more robust to noises and often demonstrate a clearer pattern of clusters." }, { - "objectID": "posts/AlphaShape/index.html#application-of-3d-alpha-shape-protein-structure", - "href": "posts/AlphaShape/index.html#application-of-3d-alpha-shape-protein-structure", - "title": "Alpha Shapes in 2D and 3D", - "section": "Application of 3D alpha shape: protein structure", - "text": "Application of 3D alpha shape: protein structure\nIt would be ideal to find some good data and put them here. To be continued." + "objectID": "posts/principal-curves/principal-curves.html", + "href": "posts/principal-curves/principal-curves.html", + "title": "Trajectory Inference for cryo-EM data using Principal Curves", + "section": "", + "text": "Suppose you run an experiment that involves collecting data points \\(\\{\\omega_1, \\ldots, \\omega_M\\} \\subseteq \\Omega \\subseteq \\mathbb R^d\\). As an example, suppose that \\(\\Omega\\) is the hexagonal domain below, and the \\(\\omega_i\\) represent positions of \\(M\\) independent, non-interacting particles in \\(\\Omega\\) (all collected simultaneously).\n\n\n\nsome sample points\n\n\nThe question is: Just from the position data \\(\\{\\omega_1, \\ldots, \\omega_M\\}\\) we have collected, can we determine 1) Whether the particles are all evolving according to the same dynamics, and 2) If so, what those dynamics are? As a sanity check, we can first try superimposing all of the data in one plot.\n\n\n\nsome sample points\n\n\nFrom the image above, there appears to be no discernable structure. But as we increase our number of samples \\(M\\), a picture starts to emerge.\n\n\n\nsome sample points\n\n\nand again:\n\n\n\nsome sample points\n\n\n\n\n\nsome sample points\n\n\n\n\n\nsome sample points\n\n\nIn the limit as \\(M \\to \\infty\\), we might obtain a picture like the following:\n\n\n\nsome sample points\n\n\nWe see that once \\(M\\) is large, it becomes (visually) clear that the particles are indeed evolving according to the same time-dependent function \\(f : \\mathbb R \\to \\Omega\\), but with 1) Small noise in the initial conditions, and 2) Different initial “offsets” \\(t_i\\) along \\(f(t)\\).\nTo expand on (1) a bit more: Note that in the figure above, there’s a fairly-clear “starting” point where the dark grey lines are all clumped together. Let’s say that this represents \\(f(0)\\). Then we see that the trajectories we observe (call them \\(f_i\\)) appear to look like they’re governed by the same principles, but with \\[f_i(0) = f(0) + \\text{ noise} \\qquad \\text{and} \\qquad f_i'(0) = f'(0) + \\text{ noise}.\\] Together with (2), we see that our observations \\(\\omega_i\\) are really samples from \\(f_i(t_i)\\). The question is how we may use these samples to recover \\(f(t)\\).\nLet us summarize the information so far.\n\n\n\nSuppose you have a time-dependent process modeled by some function \\(f : [0,T] \\to \\Omega\\), where \\(\\Omega \\subseteq \\mathbb R^d\\) (or, more generally, an abstract metric space). Then, given observations \\[\\omega_i = f_i(t_i)\\] where the \\((f_i, t_i)\\) are hidden, how can we estimate \\(f(t)\\)?\n\n\nNote that the problem above might at first look very similar to a regression problem, where one attempts to use data points \\((X_i, Y_i)\\) to determine a hidden model \\(f\\) (subject to some noise \\(\\varepsilon_i\\)) giving \\[Y_i = f(X_i) + \\varepsilon_i.\\] If we let \\(f_i(X) = f(X) + \\varepsilon_i\\), then we an almost-identical setup \\[Y_i = f_i(X_i).\\] The key distinction is that in regression, we assume our data-collection procedure gives us pairs \\((X_i, Y_i)\\), whereas in the trajectory inference problem our data consists of only the \\(Y_i\\) and we must infer the \\(X_i\\) on our own. Note in particular that we have continuum many choices for \\(X_i\\). This ends up massively complicating the the problem: If we try the trajectory-inference analogue of regularized least squares, the lack of an a priori coupling between \\(X_i\\) and \\(Y_i\\) means we lose the convexity structure and must use both different theoretical analysis and different numerical algorithms.\nNevertheless, on a cosmetic level, we may formulate the problems with similar-looking equations. This brings us to regularized principal curves." }, { - "objectID": "posts/ribosome-landmarks/index.html", - "href": "posts/ribosome-landmarks/index.html", - "title": "Defining landmarks for the ribosome exit tunnel", + "objectID": "posts/principal-curves/principal-curves.html#example-a-hexagonal-billiards-table", + "href": "posts/principal-curves/principal-curves.html#example-a-hexagonal-billiards-table", + "title": "Trajectory Inference for cryo-EM data using Principal Curves", "section": "", - "text": "The ribosome is present in all domains of life, though exhibits varying conservation across phylogeny. It has been found that, as translation proceeds, the nascent polypeptide chain interacts with the tunnel, and as such, tunnel geometry plays a role in translation dynamics and resulting protein structures1. With advances in imaging of ribosome structure with Cryo-EM, there is ample data on which geometric analysis of the tunnel may be applied and therefore a need for more computational tools to do so2." + "text": "Suppose you run an experiment that involves collecting data points \\(\\{\\omega_1, \\ldots, \\omega_M\\} \\subseteq \\Omega \\subseteq \\mathbb R^d\\). As an example, suppose that \\(\\Omega\\) is the hexagonal domain below, and the \\(\\omega_i\\) represent positions of \\(M\\) independent, non-interacting particles in \\(\\Omega\\) (all collected simultaneously).\n\n\n\nsome sample points\n\n\nThe question is: Just from the position data \\(\\{\\omega_1, \\ldots, \\omega_M\\}\\) we have collected, can we determine 1) Whether the particles are all evolving according to the same dynamics, and 2) If so, what those dynamics are? As a sanity check, we can first try superimposing all of the data in one plot.\n\n\n\nsome sample points\n\n\nFrom the image above, there appears to be no discernable structure. But as we increase our number of samples \\(M\\), a picture starts to emerge.\n\n\n\nsome sample points\n\n\nand again:\n\n\n\nsome sample points\n\n\n\n\n\nsome sample points\n\n\n\n\n\nsome sample points\n\n\nIn the limit as \\(M \\to \\infty\\), we might obtain a picture like the following:\n\n\n\nsome sample points\n\n\nWe see that once \\(M\\) is large, it becomes (visually) clear that the particles are indeed evolving according to the same time-dependent function \\(f : \\mathbb R \\to \\Omega\\), but with 1) Small noise in the initial conditions, and 2) Different initial “offsets” \\(t_i\\) along \\(f(t)\\).\nTo expand on (1) a bit more: Note that in the figure above, there’s a fairly-clear “starting” point where the dark grey lines are all clumped together. Let’s say that this represents \\(f(0)\\). Then we see that the trajectories we observe (call them \\(f_i\\)) appear to look like they’re governed by the same principles, but with \\[f_i(0) = f(0) + \\text{ noise} \\qquad \\text{and} \\qquad f_i'(0) = f'(0) + \\text{ noise}.\\] Together with (2), we see that our observations \\(\\omega_i\\) are really samples from \\(f_i(t_i)\\). The question is how we may use these samples to recover \\(f(t)\\).\nLet us summarize the information so far." }, { - "objectID": "posts/ribosome-landmarks/index.html#introduction", - "href": "posts/ribosome-landmarks/index.html#introduction", - "title": "Defining landmarks for the ribosome exit tunnel", + "objectID": "posts/principal-curves/principal-curves.html#summary-the-trajectory-inference-problem", + "href": "posts/principal-curves/principal-curves.html#summary-the-trajectory-inference-problem", + "title": "Trajectory Inference for cryo-EM data using Principal Curves", "section": "", - "text": "The ribosome is present in all domains of life, though exhibits varying conservation across phylogeny. It has been found that, as translation proceeds, the nascent polypeptide chain interacts with the tunnel, and as such, tunnel geometry plays a role in translation dynamics and resulting protein structures1. With advances in imaging of ribosome structure with Cryo-EM, there is ample data on which geometric analysis of the tunnel may be applied and therefore a need for more computational tools to do so2." + "text": "Suppose you have a time-dependent process modeled by some function \\(f : [0,T] \\to \\Omega\\), where \\(\\Omega \\subseteq \\mathbb R^d\\) (or, more generally, an abstract metric space). Then, given observations \\[\\omega_i = f_i(t_i)\\] where the \\((f_i, t_i)\\) are hidden, how can we estimate \\(f(t)\\)?\n\n\nNote that the problem above might at first look very similar to a regression problem, where one attempts to use data points \\((X_i, Y_i)\\) to determine a hidden model \\(f\\) (subject to some noise \\(\\varepsilon_i\\)) giving \\[Y_i = f(X_i) + \\varepsilon_i.\\] If we let \\(f_i(X) = f(X) + \\varepsilon_i\\), then we an almost-identical setup \\[Y_i = f_i(X_i).\\] The key distinction is that in regression, we assume our data-collection procedure gives us pairs \\((X_i, Y_i)\\), whereas in the trajectory inference problem our data consists of only the \\(Y_i\\) and we must infer the \\(X_i\\) on our own. Note in particular that we have continuum many choices for \\(X_i\\). This ends up massively complicating the the problem: If we try the trajectory-inference analogue of regularized least squares, the lack of an a priori coupling between \\(X_i\\) and \\(Y_i\\) means we lose the convexity structure and must use both different theoretical analysis and different numerical algorithms.\nNevertheless, on a cosmetic level, we may formulate the problems with similar-looking equations. This brings us to regularized principal curves." }, { - "objectID": "posts/ribosome-landmarks/index.html#background", - "href": "posts/ribosome-landmarks/index.html#background", - "title": "Defining landmarks for the ribosome exit tunnel", - "section": "Background", - "text": "Background\nIn order to preform geometric shape analysis on the ribosome, we must first superimpose mathematical definitions onto this biological context. Among others, one way of defining shape mathematically is with a set of landmarks. A landmark is a labelled point on some structure, which, biologically speaking, has some meaning. After removing the effects of translation, scaling, and rotation, sets of landmarks form a shape space, on which statistical analysis may be applied.\nAssigning landmarks to biological shapes is not a new idea; many examples involve defining landmarks as joins between bones or muscles, or as points along observed curves3. However, there has been little work in assigning landmarks to biological molecules, and none specifically to the ribosome exit tunnel. The challenge is that any one landmark must have comparable instances across shapes in the shape space, meaning that we cannot arbitrarily pick residues which we know to be near to the tunnel. Such residues must be conserved, and therefore present in each specimen, to be considered useful." + "objectID": "posts/principal-curves/principal-curves.html#special-case-empirical-distributions", + "href": "posts/principal-curves/principal-curves.html#special-case-empirical-distributions", + "title": "Trajectory Inference for cryo-EM data using Principal Curves", + "section": "Special Case: Empirical Distributions", + "text": "Special Case: Empirical Distributions\nNote that when \\(\\mu\\) is an empirical distribution on observed data points \\(\\omega_1, \\ldots, \\omega_M\\), this becomes \\[\\min_{f} \\frac{1}{M} \\sum_{i=1}^M (d(\\omega_i, f))^p+ \\lambda \\mathscr C(f).\\] Further taking \\(p=2\\) and denoting \\(y_i = \\mathrm{argmin}_{y \\in \\mathrm{image}(f)} d(\\omega_i, y)\\), we can write it as \\[\\min_{f} \\frac{1}{M} \\sum_{i=1}^M \\lvert \\omega_i - y_i\\rvert^2+ \\lambda \\mathscr C(f),\\] whence we recover the relationship with regularized least squares." }, { - "objectID": "posts/ribosome-landmarks/index.html#protocol", - "href": "posts/ribosome-landmarks/index.html#protocol", - "title": "Defining landmarks for the ribosome exit tunnel", - "section": "Protocol", - "text": "Protocol\nBelow, I present a preliminary protocol for assigning landmarks to eukaryotic ribosome tunnels. The goal is to extrapolate to bacteria and archaea, as well as produce a combined dataset of landmarks which spans the kingdoms for inter-kingdom comparison. For now, I begin with eukaryota, taking advantage of the high degree of conservation between intra-kingdom ribosomes, as conserved sequences form the basis for this protocol.\nAs the goal for this dataset is to obtain landmarks that line the ribosome exit tunnel, I begin by selecting proteins and rRNA which interact with the tunnel: uL4, uL22, eL39, and 25/28S rRNA for Eukaryota1.\n\n\n\nFigure from Dao Duc et al. (2019) showing proteins affecting tunnel shape in E. coli and H. sapiens.\n\n\nThe full protocol is available here.\n\n1. Sequence Alignment\nIn order to assign landmarks which are comparable across ribosome specimens, I consider only the residues which are mostly conserved across our dataset of approximately 400 eukaryotes. To do so, I run Multiple Sequence Alignment (MSA) using MAFFT4 on the dataset for each of the chosen four polymer types and select residues from the MSA which are at least 90% conserved across samples.\n\n\n\nA visualization of a subsection of the MSA showing a highly conserved region of uL4.\n\n\nSelecting the most conserved residue at each position in the alignment:\n\n# Given an MSA column, return the most common element if it is at least as frequent as threshold\ndef find_conserved(column, threshold):\n counter = Counter(column)\n mode = counter.most_common(1)[0]\n \n if (mode[0] != '-' and mode[1] / len(column) >= threshold):\n return mode[0]\n \n return None\n\n\n\n2. Locating Residues\nTo locate the conserved residues, I first map the chosen loci from the MSA back to the corresponding loci in the original sequences:\n\nimport Bio\nfrom Bio.Seq import Seq\n\ndef map_to_original(sequence: Seq, position: int) -> int:\n '''\n Map conserved residue position to orignal sequence positions.\n 'sequence' is the aligned sequence from MSA.\n '''\n # Initialize pointer to position in original sequence\n ungapped_position = 0\n \n # Iterate through each position in the aligned sequence\n for i, residue in enumerate(sequence):\n # Ignore any gaps '-'\n if residue != \"-\":\n # If we have arrived at the aligned position, return pointer to position in original sequence\n if i == position:\n return ungapped_position\n # Every time we pass a 'non-gap' before arriving at position, we increase pointer by 1\n ungapped_position += 1\n\n # Return None if the position is at a gap \n return None\n\nThen using PyMol5, retrieve the atomic coordinates of the residue from the CIF file. To obtain a single landmark per residue, I take the mean of the atomic coordinates for each residue as the landmark.\nBelow is example code for retrieving the atomic coordinates of W66 on 4UG0 uL4:\n\nfrom pymol import cmd\nimport numpy as np\nfrom Bio.SeqUtils import seq3\n\n# Specify the residue to locate\nparent = '4UG0'\nchain = 'LC'\nresidue = 'W'\nposition = 66\n\nif f'{parent}_{chain}' not in cmd.get_names():\n cmd.load(f'data/{parent}.cif', object=f'{parent}_{chain}')\n cmd.remove(f'not chain {chain}')\n \nselect = f\"resi {position + 1}\"\n \natom_coords = []\ncmd.iterate_state(1, select, 'atom_coords.append((chain, resn, x, y, z))', space={'atom_coords': atom_coords})\n \nif (len(atom_coords) != 0 and atom_coords[0][1] == seq3(residue).upper()): \n \n vec = np.zeros(3)\n for coord in atom_coords:\n tmp_arr = np.array([coord[2], coord[3], coord[4]])\n vec += tmp_arr\n\n vec = vec / len(atom_coords)\n vec = vec.astype(np.int32)\n \n print(f\"Coordinates: x: {vec[0]}, y: {vec[1]}, z: {vec[2]}\")\n\n\n\n3. Filtering landmarks by distance\nAmong the conserved residues on the selected polymers, many will be relatively far from the exit tunnel and not have any influence on tunnel geometry. Thus, I select only those residues which are close enough to the tunnel. In this protocol, a threshold of \\(7.5 \\mathring{A}\\) is applied.\nThis process is done by using MOLE 2.06, which is a biomolecular channel construction algorithm. The output is a list of points in \\(\\mathbb{R}^3\\) which form the centerline of the tunnel, and, for each point on the centerline, a tunnel radius.\nUsing the MSA, I locate the coordinates of the conserved residues (see Section 3.2). For each of the residues, find the closest tunnel centerline point in Euclidean space, and compute the distance from the residue to the sphere given by the radius at that centerline point. If this distance is less than the threshold, this conserved residue is close enough to the tunnel to be considered a landmark.\nFor efficiency, I only run the MOLE algorithm on one ‘prototype’ eukaryote to filter the landmarks, then use this filtered list as the list of landmarks to find on subsequent specimens.\nBelow is the code which checks landmark location against the tunnel points:\n\nimport numpy as np\n\ndef get_tunnel_coordinates(instance: str) -> dict[int,list[float]]:\n \n if instance not in get_tunnel_coordinates.cache:\n xyz = open(f\"data/tunnel_coordinates_{instance}.txt\", mode='r')\n xyz_lines = xyz.readlines()\n xyz.close()\n \n r = open(f\"data/tunnel_radius_{instance}.txt\", mode='r')\n r_lines = r.readlines()\n r.close()\n \n coords = {}\n \n for i, line in enumerate(xyz_lines):\n if (i >= len(r_lines)): break\n \n content = line.split(\" \")\n content.append(r_lines[i])\n \n cleaned = []\n for str in content:\n str.strip()\n try:\n val = float(str)\n cleaned.append(val)\n except:\n None\n \n coords[i] = cleaned\n get_tunnel_coordinates.cache[instance] = coords\n \n # Each value in coords is of the form [x, y, z, r]\n return get_tunnel_coordinates.cache[instance]\n\nget_tunnel_coordinates.cache = {}\n\n# p is a list [x,y,z]\n# instance is RCSB_ID code\ndef find_closest_point(p, instance):\n coords = get_tunnel_coordinates(instance)\n dist = np.inf\n r = 0\n p = np.array(p)\n \n for coord in coords.values():\n xyz = np.array(coord[0:3])\n euc_dist = np.sqrt(np.sum(np.square(xyz - p))) - coord[3]\n if euc_dist < dist:\n dist = euc_dist\n \n return dist\n\nFinally, plotting the results using PyMol:\n\n\n\nLandmarks shown in blue on a mesh representation of the 4UG0 tunnel, with proteins shown for reference (uL4 in pink, uL22 in green, and eL39 in yellow).\n\n\nFor information on the mesh representation of the tunnel used in the figure above, see ‘3D tesellation of biomolecular cavities’.\n\n\nNotes\n\nThe code in the post uses a package (pymol-open-source) which cannot be installed into a virtual environment. I have instead included a /yml file specifing my conda environment that is used to compile this code.\nThe code used to retrieve atomic coordinates from PyMol is not robust to inconsistencies in CIF file sequence numbering present in the PDB. My next steps for improving this protocol will be to improve the handling of these edge cases." + "objectID": "posts/Embryonic-Shape/index.html", + "href": "posts/Embryonic-Shape/index.html", + "title": "Shape analysis of C. elegans E cell", + "section": "", + "text": "Some more background information in the blog post link.\nDuring embryonic development of Caenorhabditis elegans, an endomesodermal precursor EMS cell develops into a mesoderm precursor MS cell and and endoderm precursor E cell (Sulston et al. 1983). The asymmetry of this division depends on signals coming from the neighbour of EMS cell, P2 (Jan and Jan 1998). When the signals coming from the neighbouring cell are lost, EMS cell divides symmetrically and both daughters adopt MS cell fate (Goldstein 1992). Since cell signalling can be modulated, C. elegans EMS cell is a good system to use when investigating asymmetric cell divisions. Indeed, preliminary studies show that the volume of the daughter closest to the P2 (signal-sending cell) becomes larger when the signal is abolished. We do not know, however, how the cell shape changes, and whether the daughter cell fate is mediated by the EMS and the daughter cell shape. One way to investigate this is to do direct volume analysis of the EMS cell before division, however, this approach is limiting since volume does not account for changes in the cell shape. With this project, I hope to develop a framework to investigate EMS, MS, and E cell shapes and use this framework to analyze cell shapes upon signal perturbations.\nA paper published in 2024 claims to have developed a framework to analyze cell shape in C. elegans embryonic cells (Van Bavel, Thiels, and Jelier 2023). To confirm the viability of this framework, the authors compared the shape of a wild type E cell versus an E cell that does not receive a signal from P2 cell (dsh-2/mig-5 knockdown). To analyze the shapes, the authors used conformal mapping to map the cell shapes onto a sphere. They then extracted spherical harmonics which can describe the features of the cell in decreasing importance order from the ones that have the greatest contribution to the cell shape. In this project, my aim was to reproduce their results and to use the Flowshape framework on my own samples.\n\n\nThe pipeline of this framework begins with segmentation. In the article, SDT-PICS method (Thiels et al. 2021) was used to generate 3D meshes. The method was installed using Docker, but it required substantial version control to make it work, as the instalation depended on Linux, some dependencies were not compatible with their recommended Python version, and others were not compatible with a different Python version. I hope to contact the authors of the paper and submit the fixes for installing SDT-PICS. Additionally, the segmentation pipeline did not work very well with my microscopy images (Figure 1). This could be due to different cell shape markers or microscopy differences.\n\n\n\nFigure 1: six-cell stage C. elegans embryo stained with a membrane dye.\n\n\nAfter trying numerous segmentation techniques I have settled for a semi-automatic segmentation of specific cells using ImageJ. This was done using automatic interpolation of selected cells, creating binary masks (Figure 2). These were used as sample cells for further analysis.\n\n\n\nFigure 2: Mask samples for an E cell, including the first mask, the 30th mask and the last mask.\n\n\n\n\n\n\n\nThe main Flowshape algorithm uses 3D meshes as input for conformal mapping. However, they do provide a method to build meshes from image files using a marching cubes algorithm (Lorensen and Cline 1987). Marching cubes algorithm leads to a cylindrical 3D representation of a cell (Figure 3).\n\n\n\nFigure 3: Cell reconstruction from masks shown in Figure 2 using Marching Cubes algorithm.\n\n\nTo remove any gaps in the shape, we employ a remeshing algorithm in pyvista package. This leads to an expected triangular mesh (Figure 4). The holes produced by the marching cubes algorithm are filled and the shape is ready to be analyzed.\n\n\n\nFigure 4: Cell reconstruction from the marching cubes shown in Figure 3.\n\n\nSpherical harmonics can then be calculated using the following code.\n```{python}\n# perform reconstruction with 24 SH\nweights, Y_mat, vs = fs.do_mapping(v, f, l_max = 24)\n\nrho = Y_mat.dot(weights)\n\nreconstruct = fs.reconstruct_shape(sv, f, rho )\nmeshplot.plot(reconstruct, f, c = rho)\n```\nThis results in a reconstructed cell shape (Figure 5). The colors here represent the curvature of the shape.\n\n\n\nFigure 5: Cell reconstruction using spherical harmonics (the first 24)\n\n\nSpherical harmonics can also be used to map the shape directly onto the sphere. Similar to Figure 5, high curvature areas are represented in brighter colors.\n\n\n\nFigure 6: Cell reconstruction onto a sphere using conformal mapping\n\n\n\n\n\nTo compare two shapes, it is essential to first align them. In this workflow, alignment is calculating by estimating a rotation matrix that maximizes the correlation between the spherical harmonics of two shapes. This is then used to align the shapes and refine the alignment (Figure 7)\n```{python}\nrot2 = fs.compute_max_correlation(weights3, weights2, l_max = 24)\nrot2 = rot2.as_matrix()\n\np = mp.plot(v, f)\np.add_points(v2 @ rot2, shading={\"point_size\": 0.2})\n\nfinal = v2 @ rot2\n\nfor i in range(10):\n # Project points onto surface of original mesh\n sqrD, I, proj = igl.point_mesh_squared_distance(final, v, f)\n # Print error (RMSE)\n print(np.sqrt(np.average(sqrD)))\n \n # igl's procrustes complains if you don't give the mesh in Fortran index order \n final = final.copy(order='f')\n proj = proj.copy(order='f')\n \n # Align points to their projection\n s, R, t = igl.procrustes(final, proj, include_scaling = True, include_reflections = False)\n\n # Apply the transformation\n final = (final * s).dot(R) + t -->\n```\nIn this image, the yellow shape and red dots represent two separate E cells. Fewer red dots mean that cells are better aligned.\n\n\n\nFigure 7: Alignment of two cells.\n\n\n\n\n\nTo find a mean shape between the two shapes, I found a mean spherical harmonics decomposition:\n```{python}\nweights, Y_mat, vs = fs.do_mapping(v,f, l_max = 24)\nweights2, Y_mat2, vs2 = fs.do_mapping(v2, f2, l_max = 24)\n\nmean_weights = (weights + weights2) / 2\nmean_Ymat = (Y_mat + Y_mat2)/2\n\nsv = fs.sphere_map(v, f)\nrho3 = mean_Ymat.dot(mean_weights)\nmp.plot(sv, f, c = rho3)\n\nrec2 = fs.reconstruct_shape(sv, f, rho3)\nmp.plot(rec2,f, c = rho3)\n```\nFrom this, I built a mean shape on the sphere, followed by a reconstruction (Figure 8)\n\n\n\nFigure 8: Mean shape reconstruction. Left - mean shape mapped onto a sphere, right - reconstructed mean shape using Spherical Harmonics.\n\n\nThis reconstruction was then used to re-align the original shapes and map them onto the average shape (Figure 9)\n\n\n\nFigure 9: Alignment of cells to a mean shape.\n\n\nTo further analyze the differences between the two cells, I have calculated pointwise differences between vertices, and the combined deviation of each vertex from the average vertex. I then mapped these onto the average cell shape (Figure 10)\n```{python}\npointwise_diff = np.linalg.norm(final - final2, axis=1) # Difference between aligned shapes\n\n# Point-wise difference from the mean shape\ndiff_from_mean_v = np.linalg.norm(final - v3, axis=1)\ndiff_from_mean_final = np.linalg.norm(final2 - v3, axis=1)\n```\n\n\n\nFigure 10: Estimation of deviations from the mean shape using pointwise differences (left) and filtering only the highest differences (right).\n\n\nTo numerically estimate the shape differences, I have calculated the RMSE between shapes (1.37) and surface area difference between the cells (557.5µm2). These numbers might make more sense after sufficient data to compare between different samples.\nI also tried using K-means clustering to see if there were any significant clusters (Figure 11)\n\n\n\nFigure 11: K-wise differences clustering onto the mean cell shape. Most of the clusters are evenly distributed but upon rotation there is a larger cluster (right).\n\n\n\n\n\nThis project proposes using a modified Flowshape analysis pipeline to investigate similarities and differences between C. elegans embryonic cells. Mean shape can be easily estimated using spherical harmonics which can then be used to compare different shapes and find outliers of interest. I would like to extend this project by automating and improving the segmentation pipeline (either via SDT-pics or machine learning algorithms), and finding ways to extract more data points from shape comparisons." }, { - "objectID": "posts/outlier-detection/DeCOr-MDS.html", - "href": "posts/outlier-detection/DeCOr-MDS.html", - "title": "Orthogonal outlier detection and dimension estimation for improved MDS embedding of biological datasets", + "objectID": "posts/Embryonic-Shape/index.html#segmentation", + "href": "posts/Embryonic-Shape/index.html#segmentation", + "title": "Shape analysis of C. elegans E cell", "section": "", - "text": "Multidimensional scaling (MDS) is known to be sensitive to such orthogonal outliers, we present here a robust MDS method, called DeCOr-MDS, short for Detection and Correction of Orthogonal outliers using MDS. DeCOr-MDS takes advantage of geometrical characteristics of the data to reduce the influence of orthogonal outliers, and estimate the dimension of the dataset. The full paper is available at Li et al. (2023)." + "text": "The pipeline of this framework begins with segmentation. In the article, SDT-PICS method (Thiels et al. 2021) was used to generate 3D meshes. The method was installed using Docker, but it required substantial version control to make it work, as the instalation depended on Linux, some dependencies were not compatible with their recommended Python version, and others were not compatible with a different Python version. I hope to contact the authors of the paper and submit the fixes for installing SDT-PICS. Additionally, the segmentation pipeline did not work very well with my microscopy images (Figure 1). This could be due to different cell shape markers or microscopy differences.\n\n\n\nFigure 1: six-cell stage C. elegans embryo stained with a membrane dye.\n\n\nAfter trying numerous segmentation techniques I have settled for a semi-automatic segmentation of specific cells using ImageJ. This was done using automatic interpolation of selected cells, creating binary masks (Figure 2). These were used as sample cells for further analysis.\n\n\n\nFigure 2: Mask samples for an E cell, including the first mask, the 30th mask and the last mask." }, { - "objectID": "posts/outlier-detection/DeCOr-MDS.html#multidimensional-scaling-mds", - "href": "posts/outlier-detection/DeCOr-MDS.html#multidimensional-scaling-mds", - "title": "Orthogonal outlier detection and dimension estimation for improved MDS embedding of biological datasets", - "section": "Multidimensional scaling (MDS)", - "text": "Multidimensional scaling (MDS)\nMDS is a statistical technique used for visualizing data points in a low-dimensional space, typically two or three dimensions. It is particularly useful when the data is represented in the form of a distance matrix, where each entry indicates the distance between pairs of items. MDS aims to place each item in this lower-dimensional space in such a way that the distances between the items are preserved as faithfully as possible. This allows complex, high-dimensional data to be more easily interpreted, as the visual representation can reveal patterns, clusters, or relationships among the data points that might not be immediately apparent in the original high-dimensional space. MDS is widely used in fields such as psychology, market research, and bioinformatics for tasks like visualizing similarities among stimuli, products, or genetic sequences (Carroll and Arabie 1998; Hout, Papesh, and Goldinger 2013)." + "objectID": "posts/Embryonic-Shape/index.html#flowshape-algorithm", + "href": "posts/Embryonic-Shape/index.html#flowshape-algorithm", + "title": "Shape analysis of C. elegans E cell", + "section": "", + "text": "The main Flowshape algorithm uses 3D meshes as input for conformal mapping. However, they do provide a method to build meshes from image files using a marching cubes algorithm (Lorensen and Cline 1987). Marching cubes algorithm leads to a cylindrical 3D representation of a cell (Figure 3).\n\n\n\nFigure 3: Cell reconstruction from masks shown in Figure 2 using Marching Cubes algorithm.\n\n\nTo remove any gaps in the shape, we employ a remeshing algorithm in pyvista package. This leads to an expected triangular mesh (Figure 4). The holes produced by the marching cubes algorithm are filled and the shape is ready to be analyzed.\n\n\n\nFigure 4: Cell reconstruction from the marching cubes shown in Figure 3.\n\n\nSpherical harmonics can then be calculated using the following code.\n```{python}\n# perform reconstruction with 24 SH\nweights, Y_mat, vs = fs.do_mapping(v, f, l_max = 24)\n\nrho = Y_mat.dot(weights)\n\nreconstruct = fs.reconstruct_shape(sv, f, rho )\nmeshplot.plot(reconstruct, f, c = rho)\n```\nThis results in a reconstructed cell shape (Figure 5). The colors here represent the curvature of the shape.\n\n\n\nFigure 5: Cell reconstruction using spherical harmonics (the first 24)\n\n\nSpherical harmonics can also be used to map the shape directly onto the sphere. Similar to Figure 5, high curvature areas are represented in brighter colors.\n\n\n\nFigure 6: Cell reconstruction onto a sphere using conformal mapping\n\n\n\n\n\nTo compare two shapes, it is essential to first align them. In this workflow, alignment is calculating by estimating a rotation matrix that maximizes the correlation between the spherical harmonics of two shapes. This is then used to align the shapes and refine the alignment (Figure 7)\n```{python}\nrot2 = fs.compute_max_correlation(weights3, weights2, l_max = 24)\nrot2 = rot2.as_matrix()\n\np = mp.plot(v, f)\np.add_points(v2 @ rot2, shading={\"point_size\": 0.2})\n\nfinal = v2 @ rot2\n\nfor i in range(10):\n # Project points onto surface of original mesh\n sqrD, I, proj = igl.point_mesh_squared_distance(final, v, f)\n # Print error (RMSE)\n print(np.sqrt(np.average(sqrD)))\n \n # igl's procrustes complains if you don't give the mesh in Fortran index order \n final = final.copy(order='f')\n proj = proj.copy(order='f')\n \n # Align points to their projection\n s, R, t = igl.procrustes(final, proj, include_scaling = True, include_reflections = False)\n\n # Apply the transformation\n final = (final * s).dot(R) + t -->\n```\nIn this image, the yellow shape and red dots represent two separate E cells. Fewer red dots mean that cells are better aligned.\n\n\n\nFigure 7: Alignment of two cells.\n\n\n\n\n\nTo find a mean shape between the two shapes, I found a mean spherical harmonics decomposition:\n```{python}\nweights, Y_mat, vs = fs.do_mapping(v,f, l_max = 24)\nweights2, Y_mat2, vs2 = fs.do_mapping(v2, f2, l_max = 24)\n\nmean_weights = (weights + weights2) / 2\nmean_Ymat = (Y_mat + Y_mat2)/2\n\nsv = fs.sphere_map(v, f)\nrho3 = mean_Ymat.dot(mean_weights)\nmp.plot(sv, f, c = rho3)\n\nrec2 = fs.reconstruct_shape(sv, f, rho3)\nmp.plot(rec2,f, c = rho3)\n```\nFrom this, I built a mean shape on the sphere, followed by a reconstruction (Figure 8)\n\n\n\nFigure 8: Mean shape reconstruction. Left - mean shape mapped onto a sphere, right - reconstructed mean shape using Spherical Harmonics.\n\n\nThis reconstruction was then used to re-align the original shapes and map them onto the average shape (Figure 9)\n\n\n\nFigure 9: Alignment of cells to a mean shape.\n\n\nTo further analyze the differences between the two cells, I have calculated pointwise differences between vertices, and the combined deviation of each vertex from the average vertex. I then mapped these onto the average cell shape (Figure 10)\n```{python}\npointwise_diff = np.linalg.norm(final - final2, axis=1) # Difference between aligned shapes\n\n# Point-wise difference from the mean shape\ndiff_from_mean_v = np.linalg.norm(final - v3, axis=1)\ndiff_from_mean_final = np.linalg.norm(final2 - v3, axis=1)\n```\n\n\n\nFigure 10: Estimation of deviations from the mean shape using pointwise differences (left) and filtering only the highest differences (right).\n\n\nTo numerically estimate the shape differences, I have calculated the RMSE between shapes (1.37) and surface area difference between the cells (557.5µm2). These numbers might make more sense after sufficient data to compare between different samples.\nI also tried using K-means clustering to see if there were any significant clusters (Figure 11)\n\n\n\nFigure 11: K-wise differences clustering onto the mean cell shape. Most of the clusters are evenly distributed but upon rotation there is a larger cluster (right).\n\n\n\n\n\nThis project proposes using a modified Flowshape analysis pipeline to investigate similarities and differences between C. elegans embryonic cells. Mean shape can be easily estimated using spherical harmonics which can then be used to compare different shapes and find outliers of interest. I would like to extend this project by automating and improving the segmentation pipeline (either via SDT-pics or machine learning algorithms), and finding ways to extract more data points from shape comparisons." }, { - "objectID": "posts/outlier-detection/DeCOr-MDS.html#orthogonal-outliers", - "href": "posts/outlier-detection/DeCOr-MDS.html#orthogonal-outliers", - "title": "Orthogonal outlier detection and dimension estimation for improved MDS embedding of biological datasets", - "section": "Orthogonal outliers", - "text": "Orthogonal outliers\nOutlier detection has been widely used in biological data. Sheih and Yeung proposed a method using principal component analysis (PCA) and robust estimation of Mahalanobis distances to detect outlier samples in microarray data (Shieh and Hung 2009). Chen et al. reported the use of two PCA methods to uncover outlier samples in multiple simulated and real RNA-seq data (Oh, Gao, and Rosenblatt 2008). Outlier influence can be mitigated depending on the specific type of outlier. In-plane outliers and bad leverage points can be harnessed using \\(\\ell_1\\)-norm Forero and Giannakis (2012), correntropy or M-estimators (Mandanas and Kotropoulos 2017). Outliers which violate the triangular inequality can be detected and corrected based on their pairwise distances (Blouvshtein and Cohen-Or 2019). Orthogonal outliers are another particular case, where outliers have an important component, orthogonal to the hyperspace where most data is located. These outliers often do not violate the triangular inequality, and thus require an alternative approach." + "objectID": "posts/landmarks-final/index.html", + "href": "posts/landmarks-final/index.html", + "title": "Landmarking the ribosome exit tunnel", + "section": "", + "text": "I present a complete Python protocol for assigning landmarks to the ribosome exit tunnel surface based on conservation and distance. The motivation and background for this topic can be found in my previous post. This blog post outlines implementation details and usage instructions for a more robust version of the protocol, available in full on GitHub." }, { - "objectID": "posts/outlier-detection/DeCOr-MDS.html#height-and-volume-of-n-simplices", - "href": "posts/outlier-detection/DeCOr-MDS.html#height-and-volume-of-n-simplices", - "title": "Orthogonal outlier detection and dimension estimation for improved MDS embedding of biological datasets", - "section": "Height and Volume of n-simplices", - "text": "Height and Volume of n-simplices\nWe recall some geometric properties of simplices, which our method is based on. For a set of \\(n\\) points \\((x_1,\\ldots, x_n)\\), the associated \\(n\\)-simplex is the polytope of vertices \\((x_1,\\ldots, x_n)\\) (a 3-simplex is a triangle, a 4-simplex is a tetrahedron and so on). The height \\(h(V_{n},x)\\) of a point \\(x\\) belonging to a \\(n\\)-simplex \\(V_{n}\\) can be obtained as (Sommerville 1929), \\[\n h(V_{n},x) = n \\frac{V_n}{V_{n-1}},\n\\tag{1}\\] where \\(V_{n}\\) is the volume of the \\(n\\)-simplex, and \\(V_{n-1}\\) is the volume of the \\((n-1)\\)-simplex obtained by removing the point \\(x\\). \\(V_{n}\\) and \\(V_{n-1}\\) can be computed using the pairwise distances only, with the Cayley-Menger formula (Sommerville 1929):\n\\[\\begin{equation}\n\\label{eq:Vn}\nV_n = \\sqrt{\\frac{\\vert det(CM_n)\\vert}{2^n \\cdot (n!)^2}},\n\\end{equation}\\]\nwhere \\(det(CM_n)\\) is the determinant of the Cayley-Menger matrix \\(CM_n\\), that contains the pairwise distances \\(d_{i,j}=\\left\\lVert x_i -x_j \\right\\rVert\\), as \\[\\begin{equation}\n CM_n = \\left[ \\begin{array}{cccccc} 0 & 1 & 1 & ... & 1 & 1 \\\\\n\n 1 & 0 & d_{1,2}^2 & ... & d_{1,n}^2 & d_{1,n+1}^2 \\\\\n 1 & d_{2,1}^2 & 0 & ... & d_{2,n}^2 & d_{2,n+1}^2 \\\\\n ... & ... & ... & ... & ... & ... \\\\\n 1 & d_{n,1}^2 & d_{n,2}^2 & ... & 0 & d_{n,n+1}^2 \\\\\n 1 & d_{n+1,1}^2 & d_{n+1,2}^2 & ... & d_{n+1,n}^2 & 0 \\\\\n \\end{array}\\right].\n\\end{equation}\\]" + "objectID": "posts/landmarks-final/index.html#introduction", + "href": "posts/landmarks-final/index.html#introduction", + "title": "Landmarking the ribosome exit tunnel", + "section": "", + "text": "I present a complete Python protocol for assigning landmarks to the ribosome exit tunnel surface based on conservation and distance. The motivation and background for this topic can be found in my previous post. This blog post outlines implementation details and usage instructions for a more robust version of the protocol, available in full on GitHub." }, { - "objectID": "posts/outlier-detection/DeCOr-MDS.html#sec-part1", - "href": "posts/outlier-detection/DeCOr-MDS.html#sec-part1", - "title": "Orthogonal outlier detection and dimension estimation for improved MDS embedding of biological datasets", - "section": "Orthogonal outlier detection and dimensionality estimation", - "text": "Orthogonal outlier detection and dimensionality estimation\nWe now consider a dataset \\(\\mathbf{X}\\) of size \\(N\\times d\\), where \\(N\\) is the sample size and \\(d\\) the dimension of the data. We associate with \\(\\mathbf{X}\\) a matrix \\(\\mathbf{D}\\) of size \\(N\\times N\\), which represents all the pairwise distances between observations of \\(\\mathbf{X}\\). We also assume that the data points can be mapped into a vector space with regular observations that form a main subspace of unknown dimension \\(d^*\\) with some small noise, and additional orthogonal outliers of relatively large orthogonal distance to the main subspace (see Figure 1.A). Our proposed method aims to infer from \\(\\mathbf{D}\\) the dimension of the main data subspace \\(d^*\\), using the geometric properties of simplices with respect to their number of vertices: Consider a \\((n+2)\\)-simplex containing a data point \\(x_i\\) and its associated height, that can be computed using equation Equation 1. When \\(n<d^*\\) and for \\(S\\) large enough, the distribution of heights obtained from different simplices containing \\(x_i\\) remains similar, whether \\(x_i\\) is an orthogonal outlier or a regular observation (see Figure 1.B). In contrast, when \\(n\\geq d^*\\), the median of these heights approximately yields the distance of \\(x_i\\) to the main subspace (see Figure 1.C). This distance should be significantly larger when \\(x_i\\) is an orthogonal outlier, compared with regular points, for which these distances are tantamount to the noise.\n\n\n\n\n\n\nFigure 1: Example of a dataset with orthogonal outliers and n-simplices. Representation of a dataset with regular data points (blue) belonging to a main subspace of dimension 2 with some noise, and orthogonal outliers (red triangle symbols) in the third dimension. View of two instances of 3-simplices (triangles), one with only regular points (left) and the other one containing one outlier (right). The height drawn from the outlier is close to the height of the regular triangle. Upon adding other regular points to obtain tetrahedrons (4-simplices), the height drawn from the outlier (right) becomes significantly larger than the height drawn from the same point (left) as in .\n\n\n\nTo estimate \\(d^*\\) and for a given dimension \\(n\\) tested, we thus randomly sample, for every \\(x_i\\) in \\(\\mathbf{X}\\), \\(S(n+2)\\)-simplices containing \\(x_i\\), and compute the median of the heights \\(h_i^n\\) associated with these \\(S\\) simplices. Upon considering, as a function of the dimension \\(n\\) tested, the distribution of median heights \\((h_1^{n},...,h_N^{n})\\) (with \\(1\\leq i \\leq N\\)), we then identify \\(d^*\\) as the dimension at which this function presents a sharp transition towards a highly peaked distribution at zero. To do so, we compute \\(\\tilde{h}_n\\), as the mean of \\((h_1^{n},...,h_N^{n})\\), and estimate \\(d^*\\) as\n\\[\\begin{equation}\n \\bar{n}=\\underset{n}{\\operatorname{argmax}} \\frac{\\tilde{h}_{n-1}}{\\tilde{h}_{n}}.\n \\label{Eq:Dim}\n\\end{equation}\\]\nFurthermore, we detect orthogonal outliers using the distribution obtained in \\(\\bar{n}\\), as the points for which \\(h_i^{\\bar{n}}\\) largely stands out from \\(\\tilde{h}_{\\bar{n}}\\). To do so, we compute \\(\\sigma_{\\bar{n}}\\) the standard deviation observed for the distribution \\((h_1^{\\bar{n}},...,h_N^{\\bar{n}})\\), and obtain the set of orthogonal outliers \\(\\mathbf{O}\\) as\n\\[\n \\mathbf{O}= \\left\\{ i\\;|\\;h_i^{\\bar{n}}> \\tilde{h}_{\\bar{n}} + c \\times \\sigma_{\\bar{n}} \\right\\},\n\\tag{2}\\]\nwhere \\(c>0\\) is a parameter set to achieve a reasonable trade-off between outlier detection and false detection of noisy observations." + "objectID": "posts/landmarks-final/index.html#protocol-overview", + "href": "posts/landmarks-final/index.html#protocol-overview", + "title": "Landmarking the ribosome exit tunnel", + "section": "Protocol Overview", + "text": "Protocol Overview\nLandmarks assigned on the surface of the tunnel are defined as the mean atomic coordinates of conserved residues that are close to the tunnel surface. The general steps in the protocol are:\n\nRun Multiple Sequence Alignment (MSA) on the relevant polymers and select residues that are above a conservation threshold.\nOf the conserved residues, select only the residues that are within a distance threshold of the tunnel as represented by the Mole model1.\nExtract the 3D coordinates of the selected residues.\n\n\n\n\n\n\n\nFigure 1: Landmarks shown in blue on a mesh representation of the 4UG0 tunnel, with proteins shown for reference (uL4 in pink, uL22 in green, and eL39 in yellow)." }, { - "objectID": "posts/outlier-detection/DeCOr-MDS.html#correcting-the-dimensionality-estimation-for-a-large-outlier-fraction", - "href": "posts/outlier-detection/DeCOr-MDS.html#correcting-the-dimensionality-estimation-for-a-large-outlier-fraction", - "title": "Orthogonal outlier detection and dimension estimation for improved MDS embedding of biological datasets", - "section": "Correcting the dimensionality estimation for a large outlier fraction", - "text": "Correcting the dimensionality estimation for a large outlier fraction\nThe method presented in the previous section assumes that at dimension \\(d^*\\), the median height calculated for each point reflects the distance to the main subspace. This assumption is valid when the fraction of orthogonal outliers is small enough, so that the sampled \\(n\\)-simplex likely contains regular observations only, aside from the evaluated point. However, if the number of outliers gets large enough so that a significant fraction of \\(n\\)-simplices %drawn to compute a height also contains outliers, then the calculated heights would yield the distance between \\(x_i\\) and an outlier-containing hyperplane, whose dimension is larger than a hyperplane containing only regular observations. The apparent dimensionality of the main subspace would thus increase and generates a positive bias on the estimate of \\(d^*\\).\nSpecifically, if \\(\\mathbf{X}\\) contains a fraction of \\(p\\) outliers, and if we consider \\(o_{n,p,N}\\) the number of outliers drawn after uniformly sampling \\(n+1\\) points (to test the dimension \\(n\\)), then \\(o_{n,p,N}\\) follows a hypergeometric law, with parameters \\(n+1\\), the fraction of outliers \\(p=N_o/N\\), and \\(N\\). Thus, the expected number of outliers drawn from a sampled simplex is \\((n+1) \\times p\\). After estimating \\(\\bar{n}\\) (from Section 3.1), and finding a proportion of outliers \\(\\bar p = |\\mathbf{O}|/N\\) using Equation 2, we hence correct \\(\\bar{n}\\) by substracting the estimated bias \\(\\delta\\), as the integer part of the expectation of \\(o_{n,p,N}\\), so the debiased dimensionality estimate \\(n^*\\) is\n\\[\\begin{equation}\n n^* =\\bar{n} - \\lfloor (\\bar{n}+1) \\times p \\rfloor.\n \\label{eq:corrected_n}\n\\end{equation}\\]" + "objectID": "posts/landmarks-final/index.html#implementation-details", + "href": "posts/landmarks-final/index.html#implementation-details", + "title": "Landmarking the ribosome exit tunnel", + "section": "Implementation Details", + "text": "Implementation Details\n\nSeparation by Kingdom\nThe protocol has two main entry points: main.py and main_universal.py. The main file assigns intra-kingdom landmarks; conserved residues are chosen only based on sequences from the given kingdom, meaning that the landmarks are specific to one of the three biological super-kingdoms (eukaryota, bacteria, and archaea). Using main, landmarks for one kingdom do not directly correspond to landmarks for another kingdom. While this separation prevents inter-kingdom comparison directly, it allows for a higher number of landmarks to be assigned to each specimen, due to higher degrees of conservation within kingdoms. The alternative is to use main_universal, which chooses conserved residues based on all sequences. This provides less landmarks per ribosome, but allows for inter-kingdom comparison, as each landmark will have correspondence across all specimens.\n\n\nData\nThe protocol uses data from RibosomeXYZ2 and the Protein Data Bank (PDB) via API access. For each ribosome structure, the protocol requests sequences and metadata (chain names, taxonomic information, etc.) from RibosomeXYZ for selected proteins and RNA and the full mmcif structural file from the PDB. This data is stored locally to facilitate repeated access during runtime.\n\n\nAlignments\nThe program uses MAFFT3 to preform Multiple Sequence Alignment (MSA) on all of the available sequences for each of the relevant polymers. It accesses sequence data from RibosomeXYZ polymer files. When the program is run on new specimens, if the sequences are not already in the input fasta files, they are automatically added and the alignments are re-run to include the new specimens.\n\n\n\n\n\n\nFigure 2: A visualization of a subsection of the MSA showing a highly conserved region of uL4.\n\n\n\n\n\nSelecting Landmarks\nLandmarks are selected using a prototype ribosome and based on conservation and distance. The program searches for landmarks only on polymers which are known to be close to the tunnel4.\n\n\n\nKingdom\nPrototype\nSelected Polymers\n\n\n\n\nEukaryota\n4UG0\nuL4, uL22, eL39, 25/28S rRNA\n\n\nBacteria\n3J7Z\nuL4, uL22, uL23, 23S rRNA\n\n\nArchaea\n4V6U\nuL4, uL22, eL39, 23S rRNA\n\n\nUniversal\n3J7Z\nuL4, uL22, 23/25/28S rRNA\n\n\n\nThe prototype IDs and polymers used in the protocol\n\nConservation\nTo be chosen as a landmark, residues must be at least 90% conserved. This threshold is a tuneable parameter. For each of the relevant polymers, the program iterates through each position in the MSA alignment file for that polymer and selects alignment positions for which at least 90% of specimens share the same residue. This excludes positions where gaps are the most common element. The program calls the below method on every column of the MSA for each of the relevant polymers to obtain a short-list of alignment positions to be considered for landmarks.\n\ndef find_conserved(column, threshold):\n counter = Counter(column)\n mode = counter.most_common(1)[0]\n \n if (mode[0] != '-' and mode[1] / len(column) >= threshold):\n return mode[0]\n \n return None\n\n\n\nDistances\nFor each candidate conserved position, the program first locates the residue’s coordinates on the prototype specimen (see Section 3.5 for more detail). For each prototype, I have run the Mole tunnel search algorithm to extract the centerline coordinates of the tunnel and the radius at each point. Then for each candidate landmark \\(p_l\\), I find the nearest centerline point \\(p_c\\) by euclidean distance, and compute the distance from \\(p_l\\) to the sphere centered at \\(p_c\\) with the given radius \\(r_c\\): \\[ d = ||p_l - p_c|| - r_c\\] If \\(d\\) is less than the distance threshold, the candidate position is considered a landmark. See the code below for reference:\n\ndef find_closest_point(p, instance):\n coords = get_tunnel_coordinates(instance)\n dist = np.inf\n r = 0\n p = np.array(p)\n \n for coord in coords.values():\n xyz = np.array(coord[0:3])\n euc_dist = np.sqrt(np.sum(np.square(xyz - p))) - coord[3]\n if euc_dist < dist:\n dist = euc_dist\n \n return dist\n\nEach selected landmark’s residue type and alignment location are saved to file, so that new ribosome specimens can use the list as a guideline.\n\n\n\nLocating landmarks\nLocating the chosen landmarks in the structural file for a given ribosome specimen is the most involved step of the protocol. Often, a ribosome mmcif file contains some gaps, due to experimental/imaging conditions. For this reason, I take an approach using methods from RibosomeXYZ’s backend2 to keep track of residue locations as sequences are manipulated (aligned, flattened to remove gaps, etc.). We have access to two copies of the sequence for each polymer: the sequence from the RibosomeXYZ polymer data, which is well formed, and the mmcif PDB sequence that is tied to the 3D structure, which often has gaps. The protocol makes use of both versions.\nThe PDB sequence is loaded into memory as an object using BioPython. This object holds all of the structural and hierarchical information present in the original file. This is more useful than working with sequences as strings. For example, indexing a protein sequence gives a unique residue object which holds structural information, rather than just a symbolic letter.\nI use the SeqenceMappingContainer class taken from RibosomeXYZ. The purpose of this class is to facilitate working with the PDB structural sequences with gaps. Initializing the class with a polymer sequence as a BioPython Chain object gives a ‘primary’ unchanged version of the sequence and a ‘flattened’ version with all gaps removed, as well as mappings for indices between the two. Given an index in the flattened sequence, we can use the maps to find the index in the primary sequence and therefore the author assigned residue IDs and structural information, and vice versa. This is the backbone of locating residues by sequence numbers from the landmark list on potential gappy polymer sequences.\nThe algorithm for locating a landmark is as follows:\n\nAccess the aligned sequence from the MSA, and map the landmark from the location in the alignment to the location in the original RibsomeXYZ sequence for this polymer instance.\nPerform a pairwise sequence alignment on the original RibosomeXYZ sequence and the flattened PDB sequence.\nUse this pairwise alignment to map the landmark location in the original RibosomeXYZ sequence to the location in the flattened PDB sequence.\nFrom the flattened PDB sequence, use SeqenceMappingContainer mapping to find the residue ID in the primary PDB sequence, and use this ID to index the Residue object.\nEnsure that the residue type matches the landmark type (i.e. amino acids / nucleotides match), and return the mean coordinates of the atoms in the residue as the landmark coordinates.\n\nSee the following code:\n\ndef locate_residues(landmark: Landmark, \n polymer: str, \n polymer_id: str, \n rcsb_id: str, \n chain: Structure, \n flat_seq,\n kingdom: str = None) -> dict:\n \n '''\n This method takes a landmark centered on the alignment, and finds this residue on the given rcsb_id's polymer.\n Returns the residues position and coordinates.\n \n landmark: the landmark to be located\n polymer: the poymer on which this landmark lies\n polymer_id: the polymer id specific to this rcsb_id\n rcsb_id: the id of the ribosome instance\n chain: the biopython Chain object holding the sequence\n flat_seq: from SequenceMappingContainer, tuple holding (seq, flat_index_to_residue_map, auth_seq_id_to_flat_index_map)\n kingdom: kingdom to which this rcsb_id belongs, or none if being called from main_universal.py\n '''\n \n # access aligned sequence from alignment files\n if kingdom is None:\n path = f\"data/output/fasta/aligned_sequences_{polymer}.fasta\"\n else:\n path = f\"data/output/fasta/aligned_sequences_{kingdom}_{polymer}.fasta\"\n alignment = AlignIO.read(path, \"fasta\")\n aligned_seq = get_rcsb_in_alignment(alignment, rcsb_id)\n \n # find the position of the landmark on the original riboXYZ seq\n alignment_position = map_to_original(aligned_seq, landmark.position) \n \n # access riboXYZ sequence (pre alignment)\n orig_seq = check_fasta_for_rcsb_id(rcsb_id, polymer, kingdom)\n\n if orig_seq is None:\n print(\"Cannot access sequence\")\n return\n \n # run pairwise alignment on the riboXYZ sequence and the flattened PDB sequence\n alignment = run_pairwise_alignment(rcsb_id, polymer_id, orig_seq, flat_seq[0])\n \n if alignment is None:\n return None\n \n # map the alignment_position from the original riboXYZ sequence to the pairwise-aligned flattened PDB sequence\n flattened_seq_aligned = alignment[1]\n flat_aligned_position = None\n if alignment_position is not None: \n flat_aligned_position = map_to_original(flattened_seq_aligned, alignment_position)\n \n if flat_aligned_position is None:\n print(f\"Cannot find {landmark} on {rcsb_id} {polymer}\")\n return None \n \n # use the MappingSequenceContainer flat_index_to_residue_map to access to residue in the PDB sequence\n resi_id = flat_seq[1][flat_aligned_position].get_id()\n residue = chain[resi_id]\n \n # check that the located residue is the same as the landmark\n landmark_1_letter = landmark.residue.upper()\n landmark_3_letter = ResidueSummary.one_letter_code_to_three(landmark_1_letter)\n if (residue.get_resname() != landmark_1_letter and residue.get_resname() != landmark_3_letter):\n return None\n \n # find atomic coordinates for the selected residue\n atom_coords = [atom.coord for atom in residue]\n if (len(atom_coords) == 0): \n return None\n \n # take the mean coordinate for the atoms in residue\n vec = np.zeros(3)\n for coord in atom_coords:\n tmp_arr = np.array([coord[0], coord[1], coord[2]])\n vec += tmp_arr\n vec = vec / len(atom_coords)\n vec = vec.astype(np.int32)\n \n return { \n \"parent_id\": rcsb_id, \n \"landmark\": landmark.name, \n \"residue\": landmark.residue, \n \"position\": resi_id[1],\n \"x\": vec[0], \"y\": vec[1], \"z\": vec[2]\n }\n\nSee the full algorithm here.\n\n\n\n\n\n\nFigure 3: Atoms versus landmarks near the tunnel, shown with the Mole model (dark blue) and the mesh model (light blue) for reference. a) All atoms within 10 Å of the tunnel centerline. b) Mean atomic coordinates of conserved residues within 7.5 Å of the spherical tunnel." }, { - "objectID": "posts/outlier-detection/DeCOr-MDS.html#outlier-distance-correction", - "href": "posts/outlier-detection/DeCOr-MDS.html#outlier-distance-correction", - "title": "Orthogonal outlier detection and dimension estimation for improved MDS embedding of biological datasets", - "section": "Outlier distance correction", - "text": "Outlier distance correction\nUpon identifying the main subspace containing regular points, our procedure finally corrects the pairwise distances that contain outliers in the matrix \\(\\mathbf{D}\\), in order to apply a MDS that projects the outliers in the main subspace. In the case where the original coordinates cannot be used (e.g, as a result of some transformation or if the distance is non Euclidean), we perform the two following steps: (i) We first apply a MDS on \\(\\mathbf{D}\\) to place the points in a euclidean space of dimension \\(d\\), as a new matrix of coordinates \\(\\tilde{X}\\). (ii) We run a PCA on the full coordinates of the estimated set of regular data points (i.e. \\(\\tilde{X}\\setminus O\\)), and project the outliers along the first \\(\\bar{n}^*\\) principal components of the PCA, since these components are sufficient to generate the main subspace. Using the projected outliers, we accordingly update the pairwise distances in \\(\\mathbf{D}\\) to obtain the corrected distance matrix \\(\\mathbf{D^*}\\). Note that in the case where \\(\\mathbf{D}\\) derives from a euclidean distance between the original coordinates, we can skip step (i), and directly run step (ii) on the full coordinates of the estimated set of regular data points." + "objectID": "posts/landmarks-final/index.html#usage-instructions", + "href": "posts/landmarks-final/index.html#usage-instructions", + "title": "Landmarking the ribosome exit tunnel", + "section": "Usage Instructions", + "text": "Usage Instructions\nThe full protocol and datasets are avaiblable on GitHub. At the time of writing, the protocol has been run on all 1348 ribosomes currently available on RibosomeXYZ. Landmark coordinates (kingdom-specific and universal) can be found in data/output/landmarks.\nTo assign these initial landmarks, I compiled the sequences for the relevant polymers for all 1348 specimens into polymer-specific fasta files and ran MAFFT sequence alignment on each file. Then, I ran the code to select landmarks based on the full aligned files; therefore, conservation ratios for residues were based on all (currently) available data.\n\n\n\nKingdom\nNumber of specimens\nLandmarks per specimen\n\n\n\n\nEukaryota\n424\n83\n\n\nBacteria\n842\n60\n\n\nArchaea\n82\n47\n\n\nUniversal\n1348\n42\n\n\n\nDistribution of assigned landmarks across currently available ribosomes\nTo obtain landmarks on a ribosome specimen, first check if they have already been assigned. If not, the protocol can be run on new specimens as follows:\n\nUse main_universal.py to assign universal landmarks or main.py to assign kingdom-specific landmarks\nCreate a conda environment based on requirements.txt and activate it\nWith the activated environment, run the following command: python -m protocol file rcsb_id where file is one of main_universal.py or main.py, and rcsb_id is the structure ID.\n\nThe protocol can be run on multiple instances simply by adding more rcsb_id’s to the command. For example: python -m protocol file rcsb_id1 rcsb_id2\nNote that running multiple instances in the same command is more efficient when these are new instances, as the alignment will run only once after all new sequences have been added to the fasta files, rather than after each new instance.\n\n\nAs mentioned above, the program will automatically update the fasta files and rerun the alignments to include new instances from the command. This should not change the conserved residues when small amounts of new ribosomes are added, but if you are adding many new ribosomes, you may consider changing the reselect_landmarks boolean flag to True, to ensure that the assigned landmarks reflect the conservation present in the entirety of the data. This flag can also used to apply changes to conservation and distance threshold parameters. It is important to note, however, that re-selecting landmarks disrupts the correspondence of newly assigned landmarks to previously assigned landmarks." }, { - "objectID": "posts/quasiconformalmap/index.html#theorem", - "href": "posts/quasiconformalmap/index.html#theorem", - "title": "Quasiconformal mapping for shape representation", - "section": "Theorem", - "text": "Theorem" + "objectID": "posts/landmarks-final/index.html#limitations", + "href": "posts/landmarks-final/index.html#limitations", + "title": "Landmarking the ribosome exit tunnel", + "section": "Limitations", + "text": "Limitations\n\nAlignment Efficiency\nThe protocol automatically runs MAFFT sequence alignment from the command line when the input fasta files are updated. However, running MAFFT online can be much faster. To maximize efficiency when running the protocol on many ribosomes, I suggest running the input fasta files through MAFFT online, and uploading the resulting alignments into the protocol directory (ensuring to match the location and naming of the original files).\n\n\nMissing Landmarks\nThere remains missing landmarks on many ribosome specimens in the data, due to gaps in the experimental data or unusual specimens (e.g. imaged mid biogenesis). Filtering out these instances would be beneficial prior to analysis.\n\n\nDistribution of Species\nThe available data from RibosomeXYZ is not uniformally distributed across species. There is a heavy skew towards a few model species (E. coli, T. thermophilus, etc.) as shown in Figure 4. This biases the residue conservation calculations. Analysis done on the resulting landmark data should subset appropriately to obtain a more even spread of species.\n\n\n\n\n\n\nFigure 4: Counts of species present in the data from RibosomeXYZ" }, { - "objectID": "posts/RECOVAR/index.html", - "href": "posts/RECOVAR/index.html", - "title": "Heterogeneity analysis of cryo-EM data of proteins dynamic in comformation and composition using linear subspace methods", - "section": "", - "text": "Cryogenic electron microscopy (cryo-EM), a cryomicroscopy technique applied on samples embedding in ice, along with recent development of powerful hardwares and softwares, have achieved huge success in the determination of biomolecular structures at near-atomic level. Cryo-EM takes screenshots of thousands or millions of particles in different poses frozen in the sample, and thus allows the reconstruction of the 3D structure from those 2D projections.\nEarly algorithms and softwares of processing cryo-EM data focus on resolving homogeneous structure of biomolecules. However, many biomolecules are very dynamic in conformations, compositions, or both. For example, ribosomes comprise of many sub-units, and their compositions may vary within the sample and are of research interest. Spike protein is an example of conformational heterogeneity, where the receptor-binding domain (RBD) keeps switching between close and open states in order to bind to receptors and meanwhile resist the binding of antibody. When studying the antigen-antibody complex, both compositional and conformational heterogeneity need to be considered.\n\n\n\nA simple illustration of the conformational heterogeneity of spike protein, where it displays two kinds of conformations: closed RBD and open RBD of one chain (colored in blue) (Wang et al. 2020). Spike protein is a trimer so in reality all the three chains will move possibly in different ways and the motion of spike protein is much more complex than what’s shown here.\n\n\nThe initial heterogeneity analysis of 3D structrues reconstructed from cryo-EM data started from relatively simple 3D classfication, which outputs discrete classes of different conformations. This is usually done by expectation-maximization (EM) algorithms, where 2D particle stacks were iteratively assigned to classes and used to reconstruct the volume of that class. However, such an approach has two problems: first, the classification decreases the number of images used to reconstruct the volume, and thus lower the resolution we are able to achieve; second, the motion of biomolecule is continuous in reality and discrete classification may not describe the heterogeneity very well, and we may miss some transient states.\nTherefore, nowadays people start to focus on methods modeling continuous heterogeneity without any classification step to avoid the above issues. Most methods adopt similar structures, where 2D particle stacks are mapped to latent embeddings, clusters/trajectories are estimated in latent space, and finally volumes are mapped and reconstructed from latent embeddings. Early methods use linear mapping (e.g. 3DVA), but with the applications of deep learning techniques in the field of cryo-EM data processing, people find methods adapted from variational autoencoder (VAE) achieving better performance (e.g. cryoDRGN, 3DFlex). Nevertheless, the latent space obtained from VAE and other deep learning methods is hard to interpret, and do not conserve distances and densities, imposing difficulties in reconstructing motions/trajectories, which are what most structure biologists desire at the end.\nRecent developed software RECOVAR (Gilles and Singer 2024), using a linear mapping like 3DVA, was shown to achieve comparable or even better performance with deep learning methods, and meanwhile has high interpretability and allows easy recovery of motions/trajectories from latent space. For this project, I will review the pipeline of RECOVAR, discussed improvements and extensions we made to this pipeline, and present heterogeneity analysis results from the original paper and our SARS-CoV2 spike protein dataset." + "objectID": "posts/landmarks-final/index.html#future-directions", + "href": "posts/landmarks-final/index.html#future-directions", + "title": "Landmarking the ribosome exit tunnel", + "section": "Future Directions", + "text": "Future Directions\nWhen choosing landmarks, the thresholding by distance is done by comparison to the Mole tunnel model. However, this model is too simplistic to capture the complex shape of the tunnel. A more accurate model is the mesh model as described in ‘3D Tessellation of Biomolecular Cavities’.\nThere are gaps in the landmarks where the mesh model shows protrusions that the Mole model does not capture (see Figure 3 for a visualization). Future improvements to the protocol should be made to measure distance to the tunnel using the mesh representation as a benchmark instead of the centerline and radii.\nVisualizations produced using PyVista5 and PyMol6." }, { - "objectID": "posts/RECOVAR/index.html#background", - "href": "posts/RECOVAR/index.html#background", - "title": "Heterogeneity analysis of cryo-EM data of proteins dynamic in comformation and composition using linear subspace methods", + "objectID": "posts/Neural-Manifold/index.html", + "href": "posts/Neural-Manifold/index.html", + "title": "Understanding Animal Navigation using Neural Manifolds With CEBRA", "section": "", - "text": "Cryogenic electron microscopy (cryo-EM), a cryomicroscopy technique applied on samples embedding in ice, along with recent development of powerful hardwares and softwares, have achieved huge success in the determination of biomolecular structures at near-atomic level. Cryo-EM takes screenshots of thousands or millions of particles in different poses frozen in the sample, and thus allows the reconstruction of the 3D structure from those 2D projections.\nEarly algorithms and softwares of processing cryo-EM data focus on resolving homogeneous structure of biomolecules. However, many biomolecules are very dynamic in conformations, compositions, or both. For example, ribosomes comprise of many sub-units, and their compositions may vary within the sample and are of research interest. Spike protein is an example of conformational heterogeneity, where the receptor-binding domain (RBD) keeps switching between close and open states in order to bind to receptors and meanwhile resist the binding of antibody. When studying the antigen-antibody complex, both compositional and conformational heterogeneity need to be considered.\n\n\n\nA simple illustration of the conformational heterogeneity of spike protein, where it displays two kinds of conformations: closed RBD and open RBD of one chain (colored in blue) (Wang et al. 2020). Spike protein is a trimer so in reality all the three chains will move possibly in different ways and the motion of spike protein is much more complex than what’s shown here.\n\n\nThe initial heterogeneity analysis of 3D structrues reconstructed from cryo-EM data started from relatively simple 3D classfication, which outputs discrete classes of different conformations. This is usually done by expectation-maximization (EM) algorithms, where 2D particle stacks were iteratively assigned to classes and used to reconstruct the volume of that class. However, such an approach has two problems: first, the classification decreases the number of images used to reconstruct the volume, and thus lower the resolution we are able to achieve; second, the motion of biomolecule is continuous in reality and discrete classification may not describe the heterogeneity very well, and we may miss some transient states.\nTherefore, nowadays people start to focus on methods modeling continuous heterogeneity without any classification step to avoid the above issues. Most methods adopt similar structures, where 2D particle stacks are mapped to latent embeddings, clusters/trajectories are estimated in latent space, and finally volumes are mapped and reconstructed from latent embeddings. Early methods use linear mapping (e.g. 3DVA), but with the applications of deep learning techniques in the field of cryo-EM data processing, people find methods adapted from variational autoencoder (VAE) achieving better performance (e.g. cryoDRGN, 3DFlex). Nevertheless, the latent space obtained from VAE and other deep learning methods is hard to interpret, and do not conserve distances and densities, imposing difficulties in reconstructing motions/trajectories, which are what most structure biologists desire at the end.\nRecent developed software RECOVAR (Gilles and Singer 2024), using a linear mapping like 3DVA, was shown to achieve comparable or even better performance with deep learning methods, and meanwhile has high interpretability and allows easy recovery of motions/trajectories from latent space. For this project, I will review the pipeline of RECOVAR, discussed improvements and extensions we made to this pipeline, and present heterogeneity analysis results from the original paper and our SARS-CoV2 spike protein dataset." + "text": "Seeing, hearing, touching – every moment, our brain receives numerous sensory inputs. How does it organize this wealth of data and extract relevant information? We know that the brain forms a coherent neural representation of the external world called the cognitive map (Tolman (1948)), formed by the combined firing activity of neurons in the hippocampal formation. For example, place cells are neurons that fire when a rat is at a particular location (Moser, Kropff, and Moser (2008)). Together, the activity of hundreds of these place cells can be modeled as a continuous surface - a ‘manifold’ - the location on which is analogous to the rat’s location in physical space; the rat is indeed creating a cognitive map. Specifically, the hippocampus plays a key role in this process by using path integration to keep track of an animal’s position through the integration various idiothetic cues (self-motion signals), such as optic flow, vestibular inputs, and proprioception. Manifold learning has emerged as a powerful technique for mapping complex, high-dimensional neural data onto lower-dimensional geometric representations (Mitchell-Heggs et al. (2023), Schneider, Lee, and Mathis (2023), Chaudhuri et al. (2019)). To date, it has not been feasible to learn manifolds ‘online’, i.e. while the experiment is in progress. Doing so would allow ‘closed-loop’ experiments, where we can provide feedback to the animal based on its internal representation, and thereby examine how these representations are created and maintained in the brain.\nThe question then arises: Can we decode important navigational behavioural variables during an experiment through manifold learning? And further, can we learn these manifolds online? This blog will focus on experiments conducted in “Control and recalibration of path integration in place cells using optic flow” (Madhav et al. (2024)) and “Recalibration of path integration in hippocampal place cells” (Jayakumar et al. (2019))." }, { - "objectID": "posts/RECOVAR/index.html#methods", - "href": "posts/RECOVAR/index.html#methods", - "title": "Heterogeneity analysis of cryo-EM data of proteins dynamic in comformation and composition using linear subspace methods", - "section": "Methods", - "text": "Methods\n\nRegularized covariance estimation\nLet \\(N\\) be the dimension of the grid and \\(n\\) be the number of images. We start with formulating the formation process of each cryo-EM image in the Fourier space \\(y_i\\in\\mathbb{C}^{N^2}\\) from its corresponding conformation \\(x_i\\in\\mathbb{C}^{N^3}\\) as: \\[y_i = C_i\\hat{P}(\\phi_i)x_i + \\epsilon_i, \\epsilon_i\\sim N(0, \\Lambda_i) \\]\nwhere \\(\\hat{P}(\\phi_i)\\) is the projetion from 3D to 2D after rigid body motion with pose \\(\\phi_i\\), \\(C_i\\) is the contrast transfer function (CTF), and \\(\\epsilon_i\\) represents the Gaussian noise. RECOVAR will assume that \\(C_i\\) and \\(\\phi_i\\) were given. This can be done via many existing ab-initio methods. Hence in the following analysis, we will simply fix the linear map \\(P_i:=C_i\\hat{P}(\\phi_i)\\).\nWhen poses are known, the mean \\(\\mu\\in\\mathbb{C}^{N^3}\\) of the distribution of conformations can be estimated by solving:\n\\[\\hat{\\mu}:=\\underset{\\mu}{\\mathrm{argmin}}\\sum_{i=1}^{n}\\lVert y_i-P_i\\mu\\rVert_{\\Lambda^{-1}}^2+\\lVert\\mu\\rVert_w^2\\]\nwhere \\(\\lVert v\\rVert_{\\Lambda^{-1}}^2=v^*\\Lambda^{-1}v\\) and \\(\\lVert v\\rVert_w^2=\\sum_i|v_i|^2w_i\\). \\(w\\in \\mathbb{R}^{N^3}\\) is the optional Wiener filter. Similarly, covariance can be estimated as the solution to the linear system corresponding to the following:\n\\[\\hat{\\Sigma}:=\\underset{\\Sigma}{\\mathrm{argmin}}\\sum_{i=1}^n\\lVert(y_i-P_i\\hat{\\mu})(y_i-P_i\\hat{\\mu})^*-(P_i\\Sigma P_i^*+\\Lambda_i)\\rVert_F^2+\\lVert\\Sigma\\rVert_R^2\\]\nwhere \\(\\lVert A\\rVert_F^2=\\sum_{i,j}A_{i,j}^2\\) and \\(\\lVert A\\rVert_R^2=\\sum_{i,j}A_{i,j}^2R_{i,j}\\). \\(R\\) is the regularization weight.\nOur goal at this step is to compute principal components (PCs) from \\(\\hat{\\mu}\\) and \\(\\hat{\\Sigma}\\). Nevertheless, computing the entire matrix of \\(\\hat{\\Sigma}\\) is impossible considering that we have to compute \\(N^6\\) entries. Fortunately, for low-rank variance matrix only a subset of the columns is required to estimate the entire matrix and its leading eigenvectors, which are just PCs. \\(d\\) PCs can be computed in \\(O(d(N^3+nN^2))\\), much faster than \\(O(N^6)\\) required to compute the entire covariance matrix. Here a heuristic scheme is used to choose which columes to be used to compute eigenvectors. First, all columns are added into the considered set. Then the column corresponding to the pixel with the highest SNR in the considered set is iteratively added to the chosen set, and pixels nearby are removed from the considered set, until there are a disirable number of columns \\(d\\) in the chosen set. We estimate the entries of the chosen columns and their complex conjugates and let them form \\(\\hat{\\Sigma}_{col}\\). Let \\(\\tilde{U}\\in\\mathbb{C}^{N^3\\times d}\\) be orthogonalized \\(\\hat{\\Sigma}_{col}\\). It follows that we can compute the reduced covariance matrix \\(\\hat{\\Sigma}_{\\tilde{U}}\\) by:\n\\[\\hat{\\Sigma}_{\\tilde{U}}:=\\underset{\\Sigma_{\\tilde{U}}}{\\mathrm{argmin}}\\sum_{i=1}^n\\lVert(y_i-P_i\\hat{\\mu})(y_i-P_i\\hat{\\mu})^*-(P_i\\tilde{U}\\Sigma_{\\tilde{U}}\\tilde{U}^* P_i^*+\\Lambda_i)\\rVert_F^2\\]\nBasically, we just replace \\(\\Sigma\\) in the formula to estimate the entire covariance matrix shown before with \\(\\tilde{U}\\Sigma_{\\tilde{U}}\\tilde{U}^*\\). Finally, we just need to perform an eigendecomposition on \\(\\hat{\\Sigma}_{\\tilde{U}}\\) and obtain \\(\\hat{\\Sigma}_{\\tilde{U}}=V\\Gamma V^*\\). The eigenvectors (which are the PCs we want) and eigenvalues would be \\(U:=\\tilde{U}V\\) and \\(\\Gamma\\) repectively.\n\n\nLatent space embedding\nWith PCs computed from the last step, denoted by \\(U\\in\\mathbb{C}^{N^3\\times d}\\), we can project \\(x_i\\) onto lower-dimensional latent space by \\(z_i = U^*(x_i-\\hat{\\mu})\\in\\mathbb{R}^d\\). Assuming \\(z_i\\sim N(0,\\Gamma)\\), the MAP estimation of \\(P(z_i|y_i)\\) can be obtained by solving:\n\\[\\hat{a}_i, \\hat{z}_i = \\underset{a_i\\in\\mathbb{R}^+, z_i\\in\\mathbb{R}^d}{\\mathrm{argmin}}\\lVert a_iP_i(Uz_i+\\hat{\\mu})-y_i\\rVert_{\\Lambda_i^{-1}}^2+\\lVert z_i\\rVert_{\\Gamma^{-1}}^2\\]\nwhere \\(a_i\\) is a scaling factor used to capture the effect of display variations in contrast.\n\n\nConformation reconstruction\nAfter computing the latent embeddings, the next question would naturally be how to reconstruct conformations from embeddings. The most intuitive way is to do reprojection i.e. \\(\\hat{x}\\leftarrow Uz+\\hat{\\mu}\\). Nevertheless, reprojection only works well when all the relevant PCs can be computed, which is almost impossible considering the low signal-to-noise ratio (SNR) in practice. Therefore, an alternative scheme based on adaptive kernel regression is used here. Given a fixed latent position \\(z^*\\) and the frequency \\(\\xi^k\\in\\mathbb{R}^3\\) in the 3D Fourier space of the volume whose value we would like to estimate, the kernel regression estimates of this form are computed as:\n\\[x(h;\\xi^k) = \\underset{x_k}{\\mathrm{argmin}}\\sum_{i,j}\\frac{1}{\\sigma_{i,j}^2}|C_{i,j}x_k-y_{i,j}|^2K(\\xi^k,\\xi_{i,j})K_i^h(z^*,z_i)\\]\nwhere \\(h\\) is bandwitdth; \\(\\sigma_{i,j}\\) is the variance of \\(\\epsilon_{i,j}\\), which is the noise of frequency \\(j\\) of the \\(i\\)-th observation; \\(y_{i,j}\\) is the value of frequency \\(j\\) of the \\(i\\)-th observation; \\(\\xi_{i,j}\\) is the frequency \\(j\\) of the \\(i\\)-th observation in 3D adjusted by \\(\\phi_i\\). We have two kernel functions in this formulation. \\(K(\\xi^k,\\xi_{i,j})\\) is the triangular kernel, measuring the distance in frequencies. \\(K_i^h(z^*, z_i)=E(\\frac{1}{h}\\lVert z^* - z_i\\rVert_{\\Sigma_{z_i}^{-1}})\\) where \\(\\Sigma_{z_i}\\) is the covariance matrix of \\(z_i\\) which can be computed from the formulation for latent embedding, and \\(E\\) is a piecewise constant approxination of the Epanechnikov kernel. \\(K_i^h(z^*, z_i)\\) measures the distance between latent embeddings.\nHere comes a trade-off at the heart of every heterogeneous reconstruction algorithm: averaging images is necessary to overcome noise, but it also degrades heterogeneity since the images averaged may come from different conformations. Hence, we have to choose \\(h\\) carefully. A cross-validation strategy is applied to find the optimal \\(h\\) for each frequency shell of each subvolume. For a given \\(z^*\\), the dataset is split into two: from one halfset, the 50 estimates \\(\\hat{x}(h_1), ..., \\hat{x}(h_{50})\\) with varying \\(h\\) are computed, and from the other subset a single low-bias, high-variance template \\(\\hat{x}_{CV}\\) is reconstrcuted by using a small number of images which are closest to \\(z^*\\). Each of the 50 kernel estimate is then subdivided into small subvolumes by real-space masking, and each subvolume is again decomposed into frequency shells after a Fourier transform. We use the following cross-validation metric for subvolume \\(v\\) and frequency shell \\(s\\):\n\\[d_{s,v}(h) = \\lVert S_sV^{-1/2}(M_v(\\hat{x}_{CV}-\\hat{x}(h)))\\rVert_2^2\\]\nwhere \\(S_s\\) is a matrix that extracts shell \\(s\\); \\(M_v\\) is a matrix extracting subvolume \\(v\\); and \\(V\\) is a diagonal matrix containing the variance of the template. For each \\(s\\) and \\(v\\), the minimizer over \\(h\\) of the cross-validarion score is selected, and the final volume is obtained by first recombining frequency shells for each subvolume and then recombining all the subvolumes.\n\n\n\nVolumes are reconstructed from the embedding by adaptive kernel regression.\n\n\n\n\nEstimation of state density\nSince motion is what structure biologists finally want, we have to figure out a method to sample from latent space to form a trajectory representing the motion of the molecule. According to Boltzmann statistics, the density of a particular state is a measure of the free energy of that state, which means a path which maximizes conformational density is equivalent to the path minimizing the free energy. Taking the advantage of linear mapping, we can easily relate embedding density to conformational density. The embedding density estimator is given by:\n\\[\\hat{E}(z) = \\frac{1}{n}\\sum_{i=1}^nK_G(\\hat{z_i}, \\Sigma_s;z)\\]\nwhere \\(K_G(\\mu, \\Sigma;z)\\) is the probability density function of the multivariant Gaussian with mean \\(\\mu\\) and covariance \\(\\Sigma\\), evaluated at \\(z\\), and \\(\\Sigma_s\\) is set using the Silverman rule. The conformational density can be related as following:\n\\[\\overline{E}(z)=\\overline{G}(z)*d(z)\\]\nwhere \\(\\overline{E}(z)\\) is the expectation of the embedding density \\(\\hat{E}(z)\\); \\(\\overline{G}(z)\\) is the expectation of \\(\\hat{G}(z)=\\frac{1}{n}\\sum_{i=1}^nK_G(0,\\Sigma_{z_i}+\\Sigma_s;z)\\), which is named as embedding uncertainty; \\(d(z)\\) is the conformational density corresponding to \\(z\\); \\(*\\) is the convolution operation.\n\n\nMotion recovery\nGiven the conformational density estimated from last step, denoted by \\(\\hat{d}(z)\\), start state \\(z_{st}\\) and end state \\(z_{end}\\), we can find trajectory \\(Z(t):\\mathbb{R}^+\\rightarrow\\mathbb{R}^d\\) in latent space by computing the value function:\n\\[v(z):=\\underset{Z(t)}{\\mathrm{inf}}\\int_{t=0}^{t=T_a}\\hat{d}(Z(t))^{-1}dt\\]\nsubject to \\[Z(0)=z, Z(T_a)=z_{end}, \\lVert \\frac{d}{dt}Z(t)\\rVert=1; T_a = min\\{t|Z(t)=z_{end}\\}\\]\nIn simple word, \\(v(z)\\) computes the minimum inverse density we can have to reach \\(z_{end}\\) starting from \\(z\\). \\(v(z)\\) is the viscosity solution of the Eikonal equation:\n\\[\\hat{d}(z)|\\nabla v(z)|=1, \\forall z\\in B\\setminus \\{z_{end}\\}; v(z_{end})=0\\]\nwhere \\(B\\) is the domain of interest, and \\(v(z)\\) can be solved by solving this partial differential equation. Once \\(v(z)\\) is solved, the optimal trajectory an be obtained by finding the path orthogonal to the level curve of \\(v(z)\\), which can be computed numerically using the steepest gradient descent on \\(v(z)\\) starting from \\(z_{st}\\)\n\n\n\nVisulization of the steepest gradient descent on the level curve of v(z)" + "objectID": "posts/Neural-Manifold/index.html#problem-description", + "href": "posts/Neural-Manifold/index.html#problem-description", + "title": "Understanding Animal Navigation using Neural Manifolds With CEBRA", + "section": "", + "text": "Seeing, hearing, touching – every moment, our brain receives numerous sensory inputs. How does it organize this wealth of data and extract relevant information? We know that the brain forms a coherent neural representation of the external world called the cognitive map (Tolman (1948)), formed by the combined firing activity of neurons in the hippocampal formation. For example, place cells are neurons that fire when a rat is at a particular location (Moser, Kropff, and Moser (2008)). Together, the activity of hundreds of these place cells can be modeled as a continuous surface - a ‘manifold’ - the location on which is analogous to the rat’s location in physical space; the rat is indeed creating a cognitive map. Specifically, the hippocampus plays a key role in this process by using path integration to keep track of an animal’s position through the integration various idiothetic cues (self-motion signals), such as optic flow, vestibular inputs, and proprioception. Manifold learning has emerged as a powerful technique for mapping complex, high-dimensional neural data onto lower-dimensional geometric representations (Mitchell-Heggs et al. (2023), Schneider, Lee, and Mathis (2023), Chaudhuri et al. (2019)). To date, it has not been feasible to learn manifolds ‘online’, i.e. while the experiment is in progress. Doing so would allow ‘closed-loop’ experiments, where we can provide feedback to the animal based on its internal representation, and thereby examine how these representations are created and maintained in the brain.\nThe question then arises: Can we decode important navigational behavioural variables during an experiment through manifold learning? And further, can we learn these manifolds online? This blog will focus on experiments conducted in “Control and recalibration of path integration in place cells using optic flow” (Madhav et al. (2024)) and “Recalibration of path integration in hippocampal place cells” (Jayakumar et al. (2019))." }, { - "objectID": "posts/RECOVAR/index.html#results", - "href": "posts/RECOVAR/index.html#results", - "title": "Heterogeneity analysis of cryo-EM data of proteins dynamic in comformation and composition using linear subspace methods", - "section": "Results", - "text": "Results\n\nResults of public datasets\nThe original paper of RECOVAR presents results on precatalytic spliceosome dataset (EMPIAR-10180), integrin dataset (EMPIAR-10345) and ribosomal subunit dataset (EMPIAR-10076), all of which are public dataset and could be accessed from https://www.ebi.ac.uk/empiar/.\nResults on EMPIAR-10180 focuses on comformational heterogeneity. Three local maxima in conformational density were identified, a path between two of which was identified to show arm regions moving down followed by head regions moving up.\n\n\n\nLatent space and volume view of precatalytic spliceosome conformational heterogeneity. Latent view of the path is projected on the plane formed by different combinations of two principal components.\n\n\nEMPIAR-10345 contains both conformational and compositional heterogeneity. Two local maxima were found, with the smaller one corresponds to a different composition never reported by provious studies. Also a motion of the arm was found along the path.\n\n\n\nRECOVAR finds both comformational and compositional heterogeneity within integrin\n\n\nEMPIAR-10076 is used to show the ability of RECOVAR to find stable states. RECOVAR finds two stable states of the 70S ribosomes.\n\n\n\nThe volume of two stable states are reconstructed, correspinding to two peaks in densities\n\n\n\n\nResults of SARS-CoV2 datasets\nWe also tested RECOVAR on our own dataset which contains 271,448 SARS-CoV2 spike protein particles, extracted using CryoSparc. Some of these particles are binding to human angiotensin-converting enzyme 2 (ACE2), which is an enzyme on human membrane targeted by SARS-CoV2 spike protein. Therefore, this dataset has both compositional and conformational heterogeneity.\nAfter obtaining an ab-initio model from CryoSparc, we ran RECOVAR with a dimension of 4 and a relatively small grid size of 128. K-Means clustering was performed to find 5 cluster centers among the embeddings.\nHere we present two volumes reconstructed from center 0 and center 1, showing a very obvious compositional heterogeneity, where ACE2 is clearly present in center 0 and missing in center 1.\n\n\n\nCompositional heterogeneity in the spike protein dataset. The spot where ACE2 is present/absent is highlighted by the red circle.\n\n\nA path between center 0 and 1 was analyzed to study the conformational changes adopted by the spike protein to bind to ACE2. We can see the arm in the RBD region lifts in order to bind to ACE2.\n\n\n\nConformational changes along the path between center 0 and 1, highlighted by the yellow circle" + "objectID": "posts/Neural-Manifold/index.html#experimental-setup", + "href": "posts/Neural-Manifold/index.html#experimental-setup", + "title": "Understanding Animal Navigation using Neural Manifolds With CEBRA", + "section": "Experimental Setup", + "text": "Experimental Setup\nIn (Madhav et al. (2024) and Jayakumar et al. (2019)), Dr. Madhav and colleagues designed an experimental setup to investigate how optic flow cues influence hippocampal place cells in freely moving rats. Place cells are neurons that fire when an animal is in a specific location.\nLet’s take an example to better understand: imagine a rat moving along a horizontal linear track. For simplicity let’s say the rat has only 3 place cell neurons. In this case, Neuron 1 would fire when the rat is at the very left of the track, Neuron 2 would fire when the rat is in the middle of the track, and Neuron 3 would fire at the very right of the track. As the rat moves along the track, the specific place cells corresponding to each location become activated, helping the rat to construct an internal cognitive map of its environment.\n\nThe Dome Apparatus\nIn the experiment, rats ran on a circular platform surrounded by a hemispherical projection surface called the Dome.\n\n\n\n\nFig. 1 - Virtual reality Dome apparatus. Rats ran on a circular table surrounded by a hemispherical shell. A projector image reflects off a hemispherical mirror onto the inner surface of the shell.\n\n\n\nThe dome projects moving stripes that provided controlled optic flow cues. The movement of the stripes was tied to the rats’ movement, with the stripe gain (\\(\\mathcal{S}\\)) determining the relationship between the rat’s speed and the stripes’ speed.\n\n\\(\\mathcal{S}\\) = 1: Stripes are stationary relative to the lab frame, meaning the rat is not recieving conflicting cues.\n\\(\\mathcal{S}\\) > 1: Stripes move opposite to the rat’s direction, causing the rat to percieve itself as moving faster than it is.\n\\(\\mathcal{S}\\) < 1: Stripes move in the same direction but slower than the rat, causing the rat to percieve itself as moving slower than it is.\n\nElectrodes were inserted into the CA1 of the hippocampus of male evan’s rats and spike rate neural activity was recorded during the experiment. Dr. Madhav and colleagues introduce a value \\(\\mathcal{H}\\), called the Hippocampal Gain. It is defined as the relationship between the rat’s physical movement and the updating of its position on the internal hippocampal map. At a high level, we can think of it as the rate at which the rat “perceives” itself to be moving because of the conflicting visual cues. Specifically,\n\\[\n \\mathcal{H} = \\frac{\\text{distance travelled in hippocampal reference frame}}{\\text{distance travelled in lab reference frame}}.\n\\]\nIn this equation, distance travelled in the hippocampal frame refers to the distance that the rat “thinks” it’s moving.\n\n\\(\\mathcal{H} = 1\\): The rat perceives itself as moving the “correct” speed.\n\\(\\mathcal{H} > 1\\): The rat perceives itself as moving faster than it actually is with respect to the lab frame.\n\\(\\mathcal{H} < 1\\): The rat perceives itself as moving slower than it actually is with respect to the lab frame.\n\n\\(\\mathcal{H}\\) gives valuable insights into how these visual cues such as the moving stripes affect the rats’ internal cognitive map during the task. It gives an understanding of how the rats update their perceived position in the environment.\nFor example, an \\(\\mathcal{H}\\) value of 2, would mean that the rat perceives itself as moving twice as fast as it actually is. Consequently each place cell corresponding to a specific location in the maze will fire twice per lap rather than once.\n\n\nDescription of the problem\nMethod of Determining \\(\\mathcal{H}\\): Traditionally, \\(\\mathcal{H}\\) is determined by analyzing the spatial periodicity of place cell firing over multiple laps using Fourier transforms, as seen in (Jayakumar et al. (2019),Madhav et al. (2024)). Below is a figure displaying how the traditional method is used to determine the \\(\\mathcal{H}\\) value.\n\n\n\n\n\nFigure 2 - Spectral decoding algorithm. In the dome, as visual landmarks are presented and moved at an experimental gain G, the rat encounters a particular landmark every 1/G laps (the spatial period). If the place fields fire at the same location in the landmark reference frame, the firing rate of the cell exhibits a spatial frequency of G fields per lap. a, Illustration of place-field firing for three values of hippocampal gain, H\n\n\n\n\nThe frequency of firing for each place cell effectively decodes the \\(\\mathcal{H}\\) value for that specific neuron and the mean \\(\\mathcal{H}\\) value over all neurons gives the estimated \\(\\mathcal{H}\\) value over the neuronal population. This method lacks temporal precision within individual laps since it uses a Fourier Transform over 6 laps.\nA more precise, within-lap decoding of Hippocampal Gain (\\(\\mathcal{H}\\)) could provide a deeper understanding of how path integration occurs with finer temporal resolution. This could lead to new insights into how the brain updates its cognitive map when receiving conflicting visual cues.\nAlso, note how the decoding of \\(\\mathcal{H}\\) is directly tied to the neural data, which makes the traditional method less flexible. It cannot easily be applied to experiments involving two varying neural representations (e.g., a spatial gain \\(\\mathcal{H}\\) and an auditory gain \\(\\mathcal{A}\\)). In such cases, the two representations are coupled in the neural data, making it impossible to separate them.\nHowever, neural manifold learning offers a promising approach to decouple these representations. For instance, consider the hypothetical scenario below, where the data forms a torus:\n\n\n\n\n\n\n\n\n\nFigure 3 - Left: varying spatial representation, Right: varying audio representation.\n\nIn our current dataset, we only have a single varying neural representation and therefore expect a simpler 1D ring topology. However, in the above scenario, the data might lie on a torus. On this structure, the spatial representation (\\(\\mathcal{H}\\)) could vary along the major circle of the torus, while the auditory representation (\\(\\mathcal{A}\\)) varies along the minor circle. This structure enables us to disentangle and decode the two neural representations independently. This could prove useful in the future when using this method on experiments of this type. We wish to validate this method for single varying representations, and then move on to two varying representations." }, { - "objectID": "posts/RECOVAR/index.html#discussion", - "href": "posts/RECOVAR/index.html#discussion", - "title": "Heterogeneity analysis of cryo-EM data of proteins dynamic in comformation and composition using linear subspace methods", - "section": "Discussion", - "text": "Discussion\nRECOVAR has several advantages over other heteogeneity analysis methods. Besides the high interpretability we mentioned before, RECOVAR is proved to be able to discover compositional heterogeneity, which cannot be solved by some popular deep learning methods like 3DFlex. Moreover, RECOVAR has much less hyper-parameters to tune compared with deep learning models. The main hyper-parameter the user needs to specify is the number of proncipal components to use, which is a trade-off between the amount of heterogeneity to capture and computational cost.\nHowever, one problem RECOVAR and many other heterogeneity analysis algorithms share is that it requires the input of a homogeneous model/poses of images. However the estimation of the consensus model is often biased by heterogeneity, while the heterogeneity analysis assumes the input consensus model is correct(a dead loop!). Nevertheless, we would expect this issue to be solved by an EM-algorithm iteratively constructing consensus model and performing heterogeneity analysis. In future we may also be interested in benchmarking on pose estimation errors, and other parameters such as the number of principal components, grid size, and particle number, which were not be done in the original paper.\nThe other drawback of RECOVAR is that the density-based path recovery approach is computationally expensive. The cost increases expoenentially with dimension. In practice, our NVIDIA 24GB GPU could deal with at most a dimension of 4, which is usually insufficient to capture enough heteogeneity in cryo-EM datasets with low SNR. We have to look at cheaper ways of finding path without estimating densities. We are also interested in methods to quantify the compositional heterogeneity along the path e.g. the probability of SARS-CoV2 spike proteins binding to ACE2 with certain conformation.\nThe last but not least, it will be much easier for structure biologists to study the heterogeneity if we could extend the movie of density map to the movie of atomic model. This requires fitting atomic models to density maps. Since here the density maps in the movies are very similar, we don’t want to fit from scratch every time. Instead, a better approach would be fitting an initial model and then locally updating each density map." + "objectID": "posts/Neural-Manifold/index.html#main-goal", + "href": "posts/Neural-Manifold/index.html#main-goal", + "title": "Understanding Animal Navigation using Neural Manifolds With CEBRA", + "section": "Main Goal", + "text": "Main Goal\nOur main goal is therefore to determine this \\(\\mathcal{H}\\) value without using a Fourier Transform and instead somehow find a temporally finer, within lap estimation of \\(\\mathcal{H}\\) using manifold learning. Some key questions that motivate this research include:\n\nHow does the velocity of the rat affect the \\(\\mathcal{H}\\) value?\nWhat patterns does the \\(\\mathcal{H}\\) value exhibit over the course of a lap? Does it relate to other behavioural variables?\n\nSome more important goals of this research include a method of decoding the “hippocampal gain” online and feeding these values back into the dome apparatus to control the \\(\\mathcal{H}\\) value to the desired value for the experiment.\nWe turn to CEBRA Schneider, Lee, and Mathis (2023) as our method of manifold learning. In the next section, we will see how CEBRA can help decode \\(\\mathcal{H}\\) reliably.\nThe basic idea is as follows: First, we aim to project the neural data into some latent space. In this space, we want the points to map out the topology of the task - specifically, to encode hippocampal position/angle (the rat’s position in the hippocampal reference frame). We assume that this task forms a 1D ring topology, given the cyclic nature of the dome setup and the periodic firing of place cells. Then we want to validate and construct a latent parametrization of this manifold, specifically designed to directly reflect the hippocampal position. With an accurate hippocampal position parametrization, we could then decode \\(\\mathcal{H}\\), giving us a more temporally fine estimation of \\(\\mathcal{H}\\).\nNext, we move on to what CEBRA is and how it can help us achieve our goal." }, { - "objectID": "posts/elastic-metric/elastic_metric.html", - "href": "posts/elastic-metric/elastic_metric.html", - "title": "Riemannian elastic metric for curves", - "section": "", - "text": "This page introduces basic concepts of elastic metric, square root velocity metric, geodesic distance and Fréchet mean associated with it." + "objectID": "posts/Neural-Manifold/index.html#what-is-cebra", + "href": "posts/Neural-Manifold/index.html#what-is-cebra", + "title": "Understanding Animal Navigation using Neural Manifolds With CEBRA", + "section": "What is CEBRA?", + "text": "What is CEBRA?\nCEBRA, introduced in Schneider, Lee, and Mathis (2023), is a powerful self-supervised learning algorithm designed to create consistent, interpretable embeddings of high-dimensional neural recordings using auxiliary variables such as behavior or time. CEBRA generates consistent embeddings across trials, animals, and even different recording modalities​.\nIn our analysis, we will use the discovery mode of CEBRA, with only time as our auxiliary variable. CEBRA is implemented in python.\n\nThe Need for CEBRA\nIn neuroscience, understanding how neural populations encode behavior is a large challenge. Traditional linear methods like PCA, or even non-linear approaches like UMAP and t-SNE, fail in this context because they fail to capture temporal dynamics and lack consistency across different sessions or animals. CEBRA gets past these limitations by both considering temporal dynamics and providing consistency across different sessions or animals.\n\n\nHow Does CEBRA Work?\nCEBRA uses a convolutional neural network (CNN) encoder trained with contrastive learning to produce a latent embedding of the neural data. The algorithm identifies positive and negative pairs of data points, using temporal proximity to structure the embedding space.\n\n\nCEBRA Architecture\n\nContrastive Learning\nThe CEBRA model is trained using a contrastive learning loss function. In CEBRA, this is achieved through InfoNCE (Noise Contrastive Estimation), which encourages the model to distinguish between similar (positive) and dissimilar (negative) samples.\nThe loss function is defined as: \\[\n\\mathcal{L} = - \\log \\frac{e^{\\text{sim}(f(x), f(y^+)) / \\tau}}{e^{\\text{sim}(f(x), f(y^+)) / \\tau} + \\sum_{i=1}^{K} e^{\\text{sim}(f(x), f(y_i^-)) / \\tau}}\n\\]\nWhere \\(f(x)\\) and \\(f(y)\\) are the encoded representations of the neural data after passing through the CNN, \\(\\text{sim}(f(x), f(y))\\) represents a similarity measure between the two embeddings, implemented as cosine similarity. Here, \\(y^{+}\\) denotes the positive pair (similar to \\(x\\) in time), \\(y_{i}^{-}\\) denotes the negative pairs (dissimilar to \\(x\\) in time), and \\(\\tau\\) is a temperature parameter that controls the sharpness of the distribution.\nNote that the similarity measure depends on the CEBRA mode used, and we have used time as our similarity measure. The contrastive loss encourages the encoder to map temporally close data points (positive pairs) to close points in the latent space, while mapping temporally distant data points (negative pairs) further apart. This way, the embeddings reflect the temporal structure of the data. The final output is then the embedding value in the latent space. Below is a schematic taken from ({Schneider, Lee, and Mathis (2023)}), showing the CEBRA architecture.\n\n\n\n\n\nFigure 4 - CEBRA Architecture. Input: Neural spike data in the shape (time points, neuron #). Output: Low dimensional embedding\n\n\n\n\nOnce we obtain the neural embeddings from CEBRA, the next step is to determine the underlying manifold that describes the structure of the resulting point cloud. For example, let’s consider the output of a CEBRA embedding from one experimental session.\n\n\n\n\n\nFigure 5 - Cebra Embedding for an experiment with Hippocampal Position Annotated as a Color Map\n\n\n\n\nThe embedding appears to form a 1D circle in 3D space. We can also see that the hippocampal position correctly traces the rat’s hippocampal position throughout the experiment. This observation aligns with our expectations, since we predict that the neural activity encodes the hippocampal reference frame position, not the lab frame position. To validate the 1D ring topology, we apply a technique known as Persistent Homology." }, { - "objectID": "posts/ImageMorphing/OT4DiseaseProgression.html", - "href": "posts/ImageMorphing/OT4DiseaseProgression.html", - "title": "Optimal Mass Transport and its Convex Formulation", - "section": "", - "text": "In the context of biomedics, understanding disease progression is critical in developing effective diagnostic and therapeutic strategies. Medical imaging provides us with invaluable data, capturing the spatial and structural changes in the human body over time. Yet, analyzing these changes quantitatively and consistently remains challenging. Here, we explore how optimal transport (OT) can be applied to model disease progression in a geometrically meaningful way, providing a tool to predict deformations and shape changes in diseases like neurodegeneration, cancer, and respiratory diseases.\n\n\nOptimal transport is a mathematical framework originally developed to solve the problem of transporting resources in a way that minimizes cost. The problem was formalized by the French mathematician Gaspard Monge in 1781. In the 1920s A.N. Tolstoi was among the first to study the transportation problem mathematically. However, the major advances were made in the field during World War II by the Soviet mathematician and economist Leonid Kantorovich. However, OT is a tough optimization problem. In 2000, Benamou and Brenier propose a convex formulation. Villani explains the history and mathematics behind OT in great detail in his book (Villani 2021), which is in fact very popular and well appreciated.\nMathematically, OT finds the most efficient way to “move” one distribution to match another, which is useful in medical imaging where changes in structure and morphology need to be quantitatively mapped over time. OT computes a transport map (or “flow”) that transforms one spatial distribution into another with minimal “work” (measured by the Wasserstein distance). This idea has strong applications in medical imaging, particularly for analyzing disease progression, as it provides a way to track changes in anatomical structures over time.\n\n\n\nState of neurodegeneration in a kid at different ages. (Bastos et al. (2020)) OT can learn the progression or the transformation (T) of brain deformation from the state at 5 year age (\\(\\rho_0\\)) to the final state at 7 year age (\\(\\rho_1\\)) or 9 year age (\\(\\rho_2\\)).\n\n\n\n\n\nThe OT framework is uniquely suited for disease progression modeling because it allows us to:\n\nCapture spatial and structural changes: OT computes a smooth, meaningful transformation, preserving the continuity of shapes, making it ideal for medical images that track evolving structures.\nQuantify changes robustly: By calculating the minimal transport cost, OT provides a quantitative measure of how much a structure (e.g., brain tissue) changes, which can correlate with disease severity.\nCompare across patients and populations: OT-based metrics can be standardized across subjects, enabling comparisons between different patient groups or disease stages.\n\n\n\n\n\nNeurodegeneration (e.g., Alzheimer’s Disease): OT maps brain atrophy across time points in MRI scans, quantifying volume and cortical thickness changes crucial for staging and monitoring Alzheimer’s.\nCancer: OT tracks tumor morphology changes, helping assess treatment response by measuring growth, shrinkage, or shape shifts, even aiding relapse predictions.\nRespiratory Diseases (e.g., COPD): OT compares longitudinal lung CTs to quantify tissue loss distribution, providing spatial insights for monitoring COPD progression and treatment adjustment." + "objectID": "posts/Neural-Manifold/index.html#persistent-homology", + "href": "posts/Neural-Manifold/index.html#persistent-homology", + "title": "Understanding Animal Navigation using Neural Manifolds With CEBRA", + "section": "Persistent Homology", + "text": "Persistent Homology\nPersistent homology allows us to quantify and verify the topological features of our embedded space. Specifically, we want to validate the assumption that the neural representation forms a 1D ring manifold, which corresponds to the rat’s navigation behavior within the environment. The idea of persistent homology is to create spheres of varying radii around each point in the point cloud, and from those spheres, track how the topological features of the shape change as the radius grows. By systematically increasing the radius, we can observe when distinct clusters merge, when loops (1D holes) appear, and when higher-dimensional voids form. These features persist across different radius sizes, and their persistence provides a measure of their significance. In the context of neural data, this allows us to detect the underlying topological structure of the manifold. Below is a figure illustrating this method Schneider, Lee, and Mathis (2023):\n\n\n\n\n\nFigure 6 - Persistent Homology\n\n\n\n\n\nValidating a 1D Ring Manifold\nTo confirm the circular nature of the embedding, we analyze the Betti numbers derived from the point cloud. Betti numbers describe the topological features of a space, with the \\(k\\)-th Betti number counting the number of \\(k\\)-dimensional “holes” in the manifold. Below is a figure showing a few basic topological spaces and their corresponding Betti numbers Walker (2008):\n\n\n\n\n\nFigure 7 - Some simple topological spaces and their Betti numbers \n\n\n\n\nFor a 1D ring, the expected Betti numbers are: \\[\n\\beta_0 = 1 : \\text{One connected component.}\n\\] \\[\n\\beta_1 = 1 : \\text{One 1D hole (i.e., the circular loop).}\n\\] \\[\n\\beta_2 = 0 : \\text{No 2D voids.}\n\\]\nThus, the expected Betti numbers for our manifold are (1, 1, 0). If the Betti numbers extracted from the persistent homology analysis align with these values, it confirms that the neural dynamics trace a 1D circular trajectory, supporting our hypothesis that the hippocampal representation forms a ring corresponding to the rat’s navigation path." }, { - "objectID": "posts/ImageMorphing/OT4DiseaseProgression.html#what-is-optimal-transport", - "href": "posts/ImageMorphing/OT4DiseaseProgression.html#what-is-optimal-transport", - "title": "Optimal Mass Transport and its Convex Formulation", - "section": "", - "text": "Optimal transport is a mathematical framework originally developed to solve the problem of transporting resources in a way that minimizes cost. The problem was formalized by the French mathematician Gaspard Monge in 1781. In the 1920s A.N. Tolstoi was among the first to study the transportation problem mathematically. However, the major advances were made in the field during World War II by the Soviet mathematician and economist Leonid Kantorovich. However, OT is a tough optimization problem. In 2000, Benamou and Brenier propose a convex formulation. Villani explains the history and mathematics behind OT in great detail in his book (Villani 2021), which is in fact very popular and well appreciated.\nMathematically, OT finds the most efficient way to “move” one distribution to match another, which is useful in medical imaging where changes in structure and morphology need to be quantitatively mapped over time. OT computes a transport map (or “flow”) that transforms one spatial distribution into another with minimal “work” (measured by the Wasserstein distance). This idea has strong applications in medical imaging, particularly for analyzing disease progression, as it provides a way to track changes in anatomical structures over time.\n\n\n\nState of neurodegeneration in a kid at different ages. (Bastos et al. (2020)) OT can learn the progression or the transformation (T) of brain deformation from the state at 5 year age (\\(\\rho_0\\)) to the final state at 7 year age (\\(\\rho_1\\)) or 9 year age (\\(\\rho_2\\))." + "objectID": "posts/Neural-Manifold/index.html#spud-method", + "href": "posts/Neural-Manifold/index.html#spud-method", + "title": "Understanding Animal Navigation using Neural Manifolds With CEBRA", + "section": "SPUD Method", + "text": "SPUD Method\nOnce we’ve validated the assumption that our data forms a 1D ring manifold, we can proceed to fitting a spline to the data. We do this so that we can parametrize our behavioural variable \\(\\mathcal{hippocampal angle}\\) along the point cloud. There are many different methods, but the one chosen for this purpose was taken from Chaudhuri et al. (2019). The spline is defined by a set of points, or knots, which I decided to initialize using kmedoids clustering Jin and Han (2011). The knots are then fit to the data further by minimizing a loss function defined as follows:\n\\[\n\\text{cost} = \\text{dist} + \\text{curvature} + \\text{length} - \\text{log(density)}\n\\]\nwhere dist is the distance of each point to the spline, curvature is the total curvature of the spline, length is the total length of the spline, and density is the point cloud density of each knot.\n\nOverview of the SPUD Method\nSpline Parameterization for Unsupervised Decoding (SPUD) Chaudhuri et al. (2019) is a multi-step method designed to parametrize a neural manifold. The goal of SPUD is to provide an on-manifold local parameterization using a local coordinate system rather than a global one. This method is particularly useful when dealing with topologically non-trivial variables that have a circular structure.\nSpline Parameterization: SPUD parameterizes the manifold by first fitting a spline to the underlying structure. Chaudhuri et al. (2019) demonstrated that this works for head direction cells in mice to accurately parametrize, i.e. decode the head direction. Our goal is to have the parametrization accurately decode our latent variable of interest, the Hippocampal Gain (\\(\\mathcal{H}\\)).\n\n\nDeciding the Parameterization of the Latent Variable\n\nNatural Parametrization\nA natural parameterization would mean that equal distances in the embedding space correspond to equal changes in the latent variable. The natural parameterization comes from the assumption that neural systems allocate resources based on the significance or frequency of stimuli. For example, in systems like the visual cortex, stimuli that occur frequently (e.g., vertical or horizontal orientations) might be encoded with higher resolution. However, for systems like place cell firing, where all angles are spaces are equally probable in the dome, the natural parameterization reflects this uniform encoding strategy, with no overrepresentation of certain places (Chaudhuri et al. (2019)).\n\n\nAlternative Parameterization and its Limitations\nAn alternative parameterization method was considered, in which intervals between consecutive knots in the spline were set to represent equal changes in the latent variable. This approach was designed to counteract any potential biases in the data due to over- or under-sampling in certain regions of the manifold.\nHowever, this alternative was not determined to be effective in practice by Chaudhuri et al. (2019). Given sufficient data, the natural parameterization performed better, supporting the conclusion that it better reflects how neural systems encode variables. This is also the case for our experiment. Look to the following figure, in which a spline is fit to the data and a color map is applied to the natural parametrization. We can see that it aligns almost perfectly with the hippocampal angle. Great, that’s exactly what we wanted!\n\n\n\n\n\nFigure 8 - Spline fit on CEBRA embedding\n\n\n\n\nSo, what do we do now that we have an accurate parametrization of the \\(\\mathcal{hippocampal \\. angle}\\)?" }, { - "objectID": "posts/ImageMorphing/OT4DiseaseProgression.html#why-optimal-transport-for-disease-progression", - "href": "posts/ImageMorphing/OT4DiseaseProgression.html#why-optimal-transport-for-disease-progression", - "title": "Optimal Mass Transport and its Convex Formulation", - "section": "", - "text": "The OT framework is uniquely suited for disease progression modeling because it allows us to:\n\nCapture spatial and structural changes: OT computes a smooth, meaningful transformation, preserving the continuity of shapes, making it ideal for medical images that track evolving structures.\nQuantify changes robustly: By calculating the minimal transport cost, OT provides a quantitative measure of how much a structure (e.g., brain tissue) changes, which can correlate with disease severity.\nCompare across patients and populations: OT-based metrics can be standardized across subjects, enabling comparisons between different patient groups or disease stages." + "objectID": "posts/Neural-Manifold/index.html#decoding-hippocampal-gain-mathcalh", + "href": "posts/Neural-Manifold/index.html#decoding-hippocampal-gain-mathcalh", + "title": "Understanding Animal Navigation using Neural Manifolds With CEBRA", + "section": "Decoding Hippocampal Gain (\\(\\mathcal{H}\\))", + "text": "Decoding Hippocampal Gain (\\(\\mathcal{H}\\))\n\nFinal Step\nThe final step is to decode \\(\\mathcal{H}\\) from the parametrization. The method to do this is straightforward. Once we have parametrized the spline accurately to the neural data, we calculate the hippocampal gain by comparing the distance/angle traveled in the neural manifold (derived from our spline) to the distance/angle in the lab frame (actual movement of the rat).\nThe idea is that:\n\\[\n\\mathcal{H} = \\frac{d\\theta_\\mathcal{H}}{d\\theta_\\mathcal{L}}\n\\]\nwhere \\(\\theta_H\\) is the change in angle in the hippocampal reference frame, decoded from our spline parametrization of the neural manifold, and \\(\\theta_L\\) is the physical angle traveled by the rat in the lab frame.\nNote that this is actually just the original definition of \\(\\mathcal{H}\\), but now \\(\\theta_H\\) is determined by our spline parameter, not the Fourier Transform method.\nFor example, let’s take a time interval, say 1–2 seconds. To determine the hippocampal gain within that frame, we observe where the neural activity at times 1 and 2 maps in our manifold, calling these \\(\\theta_{H1}\\) and \\(\\theta_{H2}\\), respectively. Then, using the lab frame angles at times 1 and 2, which we’ll call \\(\\theta_{L1}\\) and \\(\\theta_{L2}\\), we find that:\n\\[\n \\mathcal{H}(\\text{between } t=1 \\text{ and } t=2) = \\frac{\\theta_{\\mathcal{H2}} - \\theta_{\\mathcal{H1}}}{\\theta_{\\mathcal{L2}} - \\theta_{\\mathcal{L1}}}\n\\]\nWe extend the above example to all consecutive time points in the experiment to compute hippocampal gain (\\(\\mathcal{H}\\)) dynamically. The following Python code demonstrates how this is implemented:\n\ndef differentiate_and_smooth(data=None, window_size=3):\n #Compute finite differences.\n diffs = np.diff(data)\n \n # Compute the moving average of differences.\n kernel = np.ones(window_size) / window_size\n avg_diffs = np.convolve(diffs, kernel, mode='valid') \n \n return avg_diffs\n\nderivative_decoded_angle_rad_unwrap = differentiate_and_smooth(data=filtered_decoded_angles_unwrap, window_size=60) #hippocampal angle from manifold parametrization.\nderivative_true_angle_rad_unwrap = differentiate_and_smooth(data=binned_true_angle_rad_unwrap, window_size=60) #true angle from session recordings.\nderivative_hipp_angle_rad_unwrap = differentiate_and_smooth(data=binned_hipp_angle_rad_unwrap, window_size=60) #hippocampal angle from Fourier Transform (traditional method, can be thought of as ground truth).\n\n\ndecode_H = (derivative_decoded_angle_rad_unwrap) / (derivative_true_angle_rad_unwrap) #take the \"derivative\" of hippocampal angle at each time point and divide by \"derivative\" of true angle at each time point.\n\n#Now, plot H from manifold optimization vs H from traditional method (shown in results).\nThis code calculates the hippocampal gain, \\(\\mathcal{H}\\), by dividing the derivative of the hippocampal angle (obtained from the manifold parameterization) by the derivative of the true angle (obtained from session recordings). The result can be compared to \\(\\mathcal{H}\\) from the traditional Fourier-based method, as shown in the results section." }, { - "objectID": "posts/ImageMorphing/OT4DiseaseProgression.html#popular-applications-of-ot-to-study-disease-progression", - "href": "posts/ImageMorphing/OT4DiseaseProgression.html#popular-applications-of-ot-to-study-disease-progression", - "title": "Optimal Mass Transport and its Convex Formulation", - "section": "", - "text": "Neurodegeneration (e.g., Alzheimer’s Disease): OT maps brain atrophy across time points in MRI scans, quantifying volume and cortical thickness changes crucial for staging and monitoring Alzheimer’s.\nCancer: OT tracks tumor morphology changes, helping assess treatment response by measuring growth, shrinkage, or shape shifts, even aiding relapse predictions.\nRespiratory Diseases (e.g., COPD): OT compares longitudinal lung CTs to quantify tissue loss distribution, providing spatial insights for monitoring COPD progression and treatment adjustment." + "objectID": "posts/Neural-Manifold/index.html#results", + "href": "posts/Neural-Manifold/index.html#results", + "title": "Understanding Animal Navigation using Neural Manifolds With CEBRA", + "section": "Results", + "text": "Results\nWe now display and discuss the results. Below are a few results from applying this method to real experimental data from “Control and recalibration of path integration in place cells” (Madhav et al. (2024)). We first show two “good” trials (session 50 and 36), and two “bad” trials (session 26 and 29). We had trials where our data did not trace out a 1D ring topology in the pointcloud as can be clearly seen from the spline parametrization (and which can be easily quanitatively assessed using persistent homology). I will explain more clearly below what we mean by “good” and “bad”.\n\nPoint clouds and parametrization\n\n\n\n\nFigure 9 - Embeddings for both successful and unsuccessful trials: (a) Session 50 (top) and Session 36 (bottom) show embeddings with and without the spline fit (in red), representing successful trials. (b) Session 26 (top) and Session 29 (bottom) show embeddings for unsuccessful trials, where the manifold does not form a clear 1D ring topology.\n\n\n\nNow we plot our H value decoded from the manifold versus the H value decoded from the Fourier Transform method and compare for “good” trials and “bad” trials.\n\n\nH values\n\n\na \n\n\nb \n\n\nc \n\n\nd \n\n\n\nFigure 5 - Plot of manifold-decoded gain (red) vs. gain from the traditional method (blue) for different sessions: (a) Session 50, (b) Session 26, (c) Session 36, and (d) Session 29.\n\nAfter observing both successful and unsuccessful trials, I asked: what distinguishes “good” results from “bad” ones?\nIt became evident that the quality of results was strongly influenced by the number of neurons in the experimental recording. To quantify the quality of an embedding, I used the Structure Index (SI) score (Sebastian, Esparza, and Prida (2022)). The SI score measures how well the hippocampal angle is distributed across the point cloud.\n\nSI ranges from 0 to 1:\n\n0: The hippocampal angle is randomly distributed within the point cloud.\n1: The hippocampal angle is perfectly distributed, indicating a clear and accurate representation.\n\n\nThus, a higher SI score corresponds to a better alignment between the hippocampal angle and the manifold parameterization.\n\n\nResults\nConsider the trials discussed earlier:\n\nSuccessful trials (Sessions 50 and 36): SI scores were 0.89 and 0.9, respectively.\nUnsuccessful trials (Sessions 26 and 29): SI scores were 0.34 and 0.67, respectively.\n\nThe plot below illustrates the relationship between the number of neurons (or clusters) and the SI score. This highlights what I refer to as the “curse of clusters”: A minimum number of clusters (neurons) is required to achieve a successful trial.\n\n\n\n\nFigure 10 - Relationship between number of clusters (neurons) and SI score.\n\n\n\nThis shows that trials with fewer neurons (<35 clusters) are more likely to fail, while those with more neurons (>35 clusters) generally produce high-quality embeddings with accurate parameterization.\nIf the number of neurons was less than 35, we got “bad” results, and if the number of neurons was greater than 35, we got “good” results. We determined that in order to get an accurate \\(\\mathcal{H}\\) decoding, we need at least 35 neurons in the recording. Look at the plot below, where we look at the relationship between number of clusters and \\(\\mathcal{H}\\) decode error. The \\(\\mathcal{H}\\) decode error is calculated as, \\[\n\\text{mean} \\, \\mathcal{H} \\, \\text{decode error} = \\frac{1}{n} \\sum_{i=1}^{n} \\left( H_{\\text{decode}}[i] - H_{\\text{traditional}}[i] \\right),\n\\]\nwhere the sum is taken over all time indices in each array, and ( n ) is the number of time points.\n\n\n\n\nFigure 11 - Plot of number of clusters (neurons) vs mean \\(\\mathcal{H}\\) error.\n\n\n\nThe majority of trials with more than 35 clusters (neurons) have a mean \\(\\mathcal{H}\\) decode error of less than 0.01. However, some trials with more than 35 clusters exhibit a higher decode error.\nThe reason for this discrepancy lies in the topology of the manifold produced by CEBRA. Even when the trial appears “good” based on the SI metric, CEBRA does not always produce a 1D ring topology, which is crucial for accurate \\(\\mathcal{H}\\) decoding.\nAddressing this limitation will be part of the next steps in our methodology." }, { - "objectID": "posts/ImageMorphing/OT4DiseaseProgression.html#monge-formulation", - "href": "posts/ImageMorphing/OT4DiseaseProgression.html#monge-formulation", - "title": "Optimal Mass Transport and its Convex Formulation", - "section": "Monge Formulation", - "text": "Monge Formulation\nThe Monge formulation of optimal transport, introduced in 1781, addresses the problem of moving mass efficiently from one distribution to another. Given two distributions:\n\nSource Distribution: \\(\\mu\\) on \\(X\\)\nTarget Distribution: \\(\\nu\\) on \\(Y\\)\n\nwe seek a transport map \\(T\\): \\(X\\) to \\(Y\\) that minimizes the transport cost, typically \\(c(x, T(x)) = \\|x - T(x)\\|^p\\).\nThe Monge problem can be written as:\n\\[\n\\min_T \\int_X c(x, T(x)) \\, d\\mu(x)\n\\]\nsubject to \\(T_\\# \\mu = \\nu\\), meaning that the map \\(T\\) must push \\(\\mu\\) to \\(\\nu\\), ensuring all mass is preserved without splitting.\nKey Points:\n\nTransport Map \\(T\\): Monge’s formulation requires a direct mapping of mass from \\(\\mu\\) to \\(\\nu\\).\nNo Mass Splitting: Unlike relaxed formulations, the Monge problem doesn’t allow fractional mass transport, making it challenging to solve in complex cases.\nCost Function: The choice of \\(c(x, y)\\) affects the solution—common choices include distance \\(\\|x - y\\|\\) and squared distance \\(\\|x - y\\|^2\\).\n\n\nShortcoming\nThe Monge formulation lacks flexibility due to its one-to-one mapping constraint, which led to the Kantorovich relaxation, allowing more general solutions by enabling mass splitting. The Monge formulation captures the essence of spatial mass transport with minimal cost, inspiring modern approaches in diverse fields." + "objectID": "posts/Neural-Manifold/index.html#next-steps", + "href": "posts/Neural-Manifold/index.html#next-steps", + "title": "Understanding Animal Navigation using Neural Manifolds With CEBRA", + "section": "Next steps", + "text": "Next steps\n\nApply the Method to Raw, Unfiltered Spike Data\nInstead of relying on manual, ad hoc clustering to identify neurons and spike trains, we propose applying CEBRA directly to the raw recorded neural data. This approach could help with issues related to the “curse of clusters,” as it eliminates the dependency on clustering quality and the number of detected clusters.\nSimulate an Online Environment\nTest whether this method can be applied in a simulated “online” experimental environment. This would involve decoding neural representations in real time during an experiment, enabling closed-loop feedback and dynamic manipulation of experimental variables.\nModify the CEBRA Loss Function\nAdapt the CEBRA loss function to incorporate constraints that bias the resulting point cloud to lie on a desired topology. For instance, by guiding the embedding toward a 1D ring or a higher-dimensional structure, we could improve the consistency and interpretability of the manifold representation." }, { - "objectID": "posts/ImageMorphing/OT4DiseaseProgression.html#kantorovich-formulation", - "href": "posts/ImageMorphing/OT4DiseaseProgression.html#kantorovich-formulation", - "title": "Optimal Mass Transport and its Convex Formulation", - "section": "Kantorovich formulation", - "text": "Kantorovich formulation\nThe Kantorovich formulation, introduced by Leonid Kantorovich in 1942 (Kantorovich (2006)), generalizes the Monge problem by allowing “mass splitting,” where mass from one source point can be distributed to multiple target points. This flexibility makes it possible to solve a broader range of transport problems.\nKantorovich’s Problem:\nInstead of finding a single transport map \\(T\\), the Kantorovich formulation seeks a transport plan \\(\\gamma\\), a joint probability distribution on \\(X \\times Y\\), such that:\n\\[\n\\min_\\gamma \\int_{X \\times Y} c(x, y) \\, d\\gamma(x, y)\n\\]\nwhere \\(c(x, y)\\) represents the cost of transporting mass from \\(x \\in X\\) to \\(y \\in Y\\). The transport plan \\(\\gamma\\) must satisfy marginal constraints:\n\\[\n\\int_Y d\\gamma(x, y) = d\\mu(x) \\quad \\text{and} \\quad \\int_X d\\gamma(x, y) = d\\nu(y),\n\\]\nensuring that \\(\\gamma\\) moves all mass from \\(\\mu\\) to \\(\\nu\\).\nKey Points:\n\nTransport Plan \\(\\gamma\\): A probability measure over \\(X \\times Y\\) that allows fractional mass movement, broadening the solution space.\nMarginal Constraints: These ensure \\(\\gamma\\) aligns with source \\(\\mu\\) and target \\(\\nu\\) distributions, preserving total mass.\nCost Function: Commonly, \\(c(x, y) = \\|x - y\\|\\) or \\(c(x, y) = \\|x - y\\|^2\\), chosen based on the desired penalty for transport distance.\n\nAdvantages:\n\nFlexibility: Mass splitting allows for a solution even when \\(\\mu\\) and \\(\\nu\\) have different structures (e.g., continuous to discrete).\nComputational Feasibility: The problem can be solved via linear programming or faster algorithms using entropic regularization.\n\nHence, the Kantorovich formulation provides a robust framework for optimal transport problems, enabling applications across fields where flexibility and computational efficiency are essential." + "objectID": "posts/Neural-Manifold/index.html#conclusion", + "href": "posts/Neural-Manifold/index.html#conclusion", + "title": "Understanding Animal Navigation using Neural Manifolds With CEBRA", + "section": "Conclusion", + "text": "Conclusion\nIn this work, we demonstrated the power of CEBRA to decode hippocampal gain (\\(\\mathcal{H}\\)) at finer temporal resolutions without relying on traditional Fourier transform-based approaches. By embedding neural population activity into a low-dimensional latent space that captures the underlying topological structure of the experimental task, we successfully reconstructed a 1D ring manifold corresponding to the rat’s hippocampal reference frame. Persistent homology validated the circular topology, and the SPUD method was used to parametrize the manifold, enabling the decoding of hippocampal gain.\nWe found that at least 35 well-isolated clusters (neurons) were needed for robust manifold estimation. Below this threshold, we had poor quality and topology of the embeddings, leading to inaccurate \\(\\mathcal{H}\\) decoding. Despite these issues, the results demonstrate the potential of manifold learning for experimental tasks of this type. This work will enable new experiments for causal modeling of the neural circuits responsible for cognitive representations." }, { - "objectID": "posts/ImageMorphing/OT4DiseaseProgression.html#benamou-brenier-formulation-convex-ot", - "href": "posts/ImageMorphing/OT4DiseaseProgression.html#benamou-brenier-formulation-convex-ot", - "title": "Optimal Mass Transport and its Convex Formulation", - "section": "Benamou-Brenier Formulation (Convex OT)", - "text": "Benamou-Brenier Formulation (Convex OT)\nThe Benamou-Brenier formulation (Benamou and Brenier (2000)) provides a dynamic perspective on optimal transport, interpreting it as a fluid flow problem. Instead of transporting mass directly between two distributions, this approach finds the path of minimal “kinetic energy” needed to continuously transform one distribution into another over time.\nThe Benamou-Brenier formulation considers a probability density \\(\\rho(x, t)\\) evolving over time \\(t \\in [0, 1]\\) from an initial distribution \\(\\rho_0\\) to a final distribution \\(\\rho_1\\). The goal is to find a velocity field \\(v(x, t)\\) that minimizes the action, or “kinetic energy” cost:\n\\[\n\\min_{\\rho, v} \\int_0^1 \\int_X \\frac{1}{2} \\|v(x, t)\\|^2 \\rho(x, t) \\, dx \\, dt,\n\\]\nsubject to the continuity equation:\n\\[\n\\frac{\\partial \\rho}{\\partial t} + \\nabla \\cdot (\\rho v) = 0,\n\\]\nwhich ensures mass conservation from \\(\\rho_0\\) to \\(\\rho_1\\).\nKey Points:\n\nDynamic Interpretation: Unlike Monge and Kantorovich, the Benamou-Brenier formulation finds a time-dependent transformation, representing a continuous flow of mass.\nVelocity Field \\(v(x, t)\\): Defines the “direction” and “speed” of mass movement, yielding a smooth, physical path of minimal kinetic energy.\nContinuity Equation: Ensures mass conservation over time, maintaining that mass neither appears nor disappears.\n\nAdvantages:\n\nSmoothness: Provides a continuous path for evolving distributions, well-suited for dynamic processes.\nComputational Benefits: The problem is formulated as a convex optimization over a flow field, often solved with efficient numerical methods.\n\nThe Benamou-Brenier formulation expands optimal transport by introducing a dynamic flow approach, making it especially useful for applications requiring continuous transformations. Its physical interpretation has brought valuable insights to fields that rely on time-evolving processes." + "objectID": "posts/ribosome-tunnel-new/index.html", + "href": "posts/ribosome-tunnel-new/index.html", + "title": "3D tessellation of biomolecular cavities", + "section": "", + "text": "We present a protocol to extract the surface of a biomolecular cavity for shape analysis and molecular simulations.\nWe apply and illustrate the protocol on the ribosome structure, which contains a subcompartment known as the ribosome exit tunnel or “nascent polypeptide exit tunnel” (NPET). More details on the tunnel features and biological importance can be found in our previous works1,2.\nThe protocol was designed to refine the output obtained from MOLE software3, but can be applied to reconstruct a mesh on any general point cloud. Hence, we take the point-cloud of atom positions surrounding the tunnel as a point of departure.\n\n\n\nIllustration of the ribosome exit tunnel (from Dao Duc et al., NAR 2019)\n\n\n\n\n\n\n\n\n\n\nSchematic representation of the protocol" }, { - "objectID": "posts/AFM-data_2/index.html", - "href": "posts/AFM-data_2/index.html", - "title": "Extracting cell geometry from Atomic Force Microscopy", + "objectID": "posts/ribosome-tunnel-new/index.html#summary-and-background", + "href": "posts/ribosome-tunnel-new/index.html#summary-and-background", + "title": "3D tessellation of biomolecular cavities", "section": "", - "text": "Intro\n\n\nReferences" + "text": "We present a protocol to extract the surface of a biomolecular cavity for shape analysis and molecular simulations.\nWe apply and illustrate the protocol on the ribosome structure, which contains a subcompartment known as the ribosome exit tunnel or “nascent polypeptide exit tunnel” (NPET). More details on the tunnel features and biological importance can be found in our previous works1,2.\nThe protocol was designed to refine the output obtained from MOLE software3, but can be applied to reconstruct a mesh on any general point cloud. Hence, we take the point-cloud of atom positions surrounding the tunnel as a point of departure.\n\n\n\nIllustration of the ribosome exit tunnel (from Dao Duc et al., NAR 2019)\n\n\n\n\n\n\n\n\n\n\nSchematic representation of the protocol" }, { - "objectID": "posts/biology/index.html", - "href": "posts/biology/index.html", - "title": "Embryonic cell size asymmetry analysis", - "section": "", - "text": "Introduction and motivation\nCells propagate via cell division. In multicellular organisms, certain cells divide asymmetrically, which results in generating cell diversity (Jan and Jan 1998). There are several cues for asymmetric cell division, including cell polarity establishment, spindle positioning, division site specification (Li 2013), and signals from neighboring cells (Horvitz and Herskowitz 1992). These cues allow multicellular organisms develop correctly, and their misregulation can lead to disorders from developmental defects to cancer.\nIn Caenorhabditis elegans four cell stage embryos, the endomesodermal precursor (EMS) cell gives rise to mesoderm and endoderm cells. For this asymmetric division, the EMS cell receives signals from a neighboring P2 cell (Rocheleau et al. 1997). In response to them, the daughter cell closest to the P2 cell develops into endoderm, and its sister develops into mesoderm (Goldstein 1992) (Figure 1). In situations where the signal is absent, both EMS daughters develop into mesoderm, and the embryo is non-viable.\n\n\n\nFigure 1: EMS cell division. EMS cell division in a four-cell C. elegans embryo. The EMS cell receives signals from P2 cell to develop into endoderm (gut) and mesoderm (muscle) precursors. Adapted from (Goldstein 1992)\n\n\nIn the EMS division, daughter cells appear to adopt different shapes (Caroti et al. 2021). Additionally, the shape of daughter cells changes if Wnt signaling is absent. It is possible that these differences are correlated to the fate of the daughter cells. If gradual, these differences could also be used to identify the strength of cell response to external cues, such as Wnt signaling. It is also possible that before the birth of the E and MS cells, the shape of the EMS cell changes in response to external cues. Analysis of the EMS cell shape in different contexts could therefore prove to be a useful tool to understanding differentiation and development. Finally, developing quantitative size and shape analysis tools can reduce human bias and help speed up the experimental procedures in both understanding EMS and its daughters’ fates.\n\n\n\nFigure 2, E and MS cells adopt different shapes. Top row, EMS (parent) cell in three embryos. Bottom row, purple: MS cell, blue: E cell. Taken from (Caroti et al. 2021) - Figure 2 E.\n\n\nThere are several different methods to analyze shapes and sizes. Given that different analysis tools yield different results (Dryden and Mardia 2016, 37), it is helpful to consider a few before finding the best tool for subsequent use.\n\n\nCentroid size analysis\nCentroid analysis is a tool to measure the size of a shape on Cartesian coordinates and is defined by (Dryden and Mardia 2016, 34)as:\n\\[\nS(X) = \\sqrt{\\sum_{i=1}^{k} \\sum_{j=1}^{m}\\left( X_{ij} - \\bar{X}_j \\right)^2}, \\quad X \\in \\mathbb{R}^{k \\times m}\n\\]\nWhere \\(X_{ij}\\) is a matrix entry, and \\(\\bar{X}_j\\) is a mean of the j’th dimension of the matrix.\nCentroid is a simple tool to estimate size of a shape and could help to easily quantify differences between different cell groups in a sample, for example, EMS or E cell with and without a signal from the P2 cell.\n\n\nEuclidean distance matrix analysis (EDMA)\nEDMA is a version of multidimensional scale analysis that accounts for a bias in landmark distribution (Dryden and Mardia 2016, 357–60). This analysis focuses on distances between landmarks and can handle missed landmarks (Lele 1993). This method corrects for landmark distribution biases and can be used to test for shape differences (EDMA-I and EDMA-II).\nWhile centroids are useful in estimating size of a shape, EDMA can be helpful in finding differences in shape itself. There are a number of other tools to estimate shape differences, including square root velocity (SRV) function - a landmark-independent tool for analysing differences in shape and curvature (Srivastava et al. 2011). Independence from landmarks might result in more precise shape comparisons, however, it renders analysis computationally intensive. Analysis of the dividing EMS and E/MS cells can be performed using any of these methods, the easiest being centroid size estimation, which does not account for shape differences. Incorporating more complex analysis tools would allow for more understanding in how the cell shape changes. Additionally, it could be extrapolated to more complex analyses, such as time series or 3D images. These tools could help further understand what affects EMS daughter cells and whether their shape is linked to their fate.\n\n\n\n\n\nReferences\n\nCaroti, Francesca, Wim Thiels, Michiel Vanslambrouck, and Rob Jelier. 2021. “Wnt Signaling Induces Asymmetric Dynamics in the Actomyosin Cortex of the C. Elegans Endomesodermal Precursor Cell.” Front Cell Dev Biol 9 (September): 702741. https://doi.org/10.3389/fcell.2021.702741.\n\n\nDryden, Ian L., and Kanti V. Mardia. 2016. Statistical Shape Analysis, with Applications in R. 1st ed. Wiley Series in Probability and Statistics. Wiley. https://doi.org/10.1002/9781119072492.\n\n\nGoldstein, Bob. 1992. “Induction of Gut in Caenorhabditis Elegans Embryos.” Nature 357 (6375): 255–57. https://doi.org/10.1038/357255a0.\n\n\nHorvitz, H.Robert, and Ira Herskowitz. 1992. “Mechanisms of Asymmetric Cell Division: Two Bs or Not Two Bs, That Is the Question.” Cell 68 (2): 237–55. https://doi.org/10.1016/0092-8674(92)90468-R.\n\n\nJan, Yuh Nung, and Lily Yeh Jan. 1998. “Asymmetric Cell Division.” Nature 392 (6678): 775–78. https://doi.org/10.1038/33854.\n\n\nLele, Subhash. 1993. “Euclidean Distance Matrix Analysis (EDMA): Estimation of Mean Form and Mean Form Difference.” Math Geol 25 (5): 573–602. https://doi.org/10.1007/BF00890247.\n\n\nLi, Rong. 2013. “The Art of Choreographing Asymmetric Cell Division.” Developmental Cell 25 (5): 439–50. https://doi.org/10.1016/j.devcel.2013.05.003.\n\n\nRocheleau, Christian E, William D Downs, Rueyling Lin, Claudia Wittmann, Yanxia Bei, Yoon-Hee Cha, Mussa Ali, James R Priess, and Craig C Mello. 1997. “Wnt Signaling and an APC-Related Gene Specify Endoderm in Early C. Elegans Embryos.” Cell 90 (4): 707–16. https://doi.org/10.1016/S0092-8674(00)80531-0.\n\n\nSrivastava, A, E Klassen, S H Joshi, and I H Jermyn. 2011. “Shape Analysis of Elastic Curves in Euclidean Spaces.” IEEE Trans. Pattern Anal. Mach. Intell. 33 (7): 1415–28. https://doi.org/10.1109/TPAMI.2010.184." + "objectID": "posts/ribosome-tunnel-new/index.html#pointcloud-preparation-bounding-box-and-voxelization", + "href": "posts/ribosome-tunnel-new/index.html#pointcloud-preparation-bounding-box-and-voxelization", + "title": "3D tessellation of biomolecular cavities", + "section": "1. Pointcloud Preparation: Bounding Box and Voxelization", + "text": "1. Pointcloud Preparation: Bounding Box and Voxelization\n\n\n\n\n\n\natompos_to_voxel_sphere: convert a 3D coordinate into a voxelized sphere\n\n\n\n\n\n\ndef atompos_to_voxelized_sphere(center: np.ndarray, radius: int):\n \"\"\"Make sure radius reflects the size of the underlying voxel grid\"\"\"\n x0, y0, z0 = center\n\n #!------ Generate indices of a voxel cube of side 2r around the centerpoint\n x_range = slice(\n int(np.floor(x0 - radius)), \n int(np.ceil(x0 + radius)))\n y_range = slice(\n int(np.floor(y0 - radius)), \n int(np.ceil(y0 + radius)))\n z_range = slice(\n int(np.floor(z0 - radius)), \n int(np.ceil(z0 + radius)))\n\n indices = np.indices(\n (\n x_range.stop - x_range.start,\n y_range.stop - y_range.start,\n z_range.stop - z_range.start,\n )\n )\n\n indices += np.array([x_range.start,\n y_range.start,\n z_range.start])[:, np.newaxis, np.newaxis, np.newaxis ]\n indices = indices.transpose(1, 2, 3, 0)\n indices_list = list(map(tuple, indices.reshape(-1, 3)))\n\n #!------ Generate indices of a voxel cube of side 2r+2 around the centerpoint\n sphere_active_ix = []\n\n for ind in indices_list:\n x_ = ind[0]\n y_ = ind[1]\n z_ = ind[2]\n if (x_ - x0) ** 2 + (y_ - y0) ** 2 + (z_ - z0) ** 2 <= radius**2:\n sphere_active_ix.append([x_, y_, z_])\n\n return np.array(sphere_active_ix)\n\n\n\n\n\n\n\n\n\n\nindex_grid: populate a voxel grid (with sphered atoms)\n\n\n\n\n\n\ndef index_grid(expanded_sphere_voxels: np.ndarray) :\n\n def normalize_atom_coordinates(coordinates: np.ndarray)->tuple[ np.ndarray, np.ndarray ]:\n \"\"\"@param coordinates: numpy array of shape (N,3)\"\"\"\n\n C = coordinates\n mean_x = np.mean(C[:, 0])\n mean_y = np.mean(C[:, 1])\n mean_z = np.mean(C[:, 2])\n\n Cx = C[:, 0] - mean_x\n Cy = C[:, 1] - mean_y\n Cz = C[:, 2] - mean_z\n \n\n [dev_x, dev_y, dev_z] = [np.min(Cx), np.min(Cy), np.min(Cz)]\n\n #! shift to positive quadrant\n Cx = Cx + abs(dev_x)\n Cy = Cy + abs(dev_y)\n Cz = Cz + abs(dev_z)\n\n rescaled_coords = np.array(list(zip(Cx, Cy, Cz)))\n\n return rescaled_coords, np.array([[mean_x,mean_y,mean_z], [abs( dev_x ), abs( dev_y ), abs( dev_z )]])\n\n normalized_sphere_cords, mean_abs_vectors = normalize_atom_coordinates(expanded_sphere_voxels)\n voxel_size = 1\n\n sphere_cords_quantized = np.round(np.array(normalized_sphere_cords / voxel_size) ).astype(int)\n max_values = np.max(sphere_cords_quantized, axis=0)\n grid_dimensions = max_values + 1\n vox_grid = np.zeros(grid_dimensions)\n\n print(\"Dimension of the voxel grid is \", vox_grid.shape)\n\n vox_grid[\n sphere_cords_quantized[:, 0],\n sphere_cords_quantized[:, 1],\n sphere_cords_quantized[:, 2] ] = 1\n\n\n return ( vox_grid, grid_dimensions, mean_abs_vectors )\n\n\n\n\nBbox: There are many ways to extract a point cloud from a larger biological structure – in this case we settle for a bounding box that bounds the space between the PTC and the NPET vestibule.\n\n# \"bounding_box_atoms.npy\" is a N,3 array of atom coordinates\n\natom_centers = np.load(\"bounding_box_atoms.npy\") \n\nSphering: To make the representation of atoms slightly more physically-plausible we replace each atom-center coordinate with positions of voxels that fall within a sphere of radius \\(R\\) around the atom’s position. This is meant to represent the atom’s van der Waals radius.\nOne could model different types of atoms (\\(N\\),\\(C\\),\\(O\\),\\(H\\) etc.) with separate radii, but taking \\(R=2\\) proves a good enough compromise. The units are Angstrom and correspond to the coordinate system in which the structure of the ribosome is recorded.\n\nvoxel_spheres = np.array([ atompos_to_voxel_sphere(atom, 2) for atom in atom_centers ])\n\nVoxelization & Inversion: Since we are interested in the “empty space” between the atoms, we need a way to capture it. To make this possible we discretize the space by projecting the (sphered) point cloud into a voxel grid and invert the grid.\n\n# the grid is a binary 3D-array \n# with 1s where a normalized 3D-coordinate of an atom corresponds to the cell index and 0s elsewhere\n\n# by \"normalized\" i mean that the atom coordinates are\n# temporarily moved to the origin to decrease the size of the grid (see `index_grid` method further).\ninitial_grid, grid_dims, _ = index_grid(voxel_spheres)\n\n# The grid is inverted by changing 0->1 and 1->0\n# Now the atom locations are the null voxels and the empty space is active voxels\ninverted_grid = np.asarray(np.where(initial_grid != 1)).T\n\nCompare the following representation (Inverted Point Cloud) to the first point cloud: notice that where there previously was an active voxel is now an empty voxel and vice versa. The tubular constellation of active voxels in the center of the bounding box on this inverted grid is the tunnel “space” we are interested in.\n\n\n\n\n\n\n\n\n\n\n\n(a) Initial bounding-box point cloud\n\n\n\n\n\n\n\n\n\n\n\n(b) Inverted point cloud\n\n\n\n\n\n\n\nFigure 1: Pointcloud inversion via a voxel grid." }, { - "objectID": "posts/morphology/proposal.html", - "href": "posts/morphology/proposal.html", - "title": "Exploring cell shape dynamics dependency on the cell migration", - "section": "", - "text": "Cell morphology is an emerging field of biological research that examines the shape, size, and internal structure of cells to describe their state and the processes occurring within them. Today, more and more scientist across the world are investigating visible cellular transformations to predict cellular phenotypes. This research has significant practical implications: understanding specific cellular features characteristic of certain diseases, such as cancer, could lead to new approaches for early detection and classification.\nIn this work, we will explore aspects of cell motility by analyzing the changing shapes of migrating cells. As a cell moves through space, it reorganizes its membrane, cytosol, and cytoskeletal structures (Mogilner and Oster 1996). According to current understanding, actin polymerization causes protrusions at the leading edge of a cell, forming specific structures known as lamellipodia and filopodia. Elongation of cells in the direction of movement is also reported. These changes can be observed during experiments." + "objectID": "posts/ribosome-tunnel-new/index.html#subcloud-extraction", + "href": "posts/ribosome-tunnel-new/index.html#subcloud-extraction", + "title": "3D tessellation of biomolecular cavities", + "section": "2. Subcloud Extraction", + "text": "2. Subcloud Extraction\n\n\n\n\n\n\nDBSCAN_capture\n\n\n\n\n\n\nfrom sklearn.cluster import DBSCAN\ndef DBSCAN_capture(\n ptcloud: np.ndarray,\n eps ,\n min_samples ,\n metric : str = \"euclidean\",\n): \n\n u_EPSILON = eps\n u_MIN_SAMPLES = min_samples\n u_METRIC = metric\n\n print(\"Running DBSCAN on {} points. eps={}, min_samples={}, distance_metric={}\"\n .format( len(ptcloud), u_EPSILON, u_MIN_SAMPLES, u_METRIC ) ) \n\n db = DBSCAN(eps=eps, min_samples=min_samples, metric=metric).fit(ptcloud) # <-- this is all you need\n\n labels = db.labels_\n\n CLUSTERS_CONTAINER = {}\n for point, label in zip(ptcloud, labels):\n if label not in CLUSTERS_CONTAINER:\n CLUSTERS_CONTAINER[label] = []\n CLUSTERS_CONTAINER[label].append(point)\n\n CLUSTERS_CONTAINER = dict(sorted(CLUSTERS_CONTAINER.items()))\n return db, CLUSTERS_CONTAINER\n\n\n\n\n\n\n\n\n\n\nDBSCAN_pick_largest_cluster\n\n\n\n\n\n\nfrom sklearn.cluster import DBSCAN\ndef DBSCAN_pick_largest_cluster(clusters_container:dict[int,list])->np.ndarray:\n DBSCAN_CLUSTER_ID = 0\n for k, v in clusters_container.items():\n if int(k) == -1:\n continue\n elif len(v) > len(clusters_container[DBSCAN_CLUSTER_ID]):\n DBSCAN_CLUSTER_ID = int(k)\n return np.array(clusters_container[DBSCAN_CLUSTER_ID])\n\n\n\n\nClustering: Having obtained a voxelized representation of the interatomic spaces inside and around the NPET our task is now to extract only the space that corresponds to the NPET. We use DBSCAN.\nscikit’s implementation of DBSCAN conveniently lets us retrieve the points from the largest cluster only, which corresponds to the active voxels of NPET space (if we eyeballed our DBSCAN parameters well).\n\nfrom scikit.cluster import DBSCAN\n\n_u_EPSILON, _u_MIN_SAMPLES, _u_METRIC = 5.5, 600, 'euclidian'\n\n_, clusters_container = DBSCAN_capture(inverted_grid, _u_EPSILON, _u_MIN_SAMPLES, _u_METRIC ) \nlargest_cluster = DBSCAN_pick_largest_cluster(clusters_container)\n\n\n\n\n\n\n\nDBSCAN Parameters and grid size.\n\n\n\n\n\nOur 1Å-side grid just happens to be granular enough to accomodate a “correct” separation of clusters for some empirically established values of min_nbrs and epsilon (DBSCAN parameters), where the largest cluster captures the tunnel space.\nA possible issue here is “extraneous” clusters merging into the cluster of interest and thereby corrupting its shape. In general this occurs when there are clusters of density that are close enough (within epsilon to the main one to warrant a merge) and simultaneously large enough that they fulfill the min_nbrs parameter. Hence it might be challenging to find the combination of min_nbrs and epsilon that is sensitive enough to capture the main cluster completely and yet discriminating enough to not subsume any adjacent clusters.\nIn theory, a finer voxel grid (finer – in relationship to the initial coordinates of the general point cloud; sub-angstrom in our case) would make finding the combination of parameters specific to the dataset easier: given that the atom-sphere would be represented by a proprotionally larger number of voxels, the euclidian distance calculation between two voxels would be less sensitive to the change in epsilon.\nPartioning the voxel grid further would come at a cost:\n\nyou would need to rewrite the sphering method for atoms (to account for the the new voxel-size)\nthe computational cost will increase dramatically, the dataset could conceivably stop fitting into memory alltogether.\n\n\n\n\n\n\n\nClusters identified by DBSCAN on the inverted index grid. The largest cluster corresponds to the tunnel space.\n\n\n\n\n\n\n\n\nSubcloud refinement\n\n\n\n\n\nI found that this first pass of DBSCAN (eps=\\(5.5\\), min_nbrs=\\(600\\)) successfully identifies the largest cluster with the tunnel but generally happens to be conservative in the amount of points that are merged into it. That is, there are still redundant points in this cluster that would make the eventual surface reconstruction spatially overlap with the rRNA and protiens. To “sharpen” this cluster we apply DBSCAN only to its sub-pointcloud and push the eps distance down to \\(3\\) and min_nbrs to \\(123\\) (again, “empirically established” values), which happens to be about the lowest parameter values at which any clusters form. This sharpened cluster is what the tesselation (surface reconstruction) will be performed on.\n\n\n\n\n\n\n\n\n\n\n\n(a) Largest DBSCAN cluster (trimmed from the vestibule side).\n\n\n\n\n\n\n\n\n\n\n\n(b) Cluster refinement: DBSCAN{e=3,mn=123} result (marine blue) on the largest cluster of DBSCAN{e=5.5,mn=600} (gray)\n\n\n\n\n\n\n\nFigure 2: Second pass of DBSCAN sharpens the cluster to peel off the outer layer of redundant points." }, { - "objectID": "posts/morphology/proposal.html#background", - "href": "posts/morphology/proposal.html#background", - "title": "Exploring cell shape dynamics dependency on the cell migration", - "section": "", - "text": "Cell morphology is an emerging field of biological research that examines the shape, size, and internal structure of cells to describe their state and the processes occurring within them. Today, more and more scientist across the world are investigating visible cellular transformations to predict cellular phenotypes. This research has significant practical implications: understanding specific cellular features characteristic of certain diseases, such as cancer, could lead to new approaches for early detection and classification.\nIn this work, we will explore aspects of cell motility by analyzing the changing shapes of migrating cells. As a cell moves through space, it reorganizes its membrane, cytosol, and cytoskeletal structures (Mogilner and Oster 1996). According to current understanding, actin polymerization causes protrusions at the leading edge of a cell, forming specific structures known as lamellipodia and filopodia. Elongation of cells in the direction of movement is also reported. These changes can be observed during experiments." + "objectID": "posts/ribosome-tunnel-new/index.html#tessellation", + "href": "posts/ribosome-tunnel-new/index.html#tessellation", + "title": "3D tessellation of biomolecular cavities", + "section": "3. Tessellation", + "text": "3. Tessellation\n\n\n\n\n\n\nptcloud_convex_hull_points\n\n\n\n\n\nSurface points can be extracted by creating an alpha shape over the point cloud and taking only the points that belong to the alpha surface.\n\nimport pyvista as pv\nimport open3d as o3d\nimport numpy as np\n\ndef ptcloud_convex_hull_points(pointcloud: np.ndarray, ALPHA:float, TOLERANCE:float) -> np.ndarray:\n assert pointcloud is not None\n cloud = pv.PolyData(pointcloud)\n grid = cloud.delaunay_3d(alpha=ALPHA, tol=TOLERANCE, offset=2, progress_bar=True)\n convex_hull = grid.extract_surface().cast_to_pointset()\n return convex_hull.points\n\nOne could content themselves with the alpha shape representation of the NPET geometry and stop here, but it’s easy to notice that the vertice of the polygon (red dots) are distributed unevenly over the surface. This is likely to introduce artifacts and instabilities into further simulations.\n\n\n\n\n\n\n\n\n\n\n\n(a) Alpha-shape over the pointcloud\n\n\n\n\n\n\n\n\n\n\n\n(b) Surface points of the point cloud\n\n\n\n\n\n\n\nFigure 3: Alpha shape provides a way to identify surface points.\n\n\n\n\n\n\n\n\n\n\n\n\nestimate_normals\n\n\n\n\n\nNormal estimation is done via rolling a tangent plane over the surface points.\n\nimport pyvista as pv\nimport open3d as o3d\nimport numpy as np\n\ndef estimate_normals(convex_hull_surface_pts: np.ndarray, kdtree_radius=None, kdtree_max_nn=None, correction_tangent_planes_n=None): \n pcd = o3d.geometry.PointCloud()\n pcd.points = o3d.utility.Vector3dVector(convex_hull_surface_pts)\n\n pcd.estimate_normals(search_param=o3d.geometry.KDTreeSearchParamHybrid(radius=kdtree_radius, max_nn=kdtree_max_nn) )\n pcd.orient_normals_consistent_tangent_plane(k=correction_tangent_planes_n)\n\n return pcd\n\n\n\n\nNormals’ orientations are depicted as vectors(black) on each datapoint.\n\n\n\n\n\n\n\n\n\n\n\napply_poisson_recon\n\n\n\n\n\nThe source is available at https://github.com/mkazhdan/PoissonRecon. For programmability we connect the binary to the pipeline by wrapping it in a python subprocess but one can of course use the binary directly.\nThe output of the binary is a binary .ply (Stanford Triangle Format) file. For purposes of distribution we also produce an asciii-encoded version of this .ply file side-by-side: some geometry packages are only able to parse the ascii version.\n\ndef apply_poisson_reconstruction(surf_estimated_ptcloud_path: str, recon_depth:int=6, recon_pt_weight:int=3):\n import plyfile\n # The documentation can be found at https://www.cs.jhu.edu/~misha/Code/PoissonRecon/Version16.04/ in \"PoissonRecon\" binary\n command = [\n POISSON_RECON_BIN,\n \"--in\",\n surf_estimated_ptcloud_path,\n \"--out\",\n output_path,\n \"--depth\",\n str(recon_depth),\n \"--pointWeight\",\n str(recon_pt_weight),\n \"--threads 8\"\n ]\n process = subprocess.run(command, capture_output=True, text=True)\n if process.returncode == 0:\n print(\">>PoissonRecon executed successfully.\")\n print(\">>Wrote {}\".format(output_path))\n # Convert the plyfile to asciii\n data = plyfile.PlyData.read(output_path)\n data.text = True\n ascii_duplicate =output_path.split(\".\")[0] + \"_ascii.ply\"\n data.write(ascii_duplicate)\n print(\">>Wrote {}\".format(ascii_duplicate))\n else:\n print(\">>Error:\", process.stderr)\n\n\n\n\nThe final NPET surface reconstruction\n\n\n\n\n\nNow, having refined the largest DBSCAN cluster, we have a pointcloud which faithfully represent the tunnel geometry. To create a watertight mesh from this point cloud we need to prepare the dataset:\n\nretrieve only the “surface” points from the pointcloud\nestimate normals on the surface points (establish data orientation)\n\n\nd3d_alpha, d3d_tol = 2, 1\n\nsurface_pts = ptcloud_convex_hull_points(coordinates_in_the_original_frame, d3d_alpha,d3d_tol)\npointcloud = estimate_normals(surface_pts, kdtree_radius=10, kdtree_max_nn=15, correction_tangent_planes_n=10)\n\nThe dataset is now ready for surface reconstruction. We reach for Poisson surface reconstruction4 by Kazhdan and Hoppe, a de facto standard in the field.\n\nPR_depth , PR_ptweight = 6, 3\napply_poisson_recon(pointcloud, recon_depth=PR_depth, recon_pt_weight=PR_ptweight)" }, { - "objectID": "posts/morphology/proposal.html#goals", - "href": "posts/morphology/proposal.html#goals", - "title": "Exploring cell shape dynamics dependency on the cell migration", - "section": "Goals", - "text": "Goals\nOur goal is to perform a differential geometry analysis of cellular shape curves to explore the correlation between shape differences and spatial displacement. Using the Riemann Elastic Metric(Li et al. 2023):\n\\[\ng_c^{a, b}(h, k) = a^2 \\int_{[0,1]} \\langle D_s h, N \\rangle \\langle D_s k, N \\rangle \\, ds\n+ b^2 \\int_{[0,1]} \\langle D_s h, T \\rangle \\langle D_s k, T \\rangle \\, ds\n\\]\nwe can estimate the geodesic distance between two cellular boundary curves to mathematically describe how the cell shape changes over time. To implement this algorithm, we will use the Python Geomstats package." + "objectID": "posts/ribosome-tunnel-new/index.html#result", + "href": "posts/ribosome-tunnel-new/index.html#result", + "title": "3D tessellation of biomolecular cavities", + "section": "Result", + "text": "Result\nWhat you are left with is a smooth polygonal mesh in the .ply format. Below is the illustration of the fidelity of the representation. Folds and depressions can clearly be seen engendered by three proteins surrounding parts of the tunnel (uL22 yellow, uL4 light blue and eL39 magenta). rRNA is not shown.6\n\n\n\nThe NPET mesh surrounded by by three ribosome proteins" }, { - "objectID": "posts/morphology/proposal.html#dataset", - "href": "posts/morphology/proposal.html#dataset", - "title": "Exploring cell shape dynamics dependency on the cell migration", - "section": "Dataset", - "text": "Dataset\nThis dataset contains real cell contours obtained via fluorescent microscopy in Professor Prasad’s lab, segmented by Clément Soubrier.\n\n204 directories:\nEach directory is named cell_*, representing an individual cell.\nFrames:\nSubdirectories inside each cell are named frame_*, capturing different time points for that cell.\n\n\nNumPy Array Objects in Each Frame\n\ncentroid.npy: Stores the coordinates of the cell’s centroid.\n\noutline.npy: Contains segmented points as Cartesian coordinates.\n\ntime.npy: Timestamp of the frame.\n\n\n\nStructure\n├── cell_i\n│ ├── frame_j\n│ │ ├── centroid.npy\n│ │ ├── outline.npy\n│ │ └── time.npy\n│ ├── frame_k\n│ │ ├── centroid.npy\n│ │ ├── outline.npy\n│ │ └── time.npy\n│ └── ...\n├── cell_l\n│ ├── frame_m\n│ │ ├── centroid.npy\n│ │ ├── outline.npy\n│ │ └── time.npy\n│ └── ...\n└── ..." + "objectID": "posts/Farm-Shape-Analysis/index.html", + "href": "posts/Farm-Shape-Analysis/index.html", + "title": "Farm Shape Analysis: Linking Geometry with Crop Yield and Diversity", + "section": "", + "text": "In modern agriculture, the geometric features of farmland play a crucial role in farm management and planning. Understanding these characteristics enables farmers to make informed decisions, manage resources more efficiently, and promote sustainable agricultural practices.\nThis research leverages data from Litefarm, an open-source agri-tech application designed to support sustainable agriculture. Litefarm provides detailed information about farmland, including field shapes, offering valuable insights for analysis. However, as an open platform, Litefarm’s database may include unrealistic or inaccurate data entries, such as “fake farms.” Cleaning and validating this data is essential for ensuring the reliability of agricultural analyses.\nIn this blog, we focus on identifying fake farms by analyzing field shapes to detect unrealistic entries. Our goal is to enhance data accuracy, providing a stronger foundation for future agriculture-related research.\n\n\n\nLitefarm Interface" }, { - "objectID": "posts/morphology/proposal.html#single-cell-dynamics", - "href": "posts/morphology/proposal.html#single-cell-dynamics", - "title": "Exploring cell shape dynamics dependency on the cell migration", - "section": "Single cell dynamics", - "text": "Single cell dynamics\nimport numpy as np\nimport matplotlib.pyplot as plt\nimport os\n\nfig, ax = plt.subplots(figsize=(10, 10), layout='constrained')\n\nN = 15\n\nnumber_of_frames = sum(os.path.isdir(os.path.join(f\"cells/cell_{N}\", entry)) for entry in os.listdir(f\"cells/cell_{N}\"))\ncolors = plt.cm.tab20(np.linspace(0, 1, number_of_frames))\nfor i in range(1,number_of_frames+1):\n time = np.load(f'cells/cell_{N}/frame_{i}/time.npy')\n border = np.load(f'cells/cell_{N}/frame_{i}/outline.npy')\n centroid = np.load(f'cells/cell_{N}/frame_{i}/centroid.npy')\n\n \n color = colors[i - 1]\n\n ax.plot(border[:, 0], border[:, 1], label=time, color=color)\n ax.scatter(centroid[0], centroid[1], color=color)\nplt.legend() \n\nplt.savefig(f\"single_cell_{N}.png\", dpi=300, bbox_inches='tight')\n\n\n\nThe cell form in different time moments" + "objectID": "posts/Farm-Shape-Analysis/index.html#introduction-and-motivation", + "href": "posts/Farm-Shape-Analysis/index.html#introduction-and-motivation", + "title": "Farm Shape Analysis: Linking Geometry with Crop Yield and Diversity", + "section": "", + "text": "In modern agriculture, the geometric features of farmland play a crucial role in farm management and planning. Understanding these characteristics enables farmers to make informed decisions, manage resources more efficiently, and promote sustainable agricultural practices.\nThis research leverages data from Litefarm, an open-source agri-tech application designed to support sustainable agriculture. Litefarm provides detailed information about farmland, including field shapes, offering valuable insights for analysis. However, as an open platform, Litefarm’s database may include unrealistic or inaccurate data entries, such as “fake farms.” Cleaning and validating this data is essential for ensuring the reliability of agricultural analyses.\nIn this blog, we focus on identifying fake farms by analyzing field shapes to detect unrealistic entries. Our goal is to enhance data accuracy, providing a stronger foundation for future agriculture-related research.\n\n\n\nLitefarm Interface" }, { - "objectID": "posts/morphology/proposal.html#references", - "href": "posts/morphology/proposal.html#references", - "title": "Exploring cell shape dynamics dependency on the cell migration", - "section": "References", - "text": "References\n\n\nLi, Wanxin, Ashok Prasad, Nina Miolane, and Khanh Dao Duc. 2023. “Using a Riemannian Elastic Metric for Statistical Analysis of Tumor Cell Shape Heterogeneity.” In Geometric Science of Information, edited by Frank Nielsen and Frédéric Barbaresco, 583–92. Cham: Springer Nature Switzerland.\n\n\nMogilner, A., and G. Oster. 1996. “Cell Motility Driven by Actin Polymerization.” Biophysical Journal 71 (6): 3030–45. https://doi.org/10.1016/s0006-3495(96)79496-1." + "objectID": "posts/Farm-Shape-Analysis/index.html#dataset-overview-and-preparation", + "href": "posts/Farm-Shape-Analysis/index.html#dataset-overview-and-preparation", + "title": "Farm Shape Analysis: Linking Geometry with Crop Yield and Diversity", + "section": "2. Dataset Overview and Preparation", + "text": "2. Dataset Overview and Preparation\n\nData Source\nThe data for this study was extracted from Litefarm’s database, which contains detailed information about farm geometries, locations, and user associations. The dataset included the following key attributes:\n\nFarm-Level Information:\nEach farm is uniquely identified by a farm_ID, representing an individual farm within the Litefarm database.\nPolygon-Level Information:\nEach farm consists of multiple polygons, corresponding to distinct areas such as fields, gardens, or barns. Each polygon is uniquely identified by a location_ID, ensuring that every area within a farm is individually traceable.\nGeometric Attributes:\n\nArea: The total surface area of the polygon.\n\nPerimeter: The boundary length of the polygon.\n\nVertex Coordinates:\nThe geographic shape of each polygon is defined by a list of vertex coordinates in latitude and longitude format, represented as: [(lat1, lon1), (lat2, lon2), ..., (latN, lonN)].\nPolygon Types:\nThe polygons in each farm are categorized into various types:\n\nFields\n\nFarm site boundaries\n\nResidences\n\nBarns\n\nGardens\n\nSurface water\n\nNatural areas\n\nGreenhouses\n\nCeremonial areas\n\n\nThis rich dataset captures farm structures and geometries comprehensively, enabling the analysis of relationships between polygon features and agricultural outcomes.\nThis study focuses specifically on productive areas—gardens, greenhouses, and fields—as these contribute directly to agricultural output. Since different polygon types possess unique geometric characteristics, we focused on a single type to maintain analytical consistency.\nAs the Litefarm database is dynamic and continuously updated, the data captured as of November 28th showed that 36.4% of farms included garden areas, 20.7% had greenhouse areas, and nearly 70% contained fields. To ensure a robust and representative analysis, we focused on field polygons, which had the highest proportion within the dataset.\n\n\nRefined Litefarm Dataset\nTo ensure that only valid and realistic farm data was included in the analysis, we applied rigorous SQL filters to the Litefarm database. These filters excluded:\n\nPlaceholder farms and internal test accounts.\n\nDeleted records.\n\nFarms located in countries with insufficient representation (fewer than 10 farms).\n\nThe table below summarizes the results of the filtering process and the composition of the cleaned dataset:\n\n\n\nDescription\nCount\n\n\n\n\nInitial number of farms in Litefarm\n3,559\n\n\nFarms after SQL filtering\n2,919\n\n\nFarms with field areas\n2,022\n\n\nFarms with garden areas\n1,063\n\n\nFarms with greenhouse areas\n607\n\n\nTotal number of field polygons\n6,340\n\n\n\nBy narrowing the focus to field polygons, we ensured that the dataset was both robust and suitable for exploring the relationship between geometric features and agricultural outcomes." }, { - "objectID": "posts/CC-cells/Shape_Analysis_of_Contractile_Cells.html", - "href": "posts/CC-cells/Shape_Analysis_of_Contractile_Cells.html", - "title": "Shape Analysis of Contractile Cells", - "section": "", - "text": "Capsular contracture (CC) is an ailing complication that arises commonly amonst breast cancer patients after reconstructive breast implant surgery. CC patients suffer from aesthetic deformation, pain, and in rare cases, they may develop anaplastic large cell lymphoma (ALCL), a type of cancer of the immune system. The mechanism of CC is unknown, and there are few objective assessments of CC based on histology.\n\n\n\n\nFigure 1: Baker grade\n\nBaker grade is a subjective, clinical evaluation for the extent of CC (See Fig 1). Many researchers have measured histological properties in CC tissue samples, and correlated theses findings to their assigned Baker grade. It has been found that a high density of immune cells is associated with higher Baker grade.\nThese immune cells include fibroblasts and myofibroblasts, which can distort surrounding tissues by contracting and pulling on them. The transition from the fibroblast to myofibroblast phenotype is an important driving step in many fibrotic processes including capsular contracture. In wound healing processes, the contactility of myofibroblasts is essential in facilitating tissue remodelling, however, an exess amount of contratile forces creates a positive feedback loop, leading to the formation of pathological capsules with high density and extent of deformation.\nMyofibroblasts, considered as an “activated” form of fibroblasts, is identified by the expression of alpha-smooth muscle actin (\\(\\alpha\\)-SMA). However, this binary classification system does not capture the full range of complexities involved in the transition between these two phenotypes. Therefore, it is beneficial to develop a finer classification system of myofibroblasts to explain various levels of forces they can generate. One recent work uses pre-defined morphological features of cells, including perimeter and circularity, to create a continuous spectrum of myofibroblast activation (Hillsley et al. 2022).\nResearch suggests that mechanical strain induces change in cell morphology, inducing round cells that are lacking in stress fibers into more broad, elongated shapes. We hypothesize that cell shapes influence their ability to generate forces via mechanisms of cell-matrix adheshion and cell traction. Further, we hypothesis that cell shape is directly correlated with the severity of CC by increasing contractile forces.\nIn order to test these hypothesis, we will take a 2-step approach. The first step involves statistical analysis on correlation between cell shapes and their associated Baker grade. To do this, we collect cell images from CC samples with various Baker grades, using Geomstat we can compute a characteristic mean cell shape for each sample. Then, we cluster these characteristic cell shapes into 4 groups, and observe the extent of overlap between this classification and the Baker grade. We choose the elastic metric, associated with its geodesic distances, since it allows us to not only looking at classification, but also how cell shape deforms. If we can find a correlation, the second step is then to go back to in-vitro studies of fibroblasts, and answer the question: can the shapes of cells predict their disposition to developing into a highly contractile phenotype (linked to more severe CC)? I don’t have a concrete plan for this second step yet, however, it motivates this project as it may suggest a way to predict clinical outcomes based on pre-operative patient assessment." + "objectID": "posts/Farm-Shape-Analysis/index.html#shape-analysis", + "href": "posts/Farm-Shape-Analysis/index.html#shape-analysis", + "title": "Farm Shape Analysis: Linking Geometry with Crop Yield and Diversity", + "section": "3. Shape Analysis", + "text": "3. Shape Analysis\nThis study focuses on the geometric properties of field polygons, as these are essential for understanding farm structures and ensuring data reliability. Each field polygon is represented by a series of vertices in latitude-longitude pairs, which outline its geometric boundaries. These vertices are the foundation for calculating key metrics such as area, perimeter, and more complex shape properties.\nTo perform a robust analysis, we systematically processed and evaluated the field polygon data through the following steps:\n\n1. Vertex Distribution Analysis\nThe first step in our analysis was to examine the vertex distribution of the field polygons to understand their general characteristics and ensure data quality. A box plot was created to visualize the distribution of vertex counts: \nThe results revealed a wide range of vertex counts, spanning from 3 to 189 vertices. This variability required filtering to address potential outliers. Using the z-score method, we identified and excluded extreme values, capping the maximum vertex count at 34.\nAfter filtering, we analyzed the revised vertex distribution using a histogram, which revealed that 47.4% of field polygons had exactly four vertices:\n\n\n\nhistogram of number of veritces\n\n\n\n\n2. Validation of Area and Perimeter Metrics\n\nRecalculation Process:\n\nVertex coordinates, initially in latitude-longitude format, were transformed into a planar coordinate system (EPSG:6933) to enable precise calculations.\nArea and perimeter were computed directly from the transformed vertex data.\n\nScatter plots comparing the user-provided values with the recalculated metrics showed strong alignment, with most points clustering around the diagonal (dashed line), confirming the accuracy of the recalculated values:\n\nPerimeter Comparison\n\nArea Comparison\n\n\nThis validation step provided confidence in the accuracy of the recalculated metrics, allowing us to proceed with subsequent shape analysis using reliable data." }, { - "objectID": "posts/CC-cells/Shape_Analysis_of_Contractile_Cells.html#sort-labelling-data", - "href": "posts/CC-cells/Shape_Analysis_of_Contractile_Cells.html#sort-labelling-data", - "title": "Shape Analysis of Contractile Cells", - "section": "Sort labelling data", - "text": "Sort labelling data\nThe segmentation data can be exported as a file containing 2D coordinates of all pixels that are marked as borders. First, we need to identify individual cells from this data. We may view pixels as nodes in a graph, the problem then becomes splitting an unconnected graph into connected components. A tricky part is to process cells with overlapping/connected borders. > TO ADD: details on this algorithm.\nFrom here, a few simple bash commands allow us to import the resulting data files as a numpy array of 2D coordinates, as an acceptable input for GeomStats.\n# replace delimiters with sed\nsed -i 's/],/\\n/g' *\nsed -i 's/,/ /g' *\n\n# remove [ with sed\nsed -i 's|[[]||g' * \n\nimport sys\nfrom pathlib import Path\nimport numpy as np\nfrom decimal import Decimal\nimport matplotlib.pyplot as plt\n\n# sys.prefix = '/home/uki/Desktop/blog/posts/capsular-contracture/.venv'\n# sys.executable = '/home/uki/Desktop/blog/posts/capsular-contracture/.venv/bin/python'\nsys.path=['', '/opt/petsc/linux-c-opt/lib', '/home/uki/Desktop/blog/posts/capsular-contracture', '/usr/lib/python312.zip', '/usr/lib/python3.12', '/usr/lib/python3.12/lib-dynload', '/home/uki/Desktop/blog/posts/capsular-contracture/.venv/lib/python3.12/site-packages']\n\ndirectory = Path('/home/uki/Desktop/blog/posts/capsular-contracture/cells')\nfile_iterator = directory.iterdir()\ncells = []\n\nfor filename in file_iterator:\n with open(filename) as file:\n cell = np.loadtxt(file, dtype=int)\n cells.append(cell)\n\nprint(f\"Total number of cells : {len(cells)}\")\n\nTotal number of cells : 3\n\n\nSince the data is unordered, we need to sort the coordinates in order to visualize cell shapes.\n\ndef sort_coordinates(list_of_xy_coords):\n cx, cy = list_of_xy_coords.mean(0)\n x, y = list_of_xy_coords.T\n angles = np.arctan2(x-cx, y-cy)\n indices = np.argsort(angles)\n return list_of_xy_coords[indices]\n\n\nsorted_cells = []\n\nfor cell in cells:\n sorted_cells.append(sort_coordinates(cell))\n\n\nindex = 1\ncell_rand = cells[index]\ncell_sorted = sorted_cells[index]\n\nfig = plt.figure(figsize=(15, 5))\n\nfig.add_subplot(121)\nplt.scatter(cell_rand[:, 0], cell_rand[:, 1], color='black', s=4)\n\nplt.plot(cell_rand[:, 0], cell_rand[:, 1])\nplt.axis(\"equal\")\nplt.title(f\"Original coordinates\")\nplt.axis(\"off\")\n\nfig.add_subplot(122)\nplt.scatter(cell_sorted[:, 0], cell_sorted[:, 1], color='black', s=4)\n\nplt.plot(cell_sorted[:, 0], cell_sorted[:, 1])\nplt.axis(\"equal\")\nplt.title(f\"sorted coordinates\")\nplt.axis(\"off\")\n\n\n\n\n\n\n\n\n\n\nOriginal work ends around here, the below is a proof of concept mock pipeline performed on 3 cells, that needs to be adapted. _______________________" + "objectID": "posts/Farm-Shape-Analysis/index.html#field-polygon-standardization-and-preparation", + "href": "posts/Farm-Shape-Analysis/index.html#field-polygon-standardization-and-preparation", + "title": "Farm Shape Analysis: Linking Geometry with Crop Yield and Diversity", + "section": "Field Polygon Standardization and Preparation", + "text": "Field Polygon Standardization and Preparation\nTo focus on the geometric properties of field polygons, we projected all polygons into a size-and-shape space. This transformation isolates the shape and scale of the polygons while removing variations caused by rotation and translation. The size-and-shape space ensures consistent and meaningful comparisons of the underlying geometric features.\nWhile this study emphasizes polygon shapes, we recognize that area is a critical feature in agricultural studies due to its relationship with factors like regional regulations and agricultural policies. Thus, we preserved the size (scaling) component in our analysis to maintain the relevance of area.\nTo ensure uniformity and consistency in the dataset, we performed the following preprocessing steps:\n\nStandardizing Landmark Points:\n\nTo enable meaningful comparisons in the size-and-shape space, each polygon was resampled to have exactly 34 evenly spaced points along its boundary. The following Python function illustrates this process:\n\n\nCode\nimport folium\nimport json\nfrom shapely.geometry import shape, Polygon, Point, MultiPoint, MultiPolygon, LineString,LinearRing, MultiLineString\nfrom shapely.ops import unary_union, transform, nearest_points\nfrom collections import defaultdict\nimport geopy.distance\nimport pandas as pd\nimport math\nimport numpy as np\nfrom itertools import combinations\nimport itertools\nimport pyproj\nfrom functools import partial\nfrom collections import defaultdict\nimport altair as alt\nimport matplotlib.pyplot as plt\nimport plotly.graph_objs as go\nfrom pyproj import Transformer, CRS \nimport seaborn as sns\nimport plotly.express as px\nimport logging\nfrom shapely.validation import explain_validity\nimport geopandas as gpd\nimport ast\nfrom geographiclib.geodesic import Geodesic\nfrom sklearn.preprocessing import StandardScaler\nfrom sklearn.decomposition import PCA\nfrom geopy.distance import geodesic\nfrom geomstats.geometry.pre_shape import PreShapeSpace\nfrom geomstats.visualization import KendallDisk, KendallSphere\n\n\n\n\nCode\ndef resample_polygon(projected_coords, num_points=34):\n \"\"\"\n Resample a polygon's boundary to have a specified number of evenly spaced points.\n\n Parameters:\n - projected_coords: List of coordinates defining the polygon's boundary.\n - num_points: The number of evenly spaced points to resample (default is 34).\n\n Returns:\n - new_coords: List of resampled coordinates.\n \"\"\"\n ring = LinearRing(projected_coords)\n \n total_length = ring.length\n\n distances = np.linspace(0, total_length, num_points, endpoint=False)\n \n new_coords = [ring.interpolate(distance).coords[0] for distance in distances]\n \n return new_coords\n\n\n\nEnsuring Consistent Vertex Direction:\n\nAll polygons were standardized to have vertices drawn in the same direction (clockwise or counterclockwise). This step ensures that the orientation of the vertices does not introduce inconsistencies in the analysis.\n\n\nCode\ndef is_clockwise(coords):\n \"\"\"\n Check if the polygon vertices are in a clockwise direction.\n\n Parameters:\n - coords: List of coordinates defining the polygon's boundary.\n\n Returns:\n - True if the polygon is clockwise; False otherwise.\n \"\"\"\n ring = LinearRing(coords)\n return ring.is_ccw == False \n\ndef make_clockwise(coords):\n \"\"\"\n Convert the polygon's vertices to a clockwise direction, if it is not \n\n Parameters:\n - coords: List of coordinates defining the polygon's boundary.\n\n Returns:\n - List of coordinates ordered in a clockwise direction.\n \"\"\"\n if not is_clockwise(coords): \n return [coords[0]] + coords[:0:-1] # Reverse the vertex order, keeping the start point\n return coords\n\n\nThe image illustrates four polygons that have been standardized by resampling them to have 34 evenly spaced points, with all vertices aligned in a clockwise direction.\n\n\n\nThe standardized polygon\n\n\n\nValidation of Standardization\nTo confirm the accuracy of these transformations, we compared the areas and perimeters of the resampled polygons with the original values. The results demonstrated minimal deviation, indicating the transformations preserved the integrity of the shapes.\n\nPerimeter Comparison\n\n\n\n\nperimeter comparison\n\n\n\nArea Comparison\n\n\n\n\narea comparison\n\n\nBy meeting these preprocessing requirements, we ensured that the polygons were accurately prepared for subsequent shape analysis." }, { - "objectID": "posts/CC-cells/Shape_Analysis_of_Contractile_Cells.html#interpolation-and-removing-duplicate-sample-points", - "href": "posts/CC-cells/Shape_Analysis_of_Contractile_Cells.html#interpolation-and-removing-duplicate-sample-points", - "title": "Shape Analysis of Contractile Cells", - "section": "Interpolation and removing duplicate sample points", - "text": "Interpolation and removing duplicate sample points\n\nimport geomstats.backend as gs\nfrom common import *\nimport random\nimport os\nimport scipy.stats as stats\nfrom sklearn import manifold\n\ngs.random.seed(2024)\n\n\ndef interpolate(curve, nb_points):\n \"\"\"Interpolate a discrete curve with nb_points from a discrete curve.\n\n Returns\n -------\n interpolation : discrete curve with nb_points points\n \"\"\"\n old_length = curve.shape[0]\n interpolation = gs.zeros((nb_points, 2))\n incr = old_length / nb_points\n pos = 0\n for i in range(nb_points):\n index = int(gs.floor(pos))\n interpolation[i] = curve[index] + (pos - index) * (\n curve[(index + 1) % old_length] - curve[index]\n )\n pos += incr\n return interpolation\n\n\nk_sampling_points = 2000\n\n\nindex = 2\ncell_rand = sorted_cells[index]\ncell_interpolation = interpolate(cell_rand, k_sampling_points)\n\nfig = plt.figure(figsize=(15, 5))\n\nfig.add_subplot(121)\nplt.scatter(cell_rand[:, 0], cell_rand[:, 1], color='black', s=4)\n\nplt.plot(cell_rand[:, 0], cell_rand[:, 1])\nplt.axis(\"equal\")\nplt.title(f\"Original curve ({len(cell_rand)} points)\")\nplt.axis(\"off\")\n\nfig.add_subplot(122)\nplt.scatter(cell_interpolation[:, 0], cell_interpolation[:, 1], color='black', s=4)\n\nplt.plot(cell_interpolation[:, 0], cell_interpolation[:, 1])\nplt.axis(\"equal\")\nplt.title(f\"Interpolated curve ({k_sampling_points} points)\")\nplt.axis(\"off\")\n\n(np.float64(810.1893750000002),\n np.float64(850.848125),\n np.float64(18.650075000000008),\n np.float64(48.34842499999986))\n\n\n\n\n\n\n\n\n\n\ndef preprocess(curve, tol=1e-10):\n \"\"\"Preprocess curve to ensure that there are no consecutive duplicate points.\n\n Returns\n -------\n curve : discrete curve\n \"\"\"\n\n dist = curve[1:] - curve[:-1]\n dist_norm = np.sqrt(np.sum(np.square(dist), axis=1))\n\n if np.any( dist_norm < tol ):\n for i in range(len(curve)-1):\n if np.sqrt(np.sum(np.square(curve[i+1] - curve[i]), axis=0)) < tol:\n curve[i+1] = (curve[i] + curve[i+2]) / 2\n\n return curve\n\n\ninterpolated_cells = []\n\nfor cell in sorted_cells:\n interpolated_cells.append(preprocess(interpolate(cell, k_sampling_points)))" + "objectID": "posts/Farm-Shape-Analysis/index.html#shape-alignment-and-fréchet-mean-analysis", + "href": "posts/Farm-Shape-Analysis/index.html#shape-alignment-and-fréchet-mean-analysis", + "title": "Farm Shape Analysis: Linking Geometry with Crop Yield and Diversity", + "section": "Shape Alignment and Fréchet Mean Analysis", + "text": "Shape Alignment and Fréchet Mean Analysis\nWith data preparation complete, the polygons were ready for analysis in the size-and-shape space. This specialized framework enables consistent comparison of shapes by accounting for geometric differences, including scaling, translation, and rotation. It provides a robust foundation for meaningful geometric analysis.\nThe polygons were aligned using Procrustes analysis(Dryden and Mardia 2016), and their Fréchet Mean was iteratively computed in Euclidean space. This process standardizes the shapes, ensuring variations caused by translation and rotation are removed, allowing for accurate and meaningful comparisons.\nThe Fréchet Mean(Dryden and Mardia 2016) represents the “average” shape in a geometric space (manifold), minimizing the average squared distance to all sample shapes. It serves as a standardized and central representation of the dataset.\n\n\nStep-by-Step Overview\n\nShape Alignment:\n\nThe align_shape function performs Procrustes alignment through the following steps:\n\nRemoving Translation:\n\nThe centroid (average position of all points) of each shape is computed. The shape is then centered by subtracting its centroid from all points, ensuring the shape is position-independent.\n\nRemoving Rotation:\n\nUsing Singular Value Decomposition (SVD), the optimal rotation matrix is calculated to align the target shape with the reference shape. This step removes rotation differences while preserving the relative positions of the points.\n\n\n\nMeasuring Shape Differences:\n\nThe riemannian_distance function computes the Riemannian distance between two shapes in size-and-shape space. This metric quantifies geometric differences between shapes, considering both size and rotation.\n\n\n\n\nRiemannian Distance in Size-and-Shape Space\nGiven two \\(k\\)-point configurations in \\(m\\)-dimensions, \\(X_1^o, X_2^o \\in \\mathbb{R}^{k \\times m}\\), the Riemannian distance(Dryden and Mardia 2016) in size-and-shape space is defined as:\n\\[\nd_S(X_1^o, X_2^o) = \\sqrt{S_1^2 + S_2^2 - 2 S_1 S_2 \\cos \\rho(X_1^o, X_2^o)}\n\\]\nwhere:\n\n\\(S_1, S_2\\): Centroid sizes of \\(X_1^o\\) and \\(X_2^o\\), representing the Frobenius norms of the centered shapes.\n\\(\\rho(X_1^o, X_2^o)\\): Riemannian shape distance.\n\nThis formula ensures that the distance captures both shape similarity and scaling differences, making it a robust tool for geometric analysis.\n\nIterative Fréchet Mean Calculation:\n\nThe algorithm begins with an initial reference shape and aligns all other shapes to it using Procrustes alignment.\nThe Fréchet Mean is then calculated as the average shape in Euclidean space.\nThe shapes are iteratively re-aligned to the updated Fréchet Mean, refining the alignment and mean calculation until convergence is achieved.\n\n\n\n\n\n\nPython Implementation\nThe following Python code implements the entire process of shape alignment, Riemannian distance computation, and iterative Fréchet Mean calculation.\n\n\nCode\ndef align_shape(reference_shape, target_shape):\n \"\"\"\n Align the target shape to the reference shape using Procrustes alignment.\n\n Parameters:\n - reference_shape: The reference shape to align to.\n - target_shape: The shape to be aligned.\n\n Returns:\n - aligned_shape: The aligned target shape.\n \"\"\"\n reference_shape = np.array(reference_shape)\n target_shape = np.array(target_shape)\n\n # Step 1: Remove the translation\n centroid_reference = np.mean(reference_shape, axis=0)\n centroid_target = np.mean(target_shape, axis=0)\n centered_reference = reference_shape - centroid_reference\n centered_target = target_shape - centroid_target\n\n # Step 2: Remove the rotation\n u, s, vh = np.linalg.svd(np.matmul(np.transpose(centered_target), centered_reference))\n r = np.matmul(u, vh)\n aligned_shape = np.matmul(centered_target, r)\n\n return aligned_shape\n\ndef riemannian_distance(reference_shape, target_shape):\n \"\"\"\n Compute the Riemannian distance between two shapes.\n\n Parameters:\n - reference_shape: The reference shape.\n - target_shape: The target shape.\n\n Returns:\n - distance: The Riemannian distance between the shapes.\n \"\"\"\n reference_shape = np.array(reference_shape)\n target_shape = np.array(target_shape)\n\n # Step 1: Compute centroid sizes\n S1 = np.linalg.norm(reference_shape) \n S2 = np.linalg.norm(target_shape)\n\n # Step 2: Remove translation by centering the shapes\n centered_reference = reference_shape - np.mean(reference_shape, axis=0)\n centered_target = target_shape - np.mean(target_shape, axis=0)\n\n # Step 3: Compute optimal rotation using SVD\n H = np.dot(centered_target.T, centered_reference)\n U, _, Vt = np.linalg.svd(H)\n R = np.dot(U, Vt)\n\n # Step 4: Align target shape\n aligned_target = np.dot(centered_target, R)\n\n # Step 5: Compute the Riemannian distance\n cosine_rho = np.trace(np.dot(aligned_target.T, centered_reference)) / (S1 * S2)\n cosine_rho = np.clip(cosine_rho, -1, 1)\n distance = np.sqrt(S1**2 + S2**2 - 2 * S1 * S2 * cosine_rho)\n\n return distance\n\n# Iterative Fréchet Mean Calculation\nepsilon = 1e-6 \nmax_iterations = 100 \nreference_shape = field_data['resampled_point'].iloc[0] \naligned_shapes = []\n\n# Align all shapes to the initial reference shape\nfor target_shape in field_data['resampled_point']:\n aligned_shape = align_shape(reference_shape, target_shape)\n aligned_shapes.append(aligned_shape)\n\n# Initialize Euclidean space and calculate initial Fréchet Mean\neuclidean_space = Euclidean(dim=aligned_shapes[0].shape[1])\nfrechet_mean = FrechetMean(euclidean_space)\nprevious_frechet_mean_shape = frechet_mean.fit(aligned_shapes).estimate_\nconverged = False\niteration = 0\nfrechet_means = [previous_frechet_mean_shape]\n\nwhile not converged and iteration < max_iterations:\n iteration += 1\n aligned_shapes2 = []\n for target_shape in field_data['resampled_point']:\n aligned_shape = align_shape(previous_frechet_mean_shape, target_shape)\n aligned_shapes2.append(aligned_shape)\n\n # Calculate new Fréchet Mean\n frechet_mean = FrechetMean(euclidean_space)\n current_frechet_mean_shape = frechet_mean.fit(aligned_shapes2).estimate_\n frechet_means.append(current_frechet_mean_shape)\n \n # Check convergence\n difference = riemannian_distance(previous_frechet_mean_shape, current_frechet_mean_shape)\n if difference < epsilon:\n converged = True\n else:\n previous_frechet_mean_shape = current_frechet_mean_shape" }, { - "objectID": "posts/CC-cells/Shape_Analysis_of_Contractile_Cells.html#alignment", - "href": "posts/CC-cells/Shape_Analysis_of_Contractile_Cells.html#alignment", - "title": "Shape Analysis of Contractile Cells", - "section": "Alignment", - "text": "Alignment\n\nfrom geomstats.geometry.pre_shape import PreShapeSpace\n\nAMBIENT_DIM = 2\n\nPRESHAPE_SPACE = PreShapeSpace(ambient_dim=AMBIENT_DIM, k_landmarks=k_sampling_points)\n\nPRESHAPE_SPACE.equip_with_group_action(\"rotations\")\nPRESHAPE_SPACE.equip_with_quotient()\n\n\ndef exhaustive_align(curve, base_curve):\n \"\"\"Align curve to base_curve to minimize the L² distance.\n\n Returns\n -------\n aligned_curve : discrete curve\n \"\"\"\n nb_sampling = len(curve)\n distances = gs.zeros(nb_sampling)\n base_curve = gs.array(base_curve)\n for shift in range(nb_sampling):\n reparametrized = [curve[(i + shift) % nb_sampling] for i in range(nb_sampling)]\n aligned = PRESHAPE_SPACE.fiber_bundle.align(\n point=gs.array(reparametrized), base_point=base_curve\n )\n distances[shift] = PRESHAPE_SPACE.embedding_space.metric.norm(\n gs.array(aligned) - gs.array(base_curve)\n )\n shift_min = gs.argmin(distances)\n reparametrized_min = [\n curve[(i + shift_min) % nb_sampling] for i in range(nb_sampling)\n ]\n aligned_curve = PRESHAPE_SPACE.fiber_bundle.align(\n point=gs.array(reparametrized_min), base_point=base_curve\n )\n return aligned_curve\n\n\naligned_cells = []\nBASE_CURVE = interpolated_cells[0]\n\nfor cell in interpolated_cells:\n aligned_cells.append(exhaustive_align(cell, BASE_CURVE))\n\n\nindex = 1\nunaligned_cell = interpolated_cells[index]\naligned_cell = exhaustive_align(unaligned_cell, BASE_CURVE)\n\nfig = plt.figure(figsize=(15, 5))\n\nfig.add_subplot(131)\nplt.plot(BASE_CURVE[:, 0], BASE_CURVE[:, 1])\nplt.plot(BASE_CURVE[0, 0], BASE_CURVE[0, 1], \"ro\")\nplt.axis(\"equal\")\nplt.title(\"Reference curve\")\n\nfig.add_subplot(132)\nplt.plot(unaligned_cell[:, 0], unaligned_cell[:, 1])\nplt.plot(unaligned_cell[0, 0], unaligned_cell[0, 1], \"ro\")\nplt.axis(\"equal\")\nplt.title(\"Unaligned curve\")\n\nfig.add_subplot(133)\nplt.plot(aligned_cell[:, 0], aligned_cell[:, 1])\nplt.plot(aligned_cell[0, 0], aligned_cell[0, 1], \"ro\")\nplt.axis(\"equal\")\nplt.title(\"Aligned curve\")\n\nText(0.5, 1.0, 'Aligned curve')" + "objectID": "posts/Farm-Shape-Analysis/index.html#global-fréchet-mean-and-outlier-detection", + "href": "posts/Farm-Shape-Analysis/index.html#global-fréchet-mean-and-outlier-detection", + "title": "Farm Shape Analysis: Linking Geometry with Crop Yield and Diversity", + "section": "Global Fréchet Mean and Outlier Detection", + "text": "Global Fréchet Mean and Outlier Detection\nHere is the global Fréchet mean calculated from all field polygons:\n\n\n\nThe global mean shape\n\n\nThe following image illustrates the original polygon and its alignment with the Fréchet mean:\n\n\n\nAligned Shape\n\n\nAfter aligning all shapes to the Fréchet mean, the riemannian_distance function was used to calculate the distances between the mean shape and each aligned shape. To identify potential outliers, the z-score method was applied to these distance values.\nBelow are the four field polygons detected as outliers using the global Fréchet mean:\n\n\nCode\nimport pandas as pd\n\n# Load the CSV file\nfour_potiential_fake_farm = pd.read_csv(\"data/potiential_fake_field.csv\")\n\n# Display the table\nfour_potiential_fake_farm # Or use `data` to show the entire table\n\n\n\n\n\n\n\n\n\nFarm Number\ncountry_name\ntype\ncalculated_perimeter_m\ncalulated_area_ha\nnumber of vertices\ndistance_to_frechet_mean\nz_score\n\n\n\n\n0\nFarm 310\nUnited States\nfield\n744797.7117\n2.600780e+06\n3\n591590.0609\n48.784896\n\n\n1\nFarm 71\nCanada\nfield\n864206.5248\n4.251124e+06\n5\n709800.2531\n58.580655\n\n\n2\nFarm 45\nCanada\nfield\n341370.9916\n8.453115e+04\n5\n177371.8498\n14.459753\n\n\n3\nFarm 2792\nIndia\nfield\n200958.9993\n2.170029e+05\n4\n166440.3554\n13.553890" }, { - "objectID": "posts/rloop-analysis/rloop-analysis.html", - "href": "posts/rloop-analysis/rloop-analysis.html", - "title": "Identifying R-loops in AFM imaging data", - "section": "", - "text": "R-loops are three-stranded nucleic acid structures containing a DNA:RNA hybrid and an associated single DNA strand. They are normally created when DNA and RNA interact throughout the lifespan of a cell. Although their existence can be beneficial to a cell, an excessive formation of these objects is commonly associated with instability phenotypes.\nThe role of R-loop structures on genome stability is still not completely determined. The determining characteristics of harmful R-loops still remain to be defined. Their architecture is not very well-known either, and they are normally classified manually.\nIn this blog post, we will carry AFM data to the Kernell shape space and try to develop a method to detect and classify these objects using geomstats (Miolane et al. 2024). We will also talk about a rather simple method that works reasonably well.\n\n\n\n\nFig.1 Pictures of DNA fragments at the gene Airn in vitro. One of them was treated with RNase H and the other was not. The image on the bottom highlights the R-loops that were formed. (Carrasco-Salas et al. 2019)" + "objectID": "posts/Farm-Shape-Analysis/index.html#fréchet-mean-shape-by-country", + "href": "posts/Farm-Shape-Analysis/index.html#fréchet-mean-shape-by-country", + "title": "Farm Shape Analysis: Linking Geometry with Crop Yield and Diversity", + "section": "Fréchet Mean Shape by Country", + "text": "Fréchet Mean Shape by Country\nThe shape of field polygons varies significantly across different countries. To capture this variation, we calculated the Fréchet mean shape* for each country based on the fields located within that specific country.\nThe plot below summarizes the Fréchet mean shapes for all countries in the dataset.\nIn this visualization, different colors represent different continents. It is evident that both the shapes and areas of the field polygons differ substantially across regions, highlighting the diversity in field geometry across countries.\n\n\n\nSummary of Countries’ Mean Shapes\n\n\n\nAssessing Mean Shape Representation in Countries with Limited Data\nTo evaluate the representativeness of the mean shape, we specifically selected countries with fewer than 10 polygons. The small number of polygons in these cases allows for easier visualization, helping us assess whether the mean shape effectively captures the overall geometric characteristics of these datasets.\n\nZambia\n\n\n\nField polygons and Fréchet mean for Zambia\n\n\n\n\nChile\n\n\n\nField polygons and Fréchet mean for Chile\n\n\nFrom the above plots, we can draw the following conclusions:\n\nEffective Representation with Similar Shapes:\nWhen the field polygons within a country have similar shapes, the calculated Fréchet mean serves as an effective representation of the general shape trend.\nLimitations with Diverse Shapes:\nIf the field polygons within a country show significant variation in their shapes, the Fréchet mean becomes less representative and may fail to adequately capture the geometric diversity of the dataset.\n\n\n\n\nDetecting Potential Fake Field Polygons\nBuilding on the country-level mean shape analysis, we applied the same methodology to detect potential fake field polygons. For each country, field polygons were aligned to their corresponding Fréchet mean, and the z-score technique was used to identify anomalies based on the distances between each polygon and the mean shape.\nThrough this analysis, we identified 51 potential fake field polygons. To verify their validity, we visualized each field polygon on satellite imagery. The results are summarized in the plot below:\n\nGray markers: Fake fields\n\nPink markers: True fields\n\nOrange markers: Potential fake fields\n\n\n\n\nSatellite plot for all 51 potential fake fields\n\n\nAfter visualizing all 51 potential fake field polygons, the findings were as follows:\n\n45.1% were confirmed as fake fields.\n\n29.4% were ambiguous, meaning they could potentially be either fake or real fields, requiring further investigation.\n25.5% were determined to be true fields.\n\nBelow are examples of confirmed fake fields. These polygons often exhibit:\n\nUnusual geometric shapes\nSizes that are disproportionately large compared to neighboring field polygons\n\n\n\n\nfake field polygons\n\n\n\n\nFuture Work\nOur analysis successfully identified a significant number of potential fake field polygons, with nearly half of these cases being validated as genuinely fake. While this demonstrates the effectiveness of our approach, there is still room to improve the accuracy and reliability of the detection process. To further refine our results, future efforts will focus on:\n\nIncorporate Geographic Information:\nEnrich the dataset with geographic features such as proximity to natural landmarks (e.g., mountains, rivers) or man-made structures (e.g., urban areas, roads). These features could provide valuable context for improving the calculation of the Fréchet mean and detecting anomalies more effectively.\nImprove Outlier Detection Methods:\nLeverage advanced machine learning models, such as clustering algorithms or ensemble methods, to identify subtle patterns and relationships that may indicate fake fields. Techniques like unsupervised learning or deep anomaly detection could also be explored to improve performance." }, { - "objectID": "posts/rloop-analysis/rloop-analysis.html#context-and-motivation", - "href": "posts/rloop-analysis/rloop-analysis.html#context-and-motivation", - "title": "Identifying R-loops in AFM imaging data", + "objectID": "posts/vascularNetworks/VascularNetworks.html", + "href": "posts/vascularNetworks/VascularNetworks.html", + "title": "Vascular Networks", "section": "", - "text": "R-loops are three-stranded nucleic acid structures containing a DNA:RNA hybrid and an associated single DNA strand. They are normally created when DNA and RNA interact throughout the lifespan of a cell. Although their existence can be beneficial to a cell, an excessive formation of these objects is commonly associated with instability phenotypes.\nThe role of R-loop structures on genome stability is still not completely determined. The determining characteristics of harmful R-loops still remain to be defined. Their architecture is not very well-known either, and they are normally classified manually.\nIn this blog post, we will carry AFM data to the Kernell shape space and try to develop a method to detect and classify these objects using geomstats (Miolane et al. 2024). We will also talk about a rather simple method that works reasonably well.\n\n\n\n\nFig.1 Pictures of DNA fragments at the gene Airn in vitro. One of them was treated with RNase H and the other was not. The image on the bottom highlights the R-loops that were formed. (Carrasco-Salas et al. 2019)" + "text": "I have introduced some basic concepts of micro-circulation and the vascular networks and how they get created (angiogenesis) in health and disease. Then I discuss some angiogenesis models (Anderson-Chaplain as well as BARW) and use the tools of the geomstats to analyze the loopy structure in these networks. I explained the characteristics of the loopy structures in the networks in terms of the parameters of the model. Furthermore, I consider the time evolution of the graphs created by these networks and how the characterization of the loopy structures change through time in these networks." }, { - "objectID": "posts/rloop-analysis/rloop-analysis.html#preparations-before-data-analysis", - "href": "posts/rloop-analysis/rloop-analysis.html#preparations-before-data-analysis", - "title": "Identifying R-loops in AFM imaging data", - "section": "Preparations before data analysis", - "text": "Preparations before data analysis\nOriginal images will be edited to remove background noise. The figure below from the reference article tries to do that while maintaining some colors. This is useful to track the height of a particular spot.\n\n\n\n\nFig.2 A demonstration of background noise removal (Carrasco-Salas et al. 2019)\n\n\n\nI went a step further and turned these images into binary images. In other words, images we will use here will consist of black and white pixels, which correspond to 0 and 1 respectively. This makes coding a bit easier, but the height data (or the \\(z\\) coordinate) will need to be stored in a different matrix.\n\n\n\n\nFig.3 Binarized images of R-loops, for the original image see Fig. 1\n\n\n\nWe first import the necessary libraries.\n\nimport numpy as np\nimport matplotlib.pyplot as plt\nimport geomstats.backend as gs\ngs.random.seed(2024)\n\nWe process our data and put it into matrices.\n\ndata_original = plt.imread(\"original-data.png\")\ndata = plt.imread(\"edited-data.png\")\n\nx_values = []\ny_values = []\nz_values = []\ndata_points = []\n\nfor i,rows in enumerate(data_original):\n for j,rgb in enumerate(rows):\n if not (rgb[0]*255 < 166 and rgb[0]*255 > 162):\n continue\n if not (rgb[1]*255 < 162 and rgb[1]*255 > 167):\n continue\n if not (rgb[2]*255 < 66 and rgb[1]*255 > 61):\n continue\n # store useful height data\n z_values.append((i,j,rgb[0], rgb[1], rgb[2]))\n\nfor i,rows in enumerate(data):\n for j,entry in enumerate(rows):\n # take white pixels only (entry is a numpy array)\n if (entry.all() == 1):\n y_values.append(j+1)\n x_values.append(i+1)\n data_points.append([i,j])" - }, - { - "objectID": "posts/rloop-analysis/rloop-analysis.html#a-primitive-approach-that-surprisingly-works", - "href": "posts/rloop-analysis/rloop-analysis.html#a-primitive-approach-that-surprisingly-works", - "title": "Identifying R-loops in AFM imaging data", - "section": "A primitive approach that surprisingly works", - "text": "A primitive approach that surprisingly works\nA way to distinguish lines from loops is to count the amount of white pixels in each column. This heavily depends on the orientation. To get a meaningful result, it is required to do this at least \\(2\\) times, one for columns and one for rows. This is not bulletproof and will sometimes give false positives. However, it still gives us a good idea of possible places where there is an R-loop.\n\nwhite_pixel_counts = [i*0 for i in range(500)]\n\ndata = plt.imread(\"data-1.png\")\n\nfor i,rows in enumerate(data):\n for j,entry in enumerate(rows):\n # count white pixels only\n if (entry.all() == 1):\n white_pixel_counts[j] += 1\n\nplt.plot(range(500), white_pixel_counts, linewidth=1, color=\"g\")\nplt.xlabel(\"columns\")\nplt.ylabel(\"white pixels\")\n\nplt.legend([\"Amount of white pixels\"])\nplt.show()\n\n\n\n\n\nFig.4\n\n\n\nWe can see that in Figure \\(1\\), the R-loops are mainly accumulated on the left side. There are a considerable amount of them on the right side as well. There are some of them around the middle, but their numbers are lower. We can see that this is clearly represented in Figure \\(4\\).\nWith this approach, \\(2\\) different white pixels in the same column will always be counted even if they are not connected at all, which gives us some false positives. To avoid this issue, we can define the following function taking the position of a white pixel as its input.\n\\[ f((x,y)) = \\left\\lbrace \\begin{array}{r l}1, & \\text{if} ~~ \\exists c_1,c_2,c_3,\\dots c_{\\gamma} \\in [y-\\epsilon, y+\\epsilon] ~~ \\ni f(x,y) = 1 \\\\0, & \\text{otherwise}\\end{array} \\right.\\]\n\\(\\epsilon\\) and \\(\\gamma\\) can be adjusted depending on the data at hand. This gives us a more precise prediction about likely places for an R-loop. In this case, choosing \\(\\gamma = 8\\) and \\(\\epsilon = 10\\) gives us the following graph.\n\n\n\n\nFig.5\n\n\n\nWe can see that the Figure \\(5\\) and \\(4\\) is quite similar. The columns where the graph peaks are still the same, but we see a decrease in the values between these peaks, which is the expected result. This figure has less false positives compared to the previous one, so it is a step in the right direction." - }, - { - "objectID": "posts/rloop-analysis/rloop-analysis.html#an-analysis-using-the-kendall-pre-shape-space", - "href": "posts/rloop-analysis/rloop-analysis.html#an-analysis-using-the-kendall-pre-shape-space", - "title": "Identifying R-loops in AFM imaging data", - "section": "An analysis using the Kendall pre-shape space", - "text": "An analysis using the Kendall pre-shape space\nInitialize the space and the metric on it. Create a Kendall sphere using geomstats.\n\nfrom geomstats.geometry.pre_shape import PreShapeSpace, PreShapeMetric\nfrom geomstats.visualization.pre_shape import KendallSphere\n\nS_32 = PreShapeSpace(3,2)\nS_32.equip_with_group_action(\"rotations\")\nS_32.equip_with_quotient()\nmetric = PreShapeMetric(space=S_32)\nS_32.metric = metric\n\nprojected_points = S_32.projection(gs.array(data_points))\nS = KendallSphere()\nS.draw()\nS.add_points(projected_points)\nS.draw_points(alpha=0.1, color=\"green\", label=\"DNA matter\")\nS.ax.legend()\nplt.show()\n\n\n\n\n\nFig.6 White pixels projected onto the pre-shape space\n\n\n\nTaking a close look at it will reveal more details about where the points lie in the space.\n\n\n\n\nFig.7 White pixels projected onto the pre-shape space\n\n\n\nThe upper part of the curve consist of points that are in the left side of the image while the one below are closer to the middle. We see a reverse relationship between the amount of R-loops and the density of these points. This is an expected result when we consider how the Kendall pre-shape space is defined.\nA pre-shape space is a hypersphere. In our case, it has dimension \\(3\\). Hypothetically, if all of our points were placed at the vertices of a triangle of similar length, their projection to the Kendall pre-shape space would be approximately a single point. In the case of circular objects, there will be multiple pairs of points that are the same distance away from each other more than we would see if the object was a straight line. Therefore, we expect points forming a loop (which is a deformed circle for our purposes) to be separated from the other points. In other words, the lower-density areas in the hypersphere correspond to areas with a higher likelihood of R-loop presence.\nThe presence of more R-loops does not indicate that there will be fewer points in the corresponding area of the pre-shape space. It just means that they are further apart and more uniformly spread.\n\n\n\n\nFig.8 A zoomed-in and rotated version of Figure 7. The left side has the lowest density followed by the right side. The middle part has a higher density of points, as expected.\n\n\n\nPoints in the pre-shape space give us possible regions where we may find R-loops. However, they do not guarantee that there will be one in that location. This is evident when we look at the right end of this curve. It has a lower density of points than the left side, which is a result we did not want to see.\n\n\n\n\nFig.9 The right end of the curve in Figure 6\n\n\n\nThis happens because there are more DNA fragments on the right side with a shape similar to a half circle. Most of them are not loops, but they are distinct enough from the rest that the corresponding projection in the pre-shape space has a low density of points, which are separated from the rest.\nWe can also take a look at the Fréchet mean of the projected points in the pre-shape space.\n\nprojected_points = S_32.projection(gs.array(data_points))\nS = KendallSphere(coords_type=\"extrinsic\")\nS.draw()\nS.add_points(projected_points)\nS.draw_points(alpha=0.1, color=\"green\", label=\"DNA matter\")\n\nS.clear_points()\nestimator = FrechetMean(S_32)\nestimator.fit(projected_points)\nS.add_points(estimator.estimate_)\nS.draw_points(color=\"orange\", label=\"Fréchet mean\", s=150)\nS.add_points(gs.array(S.pole))\nS.draw_curve(color=\"orange\", label=\"curve from the Fréchet mean to the north pole\")\n\nS.ax.legend()\nplt.show()\n\n\n\n\n\nFig.10 Fréchet mean of the projected points\n\n\n\nThe point we find is located around the left side of the green curve, which is a result we already expected." - }, - { - "objectID": "about.html", - "href": "about.html", - "title": "About", - "section": "", - "text": "About this blog" - }, - { - "objectID": "index.html", - "href": "index.html", - "title": "Biological shape analysis (under construction)", - "section": "", - "text": "Welcome to MATH 612\n\n\nInstructions and tips for MATH 612 students\n\n\n\nMATH 612\n\n\n\n\n\n\n\n\n\nDec 17, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nOptimal Mass Transport for Shape Progression Study\n\n\nPyTorch Implementation of Benamou-Brenier Formulation\n\n\n\noptimal transport\n\n\nshape morphing\n\n\nBenamou-Brenier’s Formulation\n\n\npytorch\n\n\nautomatic differentiation\n\n\n\n\n\n\n\n\n\nDec 16, 2024\n\n\nSiddharth Rout\n\n\n\n\n\n\n\n\n\n\n\n\nShape analysis of C. elegans E cell\n\n\n\n\n\n\nbiology\n\n\n\n\n\n\n\n\n\nDec 16, 2024\n\n\nViktorija Juciute\n\n\n\n\n\n\n\n\n\n\n\n\nFarm Shape Analysis: Linking Geometry with Crop Yield and Diversity\n\n\n\n\n\n\nlandscape-analysis\n\n\nagriculture\n\n\n\n\n\n\n\n\n\nDec 15, 2024\n\n\nMo Wang\n\n\n\n\n\n\n\n\n\n\n\n\nLandmarking the ribosome exit tunnel\n\n\n\n\n\n\nribosome\n\n\ncryo-em\n\n\nbioinformatics\n\n\n\n\n\n\n\n\n\nDec 15, 2024\n\n\nElla Teasell\n\n\n\n\n\n\n\n\n\n\n\n\nExtensions to RECOVAR for heterogeneity analysis of SARS-CoV2 spike protein from cryo-EM data\n\n\n\n\n\n\ncryo-EM\n\n\n\n\n\n\n\n\n\nDec 11, 2024\n\n\nQiyu Wang\n\n\n\n\n\n\n\n\n\n\n\n\nIdentifying R-loops in AFM imaging data\n\n\n\n\n\n\nbiology\n\n\nAFM\n\n\n\n\n\n\n\n\n\nNov 10, 2024\n\n\nBerkant Cunnuk\n\n\n\n\n\n\n\n\n\n\n\n\nVascular Networks\n\n\n\n\n\n\nGraph theory\n\n\nVascular Networks\n\n\n\n\n\n\n\n\n\nNov 5, 2024\n\n\nAli Fele Paranj\n\n\n\n\n\n\n\n\n\n\n\n\nShape Analysis of Contractile Cells\n\n\n\n\n\n\nbiology\n\n\ncell morphology\n\n\n\n\n\n\n\n\n\nOct 28, 2024\n\n\nYuqi Xiao\n\n\n\n\n\n\n\n\n\n\n\n\nExploring cell shape dynamics dependency on the cell migration\n\n\n\n\n\n\nCell Morphology\n\n\nCell Migration\n\n\nDifferential Geometry\n\n\n\n\n\n\n\n\n\nOct 28, 2024\n\n\nPavel Bukleomishev\n\n\n\n\n\n\n\n\n\n\n\n\nEmbryonic cell size asymmetry analysis\n\n\n\n\n\n\nbiology\n\n\n\n\n\n\n\n\n\nOct 28, 2024\n\n\nViktorija Juciute\n\n\n\n\n\n\n\n\n\n\n\n\nTrajectory Inference for cryo-EM data using Principal Curves\n\n\n\n\n\n\nMath 612D\n\n\n\n\n\n\n\n\n\nOct 28, 2024\n\n\nForest Kobayashi\n\n\n\n\n\n\n\n\n\n\n\n\nDefining landmarks for the ribosome exit tunnel\n\n\n\n\n\n\nribosome\n\n\ncryo-em\n\n\nbioinformatics\n\n\n\n\n\n\n\n\n\nOct 25, 2024\n\n\nElla Teasell\n\n\n\n\n\n\n\n\n\n\n\n\nOptimal Mass Transport and its Convex Formulation\n\n\n\n\n\n\noptimal transport\n\n\nshape morphing\n\n\nMonge’s Problem\n\n\nKantorovich’s Formulation\n\n\nBenamou-Brenier’s Formulation\n\n\n\n\n\n\n\n\n\nOct 24, 2024\n\n\nSiddharth Rout\n\n\n\n\n\n\n\n\n\n\n\n\nHeterogeneity analysis of cryo-EM data of proteins dynamic in comformation and composition using linear subspace methods\n\n\n\n\n\n\ncryo-EM\n\n\n\n\n\n\n\n\n\nSep 18, 2024\n\n\nQiyu Wang\n\n\n\n\n\n\n\n\n\n\n\n\nUnderstanding Animal Navigation using Neural Manifolds With CEBRA\n\n\n\n\n\n\nbiology\n\n\nbioinformatics\n\n\nmathematics\n\n\nbiomedical engineering\n\n\nneuroscience\n\n\n\n\n\n\n\n\n\nSep 18, 2024\n\n\nDeven Shidfar\n\n\n\n\n\n\n\n\n\n\n\n\nExtracting cell geometry from Atomic Force Microscopy\n\n\nPart 2: Temporal amd morphological analysis\n\n\n\nbiology\n\n\nbioinformatics\n\n\n\n\n\n\n\n\n\nSep 17, 2024\n\n\nClément Soubrier, Khanh Dao Duc\n\n\n\n\n\n\n\n\n\n\n\n\nHorizontal Diffusion Map\n\n\n\n\n\n\ntheory\n\n\n\n\n\n\n\n\n\nAug 30, 2024\n\n\nWenjun Zhao\n\n\n\n\n\n\n\n\n\n\n\n\nOrthogonal outlier detection and dimension estimation for improved MDS embedding of biological datasets\n\n\n\n\n\n\nbiology\n\n\nbioinformatics\n\n\n\n\n\n\n\n\n\nAug 29, 2024\n\n\nWanxin Li\n\n\n\n\n\n\n\n\n\n\n\n\nCentroidal Voronoi Tessellation\n\n\nRelations with Semidiscrete Wasserstein distance\n\n\n\ntheory\n\n\n\n\n\n\n\n\n\nAug 26, 2024\n\n\nAryan Tajmir Riahi\n\n\n\n\n\n\n\n\n\n\n\n\nSimulation of tomograms of membrane-embedded spike proteins\n\n\n\n\n\n\ncryo-ET\n\n\n\n\n\n\n\n\n\nAug 15, 2024\n\n\nQiyu Wang\n\n\n\n\n\n\n\n\n\n\n\n\nShape Analysis of Cancer Cells\n\n\n\n\n\n\nbiology\n\n\nbioinformatics\n\n\n\n\n\n\n\n\n\nAug 15, 2024\n\n\nWanxin Li\n\n\n\n\n\n\n\n\n\n\n\n\nRiemannian elastic metric for curves\n\n\n\n\n\n\nbiology\n\n\nbioinformatics\n\n\n\n\n\n\n\n\n\nAug 15, 2024\n\n\nWanxin Li\n\n\n\n\n\n\n\n\n\n\n\n\nPoint cloud representation of 3D volumes\n\n\nApplication to cryoEM density maps\n\n\n\nbiology\n\n\nbioinformatics\n\n\n\n\n\n\n\n\n\nAug 15, 2024\n\n\nAryan Tajmir Riahi, Khanh Dao Duc\n\n\n\n\n\n\n\n\n\n\n\n\nMulti Dimensional Scaling of ribosome exit tunnel shapes\n\n\nAnalyze and compare the geometry of the ribosome exit tunnel\n\n\n\ncryo-EM\n\n\nribosome\n\n\nMDS\n\n\n\n\n\n\n\n\n\nAug 15, 2024\n\n\nShiqi Yu, Artem Kushner, Khanh Dao Duc\n\n\n\n\n\n\n\n\n\n\n\n\nAlpha Shapes in 2D and 3D\n\n\n\n\n\n\ntheory\n\n\n\n\n\n\n\n\n\nAug 14, 2024\n\n\nWenjun Zhao\n\n\n\n\n\n\n\n\n\n\n\n\nQuasiconformal mapping for shape representation\n\n\n\n\n\n\ntheory\n\n\n\n\n\n\n\n\n\nAug 9, 2024\n\n\nClément Soubrier\n\n\n\n\n\n\n\n\n\n\n\n\n3D tessellation of biomolecular cavities\n\n\nProtocol for analyzing the ribosome exit tunnel\n\n\n\ncryo-EM\n\n\n\n\n\n\n\n\n\nAug 4, 2024\n\n\nArtem Kushner, Khanh Dao Duc\n\n\n\n\n\n\n\n\n\n\n\n\nAlignment of 3D volumes with Optimal Transport\n\n\nApplication to cryoEM density maps\n\n\n\nexample\n\n\ncryo-EM\n\n\n\n\n\n\n\n\n\nAug 4, 2024\n\n\nAryan Tajmir Riahi, Khanh Dao Duc\n\n\n\n\n\n\n\n\n\n\n\n\nExtracting cell geometry from Atomic Force Microscopy\n\n\nPart 1: Static analysis\n\n\n\nbiology\n\n\nbioinformatics\n\n\n\n\n\n\n\n\n\nJul 31, 2024\n\n\nClément Soubrier, Khanh Dao Duc\n\n\n\n\n\n\n\n\n\n\n\n\nAnalysis of Eye Tracking Data\n\n\n\n\n\n\nbioinformatics\n\n\n\n\n\n\n\n\n\nJul 31, 2024\n\n\nLisa\n\n\n\n\n\n\nNo matching items" - }, - { - "objectID": "posts/extension_to_RECOVAR/index.html", - "href": "posts/extension_to_RECOVAR/index.html", - "title": "Extensions to RECOVAR for heterogeneity analysis of SARS-CoV2 spike protein from cryo-EM data", - "section": "", - "text": "In the previous post Heterogeneity analysis of cryo-EM data of proteins dynamic in comformation and composition using linear subspace methods, we reviewed the pipeline of RECOVAR (Gilles and Singer 2024) to generate movies showing the heterogeneity of proteins, and discussed its pros, cons and some improvements we could make. RECOVAR is a linear method which borrows the idea from principal component analysis to project complex structure information within cryo-EM data corresponding to each particle onto a lower dimensional space, where a trajectory is computed to illustrate the conformational and compositional changes (see previous post for details).\nCompared with other methods, mostly based on deep learning, RECOVAR has several advantages, include but not limited to fast computation of embeddings, easy trajectory discovery in latent space and fewer hyperparameters to tune. Nevertheless, we’ve noticed several problems when we tested RECOVAR on our SARS-CoV2 datasets. One shortcoming is that the density-based trajectory discovery algorithm used by RECOVAR involves a deconvolution operation between two large matrices, which is extremely expensive. The other improvement we would like to make is to extend the series of density maps output by RECOVAR to the series of atomic models, which is usually the final product structure biologists desire in order to obtain atomic interpretations. In this post, we will focus on how we address these two problems, and present and interpret results from our SARS-CoV2 dataset.\nBefore getting to the Methods, I would like to provide background information about SARS-CoV2 spike protein. SARS-CoV2 spike protein is a trimer binding to the surface of SARS-CoV2 virus. It has a so-called recpetor-binding domain (RBD) capable of switching between “close” and “open” states. When in the open state, the spike is able to recognize and bind to angiotensin-converting enzyme 2 (ACE2), an omnipresent enzyme on the membrane of the cells of the organs in the respiratory system, heart, intestines, testis and kidney (Hikmet et al. 2020). The binding to ACE2 helps the virus dock on the target cells and initalize the invasion and infection of the cells. Therefore, spike is often the major target for antibody development. Previous researches mainly focus on developing drugs neutralizing the RBD regions in the open state. However as I mentioned before, spike can switch to the close state, in which the antibody targeting open RBD will not longer be able to access it, making the drugs less effective. Motivated by recent progress in the heterogeneity analysis of proteins, researchers now focus on the conformational changes instead of a homogeneous state. Developing drugs to block the shape change of spike is considered an potentially more efficient way to neutralize viruses. This is why it is important to have a reliable pipeline to generate movies showing the conformational changes in spike proteins.\n\n\n\nAn illustration of how shape changes in RBD of SARS-CoV2 spike lead to the binding of ACE2. Spike is a trimer with three chains (in grey, purple and green). The RBD is located in the part of the spike away from virus membrane. In this figure, the RBD of one chain (in green) is open and binds to ACE2.(Taka et al. 2020)\n\n\nThe dataset we used comprises of 271,448 SARS-CoV2 spike protein particles, with some binding to ACE2. Therefore we would expect the algorithm to be able to deal with both conformational and compositional heterogeneity." - }, - { - "objectID": "posts/extension_to_RECOVAR/index.html#background", - "href": "posts/extension_to_RECOVAR/index.html#background", - "title": "Extensions to RECOVAR for heterogeneity analysis of SARS-CoV2 spike protein from cryo-EM data", - "section": "", - "text": "In the previous post Heterogeneity analysis of cryo-EM data of proteins dynamic in comformation and composition using linear subspace methods, we reviewed the pipeline of RECOVAR (Gilles and Singer 2024) to generate movies showing the heterogeneity of proteins, and discussed its pros, cons and some improvements we could make. RECOVAR is a linear method which borrows the idea from principal component analysis to project complex structure information within cryo-EM data corresponding to each particle onto a lower dimensional space, where a trajectory is computed to illustrate the conformational and compositional changes (see previous post for details).\nCompared with other methods, mostly based on deep learning, RECOVAR has several advantages, include but not limited to fast computation of embeddings, easy trajectory discovery in latent space and fewer hyperparameters to tune. Nevertheless, we’ve noticed several problems when we tested RECOVAR on our SARS-CoV2 datasets. One shortcoming is that the density-based trajectory discovery algorithm used by RECOVAR involves a deconvolution operation between two large matrices, which is extremely expensive. The other improvement we would like to make is to extend the series of density maps output by RECOVAR to the series of atomic models, which is usually the final product structure biologists desire in order to obtain atomic interpretations. In this post, we will focus on how we address these two problems, and present and interpret results from our SARS-CoV2 dataset.\nBefore getting to the Methods, I would like to provide background information about SARS-CoV2 spike protein. SARS-CoV2 spike protein is a trimer binding to the surface of SARS-CoV2 virus. It has a so-called recpetor-binding domain (RBD) capable of switching between “close” and “open” states. When in the open state, the spike is able to recognize and bind to angiotensin-converting enzyme 2 (ACE2), an omnipresent enzyme on the membrane of the cells of the organs in the respiratory system, heart, intestines, testis and kidney (Hikmet et al. 2020). The binding to ACE2 helps the virus dock on the target cells and initalize the invasion and infection of the cells. Therefore, spike is often the major target for antibody development. Previous researches mainly focus on developing drugs neutralizing the RBD regions in the open state. However as I mentioned before, spike can switch to the close state, in which the antibody targeting open RBD will not longer be able to access it, making the drugs less effective. Motivated by recent progress in the heterogeneity analysis of proteins, researchers now focus on the conformational changes instead of a homogeneous state. Developing drugs to block the shape change of spike is considered an potentially more efficient way to neutralize viruses. This is why it is important to have a reliable pipeline to generate movies showing the conformational changes in spike proteins.\n\n\n\nAn illustration of how shape changes in RBD of SARS-CoV2 spike lead to the binding of ACE2. Spike is a trimer with three chains (in grey, purple and green). The RBD is located in the part of the spike away from virus membrane. In this figure, the RBD of one chain (in green) is open and binds to ACE2.(Taka et al. 2020)\n\n\nThe dataset we used comprises of 271,448 SARS-CoV2 spike protein particles, with some binding to ACE2. Therefore we would expect the algorithm to be able to deal with both conformational and compositional heterogeneity." - }, - { - "objectID": "posts/extension_to_RECOVAR/index.html#methods", - "href": "posts/extension_to_RECOVAR/index.html#methods", - "title": "Extensions to RECOVAR for heterogeneity analysis of SARS-CoV2 spike protein from cryo-EM data", - "section": "Methods", - "text": "Methods\n\nReview of the original RECOVAR pipeline\nIn this section I will briefly review RECOVAR. You can refer to the previous blog for more formal and detailed formulation of the problem.\nRECOVAR starts with estimating the mean \\(\\hat{\\mu}\\) and covariance matrix \\(\\hat{\\Sigma}\\) of the conformations by solving the least square problems between the projection of the mean conformation and the particle images in the dataset. Next, principal components (PCs) can be computed from \\(\\hat{\\mu}\\) and \\(\\hat{\\Sigma}\\), and we obtained embeddings projected from conformations on the latent space formed by those (PCs). In order to generate a movie, the authors compute conformational densities by deconvolving densities in the latent space with embedding uncertainty, and finds a path between two specified states maximizing the accumulated densities along the path. Then each embeddings are converted into density maps via kernel regression.\n\n\nExtensions to RECOVAR: MPPC for path discovery\nThe density-based path discovery algorithm used by RECOVAR is based on physical considerations that molecules prefer to take the path with lowest free energy, which is the path with highest conformational density, and is robust against outliers. Nevertheless, the time to deconvolve density is exponential of the number of PCs, and deconvolution requires large memory. Our 24GB GPU can deconvolve density at most a dimension of 4, but 4 PCs are usually not enough to capture enough heterogeneity as shown in the figure below, which indicates how the eigenvalues change with the number of PCs when applying RECOVAR to the SARS-CoV2 spike dataset. There are still quite large drops in the eigenvalue after 4 PCs.\n\n\n\nEigenvalues (y-axis) of PCs (indexed by x-axis) of the SARS-CoV2 spike dataset applied with RECOVAR\n\n\nTherefore, we proposed an alternative method to discover path by computing multiple penalized principal curves (MPPC) (Kirov and Slepčev 2017). The basic idea of MPPC is to find one or multiple curves to fit all the given points as close as possible, with constraints in the number and lengths of the curves. In order to be solved numerically, the curves are usually discretized. Let \\(y^1 = (y_1, y_2, ..., y_{m_1}), y^2 = (y_{m_1+1}, y_{m_1+2}, ..., y_{m_1+m_2}),...,y^k = (y_{m-m_k+1}, y_{m-m_k+2},...,y_{m})\\) to be \\(k\\) curves represented by \\(m=m_1+m_2+...+m_k\\) points. Let \\(s_c = \\sum_{j=1}^{c}m_j\\) be the indices of the end points of curve \\(c\\). Each point \\(x_i\\) in the data to fit is assigned to the closest point on the curves, and we denote \\(I_j\\) to be the group of indices of data points that are assigned to curve point \\(y_j\\). The goal is to minimize: \\[\\sum_{j=1}^m\\sum_{i\\in I_j}w_i|x_i-y_j|^2+\\lambda_1\\sum_{c=0}^{k-1}\\sum_{j=1}^{m_c+1}|y_{s_c+j+1}-y_{s_c+j}|+\\lambda_1 \\lambda_2 (k-1)\\]\nwhere \\(w_i\\) is the weight assigned to \\(i\\)th data point, and \\(\\lambda_1\\) and \\(\\lambda_2\\) reglarize the lengths and number of the curves.\\(\\sum_{j=1}^m\\sum_{i\\in I_j}w_i|x_i-y_j|^2\\) penalizes the distance of the curves to data points, \\(\\lambda_1\\sum_{c=0}^{k-1}\\sum_{j=1}^{m_c+1}|y_{s_c+j+1}-y_{s_c+j}|\\) is the total length of all the curves, and \\(\\lambda_1 \\lambda_2 (k-1)\\) controls the number of curves. Applied to our case, we set \\(w_i\\) to be the inverse of the trace of the covariance matrix of the embedding to make the curves fit better to those embeddings with high confidence.\n\n\nExtensions to RECOVAR: atomic model fitting\nWhen resolving homogenoeous structures of proteins, atomic models are usually the final product instead of density maps as they contain more structural information. Atomic models are fitted into density maps either manually or automatically, but most start from scratch, which is very inefficient to be applied to density map series because the difference between the neighboring maps should be relatively small. We can take advantage of this property by updating coordinates of the fitted model of the previous frame to get the model fitted in the current density map. Hence, we proposed two algorithms to fit atomic model, both are based on gradient descent.\nLet \\(R_{t-1}\\in \\mathbb{R}^{N_a\\times 3}\\) be the fitted atomic model of the \\(t-1\\)th density map, where \\(N_a\\) is the number of atoms in the protein. We can use deposited protein structure or model predicted from sequence using algorithms like AlphaFold as \\(R_0\\). Let \\(V_t\\in\\mathbb{R}^{N\\times N\\times N}\\) be the \\(t\\)th density map we want to fit in, where \\(N\\) is the grid size. We cannot directly minimize the “distance” between \\(R_{t-1}\\) and \\(V_t\\), because atomic coordinates cannot be compared with volume maps. A natural way to solve this issue is to map atomic coordinates to density map with a function \\(f: \\mathbb{R}^{N_a\\times 3}\\rightarrow \\mathbb{R}^{N\\times N\\times N}\\), for example, by summing up the gaussian kernels centered at each coordinate i.e. \\[V_t({\\bf r}=(x,y,z)^T) = \\sum_{k=1}^{N_a} \\exp -\\frac{\\|{\\bf r} - R_t[k]\\|_2^2}{2\\sigma_k^2}\\]\nHowever, the computational time for one mapping is \\(O(N^3N_a)\\), which is already very slow, even without considering the fact that we have to map coordinates to densities in many update iterations. Hence in practice we used truncated gaussian kernels.\nNow we have all the tools needed to have algorithms fit atomic model \\(R_{t-1}\\) into density map \\(V_t\\). Our first algorithm purely based on gradient descent will be:\n\nWhen computing the loss used for gradient descent, we included not only the difference between \\(V_t\\) and mapped density from coordinates, but also the difference between the starting and current bond lengths/angles to preserve the original structure. In practice, we computed intra-residue bond lengths i.e. bond lengths of \\(N-CA, CA-C \\text{ and } C-O\\) and inter-residue bond lengths \\(C_i-N_{i+1}\\). In practice proteins can have multiple chains (like SARS-CoV2 spike which has three chains), so we set the inter-residue bond lengths at end points of the chains to be \\(0\\). We used dihedral angles \\(\\phi\\) (i.e. angles formed by \\(C_i-N_{i+1}-CA_{i+1}-C_{i+1}\\)) and \\(\\psi\\) (i.e. angles formed by \\(N_i-CA_i-C_i-N_{i+1}\\)) as bond angles, and similarly set the dihedral angles cross the chains to be \\(0\\).\nOne weakness of gradient descent is that it can be easily stuck in local optima. Leveraging the recent progress in diffusion models for protein generation, we proposed the second algorithm as following:\n\nThe inner for loop is the same as Algorithm1, where the coordinates are updated through gradient descent. The difference is that an outer loop is added which diffuses and then denoises the fitted coordinates from previous round of gradient descent and uses the denoised coordinates as the starting model of the current round of gradient descent. We adapted the diffusion model and graph neural network (GNN) based denoiser from Chroma (Ingraham et al. 2023)." - }, - { - "objectID": "posts/extension_to_RECOVAR/index.html#results", - "href": "posts/extension_to_RECOVAR/index.html#results", - "title": "Extensions to RECOVAR for heterogeneity analysis of SARS-CoV2 spike protein from cryo-EM data", - "section": "Results", - "text": "Results\n\nResults of SARS-CoV2 datasets\nAfter obtaining an ab-initio model containing pose information from CryoSPARC (Punjani et al. 2017), we ran RECOVAR with a dimension of 4 on our SARS-CoV2 spike dataset after downsampling particles to 128. Notice that in practice a grid size of 256 or higher is recommended to construct density maps with decent resolution, but we used 128 here for fast test of the original pipeline and extensions later. K-Means clustering was performed to find centers among the embeddings. Here for comparison with the modified algorithms in the later sections, we showed the complete movie of one RBD transiting from open state with ACE2 to close state below: \nThe original pipeline of RECOVAR is able to capture the motion of RBD between open and close states, as well as compositional changes in ACE2.\n\n\nComparison of paths discovered by density vs. MPPC\nTo compare the paths generated by MPPC to original density-based approaches, we ran MPPC on the embedings with a dimension of 4. The figure below shows the path generated by density-based methods from state 0 to 1 and 2 to 3, and path output by MPPC:\n\n\n\nPaths in 4D space output by density-based methods and MPPC. In each sub-figure, the path is visuliazed on the plane formed by 6 pairs of principal components.\n\n\nWe can see that path between 0 and 1 is completely missing in MPPC path. Path from 2 to 3 presents in MPPC path between the orange node and purple node, but is slightly pulled towards outliers.\nIt is mentioned in Methods that one advantage to use MPPC is that its low computational cost allows us to fit data in higher dimension, so we also fit MPPC to data in 10D. The results are present in the figure below:\n\nWe can see that the spike in the 10D movie is more flexible and there are more changes in the shape than 4D.\n\n\nResults of atomic model fitting\nWe first tested two atomic model fitting algorithms on the simplest case where we started from an atomic model and fit into a density map that is close to the starting model. We took deposited SARS-CoV2 spike protein structure 7V7R as initial model and generated target density map with truncated gaussian kernels from another protein 7V7Q.\n\n\n\nStarting (7V7R in blue) and target (7V7Q in brown) SARS-CoV2 spike used to test atomic model fitting algorithms\n\n\nWe ran 12000 iterations for Algorithm1. To make a fair comparison, same number of total loops, composed of 60 outer diffusion loops and 200 inner gradient descent loops were run with Algorithm2. We kept the regularization parameters the same for both algorithms. Both algorithms took about 950 seconds to complete. In UCSF Chimera (Pettersen et al. 2004), we aligned initial model 7V7R, fitted model from Algorithm1 and fitted model output by Algorithm2 with 7V7Q, computed root mean square deviation (RMSD) between aligned coordinates, and annotated the structures with red-blue color where red denotes high RMSD (large difference) and blue means low RMSD (small difference). The results are shown below:\n\n\n\nLeft: initial model (7V7R) aligned with target model (7V7Q); Middle: fitted model from Algorithm1 aligned with target model (7V7Q); Right: fitted model from Algorithm2 aligned with target model (7V7Q)\n\n\nSurprisingly, Algorithm1 performs better than Algorithm2, with more regions in deep blue color indicating low RMSD, though its design is relatively simple. Overall, both algorithms make significant progress from the initial model to fit into the target density map. There are certain white regions with medium RMSD, but the most important motions in the RBD regions are succressfully captured.\nTo test whether there will be a significant accumulation of errors if we kept using fitted model from last frame as the initial model to fit the current frame, we used our algorithms to fit into a series of three density maps, generated from proteins 7V7R, 7V7Q, and 7V7S, starting from 7V7P. Consecutive proteins in the series were aligned and local RMSD was computed to visualize the degree of conformational changes at different regions of different frames more intuitively.\n\n\n\nStarting from 7V7P (brown), we fit the density map simulated from 7V7R (pink), 7V7Q (blue), and 7V7S (green) sequentially\n\n\n\n\n\nAligment of consecutive proteins in the series for test\n\n\nMost conformational changes in this series occur in the RBD region, with 7V7S undergoing the most significant changes, and expected to be the hardest model to fit. We used the same parameters as before to fit each model with both algorithms, and followed the same procedure to compute and visualize local RMSD for each frame in the series.\n\n\n\nTest results on series from Algorithm1\n\n\n\n\n\nTest results on series from Algorithm2\n\n\nSame as the previous test, Algorithm1 performs better than Algorithm2 in fitting all the maps in the series. Compared with fitting to maps generated from 7V7Q starting with “true” 7V7R, initializing model with fitted 7V7R from previous step does not lead to siginificant increase in RMSD in fitted 7V7Q. There are some white regions with medium RMSD shared by three fitted models, but the RMSD of these regions does not increase. There is a part with high RMSD in the left region of the last structure 7V7S in the series, but it seems that the error is not accumulated from previous fitting as the RMSD of this region of the privious fitting is very low." + "objectID": "posts/vascularNetworks/VascularNetworks.html#anderson-chaplain-model-of-angiogenesis", + "href": "posts/vascularNetworks/VascularNetworks.html#anderson-chaplain-model-of-angiogenesis", + "title": "Vascular Networks", + "section": "Anderson-Chaplain Model of Angiogenesis", + "text": "Anderson-Chaplain Model of Angiogenesis\nAnderson-Chaplain model of angiogenesis describes the angiogenesis process considering the factors like TAF and fibronectin. This model contains three variables \\(\\newcommand{\\R}{\\mathbb{R}}\\) \\(\\newcommand{\\abs}[1]{|#1|}\\)\n\n\\(n = n(X,t): \\Omega \\times \\R \\to \\R\\): the endothelial-cell density (per unit area).\n\\(c = c(X,t): \\Omega \\times \\R \\to \\R\\): the tumor angiogenic factor (TAF) concentration (nmol per unit area).\n\\(f = f(X,t): \\Omega \\times \\R \\to \\R\\): the fibronectin concentration (nmol per unit area).\n\nand the time evolution is governed by the following system of PDEs\n\\[\\begin{align*}\n &\\frac{\\partial n}{\\partial t} = D_n\\nabla^2 n - \\nabla\\cdot(\\chi n\\nabla c) - \\nabla\\cdot(\\rho n \\nabla f), \\\\\n &\\frac{\\partial c}{\\partial t} = -\\lambda n c, \\\\\n &\\frac{\\partial f}{\\partial t} = \\omega n - \\mu n f,\n \\end{align*}\\]\nwhere \\(D_n\\) is a diffusion constant taking the random movement of tip cells into account, \\(\\chi, \\rho\\) reflects the strength of the chemotaxis of tip cells due to the gradient of TAF, and fibronectin respectively. Furthermore, \\(\\lambda, \\mu\\) is the rate at which tip cells consume the TAF and fibronectin respectively, and \\(\\omega\\) denotes the production of fibronectin by the tip cells. Note that we assume at the start of the angiogenesis process, we have a steady state distribution of fibronectin and TAF and is not diffusing. This assumption is not entirely true and can be enhanced.\nHere in this report, we will be using the discrete and stochastic variation of this model. For more detail see (Anderson and Chaplain 1998). See figure below for some example outputs of the model.\n\n\n\nSome example output of the Anderson-Chaplain model of angiogenesis using the implementation of the model shared by (Nardini et al. 2021). We have assumed the source of TAF molecules is located at the right edge of the domain, while the pre-existing parent vessels is located at the left edge of the domain. The strength of the chemotaxis and haptotactic (due to fibronectin) signaling is set to be \\(\\chi = 0.4\\), and \\(\\rho = 0.4\\)." }, { - "objectID": "posts/extension_to_RECOVAR/index.html#discussion", - "href": "posts/extension_to_RECOVAR/index.html#discussion", - "title": "Extensions to RECOVAR for heterogeneity analysis of SARS-CoV2 spike protein from cryo-EM data", - "section": "Discussion", - "text": "Discussion\nIn this project we proposed MPPC as an alternative approach to compute path. Although this a method can be used to find paths in higher dimension with very fast speed, it is more sensitive to outliers. One way to address this issue is to iteratively remove points that are far away from the curves and then fit the curve. Another feature of MPPC is that it does not take the starting and ending points. This can be either an advantage or disadvantage, depending on the objective. MPPC works if the goal is to study the conformational change trajectory in the entire space. Nevertheless, if we are more interested in how proteins transit between two specific states, MPPC may output path even not passing these two states. On the other hand, the movie output from trajectories found by MPPC in higher dimension indeed captures more changes in shape, which helps discover rare conformations.\nOne problem occurring to lots of datasets like the one we tested is that the output path contains both conformational and compositional heterogeneity. From the movies of the spike we can see ACE2 suddenly appear or disappear at the top of the lifted RBD region. This is essential as we want the algorithm to discover compositional heterogeneity as well, but it causes trouble to atomic fitting. In the conventional pipeline, people address this problem via discrete 3D classification to separate particles with different compositions, which may not have very high accuracy when applied to complex datasets with both compositional and comformational heterogeneity. Actually 3D classification of cryoSPARC fails to distinguish particles with and without ACE2 on our spike protein dataset without templates. Here instead we may want to leverage the powerful tool of RECOVAR, and directly classify particles in the continuous latent space. One potential approach would be segmenting latent space based on the mass of the volume associated with the embeddings. This approach may not work in the case where the compositional difference does not lead to a change in mass, but as long as the compositional heterogeneity leads to difference in mass that is more significant than noise (like SARS-CoV2 spike + ACE2 in our case), this method should work. We checked the feasibility of this approach by computing mass of the density maps along time in a movie output by RECOVAR using our SARS-CoV2 data as following:\n\n\n\nIllustrations of how mass of the density map changes in the movie of SARS-CoV2 spike, some binding to ACE2\n\n\nThis movie demonstrates a relatively complex changes in spike proteins, where the spike undergoes the following changes: one RBD up + one ACE2 -> one RBD up -> all RBDs down -> 1 RBD up -> 2 RBDs up + 1 ACE2 -> 2 RBDs up + 2 ACE2’s. There is a clear cutoff at mass of around 900,000 above which ACE2 is present. The difference in the mass between those with 1 ACE2 and 2 ACE2’s is not very obvious, but separating spike with and without ACE2 is enough for the purpose of atomic model fitting to the density maps from closed states up to the moment where the RBD completely lifts but without ACE2.\nRegarding to our atomic model fitting algorithms, Algorithm1 which is purely based on gradient descent works surprisingly well and even better than Algorithm2 whose design is more complex. Both algorithms update the changes in the RBD region with high accuracy. Although some regions with medium quality of fitting in the first frame are inherited by later fittings, the RMSD does not rise further. One improvement we can make to our current algorithms is to change the constant sigma in the gaussian kernel to map coordinates to density maps to an annealing parameter. Initially we make sigma large to enable the model to undergo large conformational changes. Later we shrink the size of sigma for better fitting in the local region." + "objectID": "posts/vascularNetworks/VascularNetworks.html#branching-annihilating-random-walker", + "href": "posts/vascularNetworks/VascularNetworks.html#branching-annihilating-random-walker", + "title": "Vascular Networks", + "section": "Branching-Annihilating Random Walker", + "text": "Branching-Annihilating Random Walker\nThe Anderson-Chaplain model of angiogenesis is not the only formulation of this phenomena. A popular alternative formulation is using the notion of branching annihilating random walkers for the to explain the branching morphogenesis of vascular networks. A very detailed discussion on this formulation can be found in Uçar et al. (2021). This formulation has been also successful to models a vast variety of tip-driven morphogenesis in mammary-glands, prostate, kidney (Hannezo et al. 2017), lymphatic system (Uçar et al. 2023), neural branching (Uçar et al. 2021), and etc.\nThe core idea behind this formulation is to assume that the tip cells undergo a branching-annihilating random walk, i.e. they move randomly in the space, turn into pairs randomly (branching), and as they move they produce new cells (stalk) behind their trails, and finally annihilate if they encounter any of the stalk cells. See figure below:\n\n\n\nThe network generated by branching-annihilating process, where the tip cells (orange circles) are doing random walk (not necessarily unbiased random walk) and each generate two random walkers at random times (branching). The tip cells make the stalk cells (the blue lines) along their way and the tip cells annihilate when encounter any of the stalk cells." }, { - "objectID": "posts/AFM-data/index.html", - "href": "posts/AFM-data/index.html", - "title": "Extracting cell geometry from Atomic Force Microscopy", - "section": "", - "text": "We present here the protocole to process biological images such as bacteria atomic force miroscopy data. We want to study the bacteria cell shape and extract the main geometrical feature." + "objectID": "posts/vascularNetworks/VascularNetworks.html#time-evolution-of-networks", + "href": "posts/vascularNetworks/VascularNetworks.html#time-evolution-of-networks", + "title": "Vascular Networks", + "section": "Time Evolution Of Networks", + "text": "Time Evolution Of Networks\nVascular networks are not static structure, but rather the evolve in time in response to the changing metabolic demand of the underlying tissue, as well as the metabolic cost of the network itself, and the overall energy required to pump the fluid through the network (See Pries and Secomb (2014) for more discussion). To put this in different words, the role of vascular networks is to deliver nutrients to the tissue and remove the wastes. To do this, it needs to have a space filling configuration with lots of branches. However, due to the Poiseuille law for the flow of fluids in a tube, the power needed to pump the fluid through the tube scales with \\(r^{-4}\\) where \\(r\\) is the radius of the tube. I.e. smaller vessel segments needs a huge power to pump the blood through them. Thus have a massively branched structure is not an optimal solution. On the other hand, the vascular network consists of cells which requires maintenance as well. Thus the optimized vascular network should have a low volume as well. Because of these dynamics in action, in the angiogenesis process first a mesh of new blood vessels form which later evolve to a more ordered and hierarchical structure in a self-organization process.\n\n\n\nRemodeling of vascular network of chick chorioallantoic membrane. Initially (sub-figure 1) a mesh of vascular networks form. Then (sub-figures 2,3,4), through the remodeling dynamics, a more ordered and hierarchical structure emerges. Images are taken from (Richard et al. 2018).\n\n\nTo determine the time evolution of the vascular network we first need to formulate the problem in an appropriate way. First, we represent a given vascular network with a multi-weighted graph \\(G=(\\mathcal{V},\\mathcal{E})\\) where \\(V\\) is the set of vertices and \\(E\\) is the edge set. We define the pressure \\(\\mathbf{P}\\) on the nodes, the flow $ $ on the edges, and let \\(C_{i,j}, L_{i,j}\\) denote the conductivity of an edge, and \\(L_{i,j}\\) denote the length of the same edge. Given the source and sink terms on the nodes $ $, the flow in the edges can be determined by \\[\\mathcal{L} \\mathbf{P} = \\mathbf{q},\\] where \\(\\mathcal{L}\\) is the Laplacian matrix of the graph. For more details on this see . Once we know the pressures on the nodes, we can easily calculate the flow through the edges by \\[\\bf{Q} = \\bf{C} L^{-1} \\bf{\\Delta} \\bf{P}, \\tag{2}\\] where \\(C\\) is a diagonal matrix of the conductance of the edges, \\(L\\) is the diagonal matrix of the length of each edge, $ $ is the transpose of the incidence matrix, and $ P $ is the pressure on the nodes. \\(Q\\) is the flow of the edges. Once we know the flow in the edges, we can design evolution law to describe the time evolution of the weights of the edges (which by Poiseuille’s is a function of the radius of the vessel segment). The evolution law can be derived by defining an energy functional and moving down the gradient of the energy functional to minimize it, or we can take an ad-hoc method and write a mechanistic ODE for time evolution of the conductances. For the energy functional one can write \\[ E(\\mathbf{C}) = \\frac{1}{2} \\sum_{e\\in \\mathcal{E}}(\\frac{Q_e^2}{C_e} + \\nu C_e^\\gamma), \\] where $ $ is the edge set of the graph, $ Q_e, C_e $ is the flow and conductance of the edge $ e $, and $ ,$ are parameters. The first term in the sum is of the form ``power=current$ $potential’’ and reflects the power required to pump the flow, and the second term can be shown that reflects the volume of the total network. We can set \\[ \\frac{d \\mathbf{C}}{dt} = -\\nabla E, \\] which determines the time evolution of the weights in a direction that reduces the total energy. The steady-state solution of this ODE system is precisely the Euler-Lagrange formulation of the least action principle. Alternatively, one can come up with carefully designed ODEs for the time evolution of the conductances that represents certain biological facts. In particular \\[ \\frac{d C_e}{dt} = \\alpha |Q_e|^{2\\sigma} - b C_e + g \\] proposed by , and \\[ \\frac{d}{dt} \\sqrt{C_e} = F(Q_e) - c\\sqrt{C_e}, \\] proposed by has been popular choices. See for more details. It is important to note that in the simulations shown here, the initial network is a toy network. This can be improved by using any of the vascular network generated by any of the angiogenesis models discussed before.\n\n\n\nTime evolution of optimal transport network. A triangulation of a 2D domain is considered to be the graph over which we optimize the flow. The sink term is represented by green dot, while the sources are represented by yellow dots. Different sub-figures show the flow network at different time steps towards converging to the optimal configuration." }, { - "objectID": "posts/AFM-data/index.html#biological-context", - "href": "posts/AFM-data/index.html#biological-context", - "title": "Extracting cell geometry from Atomic Force Microscopy", - "section": "Biological context", - "text": "Biological context\nMycobacterium smegmatis is Grahm-positive rod shape bacterium. It is 3 to 5 \\(\\mu m\\) long and around 500 \\(nm\\) wide. This non-pathogenic species is otften used a biological model to study the pathogenic Mycobacteria such as M.tuberculosis (responsible for the tubercuosis) or M.abscessus, with which it shares the same cell wall structure (Tyagi and Sharma 2002). In particular M.smegmatis has a fast growth (3-4 hours doubling time compared to 24h for M. tuberculosis), allowing for faster experimental protocols.\nHere are some know properties of M.smegmatis bacteria :\n\nThey present variation of cell diameter along their longitudinal axis (Eskandarian et al. 2017). The cell diameter is represented as a height profile along the cell centerline. We respectively name peaks and troughs the local maxima and minima of this profile.\n\n\n\n\n3D image of M.smegmatis. The orange line represents the height profile.\n\n\n\nThey grow following a biphasic and asymetrical polar dynamics (Hannebelle et al. 2020). The cells elongate from the poles, where material is added. After division, the pre-existing pole (OP) elongate at a high rate, whereas the newly created pole (NP) has first a slow growth, and then switches to a fast growth, after the New End Take Off (NETO).\n\n\n\n\nGrowth dynamics." + "objectID": "posts/vascularNetworks/VascularNetworks.html#enhanced-loop-detection-algorithm", + "href": "posts/vascularNetworks/VascularNetworks.html#enhanced-loop-detection-algorithm", + "title": "Vascular Networks", + "section": "Enhanced Loop Detection Algorithm", + "text": "Enhanced Loop Detection Algorithm\nBefore, we used to generate .png images of the simulation result (see figures above) and then perform image analysis to detect loops. For instance we have convolving the image with 4-connectivity and 8-connectivity matrices to extract the graph structres present in the images. In the new approch, instead, we managed to record the structre of the network in a NetworkX datastructre. This is not easy task to perform without smart useage of the object oriented programming structure for the code. We organized our code into following classes\n\nUsing this structure, we can record the graph structure of the generated networks as a NetworkX dataframe. Then we can use some of the built-in functions of this library to get the loops (cycles) of the network. However, since the generated networks are large, finding all of the loops (of all scales) is computationally very costly. Instead, we first found a minimal set of cycles in the graph that forms a basis for the cycles space. I.e. we found the loops that can be combined (by symmetric difference) to generate new loops. The following figure shows the basis loops highlighted on the graph.\n\nAs mentioned above, detected cycles are the basis cycles. The space of all cycles in a graph form a vector space and the basis cycles is a basis for that space. In other words, these cycles are all the cycles necessary to generate all of the cycles in the graph. The addition operation between two cycles is the symmetric difference of their edge set (or XOR of their edges). We can combined the basis cycles to generate higher level (and lower level) structure as shown below.\n\nWe can also extract and scale all of the loops for further analysis. The following figure shows all the loops in the network\n\nThe following figures shows some of the loop strucgures that we can get by combining the loops above." }, { - "objectID": "posts/AFM-data/index.html#raw-image-pre-processing", - "href": "posts/AFM-data/index.html#raw-image-pre-processing", - "title": "Extracting cell geometry from Atomic Force Microscopy", - "section": "Raw image pre-processing", - "text": "Raw image pre-processing\n\nData\nSeveral data acquisitions were conducted with wild types and different mutant strains. The raw data is composed of AFM log files times series for each experiments. Each log file contain several images, each one representing a physical channel such as height, stiffness, adhesion etc. After extraction of the data, forward and backward cells are aligned, artefacts such as image scars are detected and corrected.\n\n\n\nAt each time step, images representing different physical variables are produced by the AFM" + "objectID": "posts/vascularNetworks/VascularNetworks.html#statistical-analysis-of-loops", + "href": "posts/vascularNetworks/VascularNetworks.html#statistical-analysis-of-loops", + "title": "Vascular Networks", + "section": "Statistical Analysis of Loops", + "text": "Statistical Analysis of Loops\nThe mechanism that generated the vascular networks is an stochastic process (Branching process + Simple Radnom Walk process + local interactions (annihilation)). So we need to use statistical notions to make some observations. In the figure below, the histogram of the cycle length is plotted. The interesting observation is the fact that the number of cycles is exponentially distributed (with respect to the Cycle length). The slope of this line (on log-log plot) can reveal some very important facts about the universality class that our model belongs to. Not only this is very interesting and important from theoretical point of view, but also it can have very useful practical applications. For instance, in comparing the simulated network with real vascualr networks, this slope can be one of the components of comparison.\n\nFurthremore, it is instructive to study the correlation matrix between some of the features of the loop." }, { - "objectID": "posts/AFM-data/index.html#segmentation", - "href": "posts/AFM-data/index.html#segmentation", - "title": "Extracting cell geometry from Atomic Force Microscopy", - "section": "Segmentation", - "text": "Segmentation\nAt each time steps, images are segmented to detect each cells using the cellpose package (Stringer et al. 2021). If available, different physical channels are combined to improve the segmentation. Forward and backward images are also combined.\n\n\n\nImages are combined to improve the segmentation\n\n\nHere is an example on how to use cellpose on an image. Different models are available (with the seg_mod variable), depending on the training datasets. With cellpose 3, different denoising models are also available (with the denoise_mod variable).\n\n\nCode\nimport numpy as np\nimport matplotlib.pyplot as plt\nfrom cellpose import io, denoise, plot\nfrom PIL import Image\n\n\n'''\nParameters\n'''\n\nimage_path = 'raw_img.png'\npath_to_save = 'segmented_img'\n# Segmentation model type\nseg_mod = 'cyto' \n# Denoizing model\ndenoise_mod = \"denoise_cyto3\" \n# Expected cell diameter (pixels)\ndia = 40\n# Type of segmentation (with / without nuclei, different color channels or not)\nchan = [0,0] \n# Segmentation sensibility parameters\nthres = 0.8\ncelp = 0.4\n\n'''\nComputing segmentation\n'''\n\n\n# Opening image to segment\nimg=np.array(Image.open(image_path))[:,:,1]\n\n# Chosing a model type :\nmodel = denoise.CellposeDenoiseModel(gpu=False, model_type=seg_mod, restore_type=denoise_mod)\n\n# Computing segmentaion\nmasks, flows, st, diams = model.eval(img, diameter = dia, channels=chan, flow_threshold = thres, cellprob_threshold=celp)\n\n\n# Saving the results into a numpy file\nio.masks_flows_to_seg(img, masks, flows, path_to_save, channels=chan, diams=diams)\n\n\nWe plot the final results :\n\n\nCode\nplt.imshow(img,cmap='gray')\nplt.show()\n\n\n\n\n\nRaw image\n\n\n\n\n\n\nCode\nmask_RGB = plot.mask_overlay(img,masks)\nplt.imshow(mask_RGB)\nplt.show()\n\n\n\n\n\nImage with segmented masks overlaid" + "objectID": "posts/vascularNetworks/VascularNetworks.html#geometric-shape-analysis-fréchet-and-hausdorff-distances", + "href": "posts/vascularNetworks/VascularNetworks.html#geometric-shape-analysis-fréchet-and-hausdorff-distances", + "title": "Vascular Networks", + "section": "Geometric Shape Analysis: Fréchet and Hausdorff Distances", + "text": "Geometric Shape Analysis: Fréchet and Hausdorff Distances\nIn geometric shape analysis, comparing cycles involves quantifying their similarity based on the spatial arrangement of points in each cycle. Two widely used measures for such comparisons are the Fréchet Distance and the Hausdorff Distance. These metrics provide different insights into the relationship between cycles, and their results can be visualized as heatmaps of pairwise distances.\n\nFréchet Distance\nThe Fréchet Distance between two curves $ A = {a(t) t } $ and $ B = {b(t) t } $ is defined as:\n\\[\nd_F(A, B) = \\inf_{\\alpha, \\beta} \\max_{t \\in [0,1]} \\| a(\\alpha(t)) - b(\\beta(t)) \\|,\n\\]\nwhere:\n\n$ (t) $ and $ (t) $ are continuous, non-decreasing reparameterizations of the curves $ A $ and $ B $.\n$ | | $ denotes the Euclidean norm.\nThe infimum is taken over all possible parameterizations $ $ and $ $.\n\n\nInterpretation of Heatmap\nThe heatmap for the Fréchet distance shows the pairwise distances between all cycles. Each entry $ (i, j) $ in the heatmap represents $ d_F(C_i, C_j) $, the Fréchet distance between cycle $ C_i $ and cycle $ C_j $. Key insights include:\n\nSmall Values: Cycles with low Fréchet distances are geometrically similar in terms of overall shape and trajectory.\nLarge Values: Larger distances indicate significant differences in the geometry or shape of the cycles.\n\nThe heatmap highlights clusters of similar cycles and outliers with unique geometries.\n\n\n\n\nHausdorff Distance\nThe Hausdorff Distance between two sets of points $ A $ and $ B $ is defined as:\n\\[\nd_H(A, B) = \\max \\{ \\sup_{a \\in A} \\inf_{b \\in B} \\| a - b \\|, \\sup_{b \\in B} \\inf_{a \\in A} \\| b - a \\| \\}.\n\\]\nThis can be broken down into:\n\n$ {a A} {b B} | a - b | $: The maximum distance from a point in $ A $ to the closest point in $ B $.\n$ {b B} {a A} | b - a | $: The maximum distance from a point in $ B $ to the closest point in $ A $.\n\nThe Hausdorff distance quantifies the greatest deviation between the two sets of points, considering how well one set covers the other.\n\n\nInterpretation of Heatmap\nThe heatmap for the Hausdorff distance shows pairwise distances between cycles. Each entry $ (i, j) $ represents $ d_H(C_i, C_j) $, the Hausdorff distance between cycle $ C_i $ and cycle $ C_j $. Key insights include:\n\nSmall Values: Indicates that the points of one cycle are closely aligned with the points of another cycle.\nLarge Values: Reflects that one cycle has points significantly farther away from the other, suggesting geometric dissimilarity.\n\nThe heatmap highlights cycles that are well-aligned (small distances) and those that are far apart in terms of shape.\n\n\n\nComparison of Metrics\n\nFréchet Distance: Sensitive to the ordering of points along the curves, making it suitable for comparing trajectories or continuous shapes.\nHausdorff Distance: Ignores the order of points and focuses on the maximum deviation between sets, making it useful for analyzing shape coverage.\n\nBoth metrics complement each other in analyzing the geometric properties of cycles. While the Fréchet distance emphasizes trajectory similarity, the Hausdorff distance focuses on the extent of shape overlap." }, { - "objectID": "posts/AFM-data/index.html#centerline", - "href": "posts/AFM-data/index.html#centerline", - "title": "Extracting cell geometry from Atomic Force Microscopy", - "section": "Centerline", - "text": "Centerline\nSince we are interested in studying the variations of the cell diameter, we define height profile as the value of the cell height along the cell centerline. The cell centerline are computed using a skeletonization algorithm Lee, Kashyap, and Chu (1994). Here is an example of skeletonization\n\n\nCode\nfrom skimage.morphology import skeletonize\n\n# Selecting first mask\nfirst_mask = masks == 1\n\nskel_img = skeletonize(first_mask, method='lee') \nskel = np.argwhere(skel_img)\nplt.imshow(first_mask, cmap='gray')\n\nplt.scatter(skel[:,1], skel[:,0], 0.5*np.ones(np.shape(skel[:,0])), color='r', marker='.')\nplt.show()\n\n\n\n\n\n\n\n\n\nDepending on the masks shapes, centerlines may have branches :\n\n\nCode\nfrom skimage.morphology import skeletonize\n\n# Selecting first mask\nfirst_mask = masks == 3\n\nskel_img = skeletonize(first_mask) #, method='lee'\nskel = np.argwhere(skel_img)\nplt.imshow(first_mask, cmap='gray')\n\nplt.scatter(skel[:,1], skel[:,0], 0.5*np.ones(np.shape(skel[:,0])), color='r', marker='.')\nplt.show()\n\n\n\n\n\n\n\n\n\nIn practice, centerlines are pruned and extended to the cell poles, in order to capture the cell length. Other geometrical properties such as masks centroids or outlines are computed as well.\n\n\n\nFinal static processing results in real life data. White masks are excluded from the cell tracking algorithm (see part 2). Black dots are cell centroids. The yellow boxes represent artefacts cleaning." + "objectID": "posts/vascularNetworks/VascularNetworks.html#dimensionality-reduction", + "href": "posts/vascularNetworks/VascularNetworks.html#dimensionality-reduction", + "title": "Vascular Networks", + "section": "Dimensionality Reduction", + "text": "Dimensionality Reduction\nNonlinear dimensionality reduction methods project high-dimensional data into a lower-dimensional space while preserving specific structural properties.\n\nt-SNE (t-Distributed Stochastic Neighbor Embedding)\nt-SNE minimizes the divergence between probability distributions over pairwise distances in high-dimensional and low-dimensional spaces. It focuses on preserving local structures (relationships between nearby points) and is particularly effective at uncovering clusters. The key parameters are Perplexity: Controls the balance between local and global structure (default: 30), and Output Dimension: Reduced to 2D for visualization.\n\n\n\nSome notes to interpret the plot: that cycles forming tight clusters share strong similarities in features such as length, area, or compactness. Isolated points (outliers) indicate rare or unique geometries. t-SNE emphasizes local structures, making it ideal for detecting smaller, tightly-knit groups.\n\n\n\n\nUMAP (Uniform Manifold Approximation and Projection)\nUMAP approximates the high-dimensional data manifold and optimally preserves both local and global structures. It provides more interpretable embeddings with smooth transitions between clusters. The key parameters are Number of Neighbors: Defines the size of the local neighborhood considered for embedding (default: 15), and Output Dimension: Reduced to 2D for visualization.\n\n\n\nSome notes to interpret the plot: UMAP preserves both local and global structures, making it suitable for analyzing large-scale patterns. Transitions between clusters indicate gradual changes in feature space, useful for understanding progression or hierarchy in cycle characteristics. Dense clusters suggest strong feature alignment, while sparse areas highlight feature variability.\n\n\n\n\nConclusion\nWe used a stochastic process (Branching Annihilating Random Walker) to generate some random networks (that resembes the vascular networks). Then we translated this structure to a networkX data frame for easier processing. We extracted the cycle basis for the cycle space of the graph and using the symmetric difference operation we generated new cycles (of different scales). Then performed different statistical and geometrical analysis on the shape of the loops in the graph. Also we calculated different features for the graph and used dimnsionality reduction methods to see if we can observe any structures (clusters) in low dimension." }, { - "objectID": "posts/cryo_ET/demo.html", - "href": "posts/cryo_ET/demo.html", - "title": "Simulation of tomograms of membrane-embedded spike proteins", - "section": "", - "text": "Cryogenic electron tomography (cryo-ET) is an imaging technique to reconstruct high-resolution 3d structure, usually of biological macromolecules. Samples (usually small cells like bacteria and viruses) are prepared in standard aqueous median (unlike cryo-EM, where samples are frozen) are imaged in transmission electron microscope (TEM). The samples are tilted to different angles (e.g. from \\(-60^\\circ\\) to \\(+60^\\circ\\)), and images are obtained at every incremented degree (usually every \\(1^\\circ\\) or \\(2^\\circ\\)).\nThe main advantage of cryo-ET is that it allows the cells and macromolecules to be imaged at undisturbed state. This is very crucial in many applications such as drug discovery, when we need to know the in-situ binding state of the target of interest (e.g. viral spike protein) with the drug.\n\n\n\nTomographic slices of SARS-CoV-2 virions, with spike proteins embedded in the membrane(Shi et al. 2023)\n\n\nIn order to reconstruct macromolecules, tomographic slices need to be processed through a pipeline. A typical cryo-ET data processing pipeline includes: tilt series alignment, CTF estimation, tomogram reconstruction, particle picking, iterative subtomogram alignment and averaging, and heterogeneity analysis. Unlike cryo-EM, many algorithms for cryo-ET processing are still under development. Therefore, a large database of cryo-ET to test and tune algorithms is important. Unfortunately, collecting cryo-ET data is both time and money-consuming, and the current database of cryo-ET is not large enough, especially for deep learning training which requires a large amount of data. Therefore, simulation becomes a substitute to generate a large amount of data in a short time and at low expense. In this post, we will focus on the simimulation of membrane-embedded proteins." + "objectID": "posts/vascularNetworks/VascularNetworks.html#appendix", + "href": "posts/vascularNetworks/VascularNetworks.html#appendix", + "title": "Vascular Networks", + "section": "Appendix", + "text": "Appendix\nFor a graph, the Laplacian matrix contains the information on the in/out flow of stuff into the nodes.\n\n\n\nThen the Laplacian matrix is given by \\[ D = \\begin{pmatrix}\n 2 & 0 & 0 & 0 & 0 \\\\\n 0 & 4 & 0 & 0 & 0 \\\\\n 0 & 0 & 2 & 0 & 0 \\\\\n 0 & 0 & 0 & 2 & 0 \\\\\n 0 & 0 & 0 & 0 & 2\n \\end{pmatrix}, \\] and the adjacency matrix is given by \\[ A = \\begin{pmatrix}\n 0 & 1 & 1 & 0 & 0 \\\\\n 1 & 0 & 1 & 1 & 1 \\\\\n 1 & 1 & 0 & 0 & 0 \\\\\n 0 & 1 & 0 & 0 & 1 \\\\\n 0 & 1 & 0 & 1 & 0\n \\end{pmatrix}, \\] and the Laplacian matrix is given by \\[ L = D -A =\n \\begin{pmatrix}\n 2 & -1 & -1 & 0 & 0 \\\\\n -1 & 4 & -1 & -1 & -1 \\\\\n -1 & -1 & 2 & 0 & 0 \\\\\n 0 & -1 & 0 & 2 & -1 \\\\\n 0 & -1 & 0 & -1 & 2\n \\end{pmatrix}.\n \\] It is straight forward to generalize the notion of Laplacian matrix to the weighed graphs, where the degree matrix $ D $, the diagonal entries will be the sum of all weights of the edges connected to that node, and for the adjacency matrix, instead of zeros and ones, we will have the weights of the connections..\nThere is also another way of finding the Laplacian matrix by using the notion of incidence matrix. To do so, we first need to make our graph to be directed. Any combination of the direction on the edges will do the job and will yield in a correct answer. For instance, consider the following directed graph\nFor a graph, the Laplacian matrix contains the information on the in/out flow of stuff into the nodes.\n\n\n\nIts incidence matrix will be \\[\n M = \\begin{pmatrix}\n -1 & 1 & 0 & 0 & 0 & 0 \\\\\n 0 & -1 & 1 & -1 & 0 & -1 \\\\\n 1 & 0 & -1 & 0 & 0 & 0 \\\\\n 0 & 0 & 0 & 1 & 1 & 0 \\\\\n 0 & 0 & 0 & 0 & -1 & 1 \\\\\n \\end{pmatrix}\n \\] The Laplacian matrix can be written as \\[ \\mathcal{L} = M M^T. \\] Note that in the case of the weighed graphs, we will have \\[ \\mathcal{L} = M W M^T \\tag{1}\\] where $ W $ is a diagonal matrix containing the weights. These computations can be done easily on the NetworkX.\nThe incidence matrix is also very useful in calculating the pressure difference between nodes of a particular edge. Let \\(\\Delta = M^T\\). Then given the vector \\(P\\) that contains the pressures on the vertices, then the pressure difference on the edges will be given by \\(\\Delta P\\), where \\(\\Delta\\) is the transpose of the incidence matrix. This comes in handy when we want to calculate the flow of the edges which will be given by \\[ \\bf{Q} = \\bf{C} L^{-1} \\bf{\\Delta} \\bf{P}, \\tag{2} \\] where $ C $ is a diagonal matrix of the conductance of the edges, \\(L\\) is the diagonal matrix of the ``length’’ of each edge, \\(\\Delta\\) is the transpose of the incidence matrix, and \\(P\\) is the pressure on the nodes. \\(Q\\) is the flow of the edges. In this particular example we are assuming that the relation between flow and the pressure difference is \\(Q_e = C_e (p_i - p_j)/L\\). But we can have many other choices.\nKnowing the sources and sinks on the nodes, the pressure can be determined by the Kirchhoff law \\[ \\mathcal{L} \\bf{P} = \\bf{q}, \\] where the vector $ q $ is the sources and the sinks values for each node. This is the same as solving the . This can also be written in terms of the flow, i.e. \\[ \\Delta^T \\bf{Q} = \\bf{q}. \\] By $ (2) $ we can write \\[ (\\bf{\\Delta}^T \\bf{C}\\bf{L}^{-1}\\Delta) \\bf{P} = \\bf{q}. \\] Since $ = M^T $, the expression inside the parentheses is clearly Equation (1).\nSimilar to the Poisson equation on the graph which is equivalent Kirchhoff’s law, we can solve other types of heat and wave equations on the graph as well. The Laplacian matrix play a key role. \\[ \\frac{\\partial p}{\\partial t} = - \\mathcal{L} p + q, \\] for the heat equation, and \\[ \\frac{\\partial^2 p}{\\partial t^2} = -\\mathcal{L}p + q, \\] for the wave equation." }, { - "objectID": "posts/cryo_ET/demo.html#background", - "href": "posts/cryo_ET/demo.html#background", - "title": "Simulation of tomograms of membrane-embedded spike proteins", + "objectID": "posts/elastic-metric/osteosarcoma_analysis.html", + "href": "posts/elastic-metric/osteosarcoma_analysis.html", + "title": "Shape Analysis of Cancer Cells", "section": "", - "text": "Cryogenic electron tomography (cryo-ET) is an imaging technique to reconstruct high-resolution 3d structure, usually of biological macromolecules. Samples (usually small cells like bacteria and viruses) are prepared in standard aqueous median (unlike cryo-EM, where samples are frozen) are imaged in transmission electron microscope (TEM). The samples are tilted to different angles (e.g. from \\(-60^\\circ\\) to \\(+60^\\circ\\)), and images are obtained at every incremented degree (usually every \\(1^\\circ\\) or \\(2^\\circ\\)).\nThe main advantage of cryo-ET is that it allows the cells and macromolecules to be imaged at undisturbed state. This is very crucial in many applications such as drug discovery, when we need to know the in-situ binding state of the target of interest (e.g. viral spike protein) with the drug.\n\n\n\nTomographic slices of SARS-CoV-2 virions, with spike proteins embedded in the membrane(Shi et al. 2023)\n\n\nIn order to reconstruct macromolecules, tomographic slices need to be processed through a pipeline. A typical cryo-ET data processing pipeline includes: tilt series alignment, CTF estimation, tomogram reconstruction, particle picking, iterative subtomogram alignment and averaging, and heterogeneity analysis. Unlike cryo-EM, many algorithms for cryo-ET processing are still under development. Therefore, a large database of cryo-ET to test and tune algorithms is important. Unfortunately, collecting cryo-ET data is both time and money-consuming, and the current database of cryo-ET is not large enough, especially for deep learning training which requires a large amount of data. Therefore, simulation becomes a substitute to generate a large amount of data in a short time and at low expense. In this post, we will focus on the simimulation of membrane-embedded proteins." - }, - { - "objectID": "posts/cryo_ET/demo.html#workflow", - "href": "posts/cryo_ET/demo.html#workflow", - "title": "Simulation of tomograms of membrane-embedded spike proteins", - "section": "Workflow", - "text": "Workflow\nWe will use the Membrane Embedded Proteins Simulator (MEPSi), a tool incorporated in PyCoAn to simulate SARS-CoV-2 spike protein (Rodríguez de Francisco et al. 2022). Here, I will briefly go through the workflow of MEPSi.\n\n1. Density modeling\nIn the density modeling, atom coordinate lists of macromolecules of interest are given, and a “ground-truth” volume representation is simulated by placing the given macromolecules on the membrane with specified geometry. The algorithm uses a 3D Archimedean spiral to place the molecules at approximately equidistant points along the membrane. Random translations with sa bounding box defined by the equidistance and the maximum XY radius of the molecules will then be applied. This ensures there is no overlap between macromolecules on the surface. The volume is generated using direct generation of membrane density and Gaussian convolution of the atom positions.\nOptionally, a solvent model can be generated and added to the density. In order to keep the computational cost low, a continuum solvent model with an adjustable contrast tuning parameter is used. A 3D version of Lapacian pyramid blending is used to account for displacements of one object from another to mitigate edge effects and emulates the existence of a hydration layer around the molecules.\n\n\n2. Basis tilt series generation\nIn this step, an unperturbed basis tilt series is generated from the simulated volume. The individual tilt images are obtained by rotating the volume around the Y axis and projecting the density along Z axis. The reason that a basis tilt series is generated before final tomogram simulation is to reduce computational cost. It can speed up the process quite a lot if a perturbation-free basis tilt series is first generated to allow the user explore perturbation parameters (e.g. contrast transfer function and noise) before generating final tomograms from perturbed basis tilt series.\n\n\n3. CTF\nOne possible perturbation we can add to the basis tilt series is the contrast transfer function (CTF), which models the effect of the microscope optics. One major determinant of the CTF is the defocus value at the scattering event, which changes while the electrons traverse the specimen. In order to simplify the problem, we assume that the simulated specimen as an infinitely thin slice so only focus changes caused by tilting need to be considered. Projected tilted specimen images are subjected to a CTF model in strips parallel to the tilt axis with the defocus value modulated according to the position of the strip center.\n\n\n4. Noise\nThe noise model is expressed as a mixture of Gaussian and Laplacian, in contrast of white additive Gaussian usually used in many other simulation applications. The noise in the low-dose images contrivuting to a tilt series tends to have statistically significant non-zero skewness, which cannot be modeled by Gaussian error model alone.\n\n\n\nOverlay of an experimental intensity histogram (blue) with noise modeling by Gaussian only (red) vs. with a mix of Gaussian and Laplacian noise (green)\n\n\n\n\n5. Tomogram generation\nFinally tomograms are simulated from the perturbed basis tilt series with user-specified tilt range and increment." - }, - { - "objectID": "posts/cryo_ET/demo.html#results", - "href": "posts/cryo_ET/demo.html#results", - "title": "Simulation of tomograms of membrane-embedded spike proteins", - "section": "Results", - "text": "Results\nIn order to fully demonstrate the capacity of MEPSi, tomograms were simulated from a sample containing three different conformations of SARS-Cov2 spike protein: 6VXX, 6VYB and 6X2B, with ratio 1:1:2. Protein coordinate files in .pdb format were obtained from RCSB PDB, and preprocessed in ChimeraX to align with z-axies in order to be modeled in orrect direction in density simulation.\n\n\n\nThree conformations of the prefusion trimer of SARS-Cov2 spike protein: all RBDs in the closed position (left, 6VXX); one RBD in the open position (center, 6VYB); two RBDs in the open position (right, 6X2B)\n\n\nSolvent and CTF were added. A SNR of 0.5 was used. Finally we generated tomograms every \\(1^\\circ\\) from \\(-60^\\circ\\) to \\(+60^\\circ\\). Below were four tomograms with different tilt angles simulated." - }, - { - "objectID": "posts/MATH-612/index.html#preliminaries", - "href": "posts/MATH-612/index.html#preliminaries", - "title": "Welcome to MATH 612", - "section": "Preliminaries", - "text": "Preliminaries\n\nJupyter: Use Jupyter Notebooks for interactive coding and documentation. Great for running small code snippets and visualizing data. Learn more in the Jupyter Notebook Documentation.\nVS Code: A powerful IDE for writing and debugging code. Download it here, and install relevant extensions for Python and LaTeX.\nEnvironments: Use virtual environments like venv or conda to manage dependencies and ensure consistent results across different setups.\nQuarto: Use Quarto for creating high-quality documents, reports, and presentations from your code. It supports markdown and integrates seamlessly with Jupyter and VS Code for reproducible analysis and publication. Check out the Quarto Guide for more information. To get started quickly, you can refer to this GitHub Repository." - }, - { - "objectID": "posts/MATH-612/index.html#using-github", - "href": "posts/MATH-612/index.html#using-github", - "title": "Welcome to MATH 612", - "section": "Using GitHub", - "text": "Using GitHub\n\nCreate a GitHub Account: Sign up at GitHub.com.\nRepositories: Start by creating a repository to host your project files. Learn how in GitHub’s guide to repositories. Use a .gitignore file to exclude unnecessary files.\nBranches: Work on separate branches (main, dev, feature branches) to manage different versions of your project. More details in GitHub’s guide on branching.\nMerges: Merge changes into the main branch only after thorough review and testing. Learn about merging branches.\nCommit Messages: Write clear, descriptive commit messages to document changes effectively. Follow the best practices for commit messages." - }, - { - "objectID": "posts/MATH-612/index.html#using-quarto-to-create-blog-posts", - "href": "posts/MATH-612/index.html#using-quarto-to-create-blog-posts", - "title": "Welcome to MATH 612", - "section": "Using Quarto to create blog posts", - "text": "Using Quarto to create blog posts\n\nLog into GitHub: Make sure you have an account and are logged in.\nSend your account username/email to kdd@math.ubc.ca: This is needed to be added to the organization.\nClone the repository: After being added to the organization, clone the repository: https://github.com/bioshape-analysis/blog.\ngit clone https://github.com/bioshape-analysis/blog\nCreate a new branch: To contribute to the blog, create a new branch using:\ngit checkout -b <branch_name>\n\nVerify your branch and repository location: Use the following command to check if you are in the correct branch and repository:\ngit status\nThis command will show you the current branch you are on and the status of your working directory, ensuring you are working in the right place\n\nNavigate to posts: Go into the posts directory (found here). Create a new folder with a name that represents the content of your blog post.\nCreate or upload your content:\n\nIf using Jupyter Notebooks, upload your .ipynb file.\nIf preferred, create a new notebook for your post. Once done, convert it into Quarto using the command:\nquarto convert your_jupyter_notebook.ipynb -o output_file.qmd\n\nEdit the YAML in your .qmd file: Ensure your YAML is consistent with the main template. For example:\n\n---\ntitle: \"Title of your blog post\"\ndate: \"Date\" # Format example: August 9 2024\nauthor:\n - name: \"Your Name\" \njupyter: python3\ncategories: [] # [biology, bioinformatics, theory, etc.]\nbibliography: references.bib # If referencing anything\nexecute:\n freeze: auto\n---\nFeel free to add further formatting, but ensure it remains consistent with the main template. 8. Delete your Jupyter notebook: After converting it to a .qmd file, delete the original .ipynb file to prevent duplication in the blog post. 9. Commit and push your changes: After completing your .qmd file, push your branch to GitHub. A pull request will be automatically created, and once reviewed, it will be merged into the main branch.\nAnatomy of a Quarto Document: If you are running code, please do not forget the execute: freeze: auto, so that the website can be built without re-running your code each time.\n\nAdditional Information for Quarto:\n\nAdd Images: You can add images to your Quarto document using markdown syntax:\n![Image Description](path/to/image.png)\nTo add images from a URL:\n![Image Description](https://example.com/image.png)\nAdd References: Manage references by creating a bibliography.bib file with your references in BibTeX format. Link the bibliography file in your Quarto document header (YAML). Cite references in your text using the following syntax:\nThis is a citation [@citation_key].\nOther Edits: Add headers, footnotes, and other markdown features as needed. Customize the layout by editing the YAML header." + "text": "This notebook is adapted from this notebook (Lead author: Nina Miolane).\nThis notebook studies Osteosarcoma (bone cancer) cells and the impact of drug treatment on their morphological shapes, by analyzing cell images obtained from fluorescence microscopy.\nThis analysis relies on the elastic metric between discrete curves from Geomstats. We will study to which extent this metric can detect how the cell shape is associated with the response to treatment.\nThe full papers analyzing this dataset are available at Li et al. (2023), Li et al. (2024).\nFigure 1: Representative images of the cell lines using fluorescence microscopy, studied in this notebook (Image credit : Ashok Prasad). The cells nuclei (blue), the actin cytoskeleton (green) and the lipid membrane (red) of each cell are stained and colored. We only focus on the cell shape in our analysis." }, { - "objectID": "posts/MATH-612/index.html#multiple-environments-in-the-same-quarto-project", - "href": "posts/MATH-612/index.html#multiple-environments-in-the-same-quarto-project", - "title": "Welcome to MATH 612", - "section": "Multiple environments in the same Quarto project", - "text": "Multiple environments in the same Quarto project\nIn your blog post, you may want to use specific python packages, which may conflict with packages used in other post. To avoid this problem, you need to use a virtual environment. For simplicity please name your environment .venv.\n\nCreating the virtual environment: Go to your post folder (e.g blog/posts/my_post) and run :\npython -m venv .venv\nThe folder .venv was created and contains the environment.\nInstalling packages: First activate the environment,\nsource .venv/bin/activate\nand then install the packages you need:\npip install package1_name package2_name\nTo run code in Quarto, you need at least the package jupyter. Deactivate the environment with deactivate.\nUsing environment in VS Code: Link the virtual environment to VS Code using the command palette, with the command Python : Select Interpreter and entering the path to your interpreter ending with .venv/bin/python.\nExport your package requirements If you are installed non standard package, other that jupyter, numpy, matplotlib, pandas, plotly for example, you can export your package requirements, so that other can reproduce your environment. First go to your post directory and activate your environment. Then run:\npip freeze > requirements.txt" + "objectID": "posts/elastic-metric/osteosarcoma_analysis.html#compute-mean-cell-shape-of-the-whole-dataset-global-mean-shape", + "href": "posts/elastic-metric/osteosarcoma_analysis.html#compute-mean-cell-shape-of-the-whole-dataset-global-mean-shape", + "title": "Shape Analysis of Cancer Cells", + "section": "Compute Mean Cell Shape of the Whole Dataset: “Global” Mean Shape", + "text": "Compute Mean Cell Shape of the Whole Dataset: “Global” Mean Shape\nWe want to compute the mean cell shape of the whole dataset. Thus, we first combine all the cell shape data into a single array.\n\nCURVES_SPACE_SRV = DiscreteCurvesStartingAtOrigin(ambient_dim=2, k_sampling_points=k_sampling_points)\n\n\ncell_shapes_list = {}\nfor metric in METRICS:\n cell_shapes_list[metric] = []\n for treatment in TREATMENTS:\n for line in LINES:\n cell_shapes_list[metric].extend(ds_align[metric][treatment][line])\n\ncell_shapes = {}\nfor metric in METRICS:\n cell_shapes[metric] = gs.array(cell_shapes_list[metric])\nprint(cell_shapes['SRV'].shape)\n\n(625, 1999, 2)\n\n\nRemove outliers using DeCOr-MDS, together for DUNN and DLM8 cell lines.\n\ndef linear_dist(cell1, cell2):\n return gs.linalg.norm(cell1 - cell2)\n\ndef srv_dist(cell1, cell2):\n CURVES_SPACE_SRV.equip_with_metric(SRVMetric)\n return CURVES_SPACE_SRV.metric.dist(cell1, cell2)\n \n# compute pairwise distances, we only need to compute it once and save the results \npairwise_dists = {}\n\nif first_time:\n metric = 'SRV'\n pairwise_dists[metric] = parallel_dist(cell_shapes[metric], srv_dist, k_sampling_points)\n\n metric = 'Linear' \n pairwise_dists[metric] = parallel_dist(cell_shapes[metric], linear_dist, k_sampling_points)\n\n for metric in METRICS:\n np.savetxt(os.path.join(data_path, dataset_name, \"distance_matrix\", f\"{metric}_matrix.txt\"), pairwise_dists[metric])\nelse:\n for metric in METRICS:\n pairwise_dists[metric] = np.loadtxt(os.path.join(data_path, dataset_name, \"distance_matrix\", f\"{metric}_matrix.txt\"))\n\n\n# to remove 132 and 199\none_cell = cell_shapes['Linear'][199]\nplt.plot(one_cell[:, 0], one_cell[:, 1], c=f\"gray\")\n\n\n\n\n\n\n\n\n\n# run DeCOr-MDS\nmetric = 'SRV'\ndim_start = 2 # we know the subspace dimension is 3, we set start and end to 3 to reduce runtime \ndim_end = 10\n# dim_start = 3\n# dim_end = 3\nstd_multi = 1\nif first_time:\n subspace_dim, outlier_indices = find_subspace_dim(pairwise_dists[metric], dim_start, dim_end, std_multi)\n print(f\"subspace dimension is: {subspace_dim}\")\n print(f\"outlier_indices are: {outlier_indices}\")\n\nVisualize outlier cells to see if they are artifacts\n\nif first_time:\n fig, axes = plt.subplots(\n nrows= 1,\n ncols=len(outlier_indices),\n figsize=(2*len(outlier_indices), 2),\n )\n\n for i, outlier_index in enumerate(outlier_indices):\n one_cell = cell_shapes[metric][outlier_index]\n ax = axes[i]\n ax.plot(one_cell[:, 0], one_cell[:, 1], c=f\"C{j}\")\n ax.set_title(f\"{outlier_index}\", fontsize=14)\n # Turn off tick labels\n ax.set_yticklabels([])\n ax.set_xticklabels([])\n ax.set_xticks([])\n ax.set_yticks([])\n ax.spines[\"top\"].set_visible(False)\n ax.spines[\"right\"].set_visible(False)\n ax.spines[\"bottom\"].set_visible(False)\n ax.spines[\"left\"].set_visible(False)\n\n plt.tight_layout()\n plt.suptitle(f\"\", y=-0.01, fontsize=24)\n # plt.savefig(os.path.join(figs_dir, \"outlier.svg\"))\n\n\ndelete_indices = [132, 199]\n\n\nfig, axes = plt.subplots(\n nrows= 1,\n ncols=len(delete_indices),\n figsize=(2*len(delete_indices), 2),\n)\n\n\nfor i, outlier_index in enumerate(delete_indices):\n one_cell = cell_shapes[metric][outlier_index]\n ax = axes[i]\n ax.plot(one_cell[:, 0], one_cell[:, 1], c=f\"gray\")\n ax.set_title(f\"{outlier_index}\", fontsize=14)\n # ax.axis(\"off\")\n # Turn off tick labels\n ax.set_yticklabels([])\n ax.set_xticklabels([])\n ax.set_xticks([])\n ax.set_yticks([])\n ax.spines[\"top\"].set_visible(False)\n ax.spines[\"right\"].set_visible(False)\n ax.spines[\"bottom\"].set_visible(False)\n ax.spines[\"left\"].set_visible(False)\n\nplt.tight_layout()\nplt.suptitle(f\"\", y=-0.01, fontsize=24)\n\nif savefig:\n plt.savefig(os.path.join(figs_dir, \"delete_outlier.svg\"))\n plt.savefig(os.path.join(figs_dir, \"delete_outlier.pdf\"))\n\n\n\n\n\n\n\n\nAfter visual inspection, we decide to remove the outlier cells\n\ndef remove_ds_two_layer(ds, delete_indices):\n global_i = sum(len(v) for values in ds.values() for v in values.values())-1\n\n for treatment in reversed(list(ds.keys())):\n treatment_values = ds[treatment]\n for line in reversed(list(treatment_values.keys())):\n line_cells = treatment_values[line]\n for i, _ in reversed(list(enumerate(line_cells))):\n if global_i in delete_indices:\n print(np.array(ds[treatment][line][:i]).shape, np.array(ds[treatment][line][i+1:]).shape)\n if len(np.array(ds[treatment][line][:i]).shape) == 1:\n ds[treatment][line] = np.array(ds[treatment][line][i+1:])\n elif len(np.array(ds[treatment][line][i+1:]).shape) == 1:\n ds[treatment][line] = np.array(ds[treatment][line][:i])\n else:\n ds[treatment][line] = np.concatenate((np.array(ds[treatment][line][:i]), np.array(ds[treatment][line][i+1:])), axis=0) \n global_i -= 1\n return ds\n\n\n\ndef remove_cells_two_layer(cells, cell_shapes, lines, treatments, pairwise_dists, ds_proc, ds_align, delete_indices):\n \"\"\" \n Remove cells of control group from cells, cell_shapes, ds,\n the parameters returned from load_treated_osteosarcoma_cells\n Also update n_cells\n\n :param list[int] delete_indices: the indices to delete\n \"\"\"\n delete_indices = sorted(delete_indices, reverse=True) # to prevent change in index when deleting elements\n \n # Delete elements\n cells = del_arr_elements(cells, delete_indices) \n lines = list(np.delete(np.array(lines), delete_indices, axis=0))\n treatments = list(np.delete(np.array(treatments), delete_indices, axis=0))\n ds_proc = remove_ds_two_layer(ds_proc, delete_indices)\n \n for metric in METRICS:\n cell_shapes[metric] = np.delete(np.array(cell_shapes[metric]), delete_indices, axis=0)\n ds_align[metric] = remove_ds_two_layer(ds_align[metric], delete_indices)\n pairwise_dists[metric] = np.delete(pairwise_dists[metric], delete_indices, axis=0)\n pairwise_dists[metric] = np.delete(pairwise_dists[metric], delete_indices, axis=1)\n\n\n return cells, cell_shapes, lines, treatments, pairwise_dists, ds_proc, ds_align\n\n\ncells, cell_shapes, lines, treatments, pairwise_dists, ds_proc, ds_align = remove_cells_two_layer(cells, cell_shapes, lines, treatments, pairwise_dists, ds_proc, ds_align, delete_indices)\n\n(85, 2000, 2) (118, 2000, 2)\n(18, 2000, 2) (184, 2000, 2)\n(86, 1999, 2) (112, 1999, 2)\n(19, 1999, 2) (178, 1999, 2)\n(86, 1999, 2) (112, 1999, 2)\n(19, 1999, 2) (178, 1999, 2)\n\n\nCheck we did not loss any other cells after the removal\n\ndef check_num(cell_shapes, treatments, lines, pairwise_dists, ds_align):\n \n print(f\"treatments number is: {len(treatments)}, lines number is: {len(lines)}\")\n for metric in METRICS:\n print(f\"pairwise_dists for {metric} shape is: {pairwise_dists[metric].shape}\")\n print(f\"cell_shapes for {metric} number is : {len(cell_shapes[metric])}\")\n \n for line in LINES:\n for treatment in TREATMENTS:\n print(f\"ds_align {treatment} {line} using {metric}: {len(ds_align[metric][treatment][line])}\")\n\n\ncheck_num(cell_shapes, treatments, lines, pairwise_dists, ds_align)\n\ntreatments number is: 623, lines number is: 623\npairwise_dists for SRV shape is: (623, 623)\ncell_shapes for SRV number is : 623\nds_align control dlm8 using SRV: 113\nds_align cytd dlm8 using SRV: 74\nds_align jasp dlm8 using SRV: 56\nds_align control dunn using SRV: 197\nds_align cytd dunn using SRV: 92\nds_align jasp dunn using SRV: 91\npairwise_dists for Linear shape is: (623, 623)\ncell_shapes for Linear number is : 623\nds_align control dlm8 using Linear: 113\nds_align cytd dlm8 using Linear: 74\nds_align jasp dlm8 using Linear: 56\nds_align control dunn using Linear: 197\nds_align cytd dunn using Linear: 92\nds_align jasp dunn using Linear: 91\n\n\nWe compute the mean cell shape by using the SRV metric defined on the space of curves’ shapes. The space of curves’ shape is a manifold: we use the Frechet mean, associated to the SRV metric, to get the mean cell shape.\nDo not include cells with duplicate points when calculating the mean shapes\n\ndef check_duplicate(cell):\n \"\"\" \n Return true if there are duplicate points in the cell\n \"\"\"\n for i in range(cell.shape[0]-1):\n cur_coord = cell[i]\n next_coord = cell[i+1]\n if np.linalg.norm(cur_coord-next_coord) == 0:\n return True\n \n # Checking the last point vs the first poit\n if np.linalg.norm(cell[-1]-cell[0]) == 0:\n return True\n \n return False\n\n\ndelete_indices = []\nfor metric in METRICS:\n for i, cell in reversed(list(enumerate(cell_shapes[metric]))):\n if check_duplicate(cell):\n if i not in delete_indices:\n delete_indices.append(i)\n\n\ncells, cell_shapes, lines, treatments, pairwise_dists, ds_proc, ds_align = \\\n remove_cells_two_layer(cells, cell_shapes, lines, treatments, pairwise_dists, ds_proc, ds_align, delete_indices)\n\nRecheck cell number after removing cells with duplicated points\n\ncheck_num(cell_shapes, treatments, lines, pairwise_dists, ds_align)\n\ntreatments number is: 623, lines number is: 623\npairwise_dists for SRV shape is: (623, 623)\ncell_shapes for SRV number is : 623\nds_align control dlm8 using SRV: 113\nds_align cytd dlm8 using SRV: 74\nds_align jasp dlm8 using SRV: 56\nds_align control dunn using SRV: 197\nds_align cytd dunn using SRV: 92\nds_align jasp dunn using SRV: 91\npairwise_dists for Linear shape is: (623, 623)\ncell_shapes for Linear number is : 623\nds_align control dlm8 using Linear: 113\nds_align cytd dlm8 using Linear: 74\nds_align jasp dlm8 using Linear: 56\nds_align control dunn using Linear: 197\nds_align cytd dunn using Linear: 92\nds_align jasp dunn using Linear: 91\n\n\n\nfrom geomstats.learning.frechet_mean import FrechetMean\n\nmetric = 'SRV'\nCURVES_SPACE_SRV = DiscreteCurvesStartingAtOrigin(ambient_dim=2, k_sampling_points=k_sampling_points)\nmean = FrechetMean(CURVES_SPACE_SRV)\nprint(cell_shapes[metric].shape)\ncells = cell_shapes[metric]\nmean.fit(cells)\n\nmean_estimate = mean.estimate_\n\n(623, 1999, 2)\n\n\n\nmean_estimate_aligned = {}\n\nmean_estimate_clean = mean_estimate[~gs.isnan(gs.sum(mean_estimate, axis=1)), :]\nmean_estimate_aligned[metric] = (\n mean_estimate_clean - gs.mean(mean_estimate_clean, axis=0)\n)\n\nAlso we compute the linear mean\n\nmetric = 'Linear'\nlinear_mean_estimate = gs.mean(cell_shapes[metric], axis=0)\nlinear_mean_estimate_clean = linear_mean_estimate[~gs.isnan(gs.sum(linear_mean_estimate, axis=1)), :]\n\nmean_estimate_aligned[metric] = (\n linear_mean_estimate_clean - gs.mean(linear_mean_estimate_clean, axis=0)\n)\n\nPlot SRV mean cell versus linear mean cell\n\nfig = plt.figure(figsize=(6, 3))\n\nfig.add_subplot(121)\nmetric = 'SRV'\nplt.plot(mean_estimate_aligned[metric][:, 0], mean_estimate_aligned[metric][:, 1])\nplt.axis(\"equal\")\nplt.title(\"SRV\")\nplt.axis(\"off\")\n\nfig.add_subplot(122)\nmetric = 'Linear'\nplt.plot(mean_estimate_aligned[metric][:, 0], mean_estimate_aligned[metric][:, 1])\nplt.axis(\"equal\")\nplt.title(\"Linear\")\nplt.axis(\"off\")\n\nif savefig:\n plt.savefig(os.path.join(figs_dir, \"global_mean.svg\"))\n plt.savefig(os.path.join(figs_dir, \"global_mean.pdf\"))" }, { "objectID": "posts/ImageMorphing/OT4DiseaseProgression2.html", @@ -588,437 +490,584 @@ "text": "Visualization\n\n\nCode\n# shape of Interpolation = [|B|,|C|,|X|,|Y|,|T|]\nfig, ax = plt.subplots(nrows=2, ncols=OMTInterpolation.shape[-1], figsize=(50, 5))\n\nfor i in range(OMTInterpolation.shape[-1]):\n ax[0, i].imshow(LinearInt[:, :, i])\n ax[0, i].axis('off')\n ax[1, i].imshow(OMTInterpolation[:, :, i])\n ax[1, i].axis('off')\n\nax[0,0].set_title('Linear Interpolation', loc='left')\nax[1,0].set_title('OMT Interpolation', loc='left')\n\nplt.show()" }, { - "objectID": "posts/elastic-metric/osteosarcoma_analysis.html", - "href": "posts/elastic-metric/osteosarcoma_analysis.html", - "title": "Shape Analysis of Cancer Cells", - "section": "", - "text": "This notebook is adapted from this notebook (Lead author: Nina Miolane).\nThis notebook studies Osteosarcoma (bone cancer) cells and the impact of drug treatment on their morphological shapes, by analyzing cell images obtained from fluorescence microscopy.\nThis analysis relies on the elastic metric between discrete curves from Geomstats. We will study to which extent this metric can detect how the cell shape is associated with the response to treatment.\nThe full papers analyzing this dataset are available at Li et al. (2023), Li et al. (2024).\nFigure 1: Representative images of the cell lines using fluorescence microscopy, studied in this notebook (Image credit : Ashok Prasad). The cells nuclei (blue), the actin cytoskeleton (green) and the lipid membrane (red) of each cell are stained and colored. We only focus on the cell shape in our analysis." + "objectID": "posts/MATH-612/index.html#preliminaries", + "href": "posts/MATH-612/index.html#preliminaries", + "title": "Welcome to MATH 612", + "section": "Preliminaries", + "text": "Preliminaries\n\nJupyter: Use Jupyter Notebooks for interactive coding and documentation. Great for running small code snippets and visualizing data. Learn more in the Jupyter Notebook Documentation.\nVS Code: A powerful IDE for writing and debugging code. Download it here, and install relevant extensions for Python and LaTeX.\nEnvironments: Use virtual environments like venv or conda to manage dependencies and ensure consistent results across different setups.\nQuarto: Use Quarto for creating high-quality documents, reports, and presentations from your code. It supports markdown and integrates seamlessly with Jupyter and VS Code for reproducible analysis and publication. Check out the Quarto Guide for more information. To get started quickly, you can refer to this GitHub Repository." }, { - "objectID": "posts/elastic-metric/osteosarcoma_analysis.html#compute-mean-cell-shape-of-the-whole-dataset-global-mean-shape", - "href": "posts/elastic-metric/osteosarcoma_analysis.html#compute-mean-cell-shape-of-the-whole-dataset-global-mean-shape", - "title": "Shape Analysis of Cancer Cells", - "section": "Compute Mean Cell Shape of the Whole Dataset: “Global” Mean Shape", - "text": "Compute Mean Cell Shape of the Whole Dataset: “Global” Mean Shape\nWe want to compute the mean cell shape of the whole dataset. Thus, we first combine all the cell shape data into a single array.\n\nCURVES_SPACE_SRV = DiscreteCurvesStartingAtOrigin(ambient_dim=2, k_sampling_points=k_sampling_points)\n\n\ncell_shapes_list = {}\nfor metric in METRICS:\n cell_shapes_list[metric] = []\n for treatment in TREATMENTS:\n for line in LINES:\n cell_shapes_list[metric].extend(ds_align[metric][treatment][line])\n\ncell_shapes = {}\nfor metric in METRICS:\n cell_shapes[metric] = gs.array(cell_shapes_list[metric])\nprint(cell_shapes['SRV'].shape)\n\n(625, 1999, 2)\n\n\nRemove outliers using DeCOr-MDS, together for DUNN and DLM8 cell lines.\n\ndef linear_dist(cell1, cell2):\n return gs.linalg.norm(cell1 - cell2)\n\ndef srv_dist(cell1, cell2):\n CURVES_SPACE_SRV.equip_with_metric(SRVMetric)\n return CURVES_SPACE_SRV.metric.dist(cell1, cell2)\n \n# compute pairwise distances, we only need to compute it once and save the results \npairwise_dists = {}\n\nif first_time:\n metric = 'SRV'\n pairwise_dists[metric] = parallel_dist(cell_shapes[metric], srv_dist, k_sampling_points)\n\n metric = 'Linear' \n pairwise_dists[metric] = parallel_dist(cell_shapes[metric], linear_dist, k_sampling_points)\n\n for metric in METRICS:\n np.savetxt(os.path.join(data_path, dataset_name, \"distance_matrix\", f\"{metric}_matrix.txt\"), pairwise_dists[metric])\nelse:\n for metric in METRICS:\n pairwise_dists[metric] = np.loadtxt(os.path.join(data_path, dataset_name, \"distance_matrix\", f\"{metric}_matrix.txt\"))\n\n\n# to remove 132 and 199\none_cell = cell_shapes['Linear'][199]\nplt.plot(one_cell[:, 0], one_cell[:, 1], c=f\"gray\")\n\n\n\n\n\n\n\n\n\n# run DeCOr-MDS\nmetric = 'SRV'\ndim_start = 2 # we know the subspace dimension is 3, we set start and end to 3 to reduce runtime \ndim_end = 10\n# dim_start = 3\n# dim_end = 3\nstd_multi = 1\nif first_time:\n subspace_dim, outlier_indices = find_subspace_dim(pairwise_dists[metric], dim_start, dim_end, std_multi)\n print(f\"subspace dimension is: {subspace_dim}\")\n print(f\"outlier_indices are: {outlier_indices}\")\n\nVisualize outlier cells to see if they are artifacts\n\nif first_time:\n fig, axes = plt.subplots(\n nrows= 1,\n ncols=len(outlier_indices),\n figsize=(2*len(outlier_indices), 2),\n )\n\n for i, outlier_index in enumerate(outlier_indices):\n one_cell = cell_shapes[metric][outlier_index]\n ax = axes[i]\n ax.plot(one_cell[:, 0], one_cell[:, 1], c=f\"C{j}\")\n ax.set_title(f\"{outlier_index}\", fontsize=14)\n # Turn off tick labels\n ax.set_yticklabels([])\n ax.set_xticklabels([])\n ax.set_xticks([])\n ax.set_yticks([])\n ax.spines[\"top\"].set_visible(False)\n ax.spines[\"right\"].set_visible(False)\n ax.spines[\"bottom\"].set_visible(False)\n ax.spines[\"left\"].set_visible(False)\n\n plt.tight_layout()\n plt.suptitle(f\"\", y=-0.01, fontsize=24)\n # plt.savefig(os.path.join(figs_dir, \"outlier.svg\"))\n\n\ndelete_indices = [132, 199]\n\n\nfig, axes = plt.subplots(\n nrows= 1,\n ncols=len(delete_indices),\n figsize=(2*len(delete_indices), 2),\n)\n\n\nfor i, outlier_index in enumerate(delete_indices):\n one_cell = cell_shapes[metric][outlier_index]\n ax = axes[i]\n ax.plot(one_cell[:, 0], one_cell[:, 1], c=f\"gray\")\n ax.set_title(f\"{outlier_index}\", fontsize=14)\n # ax.axis(\"off\")\n # Turn off tick labels\n ax.set_yticklabels([])\n ax.set_xticklabels([])\n ax.set_xticks([])\n ax.set_yticks([])\n ax.spines[\"top\"].set_visible(False)\n ax.spines[\"right\"].set_visible(False)\n ax.spines[\"bottom\"].set_visible(False)\n ax.spines[\"left\"].set_visible(False)\n\nplt.tight_layout()\nplt.suptitle(f\"\", y=-0.01, fontsize=24)\n\nif savefig:\n plt.savefig(os.path.join(figs_dir, \"delete_outlier.svg\"))\n plt.savefig(os.path.join(figs_dir, \"delete_outlier.pdf\"))\n\n\n\n\n\n\n\n\nAfter visual inspection, we decide to remove the outlier cells\n\ndef remove_ds_two_layer(ds, delete_indices):\n global_i = sum(len(v) for values in ds.values() for v in values.values())-1\n\n for treatment in reversed(list(ds.keys())):\n treatment_values = ds[treatment]\n for line in reversed(list(treatment_values.keys())):\n line_cells = treatment_values[line]\n for i, _ in reversed(list(enumerate(line_cells))):\n if global_i in delete_indices:\n print(np.array(ds[treatment][line][:i]).shape, np.array(ds[treatment][line][i+1:]).shape)\n if len(np.array(ds[treatment][line][:i]).shape) == 1:\n ds[treatment][line] = np.array(ds[treatment][line][i+1:])\n elif len(np.array(ds[treatment][line][i+1:]).shape) == 1:\n ds[treatment][line] = np.array(ds[treatment][line][:i])\n else:\n ds[treatment][line] = np.concatenate((np.array(ds[treatment][line][:i]), np.array(ds[treatment][line][i+1:])), axis=0) \n global_i -= 1\n return ds\n\n\n\ndef remove_cells_two_layer(cells, cell_shapes, lines, treatments, pairwise_dists, ds_proc, ds_align, delete_indices):\n \"\"\" \n Remove cells of control group from cells, cell_shapes, ds,\n the parameters returned from load_treated_osteosarcoma_cells\n Also update n_cells\n\n :param list[int] delete_indices: the indices to delete\n \"\"\"\n delete_indices = sorted(delete_indices, reverse=True) # to prevent change in index when deleting elements\n \n # Delete elements\n cells = del_arr_elements(cells, delete_indices) \n lines = list(np.delete(np.array(lines), delete_indices, axis=0))\n treatments = list(np.delete(np.array(treatments), delete_indices, axis=0))\n ds_proc = remove_ds_two_layer(ds_proc, delete_indices)\n \n for metric in METRICS:\n cell_shapes[metric] = np.delete(np.array(cell_shapes[metric]), delete_indices, axis=0)\n ds_align[metric] = remove_ds_two_layer(ds_align[metric], delete_indices)\n pairwise_dists[metric] = np.delete(pairwise_dists[metric], delete_indices, axis=0)\n pairwise_dists[metric] = np.delete(pairwise_dists[metric], delete_indices, axis=1)\n\n\n return cells, cell_shapes, lines, treatments, pairwise_dists, ds_proc, ds_align\n\n\ncells, cell_shapes, lines, treatments, pairwise_dists, ds_proc, ds_align = remove_cells_two_layer(cells, cell_shapes, lines, treatments, pairwise_dists, ds_proc, ds_align, delete_indices)\n\n(85, 2000, 2) (118, 2000, 2)\n(18, 2000, 2) (184, 2000, 2)\n(86, 1999, 2) (112, 1999, 2)\n(19, 1999, 2) (178, 1999, 2)\n(86, 1999, 2) (112, 1999, 2)\n(19, 1999, 2) (178, 1999, 2)\n\n\nCheck we did not loss any other cells after the removal\n\ndef check_num(cell_shapes, treatments, lines, pairwise_dists, ds_align):\n \n print(f\"treatments number is: {len(treatments)}, lines number is: {len(lines)}\")\n for metric in METRICS:\n print(f\"pairwise_dists for {metric} shape is: {pairwise_dists[metric].shape}\")\n print(f\"cell_shapes for {metric} number is : {len(cell_shapes[metric])}\")\n \n for line in LINES:\n for treatment in TREATMENTS:\n print(f\"ds_align {treatment} {line} using {metric}: {len(ds_align[metric][treatment][line])}\")\n\n\ncheck_num(cell_shapes, treatments, lines, pairwise_dists, ds_align)\n\ntreatments number is: 623, lines number is: 623\npairwise_dists for SRV shape is: (623, 623)\ncell_shapes for SRV number is : 623\nds_align control dlm8 using SRV: 113\nds_align cytd dlm8 using SRV: 74\nds_align jasp dlm8 using SRV: 56\nds_align control dunn using SRV: 197\nds_align cytd dunn using SRV: 92\nds_align jasp dunn using SRV: 91\npairwise_dists for Linear shape is: (623, 623)\ncell_shapes for Linear number is : 623\nds_align control dlm8 using Linear: 113\nds_align cytd dlm8 using Linear: 74\nds_align jasp dlm8 using Linear: 56\nds_align control dunn using Linear: 197\nds_align cytd dunn using Linear: 92\nds_align jasp dunn using Linear: 91\n\n\nWe compute the mean cell shape by using the SRV metric defined on the space of curves’ shapes. The space of curves’ shape is a manifold: we use the Frechet mean, associated to the SRV metric, to get the mean cell shape.\nDo not include cells with duplicate points when calculating the mean shapes\n\ndef check_duplicate(cell):\n \"\"\" \n Return true if there are duplicate points in the cell\n \"\"\"\n for i in range(cell.shape[0]-1):\n cur_coord = cell[i]\n next_coord = cell[i+1]\n if np.linalg.norm(cur_coord-next_coord) == 0:\n return True\n \n # Checking the last point vs the first poit\n if np.linalg.norm(cell[-1]-cell[0]) == 0:\n return True\n \n return False\n\n\ndelete_indices = []\nfor metric in METRICS:\n for i, cell in reversed(list(enumerate(cell_shapes[metric]))):\n if check_duplicate(cell):\n if i not in delete_indices:\n delete_indices.append(i)\n\n\ncells, cell_shapes, lines, treatments, pairwise_dists, ds_proc, ds_align = \\\n remove_cells_two_layer(cells, cell_shapes, lines, treatments, pairwise_dists, ds_proc, ds_align, delete_indices)\n\nRecheck cell number after removing cells with duplicated points\n\ncheck_num(cell_shapes, treatments, lines, pairwise_dists, ds_align)\n\ntreatments number is: 623, lines number is: 623\npairwise_dists for SRV shape is: (623, 623)\ncell_shapes for SRV number is : 623\nds_align control dlm8 using SRV: 113\nds_align cytd dlm8 using SRV: 74\nds_align jasp dlm8 using SRV: 56\nds_align control dunn using SRV: 197\nds_align cytd dunn using SRV: 92\nds_align jasp dunn using SRV: 91\npairwise_dists for Linear shape is: (623, 623)\ncell_shapes for Linear number is : 623\nds_align control dlm8 using Linear: 113\nds_align cytd dlm8 using Linear: 74\nds_align jasp dlm8 using Linear: 56\nds_align control dunn using Linear: 197\nds_align cytd dunn using Linear: 92\nds_align jasp dunn using Linear: 91\n\n\n\nfrom geomstats.learning.frechet_mean import FrechetMean\n\nmetric = 'SRV'\nCURVES_SPACE_SRV = DiscreteCurvesStartingAtOrigin(ambient_dim=2, k_sampling_points=k_sampling_points)\nmean = FrechetMean(CURVES_SPACE_SRV)\nprint(cell_shapes[metric].shape)\ncells = cell_shapes[metric]\nmean.fit(cells)\n\nmean_estimate = mean.estimate_\n\n(623, 1999, 2)\n\n\n\nmean_estimate_aligned = {}\n\nmean_estimate_clean = mean_estimate[~gs.isnan(gs.sum(mean_estimate, axis=1)), :]\nmean_estimate_aligned[metric] = (\n mean_estimate_clean - gs.mean(mean_estimate_clean, axis=0)\n)\n\nAlso we compute the linear mean\n\nmetric = 'Linear'\nlinear_mean_estimate = gs.mean(cell_shapes[metric], axis=0)\nlinear_mean_estimate_clean = linear_mean_estimate[~gs.isnan(gs.sum(linear_mean_estimate, axis=1)), :]\n\nmean_estimate_aligned[metric] = (\n linear_mean_estimate_clean - gs.mean(linear_mean_estimate_clean, axis=0)\n)\n\nPlot SRV mean cell versus linear mean cell\n\nfig = plt.figure(figsize=(6, 3))\n\nfig.add_subplot(121)\nmetric = 'SRV'\nplt.plot(mean_estimate_aligned[metric][:, 0], mean_estimate_aligned[metric][:, 1])\nplt.axis(\"equal\")\nplt.title(\"SRV\")\nplt.axis(\"off\")\n\nfig.add_subplot(122)\nmetric = 'Linear'\nplt.plot(mean_estimate_aligned[metric][:, 0], mean_estimate_aligned[metric][:, 1])\nplt.axis(\"equal\")\nplt.title(\"Linear\")\nplt.axis(\"off\")\n\nif savefig:\n plt.savefig(os.path.join(figs_dir, \"global_mean.svg\"))\n plt.savefig(os.path.join(figs_dir, \"global_mean.pdf\"))" + "objectID": "posts/MATH-612/index.html#using-github", + "href": "posts/MATH-612/index.html#using-github", + "title": "Welcome to MATH 612", + "section": "Using GitHub", + "text": "Using GitHub\n\nCreate a GitHub Account: Sign up at GitHub.com.\nRepositories: Start by creating a repository to host your project files. Learn how in GitHub’s guide to repositories. Use a .gitignore file to exclude unnecessary files.\nBranches: Work on separate branches (main, dev, feature branches) to manage different versions of your project. More details in GitHub’s guide on branching.\nMerges: Merge changes into the main branch only after thorough review and testing. Learn about merging branches.\nCommit Messages: Write clear, descriptive commit messages to document changes effectively. Follow the best practices for commit messages." }, { - "objectID": "posts/vascularNetworks/VascularNetworks.html", - "href": "posts/vascularNetworks/VascularNetworks.html", - "title": "Vascular Networks", - "section": "", - "text": "I have introduced some basic concepts of micro-circulation and the vascular networks and how they get created (angiogenesis) in health and disease. Then I discuss some angiogenesis models (Anderson-Chaplain as well as BARW) and use the tools of the geomstats to analyze the loopy structure in these networks. I explained the characteristics of the loopy structures in the networks in terms of the parameters of the model. Furthermore, I consider the time evolution of the graphs created by these networks and how the characterization of the loopy structures change through time in these networks." + "objectID": "posts/MATH-612/index.html#using-quarto-to-create-blog-posts", + "href": "posts/MATH-612/index.html#using-quarto-to-create-blog-posts", + "title": "Welcome to MATH 612", + "section": "Using Quarto to create blog posts", + "text": "Using Quarto to create blog posts\n\nLog into GitHub: Make sure you have an account and are logged in.\nSend your account username/email to kdd@math.ubc.ca: This is needed to be added to the organization.\nClone the repository: After being added to the organization, clone the repository: https://github.com/bioshape-analysis/blog.\ngit clone https://github.com/bioshape-analysis/blog\nCreate a new branch: To contribute to the blog, create a new branch using:\ngit checkout -b <branch_name>\n\nVerify your branch and repository location: Use the following command to check if you are in the correct branch and repository:\ngit status\nThis command will show you the current branch you are on and the status of your working directory, ensuring you are working in the right place\n\nNavigate to posts: Go into the posts directory (found here). Create a new folder with a name that represents the content of your blog post.\nCreate or upload your content:\n\nIf using Jupyter Notebooks, upload your .ipynb file.\nIf preferred, create a new notebook for your post. Once done, convert it into Quarto using the command:\nquarto convert your_jupyter_notebook.ipynb -o output_file.qmd\n\nEdit the YAML in your .qmd file: Ensure your YAML is consistent with the main template. For example:\n\n---\ntitle: \"Title of your blog post\"\ndate: \"Date\" # Format example: August 9 2024\nauthor:\n - name: \"Your Name\" \njupyter: python3\ncategories: [] # [biology, bioinformatics, theory, etc.]\nbibliography: references.bib # If referencing anything\nexecute:\n freeze: auto\n---\nFeel free to add further formatting, but ensure it remains consistent with the main template. 8. Delete your Jupyter notebook: After converting it to a .qmd file, delete the original .ipynb file to prevent duplication in the blog post. 9. Commit and push your changes: After completing your .qmd file, push your branch to GitHub. A pull request will be automatically created, and once reviewed, it will be merged into the main branch.\nAnatomy of a Quarto Document: If you are running code, please do not forget the execute: freeze: auto, so that the website can be built without re-running your code each time.\n\nAdditional Information for Quarto:\n\nAdd Images: You can add images to your Quarto document using markdown syntax:\n![Image Description](path/to/image.png)\nTo add images from a URL:\n![Image Description](https://example.com/image.png)\nAdd References: Manage references by creating a bibliography.bib file with your references in BibTeX format. Link the bibliography file in your Quarto document header (YAML). Cite references in your text using the following syntax:\nThis is a citation [@citation_key].\nOther Edits: Add headers, footnotes, and other markdown features as needed. Customize the layout by editing the YAML header." }, { - "objectID": "posts/vascularNetworks/VascularNetworks.html#anderson-chaplain-model-of-angiogenesis", - "href": "posts/vascularNetworks/VascularNetworks.html#anderson-chaplain-model-of-angiogenesis", - "title": "Vascular Networks", - "section": "Anderson-Chaplain Model of Angiogenesis", - "text": "Anderson-Chaplain Model of Angiogenesis\nAnderson-Chaplain model of angiogenesis describes the angiogenesis process considering the factors like TAF and fibronectin. This model contains three variables \\(\\newcommand{\\R}{\\mathbb{R}}\\) \\(\\newcommand{\\abs}[1]{|#1|}\\)\n\n\\(n = n(X,t): \\Omega \\times \\R \\to \\R\\): the endothelial-cell density (per unit area).\n\\(c = c(X,t): \\Omega \\times \\R \\to \\R\\): the tumor angiogenic factor (TAF) concentration (nmol per unit area).\n\\(f = f(X,t): \\Omega \\times \\R \\to \\R\\): the fibronectin concentration (nmol per unit area).\n\nand the time evolution is governed by the following system of PDEs\n\\[\\begin{align*}\n &\\frac{\\partial n}{\\partial t} = D_n\\nabla^2 n - \\nabla\\cdot(\\chi n\\nabla c) - \\nabla\\cdot(\\rho n \\nabla f), \\\\\n &\\frac{\\partial c}{\\partial t} = -\\lambda n c, \\\\\n &\\frac{\\partial f}{\\partial t} = \\omega n - \\mu n f,\n \\end{align*}\\]\nwhere \\(D_n\\) is a diffusion constant taking the random movement of tip cells into account, \\(\\chi, \\rho\\) reflects the strength of the chemotaxis of tip cells due to the gradient of TAF, and fibronectin respectively. Furthermore, \\(\\lambda, \\mu\\) is the rate at which tip cells consume the TAF and fibronectin respectively, and \\(\\omega\\) denotes the production of fibronectin by the tip cells. Note that we assume at the start of the angiogenesis process, we have a steady state distribution of fibronectin and TAF and is not diffusing. This assumption is not entirely true and can be enhanced.\nHere in this report, we will be using the discrete and stochastic variation of this model. For more detail see (Anderson and Chaplain 1998). See figure below for some example outputs of the model.\n\n\n\nSome example output of the Anderson-Chaplain model of angiogenesis using the implementation of the model shared by (Nardini et al. 2021). We have assumed the source of TAF molecules is located at the right edge of the domain, while the pre-existing parent vessels is located at the left edge of the domain. The strength of the chemotaxis and haptotactic (due to fibronectin) signaling is set to be \\(\\chi = 0.4\\), and \\(\\rho = 0.4\\)." + "objectID": "posts/MATH-612/index.html#multiple-environments-in-the-same-quarto-project", + "href": "posts/MATH-612/index.html#multiple-environments-in-the-same-quarto-project", + "title": "Welcome to MATH 612", + "section": "Multiple environments in the same Quarto project", + "text": "Multiple environments in the same Quarto project\nIn your blog post, you may want to use specific python packages, which may conflict with packages used in other post. To avoid this problem, you need to use a virtual environment. For simplicity please name your environment .venv.\n\nCreating the virtual environment: Go to your post folder (e.g blog/posts/my_post) and run :\npython -m venv .venv\nThe folder .venv was created and contains the environment.\nInstalling packages: First activate the environment,\nsource .venv/bin/activate\nand then install the packages you need:\npip install package1_name package2_name\nTo run code in Quarto, you need at least the package jupyter. Deactivate the environment with deactivate.\nUsing environment in VS Code: Link the virtual environment to VS Code using the command palette, with the command Python : Select Interpreter and entering the path to your interpreter ending with .venv/bin/python.\nExport your package requirements If you are installed non standard package, other that jupyter, numpy, matplotlib, pandas, plotly for example, you can export your package requirements, so that other can reproduce your environment. First go to your post directory and activate your environment. Then run:\npip freeze > requirements.txt" }, { - "objectID": "posts/vascularNetworks/VascularNetworks.html#branching-annihilating-random-walker", - "href": "posts/vascularNetworks/VascularNetworks.html#branching-annihilating-random-walker", - "title": "Vascular Networks", - "section": "Branching-Annihilating Random Walker", - "text": "Branching-Annihilating Random Walker\nThe Anderson-Chaplain model of angiogenesis is not the only formulation of this phenomena. A popular alternative formulation is using the notion of branching annihilating random walkers for the to explain the branching morphogenesis of vascular networks. A very detailed discussion on this formulation can be found in Uçar et al. (2021). This formulation has been also successful to models a vast variety of tip-driven morphogenesis in mammary-glands, prostate, kidney (Hannezo et al. 2017), lymphatic system (Uçar et al. 2023), neural branching (Uçar et al. 2021), and etc.\nThe core idea behind this formulation is to assume that the tip cells undergo a branching-annihilating random walk, i.e. they move randomly in the space, turn into pairs randomly (branching), and as they move they produce new cells (stalk) behind their trails, and finally annihilate if they encounter any of the stalk cells. See figure below:\n\n\n\nThe network generated by branching-annihilating process, where the tip cells (orange circles) are doing random walk (not necessarily unbiased random walk) and each generate two random walkers at random times (branching). The tip cells make the stalk cells (the blue lines) along their way and the tip cells annihilate when encounter any of the stalk cells." + "objectID": "posts/cryo_ET/demo.html", + "href": "posts/cryo_ET/demo.html", + "title": "Simulation of tomograms of membrane-embedded spike proteins", + "section": "", + "text": "Cryogenic electron tomography (cryo-ET) is an imaging technique to reconstruct high-resolution 3d structure, usually of biological macromolecules. Samples (usually small cells like bacteria and viruses) are prepared in standard aqueous median (unlike cryo-EM, where samples are frozen) are imaged in transmission electron microscope (TEM). The samples are tilted to different angles (e.g. from \\(-60^\\circ\\) to \\(+60^\\circ\\)), and images are obtained at every incremented degree (usually every \\(1^\\circ\\) or \\(2^\\circ\\)).\nThe main advantage of cryo-ET is that it allows the cells and macromolecules to be imaged at undisturbed state. This is very crucial in many applications such as drug discovery, when we need to know the in-situ binding state of the target of interest (e.g. viral spike protein) with the drug.\n\n\n\nTomographic slices of SARS-CoV-2 virions, with spike proteins embedded in the membrane(Shi et al. 2023)\n\n\nIn order to reconstruct macromolecules, tomographic slices need to be processed through a pipeline. A typical cryo-ET data processing pipeline includes: tilt series alignment, CTF estimation, tomogram reconstruction, particle picking, iterative subtomogram alignment and averaging, and heterogeneity analysis. Unlike cryo-EM, many algorithms for cryo-ET processing are still under development. Therefore, a large database of cryo-ET to test and tune algorithms is important. Unfortunately, collecting cryo-ET data is both time and money-consuming, and the current database of cryo-ET is not large enough, especially for deep learning training which requires a large amount of data. Therefore, simulation becomes a substitute to generate a large amount of data in a short time and at low expense. In this post, we will focus on the simimulation of membrane-embedded proteins." }, { - "objectID": "posts/vascularNetworks/VascularNetworks.html#time-evolution-of-networks", - "href": "posts/vascularNetworks/VascularNetworks.html#time-evolution-of-networks", - "title": "Vascular Networks", - "section": "Time Evolution Of Networks", - "text": "Time Evolution Of Networks\nVascular networks are not static structure, but rather the evolve in time in response to the changing metabolic demand of the underlying tissue, as well as the metabolic cost of the network itself, and the overall energy required to pump the fluid through the network (See Pries and Secomb (2014) for more discussion). To put this in different words, the role of vascular networks is to deliver nutrients to the tissue and remove the wastes. To do this, it needs to have a space filling configuration with lots of branches. However, due to the Poiseuille law for the flow of fluids in a tube, the power needed to pump the fluid through the tube scales with \\(r^{-4}\\) where \\(r\\) is the radius of the tube. I.e. smaller vessel segments needs a huge power to pump the blood through them. Thus have a massively branched structure is not an optimal solution. On the other hand, the vascular network consists of cells which requires maintenance as well. Thus the optimized vascular network should have a low volume as well. Because of these dynamics in action, in the angiogenesis process first a mesh of new blood vessels form which later evolve to a more ordered and hierarchical structure in a self-organization process.\n\n\n\nRemodeling of vascular network of chick chorioallantoic membrane. Initially (sub-figure 1) a mesh of vascular networks form. Then (sub-figures 2,3,4), through the remodeling dynamics, a more ordered and hierarchical structure emerges. Images are taken from (Richard et al. 2018).\n\n\nTo determine the time evolution of the vascular network we first need to formulate the problem in an appropriate way. First, we represent a given vascular network with a multi-weighted graph \\(G=(\\mathcal{V},\\mathcal{E})\\) where \\(V\\) is the set of vertices and \\(E\\) is the edge set. We define the pressure \\(\\mathbf{P}\\) on the nodes, the flow $ $ on the edges, and let \\(C_{i,j}, L_{i,j}\\) denote the conductivity of an edge, and \\(L_{i,j}\\) denote the length of the same edge. Given the source and sink terms on the nodes $ $, the flow in the edges can be determined by \\[\\mathcal{L} \\mathbf{P} = \\mathbf{q},\\] where \\(\\mathcal{L}\\) is the Laplacian matrix of the graph. For more details on this see . Once we know the pressures on the nodes, we can easily calculate the flow through the edges by \\[\\bf{Q} = \\bf{C} L^{-1} \\bf{\\Delta} \\bf{P}, \\tag{2}\\] where \\(C\\) is a diagonal matrix of the conductance of the edges, \\(L\\) is the diagonal matrix of the length of each edge, $ $ is the transpose of the incidence matrix, and $ P $ is the pressure on the nodes. \\(Q\\) is the flow of the edges. Once we know the flow in the edges, we can design evolution law to describe the time evolution of the weights of the edges (which by Poiseuille’s is a function of the radius of the vessel segment). The evolution law can be derived by defining an energy functional and moving down the gradient of the energy functional to minimize it, or we can take an ad-hoc method and write a mechanistic ODE for time evolution of the conductances. For the energy functional one can write \\[ E(\\mathbf{C}) = \\frac{1}{2} \\sum_{e\\in \\mathcal{E}}(\\frac{Q_e^2}{C_e} + \\nu C_e^\\gamma), \\] where $ $ is the edge set of the graph, $ Q_e, C_e $ is the flow and conductance of the edge $ e $, and $ ,$ are parameters. The first term in the sum is of the form ``power=current$ $potential’’ and reflects the power required to pump the flow, and the second term can be shown that reflects the volume of the total network. We can set \\[ \\frac{d \\mathbf{C}}{dt} = -\\nabla E, \\] which determines the time evolution of the weights in a direction that reduces the total energy. The steady-state solution of this ODE system is precisely the Euler-Lagrange formulation of the least action principle. Alternatively, one can come up with carefully designed ODEs for the time evolution of the conductances that represents certain biological facts. In particular \\[ \\frac{d C_e}{dt} = \\alpha |Q_e|^{2\\sigma} - b C_e + g \\] proposed by , and \\[ \\frac{d}{dt} \\sqrt{C_e} = F(Q_e) - c\\sqrt{C_e}, \\] proposed by has been popular choices. See for more details. It is important to note that in the simulations shown here, the initial network is a toy network. This can be improved by using any of the vascular network generated by any of the angiogenesis models discussed before.\n\n\n\nTime evolution of optimal transport network. A triangulation of a 2D domain is considered to be the graph over which we optimize the flow. The sink term is represented by green dot, while the sources are represented by yellow dots. Different sub-figures show the flow network at different time steps towards converging to the optimal configuration." + "objectID": "posts/cryo_ET/demo.html#background", + "href": "posts/cryo_ET/demo.html#background", + "title": "Simulation of tomograms of membrane-embedded spike proteins", + "section": "", + "text": "Cryogenic electron tomography (cryo-ET) is an imaging technique to reconstruct high-resolution 3d structure, usually of biological macromolecules. Samples (usually small cells like bacteria and viruses) are prepared in standard aqueous median (unlike cryo-EM, where samples are frozen) are imaged in transmission electron microscope (TEM). The samples are tilted to different angles (e.g. from \\(-60^\\circ\\) to \\(+60^\\circ\\)), and images are obtained at every incremented degree (usually every \\(1^\\circ\\) or \\(2^\\circ\\)).\nThe main advantage of cryo-ET is that it allows the cells and macromolecules to be imaged at undisturbed state. This is very crucial in many applications such as drug discovery, when we need to know the in-situ binding state of the target of interest (e.g. viral spike protein) with the drug.\n\n\n\nTomographic slices of SARS-CoV-2 virions, with spike proteins embedded in the membrane(Shi et al. 2023)\n\n\nIn order to reconstruct macromolecules, tomographic slices need to be processed through a pipeline. A typical cryo-ET data processing pipeline includes: tilt series alignment, CTF estimation, tomogram reconstruction, particle picking, iterative subtomogram alignment and averaging, and heterogeneity analysis. Unlike cryo-EM, many algorithms for cryo-ET processing are still under development. Therefore, a large database of cryo-ET to test and tune algorithms is important. Unfortunately, collecting cryo-ET data is both time and money-consuming, and the current database of cryo-ET is not large enough, especially for deep learning training which requires a large amount of data. Therefore, simulation becomes a substitute to generate a large amount of data in a short time and at low expense. In this post, we will focus on the simimulation of membrane-embedded proteins." }, { - "objectID": "posts/vascularNetworks/VascularNetworks.html#enhanced-loop-detection-algorithm", - "href": "posts/vascularNetworks/VascularNetworks.html#enhanced-loop-detection-algorithm", - "title": "Vascular Networks", - "section": "Enhanced Loop Detection Algorithm", - "text": "Enhanced Loop Detection Algorithm\nBefore, we used to generate .png images of the simulation result (see figures above) and then perform image analysis to detect loops. For instance we have convolving the image with 4-connectivity and 8-connectivity matrices to extract the graph structres present in the images. In the new approch, instead, we managed to record the structre of the network in a NetworkX datastructre. This is not easy task to perform without smart useage of the object oriented programming structure for the code. We organized our code into following classes\n\nUsing this structure, we can record the graph structure of the generated networks as a NetworkX dataframe. Then we can use some of the built-in functions of this library to get the loops (cycles) of the network. However, since the generated networks are large, finding all of the loops (of all scales) is computationally very costly. Instead, we first found a minimal set of cycles in the graph that forms a basis for the cycles space. I.e. we found the loops that can be combined (by symmetric difference) to generate new loops. The following figure shows the basis loops highlighted on the graph.\n\nAs mentioned above, detected cycles are the basis cycles. The space of all cycles in a graph form a vector space and the basis cycles is a basis for that space. In other words, these cycles are all the cycles necessary to generate all of the cycles in the graph. The addition operation between two cycles is the symmetric difference of their edge set (or XOR of their edges). We can combined the basis cycles to generate higher level (and lower level) structure as shown below.\n\nWe can also extract and scale all of the loops for further analysis. The following figure shows all the loops in the network\n\nThe following figures shows some of the loop strucgures that we can get by combining the loops above." + "objectID": "posts/cryo_ET/demo.html#workflow", + "href": "posts/cryo_ET/demo.html#workflow", + "title": "Simulation of tomograms of membrane-embedded spike proteins", + "section": "Workflow", + "text": "Workflow\nWe will use the Membrane Embedded Proteins Simulator (MEPSi), a tool incorporated in PyCoAn to simulate SARS-CoV-2 spike protein (Rodríguez de Francisco et al. 2022). Here, I will briefly go through the workflow of MEPSi.\n\n1. Density modeling\nIn the density modeling, atom coordinate lists of macromolecules of interest are given, and a “ground-truth” volume representation is simulated by placing the given macromolecules on the membrane with specified geometry. The algorithm uses a 3D Archimedean spiral to place the molecules at approximately equidistant points along the membrane. Random translations with sa bounding box defined by the equidistance and the maximum XY radius of the molecules will then be applied. This ensures there is no overlap between macromolecules on the surface. The volume is generated using direct generation of membrane density and Gaussian convolution of the atom positions.\nOptionally, a solvent model can be generated and added to the density. In order to keep the computational cost low, a continuum solvent model with an adjustable contrast tuning parameter is used. A 3D version of Lapacian pyramid blending is used to account for displacements of one object from another to mitigate edge effects and emulates the existence of a hydration layer around the molecules.\n\n\n2. Basis tilt series generation\nIn this step, an unperturbed basis tilt series is generated from the simulated volume. The individual tilt images are obtained by rotating the volume around the Y axis and projecting the density along Z axis. The reason that a basis tilt series is generated before final tomogram simulation is to reduce computational cost. It can speed up the process quite a lot if a perturbation-free basis tilt series is first generated to allow the user explore perturbation parameters (e.g. contrast transfer function and noise) before generating final tomograms from perturbed basis tilt series.\n\n\n3. CTF\nOne possible perturbation we can add to the basis tilt series is the contrast transfer function (CTF), which models the effect of the microscope optics. One major determinant of the CTF is the defocus value at the scattering event, which changes while the electrons traverse the specimen. In order to simplify the problem, we assume that the simulated specimen as an infinitely thin slice so only focus changes caused by tilting need to be considered. Projected tilted specimen images are subjected to a CTF model in strips parallel to the tilt axis with the defocus value modulated according to the position of the strip center.\n\n\n4. Noise\nThe noise model is expressed as a mixture of Gaussian and Laplacian, in contrast of white additive Gaussian usually used in many other simulation applications. The noise in the low-dose images contrivuting to a tilt series tends to have statistically significant non-zero skewness, which cannot be modeled by Gaussian error model alone.\n\n\n\nOverlay of an experimental intensity histogram (blue) with noise modeling by Gaussian only (red) vs. with a mix of Gaussian and Laplacian noise (green)\n\n\n\n\n5. Tomogram generation\nFinally tomograms are simulated from the perturbed basis tilt series with user-specified tilt range and increment." }, { - "objectID": "posts/vascularNetworks/VascularNetworks.html#statistical-analysis-of-loops", - "href": "posts/vascularNetworks/VascularNetworks.html#statistical-analysis-of-loops", - "title": "Vascular Networks", - "section": "Statistical Analysis of Loops", - "text": "Statistical Analysis of Loops\nThe mechanism that generated the vascular networks is an stochastic process (Branching process + Simple Radnom Walk process + local interactions (annihilation)). So we need to use statistical notions to make some observations. In the figure below, the histogram of the cycle length is plotted. The interesting observation is the fact that the number of cycles is exponentially distributed (with respect to the Cycle length). The slope of this line (on log-log plot) can reveal some very important facts about the universality class that our model belongs to. Not only this is very interesting and important from theoretical point of view, but also it can have very useful practical applications. For instance, in comparing the simulated network with real vascualr networks, this slope can be one of the components of comparison.\n\nFurthremore, it is instructive to study the correlation matrix between some of the features of the loop." + "objectID": "posts/cryo_ET/demo.html#results", + "href": "posts/cryo_ET/demo.html#results", + "title": "Simulation of tomograms of membrane-embedded spike proteins", + "section": "Results", + "text": "Results\nIn order to fully demonstrate the capacity of MEPSi, tomograms were simulated from a sample containing three different conformations of SARS-Cov2 spike protein: 6VXX, 6VYB and 6X2B, with ratio 1:1:2. Protein coordinate files in .pdb format were obtained from RCSB PDB, and preprocessed in ChimeraX to align with z-axies in order to be modeled in orrect direction in density simulation.\n\n\n\nThree conformations of the prefusion trimer of SARS-Cov2 spike protein: all RBDs in the closed position (left, 6VXX); one RBD in the open position (center, 6VYB); two RBDs in the open position (right, 6X2B)\n\n\nSolvent and CTF were added. A SNR of 0.5 was used. Finally we generated tomograms every \\(1^\\circ\\) from \\(-60^\\circ\\) to \\(+60^\\circ\\). Below were four tomograms with different tilt angles simulated." }, { - "objectID": "posts/vascularNetworks/VascularNetworks.html#geometric-shape-analysis-fréchet-and-hausdorff-distances", - "href": "posts/vascularNetworks/VascularNetworks.html#geometric-shape-analysis-fréchet-and-hausdorff-distances", - "title": "Vascular Networks", - "section": "Geometric Shape Analysis: Fréchet and Hausdorff Distances", - "text": "Geometric Shape Analysis: Fréchet and Hausdorff Distances\nIn geometric shape analysis, comparing cycles involves quantifying their similarity based on the spatial arrangement of points in each cycle. Two widely used measures for such comparisons are the Fréchet Distance and the Hausdorff Distance. These metrics provide different insights into the relationship between cycles, and their results can be visualized as heatmaps of pairwise distances.\n\nFréchet Distance\nThe Fréchet Distance between two curves $ A = {a(t) t } $ and $ B = {b(t) t } $ is defined as:\n\\[\nd_F(A, B) = \\inf_{\\alpha, \\beta} \\max_{t \\in [0,1]} \\| a(\\alpha(t)) - b(\\beta(t)) \\|,\n\\]\nwhere:\n\n$ (t) $ and $ (t) $ are continuous, non-decreasing reparameterizations of the curves $ A $ and $ B $.\n$ | | $ denotes the Euclidean norm.\nThe infimum is taken over all possible parameterizations $ $ and $ $.\n\n\nInterpretation of Heatmap\nThe heatmap for the Fréchet distance shows the pairwise distances between all cycles. Each entry $ (i, j) $ in the heatmap represents $ d_F(C_i, C_j) $, the Fréchet distance between cycle $ C_i $ and cycle $ C_j $. Key insights include:\n\nSmall Values: Cycles with low Fréchet distances are geometrically similar in terms of overall shape and trajectory.\nLarge Values: Larger distances indicate significant differences in the geometry or shape of the cycles.\n\nThe heatmap highlights clusters of similar cycles and outliers with unique geometries.\n\n\n\n\nHausdorff Distance\nThe Hausdorff Distance between two sets of points $ A $ and $ B $ is defined as:\n\\[\nd_H(A, B) = \\max \\{ \\sup_{a \\in A} \\inf_{b \\in B} \\| a - b \\|, \\sup_{b \\in B} \\inf_{a \\in A} \\| b - a \\| \\}.\n\\]\nThis can be broken down into:\n\n$ {a A} {b B} | a - b | $: The maximum distance from a point in $ A $ to the closest point in $ B $.\n$ {b B} {a A} | b - a | $: The maximum distance from a point in $ B $ to the closest point in $ A $.\n\nThe Hausdorff distance quantifies the greatest deviation between the two sets of points, considering how well one set covers the other.\n\n\nInterpretation of Heatmap\nThe heatmap for the Hausdorff distance shows pairwise distances between cycles. Each entry $ (i, j) $ represents $ d_H(C_i, C_j) $, the Hausdorff distance between cycle $ C_i $ and cycle $ C_j $. Key insights include:\n\nSmall Values: Indicates that the points of one cycle are closely aligned with the points of another cycle.\nLarge Values: Reflects that one cycle has points significantly farther away from the other, suggesting geometric dissimilarity.\n\nThe heatmap highlights cycles that are well-aligned (small distances) and those that are far apart in terms of shape.\n\n\n\nComparison of Metrics\n\nFréchet Distance: Sensitive to the ordering of points along the curves, making it suitable for comparing trajectories or continuous shapes.\nHausdorff Distance: Ignores the order of points and focuses on the maximum deviation between sets, making it useful for analyzing shape coverage.\n\nBoth metrics complement each other in analyzing the geometric properties of cycles. While the Fréchet distance emphasizes trajectory similarity, the Hausdorff distance focuses on the extent of shape overlap." + "objectID": "posts/AFM-data/index.html", + "href": "posts/AFM-data/index.html", + "title": "Extracting cell geometry from Atomic Force Microscopy", + "section": "", + "text": "We present here the protocole to process biological images such as bacteria atomic force miroscopy data. We want to study the bacteria cell shape and extract the main geometrical feature." }, { - "objectID": "posts/vascularNetworks/VascularNetworks.html#dimensionality-reduction", - "href": "posts/vascularNetworks/VascularNetworks.html#dimensionality-reduction", - "title": "Vascular Networks", - "section": "Dimensionality Reduction", - "text": "Dimensionality Reduction\nNonlinear dimensionality reduction methods project high-dimensional data into a lower-dimensional space while preserving specific structural properties.\n\nt-SNE (t-Distributed Stochastic Neighbor Embedding)\nt-SNE minimizes the divergence between probability distributions over pairwise distances in high-dimensional and low-dimensional spaces. It focuses on preserving local structures (relationships between nearby points) and is particularly effective at uncovering clusters. The key parameters are Perplexity: Controls the balance between local and global structure (default: 30), and Output Dimension: Reduced to 2D for visualization.\n\n\n\nSome notes to interpret the plot: that cycles forming tight clusters share strong similarities in features such as length, area, or compactness. Isolated points (outliers) indicate rare or unique geometries. t-SNE emphasizes local structures, making it ideal for detecting smaller, tightly-knit groups.\n\n\n\n\nUMAP (Uniform Manifold Approximation and Projection)\nUMAP approximates the high-dimensional data manifold and optimally preserves both local and global structures. It provides more interpretable embeddings with smooth transitions between clusters. The key parameters are Number of Neighbors: Defines the size of the local neighborhood considered for embedding (default: 15), and Output Dimension: Reduced to 2D for visualization.\n\n\n\nSome notes to interpret the plot: UMAP preserves both local and global structures, making it suitable for analyzing large-scale patterns. Transitions between clusters indicate gradual changes in feature space, useful for understanding progression or hierarchy in cycle characteristics. Dense clusters suggest strong feature alignment, while sparse areas highlight feature variability.\n\n\n\n\nConclusion\nWe used a stochastic process (Branching Annihilating Random Walker) to generate some random networks (that resembes the vascular networks). Then we translated this structure to a networkX data frame for easier processing. We extracted the cycle basis for the cycle space of the graph and using the symmetric difference operation we generated new cycles (of different scales). Then performed different statistical and geometrical analysis on the shape of the loops in the graph. Also we calculated different features for the graph and used dimnsionality reduction methods to see if we can observe any structures (clusters) in low dimension." + "objectID": "posts/AFM-data/index.html#biological-context", + "href": "posts/AFM-data/index.html#biological-context", + "title": "Extracting cell geometry from Atomic Force Microscopy", + "section": "Biological context", + "text": "Biological context\nMycobacterium smegmatis is Grahm-positive rod shape bacterium. It is 3 to 5 \\(\\mu m\\) long and around 500 \\(nm\\) wide. This non-pathogenic species is otften used a biological model to study the pathogenic Mycobacteria such as M.tuberculosis (responsible for the tubercuosis) or M.abscessus, with which it shares the same cell wall structure (Tyagi and Sharma 2002). In particular M.smegmatis has a fast growth (3-4 hours doubling time compared to 24h for M. tuberculosis), allowing for faster experimental protocols.\nHere are some know properties of M.smegmatis bacteria :\n\nThey present variation of cell diameter along their longitudinal axis (Eskandarian et al. 2017). The cell diameter is represented as a height profile along the cell centerline. We respectively name peaks and troughs the local maxima and minima of this profile.\n\n\n\n\n3D image of M.smegmatis. The orange line represents the height profile.\n\n\n\nThey grow following a biphasic and asymetrical polar dynamics (Hannebelle et al. 2020). The cells elongate from the poles, where material is added. After division, the pre-existing pole (OP) elongate at a high rate, whereas the newly created pole (NP) has first a slow growth, and then switches to a fast growth, after the New End Take Off (NETO).\n\n\n\n\nGrowth dynamics." }, { - "objectID": "posts/vascularNetworks/VascularNetworks.html#appendix", - "href": "posts/vascularNetworks/VascularNetworks.html#appendix", - "title": "Vascular Networks", - "section": "Appendix", - "text": "Appendix\nFor a graph, the Laplacian matrix contains the information on the in/out flow of stuff into the nodes.\n\n\n\nThen the Laplacian matrix is given by \\[ D = \\begin{pmatrix}\n 2 & 0 & 0 & 0 & 0 \\\\\n 0 & 4 & 0 & 0 & 0 \\\\\n 0 & 0 & 2 & 0 & 0 \\\\\n 0 & 0 & 0 & 2 & 0 \\\\\n 0 & 0 & 0 & 0 & 2\n \\end{pmatrix}, \\] and the adjacency matrix is given by \\[ A = \\begin{pmatrix}\n 0 & 1 & 1 & 0 & 0 \\\\\n 1 & 0 & 1 & 1 & 1 \\\\\n 1 & 1 & 0 & 0 & 0 \\\\\n 0 & 1 & 0 & 0 & 1 \\\\\n 0 & 1 & 0 & 1 & 0\n \\end{pmatrix}, \\] and the Laplacian matrix is given by \\[ L = D -A =\n \\begin{pmatrix}\n 2 & -1 & -1 & 0 & 0 \\\\\n -1 & 4 & -1 & -1 & -1 \\\\\n -1 & -1 & 2 & 0 & 0 \\\\\n 0 & -1 & 0 & 2 & -1 \\\\\n 0 & -1 & 0 & -1 & 2\n \\end{pmatrix}.\n \\] It is straight forward to generalize the notion of Laplacian matrix to the weighed graphs, where the degree matrix $ D $, the diagonal entries will be the sum of all weights of the edges connected to that node, and for the adjacency matrix, instead of zeros and ones, we will have the weights of the connections..\nThere is also another way of finding the Laplacian matrix by using the notion of incidence matrix. To do so, we first need to make our graph to be directed. Any combination of the direction on the edges will do the job and will yield in a correct answer. For instance, consider the following directed graph\nFor a graph, the Laplacian matrix contains the information on the in/out flow of stuff into the nodes.\n\n\n\nIts incidence matrix will be \\[\n M = \\begin{pmatrix}\n -1 & 1 & 0 & 0 & 0 & 0 \\\\\n 0 & -1 & 1 & -1 & 0 & -1 \\\\\n 1 & 0 & -1 & 0 & 0 & 0 \\\\\n 0 & 0 & 0 & 1 & 1 & 0 \\\\\n 0 & 0 & 0 & 0 & -1 & 1 \\\\\n \\end{pmatrix}\n \\] The Laplacian matrix can be written as \\[ \\mathcal{L} = M M^T. \\] Note that in the case of the weighed graphs, we will have \\[ \\mathcal{L} = M W M^T \\tag{1}\\] where $ W $ is a diagonal matrix containing the weights. These computations can be done easily on the NetworkX.\nThe incidence matrix is also very useful in calculating the pressure difference between nodes of a particular edge. Let \\(\\Delta = M^T\\). Then given the vector \\(P\\) that contains the pressures on the vertices, then the pressure difference on the edges will be given by \\(\\Delta P\\), where \\(\\Delta\\) is the transpose of the incidence matrix. This comes in handy when we want to calculate the flow of the edges which will be given by \\[ \\bf{Q} = \\bf{C} L^{-1} \\bf{\\Delta} \\bf{P}, \\tag{2} \\] where $ C $ is a diagonal matrix of the conductance of the edges, \\(L\\) is the diagonal matrix of the ``length’’ of each edge, \\(\\Delta\\) is the transpose of the incidence matrix, and \\(P\\) is the pressure on the nodes. \\(Q\\) is the flow of the edges. In this particular example we are assuming that the relation between flow and the pressure difference is \\(Q_e = C_e (p_i - p_j)/L\\). But we can have many other choices.\nKnowing the sources and sinks on the nodes, the pressure can be determined by the Kirchhoff law \\[ \\mathcal{L} \\bf{P} = \\bf{q}, \\] where the vector $ q $ is the sources and the sinks values for each node. This is the same as solving the . This can also be written in terms of the flow, i.e. \\[ \\Delta^T \\bf{Q} = \\bf{q}. \\] By $ (2) $ we can write \\[ (\\bf{\\Delta}^T \\bf{C}\\bf{L}^{-1}\\Delta) \\bf{P} = \\bf{q}. \\] Since $ = M^T $, the expression inside the parentheses is clearly Equation (1).\nSimilar to the Poisson equation on the graph which is equivalent Kirchhoff’s law, we can solve other types of heat and wave equations on the graph as well. The Laplacian matrix play a key role. \\[ \\frac{\\partial p}{\\partial t} = - \\mathcal{L} p + q, \\] for the heat equation, and \\[ \\frac{\\partial^2 p}{\\partial t^2} = -\\mathcal{L}p + q, \\] for the wave equation." + "objectID": "posts/AFM-data/index.html#raw-image-pre-processing", + "href": "posts/AFM-data/index.html#raw-image-pre-processing", + "title": "Extracting cell geometry from Atomic Force Microscopy", + "section": "Raw image pre-processing", + "text": "Raw image pre-processing\n\nData\nSeveral data acquisitions were conducted with wild types and different mutant strains. The raw data is composed of AFM log files times series for each experiments. Each log file contain several images, each one representing a physical channel such as height, stiffness, adhesion etc. After extraction of the data, forward and backward cells are aligned, artefacts such as image scars are detected and corrected.\n\n\n\nAt each time step, images representing different physical variables are produced by the AFM" }, { - "objectID": "posts/Farm-Shape-Analysis/index.html", - "href": "posts/Farm-Shape-Analysis/index.html", - "title": "Farm Shape Analysis: Linking Geometry with Crop Yield and Diversity", + "objectID": "posts/AFM-data/index.html#segmentation", + "href": "posts/AFM-data/index.html#segmentation", + "title": "Extracting cell geometry from Atomic Force Microscopy", + "section": "Segmentation", + "text": "Segmentation\nAt each time steps, images are segmented to detect each cells using the cellpose package (Stringer et al. 2021). If available, different physical channels are combined to improve the segmentation. Forward and backward images are also combined.\n\n\n\nImages are combined to improve the segmentation\n\n\nHere is an example on how to use cellpose on an image. Different models are available (with the seg_mod variable), depending on the training datasets. With cellpose 3, different denoising models are also available (with the denoise_mod variable).\n\n\nCode\nimport numpy as np\nimport matplotlib.pyplot as plt\nfrom cellpose import io, denoise, plot\nfrom PIL import Image\n\n\n'''\nParameters\n'''\n\nimage_path = 'raw_img.png'\npath_to_save = 'segmented_img'\n# Segmentation model type\nseg_mod = 'cyto' \n# Denoizing model\ndenoise_mod = \"denoise_cyto3\" \n# Expected cell diameter (pixels)\ndia = 40\n# Type of segmentation (with / without nuclei, different color channels or not)\nchan = [0,0] \n# Segmentation sensibility parameters\nthres = 0.8\ncelp = 0.4\n\n'''\nComputing segmentation\n'''\n\n\n# Opening image to segment\nimg=np.array(Image.open(image_path))[:,:,1]\n\n# Chosing a model type :\nmodel = denoise.CellposeDenoiseModel(gpu=False, model_type=seg_mod, restore_type=denoise_mod)\n\n# Computing segmentaion\nmasks, flows, st, diams = model.eval(img, diameter = dia, channels=chan, flow_threshold = thres, cellprob_threshold=celp)\n\n\n# Saving the results into a numpy file\nio.masks_flows_to_seg(img, masks, flows, path_to_save, channels=chan, diams=diams)\n\n\nWe plot the final results :\n\n\nCode\nplt.imshow(img,cmap='gray')\nplt.show()\n\n\n\n\n\nRaw image\n\n\n\n\n\n\nCode\nmask_RGB = plot.mask_overlay(img,masks)\nplt.imshow(mask_RGB)\nplt.show()\n\n\n\n\n\nImage with segmented masks overlaid" + }, + { + "objectID": "posts/AFM-data/index.html#centerline", + "href": "posts/AFM-data/index.html#centerline", + "title": "Extracting cell geometry from Atomic Force Microscopy", + "section": "Centerline", + "text": "Centerline\nSince we are interested in studying the variations of the cell diameter, we define height profile as the value of the cell height along the cell centerline. The cell centerline are computed using a skeletonization algorithm Lee, Kashyap, and Chu (1994). Here is an example of skeletonization\n\n\nCode\nfrom skimage.morphology import skeletonize\n\n# Selecting first mask\nfirst_mask = masks == 1\n\nskel_img = skeletonize(first_mask, method='lee') \nskel = np.argwhere(skel_img)\nplt.imshow(first_mask, cmap='gray')\n\nplt.scatter(skel[:,1], skel[:,0], 0.5*np.ones(np.shape(skel[:,0])), color='r', marker='.')\nplt.show()\n\n\n\n\n\n\n\n\n\nDepending on the masks shapes, centerlines may have branches :\n\n\nCode\nfrom skimage.morphology import skeletonize\n\n# Selecting first mask\nfirst_mask = masks == 3\n\nskel_img = skeletonize(first_mask) #, method='lee'\nskel = np.argwhere(skel_img)\nplt.imshow(first_mask, cmap='gray')\n\nplt.scatter(skel[:,1], skel[:,0], 0.5*np.ones(np.shape(skel[:,0])), color='r', marker='.')\nplt.show()\n\n\n\n\n\n\n\n\n\nIn practice, centerlines are pruned and extended to the cell poles, in order to capture the cell length. Other geometrical properties such as masks centroids or outlines are computed as well.\n\n\n\nFinal static processing results in real life data. White masks are excluded from the cell tracking algorithm (see part 2). Black dots are cell centroids. The yellow boxes represent artefacts cleaning." + }, + { + "objectID": "posts/extension_to_RECOVAR/index.html", + "href": "posts/extension_to_RECOVAR/index.html", + "title": "Extensions to RECOVAR for heterogeneity analysis of SARS-CoV2 spike protein from cryo-EM data", "section": "", - "text": "In modern agriculture, the geometric features of farmland play a crucial role in farm management and planning. Understanding these characteristics enables farmers to make informed decisions, manage resources more efficiently, and promote sustainable agricultural practices.\nThis research leverages data from Litefarm, an open-source agri-tech application designed to support sustainable agriculture. Litefarm provides detailed information about farmland, including field shapes, offering valuable insights for analysis. However, as an open platform, Litefarm’s database may include unrealistic or inaccurate data entries, such as “fake farms.” Cleaning and validating this data is essential for ensuring the reliability of agricultural analyses.\nIn this blog, we focus on identifying fake farms by analyzing field shapes to detect unrealistic entries. Our goal is to enhance data accuracy, providing a stronger foundation for future agriculture-related research.\n\n\n\nLitefarm Interface" + "text": "In the previous post Heterogeneity analysis of cryo-EM data of proteins dynamic in comformation and composition using linear subspace methods, we reviewed the pipeline of RECOVAR (Gilles and Singer 2024) to generate movies showing the heterogeneity of proteins, and discussed its pros, cons and some improvements we could make. RECOVAR is a linear method which borrows the idea from principal component analysis to project complex structure information within cryo-EM data corresponding to each particle onto a lower dimensional space, where a trajectory is computed to illustrate the conformational and compositional changes (see previous post for details).\nCompared with other methods, mostly based on deep learning, RECOVAR has several advantages, include but not limited to fast computation of embeddings, easy trajectory discovery in latent space and fewer hyperparameters to tune. Nevertheless, we’ve noticed several problems when we tested RECOVAR on our SARS-CoV2 datasets. One shortcoming is that the density-based trajectory discovery algorithm used by RECOVAR involves a deconvolution operation between two large matrices, which is extremely expensive. The other improvement we would like to make is to extend the series of density maps output by RECOVAR to the series of atomic models, which is usually the final product structure biologists desire in order to obtain atomic interpretations. In this post, we will focus on how we address these two problems, and present and interpret results from our SARS-CoV2 dataset.\nBefore getting to the Methods, I would like to provide background information about SARS-CoV2 spike protein. SARS-CoV2 spike protein is a trimer binding to the surface of SARS-CoV2 virus. It has a so-called recpetor-binding domain (RBD) capable of switching between “close” and “open” states. When in the open state, the spike is able to recognize and bind to angiotensin-converting enzyme 2 (ACE2), an omnipresent enzyme on the membrane of the cells of the organs in the respiratory system, heart, intestines, testis and kidney (Hikmet et al. 2020). The binding to ACE2 helps the virus dock on the target cells and initalize the invasion and infection of the cells. Therefore, spike is often the major target for antibody development. Previous researches mainly focus on developing drugs neutralizing the RBD regions in the open state. However as I mentioned before, spike can switch to the close state, in which the antibody targeting open RBD will not longer be able to access it, making the drugs less effective. Motivated by recent progress in the heterogeneity analysis of proteins, researchers now focus on the conformational changes instead of a homogeneous state. Developing drugs to block the shape change of spike is considered an potentially more efficient way to neutralize viruses. This is why it is important to have a reliable pipeline to generate movies showing the conformational changes in spike proteins.\n\n\n\nAn illustration of how shape changes in RBD of SARS-CoV2 spike lead to the binding of ACE2. Spike is a trimer with three chains (in grey, purple and green). The RBD is located in the part of the spike away from virus membrane. In this figure, the RBD of one chain (in green) is open and binds to ACE2.(Taka et al. 2020)\n\n\nThe dataset we used comprises of 271,448 SARS-CoV2 spike protein particles, with some binding to ACE2. Therefore we would expect the algorithm to be able to deal with both conformational and compositional heterogeneity." }, { - "objectID": "posts/Farm-Shape-Analysis/index.html#introduction-and-motivation", - "href": "posts/Farm-Shape-Analysis/index.html#introduction-and-motivation", - "title": "Farm Shape Analysis: Linking Geometry with Crop Yield and Diversity", + "objectID": "posts/extension_to_RECOVAR/index.html#background", + "href": "posts/extension_to_RECOVAR/index.html#background", + "title": "Extensions to RECOVAR for heterogeneity analysis of SARS-CoV2 spike protein from cryo-EM data", "section": "", - "text": "In modern agriculture, the geometric features of farmland play a crucial role in farm management and planning. Understanding these characteristics enables farmers to make informed decisions, manage resources more efficiently, and promote sustainable agricultural practices.\nThis research leverages data from Litefarm, an open-source agri-tech application designed to support sustainable agriculture. Litefarm provides detailed information about farmland, including field shapes, offering valuable insights for analysis. However, as an open platform, Litefarm’s database may include unrealistic or inaccurate data entries, such as “fake farms.” Cleaning and validating this data is essential for ensuring the reliability of agricultural analyses.\nIn this blog, we focus on identifying fake farms by analyzing field shapes to detect unrealistic entries. Our goal is to enhance data accuracy, providing a stronger foundation for future agriculture-related research.\n\n\n\nLitefarm Interface" + "text": "In the previous post Heterogeneity analysis of cryo-EM data of proteins dynamic in comformation and composition using linear subspace methods, we reviewed the pipeline of RECOVAR (Gilles and Singer 2024) to generate movies showing the heterogeneity of proteins, and discussed its pros, cons and some improvements we could make. RECOVAR is a linear method which borrows the idea from principal component analysis to project complex structure information within cryo-EM data corresponding to each particle onto a lower dimensional space, where a trajectory is computed to illustrate the conformational and compositional changes (see previous post for details).\nCompared with other methods, mostly based on deep learning, RECOVAR has several advantages, include but not limited to fast computation of embeddings, easy trajectory discovery in latent space and fewer hyperparameters to tune. Nevertheless, we’ve noticed several problems when we tested RECOVAR on our SARS-CoV2 datasets. One shortcoming is that the density-based trajectory discovery algorithm used by RECOVAR involves a deconvolution operation between two large matrices, which is extremely expensive. The other improvement we would like to make is to extend the series of density maps output by RECOVAR to the series of atomic models, which is usually the final product structure biologists desire in order to obtain atomic interpretations. In this post, we will focus on how we address these two problems, and present and interpret results from our SARS-CoV2 dataset.\nBefore getting to the Methods, I would like to provide background information about SARS-CoV2 spike protein. SARS-CoV2 spike protein is a trimer binding to the surface of SARS-CoV2 virus. It has a so-called recpetor-binding domain (RBD) capable of switching between “close” and “open” states. When in the open state, the spike is able to recognize and bind to angiotensin-converting enzyme 2 (ACE2), an omnipresent enzyme on the membrane of the cells of the organs in the respiratory system, heart, intestines, testis and kidney (Hikmet et al. 2020). The binding to ACE2 helps the virus dock on the target cells and initalize the invasion and infection of the cells. Therefore, spike is often the major target for antibody development. Previous researches mainly focus on developing drugs neutralizing the RBD regions in the open state. However as I mentioned before, spike can switch to the close state, in which the antibody targeting open RBD will not longer be able to access it, making the drugs less effective. Motivated by recent progress in the heterogeneity analysis of proteins, researchers now focus on the conformational changes instead of a homogeneous state. Developing drugs to block the shape change of spike is considered an potentially more efficient way to neutralize viruses. This is why it is important to have a reliable pipeline to generate movies showing the conformational changes in spike proteins.\n\n\n\nAn illustration of how shape changes in RBD of SARS-CoV2 spike lead to the binding of ACE2. Spike is a trimer with three chains (in grey, purple and green). The RBD is located in the part of the spike away from virus membrane. In this figure, the RBD of one chain (in green) is open and binds to ACE2.(Taka et al. 2020)\n\n\nThe dataset we used comprises of 271,448 SARS-CoV2 spike protein particles, with some binding to ACE2. Therefore we would expect the algorithm to be able to deal with both conformational and compositional heterogeneity." }, { - "objectID": "posts/Farm-Shape-Analysis/index.html#dataset-overview-and-preparation", - "href": "posts/Farm-Shape-Analysis/index.html#dataset-overview-and-preparation", - "title": "Farm Shape Analysis: Linking Geometry with Crop Yield and Diversity", - "section": "2. Dataset Overview and Preparation", - "text": "2. Dataset Overview and Preparation\n\nData Source\nThe data for this study was extracted from Litefarm’s database, which contains detailed information about farm geometries, locations, and user associations. The dataset included the following key attributes:\n\nFarm-Level Information:\nEach farm is uniquely identified by a farm_ID, representing an individual farm within the Litefarm database.\nPolygon-Level Information:\nEach farm consists of multiple polygons, corresponding to distinct areas such as fields, gardens, or barns. Each polygon is uniquely identified by a location_ID, ensuring that every area within a farm is individually traceable.\nGeometric Attributes:\n\nArea: The total surface area of the polygon.\n\nPerimeter: The boundary length of the polygon.\n\nVertex Coordinates:\nThe geographic shape of each polygon is defined by a list of vertex coordinates in latitude and longitude format, represented as: [(lat1, lon1), (lat2, lon2), ..., (latN, lonN)].\nPolygon Types:\nThe polygons in each farm are categorized into various types:\n\nFields\n\nFarm site boundaries\n\nResidences\n\nBarns\n\nGardens\n\nSurface water\n\nNatural areas\n\nGreenhouses\n\nCeremonial areas\n\n\nThis rich dataset captures farm structures and geometries comprehensively, enabling the analysis of relationships between polygon features and agricultural outcomes.\nThis study focuses specifically on productive areas—gardens, greenhouses, and fields—as these contribute directly to agricultural output. Since different polygon types possess unique geometric characteristics, we focused on a single type to maintain analytical consistency.\nAs the Litefarm database is dynamic and continuously updated, the data captured as of November 28th showed that 36.4% of farms included garden areas, 20.7% had greenhouse areas, and nearly 70% contained fields. To ensure a robust and representative analysis, we focused on field polygons, which had the highest proportion within the dataset.\n\n\nRefined Litefarm Dataset\nTo ensure that only valid and realistic farm data was included in the analysis, we applied rigorous SQL filters to the Litefarm database. These filters excluded:\n\nPlaceholder farms and internal test accounts.\n\nDeleted records.\n\nFarms located in countries with insufficient representation (fewer than 10 farms).\n\nThe table below summarizes the results of the filtering process and the composition of the cleaned dataset:\n\n\n\nDescription\nCount\n\n\n\n\nInitial number of farms in Litefarm\n3,559\n\n\nFarms after SQL filtering\n2,919\n\n\nFarms with field areas\n2,022\n\n\nFarms with garden areas\n1,063\n\n\nFarms with greenhouse areas\n607\n\n\nTotal number of field polygons\n6,340\n\n\n\nBy narrowing the focus to field polygons, we ensured that the dataset was both robust and suitable for exploring the relationship between geometric features and agricultural outcomes." + "objectID": "posts/extension_to_RECOVAR/index.html#methods", + "href": "posts/extension_to_RECOVAR/index.html#methods", + "title": "Extensions to RECOVAR for heterogeneity analysis of SARS-CoV2 spike protein from cryo-EM data", + "section": "Methods", + "text": "Methods\n\nReview of the original RECOVAR pipeline\nIn this section I will briefly review RECOVAR. You can refer to the previous blog for more formal and detailed formulation of the problem.\nRECOVAR starts with estimating the mean \\(\\hat{\\mu}\\) and covariance matrix \\(\\hat{\\Sigma}\\) of the conformations by solving the least square problems between the projection of the mean conformation and the particle images in the dataset. Next, principal components (PCs) can be computed from \\(\\hat{\\mu}\\) and \\(\\hat{\\Sigma}\\), and we obtained embeddings projected from conformations on the latent space formed by those (PCs). In order to generate a movie, the authors compute conformational densities by deconvolving densities in the latent space with embedding uncertainty, and finds a path between two specified states maximizing the accumulated densities along the path. Then each embeddings are converted into density maps via kernel regression.\n\n\nExtensions to RECOVAR: MPPC for path discovery\nThe density-based path discovery algorithm used by RECOVAR is based on physical considerations that molecules prefer to take the path with lowest free energy, which is the path with highest conformational density, and is robust against outliers. Nevertheless, the time to deconvolve density is exponential of the number of PCs, and deconvolution requires large memory. Our 24GB GPU can deconvolve density at most a dimension of 4, but 4 PCs are usually not enough to capture enough heterogeneity as shown in the figure below, which indicates how the eigenvalues change with the number of PCs when applying RECOVAR to the SARS-CoV2 spike dataset. There are still quite large drops in the eigenvalue after 4 PCs.\n\n\n\nEigenvalues (y-axis) of PCs (indexed by x-axis) of the SARS-CoV2 spike dataset applied with RECOVAR\n\n\nTherefore, we proposed an alternative method to discover path by computing multiple penalized principal curves (MPPC) (Kirov and Slepčev 2017). The basic idea of MPPC is to find one or multiple curves to fit all the given points as close as possible, with constraints in the number and lengths of the curves. In order to be solved numerically, the curves are usually discretized. Let \\(y^1 = (y_1, y_2, ..., y_{m_1}), y^2 = (y_{m_1+1}, y_{m_1+2}, ..., y_{m_1+m_2}),...,y^k = (y_{m-m_k+1}, y_{m-m_k+2},...,y_{m})\\) to be \\(k\\) curves represented by \\(m=m_1+m_2+...+m_k\\) points. Let \\(s_c = \\sum_{j=1}^{c}m_j\\) be the indices of the end points of curve \\(c\\). Each point \\(x_i\\) in the data to fit is assigned to the closest point on the curves, and we denote \\(I_j\\) to be the group of indices of data points that are assigned to curve point \\(y_j\\). The goal is to minimize: \\[\\sum_{j=1}^m\\sum_{i\\in I_j}w_i|x_i-y_j|^2+\\lambda_1\\sum_{c=0}^{k-1}\\sum_{j=1}^{m_c+1}|y_{s_c+j+1}-y_{s_c+j}|+\\lambda_1 \\lambda_2 (k-1)\\]\nwhere \\(w_i\\) is the weight assigned to \\(i\\)th data point, and \\(\\lambda_1\\) and \\(\\lambda_2\\) reglarize the lengths and number of the curves.\\(\\sum_{j=1}^m\\sum_{i\\in I_j}w_i|x_i-y_j|^2\\) penalizes the distance of the curves to data points, \\(\\lambda_1\\sum_{c=0}^{k-1}\\sum_{j=1}^{m_c+1}|y_{s_c+j+1}-y_{s_c+j}|\\) is the total length of all the curves, and \\(\\lambda_1 \\lambda_2 (k-1)\\) controls the number of curves. Applied to our case, we set \\(w_i\\) to be the inverse of the trace of the covariance matrix of the embedding to make the curves fit better to those embeddings with high confidence.\n\n\nExtensions to RECOVAR: atomic model fitting\nWhen resolving homogenoeous structures of proteins, atomic models are usually the final product instead of density maps as they contain more structural information. Atomic models are fitted into density maps either manually or automatically, but most start from scratch, which is very inefficient to be applied to density map series because the difference between the neighboring maps should be relatively small. We can take advantage of this property by updating coordinates of the fitted model of the previous frame to get the model fitted in the current density map. Hence, we proposed two algorithms to fit atomic model, both are based on gradient descent.\nLet \\(R_{t-1}\\in \\mathbb{R}^{N_a\\times 3}\\) be the fitted atomic model of the \\(t-1\\)th density map, where \\(N_a\\) is the number of atoms in the protein. We can use deposited protein structure or model predicted from sequence using algorithms like AlphaFold as \\(R_0\\). Let \\(V_t\\in\\mathbb{R}^{N\\times N\\times N}\\) be the \\(t\\)th density map we want to fit in, where \\(N\\) is the grid size. We cannot directly minimize the “distance” between \\(R_{t-1}\\) and \\(V_t\\), because atomic coordinates cannot be compared with volume maps. A natural way to solve this issue is to map atomic coordinates to density map with a function \\(f: \\mathbb{R}^{N_a\\times 3}\\rightarrow \\mathbb{R}^{N\\times N\\times N}\\), for example, by summing up the gaussian kernels centered at each coordinate i.e. \\[V_t({\\bf r}=(x,y,z)^T) = \\sum_{k=1}^{N_a} \\exp -\\frac{\\|{\\bf r} - R_t[k]\\|_2^2}{2\\sigma_k^2}\\]\nHowever, the computational time for one mapping is \\(O(N^3N_a)\\), which is already very slow, even without considering the fact that we have to map coordinates to densities in many update iterations. Hence in practice we used truncated gaussian kernels.\nNow we have all the tools needed to have algorithms fit atomic model \\(R_{t-1}\\) into density map \\(V_t\\). Our first algorithm purely based on gradient descent will be:\n\nWhen computing the loss used for gradient descent, we included not only the difference between \\(V_t\\) and mapped density from coordinates, but also the difference between the starting and current bond lengths/angles to preserve the original structure. In practice, we computed intra-residue bond lengths i.e. bond lengths of \\(N-CA, CA-C \\text{ and } C-O\\) and inter-residue bond lengths \\(C_i-N_{i+1}\\). In practice proteins can have multiple chains (like SARS-CoV2 spike which has three chains), so we set the inter-residue bond lengths at end points of the chains to be \\(0\\). We used dihedral angles \\(\\phi\\) (i.e. angles formed by \\(C_i-N_{i+1}-CA_{i+1}-C_{i+1}\\)) and \\(\\psi\\) (i.e. angles formed by \\(N_i-CA_i-C_i-N_{i+1}\\)) as bond angles, and similarly set the dihedral angles cross the chains to be \\(0\\).\nOne weakness of gradient descent is that it can be easily stuck in local optima. Leveraging the recent progress in diffusion models for protein generation, we proposed the second algorithm as following:\n\nThe inner for loop is the same as Algorithm1, where the coordinates are updated through gradient descent. The difference is that an outer loop is added which diffuses and then denoises the fitted coordinates from previous round of gradient descent and uses the denoised coordinates as the starting model of the current round of gradient descent. We adapted the diffusion model and graph neural network (GNN) based denoiser from Chroma (Ingraham et al. 2023)." }, { - "objectID": "posts/Farm-Shape-Analysis/index.html#shape-analysis", - "href": "posts/Farm-Shape-Analysis/index.html#shape-analysis", - "title": "Farm Shape Analysis: Linking Geometry with Crop Yield and Diversity", - "section": "3. Shape Analysis", - "text": "3. Shape Analysis\nThis study focuses on the geometric properties of field polygons, as these are essential for understanding farm structures and ensuring data reliability. Each field polygon is represented by a series of vertices in latitude-longitude pairs, which outline its geometric boundaries. These vertices are the foundation for calculating key metrics such as area, perimeter, and more complex shape properties.\nTo perform a robust analysis, we systematically processed and evaluated the field polygon data through the following steps:\n\n1. Vertex Distribution Analysis\nThe first step in our analysis was to examine the vertex distribution of the field polygons to understand their general characteristics and ensure data quality. A box plot was created to visualize the distribution of vertex counts: \nThe results revealed a wide range of vertex counts, spanning from 3 to 189 vertices. This variability required filtering to address potential outliers. Using the z-score method, we identified and excluded extreme values, capping the maximum vertex count at 34.\nAfter filtering, we analyzed the revised vertex distribution using a histogram, which revealed that 47.4% of field polygons had exactly four vertices:\n\n\n\nhistogram of number of veritces\n\n\n\n\n2. Validation of Area and Perimeter Metrics\n\nRecalculation Process:\n\nVertex coordinates, initially in latitude-longitude format, were transformed into a planar coordinate system (EPSG:6933) to enable precise calculations.\nArea and perimeter were computed directly from the transformed vertex data.\n\nScatter plots comparing the user-provided values with the recalculated metrics showed strong alignment, with most points clustering around the diagonal (dashed line), confirming the accuracy of the recalculated values:\n\nPerimeter Comparison\n\nArea Comparison\n\n\nThis validation step provided confidence in the accuracy of the recalculated metrics, allowing us to proceed with subsequent shape analysis using reliable data." + "objectID": "posts/extension_to_RECOVAR/index.html#results", + "href": "posts/extension_to_RECOVAR/index.html#results", + "title": "Extensions to RECOVAR for heterogeneity analysis of SARS-CoV2 spike protein from cryo-EM data", + "section": "Results", + "text": "Results\n\nResults of SARS-CoV2 datasets\nAfter obtaining an ab-initio model containing pose information from CryoSPARC (Punjani et al. 2017), we ran RECOVAR with a dimension of 4 on our SARS-CoV2 spike dataset after downsampling particles to 128. Notice that in practice a grid size of 256 or higher is recommended to construct density maps with decent resolution, but we used 128 here for fast test of the original pipeline and extensions later. K-Means clustering was performed to find centers among the embeddings. Here for comparison with the modified algorithms in the later sections, we showed the complete movie of one RBD transiting from open state with ACE2 to close state below: \nThe original pipeline of RECOVAR is able to capture the motion of RBD between open and close states, as well as compositional changes in ACE2.\n\n\nComparison of paths discovered by density vs. MPPC\nTo compare the paths generated by MPPC to original density-based approaches, we ran MPPC on the embedings with a dimension of 4. The figure below shows the path generated by density-based methods from state 0 to 1 and 2 to 3, and path output by MPPC:\n\n\n\nPaths in 4D space output by density-based methods and MPPC. In each sub-figure, the path is visuliazed on the plane formed by 6 pairs of principal components.\n\n\nWe can see that path between 0 and 1 is completely missing in MPPC path. Path from 2 to 3 presents in MPPC path between the orange node and purple node, but is slightly pulled towards outliers.\nIt is mentioned in Methods that one advantage to use MPPC is that its low computational cost allows us to fit data in higher dimension, so we also fit MPPC to data in 10D. The results are present in the figure below:\n\nWe can see that the spike in the 10D movie is more flexible and there are more changes in the shape than 4D.\n\n\nResults of atomic model fitting\nWe first tested two atomic model fitting algorithms on the simplest case where we started from an atomic model and fit into a density map that is close to the starting model. We took deposited SARS-CoV2 spike protein structure 7V7R as initial model and generated target density map with truncated gaussian kernels from another protein 7V7Q.\n\n\n\nStarting (7V7R in blue) and target (7V7Q in brown) SARS-CoV2 spike used to test atomic model fitting algorithms\n\n\nWe ran 12000 iterations for Algorithm1. To make a fair comparison, same number of total loops, composed of 60 outer diffusion loops and 200 inner gradient descent loops were run with Algorithm2. We kept the regularization parameters the same for both algorithms. Both algorithms took about 950 seconds to complete. In UCSF Chimera (Pettersen et al. 2004), we aligned initial model 7V7R, fitted model from Algorithm1 and fitted model output by Algorithm2 with 7V7Q, computed root mean square deviation (RMSD) between aligned coordinates, and annotated the structures with red-blue color where red denotes high RMSD (large difference) and blue means low RMSD (small difference). The results are shown below:\n\n\n\nLeft: initial model (7V7R) aligned with target model (7V7Q); Middle: fitted model from Algorithm1 aligned with target model (7V7Q); Right: fitted model from Algorithm2 aligned with target model (7V7Q)\n\n\nSurprisingly, Algorithm1 performs better than Algorithm2, with more regions in deep blue color indicating low RMSD, though its design is relatively simple. Overall, both algorithms make significant progress from the initial model to fit into the target density map. There are certain white regions with medium RMSD, but the most important motions in the RBD regions are succressfully captured.\nTo test whether there will be a significant accumulation of errors if we kept using fitted model from last frame as the initial model to fit the current frame, we used our algorithms to fit into a series of three density maps, generated from proteins 7V7R, 7V7Q, and 7V7S, starting from 7V7P. Consecutive proteins in the series were aligned and local RMSD was computed to visualize the degree of conformational changes at different regions of different frames more intuitively.\n\n\n\nStarting from 7V7P (brown), we fit the density map simulated from 7V7R (pink), 7V7Q (blue), and 7V7S (green) sequentially\n\n\n\n\n\nAligment of consecutive proteins in the series for test\n\n\nMost conformational changes in this series occur in the RBD region, with 7V7S undergoing the most significant changes, and expected to be the hardest model to fit. We used the same parameters as before to fit each model with both algorithms, and followed the same procedure to compute and visualize local RMSD for each frame in the series.\n\n\n\nTest results on series from Algorithm1\n\n\n\n\n\nTest results on series from Algorithm2\n\n\nSame as the previous test, Algorithm1 performs better than Algorithm2 in fitting all the maps in the series. Compared with fitting to maps generated from 7V7Q starting with “true” 7V7R, initializing model with fitted 7V7R from previous step does not lead to siginificant increase in RMSD in fitted 7V7Q. There are some white regions with medium RMSD shared by three fitted models, but the RMSD of these regions does not increase. There is a part with high RMSD in the left region of the last structure 7V7S in the series, but it seems that the error is not accumulated from previous fitting as the RMSD of this region of the privious fitting is very low." }, { - "objectID": "posts/Farm-Shape-Analysis/index.html#field-polygon-standardization-and-preparation", - "href": "posts/Farm-Shape-Analysis/index.html#field-polygon-standardization-and-preparation", - "title": "Farm Shape Analysis: Linking Geometry with Crop Yield and Diversity", - "section": "Field Polygon Standardization and Preparation", - "text": "Field Polygon Standardization and Preparation\nTo focus on the geometric properties of field polygons, we projected all polygons into a size-and-shape space. This transformation isolates the shape and scale of the polygons while removing variations caused by rotation and translation. The size-and-shape space ensures consistent and meaningful comparisons of the underlying geometric features.\nWhile this study emphasizes polygon shapes, we recognize that area is a critical feature in agricultural studies due to its relationship with factors like regional regulations and agricultural policies. Thus, we preserved the size (scaling) component in our analysis to maintain the relevance of area.\nTo ensure uniformity and consistency in the dataset, we performed the following preprocessing steps:\n\nStandardizing Landmark Points:\n\nTo enable meaningful comparisons in the size-and-shape space, each polygon was resampled to have exactly 34 evenly spaced points along its boundary. The following Python function illustrates this process:\n\n\nCode\nimport folium\nimport json\nfrom shapely.geometry import shape, Polygon, Point, MultiPoint, MultiPolygon, LineString,LinearRing, MultiLineString\nfrom shapely.ops import unary_union, transform, nearest_points\nfrom collections import defaultdict\nimport geopy.distance\nimport pandas as pd\nimport math\nimport numpy as np\nfrom itertools import combinations\nimport itertools\nimport pyproj\nfrom functools import partial\nfrom collections import defaultdict\nimport altair as alt\nimport matplotlib.pyplot as plt\nimport plotly.graph_objs as go\nfrom pyproj import Transformer, CRS \nimport seaborn as sns\nimport plotly.express as px\nimport logging\nfrom shapely.validation import explain_validity\nimport geopandas as gpd\nimport ast\nfrom geographiclib.geodesic import Geodesic\nfrom sklearn.preprocessing import StandardScaler\nfrom sklearn.decomposition import PCA\nfrom geopy.distance import geodesic\nfrom geomstats.geometry.pre_shape import PreShapeSpace\nfrom geomstats.visualization import KendallDisk, KendallSphere\n\n\n\n\nCode\ndef resample_polygon(projected_coords, num_points=34):\n \"\"\"\n Resample a polygon's boundary to have a specified number of evenly spaced points.\n\n Parameters:\n - projected_coords: List of coordinates defining the polygon's boundary.\n - num_points: The number of evenly spaced points to resample (default is 34).\n\n Returns:\n - new_coords: List of resampled coordinates.\n \"\"\"\n ring = LinearRing(projected_coords)\n \n total_length = ring.length\n\n distances = np.linspace(0, total_length, num_points, endpoint=False)\n \n new_coords = [ring.interpolate(distance).coords[0] for distance in distances]\n \n return new_coords\n\n\n\nEnsuring Consistent Vertex Direction:\n\nAll polygons were standardized to have vertices drawn in the same direction (clockwise or counterclockwise). This step ensures that the orientation of the vertices does not introduce inconsistencies in the analysis.\n\n\nCode\ndef is_clockwise(coords):\n \"\"\"\n Check if the polygon vertices are in a clockwise direction.\n\n Parameters:\n - coords: List of coordinates defining the polygon's boundary.\n\n Returns:\n - True if the polygon is clockwise; False otherwise.\n \"\"\"\n ring = LinearRing(coords)\n return ring.is_ccw == False \n\ndef make_clockwise(coords):\n \"\"\"\n Convert the polygon's vertices to a clockwise direction, if it is not \n\n Parameters:\n - coords: List of coordinates defining the polygon's boundary.\n\n Returns:\n - List of coordinates ordered in a clockwise direction.\n \"\"\"\n if not is_clockwise(coords): \n return [coords[0]] + coords[:0:-1] # Reverse the vertex order, keeping the start point\n return coords\n\n\nThe image illustrates four polygons that have been standardized by resampling them to have 34 evenly spaced points, with all vertices aligned in a clockwise direction.\n\n\n\nThe standardized polygon\n\n\n\nValidation of Standardization\nTo confirm the accuracy of these transformations, we compared the areas and perimeters of the resampled polygons with the original values. The results demonstrated minimal deviation, indicating the transformations preserved the integrity of the shapes.\n\nPerimeter Comparison\n\n\n\n\nperimeter comparison\n\n\n\nArea Comparison\n\n\n\n\narea comparison\n\n\nBy meeting these preprocessing requirements, we ensured that the polygons were accurately prepared for subsequent shape analysis." + "objectID": "posts/extension_to_RECOVAR/index.html#discussion", + "href": "posts/extension_to_RECOVAR/index.html#discussion", + "title": "Extensions to RECOVAR for heterogeneity analysis of SARS-CoV2 spike protein from cryo-EM data", + "section": "Discussion", + "text": "Discussion\nIn this project we proposed MPPC as an alternative approach to compute path. Although this a method can be used to find paths in higher dimension with very fast speed, it is more sensitive to outliers. One way to address this issue is to iteratively remove points that are far away from the curves and then fit the curve. Another feature of MPPC is that it does not take the starting and ending points. This can be either an advantage or disadvantage, depending on the objective. MPPC works if the goal is to study the conformational change trajectory in the entire space. Nevertheless, if we are more interested in how proteins transit between two specific states, MPPC may output path even not passing these two states. On the other hand, the movie output from trajectories found by MPPC in higher dimension indeed captures more changes in shape, which helps discover rare conformations.\nOne problem occurring to lots of datasets like the one we tested is that the output path contains both conformational and compositional heterogeneity. From the movies of the spike we can see ACE2 suddenly appear or disappear at the top of the lifted RBD region. This is essential as we want the algorithm to discover compositional heterogeneity as well, but it causes trouble to atomic fitting. In the conventional pipeline, people address this problem via discrete 3D classification to separate particles with different compositions, which may not have very high accuracy when applied to complex datasets with both compositional and comformational heterogeneity. Actually 3D classification of cryoSPARC fails to distinguish particles with and without ACE2 on our spike protein dataset without templates. Here instead we may want to leverage the powerful tool of RECOVAR, and directly classify particles in the continuous latent space. One potential approach would be segmenting latent space based on the mass of the volume associated with the embeddings. This approach may not work in the case where the compositional difference does not lead to a change in mass, but as long as the compositional heterogeneity leads to difference in mass that is more significant than noise (like SARS-CoV2 spike + ACE2 in our case), this method should work. We checked the feasibility of this approach by computing mass of the density maps along time in a movie output by RECOVAR using our SARS-CoV2 data as following:\n\n\n\nIllustrations of how mass of the density map changes in the movie of SARS-CoV2 spike, some binding to ACE2\n\n\nThis movie demonstrates a relatively complex changes in spike proteins, where the spike undergoes the following changes: one RBD up + one ACE2 -> one RBD up -> all RBDs down -> 1 RBD up -> 2 RBDs up + 1 ACE2 -> 2 RBDs up + 2 ACE2’s. There is a clear cutoff at mass of around 900,000 above which ACE2 is present. The difference in the mass between those with 1 ACE2 and 2 ACE2’s is not very obvious, but separating spike with and without ACE2 is enough for the purpose of atomic model fitting to the density maps from closed states up to the moment where the RBD completely lifts but without ACE2.\nRegarding to our atomic model fitting algorithms, Algorithm1 which is purely based on gradient descent works surprisingly well and even better than Algorithm2 whose design is more complex. Both algorithms update the changes in the RBD region with high accuracy. Although some regions with medium quality of fitting in the first frame are inherited by later fittings, the RMSD does not rise further. One improvement we can make to our current algorithms is to change the constant sigma in the gaussian kernel to map coordinates to density maps to an annealing parameter. Initially we make sigma large to enable the model to undergo large conformational changes. Later we shrink the size of sigma for better fitting in the local region." }, { - "objectID": "posts/Farm-Shape-Analysis/index.html#shape-alignment-and-fréchet-mean-analysis", - "href": "posts/Farm-Shape-Analysis/index.html#shape-alignment-and-fréchet-mean-analysis", - "title": "Farm Shape Analysis: Linking Geometry with Crop Yield and Diversity", - "section": "Shape Alignment and Fréchet Mean Analysis", - "text": "Shape Alignment and Fréchet Mean Analysis\nWith data preparation complete, the polygons were ready for analysis in the size-and-shape space. This specialized framework enables consistent comparison of shapes by accounting for geometric differences, including scaling, translation, and rotation. It provides a robust foundation for meaningful geometric analysis.\nThe polygons were aligned using Procrustes analysis(Dryden and Mardia 2016), and their Fréchet Mean was iteratively computed in Euclidean space. This process standardizes the shapes, ensuring variations caused by translation and rotation are removed, allowing for accurate and meaningful comparisons.\nThe Fréchet Mean(Dryden and Mardia 2016) represents the “average” shape in a geometric space (manifold), minimizing the average squared distance to all sample shapes. It serves as a standardized and central representation of the dataset.\n\n\nStep-by-Step Overview\n\nShape Alignment:\n\nThe align_shape function performs Procrustes alignment through the following steps:\n\nRemoving Translation:\n\nThe centroid (average position of all points) of each shape is computed. The shape is then centered by subtracting its centroid from all points, ensuring the shape is position-independent.\n\nRemoving Rotation:\n\nUsing Singular Value Decomposition (SVD), the optimal rotation matrix is calculated to align the target shape with the reference shape. This step removes rotation differences while preserving the relative positions of the points.\n\n\n\nMeasuring Shape Differences:\n\nThe riemannian_distance function computes the Riemannian distance between two shapes in size-and-shape space. This metric quantifies geometric differences between shapes, considering both size and rotation.\n\n\n\n\nRiemannian Distance in Size-and-Shape Space\nGiven two \\(k\\)-point configurations in \\(m\\)-dimensions, \\(X_1^o, X_2^o \\in \\mathbb{R}^{k \\times m}\\), the Riemannian distance(Dryden and Mardia 2016) in size-and-shape space is defined as:\n\\[\nd_S(X_1^o, X_2^o) = \\sqrt{S_1^2 + S_2^2 - 2 S_1 S_2 \\cos \\rho(X_1^o, X_2^o)}\n\\]\nwhere:\n\n\\(S_1, S_2\\): Centroid sizes of \\(X_1^o\\) and \\(X_2^o\\), representing the Frobenius norms of the centered shapes.\n\\(\\rho(X_1^o, X_2^o)\\): Riemannian shape distance.\n\nThis formula ensures that the distance captures both shape similarity and scaling differences, making it a robust tool for geometric analysis.\n\nIterative Fréchet Mean Calculation:\n\nThe algorithm begins with an initial reference shape and aligns all other shapes to it using Procrustes alignment.\nThe Fréchet Mean is then calculated as the average shape in Euclidean space.\nThe shapes are iteratively re-aligned to the updated Fréchet Mean, refining the alignment and mean calculation until convergence is achieved.\n\n\n\n\n\n\nPython Implementation\nThe following Python code implements the entire process of shape alignment, Riemannian distance computation, and iterative Fréchet Mean calculation.\n\n\nCode\ndef align_shape(reference_shape, target_shape):\n \"\"\"\n Align the target shape to the reference shape using Procrustes alignment.\n\n Parameters:\n - reference_shape: The reference shape to align to.\n - target_shape: The shape to be aligned.\n\n Returns:\n - aligned_shape: The aligned target shape.\n \"\"\"\n reference_shape = np.array(reference_shape)\n target_shape = np.array(target_shape)\n\n # Step 1: Remove the translation\n centroid_reference = np.mean(reference_shape, axis=0)\n centroid_target = np.mean(target_shape, axis=0)\n centered_reference = reference_shape - centroid_reference\n centered_target = target_shape - centroid_target\n\n # Step 2: Remove the rotation\n u, s, vh = np.linalg.svd(np.matmul(np.transpose(centered_target), centered_reference))\n r = np.matmul(u, vh)\n aligned_shape = np.matmul(centered_target, r)\n\n return aligned_shape\n\ndef riemannian_distance(reference_shape, target_shape):\n \"\"\"\n Compute the Riemannian distance between two shapes.\n\n Parameters:\n - reference_shape: The reference shape.\n - target_shape: The target shape.\n\n Returns:\n - distance: The Riemannian distance between the shapes.\n \"\"\"\n reference_shape = np.array(reference_shape)\n target_shape = np.array(target_shape)\n\n # Step 1: Compute centroid sizes\n S1 = np.linalg.norm(reference_shape) \n S2 = np.linalg.norm(target_shape)\n\n # Step 2: Remove translation by centering the shapes\n centered_reference = reference_shape - np.mean(reference_shape, axis=0)\n centered_target = target_shape - np.mean(target_shape, axis=0)\n\n # Step 3: Compute optimal rotation using SVD\n H = np.dot(centered_target.T, centered_reference)\n U, _, Vt = np.linalg.svd(H)\n R = np.dot(U, Vt)\n\n # Step 4: Align target shape\n aligned_target = np.dot(centered_target, R)\n\n # Step 5: Compute the Riemannian distance\n cosine_rho = np.trace(np.dot(aligned_target.T, centered_reference)) / (S1 * S2)\n cosine_rho = np.clip(cosine_rho, -1, 1)\n distance = np.sqrt(S1**2 + S2**2 - 2 * S1 * S2 * cosine_rho)\n\n return distance\n\n# Iterative Fréchet Mean Calculation\nepsilon = 1e-6 \nmax_iterations = 100 \nreference_shape = field_data['resampled_point'].iloc[0] \naligned_shapes = []\n\n# Align all shapes to the initial reference shape\nfor target_shape in field_data['resampled_point']:\n aligned_shape = align_shape(reference_shape, target_shape)\n aligned_shapes.append(aligned_shape)\n\n# Initialize Euclidean space and calculate initial Fréchet Mean\neuclidean_space = Euclidean(dim=aligned_shapes[0].shape[1])\nfrechet_mean = FrechetMean(euclidean_space)\nprevious_frechet_mean_shape = frechet_mean.fit(aligned_shapes).estimate_\nconverged = False\niteration = 0\nfrechet_means = [previous_frechet_mean_shape]\n\nwhile not converged and iteration < max_iterations:\n iteration += 1\n aligned_shapes2 = []\n for target_shape in field_data['resampled_point']:\n aligned_shape = align_shape(previous_frechet_mean_shape, target_shape)\n aligned_shapes2.append(aligned_shape)\n\n # Calculate new Fréchet Mean\n frechet_mean = FrechetMean(euclidean_space)\n current_frechet_mean_shape = frechet_mean.fit(aligned_shapes2).estimate_\n frechet_means.append(current_frechet_mean_shape)\n \n # Check convergence\n difference = riemannian_distance(previous_frechet_mean_shape, current_frechet_mean_shape)\n if difference < epsilon:\n converged = True\n else:\n previous_frechet_mean_shape = current_frechet_mean_shape" + "objectID": "index.html", + "href": "index.html", + "title": "Biological shape analysis (under construction)", + "section": "", + "text": "Welcome to MATH 612\n\n\nInstructions and tips for MATH 612 students\n\n\n\nMATH 612\n\n\n\n\n\n\n\n\n\nDec 18, 2024\n\n\n\n\n\n\n\n\n\n\n\n\nAn analysis and segmentation of contours in AFM imaging data\n\n\n\n\n\n\nbiology\n\n\nAFM\n\n\n\n\n\n\n\n\n\nDec 17, 2024\n\n\nBerkant Cunnuk\n\n\n\n\n\n\n\n\n\n\n\n\nOptimal Mass Transport for Shape Progression Study\n\n\nPyTorch Implementation of Benamou-Brenier Formulation\n\n\n\noptimal transport\n\n\nshape morphing\n\n\nBenamou-Brenier’s Formulation\n\n\npytorch\n\n\nautomatic differentiation\n\n\n\n\n\n\n\n\n\nDec 16, 2024\n\n\nSiddharth Rout\n\n\n\n\n\n\n\n\n\n\n\n\nShape analysis of C. elegans E cell\n\n\n\n\n\n\nbiology\n\n\n\n\n\n\n\n\n\nDec 16, 2024\n\n\nViktorija Juciute\n\n\n\n\n\n\n\n\n\n\n\n\nFarm Shape Analysis: Linking Geometry with Crop Yield and Diversity\n\n\n\n\n\n\nlandscape-analysis\n\n\nagriculture\n\n\n\n\n\n\n\n\n\nDec 15, 2024\n\n\nMo Wang\n\n\n\n\n\n\n\n\n\n\n\n\nLandmarking the ribosome exit tunnel\n\n\n\n\n\n\nribosome\n\n\ncryo-em\n\n\nbioinformatics\n\n\n\n\n\n\n\n\n\nDec 15, 2024\n\n\nElla Teasell\n\n\n\n\n\n\n\n\n\n\n\n\nExtensions to RECOVAR for heterogeneity analysis of SARS-CoV2 spike protein from cryo-EM data\n\n\n\n\n\n\ncryo-EM\n\n\n\n\n\n\n\n\n\nDec 11, 2024\n\n\nQiyu Wang\n\n\n\n\n\n\n\n\n\n\n\n\nIdentifying R-loops in AFM imaging data\n\n\n\n\n\n\nbiology\n\n\nAFM\n\n\n\n\n\n\n\n\n\nNov 10, 2024\n\n\nBerkant Cunnuk\n\n\n\n\n\n\n\n\n\n\n\n\nVascular Networks\n\n\n\n\n\n\nGraph theory\n\n\nVascular Networks\n\n\n\n\n\n\n\n\n\nNov 5, 2024\n\n\nAli Fele Paranj\n\n\n\n\n\n\n\n\n\n\n\n\nShape Analysis of Contractile Cells\n\n\n\n\n\n\nbiology\n\n\ncell morphology\n\n\n\n\n\n\n\n\n\nOct 28, 2024\n\n\nYuqi Xiao\n\n\n\n\n\n\n\n\n\n\n\n\nExploring cell shape dynamics dependency on the cell migration\n\n\n\n\n\n\nCell Morphology\n\n\nCell Migration\n\n\nDifferential Geometry\n\n\n\n\n\n\n\n\n\nOct 28, 2024\n\n\nPavel Bukleomishev\n\n\n\n\n\n\n\n\n\n\n\n\nEmbryonic cell size asymmetry analysis\n\n\n\n\n\n\nbiology\n\n\n\n\n\n\n\n\n\nOct 28, 2024\n\n\nViktorija Juciute\n\n\n\n\n\n\n\n\n\n\n\n\nTrajectory Inference for cryo-EM data using Principal Curves\n\n\n\n\n\n\nMath 612D\n\n\n\n\n\n\n\n\n\nOct 28, 2024\n\n\nForest Kobayashi\n\n\n\n\n\n\n\n\n\n\n\n\nDefining landmarks for the ribosome exit tunnel\n\n\n\n\n\n\nribosome\n\n\ncryo-em\n\n\nbioinformatics\n\n\n\n\n\n\n\n\n\nOct 25, 2024\n\n\nElla Teasell\n\n\n\n\n\n\n\n\n\n\n\n\nOptimal Mass Transport and its Convex Formulation\n\n\n\n\n\n\noptimal transport\n\n\nshape morphing\n\n\nMonge’s Problem\n\n\nKantorovich’s Formulation\n\n\nBenamou-Brenier’s Formulation\n\n\n\n\n\n\n\n\n\nOct 24, 2024\n\n\nSiddharth Rout\n\n\n\n\n\n\n\n\n\n\n\n\nHeterogeneity analysis of cryo-EM data of proteins dynamic in comformation and composition using linear subspace methods\n\n\n\n\n\n\ncryo-EM\n\n\n\n\n\n\n\n\n\nSep 18, 2024\n\n\nQiyu Wang\n\n\n\n\n\n\n\n\n\n\n\n\nUnderstanding Animal Navigation using Neural Manifolds With CEBRA\n\n\n\n\n\n\nbiology\n\n\nbioinformatics\n\n\nmathematics\n\n\nbiomedical engineering\n\n\nneuroscience\n\n\n\n\n\n\n\n\n\nSep 18, 2024\n\n\nDeven Shidfar\n\n\n\n\n\n\n\n\n\n\n\n\nExtracting cell geometry from Atomic Force Microscopy\n\n\nPart 2: Temporal amd morphological analysis\n\n\n\nbiology\n\n\nbioinformatics\n\n\n\n\n\n\n\n\n\nSep 17, 2024\n\n\nClément Soubrier, Khanh Dao Duc\n\n\n\n\n\n\n\n\n\n\n\n\nHorizontal Diffusion Map\n\n\n\n\n\n\ntheory\n\n\n\n\n\n\n\n\n\nAug 30, 2024\n\n\nWenjun Zhao\n\n\n\n\n\n\n\n\n\n\n\n\nOrthogonal outlier detection and dimension estimation for improved MDS embedding of biological datasets\n\n\n\n\n\n\nbiology\n\n\nbioinformatics\n\n\n\n\n\n\n\n\n\nAug 29, 2024\n\n\nWanxin Li\n\n\n\n\n\n\n\n\n\n\n\n\nCentroidal Voronoi Tessellation\n\n\nRelations with Semidiscrete Wasserstein distance\n\n\n\ntheory\n\n\n\n\n\n\n\n\n\nAug 26, 2024\n\n\nAryan Tajmir Riahi\n\n\n\n\n\n\n\n\n\n\n\n\nSimulation of tomograms of membrane-embedded spike proteins\n\n\n\n\n\n\ncryo-ET\n\n\n\n\n\n\n\n\n\nAug 15, 2024\n\n\nQiyu Wang\n\n\n\n\n\n\n\n\n\n\n\n\nShape Analysis of Cancer Cells\n\n\n\n\n\n\nbiology\n\n\nbioinformatics\n\n\n\n\n\n\n\n\n\nAug 15, 2024\n\n\nWanxin Li\n\n\n\n\n\n\n\n\n\n\n\n\nRiemannian elastic metric for curves\n\n\n\n\n\n\nbiology\n\n\nbioinformatics\n\n\n\n\n\n\n\n\n\nAug 15, 2024\n\n\nWanxin Li\n\n\n\n\n\n\n\n\n\n\n\n\nPoint cloud representation of 3D volumes\n\n\nApplication to cryoEM density maps\n\n\n\nbiology\n\n\nbioinformatics\n\n\n\n\n\n\n\n\n\nAug 15, 2024\n\n\nAryan Tajmir Riahi, Khanh Dao Duc\n\n\n\n\n\n\n\n\n\n\n\n\nMulti Dimensional Scaling of ribosome exit tunnel shapes\n\n\nAnalyze and compare the geometry of the ribosome exit tunnel\n\n\n\ncryo-EM\n\n\nribosome\n\n\nMDS\n\n\n\n\n\n\n\n\n\nAug 15, 2024\n\n\nShiqi Yu, Artem Kushner, Khanh Dao Duc\n\n\n\n\n\n\n\n\n\n\n\n\nAlpha Shapes in 2D and 3D\n\n\n\n\n\n\ntheory\n\n\n\n\n\n\n\n\n\nAug 14, 2024\n\n\nWenjun Zhao\n\n\n\n\n\n\n\n\n\n\n\n\nQuasiconformal mapping for shape representation\n\n\n\n\n\n\ntheory\n\n\n\n\n\n\n\n\n\nAug 9, 2024\n\n\nClément Soubrier\n\n\n\n\n\n\n\n\n\n\n\n\n3D tessellation of biomolecular cavities\n\n\nProtocol for analyzing the ribosome exit tunnel\n\n\n\ncryo-EM\n\n\n\n\n\n\n\n\n\nAug 4, 2024\n\n\nArtem Kushner, Khanh Dao Duc\n\n\n\n\n\n\n\n\n\n\n\n\nAlignment of 3D volumes with Optimal Transport\n\n\nApplication to cryoEM density maps\n\n\n\nexample\n\n\ncryo-EM\n\n\n\n\n\n\n\n\n\nAug 4, 2024\n\n\nAryan Tajmir Riahi, Khanh Dao Duc\n\n\n\n\n\n\n\n\n\n\n\n\nExtracting cell geometry from Atomic Force Microscopy\n\n\nPart 1: Static analysis\n\n\n\nbiology\n\n\nbioinformatics\n\n\n\n\n\n\n\n\n\nJul 31, 2024\n\n\nClément Soubrier, Khanh Dao Duc\n\n\n\n\n\n\n\n\n\n\n\n\nAnalysis of Eye Tracking Data\n\n\n\n\n\n\nbioinformatics\n\n\n\n\n\n\n\n\n\nJul 31, 2024\n\n\nLisa\n\n\n\n\n\n\nNo matching items" }, { - "objectID": "posts/Farm-Shape-Analysis/index.html#global-fréchet-mean-and-outlier-detection", - "href": "posts/Farm-Shape-Analysis/index.html#global-fréchet-mean-and-outlier-detection", - "title": "Farm Shape Analysis: Linking Geometry with Crop Yield and Diversity", - "section": "Global Fréchet Mean and Outlier Detection", - "text": "Global Fréchet Mean and Outlier Detection\nHere is the global Fréchet mean calculated from all field polygons:\n\n\n\nThe global mean shape\n\n\nThe following image illustrates the original polygon and its alignment with the Fréchet mean:\n\n\n\nAligned Shape\n\n\nAfter aligning all shapes to the Fréchet mean, the riemannian_distance function was used to calculate the distances between the mean shape and each aligned shape. To identify potential outliers, the z-score method was applied to these distance values.\nBelow are the four field polygons detected as outliers using the global Fréchet mean:\n\n\nCode\nimport pandas as pd\n\n# Load the CSV file\nfour_potiential_fake_farm = pd.read_csv(\"data/potiential_fake_field.csv\")\n\n# Display the table\nfour_potiential_fake_farm # Or use `data` to show the entire table\n\n\n\n\n\n\n\n\n\nFarm Number\ncountry_name\ntype\ncalculated_perimeter_m\ncalulated_area_ha\nnumber of vertices\ndistance_to_frechet_mean\nz_score\n\n\n\n\n0\nFarm 310\nUnited States\nfield\n744797.7117\n2.600780e+06\n3\n591590.0609\n48.784896\n\n\n1\nFarm 71\nCanada\nfield\n864206.5248\n4.251124e+06\n5\n709800.2531\n58.580655\n\n\n2\nFarm 45\nCanada\nfield\n341370.9916\n8.453115e+04\n5\n177371.8498\n14.459753\n\n\n3\nFarm 2792\nIndia\nfield\n200958.9993\n2.170029e+05\n4\n166440.3554\n13.553890" + "objectID": "about.html", + "href": "about.html", + "title": "About", + "section": "", + "text": "About this blog" }, { - "objectID": "posts/Farm-Shape-Analysis/index.html#fréchet-mean-shape-by-country", - "href": "posts/Farm-Shape-Analysis/index.html#fréchet-mean-shape-by-country", - "title": "Farm Shape Analysis: Linking Geometry with Crop Yield and Diversity", - "section": "Fréchet Mean Shape by Country", - "text": "Fréchet Mean Shape by Country\nThe shape of field polygons varies significantly across different countries. To capture this variation, we calculated the Fréchet mean shape* for each country based on the fields located within that specific country.\nThe plot below summarizes the Fréchet mean shapes for all countries in the dataset.\nIn this visualization, different colors represent different continents. It is evident that both the shapes and areas of the field polygons differ substantially across regions, highlighting the diversity in field geometry across countries.\n\n\n\nSummary of Countries’ Mean Shapes\n\n\n\nAssessing Mean Shape Representation in Countries with Limited Data\nTo evaluate the representativeness of the mean shape, we specifically selected countries with fewer than 10 polygons. The small number of polygons in these cases allows for easier visualization, helping us assess whether the mean shape effectively captures the overall geometric characteristics of these datasets.\n\nZambia\n\n\n\nField polygons and Fréchet mean for Zambia\n\n\n\n\nChile\n\n\n\nField polygons and Fréchet mean for Chile\n\n\nFrom the above plots, we can draw the following conclusions:\n\nEffective Representation with Similar Shapes:\nWhen the field polygons within a country have similar shapes, the calculated Fréchet mean serves as an effective representation of the general shape trend.\nLimitations with Diverse Shapes:\nIf the field polygons within a country show significant variation in their shapes, the Fréchet mean becomes less representative and may fail to adequately capture the geometric diversity of the dataset.\n\n\n\n\nDetecting Potential Fake Field Polygons\nBuilding on the country-level mean shape analysis, we applied the same methodology to detect potential fake field polygons. For each country, field polygons were aligned to their corresponding Fréchet mean, and the z-score technique was used to identify anomalies based on the distances between each polygon and the mean shape.\nThrough this analysis, we identified 51 potential fake field polygons. To verify their validity, we visualized each field polygon on satellite imagery. The results are summarized in the plot below:\n\nGray markers: Fake fields\n\nPink markers: True fields\n\nOrange markers: Potential fake fields\n\n\n\n\nSatellite plot for all 51 potential fake fields\n\n\nAfter visualizing all 51 potential fake field polygons, the findings were as follows:\n\n45.1% were confirmed as fake fields.\n\n29.4% were ambiguous, meaning they could potentially be either fake or real fields, requiring further investigation.\n25.5% were determined to be true fields.\n\nBelow are examples of confirmed fake fields. These polygons often exhibit:\n\nUnusual geometric shapes\nSizes that are disproportionately large compared to neighboring field polygons\n\n\n\n\nfake field polygons\n\n\n\n\nFuture Work\nOur analysis successfully identified a significant number of potential fake field polygons, with nearly half of these cases being validated as genuinely fake. While this demonstrates the effectiveness of our approach, there is still room to improve the accuracy and reliability of the detection process. To further refine our results, future efforts will focus on:\n\nIncorporate Geographic Information:\nEnrich the dataset with geographic features such as proximity to natural landmarks (e.g., mountains, rivers) or man-made structures (e.g., urban areas, roads). These features could provide valuable context for improving the calculation of the Fréchet mean and detecting anomalies more effectively.\nImprove Outlier Detection Methods:\nLeverage advanced machine learning models, such as clustering algorithms or ensemble methods, to identify subtle patterns and relationships that may indicate fake fields. Techniques like unsupervised learning or deep anomaly detection could also be explored to improve performance." + "objectID": "posts/rloop-analysis/rloop-analysis.html", + "href": "posts/rloop-analysis/rloop-analysis.html", + "title": "Identifying R-loops in AFM imaging data", + "section": "", + "text": "R-loops are three-stranded nucleic acid structures containing a DNA:RNA hybrid and an associated single DNA strand. They are normally created when DNA and RNA interact throughout the lifespan of a cell. Although their existence can be beneficial to a cell, an excessive formation of these objects is commonly associated with instability phenotypes.\nThe role of R-loop structures on genome stability is still not completely determined. The determining characteristics of harmful R-loops still remain to be defined. Their architecture is not very well-known either, and they are normally classified manually.\nIn this blog post, we will carry AFM data to the Kernell shape space and try to develop a method to detect and classify these objects using geomstats (Miolane et al. 2024). We will also talk about a rather simple method that works reasonably well.\n\n\n\n\nFig.1 Pictures of DNA fragments at the gene Airn in vitro. One of them was treated with RNase H and the other was not. The image on the bottom highlights the R-loops that were formed. (Carrasco-Salas et al. 2019)" }, { - "objectID": "posts/ribosome-tunnel-new/index.html", - "href": "posts/ribosome-tunnel-new/index.html", - "title": "3D tessellation of biomolecular cavities", + "objectID": "posts/rloop-analysis/rloop-analysis.html#context-and-motivation", + "href": "posts/rloop-analysis/rloop-analysis.html#context-and-motivation", + "title": "Identifying R-loops in AFM imaging data", "section": "", - "text": "We present a protocol to extract the surface of a biomolecular cavity for shape analysis and molecular simulations.\nWe apply and illustrate the protocol on the ribosome structure, which contains a subcompartment known as the ribosome exit tunnel or “nascent polypeptide exit tunnel” (NPET). More details on the tunnel features and biological importance can be found in our previous works1,2.\nThe protocol was designed to refine the output obtained from MOLE software3, but can be applied to reconstruct a mesh on any general point cloud. Hence, we take the point-cloud of atom positions surrounding the tunnel as a point of departure.\n\n\n\nIllustration of the ribosome exit tunnel (from Dao Duc et al., NAR 2019)\n\n\n\n\n\n\n\n\n\n\nSchematic representation of the protocol" + "text": "R-loops are three-stranded nucleic acid structures containing a DNA:RNA hybrid and an associated single DNA strand. They are normally created when DNA and RNA interact throughout the lifespan of a cell. Although their existence can be beneficial to a cell, an excessive formation of these objects is commonly associated with instability phenotypes.\nThe role of R-loop structures on genome stability is still not completely determined. The determining characteristics of harmful R-loops still remain to be defined. Their architecture is not very well-known either, and they are normally classified manually.\nIn this blog post, we will carry AFM data to the Kernell shape space and try to develop a method to detect and classify these objects using geomstats (Miolane et al. 2024). We will also talk about a rather simple method that works reasonably well.\n\n\n\n\nFig.1 Pictures of DNA fragments at the gene Airn in vitro. One of them was treated with RNase H and the other was not. The image on the bottom highlights the R-loops that were formed. (Carrasco-Salas et al. 2019)" }, { - "objectID": "posts/ribosome-tunnel-new/index.html#summary-and-background", - "href": "posts/ribosome-tunnel-new/index.html#summary-and-background", - "title": "3D tessellation of biomolecular cavities", + "objectID": "posts/rloop-analysis/rloop-analysis.html#preparations-before-data-analysis", + "href": "posts/rloop-analysis/rloop-analysis.html#preparations-before-data-analysis", + "title": "Identifying R-loops in AFM imaging data", + "section": "Preparations before data analysis", + "text": "Preparations before data analysis\nOriginal images will be edited to remove background noise. The figure below from the reference article tries to do that while maintaining some colors. This is useful to track the height of a particular spot.\n\n\n\n\nFig.2 A demonstration of background noise removal (Carrasco-Salas et al. 2019)\n\n\n\nI went a step further and turned these images into binary images. In other words, images we will use here will consist of black and white pixels, which correspond to 0 and 1 respectively. This makes coding a bit easier, but the height data (or the \\(z\\) coordinate) will need to be stored in a different matrix.\n\n\n\n\nFig.3 Binarized images of R-loops, for the original image see Fig. 1\n\n\n\nWe first import the necessary libraries.\n\nimport numpy as np\nimport matplotlib.pyplot as plt\nimport geomstats.backend as gs\ngs.random.seed(2024)\n\nWe process our data and put it into matrices.\n\ndata_original = plt.imread(\"original-data.png\")\ndata = plt.imread(\"edited-data.png\")\n\nx_values = []\ny_values = []\nz_values = []\ndata_points = []\n\nfor i,rows in enumerate(data_original):\n for j,rgb in enumerate(rows):\n if not (rgb[0]*255 < 166 and rgb[0]*255 > 162):\n continue\n if not (rgb[1]*255 < 162 and rgb[1]*255 > 167):\n continue\n if not (rgb[2]*255 < 66 and rgb[1]*255 > 61):\n continue\n # store useful height data\n z_values.append((i,j,rgb[0], rgb[1], rgb[2]))\n\nfor i,rows in enumerate(data):\n for j,entry in enumerate(rows):\n # take white pixels only (entry is a numpy array)\n if (entry.all() == 1):\n y_values.append(j+1)\n x_values.append(i+1)\n data_points.append([i,j])" + }, + { + "objectID": "posts/rloop-analysis/rloop-analysis.html#a-primitive-approach-that-surprisingly-works", + "href": "posts/rloop-analysis/rloop-analysis.html#a-primitive-approach-that-surprisingly-works", + "title": "Identifying R-loops in AFM imaging data", + "section": "A primitive approach that surprisingly works", + "text": "A primitive approach that surprisingly works\nA way to distinguish lines from loops is to count the amount of white pixels in each column. This heavily depends on the orientation. To get a meaningful result, it is required to do this at least \\(2\\) times, one for columns and one for rows. This is not bulletproof and will sometimes give false positives. However, it still gives us a good idea of possible places where there is an R-loop.\n\nwhite_pixel_counts = [i*0 for i in range(500)]\n\ndata = plt.imread(\"data-1.png\")\n\nfor i,rows in enumerate(data):\n for j,entry in enumerate(rows):\n # count white pixels only\n if (entry.all() == 1):\n white_pixel_counts[j] += 1\n\nplt.plot(range(500), white_pixel_counts, linewidth=1, color=\"g\")\nplt.xlabel(\"columns\")\nplt.ylabel(\"white pixels\")\n\nplt.legend([\"Amount of white pixels\"])\nplt.show()\n\n\n\n\n\nFig.4\n\n\n\nWe can see that in Figure \\(1\\), the R-loops are mainly accumulated on the left side. There are a considerable amount of them on the right side as well. There are some of them around the middle, but their numbers are lower. We can see that this is clearly represented in Figure \\(4\\).\nWith this approach, \\(2\\) different white pixels in the same column will always be counted even if they are not connected at all, which gives us some false positives. To avoid this issue, we can define the following function taking the position of a white pixel as its input.\n\\[ f((x,y)) = \\left\\lbrace \\begin{array}{r l}1, & \\text{if} ~~ \\exists c_1,c_2,c_3,\\dots c_{\\gamma} \\in [y-\\epsilon, y+\\epsilon] ~~ \\ni f(x,y) = 1 \\\\0, & \\text{otherwise}\\end{array} \\right.\\]\n\\(\\epsilon\\) and \\(\\gamma\\) can be adjusted depending on the data at hand. This gives us a more precise prediction about likely places for an R-loop. In this case, choosing \\(\\gamma = 8\\) and \\(\\epsilon = 10\\) gives us the following graph.\n\n\n\n\nFig.5\n\n\n\nWe can see that the Figure \\(5\\) and \\(4\\) is quite similar. The columns where the graph peaks are still the same, but we see a decrease in the values between these peaks, which is the expected result. This figure has less false positives compared to the previous one, so it is a step in the right direction." + }, + { + "objectID": "posts/rloop-analysis/rloop-analysis.html#an-analysis-using-the-kendall-pre-shape-space", + "href": "posts/rloop-analysis/rloop-analysis.html#an-analysis-using-the-kendall-pre-shape-space", + "title": "Identifying R-loops in AFM imaging data", + "section": "An analysis using the Kendall pre-shape space", + "text": "An analysis using the Kendall pre-shape space\nInitialize the space and the metric on it. Create a Kendall sphere using geomstats.\n\nfrom geomstats.geometry.pre_shape import PreShapeSpace, PreShapeMetric\nfrom geomstats.visualization.pre_shape import KendallSphere\n\nS_32 = PreShapeSpace(3,2)\nS_32.equip_with_group_action(\"rotations\")\nS_32.equip_with_quotient()\nmetric = PreShapeMetric(space=S_32)\nS_32.metric = metric\n\nprojected_points = S_32.projection(gs.array(data_points))\nS = KendallSphere()\nS.draw()\nS.add_points(projected_points)\nS.draw_points(alpha=0.1, color=\"green\", label=\"DNA matter\")\nS.ax.legend()\nplt.show()\n\n\n\n\n\nFig.6 White pixels projected onto the pre-shape space\n\n\n\nTaking a close look at it will reveal more details about where the points lie in the space.\n\n\n\n\nFig.7 White pixels projected onto the pre-shape space\n\n\n\nThe upper part of the curve consist of points that are in the left side of the image while the one below are closer to the middle. We see a reverse relationship between the amount of R-loops and the density of these points. This is an expected result when we consider how the Kendall pre-shape space is defined.\nA pre-shape space is a hypersphere. In our case, it has dimension \\(3\\). Hypothetically, if all of our points were placed at the vertices of a triangle of similar length, their projection to the Kendall pre-shape space would be approximately a single point. In the case of circular objects, there will be multiple pairs of points that are the same distance away from each other more than we would see if the object was a straight line. Therefore, we expect points forming a loop (which is a deformed circle for our purposes) to be separated from the other points. In other words, the lower-density areas in the hypersphere correspond to areas with a higher likelihood of R-loop presence.\nThe presence of more R-loops does not indicate that there will be fewer points in the corresponding area of the pre-shape space. It just means that they are further apart and more uniformly spread.\n\n\n\n\nFig.8 A zoomed-in and rotated version of Figure 7. The left side has the lowest density followed by the right side. The middle part has a higher density of points, as expected.\n\n\n\nPoints in the pre-shape space give us possible regions where we may find R-loops. However, they do not guarantee that there will be one in that location. This is evident when we look at the right end of this curve. It has a lower density of points than the left side, which is a result we did not want to see.\n\n\n\n\nFig.9 The right end of the curve in Figure 6\n\n\n\nThis happens because there are more DNA fragments on the right side with a shape similar to a half circle. Most of them are not loops, but they are distinct enough from the rest that the corresponding projection in the pre-shape space has a low density of points, which are separated from the rest.\nWe can also take a look at the Fréchet mean of the projected points in the pre-shape space.\n\nprojected_points = S_32.projection(gs.array(data_points))\nS = KendallSphere(coords_type=\"extrinsic\")\nS.draw()\nS.add_points(projected_points)\nS.draw_points(alpha=0.1, color=\"green\", label=\"DNA matter\")\n\nS.clear_points()\nestimator = FrechetMean(S_32)\nestimator.fit(projected_points)\nS.add_points(estimator.estimate_)\nS.draw_points(color=\"orange\", label=\"Fréchet mean\", s=150)\nS.add_points(gs.array(S.pole))\nS.draw_curve(color=\"orange\", label=\"curve from the Fréchet mean to the north pole\")\n\nS.ax.legend()\nplt.show()\n\n\n\n\n\nFig.10 Fréchet mean of the projected points\n\n\n\nThe point we find is located around the left side of the green curve, which is a result we already expected." + }, + { + "objectID": "posts/CC-cells/Shape_Analysis_of_Contractile_Cells.html", + "href": "posts/CC-cells/Shape_Analysis_of_Contractile_Cells.html", + "title": "Shape Analysis of Contractile Cells", "section": "", - "text": "We present a protocol to extract the surface of a biomolecular cavity for shape analysis and molecular simulations.\nWe apply and illustrate the protocol on the ribosome structure, which contains a subcompartment known as the ribosome exit tunnel or “nascent polypeptide exit tunnel” (NPET). More details on the tunnel features and biological importance can be found in our previous works1,2.\nThe protocol was designed to refine the output obtained from MOLE software3, but can be applied to reconstruct a mesh on any general point cloud. Hence, we take the point-cloud of atom positions surrounding the tunnel as a point of departure.\n\n\n\nIllustration of the ribosome exit tunnel (from Dao Duc et al., NAR 2019)\n\n\n\n\n\n\n\n\n\n\nSchematic representation of the protocol" + "text": "Capsular contracture (CC) is an ailing complication that arises commonly amonst breast cancer patients after reconstructive breast implant surgery. CC patients suffer from aesthetic deformation, pain, and in rare cases, they may develop anaplastic large cell lymphoma (ALCL), a type of cancer of the immune system. The mechanism of CC is unknown, and there are few objective assessments of CC based on histology.\n\n\n\n\nFigure 1: Baker grade\n\nBaker grade is a subjective, clinical evaluation for the extent of CC (See Fig 1). Many researchers have measured histological properties in CC tissue samples, and correlated theses findings to their assigned Baker grade. It has been found that a high density of immune cells is associated with higher Baker grade.\nThese immune cells include fibroblasts and myofibroblasts, which can distort surrounding tissues by contracting and pulling on them. The transition from the fibroblast to myofibroblast phenotype is an important driving step in many fibrotic processes including capsular contracture. In wound healing processes, the contactility of myofibroblasts is essential in facilitating tissue remodelling, however, an exess amount of contratile forces creates a positive feedback loop, leading to the formation of pathological capsules with high density and extent of deformation.\nMyofibroblasts, considered as an “activated” form of fibroblasts, is identified by the expression of alpha-smooth muscle actin (\\(\\alpha\\)-SMA). However, this binary classification system does not capture the full range of complexities involved in the transition between these two phenotypes. Therefore, it is beneficial to develop a finer classification system of myofibroblasts to explain various levels of forces they can generate. One recent work uses pre-defined morphological features of cells, including perimeter and circularity, to create a continuous spectrum of myofibroblast activation (Hillsley et al. 2022).\nResearch suggests that mechanical strain induces change in cell morphology, inducing round cells that are lacking in stress fibers into more broad, elongated shapes. We hypothesize that cell shapes influence their ability to generate forces via mechanisms of cell-matrix adheshion and cell traction. Further, we hypothesis that cell shape is directly correlated with the severity of CC by increasing contractile forces.\nIn order to test these hypothesis, we will take a 2-step approach. The first step involves statistical analysis on correlation between cell shapes and their associated Baker grade. To do this, we collect cell images from CC samples with various Baker grades, using Geomstat we can compute a characteristic mean cell shape for each sample. Then, we cluster these characteristic cell shapes into 4 groups, and observe the extent of overlap between this classification and the Baker grade. We choose the elastic metric, associated with its geodesic distances, since it allows us to not only looking at classification, but also how cell shape deforms. If we can find a correlation, the second step is then to go back to in-vitro studies of fibroblasts, and answer the question: can the shapes of cells predict their disposition to developing into a highly contractile phenotype (linked to more severe CC)? I don’t have a concrete plan for this second step yet, however, it motivates this project as it may suggest a way to predict clinical outcomes based on pre-operative patient assessment." }, { - "objectID": "posts/ribosome-tunnel-new/index.html#pointcloud-preparation-bounding-box-and-voxelization", - "href": "posts/ribosome-tunnel-new/index.html#pointcloud-preparation-bounding-box-and-voxelization", - "title": "3D tessellation of biomolecular cavities", - "section": "1. Pointcloud Preparation: Bounding Box and Voxelization", - "text": "1. Pointcloud Preparation: Bounding Box and Voxelization\n\n\n\n\n\n\natompos_to_voxel_sphere: convert a 3D coordinate into a voxelized sphere\n\n\n\n\n\n\ndef atompos_to_voxelized_sphere(center: np.ndarray, radius: int):\n \"\"\"Make sure radius reflects the size of the underlying voxel grid\"\"\"\n x0, y0, z0 = center\n\n #!------ Generate indices of a voxel cube of side 2r around the centerpoint\n x_range = slice(\n int(np.floor(x0 - radius)), \n int(np.ceil(x0 + radius)))\n y_range = slice(\n int(np.floor(y0 - radius)), \n int(np.ceil(y0 + radius)))\n z_range = slice(\n int(np.floor(z0 - radius)), \n int(np.ceil(z0 + radius)))\n\n indices = np.indices(\n (\n x_range.stop - x_range.start,\n y_range.stop - y_range.start,\n z_range.stop - z_range.start,\n )\n )\n\n indices += np.array([x_range.start,\n y_range.start,\n z_range.start])[:, np.newaxis, np.newaxis, np.newaxis ]\n indices = indices.transpose(1, 2, 3, 0)\n indices_list = list(map(tuple, indices.reshape(-1, 3)))\n\n #!------ Generate indices of a voxel cube of side 2r+2 around the centerpoint\n sphere_active_ix = []\n\n for ind in indices_list:\n x_ = ind[0]\n y_ = ind[1]\n z_ = ind[2]\n if (x_ - x0) ** 2 + (y_ - y0) ** 2 + (z_ - z0) ** 2 <= radius**2:\n sphere_active_ix.append([x_, y_, z_])\n\n return np.array(sphere_active_ix)\n\n\n\n\n\n\n\n\n\n\nindex_grid: populate a voxel grid (with sphered atoms)\n\n\n\n\n\n\ndef index_grid(expanded_sphere_voxels: np.ndarray) :\n\n def normalize_atom_coordinates(coordinates: np.ndarray)->tuple[ np.ndarray, np.ndarray ]:\n \"\"\"@param coordinates: numpy array of shape (N,3)\"\"\"\n\n C = coordinates\n mean_x = np.mean(C[:, 0])\n mean_y = np.mean(C[:, 1])\n mean_z = np.mean(C[:, 2])\n\n Cx = C[:, 0] - mean_x\n Cy = C[:, 1] - mean_y\n Cz = C[:, 2] - mean_z\n \n\n [dev_x, dev_y, dev_z] = [np.min(Cx), np.min(Cy), np.min(Cz)]\n\n #! shift to positive quadrant\n Cx = Cx + abs(dev_x)\n Cy = Cy + abs(dev_y)\n Cz = Cz + abs(dev_z)\n\n rescaled_coords = np.array(list(zip(Cx, Cy, Cz)))\n\n return rescaled_coords, np.array([[mean_x,mean_y,mean_z], [abs( dev_x ), abs( dev_y ), abs( dev_z )]])\n\n normalized_sphere_cords, mean_abs_vectors = normalize_atom_coordinates(expanded_sphere_voxels)\n voxel_size = 1\n\n sphere_cords_quantized = np.round(np.array(normalized_sphere_cords / voxel_size) ).astype(int)\n max_values = np.max(sphere_cords_quantized, axis=0)\n grid_dimensions = max_values + 1\n vox_grid = np.zeros(grid_dimensions)\n\n print(\"Dimension of the voxel grid is \", vox_grid.shape)\n\n vox_grid[\n sphere_cords_quantized[:, 0],\n sphere_cords_quantized[:, 1],\n sphere_cords_quantized[:, 2] ] = 1\n\n\n return ( vox_grid, grid_dimensions, mean_abs_vectors )\n\n\n\n\nBbox: There are many ways to extract a point cloud from a larger biological structure – in this case we settle for a bounding box that bounds the space between the PTC and the NPET vestibule.\n\n# \"bounding_box_atoms.npy\" is a N,3 array of atom coordinates\n\natom_centers = np.load(\"bounding_box_atoms.npy\") \n\nSphering: To make the representation of atoms slightly more physically-plausible we replace each atom-center coordinate with positions of voxels that fall within a sphere of radius \\(R\\) around the atom’s position. This is meant to represent the atom’s van der Waals radius.\nOne could model different types of atoms (\\(N\\),\\(C\\),\\(O\\),\\(H\\) etc.) with separate radii, but taking \\(R=2\\) proves a good enough compromise. The units are Angstrom and correspond to the coordinate system in which the structure of the ribosome is recorded.\n\nvoxel_spheres = np.array([ atompos_to_voxel_sphere(atom, 2) for atom in atom_centers ])\n\nVoxelization & Inversion: Since we are interested in the “empty space” between the atoms, we need a way to capture it. To make this possible we discretize the space by projecting the (sphered) point cloud into a voxel grid and invert the grid.\n\n# the grid is a binary 3D-array \n# with 1s where a normalized 3D-coordinate of an atom corresponds to the cell index and 0s elsewhere\n\n# by \"normalized\" i mean that the atom coordinates are\n# temporarily moved to the origin to decrease the size of the grid (see `index_grid` method further).\ninitial_grid, grid_dims, _ = index_grid(voxel_spheres)\n\n# The grid is inverted by changing 0->1 and 1->0\n# Now the atom locations are the null voxels and the empty space is active voxels\ninverted_grid = np.asarray(np.where(initial_grid != 1)).T\n\nCompare the following representation (Inverted Point Cloud) to the first point cloud: notice that where there previously was an active voxel is now an empty voxel and vice versa. The tubular constellation of active voxels in the center of the bounding box on this inverted grid is the tunnel “space” we are interested in.\n\n\n\n\n\n\n\n\n\n\n\n(a) Initial bounding-box point cloud\n\n\n\n\n\n\n\n\n\n\n\n(b) Inverted point cloud\n\n\n\n\n\n\n\nFigure 1: Pointcloud inversion via a voxel grid." + "objectID": "posts/CC-cells/Shape_Analysis_of_Contractile_Cells.html#sort-labelling-data", + "href": "posts/CC-cells/Shape_Analysis_of_Contractile_Cells.html#sort-labelling-data", + "title": "Shape Analysis of Contractile Cells", + "section": "Sort labelling data", + "text": "Sort labelling data\nThe segmentation data can be exported as a file containing 2D coordinates of all pixels that are marked as borders. First, we need to identify individual cells from this data. We may view pixels as nodes in a graph, the problem then becomes splitting an unconnected graph into connected components. A tricky part is to process cells with overlapping/connected borders. > TO ADD: details on this algorithm.\nFrom here, a few simple bash commands allow us to import the resulting data files as a numpy array of 2D coordinates, as an acceptable input for GeomStats.\n# replace delimiters with sed\nsed -i 's/],/\\n/g' *\nsed -i 's/,/ /g' *\n\n# remove [ with sed\nsed -i 's|[[]||g' * \n\nimport sys\nfrom pathlib import Path\nimport numpy as np\nfrom decimal import Decimal\nimport matplotlib.pyplot as plt\n\n# sys.prefix = '/home/uki/Desktop/blog/posts/capsular-contracture/.venv'\n# sys.executable = '/home/uki/Desktop/blog/posts/capsular-contracture/.venv/bin/python'\nsys.path=['', '/opt/petsc/linux-c-opt/lib', '/home/uki/Desktop/blog/posts/capsular-contracture', '/usr/lib/python312.zip', '/usr/lib/python3.12', '/usr/lib/python3.12/lib-dynload', '/home/uki/Desktop/blog/posts/capsular-contracture/.venv/lib/python3.12/site-packages']\n\ndirectory = Path('/home/uki/Desktop/blog/posts/capsular-contracture/cells')\nfile_iterator = directory.iterdir()\ncells = []\n\nfor filename in file_iterator:\n with open(filename) as file:\n cell = np.loadtxt(file, dtype=int)\n cells.append(cell)\n\nprint(f\"Total number of cells : {len(cells)}\")\n\nTotal number of cells : 3\n\n\nSince the data is unordered, we need to sort the coordinates in order to visualize cell shapes.\n\ndef sort_coordinates(list_of_xy_coords):\n cx, cy = list_of_xy_coords.mean(0)\n x, y = list_of_xy_coords.T\n angles = np.arctan2(x-cx, y-cy)\n indices = np.argsort(angles)\n return list_of_xy_coords[indices]\n\n\nsorted_cells = []\n\nfor cell in cells:\n sorted_cells.append(sort_coordinates(cell))\n\n\nindex = 1\ncell_rand = cells[index]\ncell_sorted = sorted_cells[index]\n\nfig = plt.figure(figsize=(15, 5))\n\nfig.add_subplot(121)\nplt.scatter(cell_rand[:, 0], cell_rand[:, 1], color='black', s=4)\n\nplt.plot(cell_rand[:, 0], cell_rand[:, 1])\nplt.axis(\"equal\")\nplt.title(f\"Original coordinates\")\nplt.axis(\"off\")\n\nfig.add_subplot(122)\nplt.scatter(cell_sorted[:, 0], cell_sorted[:, 1], color='black', s=4)\n\nplt.plot(cell_sorted[:, 0], cell_sorted[:, 1])\nplt.axis(\"equal\")\nplt.title(f\"sorted coordinates\")\nplt.axis(\"off\")\n\n\n\n\n\n\n\n\n\n\nOriginal work ends around here, the below is a proof of concept mock pipeline performed on 3 cells, that needs to be adapted. _______________________" }, { - "objectID": "posts/ribosome-tunnel-new/index.html#subcloud-extraction", - "href": "posts/ribosome-tunnel-new/index.html#subcloud-extraction", - "title": "3D tessellation of biomolecular cavities", - "section": "2. Subcloud Extraction", - "text": "2. Subcloud Extraction\n\n\n\n\n\n\nDBSCAN_capture\n\n\n\n\n\n\nfrom sklearn.cluster import DBSCAN\ndef DBSCAN_capture(\n ptcloud: np.ndarray,\n eps ,\n min_samples ,\n metric : str = \"euclidean\",\n): \n\n u_EPSILON = eps\n u_MIN_SAMPLES = min_samples\n u_METRIC = metric\n\n print(\"Running DBSCAN on {} points. eps={}, min_samples={}, distance_metric={}\"\n .format( len(ptcloud), u_EPSILON, u_MIN_SAMPLES, u_METRIC ) ) \n\n db = DBSCAN(eps=eps, min_samples=min_samples, metric=metric).fit(ptcloud) # <-- this is all you need\n\n labels = db.labels_\n\n CLUSTERS_CONTAINER = {}\n for point, label in zip(ptcloud, labels):\n if label not in CLUSTERS_CONTAINER:\n CLUSTERS_CONTAINER[label] = []\n CLUSTERS_CONTAINER[label].append(point)\n\n CLUSTERS_CONTAINER = dict(sorted(CLUSTERS_CONTAINER.items()))\n return db, CLUSTERS_CONTAINER\n\n\n\n\n\n\n\n\n\n\nDBSCAN_pick_largest_cluster\n\n\n\n\n\n\nfrom sklearn.cluster import DBSCAN\ndef DBSCAN_pick_largest_cluster(clusters_container:dict[int,list])->np.ndarray:\n DBSCAN_CLUSTER_ID = 0\n for k, v in clusters_container.items():\n if int(k) == -1:\n continue\n elif len(v) > len(clusters_container[DBSCAN_CLUSTER_ID]):\n DBSCAN_CLUSTER_ID = int(k)\n return np.array(clusters_container[DBSCAN_CLUSTER_ID])\n\n\n\n\nClustering: Having obtained a voxelized representation of the interatomic spaces inside and around the NPET our task is now to extract only the space that corresponds to the NPET. We use DBSCAN.\nscikit’s implementation of DBSCAN conveniently lets us retrieve the points from the largest cluster only, which corresponds to the active voxels of NPET space (if we eyeballed our DBSCAN parameters well).\n\nfrom scikit.cluster import DBSCAN\n\n_u_EPSILON, _u_MIN_SAMPLES, _u_METRIC = 5.5, 600, 'euclidian'\n\n_, clusters_container = DBSCAN_capture(inverted_grid, _u_EPSILON, _u_MIN_SAMPLES, _u_METRIC ) \nlargest_cluster = DBSCAN_pick_largest_cluster(clusters_container)\n\n\n\n\n\n\n\nDBSCAN Parameters and grid size.\n\n\n\n\n\nOur 1Å-side grid just happens to be granular enough to accomodate a “correct” separation of clusters for some empirically established values of min_nbrs and epsilon (DBSCAN parameters), where the largest cluster captures the tunnel space.\nA possible issue here is “extraneous” clusters merging into the cluster of interest and thereby corrupting its shape. In general this occurs when there are clusters of density that are close enough (within epsilon to the main one to warrant a merge) and simultaneously large enough that they fulfill the min_nbrs parameter. Hence it might be challenging to find the combination of min_nbrs and epsilon that is sensitive enough to capture the main cluster completely and yet discriminating enough to not subsume any adjacent clusters.\nIn theory, a finer voxel grid (finer – in relationship to the initial coordinates of the general point cloud; sub-angstrom in our case) would make finding the combination of parameters specific to the dataset easier: given that the atom-sphere would be represented by a proprotionally larger number of voxels, the euclidian distance calculation between two voxels would be less sensitive to the change in epsilon.\nPartioning the voxel grid further would come at a cost:\n\nyou would need to rewrite the sphering method for atoms (to account for the the new voxel-size)\nthe computational cost will increase dramatically, the dataset could conceivably stop fitting into memory alltogether.\n\n\n\n\n\n\n\nClusters identified by DBSCAN on the inverted index grid. The largest cluster corresponds to the tunnel space.\n\n\n\n\n\n\n\n\nSubcloud refinement\n\n\n\n\n\nI found that this first pass of DBSCAN (eps=\\(5.5\\), min_nbrs=\\(600\\)) successfully identifies the largest cluster with the tunnel but generally happens to be conservative in the amount of points that are merged into it. That is, there are still redundant points in this cluster that would make the eventual surface reconstruction spatially overlap with the rRNA and protiens. To “sharpen” this cluster we apply DBSCAN only to its sub-pointcloud and push the eps distance down to \\(3\\) and min_nbrs to \\(123\\) (again, “empirically established” values), which happens to be about the lowest parameter values at which any clusters form. This sharpened cluster is what the tesselation (surface reconstruction) will be performed on.\n\n\n\n\n\n\n\n\n\n\n\n(a) Largest DBSCAN cluster (trimmed from the vestibule side).\n\n\n\n\n\n\n\n\n\n\n\n(b) Cluster refinement: DBSCAN{e=3,mn=123} result (marine blue) on the largest cluster of DBSCAN{e=5.5,mn=600} (gray)\n\n\n\n\n\n\n\nFigure 2: Second pass of DBSCAN sharpens the cluster to peel off the outer layer of redundant points." + "objectID": "posts/CC-cells/Shape_Analysis_of_Contractile_Cells.html#interpolation-and-removing-duplicate-sample-points", + "href": "posts/CC-cells/Shape_Analysis_of_Contractile_Cells.html#interpolation-and-removing-duplicate-sample-points", + "title": "Shape Analysis of Contractile Cells", + "section": "Interpolation and removing duplicate sample points", + "text": "Interpolation and removing duplicate sample points\n\nimport geomstats.backend as gs\nfrom common import *\nimport random\nimport os\nimport scipy.stats as stats\nfrom sklearn import manifold\n\ngs.random.seed(2024)\n\n\ndef interpolate(curve, nb_points):\n \"\"\"Interpolate a discrete curve with nb_points from a discrete curve.\n\n Returns\n -------\n interpolation : discrete curve with nb_points points\n \"\"\"\n old_length = curve.shape[0]\n interpolation = gs.zeros((nb_points, 2))\n incr = old_length / nb_points\n pos = 0\n for i in range(nb_points):\n index = int(gs.floor(pos))\n interpolation[i] = curve[index] + (pos - index) * (\n curve[(index + 1) % old_length] - curve[index]\n )\n pos += incr\n return interpolation\n\n\nk_sampling_points = 2000\n\n\nindex = 2\ncell_rand = sorted_cells[index]\ncell_interpolation = interpolate(cell_rand, k_sampling_points)\n\nfig = plt.figure(figsize=(15, 5))\n\nfig.add_subplot(121)\nplt.scatter(cell_rand[:, 0], cell_rand[:, 1], color='black', s=4)\n\nplt.plot(cell_rand[:, 0], cell_rand[:, 1])\nplt.axis(\"equal\")\nplt.title(f\"Original curve ({len(cell_rand)} points)\")\nplt.axis(\"off\")\n\nfig.add_subplot(122)\nplt.scatter(cell_interpolation[:, 0], cell_interpolation[:, 1], color='black', s=4)\n\nplt.plot(cell_interpolation[:, 0], cell_interpolation[:, 1])\nplt.axis(\"equal\")\nplt.title(f\"Interpolated curve ({k_sampling_points} points)\")\nplt.axis(\"off\")\n\n(np.float64(810.1893750000002),\n np.float64(850.848125),\n np.float64(18.650075000000008),\n np.float64(48.34842499999986))\n\n\n\n\n\n\n\n\n\n\ndef preprocess(curve, tol=1e-10):\n \"\"\"Preprocess curve to ensure that there are no consecutive duplicate points.\n\n Returns\n -------\n curve : discrete curve\n \"\"\"\n\n dist = curve[1:] - curve[:-1]\n dist_norm = np.sqrt(np.sum(np.square(dist), axis=1))\n\n if np.any( dist_norm < tol ):\n for i in range(len(curve)-1):\n if np.sqrt(np.sum(np.square(curve[i+1] - curve[i]), axis=0)) < tol:\n curve[i+1] = (curve[i] + curve[i+2]) / 2\n\n return curve\n\n\ninterpolated_cells = []\n\nfor cell in sorted_cells:\n interpolated_cells.append(preprocess(interpolate(cell, k_sampling_points)))" }, { - "objectID": "posts/ribosome-tunnel-new/index.html#tessellation", - "href": "posts/ribosome-tunnel-new/index.html#tessellation", - "title": "3D tessellation of biomolecular cavities", - "section": "3. Tessellation", - "text": "3. Tessellation\n\n\n\n\n\n\nptcloud_convex_hull_points\n\n\n\n\n\nSurface points can be extracted by creating an alpha shape over the point cloud and taking only the points that belong to the alpha surface.\n\nimport pyvista as pv\nimport open3d as o3d\nimport numpy as np\n\ndef ptcloud_convex_hull_points(pointcloud: np.ndarray, ALPHA:float, TOLERANCE:float) -> np.ndarray:\n assert pointcloud is not None\n cloud = pv.PolyData(pointcloud)\n grid = cloud.delaunay_3d(alpha=ALPHA, tol=TOLERANCE, offset=2, progress_bar=True)\n convex_hull = grid.extract_surface().cast_to_pointset()\n return convex_hull.points\n\nOne could content themselves with the alpha shape representation of the NPET geometry and stop here, but it’s easy to notice that the vertice of the polygon (red dots) are distributed unevenly over the surface. This is likely to introduce artifacts and instabilities into further simulations.\n\n\n\n\n\n\n\n\n\n\n\n(a) Alpha-shape over the pointcloud\n\n\n\n\n\n\n\n\n\n\n\n(b) Surface points of the point cloud\n\n\n\n\n\n\n\nFigure 3: Alpha shape provides a way to identify surface points.\n\n\n\n\n\n\n\n\n\n\n\n\nestimate_normals\n\n\n\n\n\nNormal estimation is done via rolling a tangent plane over the surface points.\n\nimport pyvista as pv\nimport open3d as o3d\nimport numpy as np\n\ndef estimate_normals(convex_hull_surface_pts: np.ndarray, kdtree_radius=None, kdtree_max_nn=None, correction_tangent_planes_n=None): \n pcd = o3d.geometry.PointCloud()\n pcd.points = o3d.utility.Vector3dVector(convex_hull_surface_pts)\n\n pcd.estimate_normals(search_param=o3d.geometry.KDTreeSearchParamHybrid(radius=kdtree_radius, max_nn=kdtree_max_nn) )\n pcd.orient_normals_consistent_tangent_plane(k=correction_tangent_planes_n)\n\n return pcd\n\n\n\n\nNormals’ orientations are depicted as vectors(black) on each datapoint.\n\n\n\n\n\n\n\n\n\n\n\napply_poisson_recon\n\n\n\n\n\nThe source is available at https://github.com/mkazhdan/PoissonRecon. For programmability we connect the binary to the pipeline by wrapping it in a python subprocess but one can of course use the binary directly.\nThe output of the binary is a binary .ply (Stanford Triangle Format) file. For purposes of distribution we also produce an asciii-encoded version of this .ply file side-by-side: some geometry packages are only able to parse the ascii version.\n\ndef apply_poisson_reconstruction(surf_estimated_ptcloud_path: str, recon_depth:int=6, recon_pt_weight:int=3):\n import plyfile\n # The documentation can be found at https://www.cs.jhu.edu/~misha/Code/PoissonRecon/Version16.04/ in \"PoissonRecon\" binary\n command = [\n POISSON_RECON_BIN,\n \"--in\",\n surf_estimated_ptcloud_path,\n \"--out\",\n output_path,\n \"--depth\",\n str(recon_depth),\n \"--pointWeight\",\n str(recon_pt_weight),\n \"--threads 8\"\n ]\n process = subprocess.run(command, capture_output=True, text=True)\n if process.returncode == 0:\n print(\">>PoissonRecon executed successfully.\")\n print(\">>Wrote {}\".format(output_path))\n # Convert the plyfile to asciii\n data = plyfile.PlyData.read(output_path)\n data.text = True\n ascii_duplicate =output_path.split(\".\")[0] + \"_ascii.ply\"\n data.write(ascii_duplicate)\n print(\">>Wrote {}\".format(ascii_duplicate))\n else:\n print(\">>Error:\", process.stderr)\n\n\n\n\nThe final NPET surface reconstruction\n\n\n\n\n\nNow, having refined the largest DBSCAN cluster, we have a pointcloud which faithfully represent the tunnel geometry. To create a watertight mesh from this point cloud we need to prepare the dataset:\n\nretrieve only the “surface” points from the pointcloud\nestimate normals on the surface points (establish data orientation)\n\n\nd3d_alpha, d3d_tol = 2, 1\n\nsurface_pts = ptcloud_convex_hull_points(coordinates_in_the_original_frame, d3d_alpha,d3d_tol)\npointcloud = estimate_normals(surface_pts, kdtree_radius=10, kdtree_max_nn=15, correction_tangent_planes_n=10)\n\nThe dataset is now ready for surface reconstruction. We reach for Poisson surface reconstruction4 by Kazhdan and Hoppe, a de facto standard in the field.\n\nPR_depth , PR_ptweight = 6, 3\napply_poisson_recon(pointcloud, recon_depth=PR_depth, recon_pt_weight=PR_ptweight)" + "objectID": "posts/CC-cells/Shape_Analysis_of_Contractile_Cells.html#alignment", + "href": "posts/CC-cells/Shape_Analysis_of_Contractile_Cells.html#alignment", + "title": "Shape Analysis of Contractile Cells", + "section": "Alignment", + "text": "Alignment\n\nfrom geomstats.geometry.pre_shape import PreShapeSpace\n\nAMBIENT_DIM = 2\n\nPRESHAPE_SPACE = PreShapeSpace(ambient_dim=AMBIENT_DIM, k_landmarks=k_sampling_points)\n\nPRESHAPE_SPACE.equip_with_group_action(\"rotations\")\nPRESHAPE_SPACE.equip_with_quotient()\n\n\ndef exhaustive_align(curve, base_curve):\n \"\"\"Align curve to base_curve to minimize the L² distance.\n\n Returns\n -------\n aligned_curve : discrete curve\n \"\"\"\n nb_sampling = len(curve)\n distances = gs.zeros(nb_sampling)\n base_curve = gs.array(base_curve)\n for shift in range(nb_sampling):\n reparametrized = [curve[(i + shift) % nb_sampling] for i in range(nb_sampling)]\n aligned = PRESHAPE_SPACE.fiber_bundle.align(\n point=gs.array(reparametrized), base_point=base_curve\n )\n distances[shift] = PRESHAPE_SPACE.embedding_space.metric.norm(\n gs.array(aligned) - gs.array(base_curve)\n )\n shift_min = gs.argmin(distances)\n reparametrized_min = [\n curve[(i + shift_min) % nb_sampling] for i in range(nb_sampling)\n ]\n aligned_curve = PRESHAPE_SPACE.fiber_bundle.align(\n point=gs.array(reparametrized_min), base_point=base_curve\n )\n return aligned_curve\n\n\naligned_cells = []\nBASE_CURVE = interpolated_cells[0]\n\nfor cell in interpolated_cells:\n aligned_cells.append(exhaustive_align(cell, BASE_CURVE))\n\n\nindex = 1\nunaligned_cell = interpolated_cells[index]\naligned_cell = exhaustive_align(unaligned_cell, BASE_CURVE)\n\nfig = plt.figure(figsize=(15, 5))\n\nfig.add_subplot(131)\nplt.plot(BASE_CURVE[:, 0], BASE_CURVE[:, 1])\nplt.plot(BASE_CURVE[0, 0], BASE_CURVE[0, 1], \"ro\")\nplt.axis(\"equal\")\nplt.title(\"Reference curve\")\n\nfig.add_subplot(132)\nplt.plot(unaligned_cell[:, 0], unaligned_cell[:, 1])\nplt.plot(unaligned_cell[0, 0], unaligned_cell[0, 1], \"ro\")\nplt.axis(\"equal\")\nplt.title(\"Unaligned curve\")\n\nfig.add_subplot(133)\nplt.plot(aligned_cell[:, 0], aligned_cell[:, 1])\nplt.plot(aligned_cell[0, 0], aligned_cell[0, 1], \"ro\")\nplt.axis(\"equal\")\nplt.title(\"Aligned curve\")\n\nText(0.5, 1.0, 'Aligned curve')" }, { - "objectID": "posts/ribosome-tunnel-new/index.html#result", - "href": "posts/ribosome-tunnel-new/index.html#result", - "title": "3D tessellation of biomolecular cavities", - "section": "Result", - "text": "Result\nWhat you are left with is a smooth polygonal mesh in the .ply format. Below is the illustration of the fidelity of the representation. Folds and depressions can clearly be seen engendered by three proteins surrounding parts of the tunnel (uL22 yellow, uL4 light blue and eL39 magenta). rRNA is not shown.6\n\n\n\nThe NPET mesh surrounded by by three ribosome proteins" + "objectID": "posts/morphology/proposal.html", + "href": "posts/morphology/proposal.html", + "title": "Exploring cell shape dynamics dependency on the cell migration", + "section": "", + "text": "Cell morphology is an emerging field of biological research that examines the shape, size, and internal structure of cells to describe their state and the processes occurring within them. Today, more and more scientist across the world are investigating visible cellular transformations to predict cellular phenotypes. This research has significant practical implications: understanding specific cellular features characteristic of certain diseases, such as cancer, could lead to new approaches for early detection and classification.\nIn this work, we will explore aspects of cell motility by analyzing the changing shapes of migrating cells. As a cell moves through space, it reorganizes its membrane, cytosol, and cytoskeletal structures (Mogilner and Oster 1996). According to current understanding, actin polymerization causes protrusions at the leading edge of a cell, forming specific structures known as lamellipodia and filopodia. Elongation of cells in the direction of movement is also reported. These changes can be observed during experiments." }, { - "objectID": "posts/Neural-Manifold/index.html", - "href": "posts/Neural-Manifold/index.html", - "title": "Understanding Animal Navigation using Neural Manifolds With CEBRA", + "objectID": "posts/morphology/proposal.html#background", + "href": "posts/morphology/proposal.html#background", + "title": "Exploring cell shape dynamics dependency on the cell migration", "section": "", - "text": "Seeing, hearing, touching – every moment, our brain receives numerous sensory inputs. How does it organize this wealth of data and extract relevant information? We know that the brain forms a coherent neural representation of the external world called the cognitive map (Tolman (1948)), formed by the combined firing activity of neurons in the hippocampal formation. For example, place cells are neurons that fire when a rat is at a particular location (Moser, Kropff, and Moser (2008)). Together, the activity of hundreds of these place cells can be modeled as a continuous surface - a ‘manifold’ - the location on which is analogous to the rat’s location in physical space; the rat is indeed creating a cognitive map. Specifically, the hippocampus plays a key role in this process by using path integration to keep track of an animal’s position through the integration various idiothetic cues (self-motion signals), such as optic flow, vestibular inputs, and proprioception. Manifold learning has emerged as a powerful technique for mapping complex, high-dimensional neural data onto lower-dimensional geometric representations (Mitchell-Heggs et al. (2023), Schneider, Lee, and Mathis (2023), Chaudhuri et al. (2019)). To date, it has not been feasible to learn manifolds ‘online’, i.e. while the experiment is in progress. Doing so would allow ‘closed-loop’ experiments, where we can provide feedback to the animal based on its internal representation, and thereby examine how these representations are created and maintained in the brain.\nThe question then arises: Can we decode important navigational behavioural variables during an experiment through manifold learning? And further, can we learn these manifolds online? This blog will focus on experiments conducted in “Control and recalibration of path integration in place cells using optic flow” (Madhav et al. (2024)) and “Recalibration of path integration in hippocampal place cells” (Jayakumar et al. (2019))." + "text": "Cell morphology is an emerging field of biological research that examines the shape, size, and internal structure of cells to describe their state and the processes occurring within them. Today, more and more scientist across the world are investigating visible cellular transformations to predict cellular phenotypes. This research has significant practical implications: understanding specific cellular features characteristic of certain diseases, such as cancer, could lead to new approaches for early detection and classification.\nIn this work, we will explore aspects of cell motility by analyzing the changing shapes of migrating cells. As a cell moves through space, it reorganizes its membrane, cytosol, and cytoskeletal structures (Mogilner and Oster 1996). According to current understanding, actin polymerization causes protrusions at the leading edge of a cell, forming specific structures known as lamellipodia and filopodia. Elongation of cells in the direction of movement is also reported. These changes can be observed during experiments." }, { - "objectID": "posts/Neural-Manifold/index.html#problem-description", - "href": "posts/Neural-Manifold/index.html#problem-description", - "title": "Understanding Animal Navigation using Neural Manifolds With CEBRA", + "objectID": "posts/morphology/proposal.html#goals", + "href": "posts/morphology/proposal.html#goals", + "title": "Exploring cell shape dynamics dependency on the cell migration", + "section": "Goals", + "text": "Goals\nOur goal is to perform a differential geometry analysis of cellular shape curves to explore the correlation between shape differences and spatial displacement. Using the Riemann Elastic Metric(Li et al. 2023):\n\\[\ng_c^{a, b}(h, k) = a^2 \\int_{[0,1]} \\langle D_s h, N \\rangle \\langle D_s k, N \\rangle \\, ds\n+ b^2 \\int_{[0,1]} \\langle D_s h, T \\rangle \\langle D_s k, T \\rangle \\, ds\n\\]\nwe can estimate the geodesic distance between two cellular boundary curves to mathematically describe how the cell shape changes over time. To implement this algorithm, we will use the Python Geomstats package." + }, + { + "objectID": "posts/morphology/proposal.html#dataset", + "href": "posts/morphology/proposal.html#dataset", + "title": "Exploring cell shape dynamics dependency on the cell migration", + "section": "Dataset", + "text": "Dataset\nThis dataset contains real cell contours obtained via fluorescent microscopy in Professor Prasad’s lab, segmented by Clément Soubrier.\n\n204 directories:\nEach directory is named cell_*, representing an individual cell.\nFrames:\nSubdirectories inside each cell are named frame_*, capturing different time points for that cell.\n\n\nNumPy Array Objects in Each Frame\n\ncentroid.npy: Stores the coordinates of the cell’s centroid.\n\noutline.npy: Contains segmented points as Cartesian coordinates.\n\ntime.npy: Timestamp of the frame.\n\n\n\nStructure\n├── cell_i\n│ ├── frame_j\n│ │ ├── centroid.npy\n│ │ ├── outline.npy\n│ │ └── time.npy\n│ ├── frame_k\n│ │ ├── centroid.npy\n│ │ ├── outline.npy\n│ │ └── time.npy\n│ └── ...\n├── cell_l\n│ ├── frame_m\n│ │ ├── centroid.npy\n│ │ ├── outline.npy\n│ │ └── time.npy\n│ └── ...\n└── ..." + }, + { + "objectID": "posts/morphology/proposal.html#single-cell-dynamics", + "href": "posts/morphology/proposal.html#single-cell-dynamics", + "title": "Exploring cell shape dynamics dependency on the cell migration", + "section": "Single cell dynamics", + "text": "Single cell dynamics\nimport numpy as np\nimport matplotlib.pyplot as plt\nimport os\n\nfig, ax = plt.subplots(figsize=(10, 10), layout='constrained')\n\nN = 15\n\nnumber_of_frames = sum(os.path.isdir(os.path.join(f\"cells/cell_{N}\", entry)) for entry in os.listdir(f\"cells/cell_{N}\"))\ncolors = plt.cm.tab20(np.linspace(0, 1, number_of_frames))\nfor i in range(1,number_of_frames+1):\n time = np.load(f'cells/cell_{N}/frame_{i}/time.npy')\n border = np.load(f'cells/cell_{N}/frame_{i}/outline.npy')\n centroid = np.load(f'cells/cell_{N}/frame_{i}/centroid.npy')\n\n \n color = colors[i - 1]\n\n ax.plot(border[:, 0], border[:, 1], label=time, color=color)\n ax.scatter(centroid[0], centroid[1], color=color)\nplt.legend() \n\nplt.savefig(f\"single_cell_{N}.png\", dpi=300, bbox_inches='tight')\n\n\n\nThe cell form in different time moments" + }, + { + "objectID": "posts/morphology/proposal.html#references", + "href": "posts/morphology/proposal.html#references", + "title": "Exploring cell shape dynamics dependency on the cell migration", + "section": "References", + "text": "References\n\n\nLi, Wanxin, Ashok Prasad, Nina Miolane, and Khanh Dao Duc. 2023. “Using a Riemannian Elastic Metric for Statistical Analysis of Tumor Cell Shape Heterogeneity.” In Geometric Science of Information, edited by Frank Nielsen and Frédéric Barbaresco, 583–92. Cham: Springer Nature Switzerland.\n\n\nMogilner, A., and G. Oster. 1996. “Cell Motility Driven by Actin Polymerization.” Biophysical Journal 71 (6): 3030–45. https://doi.org/10.1016/s0006-3495(96)79496-1." + }, + { + "objectID": "posts/biology/index.html", + "href": "posts/biology/index.html", + "title": "Embryonic cell size asymmetry analysis", "section": "", - "text": "Seeing, hearing, touching – every moment, our brain receives numerous sensory inputs. How does it organize this wealth of data and extract relevant information? We know that the brain forms a coherent neural representation of the external world called the cognitive map (Tolman (1948)), formed by the combined firing activity of neurons in the hippocampal formation. For example, place cells are neurons that fire when a rat is at a particular location (Moser, Kropff, and Moser (2008)). Together, the activity of hundreds of these place cells can be modeled as a continuous surface - a ‘manifold’ - the location on which is analogous to the rat’s location in physical space; the rat is indeed creating a cognitive map. Specifically, the hippocampus plays a key role in this process by using path integration to keep track of an animal’s position through the integration various idiothetic cues (self-motion signals), such as optic flow, vestibular inputs, and proprioception. Manifold learning has emerged as a powerful technique for mapping complex, high-dimensional neural data onto lower-dimensional geometric representations (Mitchell-Heggs et al. (2023), Schneider, Lee, and Mathis (2023), Chaudhuri et al. (2019)). To date, it has not been feasible to learn manifolds ‘online’, i.e. while the experiment is in progress. Doing so would allow ‘closed-loop’ experiments, where we can provide feedback to the animal based on its internal representation, and thereby examine how these representations are created and maintained in the brain.\nThe question then arises: Can we decode important navigational behavioural variables during an experiment through manifold learning? And further, can we learn these manifolds online? This blog will focus on experiments conducted in “Control and recalibration of path integration in place cells using optic flow” (Madhav et al. (2024)) and “Recalibration of path integration in hippocampal place cells” (Jayakumar et al. (2019))." + "text": "Introduction and motivation\nCells propagate via cell division. In multicellular organisms, certain cells divide asymmetrically, which results in generating cell diversity (Jan and Jan 1998). There are several cues for asymmetric cell division, including cell polarity establishment, spindle positioning, division site specification (Li 2013), and signals from neighboring cells (Horvitz and Herskowitz 1992). These cues allow multicellular organisms develop correctly, and their misregulation can lead to disorders from developmental defects to cancer.\nIn Caenorhabditis elegans four cell stage embryos, the endomesodermal precursor (EMS) cell gives rise to mesoderm and endoderm cells. For this asymmetric division, the EMS cell receives signals from a neighboring P2 cell (Rocheleau et al. 1997). In response to them, the daughter cell closest to the P2 cell develops into endoderm, and its sister develops into mesoderm (Goldstein 1992) (Figure 1). In situations where the signal is absent, both EMS daughters develop into mesoderm, and the embryo is non-viable.\n\n\n\nFigure 1: EMS cell division. EMS cell division in a four-cell C. elegans embryo. The EMS cell receives signals from P2 cell to develop into endoderm (gut) and mesoderm (muscle) precursors. Adapted from (Goldstein 1992)\n\n\nIn the EMS division, daughter cells appear to adopt different shapes (Caroti et al. 2021). Additionally, the shape of daughter cells changes if Wnt signaling is absent. It is possible that these differences are correlated to the fate of the daughter cells. If gradual, these differences could also be used to identify the strength of cell response to external cues, such as Wnt signaling. It is also possible that before the birth of the E and MS cells, the shape of the EMS cell changes in response to external cues. Analysis of the EMS cell shape in different contexts could therefore prove to be a useful tool to understanding differentiation and development. Finally, developing quantitative size and shape analysis tools can reduce human bias and help speed up the experimental procedures in both understanding EMS and its daughters’ fates.\n\n\n\nFigure 2, E and MS cells adopt different shapes. Top row, EMS (parent) cell in three embryos. Bottom row, purple: MS cell, blue: E cell. Taken from (Caroti et al. 2021) - Figure 2 E.\n\n\nThere are several different methods to analyze shapes and sizes. Given that different analysis tools yield different results (Dryden and Mardia 2016, 37), it is helpful to consider a few before finding the best tool for subsequent use.\n\n\nCentroid size analysis\nCentroid analysis is a tool to measure the size of a shape on Cartesian coordinates and is defined by (Dryden and Mardia 2016, 34)as:\n\\[\nS(X) = \\sqrt{\\sum_{i=1}^{k} \\sum_{j=1}^{m}\\left( X_{ij} - \\bar{X}_j \\right)^2}, \\quad X \\in \\mathbb{R}^{k \\times m}\n\\]\nWhere \\(X_{ij}\\) is a matrix entry, and \\(\\bar{X}_j\\) is a mean of the j’th dimension of the matrix.\nCentroid is a simple tool to estimate size of a shape and could help to easily quantify differences between different cell groups in a sample, for example, EMS or E cell with and without a signal from the P2 cell.\n\n\nEuclidean distance matrix analysis (EDMA)\nEDMA is a version of multidimensional scale analysis that accounts for a bias in landmark distribution (Dryden and Mardia 2016, 357–60). This analysis focuses on distances between landmarks and can handle missed landmarks (Lele 1993). This method corrects for landmark distribution biases and can be used to test for shape differences (EDMA-I and EDMA-II).\nWhile centroids are useful in estimating size of a shape, EDMA can be helpful in finding differences in shape itself. There are a number of other tools to estimate shape differences, including square root velocity (SRV) function - a landmark-independent tool for analysing differences in shape and curvature (Srivastava et al. 2011). Independence from landmarks might result in more precise shape comparisons, however, it renders analysis computationally intensive. Analysis of the dividing EMS and E/MS cells can be performed using any of these methods, the easiest being centroid size estimation, which does not account for shape differences. Incorporating more complex analysis tools would allow for more understanding in how the cell shape changes. Additionally, it could be extrapolated to more complex analyses, such as time series or 3D images. These tools could help further understand what affects EMS daughter cells and whether their shape is linked to their fate.\n\n\n\n\n\nReferences\n\nCaroti, Francesca, Wim Thiels, Michiel Vanslambrouck, and Rob Jelier. 2021. “Wnt Signaling Induces Asymmetric Dynamics in the Actomyosin Cortex of the C. Elegans Endomesodermal Precursor Cell.” Front Cell Dev Biol 9 (September): 702741. https://doi.org/10.3389/fcell.2021.702741.\n\n\nDryden, Ian L., and Kanti V. Mardia. 2016. Statistical Shape Analysis, with Applications in R. 1st ed. Wiley Series in Probability and Statistics. Wiley. https://doi.org/10.1002/9781119072492.\n\n\nGoldstein, Bob. 1992. “Induction of Gut in Caenorhabditis Elegans Embryos.” Nature 357 (6375): 255–57. https://doi.org/10.1038/357255a0.\n\n\nHorvitz, H.Robert, and Ira Herskowitz. 1992. “Mechanisms of Asymmetric Cell Division: Two Bs or Not Two Bs, That Is the Question.” Cell 68 (2): 237–55. https://doi.org/10.1016/0092-8674(92)90468-R.\n\n\nJan, Yuh Nung, and Lily Yeh Jan. 1998. “Asymmetric Cell Division.” Nature 392 (6678): 775–78. https://doi.org/10.1038/33854.\n\n\nLele, Subhash. 1993. “Euclidean Distance Matrix Analysis (EDMA): Estimation of Mean Form and Mean Form Difference.” Math Geol 25 (5): 573–602. https://doi.org/10.1007/BF00890247.\n\n\nLi, Rong. 2013. “The Art of Choreographing Asymmetric Cell Division.” Developmental Cell 25 (5): 439–50. https://doi.org/10.1016/j.devcel.2013.05.003.\n\n\nRocheleau, Christian E, William D Downs, Rueyling Lin, Claudia Wittmann, Yanxia Bei, Yoon-Hee Cha, Mussa Ali, James R Priess, and Craig C Mello. 1997. “Wnt Signaling and an APC-Related Gene Specify Endoderm in Early C. Elegans Embryos.” Cell 90 (4): 707–16. https://doi.org/10.1016/S0092-8674(00)80531-0.\n\n\nSrivastava, A, E Klassen, S H Joshi, and I H Jermyn. 2011. “Shape Analysis of Elastic Curves in Euclidean Spaces.” IEEE Trans. Pattern Anal. Mach. Intell. 33 (7): 1415–28. https://doi.org/10.1109/TPAMI.2010.184." }, { - "objectID": "posts/Neural-Manifold/index.html#experimental-setup", - "href": "posts/Neural-Manifold/index.html#experimental-setup", - "title": "Understanding Animal Navigation using Neural Manifolds With CEBRA", - "section": "Experimental Setup", - "text": "Experimental Setup\nIn (Madhav et al. (2024) and Jayakumar et al. (2019)), Dr. Madhav and colleagues designed an experimental setup to investigate how optic flow cues influence hippocampal place cells in freely moving rats. Place cells are neurons that fire when an animal is in a specific location.\nLet’s take an example to better understand: imagine a rat moving along a horizontal linear track. For simplicity let’s say the rat has only 3 place cell neurons. In this case, Neuron 1 would fire when the rat is at the very left of the track, Neuron 2 would fire when the rat is in the middle of the track, and Neuron 3 would fire at the very right of the track. As the rat moves along the track, the specific place cells corresponding to each location become activated, helping the rat to construct an internal cognitive map of its environment.\n\nThe Dome Apparatus\nIn the experiment, rats ran on a circular platform surrounded by a hemispherical projection surface called the Dome.\n\n\n\n\nFig. 1 - Virtual reality Dome apparatus. Rats ran on a circular table surrounded by a hemispherical shell. A projector image reflects off a hemispherical mirror onto the inner surface of the shell.\n\n\n\nThe dome projects moving stripes that provided controlled optic flow cues. The movement of the stripes was tied to the rats’ movement, with the stripe gain (\\(\\mathcal{S}\\)) determining the relationship between the rat’s speed and the stripes’ speed.\n\n\\(\\mathcal{S}\\) = 1: Stripes are stationary relative to the lab frame, meaning the rat is not recieving conflicting cues.\n\\(\\mathcal{S}\\) > 1: Stripes move opposite to the rat’s direction, causing the rat to percieve itself as moving faster than it is.\n\\(\\mathcal{S}\\) < 1: Stripes move in the same direction but slower than the rat, causing the rat to percieve itself as moving slower than it is.\n\nElectrodes were inserted into the CA1 of the hippocampus of male evan’s rats and spike rate neural activity was recorded during the experiment. Dr. Madhav and colleagues introduce a value \\(\\mathcal{H}\\), called the Hippocampal Gain. It is defined as the relationship between the rat’s physical movement and the updating of its position on the internal hippocampal map. At a high level, we can think of it as the rate at which the rat “perceives” itself to be moving because of the conflicting visual cues. Specifically,\n\\[\n \\mathcal{H} = \\frac{\\text{distance travelled in hippocampal reference frame}}{\\text{distance travelled in lab reference frame}}.\n\\]\nIn this equation, distance travelled in the hippocampal frame refers to the distance that the rat “thinks” it’s moving.\n\n\\(\\mathcal{H} = 1\\): The rat perceives itself as moving the “correct” speed.\n\\(\\mathcal{H} > 1\\): The rat perceives itself as moving faster than it actually is with respect to the lab frame.\n\\(\\mathcal{H} < 1\\): The rat perceives itself as moving slower than it actually is with respect to the lab frame.\n\n\\(\\mathcal{H}\\) gives valuable insights into how these visual cues such as the moving stripes affect the rats’ internal cognitive map during the task. It gives an understanding of how the rats update their perceived position in the environment.\nFor example, an \\(\\mathcal{H}\\) value of 2, would mean that the rat perceives itself as moving twice as fast as it actually is. Consequently each place cell corresponding to a specific location in the maze will fire twice per lap rather than once.\n\n\nDescription of the problem\nMethod of Determining \\(\\mathcal{H}\\): Traditionally, \\(\\mathcal{H}\\) is determined by analyzing the spatial periodicity of place cell firing over multiple laps using Fourier transforms, as seen in (Jayakumar et al. (2019),Madhav et al. (2024)). Below is a figure displaying how the traditional method is used to determine the \\(\\mathcal{H}\\) value.\n\n\n\n\n\nFigure 2 - Spectral decoding algorithm. In the dome, as visual landmarks are presented and moved at an experimental gain G, the rat encounters a particular landmark every 1/G laps (the spatial period). If the place fields fire at the same location in the landmark reference frame, the firing rate of the cell exhibits a spatial frequency of G fields per lap. a, Illustration of place-field firing for three values of hippocampal gain, H\n\n\n\n\nThe frequency of firing for each place cell effectively decodes the \\(\\mathcal{H}\\) value for that specific neuron and the mean \\(\\mathcal{H}\\) value over all neurons gives the estimated \\(\\mathcal{H}\\) value over the neuronal population. This method lacks temporal precision within individual laps since it uses a Fourier Transform over 6 laps.\nA more precise, within-lap decoding of Hippocampal Gain (\\(\\mathcal{H}\\)) could provide a deeper understanding of how path integration occurs with finer temporal resolution. This could lead to new insights into how the brain updates its cognitive map when receiving conflicting visual cues.\nAlso, note how the decoding of \\(\\mathcal{H}\\) is directly tied to the neural data, which makes the traditional method less flexible. It cannot easily be applied to experiments involving two varying neural representations (e.g., a spatial gain \\(\\mathcal{H}\\) and an auditory gain \\(\\mathcal{A}\\)). In such cases, the two representations are coupled in the neural data, making it impossible to separate them.\nHowever, neural manifold learning offers a promising approach to decouple these representations. For instance, consider the hypothetical scenario below, where the data forms a torus:\n\n\n\n\n\n\n\n\n\nFigure 3 - Left: varying spatial representation, Right: varying audio representation.\n\nIn our current dataset, we only have a single varying neural representation and therefore expect a simpler 1D ring topology. However, in the above scenario, the data might lie on a torus. On this structure, the spatial representation (\\(\\mathcal{H}\\)) could vary along the major circle of the torus, while the auditory representation (\\(\\mathcal{A}\\)) varies along the minor circle. This structure enables us to disentangle and decode the two neural representations independently. This could prove useful in the future when using this method on experiments of this type. We wish to validate this method for single varying representations, and then move on to two varying representations." + "objectID": "posts/AFM-data_2/index.html", + "href": "posts/AFM-data_2/index.html", + "title": "Extracting cell geometry from Atomic Force Microscopy", + "section": "", + "text": "Intro\n\n\nReferences" }, { - "objectID": "posts/Neural-Manifold/index.html#main-goal", - "href": "posts/Neural-Manifold/index.html#main-goal", - "title": "Understanding Animal Navigation using Neural Manifolds With CEBRA", - "section": "Main Goal", - "text": "Main Goal\nOur main goal is therefore to determine this \\(\\mathcal{H}\\) value without using a Fourier Transform and instead somehow find a temporally finer, within lap estimation of \\(\\mathcal{H}\\) using manifold learning. Some key questions that motivate this research include:\n\nHow does the velocity of the rat affect the \\(\\mathcal{H}\\) value?\nWhat patterns does the \\(\\mathcal{H}\\) value exhibit over the course of a lap? Does it relate to other behavioural variables?\n\nSome more important goals of this research include a method of decoding the “hippocampal gain” online and feeding these values back into the dome apparatus to control the \\(\\mathcal{H}\\) value to the desired value for the experiment.\nWe turn to CEBRA Schneider, Lee, and Mathis (2023) as our method of manifold learning. In the next section, we will see how CEBRA can help decode \\(\\mathcal{H}\\) reliably.\nThe basic idea is as follows: First, we aim to project the neural data into some latent space. In this space, we want the points to map out the topology of the task - specifically, to encode hippocampal position/angle (the rat’s position in the hippocampal reference frame). We assume that this task forms a 1D ring topology, given the cyclic nature of the dome setup and the periodic firing of place cells. Then we want to validate and construct a latent parametrization of this manifold, specifically designed to directly reflect the hippocampal position. With an accurate hippocampal position parametrization, we could then decode \\(\\mathcal{H}\\), giving us a more temporally fine estimation of \\(\\mathcal{H}\\).\nNext, we move on to what CEBRA is and how it can help us achieve our goal." + "objectID": "posts/ImageMorphing/OT4DiseaseProgression.html", + "href": "posts/ImageMorphing/OT4DiseaseProgression.html", + "title": "Optimal Mass Transport and its Convex Formulation", + "section": "", + "text": "In the context of biomedics, understanding disease progression is critical in developing effective diagnostic and therapeutic strategies. Medical imaging provides us with invaluable data, capturing the spatial and structural changes in the human body over time. Yet, analyzing these changes quantitatively and consistently remains challenging. Here, we explore how optimal transport (OT) can be applied to model disease progression in a geometrically meaningful way, providing a tool to predict deformations and shape changes in diseases like neurodegeneration, cancer, and respiratory diseases.\n\n\nOptimal transport is a mathematical framework originally developed to solve the problem of transporting resources in a way that minimizes cost. The problem was formalized by the French mathematician Gaspard Monge in 1781. In the 1920s A.N. Tolstoi was among the first to study the transportation problem mathematically. However, the major advances were made in the field during World War II by the Soviet mathematician and economist Leonid Kantorovich. However, OT is a tough optimization problem. In 2000, Benamou and Brenier propose a convex formulation. Villani explains the history and mathematics behind OT in great detail in his book (Villani 2021), which is in fact very popular and well appreciated.\nMathematically, OT finds the most efficient way to “move” one distribution to match another, which is useful in medical imaging where changes in structure and morphology need to be quantitatively mapped over time. OT computes a transport map (or “flow”) that transforms one spatial distribution into another with minimal “work” (measured by the Wasserstein distance). This idea has strong applications in medical imaging, particularly for analyzing disease progression, as it provides a way to track changes in anatomical structures over time.\n\n\n\nState of neurodegeneration in a kid at different ages. (Bastos et al. (2020)) OT can learn the progression or the transformation (T) of brain deformation from the state at 5 year age (\\(\\rho_0\\)) to the final state at 7 year age (\\(\\rho_1\\)) or 9 year age (\\(\\rho_2\\)).\n\n\n\n\n\nThe OT framework is uniquely suited for disease progression modeling because it allows us to:\n\nCapture spatial and structural changes: OT computes a smooth, meaningful transformation, preserving the continuity of shapes, making it ideal for medical images that track evolving structures.\nQuantify changes robustly: By calculating the minimal transport cost, OT provides a quantitative measure of how much a structure (e.g., brain tissue) changes, which can correlate with disease severity.\nCompare across patients and populations: OT-based metrics can be standardized across subjects, enabling comparisons between different patient groups or disease stages.\n\n\n\n\n\nNeurodegeneration (e.g., Alzheimer’s Disease): OT maps brain atrophy across time points in MRI scans, quantifying volume and cortical thickness changes crucial for staging and monitoring Alzheimer’s.\nCancer: OT tracks tumor morphology changes, helping assess treatment response by measuring growth, shrinkage, or shape shifts, even aiding relapse predictions.\nRespiratory Diseases (e.g., COPD): OT compares longitudinal lung CTs to quantify tissue loss distribution, providing spatial insights for monitoring COPD progression and treatment adjustment." }, { - "objectID": "posts/Neural-Manifold/index.html#what-is-cebra", - "href": "posts/Neural-Manifold/index.html#what-is-cebra", - "title": "Understanding Animal Navigation using Neural Manifolds With CEBRA", - "section": "What is CEBRA?", - "text": "What is CEBRA?\nCEBRA, introduced in Schneider, Lee, and Mathis (2023), is a powerful self-supervised learning algorithm designed to create consistent, interpretable embeddings of high-dimensional neural recordings using auxiliary variables such as behavior or time. CEBRA generates consistent embeddings across trials, animals, and even different recording modalities​.\nIn our analysis, we will use the discovery mode of CEBRA, with only time as our auxiliary variable. CEBRA is implemented in python.\n\nThe Need for CEBRA\nIn neuroscience, understanding how neural populations encode behavior is a large challenge. Traditional linear methods like PCA, or even non-linear approaches like UMAP and t-SNE, fail in this context because they fail to capture temporal dynamics and lack consistency across different sessions or animals. CEBRA gets past these limitations by both considering temporal dynamics and providing consistency across different sessions or animals.\n\n\nHow Does CEBRA Work?\nCEBRA uses a convolutional neural network (CNN) encoder trained with contrastive learning to produce a latent embedding of the neural data. The algorithm identifies positive and negative pairs of data points, using temporal proximity to structure the embedding space.\n\n\nCEBRA Architecture\n\nContrastive Learning\nThe CEBRA model is trained using a contrastive learning loss function. In CEBRA, this is achieved through InfoNCE (Noise Contrastive Estimation), which encourages the model to distinguish between similar (positive) and dissimilar (negative) samples.\nThe loss function is defined as: \\[\n\\mathcal{L} = - \\log \\frac{e^{\\text{sim}(f(x), f(y^+)) / \\tau}}{e^{\\text{sim}(f(x), f(y^+)) / \\tau} + \\sum_{i=1}^{K} e^{\\text{sim}(f(x), f(y_i^-)) / \\tau}}\n\\]\nWhere \\(f(x)\\) and \\(f(y)\\) are the encoded representations of the neural data after passing through the CNN, \\(\\text{sim}(f(x), f(y))\\) represents a similarity measure between the two embeddings, implemented as cosine similarity. Here, \\(y^{+}\\) denotes the positive pair (similar to \\(x\\) in time), \\(y_{i}^{-}\\) denotes the negative pairs (dissimilar to \\(x\\) in time), and \\(\\tau\\) is a temperature parameter that controls the sharpness of the distribution.\nNote that the similarity measure depends on the CEBRA mode used, and we have used time as our similarity measure. The contrastive loss encourages the encoder to map temporally close data points (positive pairs) to close points in the latent space, while mapping temporally distant data points (negative pairs) further apart. This way, the embeddings reflect the temporal structure of the data. The final output is then the embedding value in the latent space. Below is a schematic taken from ({Schneider, Lee, and Mathis (2023)}), showing the CEBRA architecture.\n\n\n\n\n\nFigure 4 - CEBRA Architecture. Input: Neural spike data in the shape (time points, neuron #). Output: Low dimensional embedding\n\n\n\n\nOnce we obtain the neural embeddings from CEBRA, the next step is to determine the underlying manifold that describes the structure of the resulting point cloud. For example, let’s consider the output of a CEBRA embedding from one experimental session.\n\n\n\n\n\nFigure 5 - Cebra Embedding for an experiment with Hippocampal Position Annotated as a Color Map\n\n\n\n\nThe embedding appears to form a 1D circle in 3D space. We can also see that the hippocampal position correctly traces the rat’s hippocampal position throughout the experiment. This observation aligns with our expectations, since we predict that the neural activity encodes the hippocampal reference frame position, not the lab frame position. To validate the 1D ring topology, we apply a technique known as Persistent Homology." + "objectID": "posts/ImageMorphing/OT4DiseaseProgression.html#what-is-optimal-transport", + "href": "posts/ImageMorphing/OT4DiseaseProgression.html#what-is-optimal-transport", + "title": "Optimal Mass Transport and its Convex Formulation", + "section": "", + "text": "Optimal transport is a mathematical framework originally developed to solve the problem of transporting resources in a way that minimizes cost. The problem was formalized by the French mathematician Gaspard Monge in 1781. In the 1920s A.N. Tolstoi was among the first to study the transportation problem mathematically. However, the major advances were made in the field during World War II by the Soviet mathematician and economist Leonid Kantorovich. However, OT is a tough optimization problem. In 2000, Benamou and Brenier propose a convex formulation. Villani explains the history and mathematics behind OT in great detail in his book (Villani 2021), which is in fact very popular and well appreciated.\nMathematically, OT finds the most efficient way to “move” one distribution to match another, which is useful in medical imaging where changes in structure and morphology need to be quantitatively mapped over time. OT computes a transport map (or “flow”) that transforms one spatial distribution into another with minimal “work” (measured by the Wasserstein distance). This idea has strong applications in medical imaging, particularly for analyzing disease progression, as it provides a way to track changes in anatomical structures over time.\n\n\n\nState of neurodegeneration in a kid at different ages. (Bastos et al. (2020)) OT can learn the progression or the transformation (T) of brain deformation from the state at 5 year age (\\(\\rho_0\\)) to the final state at 7 year age (\\(\\rho_1\\)) or 9 year age (\\(\\rho_2\\))." }, { - "objectID": "posts/Neural-Manifold/index.html#persistent-homology", - "href": "posts/Neural-Manifold/index.html#persistent-homology", - "title": "Understanding Animal Navigation using Neural Manifolds With CEBRA", - "section": "Persistent Homology", - "text": "Persistent Homology\nPersistent homology allows us to quantify and verify the topological features of our embedded space. Specifically, we want to validate the assumption that the neural representation forms a 1D ring manifold, which corresponds to the rat’s navigation behavior within the environment. The idea of persistent homology is to create spheres of varying radii around each point in the point cloud, and from those spheres, track how the topological features of the shape change as the radius grows. By systematically increasing the radius, we can observe when distinct clusters merge, when loops (1D holes) appear, and when higher-dimensional voids form. These features persist across different radius sizes, and their persistence provides a measure of their significance. In the context of neural data, this allows us to detect the underlying topological structure of the manifold. Below is a figure illustrating this method Schneider, Lee, and Mathis (2023):\n\n\n\n\n\nFigure 6 - Persistent Homology\n\n\n\n\n\nValidating a 1D Ring Manifold\nTo confirm the circular nature of the embedding, we analyze the Betti numbers derived from the point cloud. Betti numbers describe the topological features of a space, with the \\(k\\)-th Betti number counting the number of \\(k\\)-dimensional “holes” in the manifold. Below is a figure showing a few basic topological spaces and their corresponding Betti numbers Walker (2008):\n\n\n\n\n\nFigure 7 - Some simple topological spaces and their Betti numbers \n\n\n\n\nFor a 1D ring, the expected Betti numbers are: \\[\n\\beta_0 = 1 : \\text{One connected component.}\n\\] \\[\n\\beta_1 = 1 : \\text{One 1D hole (i.e., the circular loop).}\n\\] \\[\n\\beta_2 = 0 : \\text{No 2D voids.}\n\\]\nThus, the expected Betti numbers for our manifold are (1, 1, 0). If the Betti numbers extracted from the persistent homology analysis align with these values, it confirms that the neural dynamics trace a 1D circular trajectory, supporting our hypothesis that the hippocampal representation forms a ring corresponding to the rat’s navigation path." + "objectID": "posts/ImageMorphing/OT4DiseaseProgression.html#why-optimal-transport-for-disease-progression", + "href": "posts/ImageMorphing/OT4DiseaseProgression.html#why-optimal-transport-for-disease-progression", + "title": "Optimal Mass Transport and its Convex Formulation", + "section": "", + "text": "The OT framework is uniquely suited for disease progression modeling because it allows us to:\n\nCapture spatial and structural changes: OT computes a smooth, meaningful transformation, preserving the continuity of shapes, making it ideal for medical images that track evolving structures.\nQuantify changes robustly: By calculating the minimal transport cost, OT provides a quantitative measure of how much a structure (e.g., brain tissue) changes, which can correlate with disease severity.\nCompare across patients and populations: OT-based metrics can be standardized across subjects, enabling comparisons between different patient groups or disease stages." }, { - "objectID": "posts/Neural-Manifold/index.html#spud-method", - "href": "posts/Neural-Manifold/index.html#spud-method", - "title": "Understanding Animal Navigation using Neural Manifolds With CEBRA", - "section": "SPUD Method", - "text": "SPUD Method\nOnce we’ve validated the assumption that our data forms a 1D ring manifold, we can proceed to fitting a spline to the data. We do this so that we can parametrize our behavioural variable \\(\\mathcal{hippocampal angle}\\) along the point cloud. There are many different methods, but the one chosen for this purpose was taken from Chaudhuri et al. (2019). The spline is defined by a set of points, or knots, which I decided to initialize using kmedoids clustering Jin and Han (2011). The knots are then fit to the data further by minimizing a loss function defined as follows:\n\\[\n\\text{cost} = \\text{dist} + \\text{curvature} + \\text{length} - \\text{log(density)}\n\\]\nwhere dist is the distance of each point to the spline, curvature is the total curvature of the spline, length is the total length of the spline, and density is the point cloud density of each knot.\n\nOverview of the SPUD Method\nSpline Parameterization for Unsupervised Decoding (SPUD) Chaudhuri et al. (2019) is a multi-step method designed to parametrize a neural manifold. The goal of SPUD is to provide an on-manifold local parameterization using a local coordinate system rather than a global one. This method is particularly useful when dealing with topologically non-trivial variables that have a circular structure.\nSpline Parameterization: SPUD parameterizes the manifold by first fitting a spline to the underlying structure. Chaudhuri et al. (2019) demonstrated that this works for head direction cells in mice to accurately parametrize, i.e. decode the head direction. Our goal is to have the parametrization accurately decode our latent variable of interest, the Hippocampal Gain (\\(\\mathcal{H}\\)).\n\n\nDeciding the Parameterization of the Latent Variable\n\nNatural Parametrization\nA natural parameterization would mean that equal distances in the embedding space correspond to equal changes in the latent variable. The natural parameterization comes from the assumption that neural systems allocate resources based on the significance or frequency of stimuli. For example, in systems like the visual cortex, stimuli that occur frequently (e.g., vertical or horizontal orientations) might be encoded with higher resolution. However, for systems like place cell firing, where all angles are spaces are equally probable in the dome, the natural parameterization reflects this uniform encoding strategy, with no overrepresentation of certain places (Chaudhuri et al. (2019)).\n\n\nAlternative Parameterization and its Limitations\nAn alternative parameterization method was considered, in which intervals between consecutive knots in the spline were set to represent equal changes in the latent variable. This approach was designed to counteract any potential biases in the data due to over- or under-sampling in certain regions of the manifold.\nHowever, this alternative was not determined to be effective in practice by Chaudhuri et al. (2019). Given sufficient data, the natural parameterization performed better, supporting the conclusion that it better reflects how neural systems encode variables. This is also the case for our experiment. Look to the following figure, in which a spline is fit to the data and a color map is applied to the natural parametrization. We can see that it aligns almost perfectly with the hippocampal angle. Great, that’s exactly what we wanted!\n\n\n\n\n\nFigure 8 - Spline fit on CEBRA embedding\n\n\n\n\nSo, what do we do now that we have an accurate parametrization of the \\(\\mathcal{hippocampal \\. angle}\\)?" + "objectID": "posts/ImageMorphing/OT4DiseaseProgression.html#popular-applications-of-ot-to-study-disease-progression", + "href": "posts/ImageMorphing/OT4DiseaseProgression.html#popular-applications-of-ot-to-study-disease-progression", + "title": "Optimal Mass Transport and its Convex Formulation", + "section": "", + "text": "Neurodegeneration (e.g., Alzheimer’s Disease): OT maps brain atrophy across time points in MRI scans, quantifying volume and cortical thickness changes crucial for staging and monitoring Alzheimer’s.\nCancer: OT tracks tumor morphology changes, helping assess treatment response by measuring growth, shrinkage, or shape shifts, even aiding relapse predictions.\nRespiratory Diseases (e.g., COPD): OT compares longitudinal lung CTs to quantify tissue loss distribution, providing spatial insights for monitoring COPD progression and treatment adjustment." + }, + { + "objectID": "posts/ImageMorphing/OT4DiseaseProgression.html#monge-formulation", + "href": "posts/ImageMorphing/OT4DiseaseProgression.html#monge-formulation", + "title": "Optimal Mass Transport and its Convex Formulation", + "section": "Monge Formulation", + "text": "Monge Formulation\nThe Monge formulation of optimal transport, introduced in 1781, addresses the problem of moving mass efficiently from one distribution to another. Given two distributions:\n\nSource Distribution: \\(\\mu\\) on \\(X\\)\nTarget Distribution: \\(\\nu\\) on \\(Y\\)\n\nwe seek a transport map \\(T\\): \\(X\\) to \\(Y\\) that minimizes the transport cost, typically \\(c(x, T(x)) = \\|x - T(x)\\|^p\\).\nThe Monge problem can be written as:\n\\[\n\\min_T \\int_X c(x, T(x)) \\, d\\mu(x)\n\\]\nsubject to \\(T_\\# \\mu = \\nu\\), meaning that the map \\(T\\) must push \\(\\mu\\) to \\(\\nu\\), ensuring all mass is preserved without splitting.\nKey Points:\n\nTransport Map \\(T\\): Monge’s formulation requires a direct mapping of mass from \\(\\mu\\) to \\(\\nu\\).\nNo Mass Splitting: Unlike relaxed formulations, the Monge problem doesn’t allow fractional mass transport, making it challenging to solve in complex cases.\nCost Function: The choice of \\(c(x, y)\\) affects the solution—common choices include distance \\(\\|x - y\\|\\) and squared distance \\(\\|x - y\\|^2\\).\n\n\nShortcoming\nThe Monge formulation lacks flexibility due to its one-to-one mapping constraint, which led to the Kantorovich relaxation, allowing more general solutions by enabling mass splitting. The Monge formulation captures the essence of spatial mass transport with minimal cost, inspiring modern approaches in diverse fields." + }, + { + "objectID": "posts/ImageMorphing/OT4DiseaseProgression.html#kantorovich-formulation", + "href": "posts/ImageMorphing/OT4DiseaseProgression.html#kantorovich-formulation", + "title": "Optimal Mass Transport and its Convex Formulation", + "section": "Kantorovich formulation", + "text": "Kantorovich formulation\nThe Kantorovich formulation, introduced by Leonid Kantorovich in 1942 (Kantorovich (2006)), generalizes the Monge problem by allowing “mass splitting,” where mass from one source point can be distributed to multiple target points. This flexibility makes it possible to solve a broader range of transport problems.\nKantorovich’s Problem:\nInstead of finding a single transport map \\(T\\), the Kantorovich formulation seeks a transport plan \\(\\gamma\\), a joint probability distribution on \\(X \\times Y\\), such that:\n\\[\n\\min_\\gamma \\int_{X \\times Y} c(x, y) \\, d\\gamma(x, y)\n\\]\nwhere \\(c(x, y)\\) represents the cost of transporting mass from \\(x \\in X\\) to \\(y \\in Y\\). The transport plan \\(\\gamma\\) must satisfy marginal constraints:\n\\[\n\\int_Y d\\gamma(x, y) = d\\mu(x) \\quad \\text{and} \\quad \\int_X d\\gamma(x, y) = d\\nu(y),\n\\]\nensuring that \\(\\gamma\\) moves all mass from \\(\\mu\\) to \\(\\nu\\).\nKey Points:\n\nTransport Plan \\(\\gamma\\): A probability measure over \\(X \\times Y\\) that allows fractional mass movement, broadening the solution space.\nMarginal Constraints: These ensure \\(\\gamma\\) aligns with source \\(\\mu\\) and target \\(\\nu\\) distributions, preserving total mass.\nCost Function: Commonly, \\(c(x, y) = \\|x - y\\|\\) or \\(c(x, y) = \\|x - y\\|^2\\), chosen based on the desired penalty for transport distance.\n\nAdvantages:\n\nFlexibility: Mass splitting allows for a solution even when \\(\\mu\\) and \\(\\nu\\) have different structures (e.g., continuous to discrete).\nComputational Feasibility: The problem can be solved via linear programming or faster algorithms using entropic regularization.\n\nHence, the Kantorovich formulation provides a robust framework for optimal transport problems, enabling applications across fields where flexibility and computational efficiency are essential." + }, + { + "objectID": "posts/ImageMorphing/OT4DiseaseProgression.html#benamou-brenier-formulation-convex-ot", + "href": "posts/ImageMorphing/OT4DiseaseProgression.html#benamou-brenier-formulation-convex-ot", + "title": "Optimal Mass Transport and its Convex Formulation", + "section": "Benamou-Brenier Formulation (Convex OT)", + "text": "Benamou-Brenier Formulation (Convex OT)\nThe Benamou-Brenier formulation (Benamou and Brenier (2000)) provides a dynamic perspective on optimal transport, interpreting it as a fluid flow problem. Instead of transporting mass directly between two distributions, this approach finds the path of minimal “kinetic energy” needed to continuously transform one distribution into another over time.\nThe Benamou-Brenier formulation considers a probability density \\(\\rho(x, t)\\) evolving over time \\(t \\in [0, 1]\\) from an initial distribution \\(\\rho_0\\) to a final distribution \\(\\rho_1\\). The goal is to find a velocity field \\(v(x, t)\\) that minimizes the action, or “kinetic energy” cost:\n\\[\n\\min_{\\rho, v} \\int_0^1 \\int_X \\frac{1}{2} \\|v(x, t)\\|^2 \\rho(x, t) \\, dx \\, dt,\n\\]\nsubject to the continuity equation:\n\\[\n\\frac{\\partial \\rho}{\\partial t} + \\nabla \\cdot (\\rho v) = 0,\n\\]\nwhich ensures mass conservation from \\(\\rho_0\\) to \\(\\rho_1\\).\nKey Points:\n\nDynamic Interpretation: Unlike Monge and Kantorovich, the Benamou-Brenier formulation finds a time-dependent transformation, representing a continuous flow of mass.\nVelocity Field \\(v(x, t)\\): Defines the “direction” and “speed” of mass movement, yielding a smooth, physical path of minimal kinetic energy.\nContinuity Equation: Ensures mass conservation over time, maintaining that mass neither appears nor disappears.\n\nAdvantages:\n\nSmoothness: Provides a continuous path for evolving distributions, well-suited for dynamic processes.\nComputational Benefits: The problem is formulated as a convex optimization over a flow field, often solved with efficient numerical methods.\n\nThe Benamou-Brenier formulation expands optimal transport by introducing a dynamic flow approach, making it especially useful for applications requiring continuous transformations. Its physical interpretation has brought valuable insights to fields that rely on time-evolving processes." + }, + { + "objectID": "posts/elastic-metric/elastic_metric.html", + "href": "posts/elastic-metric/elastic_metric.html", + "title": "Riemannian elastic metric for curves", + "section": "", + "text": "This page introduces basic concepts of elastic metric, square root velocity metric, geodesic distance and Fréchet mean associated with it." + }, + { + "objectID": "posts/RECOVAR/index.html", + "href": "posts/RECOVAR/index.html", + "title": "Heterogeneity analysis of cryo-EM data of proteins dynamic in comformation and composition using linear subspace methods", + "section": "", + "text": "Cryogenic electron microscopy (cryo-EM), a cryomicroscopy technique applied on samples embedding in ice, along with recent development of powerful hardwares and softwares, have achieved huge success in the determination of biomolecular structures at near-atomic level. Cryo-EM takes screenshots of thousands or millions of particles in different poses frozen in the sample, and thus allows the reconstruction of the 3D structure from those 2D projections.\nEarly algorithms and softwares of processing cryo-EM data focus on resolving homogeneous structure of biomolecules. However, many biomolecules are very dynamic in conformations, compositions, or both. For example, ribosomes comprise of many sub-units, and their compositions may vary within the sample and are of research interest. Spike protein is an example of conformational heterogeneity, where the receptor-binding domain (RBD) keeps switching between close and open states in order to bind to receptors and meanwhile resist the binding of antibody. When studying the antigen-antibody complex, both compositional and conformational heterogeneity need to be considered.\n\n\n\nA simple illustration of the conformational heterogeneity of spike protein, where it displays two kinds of conformations: closed RBD and open RBD of one chain (colored in blue) (Wang et al. 2020). Spike protein is a trimer so in reality all the three chains will move possibly in different ways and the motion of spike protein is much more complex than what’s shown here.\n\n\nThe initial heterogeneity analysis of 3D structrues reconstructed from cryo-EM data started from relatively simple 3D classfication, which outputs discrete classes of different conformations. This is usually done by expectation-maximization (EM) algorithms, where 2D particle stacks were iteratively assigned to classes and used to reconstruct the volume of that class. However, such an approach has two problems: first, the classification decreases the number of images used to reconstruct the volume, and thus lower the resolution we are able to achieve; second, the motion of biomolecule is continuous in reality and discrete classification may not describe the heterogeneity very well, and we may miss some transient states.\nTherefore, nowadays people start to focus on methods modeling continuous heterogeneity without any classification step to avoid the above issues. Most methods adopt similar structures, where 2D particle stacks are mapped to latent embeddings, clusters/trajectories are estimated in latent space, and finally volumes are mapped and reconstructed from latent embeddings. Early methods use linear mapping (e.g. 3DVA), but with the applications of deep learning techniques in the field of cryo-EM data processing, people find methods adapted from variational autoencoder (VAE) achieving better performance (e.g. cryoDRGN, 3DFlex). Nevertheless, the latent space obtained from VAE and other deep learning methods is hard to interpret, and do not conserve distances and densities, imposing difficulties in reconstructing motions/trajectories, which are what most structure biologists desire at the end.\nRecent developed software RECOVAR (Gilles and Singer 2024), using a linear mapping like 3DVA, was shown to achieve comparable or even better performance with deep learning methods, and meanwhile has high interpretability and allows easy recovery of motions/trajectories from latent space. For this project, I will review the pipeline of RECOVAR, discussed improvements and extensions we made to this pipeline, and present heterogeneity analysis results from the original paper and our SARS-CoV2 spike protein dataset." + }, + { + "objectID": "posts/RECOVAR/index.html#background", + "href": "posts/RECOVAR/index.html#background", + "title": "Heterogeneity analysis of cryo-EM data of proteins dynamic in comformation and composition using linear subspace methods", + "section": "", + "text": "Cryogenic electron microscopy (cryo-EM), a cryomicroscopy technique applied on samples embedding in ice, along with recent development of powerful hardwares and softwares, have achieved huge success in the determination of biomolecular structures at near-atomic level. Cryo-EM takes screenshots of thousands or millions of particles in different poses frozen in the sample, and thus allows the reconstruction of the 3D structure from those 2D projections.\nEarly algorithms and softwares of processing cryo-EM data focus on resolving homogeneous structure of biomolecules. However, many biomolecules are very dynamic in conformations, compositions, or both. For example, ribosomes comprise of many sub-units, and their compositions may vary within the sample and are of research interest. Spike protein is an example of conformational heterogeneity, where the receptor-binding domain (RBD) keeps switching between close and open states in order to bind to receptors and meanwhile resist the binding of antibody. When studying the antigen-antibody complex, both compositional and conformational heterogeneity need to be considered.\n\n\n\nA simple illustration of the conformational heterogeneity of spike protein, where it displays two kinds of conformations: closed RBD and open RBD of one chain (colored in blue) (Wang et al. 2020). Spike protein is a trimer so in reality all the three chains will move possibly in different ways and the motion of spike protein is much more complex than what’s shown here.\n\n\nThe initial heterogeneity analysis of 3D structrues reconstructed from cryo-EM data started from relatively simple 3D classfication, which outputs discrete classes of different conformations. This is usually done by expectation-maximization (EM) algorithms, where 2D particle stacks were iteratively assigned to classes and used to reconstruct the volume of that class. However, such an approach has two problems: first, the classification decreases the number of images used to reconstruct the volume, and thus lower the resolution we are able to achieve; second, the motion of biomolecule is continuous in reality and discrete classification may not describe the heterogeneity very well, and we may miss some transient states.\nTherefore, nowadays people start to focus on methods modeling continuous heterogeneity without any classification step to avoid the above issues. Most methods adopt similar structures, where 2D particle stacks are mapped to latent embeddings, clusters/trajectories are estimated in latent space, and finally volumes are mapped and reconstructed from latent embeddings. Early methods use linear mapping (e.g. 3DVA), but with the applications of deep learning techniques in the field of cryo-EM data processing, people find methods adapted from variational autoencoder (VAE) achieving better performance (e.g. cryoDRGN, 3DFlex). Nevertheless, the latent space obtained from VAE and other deep learning methods is hard to interpret, and do not conserve distances and densities, imposing difficulties in reconstructing motions/trajectories, which are what most structure biologists desire at the end.\nRecent developed software RECOVAR (Gilles and Singer 2024), using a linear mapping like 3DVA, was shown to achieve comparable or even better performance with deep learning methods, and meanwhile has high interpretability and allows easy recovery of motions/trajectories from latent space. For this project, I will review the pipeline of RECOVAR, discussed improvements and extensions we made to this pipeline, and present heterogeneity analysis results from the original paper and our SARS-CoV2 spike protein dataset." + }, + { + "objectID": "posts/RECOVAR/index.html#methods", + "href": "posts/RECOVAR/index.html#methods", + "title": "Heterogeneity analysis of cryo-EM data of proteins dynamic in comformation and composition using linear subspace methods", + "section": "Methods", + "text": "Methods\n\nRegularized covariance estimation\nLet \\(N\\) be the dimension of the grid and \\(n\\) be the number of images. We start with formulating the formation process of each cryo-EM image in the Fourier space \\(y_i\\in\\mathbb{C}^{N^2}\\) from its corresponding conformation \\(x_i\\in\\mathbb{C}^{N^3}\\) as: \\[y_i = C_i\\hat{P}(\\phi_i)x_i + \\epsilon_i, \\epsilon_i\\sim N(0, \\Lambda_i) \\]\nwhere \\(\\hat{P}(\\phi_i)\\) is the projetion from 3D to 2D after rigid body motion with pose \\(\\phi_i\\), \\(C_i\\) is the contrast transfer function (CTF), and \\(\\epsilon_i\\) represents the Gaussian noise. RECOVAR will assume that \\(C_i\\) and \\(\\phi_i\\) were given. This can be done via many existing ab-initio methods. Hence in the following analysis, we will simply fix the linear map \\(P_i:=C_i\\hat{P}(\\phi_i)\\).\nWhen poses are known, the mean \\(\\mu\\in\\mathbb{C}^{N^3}\\) of the distribution of conformations can be estimated by solving:\n\\[\\hat{\\mu}:=\\underset{\\mu}{\\mathrm{argmin}}\\sum_{i=1}^{n}\\lVert y_i-P_i\\mu\\rVert_{\\Lambda^{-1}}^2+\\lVert\\mu\\rVert_w^2\\]\nwhere \\(\\lVert v\\rVert_{\\Lambda^{-1}}^2=v^*\\Lambda^{-1}v\\) and \\(\\lVert v\\rVert_w^2=\\sum_i|v_i|^2w_i\\). \\(w\\in \\mathbb{R}^{N^3}\\) is the optional Wiener filter. Similarly, covariance can be estimated as the solution to the linear system corresponding to the following:\n\\[\\hat{\\Sigma}:=\\underset{\\Sigma}{\\mathrm{argmin}}\\sum_{i=1}^n\\lVert(y_i-P_i\\hat{\\mu})(y_i-P_i\\hat{\\mu})^*-(P_i\\Sigma P_i^*+\\Lambda_i)\\rVert_F^2+\\lVert\\Sigma\\rVert_R^2\\]\nwhere \\(\\lVert A\\rVert_F^2=\\sum_{i,j}A_{i,j}^2\\) and \\(\\lVert A\\rVert_R^2=\\sum_{i,j}A_{i,j}^2R_{i,j}\\). \\(R\\) is the regularization weight.\nOur goal at this step is to compute principal components (PCs) from \\(\\hat{\\mu}\\) and \\(\\hat{\\Sigma}\\). Nevertheless, computing the entire matrix of \\(\\hat{\\Sigma}\\) is impossible considering that we have to compute \\(N^6\\) entries. Fortunately, for low-rank variance matrix only a subset of the columns is required to estimate the entire matrix and its leading eigenvectors, which are just PCs. \\(d\\) PCs can be computed in \\(O(d(N^3+nN^2))\\), much faster than \\(O(N^6)\\) required to compute the entire covariance matrix. Here a heuristic scheme is used to choose which columes to be used to compute eigenvectors. First, all columns are added into the considered set. Then the column corresponding to the pixel with the highest SNR in the considered set is iteratively added to the chosen set, and pixels nearby are removed from the considered set, until there are a disirable number of columns \\(d\\) in the chosen set. We estimate the entries of the chosen columns and their complex conjugates and let them form \\(\\hat{\\Sigma}_{col}\\). Let \\(\\tilde{U}\\in\\mathbb{C}^{N^3\\times d}\\) be orthogonalized \\(\\hat{\\Sigma}_{col}\\). It follows that we can compute the reduced covariance matrix \\(\\hat{\\Sigma}_{\\tilde{U}}\\) by:\n\\[\\hat{\\Sigma}_{\\tilde{U}}:=\\underset{\\Sigma_{\\tilde{U}}}{\\mathrm{argmin}}\\sum_{i=1}^n\\lVert(y_i-P_i\\hat{\\mu})(y_i-P_i\\hat{\\mu})^*-(P_i\\tilde{U}\\Sigma_{\\tilde{U}}\\tilde{U}^* P_i^*+\\Lambda_i)\\rVert_F^2\\]\nBasically, we just replace \\(\\Sigma\\) in the formula to estimate the entire covariance matrix shown before with \\(\\tilde{U}\\Sigma_{\\tilde{U}}\\tilde{U}^*\\). Finally, we just need to perform an eigendecomposition on \\(\\hat{\\Sigma}_{\\tilde{U}}\\) and obtain \\(\\hat{\\Sigma}_{\\tilde{U}}=V\\Gamma V^*\\). The eigenvectors (which are the PCs we want) and eigenvalues would be \\(U:=\\tilde{U}V\\) and \\(\\Gamma\\) repectively.\n\n\nLatent space embedding\nWith PCs computed from the last step, denoted by \\(U\\in\\mathbb{C}^{N^3\\times d}\\), we can project \\(x_i\\) onto lower-dimensional latent space by \\(z_i = U^*(x_i-\\hat{\\mu})\\in\\mathbb{R}^d\\). Assuming \\(z_i\\sim N(0,\\Gamma)\\), the MAP estimation of \\(P(z_i|y_i)\\) can be obtained by solving:\n\\[\\hat{a}_i, \\hat{z}_i = \\underset{a_i\\in\\mathbb{R}^+, z_i\\in\\mathbb{R}^d}{\\mathrm{argmin}}\\lVert a_iP_i(Uz_i+\\hat{\\mu})-y_i\\rVert_{\\Lambda_i^{-1}}^2+\\lVert z_i\\rVert_{\\Gamma^{-1}}^2\\]\nwhere \\(a_i\\) is a scaling factor used to capture the effect of display variations in contrast.\n\n\nConformation reconstruction\nAfter computing the latent embeddings, the next question would naturally be how to reconstruct conformations from embeddings. The most intuitive way is to do reprojection i.e. \\(\\hat{x}\\leftarrow Uz+\\hat{\\mu}\\). Nevertheless, reprojection only works well when all the relevant PCs can be computed, which is almost impossible considering the low signal-to-noise ratio (SNR) in practice. Therefore, an alternative scheme based on adaptive kernel regression is used here. Given a fixed latent position \\(z^*\\) and the frequency \\(\\xi^k\\in\\mathbb{R}^3\\) in the 3D Fourier space of the volume whose value we would like to estimate, the kernel regression estimates of this form are computed as:\n\\[x(h;\\xi^k) = \\underset{x_k}{\\mathrm{argmin}}\\sum_{i,j}\\frac{1}{\\sigma_{i,j}^2}|C_{i,j}x_k-y_{i,j}|^2K(\\xi^k,\\xi_{i,j})K_i^h(z^*,z_i)\\]\nwhere \\(h\\) is bandwitdth; \\(\\sigma_{i,j}\\) is the variance of \\(\\epsilon_{i,j}\\), which is the noise of frequency \\(j\\) of the \\(i\\)-th observation; \\(y_{i,j}\\) is the value of frequency \\(j\\) of the \\(i\\)-th observation; \\(\\xi_{i,j}\\) is the frequency \\(j\\) of the \\(i\\)-th observation in 3D adjusted by \\(\\phi_i\\). We have two kernel functions in this formulation. \\(K(\\xi^k,\\xi_{i,j})\\) is the triangular kernel, measuring the distance in frequencies. \\(K_i^h(z^*, z_i)=E(\\frac{1}{h}\\lVert z^* - z_i\\rVert_{\\Sigma_{z_i}^{-1}})\\) where \\(\\Sigma_{z_i}\\) is the covariance matrix of \\(z_i\\) which can be computed from the formulation for latent embedding, and \\(E\\) is a piecewise constant approxination of the Epanechnikov kernel. \\(K_i^h(z^*, z_i)\\) measures the distance between latent embeddings.\nHere comes a trade-off at the heart of every heterogeneous reconstruction algorithm: averaging images is necessary to overcome noise, but it also degrades heterogeneity since the images averaged may come from different conformations. Hence, we have to choose \\(h\\) carefully. A cross-validation strategy is applied to find the optimal \\(h\\) for each frequency shell of each subvolume. For a given \\(z^*\\), the dataset is split into two: from one halfset, the 50 estimates \\(\\hat{x}(h_1), ..., \\hat{x}(h_{50})\\) with varying \\(h\\) are computed, and from the other subset a single low-bias, high-variance template \\(\\hat{x}_{CV}\\) is reconstrcuted by using a small number of images which are closest to \\(z^*\\). Each of the 50 kernel estimate is then subdivided into small subvolumes by real-space masking, and each subvolume is again decomposed into frequency shells after a Fourier transform. We use the following cross-validation metric for subvolume \\(v\\) and frequency shell \\(s\\):\n\\[d_{s,v}(h) = \\lVert S_sV^{-1/2}(M_v(\\hat{x}_{CV}-\\hat{x}(h)))\\rVert_2^2\\]\nwhere \\(S_s\\) is a matrix that extracts shell \\(s\\); \\(M_v\\) is a matrix extracting subvolume \\(v\\); and \\(V\\) is a diagonal matrix containing the variance of the template. For each \\(s\\) and \\(v\\), the minimizer over \\(h\\) of the cross-validarion score is selected, and the final volume is obtained by first recombining frequency shells for each subvolume and then recombining all the subvolumes.\n\n\n\nVolumes are reconstructed from the embedding by adaptive kernel regression.\n\n\n\n\nEstimation of state density\nSince motion is what structure biologists finally want, we have to figure out a method to sample from latent space to form a trajectory representing the motion of the molecule. According to Boltzmann statistics, the density of a particular state is a measure of the free energy of that state, which means a path which maximizes conformational density is equivalent to the path minimizing the free energy. Taking the advantage of linear mapping, we can easily relate embedding density to conformational density. The embedding density estimator is given by:\n\\[\\hat{E}(z) = \\frac{1}{n}\\sum_{i=1}^nK_G(\\hat{z_i}, \\Sigma_s;z)\\]\nwhere \\(K_G(\\mu, \\Sigma;z)\\) is the probability density function of the multivariant Gaussian with mean \\(\\mu\\) and covariance \\(\\Sigma\\), evaluated at \\(z\\), and \\(\\Sigma_s\\) is set using the Silverman rule. The conformational density can be related as following:\n\\[\\overline{E}(z)=\\overline{G}(z)*d(z)\\]\nwhere \\(\\overline{E}(z)\\) is the expectation of the embedding density \\(\\hat{E}(z)\\); \\(\\overline{G}(z)\\) is the expectation of \\(\\hat{G}(z)=\\frac{1}{n}\\sum_{i=1}^nK_G(0,\\Sigma_{z_i}+\\Sigma_s;z)\\), which is named as embedding uncertainty; \\(d(z)\\) is the conformational density corresponding to \\(z\\); \\(*\\) is the convolution operation.\n\n\nMotion recovery\nGiven the conformational density estimated from last step, denoted by \\(\\hat{d}(z)\\), start state \\(z_{st}\\) and end state \\(z_{end}\\), we can find trajectory \\(Z(t):\\mathbb{R}^+\\rightarrow\\mathbb{R}^d\\) in latent space by computing the value function:\n\\[v(z):=\\underset{Z(t)}{\\mathrm{inf}}\\int_{t=0}^{t=T_a}\\hat{d}(Z(t))^{-1}dt\\]\nsubject to \\[Z(0)=z, Z(T_a)=z_{end}, \\lVert \\frac{d}{dt}Z(t)\\rVert=1; T_a = min\\{t|Z(t)=z_{end}\\}\\]\nIn simple word, \\(v(z)\\) computes the minimum inverse density we can have to reach \\(z_{end}\\) starting from \\(z\\). \\(v(z)\\) is the viscosity solution of the Eikonal equation:\n\\[\\hat{d}(z)|\\nabla v(z)|=1, \\forall z\\in B\\setminus \\{z_{end}\\}; v(z_{end})=0\\]\nwhere \\(B\\) is the domain of interest, and \\(v(z)\\) can be solved by solving this partial differential equation. Once \\(v(z)\\) is solved, the optimal trajectory an be obtained by finding the path orthogonal to the level curve of \\(v(z)\\), which can be computed numerically using the steepest gradient descent on \\(v(z)\\) starting from \\(z_{st}\\)\n\n\n\nVisulization of the steepest gradient descent on the level curve of v(z)" + }, + { + "objectID": "posts/RECOVAR/index.html#results", + "href": "posts/RECOVAR/index.html#results", + "title": "Heterogeneity analysis of cryo-EM data of proteins dynamic in comformation and composition using linear subspace methods", + "section": "Results", + "text": "Results\n\nResults of public datasets\nThe original paper of RECOVAR presents results on precatalytic spliceosome dataset (EMPIAR-10180), integrin dataset (EMPIAR-10345) and ribosomal subunit dataset (EMPIAR-10076), all of which are public dataset and could be accessed from https://www.ebi.ac.uk/empiar/.\nResults on EMPIAR-10180 focuses on comformational heterogeneity. Three local maxima in conformational density were identified, a path between two of which was identified to show arm regions moving down followed by head regions moving up.\n\n\n\nLatent space and volume view of precatalytic spliceosome conformational heterogeneity. Latent view of the path is projected on the plane formed by different combinations of two principal components.\n\n\nEMPIAR-10345 contains both conformational and compositional heterogeneity. Two local maxima were found, with the smaller one corresponds to a different composition never reported by provious studies. Also a motion of the arm was found along the path.\n\n\n\nRECOVAR finds both comformational and compositional heterogeneity within integrin\n\n\nEMPIAR-10076 is used to show the ability of RECOVAR to find stable states. RECOVAR finds two stable states of the 70S ribosomes.\n\n\n\nThe volume of two stable states are reconstructed, correspinding to two peaks in densities\n\n\n\n\nResults of SARS-CoV2 datasets\nWe also tested RECOVAR on our own dataset which contains 271,448 SARS-CoV2 spike protein particles, extracted using CryoSparc. Some of these particles are binding to human angiotensin-converting enzyme 2 (ACE2), which is an enzyme on human membrane targeted by SARS-CoV2 spike protein. Therefore, this dataset has both compositional and conformational heterogeneity.\nAfter obtaining an ab-initio model from CryoSparc, we ran RECOVAR with a dimension of 4 and a relatively small grid size of 128. K-Means clustering was performed to find 5 cluster centers among the embeddings.\nHere we present two volumes reconstructed from center 0 and center 1, showing a very obvious compositional heterogeneity, where ACE2 is clearly present in center 0 and missing in center 1.\n\n\n\nCompositional heterogeneity in the spike protein dataset. The spot where ACE2 is present/absent is highlighted by the red circle.\n\n\nA path between center 0 and 1 was analyzed to study the conformational changes adopted by the spike protein to bind to ACE2. We can see the arm in the RBD region lifts in order to bind to ACE2.\n\n\n\nConformational changes along the path between center 0 and 1, highlighted by the yellow circle" + }, + { + "objectID": "posts/RECOVAR/index.html#discussion", + "href": "posts/RECOVAR/index.html#discussion", + "title": "Heterogeneity analysis of cryo-EM data of proteins dynamic in comformation and composition using linear subspace methods", + "section": "Discussion", + "text": "Discussion\nRECOVAR has several advantages over other heteogeneity analysis methods. Besides the high interpretability we mentioned before, RECOVAR is proved to be able to discover compositional heterogeneity, which cannot be solved by some popular deep learning methods like 3DFlex. Moreover, RECOVAR has much less hyper-parameters to tune compared with deep learning models. The main hyper-parameter the user needs to specify is the number of proncipal components to use, which is a trade-off between the amount of heterogeneity to capture and computational cost.\nHowever, one problem RECOVAR and many other heterogeneity analysis algorithms share is that it requires the input of a homogeneous model/poses of images. However the estimation of the consensus model is often biased by heterogeneity, while the heterogeneity analysis assumes the input consensus model is correct(a dead loop!). Nevertheless, we would expect this issue to be solved by an EM-algorithm iteratively constructing consensus model and performing heterogeneity analysis. In future we may also be interested in benchmarking on pose estimation errors, and other parameters such as the number of principal components, grid size, and particle number, which were not be done in the original paper.\nThe other drawback of RECOVAR is that the density-based path recovery approach is computationally expensive. The cost increases expoenentially with dimension. In practice, our NVIDIA 24GB GPU could deal with at most a dimension of 4, which is usually insufficient to capture enough heteogeneity in cryo-EM datasets with low SNR. We have to look at cheaper ways of finding path without estimating densities. We are also interested in methods to quantify the compositional heterogeneity along the path e.g. the probability of SARS-CoV2 spike proteins binding to ACE2 with certain conformation.\nThe last but not least, it will be much easier for structure biologists to study the heterogeneity if we could extend the movie of density map to the movie of atomic model. This requires fitting atomic models to density maps. Since here the density maps in the movies are very similar, we don’t want to fit from scratch every time. Instead, a better approach would be fitting an initial model and then locally updating each density map." + }, + { + "objectID": "posts/quasiconformalmap/index.html#theorem", + "href": "posts/quasiconformalmap/index.html#theorem", + "title": "Quasiconformal mapping for shape representation", + "section": "Theorem", + "text": "Theorem" + }, + { + "objectID": "posts/outlier-detection/DeCOr-MDS.html", + "href": "posts/outlier-detection/DeCOr-MDS.html", + "title": "Orthogonal outlier detection and dimension estimation for improved MDS embedding of biological datasets", + "section": "", + "text": "Multidimensional scaling (MDS) is known to be sensitive to such orthogonal outliers, we present here a robust MDS method, called DeCOr-MDS, short for Detection and Correction of Orthogonal outliers using MDS. DeCOr-MDS takes advantage of geometrical characteristics of the data to reduce the influence of orthogonal outliers, and estimate the dimension of the dataset. The full paper is available at Li et al. (2023)." + }, + { + "objectID": "posts/outlier-detection/DeCOr-MDS.html#multidimensional-scaling-mds", + "href": "posts/outlier-detection/DeCOr-MDS.html#multidimensional-scaling-mds", + "title": "Orthogonal outlier detection and dimension estimation for improved MDS embedding of biological datasets", + "section": "Multidimensional scaling (MDS)", + "text": "Multidimensional scaling (MDS)\nMDS is a statistical technique used for visualizing data points in a low-dimensional space, typically two or three dimensions. It is particularly useful when the data is represented in the form of a distance matrix, where each entry indicates the distance between pairs of items. MDS aims to place each item in this lower-dimensional space in such a way that the distances between the items are preserved as faithfully as possible. This allows complex, high-dimensional data to be more easily interpreted, as the visual representation can reveal patterns, clusters, or relationships among the data points that might not be immediately apparent in the original high-dimensional space. MDS is widely used in fields such as psychology, market research, and bioinformatics for tasks like visualizing similarities among stimuli, products, or genetic sequences (Carroll and Arabie 1998; Hout, Papesh, and Goldinger 2013)." + }, + { + "objectID": "posts/outlier-detection/DeCOr-MDS.html#orthogonal-outliers", + "href": "posts/outlier-detection/DeCOr-MDS.html#orthogonal-outliers", + "title": "Orthogonal outlier detection and dimension estimation for improved MDS embedding of biological datasets", + "section": "Orthogonal outliers", + "text": "Orthogonal outliers\nOutlier detection has been widely used in biological data. Sheih and Yeung proposed a method using principal component analysis (PCA) and robust estimation of Mahalanobis distances to detect outlier samples in microarray data (Shieh and Hung 2009). Chen et al. reported the use of two PCA methods to uncover outlier samples in multiple simulated and real RNA-seq data (Oh, Gao, and Rosenblatt 2008). Outlier influence can be mitigated depending on the specific type of outlier. In-plane outliers and bad leverage points can be harnessed using \\(\\ell_1\\)-norm Forero and Giannakis (2012), correntropy or M-estimators (Mandanas and Kotropoulos 2017). Outliers which violate the triangular inequality can be detected and corrected based on their pairwise distances (Blouvshtein and Cohen-Or 2019). Orthogonal outliers are another particular case, where outliers have an important component, orthogonal to the hyperspace where most data is located. These outliers often do not violate the triangular inequality, and thus require an alternative approach." }, { - "objectID": "posts/Neural-Manifold/index.html#decoding-hippocampal-gain-mathcalh", - "href": "posts/Neural-Manifold/index.html#decoding-hippocampal-gain-mathcalh", - "title": "Understanding Animal Navigation using Neural Manifolds With CEBRA", - "section": "Decoding Hippocampal Gain (\\(\\mathcal{H}\\))", - "text": "Decoding Hippocampal Gain (\\(\\mathcal{H}\\))\n\nFinal Step\nThe final step is to decode \\(\\mathcal{H}\\) from the parametrization. The method to do this is straightforward. Once we have parametrized the spline accurately to the neural data, we calculate the hippocampal gain by comparing the distance/angle traveled in the neural manifold (derived from our spline) to the distance/angle in the lab frame (actual movement of the rat).\nThe idea is that:\n\\[\n\\mathcal{H} = \\frac{d\\theta_\\mathcal{H}}{d\\theta_\\mathcal{L}}\n\\]\nwhere \\(\\theta_H\\) is the change in angle in the hippocampal reference frame, decoded from our spline parametrization of the neural manifold, and \\(\\theta_L\\) is the physical angle traveled by the rat in the lab frame.\nNote that this is actually just the original definition of \\(\\mathcal{H}\\), but now \\(\\theta_H\\) is determined by our spline parameter, not the Fourier Transform method.\nFor example, let’s take a time interval, say 1–2 seconds. To determine the hippocampal gain within that frame, we observe where the neural activity at times 1 and 2 maps in our manifold, calling these \\(\\theta_{H1}\\) and \\(\\theta_{H2}\\), respectively. Then, using the lab frame angles at times 1 and 2, which we’ll call \\(\\theta_{L1}\\) and \\(\\theta_{L2}\\), we find that:\n\\[\n \\mathcal{H}(\\text{between } t=1 \\text{ and } t=2) = \\frac{\\theta_{\\mathcal{H2}} - \\theta_{\\mathcal{H1}}}{\\theta_{\\mathcal{L2}} - \\theta_{\\mathcal{L1}}}\n\\]\nWe extend the above example to all consecutive time points in the experiment to compute hippocampal gain (\\(\\mathcal{H}\\)) dynamically. The following Python code demonstrates how this is implemented:\n\ndef differentiate_and_smooth(data=None, window_size=3):\n #Compute finite differences.\n diffs = np.diff(data)\n \n # Compute the moving average of differences.\n kernel = np.ones(window_size) / window_size\n avg_diffs = np.convolve(diffs, kernel, mode='valid') \n \n return avg_diffs\n\nderivative_decoded_angle_rad_unwrap = differentiate_and_smooth(data=filtered_decoded_angles_unwrap, window_size=60) #hippocampal angle from manifold parametrization.\nderivative_true_angle_rad_unwrap = differentiate_and_smooth(data=binned_true_angle_rad_unwrap, window_size=60) #true angle from session recordings.\nderivative_hipp_angle_rad_unwrap = differentiate_and_smooth(data=binned_hipp_angle_rad_unwrap, window_size=60) #hippocampal angle from Fourier Transform (traditional method, can be thought of as ground truth).\n\n\ndecode_H = (derivative_decoded_angle_rad_unwrap) / (derivative_true_angle_rad_unwrap) #take the \"derivative\" of hippocampal angle at each time point and divide by \"derivative\" of true angle at each time point.\n\n#Now, plot H from manifold optimization vs H from traditional method (shown in results).\nThis code calculates the hippocampal gain, \\(\\mathcal{H}\\), by dividing the derivative of the hippocampal angle (obtained from the manifold parameterization) by the derivative of the true angle (obtained from session recordings). The result can be compared to \\(\\mathcal{H}\\) from the traditional Fourier-based method, as shown in the results section." + "objectID": "posts/outlier-detection/DeCOr-MDS.html#height-and-volume-of-n-simplices", + "href": "posts/outlier-detection/DeCOr-MDS.html#height-and-volume-of-n-simplices", + "title": "Orthogonal outlier detection and dimension estimation for improved MDS embedding of biological datasets", + "section": "Height and Volume of n-simplices", + "text": "Height and Volume of n-simplices\nWe recall some geometric properties of simplices, which our method is based on. For a set of \\(n\\) points \\((x_1,\\ldots, x_n)\\), the associated \\(n\\)-simplex is the polytope of vertices \\((x_1,\\ldots, x_n)\\) (a 3-simplex is a triangle, a 4-simplex is a tetrahedron and so on). The height \\(h(V_{n},x)\\) of a point \\(x\\) belonging to a \\(n\\)-simplex \\(V_{n}\\) can be obtained as (Sommerville 1929), \\[\n h(V_{n},x) = n \\frac{V_n}{V_{n-1}},\n\\tag{1}\\] where \\(V_{n}\\) is the volume of the \\(n\\)-simplex, and \\(V_{n-1}\\) is the volume of the \\((n-1)\\)-simplex obtained by removing the point \\(x\\). \\(V_{n}\\) and \\(V_{n-1}\\) can be computed using the pairwise distances only, with the Cayley-Menger formula (Sommerville 1929):\n\\[\\begin{equation}\n\\label{eq:Vn}\nV_n = \\sqrt{\\frac{\\vert det(CM_n)\\vert}{2^n \\cdot (n!)^2}},\n\\end{equation}\\]\nwhere \\(det(CM_n)\\) is the determinant of the Cayley-Menger matrix \\(CM_n\\), that contains the pairwise distances \\(d_{i,j}=\\left\\lVert x_i -x_j \\right\\rVert\\), as \\[\\begin{equation}\n CM_n = \\left[ \\begin{array}{cccccc} 0 & 1 & 1 & ... & 1 & 1 \\\\\n\n 1 & 0 & d_{1,2}^2 & ... & d_{1,n}^2 & d_{1,n+1}^2 \\\\\n 1 & d_{2,1}^2 & 0 & ... & d_{2,n}^2 & d_{2,n+1}^2 \\\\\n ... & ... & ... & ... & ... & ... \\\\\n 1 & d_{n,1}^2 & d_{n,2}^2 & ... & 0 & d_{n,n+1}^2 \\\\\n 1 & d_{n+1,1}^2 & d_{n+1,2}^2 & ... & d_{n+1,n}^2 & 0 \\\\\n \\end{array}\\right].\n\\end{equation}\\]" }, { - "objectID": "posts/Neural-Manifold/index.html#results", - "href": "posts/Neural-Manifold/index.html#results", - "title": "Understanding Animal Navigation using Neural Manifolds With CEBRA", - "section": "Results", - "text": "Results\nWe now display and discuss the results. Below are a few results from applying this method to real experimental data from “Control and recalibration of path integration in place cells” (Madhav et al. (2024)). We first show two “good” trials (session 50 and 36), and two “bad” trials (session 26 and 29). We had trials where our data did not trace out a 1D ring topology in the pointcloud as can be clearly seen from the spline parametrization (and which can be easily quanitatively assessed using persistent homology). I will explain more clearly below what we mean by “good” and “bad”.\n\nPoint clouds and parametrization\n\n\n\n\nFigure 9 - Embeddings for both successful and unsuccessful trials: (a) Session 50 (top) and Session 36 (bottom) show embeddings with and without the spline fit (in red), representing successful trials. (b) Session 26 (top) and Session 29 (bottom) show embeddings for unsuccessful trials, where the manifold does not form a clear 1D ring topology.\n\n\n\nNow we plot our H value decoded from the manifold versus the H value decoded from the Fourier Transform method and compare for “good” trials and “bad” trials.\n\n\nH values\n\n\na \n\n\nb \n\n\nc \n\n\nd \n\n\n\nFigure 5 - Plot of manifold-decoded gain (red) vs. gain from the traditional method (blue) for different sessions: (a) Session 50, (b) Session 26, (c) Session 36, and (d) Session 29.\n\nAfter observing both successful and unsuccessful trials, I asked: what distinguishes “good” results from “bad” ones?\nIt became evident that the quality of results was strongly influenced by the number of neurons in the experimental recording. To quantify the quality of an embedding, I used the Structure Index (SI) score (Sebastian, Esparza, and Prida (2022)). The SI score measures how well the hippocampal angle is distributed across the point cloud.\n\nSI ranges from 0 to 1:\n\n0: The hippocampal angle is randomly distributed within the point cloud.\n1: The hippocampal angle is perfectly distributed, indicating a clear and accurate representation.\n\n\nThus, a higher SI score corresponds to a better alignment between the hippocampal angle and the manifold parameterization.\n\n\nResults\nConsider the trials discussed earlier:\n\nSuccessful trials (Sessions 50 and 36): SI scores were 0.89 and 0.9, respectively.\nUnsuccessful trials (Sessions 26 and 29): SI scores were 0.34 and 0.67, respectively.\n\nThe plot below illustrates the relationship between the number of neurons (or clusters) and the SI score. This highlights what I refer to as the “curse of clusters”: A minimum number of clusters (neurons) is required to achieve a successful trial.\n\n\n\n\nFigure 10 - Relationship between number of clusters (neurons) and SI score.\n\n\n\nThis shows that trials with fewer neurons (<35 clusters) are more likely to fail, while those with more neurons (>35 clusters) generally produce high-quality embeddings with accurate parameterization.\nIf the number of neurons was less than 35, we got “bad” results, and if the number of neurons was greater than 35, we got “good” results. We determined that in order to get an accurate \\(\\mathcal{H}\\) decoding, we need at least 35 neurons in the recording. Look at the plot below, where we look at the relationship between number of clusters and \\(\\mathcal{H}\\) decode error. The \\(\\mathcal{H}\\) decode error is calculated as, \\[\n\\text{mean} \\, \\mathcal{H} \\, \\text{decode error} = \\frac{1}{n} \\sum_{i=1}^{n} \\left( H_{\\text{decode}}[i] - H_{\\text{traditional}}[i] \\right),\n\\]\nwhere the sum is taken over all time indices in each array, and ( n ) is the number of time points.\n\n\n\n\nFigure 11 - Plot of number of clusters (neurons) vs mean \\(\\mathcal{H}\\) error.\n\n\n\nThe majority of trials with more than 35 clusters (neurons) have a mean \\(\\mathcal{H}\\) decode error of less than 0.01. However, some trials with more than 35 clusters exhibit a higher decode error.\nThe reason for this discrepancy lies in the topology of the manifold produced by CEBRA. Even when the trial appears “good” based on the SI metric, CEBRA does not always produce a 1D ring topology, which is crucial for accurate \\(\\mathcal{H}\\) decoding.\nAddressing this limitation will be part of the next steps in our methodology." + "objectID": "posts/outlier-detection/DeCOr-MDS.html#sec-part1", + "href": "posts/outlier-detection/DeCOr-MDS.html#sec-part1", + "title": "Orthogonal outlier detection and dimension estimation for improved MDS embedding of biological datasets", + "section": "Orthogonal outlier detection and dimensionality estimation", + "text": "Orthogonal outlier detection and dimensionality estimation\nWe now consider a dataset \\(\\mathbf{X}\\) of size \\(N\\times d\\), where \\(N\\) is the sample size and \\(d\\) the dimension of the data. We associate with \\(\\mathbf{X}\\) a matrix \\(\\mathbf{D}\\) of size \\(N\\times N\\), which represents all the pairwise distances between observations of \\(\\mathbf{X}\\). We also assume that the data points can be mapped into a vector space with regular observations that form a main subspace of unknown dimension \\(d^*\\) with some small noise, and additional orthogonal outliers of relatively large orthogonal distance to the main subspace (see Figure 1.A). Our proposed method aims to infer from \\(\\mathbf{D}\\) the dimension of the main data subspace \\(d^*\\), using the geometric properties of simplices with respect to their number of vertices: Consider a \\((n+2)\\)-simplex containing a data point \\(x_i\\) and its associated height, that can be computed using equation Equation 1. When \\(n<d^*\\) and for \\(S\\) large enough, the distribution of heights obtained from different simplices containing \\(x_i\\) remains similar, whether \\(x_i\\) is an orthogonal outlier or a regular observation (see Figure 1.B). In contrast, when \\(n\\geq d^*\\), the median of these heights approximately yields the distance of \\(x_i\\) to the main subspace (see Figure 1.C). This distance should be significantly larger when \\(x_i\\) is an orthogonal outlier, compared with regular points, for which these distances are tantamount to the noise.\n\n\n\n\n\n\nFigure 1: Example of a dataset with orthogonal outliers and n-simplices. Representation of a dataset with regular data points (blue) belonging to a main subspace of dimension 2 with some noise, and orthogonal outliers (red triangle symbols) in the third dimension. View of two instances of 3-simplices (triangles), one with only regular points (left) and the other one containing one outlier (right). The height drawn from the outlier is close to the height of the regular triangle. Upon adding other regular points to obtain tetrahedrons (4-simplices), the height drawn from the outlier (right) becomes significantly larger than the height drawn from the same point (left) as in .\n\n\n\nTo estimate \\(d^*\\) and for a given dimension \\(n\\) tested, we thus randomly sample, for every \\(x_i\\) in \\(\\mathbf{X}\\), \\(S(n+2)\\)-simplices containing \\(x_i\\), and compute the median of the heights \\(h_i^n\\) associated with these \\(S\\) simplices. Upon considering, as a function of the dimension \\(n\\) tested, the distribution of median heights \\((h_1^{n},...,h_N^{n})\\) (with \\(1\\leq i \\leq N\\)), we then identify \\(d^*\\) as the dimension at which this function presents a sharp transition towards a highly peaked distribution at zero. To do so, we compute \\(\\tilde{h}_n\\), as the mean of \\((h_1^{n},...,h_N^{n})\\), and estimate \\(d^*\\) as\n\\[\\begin{equation}\n \\bar{n}=\\underset{n}{\\operatorname{argmax}} \\frac{\\tilde{h}_{n-1}}{\\tilde{h}_{n}}.\n \\label{Eq:Dim}\n\\end{equation}\\]\nFurthermore, we detect orthogonal outliers using the distribution obtained in \\(\\bar{n}\\), as the points for which \\(h_i^{\\bar{n}}\\) largely stands out from \\(\\tilde{h}_{\\bar{n}}\\). To do so, we compute \\(\\sigma_{\\bar{n}}\\) the standard deviation observed for the distribution \\((h_1^{\\bar{n}},...,h_N^{\\bar{n}})\\), and obtain the set of orthogonal outliers \\(\\mathbf{O}\\) as\n\\[\n \\mathbf{O}= \\left\\{ i\\;|\\;h_i^{\\bar{n}}> \\tilde{h}_{\\bar{n}} + c \\times \\sigma_{\\bar{n}} \\right\\},\n\\tag{2}\\]\nwhere \\(c>0\\) is a parameter set to achieve a reasonable trade-off between outlier detection and false detection of noisy observations." }, { - "objectID": "posts/Neural-Manifold/index.html#next-steps", - "href": "posts/Neural-Manifold/index.html#next-steps", - "title": "Understanding Animal Navigation using Neural Manifolds With CEBRA", - "section": "Next steps", - "text": "Next steps\n\nApply the Method to Raw, Unfiltered Spike Data\nInstead of relying on manual, ad hoc clustering to identify neurons and spike trains, we propose applying CEBRA directly to the raw recorded neural data. This approach could help with issues related to the “curse of clusters,” as it eliminates the dependency on clustering quality and the number of detected clusters.\nSimulate an Online Environment\nTest whether this method can be applied in a simulated “online” experimental environment. This would involve decoding neural representations in real time during an experiment, enabling closed-loop feedback and dynamic manipulation of experimental variables.\nModify the CEBRA Loss Function\nAdapt the CEBRA loss function to incorporate constraints that bias the resulting point cloud to lie on a desired topology. For instance, by guiding the embedding toward a 1D ring or a higher-dimensional structure, we could improve the consistency and interpretability of the manifold representation." + "objectID": "posts/outlier-detection/DeCOr-MDS.html#correcting-the-dimensionality-estimation-for-a-large-outlier-fraction", + "href": "posts/outlier-detection/DeCOr-MDS.html#correcting-the-dimensionality-estimation-for-a-large-outlier-fraction", + "title": "Orthogonal outlier detection and dimension estimation for improved MDS embedding of biological datasets", + "section": "Correcting the dimensionality estimation for a large outlier fraction", + "text": "Correcting the dimensionality estimation for a large outlier fraction\nThe method presented in the previous section assumes that at dimension \\(d^*\\), the median height calculated for each point reflects the distance to the main subspace. This assumption is valid when the fraction of orthogonal outliers is small enough, so that the sampled \\(n\\)-simplex likely contains regular observations only, aside from the evaluated point. However, if the number of outliers gets large enough so that a significant fraction of \\(n\\)-simplices %drawn to compute a height also contains outliers, then the calculated heights would yield the distance between \\(x_i\\) and an outlier-containing hyperplane, whose dimension is larger than a hyperplane containing only regular observations. The apparent dimensionality of the main subspace would thus increase and generates a positive bias on the estimate of \\(d^*\\).\nSpecifically, if \\(\\mathbf{X}\\) contains a fraction of \\(p\\) outliers, and if we consider \\(o_{n,p,N}\\) the number of outliers drawn after uniformly sampling \\(n+1\\) points (to test the dimension \\(n\\)), then \\(o_{n,p,N}\\) follows a hypergeometric law, with parameters \\(n+1\\), the fraction of outliers \\(p=N_o/N\\), and \\(N\\). Thus, the expected number of outliers drawn from a sampled simplex is \\((n+1) \\times p\\). After estimating \\(\\bar{n}\\) (from Section 3.1), and finding a proportion of outliers \\(\\bar p = |\\mathbf{O}|/N\\) using Equation 2, we hence correct \\(\\bar{n}\\) by substracting the estimated bias \\(\\delta\\), as the integer part of the expectation of \\(o_{n,p,N}\\), so the debiased dimensionality estimate \\(n^*\\) is\n\\[\\begin{equation}\n n^* =\\bar{n} - \\lfloor (\\bar{n}+1) \\times p \\rfloor.\n \\label{eq:corrected_n}\n\\end{equation}\\]" }, { - "objectID": "posts/Neural-Manifold/index.html#conclusion", - "href": "posts/Neural-Manifold/index.html#conclusion", - "title": "Understanding Animal Navigation using Neural Manifolds With CEBRA", - "section": "Conclusion", - "text": "Conclusion\nIn this work, we demonstrated the power of CEBRA to decode hippocampal gain (\\(\\mathcal{H}\\)) at finer temporal resolutions without relying on traditional Fourier transform-based approaches. By embedding neural population activity into a low-dimensional latent space that captures the underlying topological structure of the experimental task, we successfully reconstructed a 1D ring manifold corresponding to the rat’s hippocampal reference frame. Persistent homology validated the circular topology, and the SPUD method was used to parametrize the manifold, enabling the decoding of hippocampal gain.\nWe found that at least 35 well-isolated clusters (neurons) were needed for robust manifold estimation. Below this threshold, we had poor quality and topology of the embeddings, leading to inaccurate \\(\\mathcal{H}\\) decoding. Despite these issues, the results demonstrate the potential of manifold learning for experimental tasks of this type. This work will enable new experiments for causal modeling of the neural circuits responsible for cognitive representations." + "objectID": "posts/outlier-detection/DeCOr-MDS.html#outlier-distance-correction", + "href": "posts/outlier-detection/DeCOr-MDS.html#outlier-distance-correction", + "title": "Orthogonal outlier detection and dimension estimation for improved MDS embedding of biological datasets", + "section": "Outlier distance correction", + "text": "Outlier distance correction\nUpon identifying the main subspace containing regular points, our procedure finally corrects the pairwise distances that contain outliers in the matrix \\(\\mathbf{D}\\), in order to apply a MDS that projects the outliers in the main subspace. In the case where the original coordinates cannot be used (e.g, as a result of some transformation or if the distance is non Euclidean), we perform the two following steps: (i) We first apply a MDS on \\(\\mathbf{D}\\) to place the points in a euclidean space of dimension \\(d\\), as a new matrix of coordinates \\(\\tilde{X}\\). (ii) We run a PCA on the full coordinates of the estimated set of regular data points (i.e. \\(\\tilde{X}\\setminus O\\)), and project the outliers along the first \\(\\bar{n}^*\\) principal components of the PCA, since these components are sufficient to generate the main subspace. Using the projected outliers, we accordingly update the pairwise distances in \\(\\mathbf{D}\\) to obtain the corrected distance matrix \\(\\mathbf{D^*}\\). Note that in the case where \\(\\mathbf{D}\\) derives from a euclidean distance between the original coordinates, we can skip step (i), and directly run step (ii) on the full coordinates of the estimated set of regular data points." }, { - "objectID": "posts/landmarks-final/index.html", - "href": "posts/landmarks-final/index.html", - "title": "Landmarking the ribosome exit tunnel", + "objectID": "posts/ribosome-landmarks/index.html", + "href": "posts/ribosome-landmarks/index.html", + "title": "Defining landmarks for the ribosome exit tunnel", "section": "", - "text": "I present a complete Python protocol for assigning landmarks to the ribosome exit tunnel surface based on conservation and distance. The motivation and background for this topic can be found in my previous post. This blog post outlines implementation details and usage instructions for a more robust version of the protocol, available in full on GitHub." + "text": "The ribosome is present in all domains of life, though exhibits varying conservation across phylogeny. It has been found that, as translation proceeds, the nascent polypeptide chain interacts with the tunnel, and as such, tunnel geometry plays a role in translation dynamics and resulting protein structures1. With advances in imaging of ribosome structure with Cryo-EM, there is ample data on which geometric analysis of the tunnel may be applied and therefore a need for more computational tools to do so2." }, { - "objectID": "posts/landmarks-final/index.html#introduction", - "href": "posts/landmarks-final/index.html#introduction", - "title": "Landmarking the ribosome exit tunnel", + "objectID": "posts/ribosome-landmarks/index.html#introduction", + "href": "posts/ribosome-landmarks/index.html#introduction", + "title": "Defining landmarks for the ribosome exit tunnel", "section": "", - "text": "I present a complete Python protocol for assigning landmarks to the ribosome exit tunnel surface based on conservation and distance. The motivation and background for this topic can be found in my previous post. This blog post outlines implementation details and usage instructions for a more robust version of the protocol, available in full on GitHub." + "text": "The ribosome is present in all domains of life, though exhibits varying conservation across phylogeny. It has been found that, as translation proceeds, the nascent polypeptide chain interacts with the tunnel, and as such, tunnel geometry plays a role in translation dynamics and resulting protein structures1. With advances in imaging of ribosome structure with Cryo-EM, there is ample data on which geometric analysis of the tunnel may be applied and therefore a need for more computational tools to do so2." }, { - "objectID": "posts/landmarks-final/index.html#protocol-overview", - "href": "posts/landmarks-final/index.html#protocol-overview", - "title": "Landmarking the ribosome exit tunnel", - "section": "Protocol Overview", - "text": "Protocol Overview\nLandmarks assigned on the surface of the tunnel are defined as the mean atomic coordinates of conserved residues that are close to the tunnel surface. The general steps in the protocol are:\n\nRun Multiple Sequence Alignment (MSA) on the relevant polymers and select residues that are above a conservation threshold.\nOf the conserved residues, select only the residues that are within a distance threshold of the tunnel as represented by the Mole model1.\nExtract the 3D coordinates of the selected residues.\n\n\n\n\n\n\n\nFigure 1: Landmarks shown in blue on a mesh representation of the 4UG0 tunnel, with proteins shown for reference (uL4 in pink, uL22 in green, and eL39 in yellow)." + "objectID": "posts/ribosome-landmarks/index.html#background", + "href": "posts/ribosome-landmarks/index.html#background", + "title": "Defining landmarks for the ribosome exit tunnel", + "section": "Background", + "text": "Background\nIn order to preform geometric shape analysis on the ribosome, we must first superimpose mathematical definitions onto this biological context. Among others, one way of defining shape mathematically is with a set of landmarks. A landmark is a labelled point on some structure, which, biologically speaking, has some meaning. After removing the effects of translation, scaling, and rotation, sets of landmarks form a shape space, on which statistical analysis may be applied.\nAssigning landmarks to biological shapes is not a new idea; many examples involve defining landmarks as joins between bones or muscles, or as points along observed curves3. However, there has been little work in assigning landmarks to biological molecules, and none specifically to the ribosome exit tunnel. The challenge is that any one landmark must have comparable instances across shapes in the shape space, meaning that we cannot arbitrarily pick residues which we know to be near to the tunnel. Such residues must be conserved, and therefore present in each specimen, to be considered useful." }, { - "objectID": "posts/landmarks-final/index.html#implementation-details", - "href": "posts/landmarks-final/index.html#implementation-details", - "title": "Landmarking the ribosome exit tunnel", - "section": "Implementation Details", - "text": "Implementation Details\n\nSeparation by Kingdom\nThe protocol has two main entry points: main.py and main_universal.py. The main file assigns intra-kingdom landmarks; conserved residues are chosen only based on sequences from the given kingdom, meaning that the landmarks are specific to one of the three biological super-kingdoms (eukaryota, bacteria, and archaea). Using main, landmarks for one kingdom do not directly correspond to landmarks for another kingdom. While this separation prevents inter-kingdom comparison directly, it allows for a higher number of landmarks to be assigned to each specimen, due to higher degrees of conservation within kingdoms. The alternative is to use main_universal, which chooses conserved residues based on all sequences. This provides less landmarks per ribosome, but allows for inter-kingdom comparison, as each landmark will have correspondence across all specimens.\n\n\nData\nThe protocol uses data from RibosomeXYZ2 and the Protein Data Bank (PDB) via API access. For each ribosome structure, the protocol requests sequences and metadata (chain names, taxonomic information, etc.) from RibosomeXYZ for selected proteins and RNA and the full mmcif structural file from the PDB. This data is stored locally to facilitate repeated access during runtime.\n\n\nAlignments\nThe program uses MAFFT3 to preform Multiple Sequence Alignment (MSA) on all of the available sequences for each of the relevant polymers. It accesses sequence data from RibosomeXYZ polymer files. When the program is run on new specimens, if the sequences are not already in the input fasta files, they are automatically added and the alignments are re-run to include the new specimens.\n\n\n\n\n\n\nFigure 2: A visualization of a subsection of the MSA showing a highly conserved region of uL4.\n\n\n\n\n\nSelecting Landmarks\nLandmarks are selected using a prototype ribosome and based on conservation and distance. The program searches for landmarks only on polymers which are known to be close to the tunnel4.\n\n\n\nKingdom\nPrototype\nSelected Polymers\n\n\n\n\nEukaryota\n4UG0\nuL4, uL22, eL39, 25/28S rRNA\n\n\nBacteria\n3J7Z\nuL4, uL22, uL23, 23S rRNA\n\n\nArchaea\n4V6U\nuL4, uL22, eL39, 23S rRNA\n\n\nUniversal\n3J7Z\nuL4, uL22, 23/25/28S rRNA\n\n\n\nThe prototype IDs and polymers used in the protocol\n\nConservation\nTo be chosen as a landmark, residues must be at least 90% conserved. This threshold is a tuneable parameter. For each of the relevant polymers, the program iterates through each position in the MSA alignment file for that polymer and selects alignment positions for which at least 90% of specimens share the same residue. This excludes positions where gaps are the most common element. The program calls the below method on every column of the MSA for each of the relevant polymers to obtain a short-list of alignment positions to be considered for landmarks.\n\ndef find_conserved(column, threshold):\n counter = Counter(column)\n mode = counter.most_common(1)[0]\n \n if (mode[0] != '-' and mode[1] / len(column) >= threshold):\n return mode[0]\n \n return None\n\n\n\nDistances\nFor each candidate conserved position, the program first locates the residue’s coordinates on the prototype specimen (see Section 3.5 for more detail). For each prototype, I have run the Mole tunnel search algorithm to extract the centerline coordinates of the tunnel and the radius at each point. Then for each candidate landmark \\(p_l\\), I find the nearest centerline point \\(p_c\\) by euclidean distance, and compute the distance from \\(p_l\\) to the sphere centered at \\(p_c\\) with the given radius \\(r_c\\): \\[ d = ||p_l - p_c|| - r_c\\] If \\(d\\) is less than the distance threshold, the candidate position is considered a landmark. See the code below for reference:\n\ndef find_closest_point(p, instance):\n coords = get_tunnel_coordinates(instance)\n dist = np.inf\n r = 0\n p = np.array(p)\n \n for coord in coords.values():\n xyz = np.array(coord[0:3])\n euc_dist = np.sqrt(np.sum(np.square(xyz - p))) - coord[3]\n if euc_dist < dist:\n dist = euc_dist\n \n return dist\n\nEach selected landmark’s residue type and alignment location are saved to file, so that new ribosome specimens can use the list as a guideline.\n\n\n\nLocating landmarks\nLocating the chosen landmarks in the structural file for a given ribosome specimen is the most involved step of the protocol. Often, a ribosome mmcif file contains some gaps, due to experimental/imaging conditions. For this reason, I take an approach using methods from RibosomeXYZ’s backend2 to keep track of residue locations as sequences are manipulated (aligned, flattened to remove gaps, etc.). We have access to two copies of the sequence for each polymer: the sequence from the RibosomeXYZ polymer data, which is well formed, and the mmcif PDB sequence that is tied to the 3D structure, which often has gaps. The protocol makes use of both versions.\nThe PDB sequence is loaded into memory as an object using BioPython. This object holds all of the structural and hierarchical information present in the original file. This is more useful than working with sequences as strings. For example, indexing a protein sequence gives a unique residue object which holds structural information, rather than just a symbolic letter.\nI use the SeqenceMappingContainer class taken from RibosomeXYZ. The purpose of this class is to facilitate working with the PDB structural sequences with gaps. Initializing the class with a polymer sequence as a BioPython Chain object gives a ‘primary’ unchanged version of the sequence and a ‘flattened’ version with all gaps removed, as well as mappings for indices between the two. Given an index in the flattened sequence, we can use the maps to find the index in the primary sequence and therefore the author assigned residue IDs and structural information, and vice versa. This is the backbone of locating residues by sequence numbers from the landmark list on potential gappy polymer sequences.\nThe algorithm for locating a landmark is as follows:\n\nAccess the aligned sequence from the MSA, and map the landmark from the location in the alignment to the location in the original RibsomeXYZ sequence for this polymer instance.\nPerform a pairwise sequence alignment on the original RibosomeXYZ sequence and the flattened PDB sequence.\nUse this pairwise alignment to map the landmark location in the original RibosomeXYZ sequence to the location in the flattened PDB sequence.\nFrom the flattened PDB sequence, use SeqenceMappingContainer mapping to find the residue ID in the primary PDB sequence, and use this ID to index the Residue object.\nEnsure that the residue type matches the landmark type (i.e. amino acids / nucleotides match), and return the mean coordinates of the atoms in the residue as the landmark coordinates.\n\nSee the following code:\n\ndef locate_residues(landmark: Landmark, \n polymer: str, \n polymer_id: str, \n rcsb_id: str, \n chain: Structure, \n flat_seq,\n kingdom: str = None) -> dict:\n \n '''\n This method takes a landmark centered on the alignment, and finds this residue on the given rcsb_id's polymer.\n Returns the residues position and coordinates.\n \n landmark: the landmark to be located\n polymer: the poymer on which this landmark lies\n polymer_id: the polymer id specific to this rcsb_id\n rcsb_id: the id of the ribosome instance\n chain: the biopython Chain object holding the sequence\n flat_seq: from SequenceMappingContainer, tuple holding (seq, flat_index_to_residue_map, auth_seq_id_to_flat_index_map)\n kingdom: kingdom to which this rcsb_id belongs, or none if being called from main_universal.py\n '''\n \n # access aligned sequence from alignment files\n if kingdom is None:\n path = f\"data/output/fasta/aligned_sequences_{polymer}.fasta\"\n else:\n path = f\"data/output/fasta/aligned_sequences_{kingdom}_{polymer}.fasta\"\n alignment = AlignIO.read(path, \"fasta\")\n aligned_seq = get_rcsb_in_alignment(alignment, rcsb_id)\n \n # find the position of the landmark on the original riboXYZ seq\n alignment_position = map_to_original(aligned_seq, landmark.position) \n \n # access riboXYZ sequence (pre alignment)\n orig_seq = check_fasta_for_rcsb_id(rcsb_id, polymer, kingdom)\n\n if orig_seq is None:\n print(\"Cannot access sequence\")\n return\n \n # run pairwise alignment on the riboXYZ sequence and the flattened PDB sequence\n alignment = run_pairwise_alignment(rcsb_id, polymer_id, orig_seq, flat_seq[0])\n \n if alignment is None:\n return None\n \n # map the alignment_position from the original riboXYZ sequence to the pairwise-aligned flattened PDB sequence\n flattened_seq_aligned = alignment[1]\n flat_aligned_position = None\n if alignment_position is not None: \n flat_aligned_position = map_to_original(flattened_seq_aligned, alignment_position)\n \n if flat_aligned_position is None:\n print(f\"Cannot find {landmark} on {rcsb_id} {polymer}\")\n return None \n \n # use the MappingSequenceContainer flat_index_to_residue_map to access to residue in the PDB sequence\n resi_id = flat_seq[1][flat_aligned_position].get_id()\n residue = chain[resi_id]\n \n # check that the located residue is the same as the landmark\n landmark_1_letter = landmark.residue.upper()\n landmark_3_letter = ResidueSummary.one_letter_code_to_three(landmark_1_letter)\n if (residue.get_resname() != landmark_1_letter and residue.get_resname() != landmark_3_letter):\n return None\n \n # find atomic coordinates for the selected residue\n atom_coords = [atom.coord for atom in residue]\n if (len(atom_coords) == 0): \n return None\n \n # take the mean coordinate for the atoms in residue\n vec = np.zeros(3)\n for coord in atom_coords:\n tmp_arr = np.array([coord[0], coord[1], coord[2]])\n vec += tmp_arr\n vec = vec / len(atom_coords)\n vec = vec.astype(np.int32)\n \n return { \n \"parent_id\": rcsb_id, \n \"landmark\": landmark.name, \n \"residue\": landmark.residue, \n \"position\": resi_id[1],\n \"x\": vec[0], \"y\": vec[1], \"z\": vec[2]\n }\n\nSee the full algorithm here.\n\n\n\n\n\n\nFigure 3: Atoms versus landmarks near the tunnel, shown with the Mole model (dark blue) and the mesh model (light blue) for reference. a) All atoms within 10 Å of the tunnel centerline. b) Mean atomic coordinates of conserved residues within 7.5 Å of the spherical tunnel." + "objectID": "posts/ribosome-landmarks/index.html#protocol", + "href": "posts/ribosome-landmarks/index.html#protocol", + "title": "Defining landmarks for the ribosome exit tunnel", + "section": "Protocol", + "text": "Protocol\nBelow, I present a preliminary protocol for assigning landmarks to eukaryotic ribosome tunnels. The goal is to extrapolate to bacteria and archaea, as well as produce a combined dataset of landmarks which spans the kingdoms for inter-kingdom comparison. For now, I begin with eukaryota, taking advantage of the high degree of conservation between intra-kingdom ribosomes, as conserved sequences form the basis for this protocol.\nAs the goal for this dataset is to obtain landmarks that line the ribosome exit tunnel, I begin by selecting proteins and rRNA which interact with the tunnel: uL4, uL22, eL39, and 25/28S rRNA for Eukaryota1.\n\n\n\nFigure from Dao Duc et al. (2019) showing proteins affecting tunnel shape in E. coli and H. sapiens.\n\n\nThe full protocol is available here.\n\n1. Sequence Alignment\nIn order to assign landmarks which are comparable across ribosome specimens, I consider only the residues which are mostly conserved across our dataset of approximately 400 eukaryotes. To do so, I run Multiple Sequence Alignment (MSA) using MAFFT4 on the dataset for each of the chosen four polymer types and select residues from the MSA which are at least 90% conserved across samples.\n\n\n\nA visualization of a subsection of the MSA showing a highly conserved region of uL4.\n\n\nSelecting the most conserved residue at each position in the alignment:\n\n# Given an MSA column, return the most common element if it is at least as frequent as threshold\ndef find_conserved(column, threshold):\n counter = Counter(column)\n mode = counter.most_common(1)[0]\n \n if (mode[0] != '-' and mode[1] / len(column) >= threshold):\n return mode[0]\n \n return None\n\n\n\n2. Locating Residues\nTo locate the conserved residues, I first map the chosen loci from the MSA back to the corresponding loci in the original sequences:\n\nimport Bio\nfrom Bio.Seq import Seq\n\ndef map_to_original(sequence: Seq, position: int) -> int:\n '''\n Map conserved residue position to orignal sequence positions.\n 'sequence' is the aligned sequence from MSA.\n '''\n # Initialize pointer to position in original sequence\n ungapped_position = 0\n \n # Iterate through each position in the aligned sequence\n for i, residue in enumerate(sequence):\n # Ignore any gaps '-'\n if residue != \"-\":\n # If we have arrived at the aligned position, return pointer to position in original sequence\n if i == position:\n return ungapped_position\n # Every time we pass a 'non-gap' before arriving at position, we increase pointer by 1\n ungapped_position += 1\n\n # Return None if the position is at a gap \n return None\n\nThen using PyMol5, retrieve the atomic coordinates of the residue from the CIF file. To obtain a single landmark per residue, I take the mean of the atomic coordinates for each residue as the landmark.\nBelow is example code for retrieving the atomic coordinates of W66 on 4UG0 uL4:\n\nfrom pymol import cmd\nimport numpy as np\nfrom Bio.SeqUtils import seq3\n\n# Specify the residue to locate\nparent = '4UG0'\nchain = 'LC'\nresidue = 'W'\nposition = 66\n\nif f'{parent}_{chain}' not in cmd.get_names():\n cmd.load(f'data/{parent}.cif', object=f'{parent}_{chain}')\n cmd.remove(f'not chain {chain}')\n \nselect = f\"resi {position + 1}\"\n \natom_coords = []\ncmd.iterate_state(1, select, 'atom_coords.append((chain, resn, x, y, z))', space={'atom_coords': atom_coords})\n \nif (len(atom_coords) != 0 and atom_coords[0][1] == seq3(residue).upper()): \n \n vec = np.zeros(3)\n for coord in atom_coords:\n tmp_arr = np.array([coord[2], coord[3], coord[4]])\n vec += tmp_arr\n\n vec = vec / len(atom_coords)\n vec = vec.astype(np.int32)\n \n print(f\"Coordinates: x: {vec[0]}, y: {vec[1]}, z: {vec[2]}\")\n\n\n\n3. Filtering landmarks by distance\nAmong the conserved residues on the selected polymers, many will be relatively far from the exit tunnel and not have any influence on tunnel geometry. Thus, I select only those residues which are close enough to the tunnel. In this protocol, a threshold of \\(7.5 \\mathring{A}\\) is applied.\nThis process is done by using MOLE 2.06, which is a biomolecular channel construction algorithm. The output is a list of points in \\(\\mathbb{R}^3\\) which form the centerline of the tunnel, and, for each point on the centerline, a tunnel radius.\nUsing the MSA, I locate the coordinates of the conserved residues (see Section 3.2). For each of the residues, find the closest tunnel centerline point in Euclidean space, and compute the distance from the residue to the sphere given by the radius at that centerline point. If this distance is less than the threshold, this conserved residue is close enough to the tunnel to be considered a landmark.\nFor efficiency, I only run the MOLE algorithm on one ‘prototype’ eukaryote to filter the landmarks, then use this filtered list as the list of landmarks to find on subsequent specimens.\nBelow is the code which checks landmark location against the tunnel points:\n\nimport numpy as np\n\ndef get_tunnel_coordinates(instance: str) -> dict[int,list[float]]:\n \n if instance not in get_tunnel_coordinates.cache:\n xyz = open(f\"data/tunnel_coordinates_{instance}.txt\", mode='r')\n xyz_lines = xyz.readlines()\n xyz.close()\n \n r = open(f\"data/tunnel_radius_{instance}.txt\", mode='r')\n r_lines = r.readlines()\n r.close()\n \n coords = {}\n \n for i, line in enumerate(xyz_lines):\n if (i >= len(r_lines)): break\n \n content = line.split(\" \")\n content.append(r_lines[i])\n \n cleaned = []\n for str in content:\n str.strip()\n try:\n val = float(str)\n cleaned.append(val)\n except:\n None\n \n coords[i] = cleaned\n get_tunnel_coordinates.cache[instance] = coords\n \n # Each value in coords is of the form [x, y, z, r]\n return get_tunnel_coordinates.cache[instance]\n\nget_tunnel_coordinates.cache = {}\n\n# p is a list [x,y,z]\n# instance is RCSB_ID code\ndef find_closest_point(p, instance):\n coords = get_tunnel_coordinates(instance)\n dist = np.inf\n r = 0\n p = np.array(p)\n \n for coord in coords.values():\n xyz = np.array(coord[0:3])\n euc_dist = np.sqrt(np.sum(np.square(xyz - p))) - coord[3]\n if euc_dist < dist:\n dist = euc_dist\n \n return dist\n\nFinally, plotting the results using PyMol:\n\n\n\nLandmarks shown in blue on a mesh representation of the 4UG0 tunnel, with proteins shown for reference (uL4 in pink, uL22 in green, and eL39 in yellow).\n\n\nFor information on the mesh representation of the tunnel used in the figure above, see ‘3D tesellation of biomolecular cavities’.\n\n\nNotes\n\nThe code in the post uses a package (pymol-open-source) which cannot be installed into a virtual environment. I have instead included a /yml file specifing my conda environment that is used to compile this code.\nThe code used to retrieve atomic coordinates from PyMol is not robust to inconsistencies in CIF file sequence numbering present in the PDB. My next steps for improving this protocol will be to improve the handling of these edge cases." }, { - "objectID": "posts/landmarks-final/index.html#usage-instructions", - "href": "posts/landmarks-final/index.html#usage-instructions", - "title": "Landmarking the ribosome exit tunnel", - "section": "Usage Instructions", - "text": "Usage Instructions\nThe full protocol and datasets are avaiblable on GitHub. At the time of writing, the protocol has been run on all 1348 ribosomes currently available on RibosomeXYZ. Landmark coordinates (kingdom-specific and universal) can be found in data/output/landmarks.\nTo assign these initial landmarks, I compiled the sequences for the relevant polymers for all 1348 specimens into polymer-specific fasta files and ran MAFFT sequence alignment on each file. Then, I ran the code to select landmarks based on the full aligned files; therefore, conservation ratios for residues were based on all (currently) available data.\n\n\n\nKingdom\nNumber of specimens\nLandmarks per specimen\n\n\n\n\nEukaryota\n424\n83\n\n\nBacteria\n842\n60\n\n\nArchaea\n82\n47\n\n\nUniversal\n1348\n42\n\n\n\nDistribution of assigned landmarks across currently available ribosomes\nTo obtain landmarks on a ribosome specimen, first check if they have already been assigned. If not, the protocol can be run on new specimens as follows:\n\nUse main_universal.py to assign universal landmarks or main.py to assign kingdom-specific landmarks\nCreate a conda environment based on requirements.txt and activate it\nWith the activated environment, run the following command: python -m protocol file rcsb_id where file is one of main_universal.py or main.py, and rcsb_id is the structure ID.\n\nThe protocol can be run on multiple instances simply by adding more rcsb_id’s to the command. For example: python -m protocol file rcsb_id1 rcsb_id2\nNote that running multiple instances in the same command is more efficient when these are new instances, as the alignment will run only once after all new sequences have been added to the fasta files, rather than after each new instance.\n\n\nAs mentioned above, the program will automatically update the fasta files and rerun the alignments to include new instances from the command. This should not change the conserved residues when small amounts of new ribosomes are added, but if you are adding many new ribosomes, you may consider changing the reselect_landmarks boolean flag to True, to ensure that the assigned landmarks reflect the conservation present in the entirety of the data. This flag can also used to apply changes to conservation and distance threshold parameters. It is important to note, however, that re-selecting landmarks disrupts the correspondence of newly assigned landmarks to previously assigned landmarks." + "objectID": "posts/AlphaShape/index.html", + "href": "posts/AlphaShape/index.html", + "title": "Alpha Shapes in 2D and 3D", + "section": "", + "text": "Alpha shapes are a generalization of the convex hull used in computational geometry. They are particularly useful for understanding the shape of a point cloud in both 2D and 3D spaces. In this document, we will explore alpha shapes in both dimensions using Python.\nWhat is \\(\\alpha\\) shape? My favorite analogy (reference https://doc.cgal.org/latest/Alpha_shapes_2/index.html):\nImagine you have a huge mass of ice cream in either 2D or 3D, and the points are “hard” chocolate pieces which we would like to avoid. Using one of these round-shaped ice-cream spoons with radius \\(1/\\alpha\\), we carve out all the ice cream without bumping into any of the chocolate pieces. Finally we straighten the round boundaries to obtain the so-called \\(\\alpha\\) shape.\nWhat is the \\(\\alpha\\) parameter? \\(1/\\alpha\\) is the radius of your “carving spoon” and controls the roughness of your boundary. If the radius of spoon is too small (\\(\\alpha\\to \\infty\\)), all the ice cream can be carved out except the chocolate chips themselves, so eventually all data points become singletons and no information regarding the shape can be revealed. However, choosing big radius (\\(\\alpha \\approx 0\\)) may not be ideal either because it does not allow carving out anything, so we end up with a convex hull of all data points." }, { - "objectID": "posts/landmarks-final/index.html#limitations", - "href": "posts/landmarks-final/index.html#limitations", - "title": "Landmarking the ribosome exit tunnel", - "section": "Limitations", - "text": "Limitations\n\nAlignment Efficiency\nThe protocol automatically runs MAFFT sequence alignment from the command line when the input fasta files are updated. However, running MAFFT online can be much faster. To maximize efficiency when running the protocol on many ribosomes, I suggest running the input fasta files through MAFFT online, and uploading the resulting alignments into the protocol directory (ensuring to match the location and naming of the original files).\n\n\nMissing Landmarks\nThere remains missing landmarks on many ribosome specimens in the data, due to gaps in the experimental data or unusual specimens (e.g. imaged mid biogenesis). Filtering out these instances would be beneficial prior to analysis.\n\n\nDistribution of Species\nThe available data from RibosomeXYZ is not uniformally distributed across species. There is a heavy skew towards a few model species (E. coli, T. thermophilus, etc.) as shown in Figure 4. This biases the residue conservation calculations. Analysis done on the resulting landmark data should subset appropriately to obtain a more even spread of species.\n\n\n\n\n\n\nFigure 4: Counts of species present in the data from RibosomeXYZ" + "objectID": "posts/AlphaShape/index.html#introduction", + "href": "posts/AlphaShape/index.html#introduction", + "title": "Alpha Shapes in 2D and 3D", + "section": "", + "text": "Alpha shapes are a generalization of the convex hull used in computational geometry. They are particularly useful for understanding the shape of a point cloud in both 2D and 3D spaces. In this document, we will explore alpha shapes in both dimensions using Python.\nWhat is \\(\\alpha\\) shape? My favorite analogy (reference https://doc.cgal.org/latest/Alpha_shapes_2/index.html):\nImagine you have a huge mass of ice cream in either 2D or 3D, and the points are “hard” chocolate pieces which we would like to avoid. Using one of these round-shaped ice-cream spoons with radius \\(1/\\alpha\\), we carve out all the ice cream without bumping into any of the chocolate pieces. Finally we straighten the round boundaries to obtain the so-called \\(\\alpha\\) shape.\nWhat is the \\(\\alpha\\) parameter? \\(1/\\alpha\\) is the radius of your “carving spoon” and controls the roughness of your boundary. If the radius of spoon is too small (\\(\\alpha\\to \\infty\\)), all the ice cream can be carved out except the chocolate chips themselves, so eventually all data points become singletons and no information regarding the shape can be revealed. However, choosing big radius (\\(\\alpha \\approx 0\\)) may not be ideal either because it does not allow carving out anything, so we end up with a convex hull of all data points." }, { - "objectID": "posts/landmarks-final/index.html#future-directions", - "href": "posts/landmarks-final/index.html#future-directions", - "title": "Landmarking the ribosome exit tunnel", - "section": "Future Directions", - "text": "Future Directions\nWhen choosing landmarks, the thresholding by distance is done by comparison to the Mole tunnel model. However, this model is too simplistic to capture the complex shape of the tunnel. A more accurate model is the mesh model as described in ‘3D Tessellation of Biomolecular Cavities’.\nThere are gaps in the landmarks where the mesh model shows protrusions that the Mole model does not capture (see Figure 3 for a visualization). Future improvements to the protocol should be made to measure distance to the tunnel using the mesh representation as a benchmark instead of the centerline and radii.\nVisualizations produced using PyVista5 and PyMol6." + "objectID": "posts/AlphaShape/index.html#d-alpha-shape", + "href": "posts/AlphaShape/index.html#d-alpha-shape", + "title": "Alpha Shapes in 2D and 3D", + "section": "2D Alpha Shape", + "text": "2D Alpha Shape\nTo illustrate alpha shapes in 2D, we’ll use the alphashape library. Let’s start by generating a set of random points and compute their alpha shape.\nFirst we create a point cloud:\n\nimport numpy as np\nimport matplotlib.pyplot as plt\nimport alphashape\nfrom matplotlib.path import Path\nfrom scipy.spatial import ConvexHull\n\ndef generate_flower_shape(num_petals, num_points_per_petal):\n angles = np.linspace(0, 2 * np.pi, num_points_per_petal, endpoint=False)\n r = 1 + 0.5 * np.sin(num_petals * angles)\n \n x = r* np.cos(angles)\n \n y = r * np.sin(angles)\n \n return np.column_stack((x, y))\n\ndef generate_random_points_within_polygon(polygon, num_points):\n \"\"\"Generate random points inside a given polygon.\"\"\"\n min_x, max_x = polygon[:, 0].min(), polygon[:, 0].max()\n min_y, max_y = polygon[:, 1].min(), polygon[:, 1].max()\n \n points = []\n while len(points) < num_points:\n x = np.random.uniform(min_x, max_x)\n y = np.random.uniform(min_y, max_y)\n if Path(polygon).contains_point((x, y)):\n points.append((x, y))\n \n return np.array(points)\n\nplt.figure(figsize=(8, 6))\npoints = generate_flower_shape(num_petals=6, num_points_per_petal=100)\npoints = generate_random_points_within_polygon(points, 1000)\nplt.scatter(points[:, 0], points[:, 1], s=10, color='blue', label='Points')\n\n/Users/wenjunzhao/opt/anaconda3/lib/python3.8/site-packages/scipy/__init__.py:146: UserWarning:\n\nA NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.24.3\n\n\n\n\n\n\n\n\n\n\nTry run this with \\(\\alpha=0.1\\):\n\n# Create alpha shape\nalpha = 0.1\nalpha_shape = alphashape.alphashape(points, alpha)\n\n# Plot points and alpha shape\nplt.figure(figsize=(8, 6))\nplt.scatter(points[:, 0], points[:, 1], s=10, color='blue', label='Points')\nplt.plot(*alpha_shape.exterior.xy, color='red', lw=2, label='Alpha Shape')\nplt.title('2D Alpha Shape')\nplt.xlabel('X')\nplt.ylabel('Y')\nplt.legend()\nplt.grid(True)\nplt.show()\n\n\n\n\n\n\n\n\nOops, it seems the radius we picked is too big! Let’s try a few other choices.\n\nalpha_values = [0.1, 5.0, 10.0, 15.0]\n# Plot the flower shape and alpha shapes with varying alpha values\nfig, axes = plt.subplots(2, 2, figsize=(6,6))\naxes = axes.flatten()\n\nfor i, alpha in enumerate(alpha_values):\n # Compute alpha shape\n alpha_shape = alphashape.alphashape(points, alpha)\n \n # Plot the points and the alpha shape\n ax = axes[i]\n #print(alpha_shape.type)\n if alpha_shape.type == 'Polygon':\n ax.plot(*alpha_shape.exterior.xy, color='red', lw=2, label='Alpha Shape')\n ax.scatter(points[:, 0], points[:, 1], color='orange', s=10, label='Point Cloud')\n \n \n \n ax.set_title(f'Alpha Shape with alpha={alpha}')\n ax.legend()\n ax.grid(True)\n\nplt.tight_layout()\nplt.show()\n\n/var/folders/k7/s0t_zwg11h56xb5xp339s5pm0000gp/T/ipykernel_29951/885549844.py:13: ShapelyDeprecationWarning:\n\nThe 'type' attribute is deprecated, and will be removed in the future. You can use the 'geom_type' attribute instead." }, { - "objectID": "posts/Embryonic-Shape/index.html", - "href": "posts/Embryonic-Shape/index.html", - "title": "Shape analysis of C. elegans E cell", - "section": "", - "text": "Some more background information in the blog post link.\nDuring embryonic development of Caenorhabditis elegans, an endomesodermal precursor EMS cell develops into a mesoderm precursor MS cell and and endoderm precursor E cell (Sulston et al. 1983). The asymmetry of this division depends on signals coming from the neighbour of EMS cell, P2 (Jan and Jan 1998). When the signals coming from the neighbouring cell are lost, EMS cell divides symmetrically and both daughters adopt MS cell fate (Goldstein 1992). Since cell signalling can be modulated, C. elegans EMS cell is a good system to use when investigating asymmetric cell divisions. Indeed, preliminary studies show that the volume of the daughter closest to the P2 (signal-sending cell) becomes larger when the signal is abolished. We do not know, however, how the cell shape changes, and whether the daughter cell fate is mediated by the EMS and the daughter cell shape. One way to investigate this is to do direct volume analysis of the EMS cell before division, however, this approach is limiting since volume does not account for changes in the cell shape. With this project, I hope to develop a framework to investigate EMS, MS, and E cell shapes and use this framework to analyze cell shapes upon signal perturbations.\nA paper published in 2024 claims to have developed a framework to analyze cell shape in C. elegans embryonic cells (Van Bavel, Thiels, and Jelier 2023). To confirm the viability of this framework, the authors compared the shape of a wild type E cell versus an E cell that does not receive a signal from P2 cell (dsh-2/mig-5 knockdown). To analyze the shapes, the authors used conformal mapping to map the cell shapes onto a sphere. They then extracted spherical harmonics which can describe the features of the cell in decreasing importance order from the ones that have the greatest contribution to the cell shape. In this project, my aim was to reproduce their results and to use the Flowshape framework on my own samples.\n\n\nThe pipeline of this framework begins with segmentation. In the article, SDT-PICS method (Thiels et al. 2021) was used to generate 3D meshes. The method was installed using Docker, but it required substantial version control to make it work, as the instalation depended on Linux, some dependencies were not compatible with their recommended Python version, and others were not compatible with a different Python version. I hope to contact the authors of the paper and submit the fixes for installing SDT-PICS. Additionally, the segmentation pipeline did not work very well with my microscopy images (Figure 1). This could be due to different cell shape markers or microscopy differences.\n\n\n\nFigure 1: six-cell stage C. elegans embryo stained with a membrane dye.\n\n\nAfter trying numerous segmentation techniques I have settled for a semi-automatic segmentation of specific cells using ImageJ. This was done using automatic interpolation of selected cells, creating binary masks (Figure 2). These were used as sample cells for further analysis.\n\n\n\nFigure 2: Mask samples for an E cell, including the first mask, the 30th mask and the last mask.\n\n\n\n\n\n\n\nThe main Flowshape algorithm uses 3D meshes as input for conformal mapping. However, they do provide a method to build meshes from image files using a marching cubes algorithm (Lorensen and Cline 1987). Marching cubes algorithm leads to a cylindrical 3D representation of a cell (Figure 3).\n\n\n\nFigure 3: Cell reconstruction from masks shown in Figure 2 using Marching Cubes algorithm.\n\n\nTo remove any gaps in the shape, we employ a remeshing algorithm in pyvista package. This leads to an expected triangular mesh (Figure 4). The holes produced by the marching cubes algorithm are filled and the shape is ready to be analyzed.\n\n\n\nFigure 4: Cell reconstruction from the marching cubes shown in Figure 3.\n\n\nSpherical harmonics can then be calculated using the following code.\n```{python}\n# perform reconstruction with 24 SH\nweights, Y_mat, vs = fs.do_mapping(v, f, l_max = 24)\n\nrho = Y_mat.dot(weights)\n\nreconstruct = fs.reconstruct_shape(sv, f, rho )\nmeshplot.plot(reconstruct, f, c = rho)\n```\nThis results in a reconstructed cell shape (Figure 5). The colors here represent the curvature of the shape.\n\n\n\nFigure 5: Cell reconstruction using spherical harmonics (the first 24)\n\n\nSpherical harmonics can also be used to map the shape directly onto the sphere. Similar to Figure 5, high curvature areas are represented in brighter colors.\n\n\n\nFigure 6: Cell reconstruction onto a sphere using conformal mapping\n\n\n\n\n\nTo compare two shapes, it is essential to first align them. In this workflow, alignment is calculating by estimating a rotation matrix that maximizes the correlation between the spherical harmonics of two shapes. This is then used to align the shapes and refine the alignment (Figure 7)\n```{python}\nrot2 = fs.compute_max_correlation(weights3, weights2, l_max = 24)\nrot2 = rot2.as_matrix()\n\np = mp.plot(v, f)\np.add_points(v2 @ rot2, shading={\"point_size\": 0.2})\n\nfinal = v2 @ rot2\n\nfor i in range(10):\n # Project points onto surface of original mesh\n sqrD, I, proj = igl.point_mesh_squared_distance(final, v, f)\n # Print error (RMSE)\n print(np.sqrt(np.average(sqrD)))\n \n # igl's procrustes complains if you don't give the mesh in Fortran index order \n final = final.copy(order='f')\n proj = proj.copy(order='f')\n \n # Align points to their projection\n s, R, t = igl.procrustes(final, proj, include_scaling = True, include_reflections = False)\n\n # Apply the transformation\n final = (final * s).dot(R) + t -->\n```\nIn this image, the yellow shape and red dots represent two separate E cells. Fewer red dots mean that cells are better aligned.\n\n\n\nFigure 7: Alignment of two cells.\n\n\n\n\n\nTo find a mean shape between the two shapes, I found a mean spherical harmonics decomposition:\n```{python}\nweights, Y_mat, vs = fs.do_mapping(v,f, l_max = 24)\nweights2, Y_mat2, vs2 = fs.do_mapping(v2, f2, l_max = 24)\n\nmean_weights = (weights + weights2) / 2\nmean_Ymat = (Y_mat + Y_mat2)/2\n\nsv = fs.sphere_map(v, f)\nrho3 = mean_Ymat.dot(mean_weights)\nmp.plot(sv, f, c = rho3)\n\nrec2 = fs.reconstruct_shape(sv, f, rho3)\nmp.plot(rec2,f, c = rho3)\n```\nFrom this, I built a mean shape on the sphere, followed by a reconstruction (Figure 8)\n\n\n\nFigure 8: Mean shape reconstruction. Left - mean shape mapped onto a sphere, right - reconstructed mean shape using Spherical Harmonics.\n\n\nThis reconstruction was then used to re-align the original shapes and map them onto the average shape (Figure 9)\n\n\n\nFigure 9: Alignment of cells to a mean shape.\n\n\nTo further analyze the differences between the two cells, I have calculated pointwise differences between vertices, and the combined deviation of each vertex from the average vertex. I then mapped these onto the average cell shape (Figure 10)\n```{python}\npointwise_diff = np.linalg.norm(final - final2, axis=1) # Difference between aligned shapes\n\n# Point-wise difference from the mean shape\ndiff_from_mean_v = np.linalg.norm(final - v3, axis=1)\ndiff_from_mean_final = np.linalg.norm(final2 - v3, axis=1)\n```\n\n\n\nFigure 10: Estimation of deviations from the mean shape using pointwise differences (left) and filtering only the highest differences (right).\n\n\nTo numerically estimate the shape differences, I have calculated the RMSE between shapes (1.37) and surface area difference between the cells (557.5µm2). These numbers might make more sense after sufficient data to compare between different samples.\nI also tried using K-means clustering to see if there were any significant clusters (Figure 11)\n\n\n\nFigure 11: K-wise differences clustering onto the mean cell shape. Most of the clusters are evenly distributed but upon rotation there is a larger cluster (right).\n\n\n\n\n\nThis project proposes using a modified Flowshape analysis pipeline to investigate similarities and differences between C. elegans embryonic cells. Mean shape can be easily estimated using spherical harmonics which can then be used to compare different shapes and find outliers of interest. I would like to extend this project by automating and improving the segmentation pipeline (either via SDT-pics or machine learning algorithms), and finding ways to extract more data points from shape comparisons." + "objectID": "posts/AlphaShape/index.html#application-of-2d-alpha-shapes-on-reaction-diffusion-equation", + "href": "posts/AlphaShape/index.html#application-of-2d-alpha-shapes-on-reaction-diffusion-equation", + "title": "Alpha Shapes in 2D and 3D", + "section": "Application of 2D alpha shapes on reaction-diffusion equation", + "text": "Application of 2D alpha shapes on reaction-diffusion equation\nNow we discuss an application of 2D alpha shape on quantifying the patterns that arise in reaction-diffusion equations modeling morphogenesis.\nReference: Zhao, Maffa, Sandstede. http://bjornsandstede.com/papers/Data_Driven_Continuation.pdf\nAs an example, let’s consider the Brusselator model in 2D, and below is a simple simulator that generates the snapshot of its solution over the spatial domain. The initial condition is random, and patterns start to arise after we evolve the system forward for a short time.\n\nimport numpy as np\nimport matplotlib.pyplot as plt\n\ndef brusselator_2d_simulation(A, B, Lx=100, Ly=100, Nx=100, Ny=100, dt=0.005, D_u=4, D_v=32, T=20):\n \"\"\"\n Simulate the 2D Brusselator model and return the concentration field u at time T.\n \n Parameters:\n - A: Reaction parameter A\n - B: Reaction parameter B\n - Lx: Domain size in x direction\n - Ly: Domain size in y direction\n - Nx: Number of grid points in x direction\n - Ny: Number of grid points in y direction\n - dt: Time step\n - D_u: Diffusion coefficient for u\n - D_v: Diffusion coefficient for v\n - T: Total simulation time\n \n Returns:\n - u: Concentration field u at time T\n \"\"\"\n \n # Generate random points\n np.random.seed(0) # For reproducibility\n\n # Initialize variables\n dx, dy = Lx / Nx, Ly / Ny\n u = np.random.uniform(size=(Nx, Ny))\n v = np.zeros((Nx, Ny))\n \n \n # Prepare the grid\n x = np.linspace(0, Lx, Nx)\n y = np.linspace(0, Ly, Ny)\n \n # Compute Laplacian\n def laplacian(field):\n return (np.roll(field, 1, axis=0) + np.roll(field, -1, axis=0) +\n np.roll(field, 1, axis=1) + np.roll(field, -1, axis=1) -\n 4 * field) / (dx * dy)\n \n # Time-stepping loop\n num_steps = int(T / dt)\n for _ in range(num_steps):\n # Compute Laplacian\n lap_u = laplacian(u)\n lap_v = laplacian(v)\n \n # Brusselator model equations\n du = D_u * lap_u + A - (B + 1) * u + u**2 * v\n dv = D_v * lap_v + B * u - u**2 * v\n \n # Update fields\n u += du * dt\n v += dv * dt\n \n return u, x, y\n\n# Example usage\nA = 4.75\nB = 11.0\nu_at_T, x, y = brusselator_2d_simulation(A, B)\n\n# Plot the result\nplt.figure(figsize=(8, 8))\nplt.imshow(u_at_T, cmap='viridis', interpolation='bilinear', origin='lower')\nplt.colorbar(label='Concentration of u')\nplt.title(f'Concentration of u at T=100 with A={A}, B={B}')\nplt.xlabel('x')\nplt.ylabel('y')\nplt.grid(True)\nplt.show()\n\n\n\n\n\n\n\n\nNow we create point cloud via thresholding the solution:\n\ndef get_threshold_points(u, threshold=0.7):\n \"\"\"\n Get grid points where the concentration field u exceeds the specified threshold.\n \n Parameters:\n - u: Concentration field\n - threshold: The threshold value as a percentage of the maximum value in u\n \n Returns:\n - coords: Array of grid points where u exceeds the threshold\n \"\"\"\n max_u = np.max(u)\n threshold_value = threshold * max_u\n coords = np.argwhere(u > threshold_value)\n return coords\n\n# Get grid points above 70% of the maximum value\ncoords = get_threshold_points(u_at_T, threshold=0.7)\n# Highlight points above threshold\nx_coords, y_coords = coords[:, 1], coords[:, 0]\nplt.scatter(x_coords, y_coords, color='red', s=20, marker='o', edgecolor='w')\n\n\n\n\n\n\n\n\nAfter we obtain the point cloud, now we can run alpha shape on it. As mentioned before, picking a good alpha can be tricky, so let’s try a few alpha values to see which one identifies the boundary in an ideal way.\n\nalpha_values = [.3, 0.35, 0.5, 1.]\n# Plot the flower shape and alpha shapes with varying alpha values\nfig, axes = plt.subplots(2, 2, figsize=(6,6))\naxes = axes.flatten()\n\nfor i, alpha in enumerate(alpha_values):\n # Scatter the plot\n \n # Compute alpha shape\n alpha_shape = alphashape.alphashape(coords, alpha)\n #print(alpha_shape.type)\n # Plot the points and the alpha shape\n plt.subplot(2,2,i+1)\n #ax = axes[i]\n \n if alpha_shape.geom_type == 'GeometryCollection':\n print(alpha_shape)\n for geom in list( alpha_shape.geoms ):\n \n if geom.type == 'Polygon':\n x, y = geom.exterior.xy\n plt.plot(x, y, 'r-')\n elif alpha_shape.geom_type == 'Polygon':\n x, y = alpha_shape.exterior.xy\n plt.plot(x, y, 'r-')\n elif alpha_shape.geom_type == 'MultiPolygon':\n \n alpha_shape = list( alpha_shape.geoms )\n for polygon in alpha_shape:\n x, y = polygon.exterior.xy\n plt.plot(x, y, 'r-')#, label='Alpha Shape')\n plt.scatter(coords[:, 0], coords[:, 1], color='orange', s=10, label='Point Cloud')\n \n \n \n plt.title(f'alpha={alpha}')\n #plt.legend()\n #plt.grid(True)\n\nplt.tight_layout()\nplt.show()\n\n\n\n\n\n\n\n\nNow we can study different pattern statistics for these clusters! For example, the roundness of clusters are defined as \\(4\\pi Area/Perimeter^2\\), which is bounded between zero (stripe) and one (spot). For each cluster, a roundness score value can be computed. The resulting histogram of roundness scores of all clusters will follow a bimodal distribution, with its two peaks correspond to spots and stripes, respectively.\n\nalpha_values = [.3, 0.4, 0.6, 1.]\n# Plot the flower shape and alpha shapes with varying alpha values\nfig, axes = plt.subplots(2, 2, figsize=(6,6))\naxes = axes.flatten()\n\nfor i, alpha in enumerate(alpha_values):\n plt.subplot(2,2,i+1)\n # Compute alpha shape\n alpha_shape = alphashape.alphashape(coords, alpha)\n if alpha_shape.geom_type == 'MultiPolygon':\n # Extract and print the area of each polygon\n areas = [polygon.area for polygon in list(alpha_shape.geoms)]\n perimeters = [polygon.length for polygon in list(alpha_shape.geoms)]\n roundness = [4*np.pi*areas[i]/perimeters[i]**2 for i in range(len(list(alpha_shape.geoms))) ]\n else:\n areas = [ alpha_shape.area ]\n perimeters = [alpha_shape.length]\n roundness = [areas[0]*4*np.pi/perimeters[0]**2]\n plt.hist(roundness,density=True, range=[0,1])\n plt.xlim([0,1])\n plt.title(f'Roundness with alpha={alpha}')\n \n\nplt.tight_layout()\nplt.show()" }, { - "objectID": "posts/Embryonic-Shape/index.html#segmentation", - "href": "posts/Embryonic-Shape/index.html#segmentation", - "title": "Shape analysis of C. elegans E cell", - "section": "", - "text": "The pipeline of this framework begins with segmentation. In the article, SDT-PICS method (Thiels et al. 2021) was used to generate 3D meshes. The method was installed using Docker, but it required substantial version control to make it work, as the instalation depended on Linux, some dependencies were not compatible with their recommended Python version, and others were not compatible with a different Python version. I hope to contact the authors of the paper and submit the fixes for installing SDT-PICS. Additionally, the segmentation pipeline did not work very well with my microscopy images (Figure 1). This could be due to different cell shape markers or microscopy differences.\n\n\n\nFigure 1: six-cell stage C. elegans embryo stained with a membrane dye.\n\n\nAfter trying numerous segmentation techniques I have settled for a semi-automatic segmentation of specific cells using ImageJ. This was done using automatic interpolation of selected cells, creating binary masks (Figure 2). These were used as sample cells for further analysis.\n\n\n\nFigure 2: Mask samples for an E cell, including the first mask, the 30th mask and the last mask." + "objectID": "posts/AlphaShape/index.html#d-alpha-shapes", + "href": "posts/AlphaShape/index.html#d-alpha-shapes", + "title": "Alpha Shapes in 2D and 3D", + "section": "3D Alpha shapes", + "text": "3D Alpha shapes\n\nfrom mpl_toolkits.mplot3d import Axes3D\n\ndef plot_torus_with_random_points(R1=1.0, r1=0.3, R2=0.8, r2=0.3, num_points=1000):\n \"\"\"\n Plots a torus with random points filling its volume.\n\n Parameters:\n R (float): Major radius of the torus.\n r (float): Minor radius of the torus.\n num_points (int): Number of random points to generate inside the torus.\n \"\"\"\n \n # Generate random points\n np.random.seed(0) # For reproducibility\n theta = np.random.uniform(0, 2 * np.pi, num_points) # Angle around the major circle\n phi = np.random.uniform(0, 2 * np.pi, num_points) # Angle around the minor circle\n u = np.random.uniform(0, 1, num_points) # Random uniform distribution for radial distance\n \n # Convert uniform distribution to proper volume inside the torus\n u = np.sqrt(u) # To spread points more evenly\n\n # Parametric equations for the double torus\n # First torus\n x1 = .5*(R1 + r1 * np.cos(phi)) * np.cos(theta)\n y1 = (R1 + r1 * np.cos(phi)) * np.sin(theta)\n z1 = r1 * np.sin(phi)\n \n # Second torus\n x2 = -1 + .5*(R2 + r2 * np.cos(phi)) * np.cos(theta)\n y2 = (R2 + r2 * np.cos(phi)) * np.sin(theta)\n z2 = r2 * np.sin(phi)# + 2 * (R2 + r2 * np.cos(phi)) * np.sin(theta) # Shifted in z-direction for double torus effect\n\n # Combine points from both tori\n x = np.concatenate([x1, x2])\n y = np.concatenate([y1, y2])\n z = np.concatenate([z1, z2])\n\n \n\n # Plot the torus and the random points\n fig = plt.figure()\n ax = fig.add_subplot(111, projection='3d')\n\n # Plot the random points\n ax.scatter(x, y, z, c='red', s=1, label='Random Points') # Using a small point size for clarity\n\n\n # Add titles and labels\n ax.set_title('Torus with Random Points')\n ax.set_xlabel('X axis')\n ax.set_ylabel('Y axis')\n ax.set_zlabel('Z axis')\n #ax.set_xlim([-1.5,0.5])\n #ax.set_ylim([-0.5,1.5])\n ax.set_zlim([-1.5,1.5])\n ax.legend()\n plt.show()\n return x,y,z\n\n# Example usage\nx, y, z = plot_torus_with_random_points(num_points=2000)\n\n\n\n\n\n\n\n\nThe intuition on picking alpha still holds! Let’s first try a big alpha (small radius and refined boundaries) and then a small one (big radius and rough boundaries)\n\nimport alphashape\n\n\nalpha_shape = alphashape.alphashape(np.column_stack((x,y,z)), 5.0)\nalpha_shape.show()\n\n\n\n\n\nalpha_shape = alphashape.alphashape(np.column_stack((x,y,z)), 3.0)\nalpha_shape.show()" }, { - "objectID": "posts/Embryonic-Shape/index.html#flowshape-algorithm", - "href": "posts/Embryonic-Shape/index.html#flowshape-algorithm", - "title": "Shape analysis of C. elegans E cell", - "section": "", - "text": "The main Flowshape algorithm uses 3D meshes as input for conformal mapping. However, they do provide a method to build meshes from image files using a marching cubes algorithm (Lorensen and Cline 1987). Marching cubes algorithm leads to a cylindrical 3D representation of a cell (Figure 3).\n\n\n\nFigure 3: Cell reconstruction from masks shown in Figure 2 using Marching Cubes algorithm.\n\n\nTo remove any gaps in the shape, we employ a remeshing algorithm in pyvista package. This leads to an expected triangular mesh (Figure 4). The holes produced by the marching cubes algorithm are filled and the shape is ready to be analyzed.\n\n\n\nFigure 4: Cell reconstruction from the marching cubes shown in Figure 3.\n\n\nSpherical harmonics can then be calculated using the following code.\n```{python}\n# perform reconstruction with 24 SH\nweights, Y_mat, vs = fs.do_mapping(v, f, l_max = 24)\n\nrho = Y_mat.dot(weights)\n\nreconstruct = fs.reconstruct_shape(sv, f, rho )\nmeshplot.plot(reconstruct, f, c = rho)\n```\nThis results in a reconstructed cell shape (Figure 5). The colors here represent the curvature of the shape.\n\n\n\nFigure 5: Cell reconstruction using spherical harmonics (the first 24)\n\n\nSpherical harmonics can also be used to map the shape directly onto the sphere. Similar to Figure 5, high curvature areas are represented in brighter colors.\n\n\n\nFigure 6: Cell reconstruction onto a sphere using conformal mapping\n\n\n\n\n\nTo compare two shapes, it is essential to first align them. In this workflow, alignment is calculating by estimating a rotation matrix that maximizes the correlation between the spherical harmonics of two shapes. This is then used to align the shapes and refine the alignment (Figure 7)\n```{python}\nrot2 = fs.compute_max_correlation(weights3, weights2, l_max = 24)\nrot2 = rot2.as_matrix()\n\np = mp.plot(v, f)\np.add_points(v2 @ rot2, shading={\"point_size\": 0.2})\n\nfinal = v2 @ rot2\n\nfor i in range(10):\n # Project points onto surface of original mesh\n sqrD, I, proj = igl.point_mesh_squared_distance(final, v, f)\n # Print error (RMSE)\n print(np.sqrt(np.average(sqrD)))\n \n # igl's procrustes complains if you don't give the mesh in Fortran index order \n final = final.copy(order='f')\n proj = proj.copy(order='f')\n \n # Align points to their projection\n s, R, t = igl.procrustes(final, proj, include_scaling = True, include_reflections = False)\n\n # Apply the transformation\n final = (final * s).dot(R) + t -->\n```\nIn this image, the yellow shape and red dots represent two separate E cells. Fewer red dots mean that cells are better aligned.\n\n\n\nFigure 7: Alignment of two cells.\n\n\n\n\n\nTo find a mean shape between the two shapes, I found a mean spherical harmonics decomposition:\n```{python}\nweights, Y_mat, vs = fs.do_mapping(v,f, l_max = 24)\nweights2, Y_mat2, vs2 = fs.do_mapping(v2, f2, l_max = 24)\n\nmean_weights = (weights + weights2) / 2\nmean_Ymat = (Y_mat + Y_mat2)/2\n\nsv = fs.sphere_map(v, f)\nrho3 = mean_Ymat.dot(mean_weights)\nmp.plot(sv, f, c = rho3)\n\nrec2 = fs.reconstruct_shape(sv, f, rho3)\nmp.plot(rec2,f, c = rho3)\n```\nFrom this, I built a mean shape on the sphere, followed by a reconstruction (Figure 8)\n\n\n\nFigure 8: Mean shape reconstruction. Left - mean shape mapped onto a sphere, right - reconstructed mean shape using Spherical Harmonics.\n\n\nThis reconstruction was then used to re-align the original shapes and map them onto the average shape (Figure 9)\n\n\n\nFigure 9: Alignment of cells to a mean shape.\n\n\nTo further analyze the differences between the two cells, I have calculated pointwise differences between vertices, and the combined deviation of each vertex from the average vertex. I then mapped these onto the average cell shape (Figure 10)\n```{python}\npointwise_diff = np.linalg.norm(final - final2, axis=1) # Difference between aligned shapes\n\n# Point-wise difference from the mean shape\ndiff_from_mean_v = np.linalg.norm(final - v3, axis=1)\ndiff_from_mean_final = np.linalg.norm(final2 - v3, axis=1)\n```\n\n\n\nFigure 10: Estimation of deviations from the mean shape using pointwise differences (left) and filtering only the highest differences (right).\n\n\nTo numerically estimate the shape differences, I have calculated the RMSE between shapes (1.37) and surface area difference between the cells (557.5µm2). These numbers might make more sense after sufficient data to compare between different samples.\nI also tried using K-means clustering to see if there were any significant clusters (Figure 11)\n\n\n\nFigure 11: K-wise differences clustering onto the mean cell shape. Most of the clusters are evenly distributed but upon rotation there is a larger cluster (right).\n\n\n\n\n\nThis project proposes using a modified Flowshape analysis pipeline to investigate similarities and differences between C. elegans embryonic cells. Mean shape can be easily estimated using spherical harmonics which can then be used to compare different shapes and find outliers of interest. I would like to extend this project by automating and improving the segmentation pipeline (either via SDT-pics or machine learning algorithms), and finding ways to extract more data points from shape comparisons." + "objectID": "posts/AlphaShape/index.html#application-of-3d-alpha-shape-protein-structure", + "href": "posts/AlphaShape/index.html#application-of-3d-alpha-shape-protein-structure", + "title": "Alpha Shapes in 2D and 3D", + "section": "Application of 3D alpha shape: protein structure", + "text": "Application of 3D alpha shape: protein structure\nIt would be ideal to find some good data and put them here. To be continued." }, { - "objectID": "posts/principal-curves/principal-curves.html", - "href": "posts/principal-curves/principal-curves.html", - "title": "Trajectory Inference for cryo-EM data using Principal Curves", + "objectID": "posts/point-cloud/pointcloud.html", + "href": "posts/point-cloud/pointcloud.html", + "title": "Point cloud representation of 3D volumes", "section": "", - "text": "Suppose you run an experiment that involves collecting data points \\(\\{\\omega_1, \\ldots, \\omega_M\\} \\subseteq \\Omega \\subseteq \\mathbb R^d\\). As an example, suppose that \\(\\Omega\\) is the hexagonal domain below, and the \\(\\omega_i\\) represent positions of \\(M\\) independent, non-interacting particles in \\(\\Omega\\) (all collected simultaneously).\n\n\n\nsome sample points\n\n\nThe question is: Just from the position data \\(\\{\\omega_1, \\ldots, \\omega_M\\}\\) we have collected, can we determine 1) Whether the particles are all evolving according to the same dynamics, and 2) If so, what those dynamics are? As a sanity check, we can first try superimposing all of the data in one plot.\n\n\n\nsome sample points\n\n\nFrom the image above, there appears to be no discernable structure. But as we increase our number of samples \\(M\\), a picture starts to emerge.\n\n\n\nsome sample points\n\n\nand again:\n\n\n\nsome sample points\n\n\n\n\n\nsome sample points\n\n\n\n\n\nsome sample points\n\n\nIn the limit as \\(M \\to \\infty\\), we might obtain a picture like the following:\n\n\n\nsome sample points\n\n\nWe see that once \\(M\\) is large, it becomes (visually) clear that the particles are indeed evolving according to the same time-dependent function \\(f : \\mathbb R \\to \\Omega\\), but with 1) Small noise in the initial conditions, and 2) Different initial “offsets” \\(t_i\\) along \\(f(t)\\).\nTo expand on (1) a bit more: Note that in the figure above, there’s a fairly-clear “starting” point where the dark grey lines are all clumped together. Let’s say that this represents \\(f(0)\\). Then we see that the trajectories we observe (call them \\(f_i\\)) appear to look like they’re governed by the same principles, but with \\[f_i(0) = f(0) + \\text{ noise} \\qquad \\text{and} \\qquad f_i'(0) = f'(0) + \\text{ noise}.\\] Together with (2), we see that our observations \\(\\omega_i\\) are really samples from \\(f_i(t_i)\\). The question is how we may use these samples to recover \\(f(t)\\).\nLet us summarize the information so far.\n\n\n\nSuppose you have a time-dependent process modeled by some function \\(f : [0,T] \\to \\Omega\\), where \\(\\Omega \\subseteq \\mathbb R^d\\) (or, more generally, an abstract metric space). Then, given observations \\[\\omega_i = f_i(t_i)\\] where the \\((f_i, t_i)\\) are hidden, how can we estimate \\(f(t)\\)?\n\n\nNote that the problem above might at first look very similar to a regression problem, where one attempts to use data points \\((X_i, Y_i)\\) to determine a hidden model \\(f\\) (subject to some noise \\(\\varepsilon_i\\)) giving \\[Y_i = f(X_i) + \\varepsilon_i.\\] If we let \\(f_i(X) = f(X) + \\varepsilon_i\\), then we an almost-identical setup \\[Y_i = f_i(X_i).\\] The key distinction is that in regression, we assume our data-collection procedure gives us pairs \\((X_i, Y_i)\\), whereas in the trajectory inference problem our data consists of only the \\(Y_i\\) and we must infer the \\(X_i\\) on our own. Note in particular that we have continuum many choices for \\(X_i\\). This ends up massively complicating the the problem: If we try the trajectory-inference analogue of regularized least squares, the lack of an a priori coupling between \\(X_i\\) and \\(Y_i\\) means we lose the convexity structure and must use both different theoretical analysis and different numerical algorithms.\nNevertheless, on a cosmetic level, we may formulate the problems with similar-looking equations. This brings us to regularized principal curves." + "text": "In the context of cryo-EM, many computationally exhaustive methods rely on simpler representations of cryo-EM density maps to overcome their scalability challenges. There are many choices for the form of the simpler representation, such as vectors (Han et al. 2021) or a mixture of Gaussians (Kawabata 2008). In this post, we discuss a format that is probably the simplest and uses a set of points (called a point cloud).\nThis problem can be formulated in a much more general sense rather than cryo-EM. In this sense, we are given a probability distribution over \\(\\mathbb{R}^3\\) and we want to generate a set of 3D points that represent this distribution. The naive approach for finding such a point cloud is to just sample points from the distribution. Although this approach is guaranteed to find a good representation, it needs many points to cover the distribution evenly. Since methods used in this field can be computationally intensive with cubic or higher time complexity, generating a point cloud that covers the given distribution with a smaller point-cloud size leads to a significant improvement in their runtime.\nIn this approach, we present two methods for generating a point cloud from a cryo-EM density map or a distribution in general. The first one is based on the Topological Representing Network (TRN) (Martinetz and Schulten 1994) and the second one combines the usage of the Optimal Transport (OT) (Peyré, Cuturi, et al. 2019) theory and a computational geometry object named Centroidal Voronoi Tessellation (CVT).\n\n\nFor the sake of simplicity in this post, we assume we are given a primal distribution over \\(\\mathbb{R}^2\\). As an example, we will work on a multivariate Gaussian distribution that it’s domain is limited to \\([0, 1]^2\\). The following code prepares and illustrates the pdf of the example distribution.\n\nimport numpy as np\nimport scipy as scp\nimport matplotlib\nimport matplotlib.pyplot as plt\n\nplt.rcParams[\"figure.figsize\"] = (20,20)\n\n\n\nmean = np.array([0,0])\ncov = np.array([[0.5, 0.25], [0.25, 0.5]])\ndistr = scp.stats.multivariate_normal(cov = cov, mean = mean, seed = 1)\n\n\nfig, ax = plt.subplots(figsize=(8,8))\nim = ax.imshow([[distr.pdf([i/100,j/100]) for i in range(100,-100,-1)] for j in range(-100,100)], extent=[-1, 1, -1, 1])\ncbar = ax.figure.colorbar(im, ax=ax)\nplt.title(\"The pdf of our primal distribution\")\nplt.show()\n\n\n\n\n\n\n\n\nBoth of the methods that we are going to cover are iterative methods relying on an initial sample of points. For generating a point cloud with size \\(n\\), they begin by randomly sampling \\(n\\) points and refining it over iterations. We use \\(n=200\\) in our examples.\n\ndef sampler(rvs):\n while True:\n sample = rvs(1)\n if abs(sample[0]) > 1 or abs(sample[1]) > 1:\n continue\n return sample\n\ninitial_samples = []\nwhile len(initial_samples) < 200:\n sample = sampler(distr.rvs)\n initial_samples.append(list(sample))\ninitial_samples = np.array(initial_samples)\n\nl = list(zip(*initial_samples))\nx = list(l[0])\ny = list(l[1])\n\nfig, ax = plt.subplots(figsize=(8,8))\nax.scatter(x, y)\nax.plot((-1,-1), (-1,1), 'k-')\nax.plot((-1,1), (-1,-1), 'k-')\nax.plot((1,1), (1,-1), 'k-')\nax.plot((-1,1), (1,1), 'k-')\nplt.ylim(-1.1,1.1)\nplt.xlim(-1.1,1.1)\nplt.xticks([])\nplt.yticks([])\nplt.show()" }, { - "objectID": "posts/principal-curves/principal-curves.html#example-a-hexagonal-billiards-table", - "href": "posts/principal-curves/principal-curves.html#example-a-hexagonal-billiards-table", - "title": "Trajectory Inference for cryo-EM data using Principal Curves", + "objectID": "posts/point-cloud/pointcloud.html#data", + "href": "posts/point-cloud/pointcloud.html#data", + "title": "Point cloud representation of 3D volumes", "section": "", - "text": "Suppose you run an experiment that involves collecting data points \\(\\{\\omega_1, \\ldots, \\omega_M\\} \\subseteq \\Omega \\subseteq \\mathbb R^d\\). As an example, suppose that \\(\\Omega\\) is the hexagonal domain below, and the \\(\\omega_i\\) represent positions of \\(M\\) independent, non-interacting particles in \\(\\Omega\\) (all collected simultaneously).\n\n\n\nsome sample points\n\n\nThe question is: Just from the position data \\(\\{\\omega_1, \\ldots, \\omega_M\\}\\) we have collected, can we determine 1) Whether the particles are all evolving according to the same dynamics, and 2) If so, what those dynamics are? As a sanity check, we can first try superimposing all of the data in one plot.\n\n\n\nsome sample points\n\n\nFrom the image above, there appears to be no discernable structure. But as we increase our number of samples \\(M\\), a picture starts to emerge.\n\n\n\nsome sample points\n\n\nand again:\n\n\n\nsome sample points\n\n\n\n\n\nsome sample points\n\n\n\n\n\nsome sample points\n\n\nIn the limit as \\(M \\to \\infty\\), we might obtain a picture like the following:\n\n\n\nsome sample points\n\n\nWe see that once \\(M\\) is large, it becomes (visually) clear that the particles are indeed evolving according to the same time-dependent function \\(f : \\mathbb R \\to \\Omega\\), but with 1) Small noise in the initial conditions, and 2) Different initial “offsets” \\(t_i\\) along \\(f(t)\\).\nTo expand on (1) a bit more: Note that in the figure above, there’s a fairly-clear “starting” point where the dark grey lines are all clumped together. Let’s say that this represents \\(f(0)\\). Then we see that the trajectories we observe (call them \\(f_i\\)) appear to look like they’re governed by the same principles, but with \\[f_i(0) = f(0) + \\text{ noise} \\qquad \\text{and} \\qquad f_i'(0) = f'(0) + \\text{ noise}.\\] Together with (2), we see that our observations \\(\\omega_i\\) are really samples from \\(f_i(t_i)\\). The question is how we may use these samples to recover \\(f(t)\\).\nLet us summarize the information so far." + "text": "For the sake of simplicity in this post, we assume we are given a primal distribution over \\(\\mathbb{R}^2\\). As an example, we will work on a multivariate Gaussian distribution that it’s domain is limited to \\([0, 1]^2\\). The following code prepares and illustrates the pdf of the example distribution.\n\nimport numpy as np\nimport scipy as scp\nimport matplotlib\nimport matplotlib.pyplot as plt\n\nplt.rcParams[\"figure.figsize\"] = (20,20)\n\n\n\nmean = np.array([0,0])\ncov = np.array([[0.5, 0.25], [0.25, 0.5]])\ndistr = scp.stats.multivariate_normal(cov = cov, mean = mean, seed = 1)\n\n\nfig, ax = plt.subplots(figsize=(8,8))\nim = ax.imshow([[distr.pdf([i/100,j/100]) for i in range(100,-100,-1)] for j in range(-100,100)], extent=[-1, 1, -1, 1])\ncbar = ax.figure.colorbar(im, ax=ax)\nplt.title(\"The pdf of our primal distribution\")\nplt.show()\n\n\n\n\n\n\n\n\nBoth of the methods that we are going to cover are iterative methods relying on an initial sample of points. For generating a point cloud with size \\(n\\), they begin by randomly sampling \\(n\\) points and refining it over iterations. We use \\(n=200\\) in our examples.\n\ndef sampler(rvs):\n while True:\n sample = rvs(1)\n if abs(sample[0]) > 1 or abs(sample[1]) > 1:\n continue\n return sample\n\ninitial_samples = []\nwhile len(initial_samples) < 200:\n sample = sampler(distr.rvs)\n initial_samples.append(list(sample))\ninitial_samples = np.array(initial_samples)\n\nl = list(zip(*initial_samples))\nx = list(l[0])\ny = list(l[1])\n\nfig, ax = plt.subplots(figsize=(8,8))\nax.scatter(x, y)\nax.plot((-1,-1), (-1,1), 'k-')\nax.plot((-1,1), (-1,-1), 'k-')\nax.plot((1,1), (1,-1), 'k-')\nax.plot((-1,1), (1,1), 'k-')\nplt.ylim(-1.1,1.1)\nplt.xlim(-1.1,1.1)\nplt.xticks([])\nplt.yticks([])\nplt.show()" }, { - "objectID": "posts/principal-curves/principal-curves.html#summary-the-trajectory-inference-problem", - "href": "posts/principal-curves/principal-curves.html#summary-the-trajectory-inference-problem", - "title": "Trajectory Inference for cryo-EM data using Principal Curves", + "objectID": "posts/sy mds tunnel/index.html", + "href": "posts/sy mds tunnel/index.html", + "title": "Multi Dimensional Scaling of ribosome exit tunnel shapes", "section": "", - "text": "Suppose you have a time-dependent process modeled by some function \\(f : [0,T] \\to \\Omega\\), where \\(\\Omega \\subseteq \\mathbb R^d\\) (or, more generally, an abstract metric space). Then, given observations \\[\\omega_i = f_i(t_i)\\] where the \\((f_i, t_i)\\) are hidden, how can we estimate \\(f(t)\\)?\n\n\nNote that the problem above might at first look very similar to a regression problem, where one attempts to use data points \\((X_i, Y_i)\\) to determine a hidden model \\(f\\) (subject to some noise \\(\\varepsilon_i\\)) giving \\[Y_i = f(X_i) + \\varepsilon_i.\\] If we let \\(f_i(X) = f(X) + \\varepsilon_i\\), then we an almost-identical setup \\[Y_i = f_i(X_i).\\] The key distinction is that in regression, we assume our data-collection procedure gives us pairs \\((X_i, Y_i)\\), whereas in the trajectory inference problem our data consists of only the \\(Y_i\\) and we must infer the \\(X_i\\) on our own. Note in particular that we have continuum many choices for \\(X_i\\). This ends up massively complicating the the problem: If we try the trajectory-inference analogue of regularized least squares, the lack of an a priori coupling between \\(X_i\\) and \\(Y_i\\) means we lose the convexity structure and must use both different theoretical analysis and different numerical algorithms.\nNevertheless, on a cosmetic level, we may formulate the problems with similar-looking equations. This brings us to regularized principal curves." - }, - { - "objectID": "posts/principal-curves/principal-curves.html#special-case-empirical-distributions", - "href": "posts/principal-curves/principal-curves.html#special-case-empirical-distributions", - "title": "Trajectory Inference for cryo-EM data using Principal Curves", - "section": "Special Case: Empirical Distributions", - "text": "Special Case: Empirical Distributions\nNote that when \\(\\mu\\) is an empirical distribution on observed data points \\(\\omega_1, \\ldots, \\omega_M\\), this becomes \\[\\min_{f} \\frac{1}{M} \\sum_{i=1}^M (d(\\omega_i, f))^p+ \\lambda \\mathscr C(f).\\] Further taking \\(p=2\\) and denoting \\(y_i = \\mathrm{argmin}_{y \\in \\mathrm{image}(f)} d(\\omega_i, y)\\), we can write it as \\[\\min_{f} \\frac{1}{M} \\sum_{i=1}^M \\lvert \\omega_i - y_i\\rvert^2+ \\lambda \\mathscr C(f),\\] whence we recover the relationship with regularized least squares." + "text": "The ribosome exit tunnel is a sub-compartment of the ribosome whose geometry varies significantly across species, potentially affecting the translational dynamics and co-translational folding of nascent polypeptide1.\nAs the recent advances in imaging technologies result in a surge of high-resolution ribosome structures, we are now able to study the tunnel geometric heterogeneity comprehensively across three domains of life: bacteria, archaea and eukaryotes.\nHere, we present some methods for large-scale analysis and comparison of tunnel structures." }, { - "objectID": "posts/HDM/index.html", - "href": "posts/HDM/index.html", - "title": "Horizontal Diffusion Map", + "objectID": "posts/sy mds tunnel/index.html#summary-and-background", + "href": "posts/sy mds tunnel/index.html#summary-and-background", + "title": "Multi Dimensional Scaling of ribosome exit tunnel shapes", "section": "", - "text": "This post is based on the following references:\n\nShan Shan, Probabilistic Models on Fibre Bundles (https://dukespace.lib.duke.edu/server/api/core/bitstreams/21bc2e06-ee66-4331-83af-115fe9518e80/content)\nTingran Gao, The Diffusion Geometry of Fibre Bundles: Horizontal Diffusion Maps (https://arxiv.org/pdf/1602.02330)" + "text": "The ribosome exit tunnel is a sub-compartment of the ribosome whose geometry varies significantly across species, potentially affecting the translational dynamics and co-translational folding of nascent polypeptide1.\nAs the recent advances in imaging technologies result in a surge of high-resolution ribosome structures, we are now able to study the tunnel geometric heterogeneity comprehensively across three domains of life: bacteria, archaea and eukaryotes.\nHere, we present some methods for large-scale analysis and comparison of tunnel structures." }, { - "objectID": "posts/HDM/index.html#references", - "href": "posts/HDM/index.html#references", - "title": "Horizontal Diffusion Map", - "section": "", - "text": "This post is based on the following references:\n\nShan Shan, Probabilistic Models on Fibre Bundles (https://dukespace.lib.duke.edu/server/api/core/bitstreams/21bc2e06-ee66-4331-83af-115fe9518e80/content)\nTingran Gao, The Diffusion Geometry of Fibre Bundles: Horizontal Diffusion Maps (https://arxiv.org/pdf/1602.02330)" + "objectID": "posts/sy mds tunnel/index.html#tunnel-shape", + "href": "posts/sy mds tunnel/index.html#tunnel-shape", + "title": "Multi Dimensional Scaling of ribosome exit tunnel shapes", + "section": "Tunnel Shape", + "text": "Tunnel Shape\nThe ribosome exit tunnel spans from the peptidyl-transferase center (PTC), where amino acids are polymerized onto the growing nascent chain, to the surface of the ribosome.\nTypically, it measures 80-100 Å in length and 10-20 Å in diameter. While the eukaryotic tunnels are, on average, shorter and substantially narrower than prokaryote ones1.\nIn all domains of life, the tunnel features a universally conserved narrow region downstream of the PTC, so-called constriction site. However, the eukaryotic exit tunnel exhibit an additional (second) constriction site due to the modified structure of the surrounding ribosomal proteins.\n\n\n\nIllustration of the tunnel structure of H.sapiens." }, { - "objectID": "posts/HDM/index.html#introduction", - "href": "posts/HDM/index.html#introduction", - "title": "Horizontal Diffusion Map", - "section": "Introduction", - "text": "Introduction\nHorizontal Diffusion Maps are a variant of diffusion maps used in dimensionality reduction and data analysis. They focus on preserving the local structure of data points in a lower-dimensional space by leveraging diffusion processes. Here’s a simple overview:\n\nDiffusion Maps Overview\n\nDiffusion Maps: These are a powerful technique in machine learning and data analysis for reducing dimensionality and capturing intrinsic data structures. They are based on the concept of diffusion processes over a graph or data manifold.\nConcept: Imagine a diffusion process where particles spread out over a data set according to some probability distribution. The diffusion map captures the way these particles spread and organizes the data into a lower-dimensional space that retains the local and global structure.\n\nHorizontal Diffusion Maps\n\nPurpose: Horizontal Diffusion Maps specifically aim to capture and visualize the horizontal or local structure of the data manifold. This can be particularly useful when you want to emphasize local relationships while reducing dimensionality.\nDifference from Standard Diffusion Maps: While standard diffusion maps focus on capturing both local and global structures, horizontal diffusion maps emphasize local, horizontal connections among data points. This means they preserve local neighborhoods and horizontal relationships more explicitly." + "objectID": "posts/sy mds tunnel/index.html#ribosome-dataset", + "href": "posts/sy mds tunnel/index.html#ribosome-dataset", + "title": "Multi Dimensional Scaling of ribosome exit tunnel shapes", + "section": "Ribosome Dataset", + "text": "Ribosome Dataset\nCryo-EM reconstructions and X-ray crystallography structures of ribosomes were retrived from the Protein Data Bank (https://www.rcsb.org) including 762 structures across 34 species domain.\nThe exit tunnels were extracted from the ribosomes using our developed tunnel-searching pipeline based on the MOLE cavity extraction algorithm developed by Sehnal et al.2." }, { - "objectID": "posts/HDM/index.html#example-möbius-strip", - "href": "posts/HDM/index.html#example-möbius-strip", - "title": "Horizontal Diffusion Map", - "section": "Example: Möbius Strip", - "text": "Example: Möbius Strip\nIn this section, we show how horizontal diffusion map works on Möbius Strip parameterized by:\n\\[\nx = (1 + v\\cos(\\frac{u}{2}))\\cos(u),\\quad y= (1 + v\\cos(\\frac{u}{2}))\\sin(u),\n\\] for \\(u\\in [0,2\\pi)\\) and \\(v \\in [-1,1]\\).\nIt is known as one of the most simple yet nontrivial fibre bundle. See below for a visualization:\n\nimport numpy as np\nimport matplotlib.pyplot as plt\nfrom mpl_toolkits.mplot3d import Axes3D\n\ndef mobius_strip(u, v):\n \"\"\"\n Generate coordinates for a Möbius strip.\n \n Parameters:\n - u: Parameter that varies from 0 to 2*pi\n - v: Parameter that varies from -0.5 to 0.5\n \n Returns:\n - x, y, z: Coordinates of the Möbius strip\n \"\"\"\n # Parameters for the Möbius strip\n radius = 1.0\n width = 1.0\n \n # Compute coordinates\n x = (radius + width * v * np.cos(u / 2)) * np.cos(u)\n y = (radius + width * v * np.cos(u / 2)) * np.sin(u)\n z = width * v * np.sin(u / 2)\n \n return x, y, z\n\ndef plot_mobius_strip():\n u = np.linspace(0, 2 * np.pi, 100)\n v = np.linspace(-1, 1, 10)\n \n u, v = np.meshgrid(u, v)\n x, y, z = mobius_strip(u, v)\n \n fig = plt.figure(figsize=(10, 7))\n ax = fig.add_subplot(111, projection='3d')\n \n # Plot the Möbius strip\n ax.plot_surface(x, y, z, cmap='inferno', edgecolor='none')\n \n # Set labels and title\n ax.set_xlabel('X')\n ax.set_ylabel('Y')\n ax.set_zlabel('Z')\n ax.set_title('Möbius Strip')\n \n plt.show()\n\n# Run the function to plot the Möbius strip\nplot_mobius_strip()\n\n\n\n\n\n\n\n\nNow we generate samples from the surface uniformly by first sample \\(N_{base}\\) points on the `base manifold’, parameterized by the \\(v\\) component. Then we sample \\(N_{fibre}\\) points along each fibre:\n\nN_fibre = 20\nv = np.linspace(-1,1,N_fibre,endpoint=False) #samples on each fibre\nN_base = 50\nu = np.linspace(0,2*np.pi,N_base,endpoint=False) #different objects\n# Here we concatenate all fibres to create the overall object\nV = np.tile(v,len(u))\nU= np.array([num for num in u for _ in range(len(v)) ])\nN = U.shape[0]\n\nHere we visualize the points to see how they are distributed on the manifold:\n\nu, v = np.meshgrid(U,V)\nx, y, z = mobius_strip(u, v)\n \nfig = plt.figure(figsize=(10, 7))\nax = fig.add_subplot(111, projection='3d')\n \n# Plot the Möbius strip\nax.scatter(x, y, z, c=v, s=1)\n \n# Set labels and title\nax.set_xlabel('X')\nax.set_ylabel('Y')\nax.set_zlabel('Z')\nax.set_title('Möbius Strip')\n \nplt.show()\n\n\n\n\n\n\n\n\nLater on, we will go over the horizontal diffusion map and apply it to the data we just created!" + "objectID": "posts/sy mds tunnel/index.html#pairwise-distance", + "href": "posts/sy mds tunnel/index.html#pairwise-distance", + "title": "Multi Dimensional Scaling of ribosome exit tunnel shapes", + "section": "Pairwise Distance", + "text": "Pairwise Distance\nTo simplify the geomertic comparisons, we first reduced the tunnel structure into a coordinate set that describes both the centerline trajectory and the tunnel radius at each centerline position,\nWe then applied the pairwise distance metrics developed by Dao Duc et al.1 to compute the geometric similarity between tunnels. More details can be found in the previous work1.\n\n\n\nPairwise comparison of radial varaition plots between H.sapiens and E.coli" }, { - "objectID": "posts/HDM/index.html#horizontal-diffusion-map-hdm", - "href": "posts/HDM/index.html#horizontal-diffusion-map-hdm", - "title": "Horizontal Diffusion Map", - "section": "Horizontal diffusion map (HDM)", - "text": "Horizontal diffusion map (HDM)\nThe first step is to create a kernel matrix. As outlined by the references, two common approaches are:\nHorizontal diffusion kernel: For two data points \\(e=(u,v)\\) and \\(e' = (u',v')\\): \\[\nK_{\\epsilon}(e, e') = \\exp( -(u - u')^2/\\epsilon) \\text{ if }v' = P_{uu'}v,\n\\] and zero otherwise. Here \\(P_{uu'}\\) is the map which connects every point from \\(v\\) to its image \\(v'\\), which, for our case, maps \\(v\\) to itself.\n\ndef horizontal_diffusion_kernel(U,V,eps):\n \n N = U.shape[0]\n K = np.zeros((N,N))\n for i in range(N):\n for j in range(N):\n if V[i] == V[j]:# and U[i] != U[j]:\n #print('match')\n K[i,j] = np.exp(-(U[i]-U[j])**2/eps)\n return K\n\neps = 0.2\nK = horizontal_diffusion_kernel(U,V,0.2)\nplt.imshow(K)\nplt.show()\n\n\n\n\n\n\n\n\nAn alternative, soft version of the kernel above is the coupled diffusion kernel: \n\\[\nK_{\\epsilon, \\delta}(e,e') = \\exp( -(u - u')^2/\\epsilon) \\exp( -(v-v')^2/\\delta ).\n\\]\n\ndef coupled_diffusion_kernel(U,V,eps,delta):\n N = U.shape[0]\n K_c = np.zeros((N,N))\n for i in range(N):\n for j in range(N):\n if True:#U[i] != U[j]:\n #print('match')\n K_c[i,j] = np.exp(-(U[i]-U[j])**2/eps) * np.exp( -(V[i]-V[j])**2/delta )\n return K_c\n\neps = .2\ndelta = .01 \nK_c = coupled_diffusion_kernel(U,V,eps,delta) \nplt.imshow(K_c)\nplt.show()\n\n\n\n\n\n\n\n\nAfter we created the kernel matrix, we can then proceed with the regular diffusion map by (1) Create the diffusion operator by normalizing the kernel matrix and computing its eigendecomposition, and (2) extract the diffusion coordinates by using the eigenvectors corresponding to the largest eigenvalues (excluding the trivial eigenvalue) to form the diffusion coordinates.\n\nimport numpy as np\nimport matplotlib.pyplot as plt\nfrom scipy.linalg import eigh\nfrom sklearn.preprocessing import normalize\n\ndef compute_diffusion_map(kernel_matrix, num_components=2):\n \"\"\"\n Compute the diffusion map from a kernel matrix.\n\n Parameters:\n - kernel_matrix: The kernel matrix (e.g., RBF kernel matrix).\n - num_components: Number of diffusion map dimensions to compute.\n\n Returns:\n - diffusion_coordinates: The 2D diffusion map coordinates.\n \"\"\"\n # Compute the degree matrix\n degree_matrix = np.diag(np.sum(kernel_matrix, axis=1))\n \n # Compute the normalized Laplacian matrix\n laplacian = np.linalg.inv(degree_matrix) @ kernel_matrix\n \n # Compute eigenvalues and eigenvectors\n eigvals, eigvecs = eigh(laplacian)\n \n # Sort eigenvalues and eigenvectors\n sorted_indices = np.argsort(eigvals)[::-1]\n eigvals = eigvals[sorted_indices]\n #print(eigvals)\n eigvecs = eigvecs[:, sorted_indices]\n \n # Take the first `num_components` eigenvectors (excluding the first one which is trivial)\n diffusion_coordinates = eigvecs[:, 1:num_components+1] @ np.diag(np.sqrt(eigvals[1:num_components+1]))\n \n return diffusion_coordinates\n\n\ndef plot_diffusion_map(diffusion_coordinates,color):\n \"\"\"\n Plot the 2D diffusion map.\n\n Parameters:\n - diffusion_coordinates: The 2D diffusion map coordinates.\n \"\"\"\n plt.figure(figsize=(8, 6))\n plt.scatter(diffusion_coordinates[:, 0], diffusion_coordinates[:, 1], c=color, s=10, alpha=0.7)\n plt.title('2D Diffusion Map')\n plt.xlabel('Dimension 1')\n plt.ylabel('Dimension 2')\n plt.grid(True)\n plt.show()\n\nNow project the data points into a lower-dimensional space defined by the significant diffusion coordinates. This projection helps in visualizing and analyzing the local structure of the data.\n\n# Compute the diffusion map\neps = 0.2\nK = horizontal_diffusion_kernel(U,V,eps)\ndiffusion_coordinates = compute_diffusion_map( K, num_components=2)\n#print(diffusion_coordinates)\n# Plot the 2D diffusion map, where color represents where they were on the fibre. Points that are mapped \nplot_diffusion_map(diffusion_coordinates,V)\n\n\n\n\n\n\n\n\nSimilarly we perform the same procedure for the coupled diffusion matrix:\n\n# Compute the diffusion map\neps = 0.2\ndelta = 0.01\nK_c = coupled_diffusion_kernel(U,V,eps,delta)\n\ndiffusion_coordinates = compute_diffusion_map( K_c, num_components=2)\n#print(diffusion_coordinates)\n# Plot the 2D diffusion map\nplot_diffusion_map(diffusion_coordinates,V)\n#plot_diffusion_map(diffusion_coordinates,U)\n\n\n\n\n\n\n\n\nThe points are colored according to their correspondence on all the fibres through component \\(v\\). If two points correspond to each other across different but nearby fibres, they are likely to be neighbors in the visualization above." + "objectID": "posts/sy mds tunnel/index.html#mds", + "href": "posts/sy mds tunnel/index.html#mds", + "title": "Multi Dimensional Scaling of ribosome exit tunnel shapes", + "section": "MDS", + "text": "MDS\nThe Multidimensional Scaling (MDS) method developed by Li et al.3 was applied on the pairwise distance matrix to visualize the geometric similarity of tunnels. Each data point represents a single tunnel structure, and the Euclidean distance between data points represents the similarity.\n\n\n\nMDS plot of tunnel structures across prokaryotes and eukaryotes" }, { - "objectID": "posts/HDM/index.html#horizontal-base-diffusion-map-hbdm", - "href": "posts/HDM/index.html#horizontal-base-diffusion-map-hbdm", - "title": "Horizontal Diffusion Map", - "section": "Horizontal base diffusion map (HBDM)", - "text": "Horizontal base diffusion map (HBDM)\nIn addition to embed all the data points, the framework also allows for embedding different objects (fibres). The new kernel is defined as the Frobenius norm of all entries in the previous kernel matrix that correspond to the two fibres:\n\neps = .2\nK = horizontal_diffusion_kernel(U,V,eps)\nK_base = np.zeros( (N_base,N_base) )\nfor i in range(N_base):\n for j in range(N_base):\n #print( np.ix_( range(N_fibre*(i),N_fibre*(i+1)), range(N_fibre*(j),N_fibre*(j+1)) ) )\n K_base[i,j] = np.linalg.norm( K[ np.ix_( range(N_fibre*(i),N_fibre*(i+1)), range(N_fibre*(j),N_fibre*(j+1)) ) ] ,'fro')\n#plt.imshow(K_base)\n#plt.show()\n\n\n# Compute the diffusion map\ndiffusion_coordinates = compute_diffusion_map( K_base, num_components=2)\n\n# Plot the 2D diffusion map\n\nplot_diffusion_map(diffusion_coordinates, np.sort(list(set(list(U)) ) ) )\n\n\n\n\n\n\n\n\nThe embedded points are colored according to the `ground truth’ \\(u\\). The smooth color transition shows that the embedding uncovers the information of all fibres on the base manifold." + "objectID": "posts/contour-analysis/final-post.html", + "href": "posts/contour-analysis/final-post.html", + "title": "An analysis and segmentation of contours in AFM imaging data", + "section": "", + "text": "The segmentation of pieces in AFM images gives us a chance to gather information about their shape. This can very well be a determining characteristic for certain biological objects. Analyzing an image piece by piece is usually easier. It also allows us to iterate through pieces of an image if we wish to analyze something different that is not necessarily related to its shape.\nAlthough the work done in this project is applicable to any AFM image, one of my main goals in to detect R-loops in those images. Further information about this topic can be found in this previous blog post. Unedited AFM images in this blog post was captured by the Pyne Lab." }, { - "objectID": "posts/HDM/index.html#applications-in-shape-data", - "href": "posts/HDM/index.html#applications-in-shape-data", - "title": "Horizontal Diffusion Map", - "section": "Applications in shape data", - "text": "Applications in shape data\nThe horizontal diffusion map framework is particularly useful in the two following espects, both demonstrated in Gao et al.:\n\nHorizontal diffusion map (embedding all data points): The embedding automatically suggests a global registration for all fibres that respects a mutual similarity measure.\nHorizontal base diffusion map (embedding all data objects/fibres): Compared to the classical diffusion map without correspondences, the horizontal base diffusion map is more robust to noises and often demonstrate a clearer pattern of clusters." + "objectID": "posts/contour-analysis/final-post.html#context-and-motivation", + "href": "posts/contour-analysis/final-post.html#context-and-motivation", + "title": "An analysis and segmentation of contours in AFM imaging data", + "section": "", + "text": "The segmentation of pieces in AFM images gives us a chance to gather information about their shape. This can very well be a determining characteristic for certain biological objects. Analyzing an image piece by piece is usually easier. It also allows us to iterate through pieces of an image if we wish to analyze something different that is not necessarily related to its shape.\nAlthough the work done in this project is applicable to any AFM image, one of my main goals in to detect R-loops in those images. Further information about this topic can be found in this previous blog post. Unedited AFM images in this blog post was captured by the Pyne Lab." }, { - "objectID": "posts/ET/ey.html", - "href": "posts/ET/ey.html", - "title": "Analysis of Eye Tracking Data", - "section": "", - "text": "Eye Tracking\n\nEye tracking (ET) is a process by which a device measures the gaze of a participant – with a number of variables that can be captured, such as duration of fixation, re-fixation (go-backs), saccades, blinking, pupillary response. The ‘strong eye-mind hypothesis’ provides the theoretical ground where the underlying assumption is that duration of fixation is a reflection of preference, and that information is processed with immediacy. ET also is a non-invasive technique that has recently garnered attention in autism research as a method to elucidate or gather more information about the supposed central cognitive deficit (Flack-Ytter et al., 2013, Senju et al., 2009).\n\nExperimental set up\n\n22 youth (13-17) with high functioning autism and without autism will be recruited into this study.Students will be brought into a quiet room and asked to read a manga comic displayed on a monitor connected to the eye tracking device (Tobii pro eye tracker, provided by Professor Conati’s lab)" + "objectID": "posts/contour-analysis/final-post.html#preparations-before-analysis", + "href": "posts/contour-analysis/final-post.html#preparations-before-analysis", + "title": "An analysis and segmentation of contours in AFM imaging data", + "section": "Preparations before analysis", + "text": "Preparations before analysis\nFor image denoising and binarization we will use the OpenCV library. Images will be loaded into Numpy arrays.\n\nimport cv2\nimport numpy as np\n\nBackground noises in images are problematic for edge detection algorithms. Many of them rely on counting pixels around a neighbourhood with a similar color value. When noise is present, we are more likely to get disconnected edges. The most common way to get around this is to use a Gaussian blurring, which basically calculates the average of pixels in a square of pre-determined length. This process makes the image more smooth at the cost of some details and precision.\nWe will use a better version of this algorithm called non-local means denoising (Buades, Coll, and Morel 2011). Instead of just looking at the immediate surroundings of a pixel, non-local denoising takes into account similar portions in the entire image and calculates the average of all those pixels.\n\n\n\n\nFigure 1, An AFM image of DNA fragments (picture by the Pyne lab)\n\n\n\n\nsrc = cv2.imread(\"data/data.png\", cv2.IMREAD_COLOR)\n\n# filter strength for luminance component = 10\n# filter strength for color components = 10\n# templateWindowSize = 7 (for computing weights)\n# searchWindowSize = 21 (for computing averages)\nsrc = cv2.fastNlMeansDenoisingColored(src,None,10,10,7,21)\n\ncv2.imwrite(\"data/data-denoised.png\", src)\n\n\n\n\n\nFigure 2, The image after denoising\n\n\n\nAs we are only interested in finding contours in the image, RGB colors will not be important. In fact, it makes it harder to analyze. We start by changing the color coding of the image to grayscale.\n\nsrc = cv2.imread(\"data/data-denoised.png\", cv2.IMREAD_COLOR)\n\nsrc_gray = cv2.cvtColor(src, cv2.COLOR_BGR2GRAY)\n\ncv2.imwrite(\"data/data-grayscale.png\", src_gray)\n\n\n\n\n\nFigure 3, Grayscale version of the image\n\n\n\nThe final step is to completely binarize the image. We are only interested in parts of the image that are considered to be DNA matter, which has a smaller color value compared to the background. We will apply a threshold to the image. Any pixel with a color value above 80 is considered to be DNA matter and it is mapped to a white pixel. Everything else is mapped to a black pixel.\n\nsrc = cv2.imread(\"data/data-grayscale.png\", cv2.IMREAD_COLOR)\n\n# threshold: 80\n# max_value: 255\n# method: THRESH_BINARY\nret,src_binary = cv2.threshold(src_gray,80,255,cv2.THRESH_BINARY)\n\ncv2.imwrite(\"data/data-binary.png\", src_binary)\n\n\n\n\n\nFigure 4, The binarized image after the thresholding" }, { - "objectID": "posts/ET/ey.html#eye-tracking-backagroud", - "href": "posts/ET/ey.html#eye-tracking-backagroud", - "title": "Analysis of Eye Tracking Data", - "section": "", - "text": "Eye Tracking\n\nEye tracking (ET) is a process by which a device measures the gaze of a participant – with a number of variables that can be captured, such as duration of fixation, re-fixation (go-backs), saccades, blinking, pupillary response. The ‘strong eye-mind hypothesis’ provides the theoretical ground where the underlying assumption is that duration of fixation is a reflection of preference, and that information is processed with immediacy. ET also is a non-invasive technique that has recently garnered attention in autism research as a method to elucidate or gather more information about the supposed central cognitive deficit (Flack-Ytter et al., 2013, Senju et al., 2009).\n\nExperimental set up\n\n22 youth (13-17) with high functioning autism and without autism will be recruited into this study.Students will be brought into a quiet room and asked to read a manga comic displayed on a monitor connected to the eye tracking device (Tobii pro eye tracker, provided by Professor Conati’s lab)" + "objectID": "posts/contour-analysis/final-post.html#finding-contours", + "href": "posts/contour-analysis/final-post.html#finding-contours", + "title": "An analysis and segmentation of contours in AFM imaging data", + "section": "Finding contours", + "text": "Finding contours\nWe will make use of the findContours function in OpenCV with the additional parameter RETR_TREE, which stands for contour retrieval tree. For our purposes, a contour is just a continuous set of points, but its position is also important. A shape can be located inside another shape or it might be connected to some other shape, which is useful information.\nWe consider the outer contour a parent, and the inner one a child. findContours returns a multi-dimensional array that contains the parent and child relation for any contour in an image.\nAfter finding contours from the binarized image, we draw them on top of the original AFM image we initially started with.\n\nsrc = cv2.imread(\"data/data.png\", cv2.IMREAD_COLOR)\nsrc_binary = cv2.imread(\"data/data-binary.png\", cv2.IMREAD_UNCHANGED)\n\ncontours, hierarchy = cv2.findContours(src_binary, cv2.RETR_TREE, cv2.CHAIN_APPROX_NONE)\nhierarchy = hierarchy[0]\n\nfor i,c in enumerate(contours):\n # omit very small contours on the background\n if (cv2.arcLength(c, True) < 75):\n continue\n color = (randint(0,255), randint(0,255), randint(0,255))\n cv2.drawContours(src, contours, i, color, 2)\n\ncv2.imwrite(\"data/data-contours.png\", src)\n\n\n\n\n\nFigure 5, Each contour is highlighted with a different color" }, { - "objectID": "posts/ET/ey.html#visualisation", - "href": "posts/ET/ey.html#visualisation", - "title": "Analysis of Eye Tracking Data", - "section": "2 Visualisation", - "text": "2 Visualisation\nOne way of visualizing your data in Tobii Pro Lab is by creating Heat maps. Heat maps visualize where a participant’s (or a group of participants’) fixations or gaze data samples were distributed on a still image or a video frame. The distribution of the data is represented with colors.Each sample corresponds to a gaze point from the eye tracker, consistently sampled every 1.6 to 33 milliseconds (depending on the sampling data rate of the eye tracker). When using an I-VT Filter, it will group the raw eye tracking samples into fixations. The duration of each fixation depends on the gaze filter used to identify the fixations.\n\n\n\nHeatmap" + "objectID": "posts/contour-analysis/final-post.html#segmentation", + "href": "posts/contour-analysis/final-post.html#segmentation", + "title": "An analysis and segmentation of contours in AFM imaging data", + "section": "Segmentation", + "text": "Segmentation\nThe hierarchy tree returned by findContours lets us iterate through any piece or specific level. The following code draws the outermost contours.\n\ncontours, hierarchy = cv2.findContours(src_binary, cv2.RETR_TREE, cv2.CHAIN_APPROX_NONE)\nhierarchy = hierarchy[0]\n\n# create a full black image\nbackground = np.zeros((1504,1504,3), dtype=np.uint8)\n\nfor i,c in enumerate(hierarchy):\n # find the first outermost contour\n if(hierarchy[i][1] == -1 and hierarchy[i][3] == -1):\n current = hierarchy[i]\n else:\n continue\n\n # after we find it, draw all the other outermost contours in the same level\n while(current[0] != -1):\n # omit very small contours on the background\n if (cv2.arcLength(contours[i], True) < 75):\n # point to the next element\n current = hierarchy[current[0]]\n i = current[0]\n continue\n cv2.drawContours(background, contours, i, (255,255,255), 2)\n # point to the next element\n current = hierarchy[current[0]]\n i = current[0]\n\n # after outermost contours are drawn, exit\n break\n\n\n\n\n\nFigure 6, Pieces in the binarized AFM image" }, { - "objectID": "posts/ET/ey.html#features", - "href": "posts/ET/ey.html#features", - "title": "Analysis of Eye Tracking Data", - "section": "3 Features", - "text": "3 Features\n\nData processing of eye tracking recordings\n\nTo run a statistical study on the data recorded, we carried out in two stages data processing. First using Tobio Pro Lab, then the EMADAT package. Following the experiments, the files are processed using Tobii Pro Lab software. We delimited the AOI for each page, manually pointed the gazes points for the 22 participants on the 12 selected pages. Then exported the data for each participant in a tsv format.\nThen EMDAT was used to generate the datasets. Indeed, to extract the gaze features we used EMDAT python 2.7. EMDAT stands for Eye Movement Data Analysis Toolkit, it is an open-source toolkit developed by our group. EMDAT receives three types of input folder: a folder containing the recordings from Tobii in a tsv format, a Segment folder containing the timestamp for the start and end of page reading for each participant, and an AOI folder containing the coordinates and the time spent per participant of each AOI per page. We have also automated the writing of the Segments and AOIs folders. Then we run the EMDAT script for each page. EMDAT also validates the quality of the recordings per page, here the parameter has been set to VALIDITY_METHOD = 1 (see documentation). In particular, we found that the quality of the data did not diminish over the course of the recordings.\n\nEye tracking features\n\nUpon following the data processing protocol, we extracted the following features:\n\nnumber of fixation (quantitative feature): The number of fixations denoted by is defined as the total number of fixations recorded over the total duration spent on a page by a participant.\nmean fixation duration (duration feature): The mean fixation duration denoted by is defined as as the average fixation duration during page reading.\nstandard deviation of the relative path angle (spatial feature): The standard deviation of the relative path angle denoted by is defined as as the average fixation duration during page reading.the standard deviation of the relative angle between two successive saccades. This component enables us to capture the consistency of a participant’s gaze pattern. The greater the standard deviation, the more likely the participant is to look across the different areas of a page." + "objectID": "posts/contour-analysis/final-post.html#detecting-closed-contours", + "href": "posts/contour-analysis/final-post.html#detecting-closed-contours", + "title": "An analysis and segmentation of contours in AFM imaging data", + "section": "Detecting closed contours", + "text": "Detecting closed contours\nIf a contour passes through one pixel more than once, we expect it to have a child contour inside. A closed shape will have an outer contour and at least one inner contour. By looking at the values in the returned tree hierarchy, we can determine whether a contour is open or closed.\n\nsrc = cv2.imread(\"data/data.png\", cv2.IMREAD_COLOR)\nsrc_binary = cv2.imread(\"data/data-binary.png\", cv2.IMREAD_UNCHANGED)\n\ncontours, hierarchy = cv2.findContours(src_binary, cv2.RETR_TREE, cv2.CHAIN_APPROX_NONE)\nhierarchy = hierarchy[0]\n\nfor i,c in enumerate(hierarchy):\n # find the first outermost contour\n if(hierarchy[i][1] == -1 and hierarchy[i][3] == -1):\n current = hierarchy[i]\n else:\n continue\n\n # after we find it, draw all the other outermost contours in the same level\n while(current[0] != -1):\n # omit very small contours on the background\n if (cv2.arcLength(contours[i], True) < 75):\n # point to the next element\n current = hierarchy[current[0]]\n i = current[0]\n continue\n\n # check whether the contour has a child\n if hierarchy[i][2] >= 0:\n cv2.drawContours(src, contours, i, (0, 255, 0), 2)\n else:\n cv2.drawContours(src, contours, i, (255, 0, 150), 2)\n\n # point to the next element\n current = hierarchy[current[0]]\n i = current[0]\n\n # after outermost contours are drawn, exit\n break\n\ncv2.imwrite(\"data/data-closed-contours.png\", src)\n\n\n\n\n\nFigure 7, Green contours are closed while the magenta ones are not" }, { - "objectID": "posts/ET/ey.html#t-test", - "href": "posts/ET/ey.html#t-test", - "title": "Analysis of Eye Tracking Data", - "section": "4 T-test", - "text": "4 T-test\nFirst, we wondered whether there were any major differences in the way the two groups read. To do this, we compared the two populations along the three axes - quantitative, duration and spatial - defined in the previous section. To quantify these differences, we used a t-test to compare the means of the distributions, and a Kolmogorov-Smirnov test to compare the distributions. Concerning the total number of fixations per page, the two populations seem to have the same characteristics (p-value>0.1 and Cohen’s d=0.2) and to be from the same distribution (two sided K-s test p-value>0.1). However, on the other two criteria, the autistic adolescents had a shorter mean fixation time and a lower standard deviation (p-value<0.05, Cohen’s d > 0.5), and their associated distribution was lower than that of the control population (less K-S test p-value>0.1).\n\n\n\n\n\n\n\n\n\nT-test\nK-S test\n\n\n\n\nNum fixations\nNo statistically significant differences in the mean number of fixation (small effect size, two-sided p-value > 0.1)\nThe distributions of the number of fixations per page look similar across the two populations (KS two-sided p-value > 0.1)\n\n\nMean fixation duration\nND seems to have a shorter mean duration fixation (Negative medium effect size, two-sided p-value < 0.01)\nThe ND mean fixation duration distribution is smaller than the NT mean fixation duration distribution (KS less p-value > 0.1)\n\n\nStandard deviation relative path angle\nND seems to have on average a smaller std (Negative medium effect size, two-sided p-value < 0.01)\nThe ND std relative path angle distribution is smaller than the NT std relative path angle distribution (KS less p-value > 0.1)" + "objectID": "posts/contour-analysis/final-post.html#future-goals", + "href": "posts/contour-analysis/final-post.html#future-goals", + "title": "An analysis and segmentation of contours in AFM imaging data", + "section": "Future goals", + "text": "Future goals\nThis program sometimes gives false positives if there are artificial holes inside the DNA strand. If this is detected as an inner loop, the program considers that the contour is closed even though it is not.\nDepending on how bright the picture is, the threshold color value needs to be adjusted manually. Otherwise, some parts of the DNA will not appear in the binarized image. An automatic detection is more preferable.\nCurrently, I am using the Python fork of OpenCV, which was originally written in C++. Heavy operations take a considerable amount of time in Python. One of my plans is to rewrite this in C++." } ] \ No newline at end of file diff --git a/sitemap.xml b/sitemap.xml index bd06138..66917d5 100644 --- a/sitemap.xml +++ b/sitemap.xml @@ -2,134 +2,138 @@ https://bioshape-analysis.github.io/blog/posts/cvt/index.html - 2024-12-17T22:14:09.592Z + 2024-12-18T04:43:31.515Z - https://bioshape-analysis.github.io/blog/posts/sy mds tunnel/index.html - 2024-12-17T22:14:10.908Z + https://bioshape-analysis.github.io/blog/posts/ET/ey.html + 2024-12-18T04:43:31.047Z - https://bioshape-analysis.github.io/blog/posts/OTalignment/index.html - 2024-12-17T22:14:09.548Z + https://bioshape-analysis.github.io/blog/posts/HDM/index.html + 2024-12-18T04:43:31.107Z - https://bioshape-analysis.github.io/blog/posts/point-cloud/pointcloud.html - 2024-12-17T22:14:09.800Z + https://bioshape-analysis.github.io/blog/posts/principal-curves/principal-curves.html + 2024-12-18T04:43:31.727Z - https://bioshape-analysis.github.io/blog/posts/AlphaShape/index.html - 2024-12-17T22:14:09.172Z + https://bioshape-analysis.github.io/blog/posts/Embryonic-Shape/index.html + 2024-12-18T04:43:31.075Z - https://bioshape-analysis.github.io/blog/posts/ribosome-landmarks/index.html - 2024-12-17T22:14:09.948Z + https://bioshape-analysis.github.io/blog/posts/landmarks-final/index.html + 2024-12-18T04:43:31.631Z - https://bioshape-analysis.github.io/blog/posts/outlier-detection/DeCOr-MDS.html - 2024-12-17T22:14:09.720Z + https://bioshape-analysis.github.io/blog/posts/Neural-Manifold/index.html + 2024-12-18T04:43:31.415Z - https://bioshape-analysis.github.io/blog/posts/quasiconformalmap/index.html - 2024-12-17T22:14:09.804Z + https://bioshape-analysis.github.io/blog/posts/ribosome-tunnel-new/index.html + 2024-12-18T04:43:32.815Z - https://bioshape-analysis.github.io/blog/posts/RECOVAR/index.html - 2024-12-17T22:14:09.580Z + https://bioshape-analysis.github.io/blog/posts/Farm-Shape-Analysis/index.html + 2024-12-18T04:43:31.107Z - https://bioshape-analysis.github.io/blog/posts/elastic-metric/elastic_metric.html - 2024-12-17T22:14:09.592Z + https://bioshape-analysis.github.io/blog/posts/vascularNetworks/VascularNetworks.html + 2024-12-18T04:43:32.831Z - https://bioshape-analysis.github.io/blog/posts/ImageMorphing/OT4DiseaseProgression.html - 2024-12-17T22:14:09.236Z + https://bioshape-analysis.github.io/blog/posts/elastic-metric/osteosarcoma_analysis.html + 2024-12-18T04:43:31.543Z - https://bioshape-analysis.github.io/blog/posts/AFM-data_2/index.html - 2024-12-17T22:14:09.172Z + https://bioshape-analysis.github.io/blog/posts/ImageMorphing/OT4DiseaseProgression2.html + 2024-12-18T04:43:31.107Z - https://bioshape-analysis.github.io/blog/posts/biology/index.html - 2024-12-17T22:14:09.580Z + https://bioshape-analysis.github.io/blog/posts/MATH-612/index.html + 2024-12-18T04:43:31.279Z - https://bioshape-analysis.github.io/blog/posts/morphology/proposal.html - 2024-12-17T22:14:09.720Z + https://bioshape-analysis.github.io/blog/posts/cryo_ET/demo.html + 2024-12-18T04:43:31.503Z - https://bioshape-analysis.github.io/blog/posts/CC-cells/Shape_Analysis_of_Contractile_Cells.html - 2024-12-17T22:14:09.172Z + https://bioshape-analysis.github.io/blog/posts/AFM-data/index.html + 2024-12-18T04:43:31.027Z - https://bioshape-analysis.github.io/blog/posts/rloop-analysis/rloop-analysis.html - 2024-12-17T22:14:10.900Z + https://bioshape-analysis.github.io/blog/posts/extension_to_RECOVAR/index.html + 2024-12-18T04:43:31.575Z + + + https://bioshape-analysis.github.io/blog/index.html + 2024-12-18T04:43:31.019Z https://bioshape-analysis.github.io/blog/about.html - 2024-12-17T22:14:09.148Z + 2024-12-18T04:43:31.019Z - https://bioshape-analysis.github.io/blog/index.html - 2024-12-17T22:14:09.148Z + https://bioshape-analysis.github.io/blog/posts/rloop-analysis/rloop-analysis.html + 2024-12-18T04:43:32.823Z - https://bioshape-analysis.github.io/blog/posts/extension_to_RECOVAR/index.html - 2024-12-17T22:14:09.652Z + https://bioshape-analysis.github.io/blog/posts/CC-cells/Shape_Analysis_of_Contractile_Cells.html + 2024-12-18T04:43:31.047Z - https://bioshape-analysis.github.io/blog/posts/AFM-data/index.html - 2024-12-17T22:14:09.152Z + https://bioshape-analysis.github.io/blog/posts/morphology/proposal.html + 2024-12-18T04:43:31.639Z - https://bioshape-analysis.github.io/blog/posts/cryo_ET/demo.html - 2024-12-17T22:14:09.584Z + https://bioshape-analysis.github.io/blog/posts/biology/index.html + 2024-12-18T04:43:31.451Z - https://bioshape-analysis.github.io/blog/posts/MATH-612/index.html - 2024-12-17T22:14:09.408Z + https://bioshape-analysis.github.io/blog/posts/AFM-data_2/index.html + 2024-12-18T04:43:31.047Z - https://bioshape-analysis.github.io/blog/posts/ImageMorphing/OT4DiseaseProgression2.html - 2024-12-17T22:14:09.236Z + https://bioshape-analysis.github.io/blog/posts/ImageMorphing/OT4DiseaseProgression.html + 2024-12-18T04:43:31.107Z - https://bioshape-analysis.github.io/blog/posts/elastic-metric/osteosarcoma_analysis.html - 2024-12-17T22:14:09.620Z + https://bioshape-analysis.github.io/blog/posts/elastic-metric/elastic_metric.html + 2024-12-18T04:43:31.515Z - https://bioshape-analysis.github.io/blog/posts/vascularNetworks/VascularNetworks.html - 2024-12-17T22:14:10.908Z + https://bioshape-analysis.github.io/blog/posts/RECOVAR/index.html + 2024-12-18T04:43:31.451Z - https://bioshape-analysis.github.io/blog/posts/Farm-Shape-Analysis/index.html - 2024-12-17T22:14:09.232Z + https://bioshape-analysis.github.io/blog/posts/quasiconformalmap/index.html + 2024-12-18T04:43:31.727Z - https://bioshape-analysis.github.io/blog/posts/ribosome-tunnel-new/index.html - 2024-12-17T22:14:10.892Z + https://bioshape-analysis.github.io/blog/posts/outlier-detection/DeCOr-MDS.html + 2024-12-18T04:43:31.639Z - https://bioshape-analysis.github.io/blog/posts/Neural-Manifold/index.html - 2024-12-17T22:14:09.544Z + https://bioshape-analysis.github.io/blog/posts/ribosome-landmarks/index.html + 2024-12-18T04:43:31.867Z - https://bioshape-analysis.github.io/blog/posts/landmarks-final/index.html - 2024-12-17T22:14:09.708Z + https://bioshape-analysis.github.io/blog/posts/AlphaShape/index.html + 2024-12-18T04:43:31.047Z - https://bioshape-analysis.github.io/blog/posts/Embryonic-Shape/index.html - 2024-12-17T22:14:09.200Z + https://bioshape-analysis.github.io/blog/posts/point-cloud/pointcloud.html + 2024-12-18T04:43:31.719Z - https://bioshape-analysis.github.io/blog/posts/principal-curves/principal-curves.html - 2024-12-17T22:14:09.804Z + https://bioshape-analysis.github.io/blog/posts/OTalignment/index.html + 2024-12-18T04:43:31.419Z - https://bioshape-analysis.github.io/blog/posts/HDM/index.html - 2024-12-17T22:14:09.232Z + https://bioshape-analysis.github.io/blog/posts/sy mds tunnel/index.html + 2024-12-18T04:43:32.827Z - https://bioshape-analysis.github.io/blog/posts/ET/ey.html - 2024-12-17T22:14:09.172Z + https://bioshape-analysis.github.io/blog/posts/contour-analysis/final-post.html + 2024-12-18T04:43:31.503Z