Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Algorithm and new-style of figure references #1

Merged
merged 1 commit into from
Oct 3, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
58 changes: 26 additions & 32 deletions 02_Methodology.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,9 @@
---
title: Methodology
numbering:
enumerator: 2.%s
enumerator: 2.%s
---



## Algorithms from Computer Vision

The development of neural networks originated from the perceptron, introduced by {cite:t}`mccullochLogicalCalculusIdeas1943`. They are a computational model that mimics the neurons of the brain and consists of multiple connected nodes (neurons) distributed in the input, hidden, and output layers {cite:p}`bishopPatternRecognitionMachine2006`. A neural network transmits and processes information through weight and bias to achieve complex nonlinear mapping of input data. Trained by back-propagation, the neural network continuously adjusts weight and bias to learn from the data and make predictions or classifications {cite:p}`vapnikNatureStatisticalLearning1995`. While traditional image processing methods require manual design of feature extraction algorithms, neural networks, on the other hand, can automatically learn and extract features from images, thus reducing manual intervention and design complexity {cite:p}`lecunBackpropagationAppliedHandwritten1989`. With the development of machine learning research, various neural network architectures have been developed, including Convolutional Neural Networks (CNN) {cite:p}`lecunGradientbasedLearningApplied1998` Autoencoders {cite:p}`hintonReducingDimensionalityData2006`, Transformer {cite:p}`vaswaniAttentionAllYou2017`, and structured state space models (SSM) {cite:p}`guMambaLinearTimeSequence2023`.
Expand All @@ -31,34 +29,31 @@ The DeAOT model is a video tracking model, which gives image segmentation (spati
Model pipeline diagram. Each frame is processed by a segmenter to generate a segmentation ID mask, followed by a tracker to update object IDs and produce a tracking ID mask.
```

{numref}`Figure %s <oxlKu6dMaY>` shows the segmentation and tracking process used in microscopic video analysis. For each frame, the image is first processed through the Segmenter to generate a “Mask with Track ID”, where each segmented region in the schematic is identified with a different color. In the initial frame, the segmentation result is added as a reference for the Tracker in the next frame. Starting from the second frame, the Segmenter continues to process each frame to generate the corresponding “Track-ID”. The Tracker then receives these “Track-ID” and updates the object IDs based on the information from the previous frame to generate the “Mask with Track ID (Track-ID)”. This process is repeated in each frame of the video. The segmentation and tracking results of each frame are used as a reference for the tracking of the next frame, thus ensuring continuous tracking and accurate identification of the object. This pipeline enables efficient microscopic video analysis by combining the segmentation and tracking results in each frame. The segmentation information provided by the Tracker enables the tracker to more accurately identify and track dynamically changing targets.

```{math}
:label: psudocode
\begin{array}{ll}
\hline
\hline
&{\textbf{Algorithm 1} \text{ Microscopic Video Tracking}} \\
\hline
1: & \text{Initialize Segmenter} \\
2: & \text{Initialize Tracker} \\
3: & \textbf{for} \text{ each frame } f_i \text{ in video } \textbf{do}\\
4: & \quad \textbf{if } i = 0 \textbf{ then} \\
5: & \quad\quad \textit{pred\_mask} \gets \text{Segmenter.segment}(f_i) \\
6: & \quad\quad \text{Tracker.add\_reference(}\textit{pred\_mask}\text{)} \\
7: & \quad \textbf{else} \\
8: & \quad\quad \textit{seg\_mask} \gets \text{Segmenter.segment}(f_i) \\
9: & \quad\quad \textit{track\_mask} \gets \text{Tracker.track(}\textit{seg\_mask}\text{)} \\
10: & \quad\quad \textit{new\_obj\_mask} \gets \text{Tracker.detect\_new\_objects(}\textit{track\_mask, seg\_mask}\text{)} \\
11: & \quad\quad \textit{pred\_mask} \gets \textit{track\_mask} + \textit{new\_obj\_mask} \\
12: & \quad\quad \text{Tracker.add\_reference(}\textit{pred\_mask}\text{)} \\
13: & \quad \textbf{end if} \\
14: & \textbf{end for} \\
\hline
\end{array}
@oxlKu6dMaY shows the segmentation and tracking process used in microscopic video analysis. For each frame, the image is first processed through the Segmenter to generate a “Mask with Track ID”, where each segmented region in the schematic is identified with a different color. In the initial frame, the segmentation result is added as a reference for the Tracker in the next frame. Starting from the second frame, the Segmenter continues to process each frame to generate the corresponding “Track-ID”. The Tracker then receives these “Track-ID” and updates the object IDs based on the information from the previous frame to generate the “Mask with Track ID (Track-ID)”. This process is repeated in each frame of the video. The segmentation and tracking results of each frame are used as a reference for the tracking of the next frame, thus ensuring continuous tracking and accurate identification of the object. This pipeline enables efficient microscopic video analysis by combining the segmentation and tracking results in each frame. The segmentation information provided by the Tracker enables the tracker to more accurately identify and track dynamically changing targets.

```{raw} latex
\begin{algorithm}
\caption{Microscopic Video Tracking}\label{psudocode}
Initialize Segmenter

Initialize Tracker

\For{each frame $f_i$ in video}
\If{$i = 0$}
\State \textit{pred\_mask} $\gets$ Segmenter.segment($f_i$)
\State Tracker.add\_reference(\textit{pred\_mask})
\Else
\State \textit{seg\_mask} $\gets$ Segmenter.segment($f_i$)
\State \textit{track\_mask} $\gets$ Tracker.track(\textit{seg\_mask})
\State \textit{new\_obj\_mask} $\gets$ Tracker.detect\_new\_objects(\textit{track\_mask, seg\_mask})
\State \textit{pred\_mask} $\gets$ \textit{track\_mask} + \textit{new\_obj\_mask}
\State Tracker.add\_reference(\textit{pred\_mask})
\EndIf
\EndFor
\end{algorithm}
```

Algorithm [](#psudocode) is the pseudocode for the flow in {numref}`Figure %s <oxlKu6dMaY>`. First, the segmenter and tracker are initialized. For each frame in the video, if it is the first frame, a prediction mask is generated by the segmenter and this mask is added as a reference for the tracker. For subsequent frames, a segmentation mask is generated by the segmenter and the segmentation mask is tracked by the tracker to generate a tracking mask. At the same time, a new object mask is detected. The tracking mask and the new object mask are merged to generate a prediction mask, and this prediction mask is added as a reference for the tracker. Through the above steps, each frame is segmented and tracked to ensure continuous tracking and accurate recognition of the target object. This approach improves the robustness and reliability of the system and is suitable for long-time tracking and analysis of microstructures.
@psudocode is the pseudocode for the flow in @oxlKu6dMaY. First, the segmenter and tracker are initialized. For each frame in the video, if it is the first frame, a prediction mask is generated by the segmenter and this mask is added as a reference for the tracker. For subsequent frames, a segmentation mask is generated by the segmenter and the segmentation mask is tracked by the tracker to generate a tracking mask. At the same time, a new object mask is detected. The tracking mask and the new object mask are merged to generate a prediction mask, and this prediction mask is added as a reference for the tracker. Through the above steps, each frame is segmented and tracked to ensure continuous tracking and accurate recognition of the target object. This approach improves the robustness and reliability of the system and is suitable for long-time tracking and analysis of microstructures.

## Model Information and Training Data

Expand Down Expand Up @@ -98,8 +93,7 @@ Algorithm [](#psudocode) is the pseudocode for the flow in {numref}`Figure %s <o

```

{numref}`Table %s <vwwdoR5dj5>` shows the names and number of parameters of the four different models in this paper. Specifically, the YOLOv8n-seg model has 3,409,968 parameters; the Swin-UNet model has 3,645,600 parameters; and the VMamba model has 3, 145,179 parameters. In this paper, a distilled version of SAM is used, with 10,185,380 parameters. The high computational requirements of SAM limit wider applications, so there are many studies focusing on the distillation of SAM, such as FastSAM {cite:p}`zhaoFastSegmentAnything2023`, TinySAM {cite:p}`shuTinySAMPushingEnvelope2024`, MobileSAM {cite:p}`zhangFasterSegmentAnything2023`, EfficientSAM {cite:p}`xiongEfficientSAMLeveragedMasked2023`, SAM-Lightening {cite:p}`songSAMLighteningLightweightSegment2024`. EfficientSAM utilizes SAM’s masked image pre-training (SAMI) strategy. It employs a MAE-based pre-training method combined with SAM models as a way to obtain high-quality pre-trained ViT encoders. This method is suitable for extracting knowledge from large self-supervised pre-trained SAM models, which in turn generates models that are both lightweight and highly efficient. The described knowledge distillation strategy significantly optimizes the knowledge transfer process from the teacher model to the student model {cite:p}`heMaskedAutoencodersAre2021; baiMaskedAutoencodersEnable2022`.

@vwwdoR5dj5 shows the names and number of parameters of the four different models in this paper. Specifically, the YOLOv8n-seg model has 3,409,968 parameters; the Swin-UNet model has 3,645,600 parameters; and the VMamba model has 3, 145,179 parameters. In this paper, a distilled version of SAM is used, with 10,185,380 parameters. The high computational requirements of SAM limit wider applications, so there are many studies focusing on the distillation of SAM, such as FastSAM {cite:p}`zhaoFastSegmentAnything2023`, TinySAM {cite:p}`shuTinySAMPushingEnvelope2024`, MobileSAM {cite:p}`zhangFasterSegmentAnything2023`, EfficientSAM {cite:p}`xiongEfficientSAMLeveragedMasked2023`, SAM-Lightening {cite:p}`songSAMLighteningLightweightSegment2024`. EfficientSAM utilizes SAM’s masked image pre-training (SAMI) strategy. It employs a MAE-based pre-training method combined with SAM models as a way to obtain high-quality pre-trained ViT encoders. This method is suitable for extracting knowledge from large self-supervised pre-trained SAM models, which in turn generates models that are both lightweight and highly efficient. The described knowledge distillation strategy significantly optimizes the knowledge transfer process from the teacher model to the student model {cite:p}`heMaskedAutoencodersAre2021; baiMaskedAutoencodersEnable2022`.

```{figure} #app:fig2
:name: cEwjf0su9Q
Expand All @@ -110,6 +104,6 @@ Algorithm [](#psudocode) is the pseudocode for the flow in {numref}`Figure %s <o
512×512 TEM images and their corresponding ground truth mask images.
```

{numref}`Figure %s <cEwjf0su9Q>` illustrates that the dataset used in this paper. It consists of transmission electron microscopy (TEM) images and their corresponding ground truth masks. The raw images are sliced into subimages of 512 × 512 pixels for model training and evaluation. The entire dataset consists of 2000 images, including 1400 for training and 600 for testing. The ground truth mask of each image is manually labeled by hand to ensure the accuracy and reliability of the labeling. Each TEM image is equipped with a corresponding mask, which shows the target nanoparticles as white areas and the non-target areas as black background. These mask images are used for model training during the supervised learning process, and the pairing of high pixel resolution TEM images with accurately labeled true-label masks ensures that the model can learn to distinguish nanoparticles with high accuracy.
@cEwjf0su9Q illustrates that the dataset used in this paper. It consists of transmission electron microscopy (TEM) images and their corresponding ground truth masks. The raw images are sliced into subimages of 512 × 512 pixels for model training and evaluation. The entire dataset consists of 2000 images, including 1400 for training and 600 for testing. The ground truth mask of each image is manually labeled by hand to ensure the accuracy and reliability of the labeling. Each TEM image is equipped with a corresponding mask, which shows the target nanoparticles as white areas and the non-target areas as black background. These mask images are used for model training during the supervised learning process, and the pairing of high pixel resolution TEM images with accurately labeled true-label masks ensures that the model can learn to distinguish nanoparticles with high accuracy.

For the YOLOv8n-seg model and the VMamba model, data with a resolution of 512x512 was used directly for training. However, the EfficientSAM model is a pre-trained model that requires the size of the input image and output mask to be fixed at 1024x1024. The Swin-UNet model uses images scaled to 448x448, and due to the “Shift Windows” operation in Swin Transformer, there is a certain limitation on the resolution of the input and output images, which needs to be 224. Therefore, during the training process, the training and test data were adjusted to match the input requirements of the model by adjusting resolution without re-croping the raw images.
Loading