diff --git a/02_Methodology.md b/02_Methodology.md index 2f3d791..6272215 100644 --- a/02_Methodology.md +++ b/02_Methodology.md @@ -1,11 +1,9 @@ --- title: Methodology numbering: - enumerator: 2.%s + enumerator: 2.%s --- - - ## Algorithms from Computer Vision The development of neural networks originated from the perceptron, introduced by {cite:t}`mccullochLogicalCalculusIdeas1943`. They are a computational model that mimics the neurons of the brain and consists of multiple connected nodes (neurons) distributed in the input, hidden, and output layers {cite:p}`bishopPatternRecognitionMachine2006`. A neural network transmits and processes information through weight and bias to achieve complex nonlinear mapping of input data. Trained by back-propagation, the neural network continuously adjusts weight and bias to learn from the data and make predictions or classifications {cite:p}`vapnikNatureStatisticalLearning1995`. While traditional image processing methods require manual design of feature extraction algorithms, neural networks, on the other hand, can automatically learn and extract features from images, thus reducing manual intervention and design complexity {cite:p}`lecunBackpropagationAppliedHandwritten1989`. With the development of machine learning research, various neural network architectures have been developed, including Convolutional Neural Networks (CNN) {cite:p}`lecunGradientbasedLearningApplied1998` Autoencoders {cite:p}`hintonReducingDimensionalityData2006`, Transformer {cite:p}`vaswaniAttentionAllYou2017`, and structured state space models (SSM) {cite:p}`guMambaLinearTimeSequence2023`. @@ -31,34 +29,31 @@ The DeAOT model is a video tracking model, which gives image segmentation (spati Model pipeline diagram. Each frame is processed by a segmenter to generate a segmentation ID mask, followed by a tracker to update object IDs and produce a tracking ID mask. ``` -{numref}`Figure %s ` shows the segmentation and tracking process used in microscopic video analysis. For each frame, the image is first processed through the Segmenter to generate a “Mask with Track ID”, where each segmented region in the schematic is identified with a different color. In the initial frame, the segmentation result is added as a reference for the Tracker in the next frame. Starting from the second frame, the Segmenter continues to process each frame to generate the corresponding “Track-ID”. The Tracker then receives these “Track-ID” and updates the object IDs based on the information from the previous frame to generate the “Mask with Track ID (Track-ID)”. This process is repeated in each frame of the video. The segmentation and tracking results of each frame are used as a reference for the tracking of the next frame, thus ensuring continuous tracking and accurate identification of the object. This pipeline enables efficient microscopic video analysis by combining the segmentation and tracking results in each frame. The segmentation information provided by the Tracker enables the tracker to more accurately identify and track dynamically changing targets. - -```{math} -:label: psudocode -\begin{array}{ll} -\hline -\hline -&{\textbf{Algorithm 1} \text{ Microscopic Video Tracking}} \\ -\hline -1: & \text{Initialize Segmenter} \\ -2: & \text{Initialize Tracker} \\ -3: & \textbf{for} \text{ each frame } f_i \text{ in video } \textbf{do}\\ -4: & \quad \textbf{if } i = 0 \textbf{ then} \\ -5: & \quad\quad \textit{pred\_mask} \gets \text{Segmenter.segment}(f_i) \\ -6: & \quad\quad \text{Tracker.add\_reference(}\textit{pred\_mask}\text{)} \\ -7: & \quad \textbf{else} \\ -8: & \quad\quad \textit{seg\_mask} \gets \text{Segmenter.segment}(f_i) \\ -9: & \quad\quad \textit{track\_mask} \gets \text{Tracker.track(}\textit{seg\_mask}\text{)} \\ -10: & \quad\quad \textit{new\_obj\_mask} \gets \text{Tracker.detect\_new\_objects(}\textit{track\_mask, seg\_mask}\text{)} \\ -11: & \quad\quad \textit{pred\_mask} \gets \textit{track\_mask} + \textit{new\_obj\_mask} \\ -12: & \quad\quad \text{Tracker.add\_reference(}\textit{pred\_mask}\text{)} \\ -13: & \quad \textbf{end if} \\ -14: & \textbf{end for} \\ -\hline -\end{array} +@oxlKu6dMaY shows the segmentation and tracking process used in microscopic video analysis. For each frame, the image is first processed through the Segmenter to generate a “Mask with Track ID”, where each segmented region in the schematic is identified with a different color. In the initial frame, the segmentation result is added as a reference for the Tracker in the next frame. Starting from the second frame, the Segmenter continues to process each frame to generate the corresponding “Track-ID”. The Tracker then receives these “Track-ID” and updates the object IDs based on the information from the previous frame to generate the “Mask with Track ID (Track-ID)”. This process is repeated in each frame of the video. The segmentation and tracking results of each frame are used as a reference for the tracking of the next frame, thus ensuring continuous tracking and accurate identification of the object. This pipeline enables efficient microscopic video analysis by combining the segmentation and tracking results in each frame. The segmentation information provided by the Tracker enables the tracker to more accurately identify and track dynamically changing targets. + +```{raw} latex +\begin{algorithm} +\caption{Microscopic Video Tracking}\label{psudocode} +Initialize Segmenter + +Initialize Tracker + +\For{each frame $f_i$ in video} + \If{$i = 0$} + \State \textit{pred\_mask} $\gets$ Segmenter.segment($f_i$) + \State Tracker.add\_reference(\textit{pred\_mask}) + \Else + \State \textit{seg\_mask} $\gets$ Segmenter.segment($f_i$) + \State \textit{track\_mask} $\gets$ Tracker.track(\textit{seg\_mask}) + \State \textit{new\_obj\_mask} $\gets$ Tracker.detect\_new\_objects(\textit{track\_mask, seg\_mask}) + \State \textit{pred\_mask} $\gets$ \textit{track\_mask} + \textit{new\_obj\_mask} + \State Tracker.add\_reference(\textit{pred\_mask}) + \EndIf +\EndFor +\end{algorithm} ``` -Algorithm [](#psudocode) is the pseudocode for the flow in {numref}`Figure %s `. First, the segmenter and tracker are initialized. For each frame in the video, if it is the first frame, a prediction mask is generated by the segmenter and this mask is added as a reference for the tracker. For subsequent frames, a segmentation mask is generated by the segmenter and the segmentation mask is tracked by the tracker to generate a tracking mask. At the same time, a new object mask is detected. The tracking mask and the new object mask are merged to generate a prediction mask, and this prediction mask is added as a reference for the tracker. Through the above steps, each frame is segmented and tracked to ensure continuous tracking and accurate recognition of the target object. This approach improves the robustness and reliability of the system and is suitable for long-time tracking and analysis of microstructures. +@psudocode is the pseudocode for the flow in @oxlKu6dMaY. First, the segmenter and tracker are initialized. For each frame in the video, if it is the first frame, a prediction mask is generated by the segmenter and this mask is added as a reference for the tracker. For subsequent frames, a segmentation mask is generated by the segmenter and the segmentation mask is tracked by the tracker to generate a tracking mask. At the same time, a new object mask is detected. The tracking mask and the new object mask are merged to generate a prediction mask, and this prediction mask is added as a reference for the tracker. Through the above steps, each frame is segmented and tracked to ensure continuous tracking and accurate recognition of the target object. This approach improves the robustness and reliability of the system and is suitable for long-time tracking and analysis of microstructures. ## Model Information and Training Data @@ -98,8 +93,7 @@ Algorithm [](#psudocode) is the pseudocode for the flow in {numref}`Figure %s ` shows the names and number of parameters of the four different models in this paper. Specifically, the YOLOv8n-seg model has 3,409,968 parameters; the Swin-UNet model has 3,645,600 parameters; and the VMamba model has 3, 145,179 parameters. In this paper, a distilled version of SAM is used, with 10,185,380 parameters. The high computational requirements of SAM limit wider applications, so there are many studies focusing on the distillation of SAM, such as FastSAM {cite:p}`zhaoFastSegmentAnything2023`, TinySAM {cite:p}`shuTinySAMPushingEnvelope2024`, MobileSAM {cite:p}`zhangFasterSegmentAnything2023`, EfficientSAM {cite:p}`xiongEfficientSAMLeveragedMasked2023`, SAM-Lightening {cite:p}`songSAMLighteningLightweightSegment2024`. EfficientSAM utilizes SAM’s masked image pre-training (SAMI) strategy. It employs a MAE-based pre-training method combined with SAM models as a way to obtain high-quality pre-trained ViT encoders. This method is suitable for extracting knowledge from large self-supervised pre-trained SAM models, which in turn generates models that are both lightweight and highly efficient. The described knowledge distillation strategy significantly optimizes the knowledge transfer process from the teacher model to the student model {cite:p}`heMaskedAutoencodersAre2021; baiMaskedAutoencodersEnable2022`. - +@vwwdoR5dj5 shows the names and number of parameters of the four different models in this paper. Specifically, the YOLOv8n-seg model has 3,409,968 parameters; the Swin-UNet model has 3,645,600 parameters; and the VMamba model has 3, 145,179 parameters. In this paper, a distilled version of SAM is used, with 10,185,380 parameters. The high computational requirements of SAM limit wider applications, so there are many studies focusing on the distillation of SAM, such as FastSAM {cite:p}`zhaoFastSegmentAnything2023`, TinySAM {cite:p}`shuTinySAMPushingEnvelope2024`, MobileSAM {cite:p}`zhangFasterSegmentAnything2023`, EfficientSAM {cite:p}`xiongEfficientSAMLeveragedMasked2023`, SAM-Lightening {cite:p}`songSAMLighteningLightweightSegment2024`. EfficientSAM utilizes SAM’s masked image pre-training (SAMI) strategy. It employs a MAE-based pre-training method combined with SAM models as a way to obtain high-quality pre-trained ViT encoders. This method is suitable for extracting knowledge from large self-supervised pre-trained SAM models, which in turn generates models that are both lightweight and highly efficient. The described knowledge distillation strategy significantly optimizes the knowledge transfer process from the teacher model to the student model {cite:p}`heMaskedAutoencodersAre2021; baiMaskedAutoencodersEnable2022`. ```{figure} #app:fig2 :name: cEwjf0su9Q @@ -110,6 +104,6 @@ Algorithm [](#psudocode) is the pseudocode for the flow in {numref}`Figure %s ` illustrates that the dataset used in this paper. It consists of transmission electron microscopy (TEM) images and their corresponding ground truth masks. The raw images are sliced into subimages of 512 × 512 pixels for model training and evaluation. The entire dataset consists of 2000 images, including 1400 for training and 600 for testing. The ground truth mask of each image is manually labeled by hand to ensure the accuracy and reliability of the labeling. Each TEM image is equipped with a corresponding mask, which shows the target nanoparticles as white areas and the non-target areas as black background. These mask images are used for model training during the supervised learning process, and the pairing of high pixel resolution TEM images with accurately labeled true-label masks ensures that the model can learn to distinguish nanoparticles with high accuracy. +@cEwjf0su9Q illustrates that the dataset used in this paper. It consists of transmission electron microscopy (TEM) images and their corresponding ground truth masks. The raw images are sliced into subimages of 512 × 512 pixels for model training and evaluation. The entire dataset consists of 2000 images, including 1400 for training and 600 for testing. The ground truth mask of each image is manually labeled by hand to ensure the accuracy and reliability of the labeling. Each TEM image is equipped with a corresponding mask, which shows the target nanoparticles as white areas and the non-target areas as black background. These mask images are used for model training during the supervised learning process, and the pairing of high pixel resolution TEM images with accurately labeled true-label masks ensures that the model can learn to distinguish nanoparticles with high accuracy. For the YOLOv8n-seg model and the VMamba model, data with a resolution of 512x512 was used directly for training. However, the EfficientSAM model is a pre-trained model that requires the size of the input image and output mask to be fixed at 1024x1024. The Swin-UNet model uses images scaled to 448x448, and due to the “Shift Windows” operation in Swin Transformer, there is a certain limitation on the resolution of the input and output images, which needs to be 224. Therefore, during the training process, the training and test data were adjusted to match the input requirements of the model by adjusting resolution without re-croping the raw images. diff --git a/03_Results.md b/03_Results.md index 7573e67..87e5906 100644 --- a/03_Results.md +++ b/03_Results.md @@ -1,7 +1,7 @@ --- title: Results numbering: - enumerator: 3.%s + enumerator: 3.%s --- ## TEM Image Segmentation @@ -84,7 +84,7 @@ numbering: This paper compares four models, YOLOv8n-seg, EfficientSAM-tiny, Swin-UNet, and VMamba. The comparison is analyzed by comparing the accuracy, throughput, number of parameters, and video memory usage. -- YOLOv8n-seg provides a high throughput (80.19/s) with a relatively small number of parameters (3.41M) and video memory of 1513MB. This model, while performing well in terms of accuracy, can be seen in {numref}`Figure %s ` where overfitting occurs during the training of YOLO. Early stopping was used to mitigate this problem. +- YOLOv8n-seg provides a high throughput (80.19/s) with a relatively small number of parameters (3.41M) and video memory of 1513MB. This model, while performing well in terms of accuracy, can be seen in @Qy3XvUUvyI where overfitting occurs during the training of YOLO. Early stopping was used to mitigate this problem. - Swin-UNet has the sliding window attention mechanism, the number of parameters (3.65M) and video memory usage 1793MB are moderate and the throughput (64.89/s) is high. Swin-UNet based on the sliding window mechanism has a significant advantage in training and inference speed. - VMamba is based on a new architecture based on Mamba. It has a relatively small number of parameters (3.15M) and a video memory of 1823MB. However, its throughput (16.41/s) is low and its inferencing is slow. It is worth noting that Mamba, being a new architecture, is currently not able to train with multiple cards, unlike the other models. - EfficientSAM-tiny has a high number of parameters (10.19M) and video memory usage 1827MB, relatively low throughput (17.94/s), but has a significant advantage in accuracy. Despite its high number of parameters, it was the final model chosen due to its excellent accuracy. @@ -98,7 +98,7 @@ This paper compares four models, YOLOv8n-seg, EfficientSAM-tiny, Swin-UNet, and The top row of graphs shows detailed zoomed sections of the full range loss curves displayed in the bottom row for various segmentation models for comparison: YOLOv8n-seg, Swin-UNet, VMamba, and EfficientSAM. ``` -The training and testing losses of four different segmentation models (YOLOv8n-seg, Swin-UNet, VMamba, and EfficientSAM) are comparatively analyzed in {numref}`Figure %s `. {numref}`Figure %s ` shows how the loss of each model varies over 1000 epochs. First, the training loss of the YOLOv8n-seg model gradually decreases and stabilizes, while the testing loss significantly increases after the initial fluctuation, indicating a certain degree of overfitting in this model. Second, the Swin-UNet model shows a more consistent downward trend in training and testing losses, and although the testing loss is slightly higher than the training loss, the overall curve tends to be stable, showing good generalization ability. Thirdly, the training and testing loss curves of the VMamba model are very close to each other and drop rapidly in a short period of time, after which they remain at a low level, indicating that it has a significant advantage in convergence speed and stability. Finally, the EfficientSAM model performs particularly well, as its training and testing losses almost completely overlap and are maintained at a very low level throughout the training process, showing extremely high training efficiency and excellent generalization performance. +The training and testing losses of four different segmentation models (YOLOv8n-seg, Swin-UNet, VMamba, and EfficientSAM) are comparatively analyzed in @Qy3XvUUvyI. @Qy3XvUUvyI shows how the loss of each model varies over 1000 epochs. First, the training loss of the YOLOv8n-seg model gradually decreases and stabilizes, while the testing loss significantly increases after the initial fluctuation, indicating a certain degree of overfitting in this model. Second, the Swin-UNet model shows a more consistent downward trend in training and testing losses, and although the testing loss is slightly higher than the training loss, the overall curve tends to be stable, showing good generalization ability. Thirdly, the training and testing loss curves of the VMamba model are very close to each other and drop rapidly in a short period of time, after which they remain at a low level, indicating that it has a significant advantage in convergence speed and stability. Finally, the EfficientSAM model performs particularly well, as its training and testing losses almost completely overlap and are maintained at a very low level throughout the training process, showing extremely high training efficiency and excellent generalization performance. ```{figure} #app:fig4-3 :name: FnM0Z3oOIl @@ -109,7 +109,7 @@ The training and testing losses of four different segmentation models (YOLOv8n-s Comparison of training and test IoU and Dice coefficients for different segmentation models: Swin-UNet, VMamba, and EfficientSAM. ``` -{numref}`Figure %s ` shows the trend of training and testing Intersection over Union (IoU) and Dice-Sørensen coefficient (Dice Coefficient) with 1000 training epochs for three different segmentation models (Swin-UNet, VMamba, EfficientSAM). The training and testing IoU and Dice Coefficient curves for the YOLO model are not available here, but the final IoU and Dice Coefficient are shown in {numref}`Table %s `. For the Swin-UNet model, the training and testing IoU rises rapidly at the beginning of the training rounds, slows down at about round 400. 1000th round to reach an IoU value of about 0.965. The training and testing Dice Coefficient shows a similar trend and eventually stabilizes at about 0.982. Both the training and testing IoU and Dice Coefficient of the VMamba model show a rapid increase with slight fluctuations in the early stages of training. However, after about 300 rounds, these metrics stabilize rapidly and eventually reach about 0.994 and 0.997. This indicates that the VMamba model performs well both in terms of convergence speed and final performance. Notably, the EfficientSAM model performs significantly better than the other two models. Its training and testing IoUs as well as Dice Coefficient rapidly approach 1.0 early in training, which may be due to the use of a pre-trained model. These metrics did not fluctuate significantly during subsequent training, and quickly reached higher accuracy, eventually stabilizing at about 0.997 and 0.998. +@FnM0Z3oOIl shows the trend of training and testing Intersection over Union (IoU) and Dice-Sørensen coefficient (Dice Coefficient) with 1000 training epochs for three different segmentation models (Swin-UNet, VMamba, EfficientSAM). The training and testing IoU and Dice Coefficient curves for the YOLO model are not available here, but the final IoU and Dice Coefficient are shown in @eRmHsk1Lat. For the Swin-UNet model, the training and testing IoU rises rapidly at the beginning of the training rounds, slows down at about round 400. 1000th round to reach an IoU value of about 0.965. The training and testing Dice Coefficient shows a similar trend and eventually stabilizes at about 0.982. Both the training and testing IoU and Dice Coefficient of the VMamba model show a rapid increase with slight fluctuations in the early stages of training. However, after about 300 rounds, these metrics stabilize rapidly and eventually reach about 0.994 and 0.997. This indicates that the VMamba model performs well both in terms of convergence speed and final performance. Notably, the EfficientSAM model performs significantly better than the other two models. Its training and testing IoUs as well as Dice Coefficient rapidly approach 1.0 early in training, which may be due to the use of a pre-trained model. These metrics did not fluctuate significantly during subsequent training, and quickly reached higher accuracy, eventually stabilizing at about 0.997 and 0.998. ```{figure} #app:fig5 :name: eTzB6lohnz @@ -120,7 +120,7 @@ Comparison of training and test IoU and Dice coefficients for different segmenta Comparison of different segmentation methods. Left: input image and zoomed-in area. Then, segmentation results of Ground Truth, Swin-UNet, VMamba, YOLOv8, and EfficientSAM. ``` -The results of Mask comparison of different segmentation methods in detail are shown in {numref}`Figure %s `. Firstly, the input image and its magnified region are shown, followed by the segmentation results of “Ground Truth Mask”, Swin-UNet, VMamba, YOLOv8, and EfficientSAM in order. The “Ground Truth Mask” provides the ideal reference mask for comparing the performance of other methods. +The results of Mask comparison of different segmentation methods in detail are shown in @eTzB6lohnz. Firstly, the input image and its magnified region are shown, followed by the segmentation results of “Ground Truth Mask”, Swin-UNet, VMamba, YOLOv8, and EfficientSAM in order. The “Ground Truth Mask” provides the ideal reference mask for comparing the performance of other methods. The segmentation results of Swin-UNet show a slight lack of edge detail, with some regions failing to be segmented correctly, which is often unacceptable for scientific research. VMamba performs similarly to Swin-UNet, but with a smoother boundary treatment. YOLOv8’s segmentation results have multiple targets boxed and labeled, however, it does not perform as well as the previous two in terms of fine-grained segmentation. It is worth noting that YOLOv8 has a significant advantage in real-time detection and processing speed, which is especially suitable for application scenarios that require a fast response, but the accuracy of the recognition is more important than the speed in scientific research. EfficientSAM performs excellently in preserving the integrity of the target and has a clearer boundary. The EfficientSAM and YOLOv8-seg models segmentation can provide IDs (different colors correspond to different IDs). EfficientSAM can generate masks with IDs by calling the mask decoder multiple times with an array of point prompts, and YOLOv8-seg can also provide IDs because it is a target detection model that performs segmentation after single term of detection. @@ -135,4 +135,4 @@ The segmentation results of Swin-UNet show a slight lack of edge detail, with so Comparison of key video frames. The first row shows original frames, the second row shows segmentation results, and the third row is a magnified view of the red box area, showing segmentation and tracking of objects. ``` -In this paper, a video analysis of the sintering process of the material at a high temperature of 800℃ is carried out. Figure {numref}`Figure %s ` shows the original images of the key frames and the segmentation results. The first row shows the original frames, showing the continuous image of some key frames 1-115. The second row shows the segmentation result from EfficientSAM, which shows the change of objects on the time axis by color marking different objects. The third row shows a magnified view of the red-framed area, in order to show the segmentation and tracking process more clearly. The images from frame 77-115 show the sintering phenomenon of multiple nano-particles, especially between frame 111-115, where the three nano-particles are gradually fused and show obvious morphological changes. This indicates that a significant sintering process occurred in the material at a high temperature of 800℃. Comparison of the segmentation results with the original frames shows that the method in this paper is able to accurately identify and label the objects, and track them effectively even when they become sintered. The color-marked segmentation results clearly show the dynamic changes of different particles during the sintering process, and track the evolution of the microstructure of the material under the high-temperature environment very well. +In this paper, a video analysis of the sintering process of the material at a high temperature of 800℃ is carried out. @BJo1hhWS1b shows the original images of the key frames and the segmentation results. The first row shows the original frames, showing the continuous image of some key frames 1-115. The second row shows the segmentation result from EfficientSAM, which shows the change of objects on the time axis by color marking different objects. The third row shows a magnified view of the red-framed area, in order to show the segmentation and tracking process more clearly. The images from frame 77-115 show the sintering phenomenon of multiple nano-particles, especially between frame 111-115, where the three nano-particles are gradually fused and show obvious morphological changes. This indicates that a significant sintering process occurred in the material at a high temperature of 800℃. Comparison of the segmentation results with the original frames shows that the method in this paper is able to accurately identify and label the objects, and track them effectively even when they become sintered. The color-marked segmentation results clearly show the dynamic changes of different particles during the sintering process, and track the evolution of the microstructure of the material under the high-temperature environment very well. diff --git a/04_Discussion.md b/04_Discussion.md index 6d64fd5..7b8bf49 100644 --- a/04_Discussion.md +++ b/04_Discussion.md @@ -1,20 +1,20 @@ --- title: Discussion numbering: - enumerator: 4.%s + enumerator: 4.%s --- # Discussion ## Segmentation and Tracking -The models released for computer vision are often based on realistic images, which are often ineffective when applied directly to microscope data for materials research. This paper takes microscope datasets and trains or fine-tunes them on different models. Deep learning shows the superior performance of the models in microscope image segmentation. All four segmentation models perform well on both the training and test sets, especially the EfficientSAM model, which shows the highest stability and accuracy on all evaluation metrics, indicating its strong generalization ability and robustness in the segmentation task. EfficientSAM demonstrates better segmentation performance in the application scenario {numref}`Figure %s `, and the edges are closer to the ground truth, according to the results in {numref}`Table %s `, EfficientSAM-tiny does have an obvious advantage in terms of accuracy. Although the other models have their own characteristics in terms of throughput, number of parameters and memory usage, EfficientSAM-tiny outperforms the other models in terms of accuracy, with an IoU of 0.99672 and a Dice Coefficient of 0.99836. At the same time, the performance of EfficientSAM model is better than the other models in terms of loss in both the training and the testing curves, showing optimal segmentation performance and generalization ability. These results suggest that the EfficientSAM model may be a potentially superior choice for handling segmentation tasks with greater efficiency and effectiveness. This provides an important reference for the selection and optimization of the tracking model. +The models released for computer vision are often based on realistic images, which are often ineffective when applied directly to microscope data for materials research. This paper takes microscope datasets and trains or fine-tunes them on different models. Deep learning shows the superior performance of the models in microscope image segmentation. All four segmentation models perform well on both the training and test sets, especially the EfficientSAM model, which shows the highest stability and accuracy on all evaluation metrics, indicating its strong generalization ability and robustness in the segmentation task. EfficientSAM demonstrates better segmentation performance in the application scenario @eTzB6lohnz, and the edges are closer to the ground truth, according to the results in @eRmHsk1Lat, EfficientSAM-tiny does have an obvious advantage in terms of accuracy. Although the other models have their own characteristics in terms of throughput, number of parameters and memory usage, EfficientSAM-tiny outperforms the other models in terms of accuracy, with an IoU of 0.99672 and a Dice Coefficient of 0.99836. At the same time, the performance of EfficientSAM model is better than the other models in terms of loss in both the training and the testing curves, showing optimal segmentation performance and generalization ability. These results suggest that the EfficientSAM model may be a potentially superior choice for handling segmentation tasks with greater efficiency and effectiveness. This provides an important reference for the selection and optimization of the tracking model. -For the autoencoder architecture, the encoder part of the network drastically reduces the resolution of the feature maps through the pooling layer, which is not conducive to generating accurate segmentation masks. Skip-connection in the Swin-UNet can introduce high-resolution features from the shallow convolutional layers, which contain rich low-level information to help generate better segmentation masks. However, the edges of the mask after Swin-UNet segmentation are still not clean enough in {numref}`Figure %s `. The algorithms in the tracking part are able to maintain high performance when dealing with the kinematic complexities such as particle growth, and motion during sintering, demonstrating the dynamics of the particles. This indicates the tracking method of this paper has significant application potential. In particular, the outstanding EfficientSAM and the efficiency of the DeAOT model demonstrate the potential of deep learning techniques in microscopy image/video analysis. +For the autoencoder architecture, the encoder part of the network drastically reduces the resolution of the feature maps through the pooling layer, which is not conducive to generating accurate segmentation masks. Skip-connection in the Swin-UNet can introduce high-resolution features from the shallow convolutional layers, which contain rich low-level information to help generate better segmentation masks. However, the edges of the mask after Swin-UNet segmentation are still not clean enough in @eTzB6lohnz. The algorithms in the tracking part are able to maintain high performance when dealing with the kinematic complexities such as particle growth, and motion during sintering, demonstrating the dynamics of the particles. This indicates the tracking method of this paper has significant application potential. In particular, the outstanding EfficientSAM and the efficiency of the DeAOT model demonstrate the potential of deep learning techniques in microscopy image/video analysis. ## Future Directions -In {numref}`Figure %s `, the training loss of the YOLO model decreases initially, but the test loss picks up after reaching its lowest point in the 580 epoch and shows a U-shaped curve. This indicates that the model is overfitting. This may be due to the fact that a model of this size is not able to capture the visual information in the image well. However, this does not mean that the model is not useful. The advantage of YOLO is the speed of inference. In future research, YOLO can be used to perform initial segmentation at high speed to obtain positional information, and then SAM could be used to re-segment the critical parts. This will increase the speed of inference while still maintaining high accuracy. In addition, in higher pixel resolution microscope videos, it may be necessary to cut the image into small regions for tracking, which may significantly slow down the inference speed of the model. In this case, YOLO can be used to perform overall fast target detection on the scaled resolution image, and then the high pixel resolution target region can be cropped, based on the detection result. Thereafter, SAM can be used to obtain more accurate segmentation. +In @Qy3XvUUvyI, the training loss of the YOLO model decreases initially, but the test loss picks up after reaching its lowest point in the 580 epoch and shows a U-shaped curve. This indicates that the model is overfitting. This may be due to the fact that a model of this size is not able to capture the visual information in the image well. However, this does not mean that the model is not useful. The advantage of YOLO is the speed of inference. In future research, YOLO can be used to perform initial segmentation at high speed to obtain positional information, and then SAM could be used to re-segment the critical parts. This will increase the speed of inference while still maintaining high accuracy. In addition, in higher pixel resolution microscope videos, it may be necessary to cut the image into small regions for tracking, which may significantly slow down the inference speed of the model. In this case, YOLO can be used to perform overall fast target detection on the scaled resolution image, and then the high pixel resolution target region can be cropped, based on the detection result. Thereafter, SAM can be used to obtain more accurate segmentation. ```{figure} #app:fig7 :name: KhF5puVr6V @@ -25,4 +25,4 @@ In {numref}`Figure %s `, the training loss of the YOLO model decreas The original image and the segmentation results from LISA model obtained by progressively optimized prompts. ``` -In the future, there are potential applications of large language modeling (LLM) in microscopy image analysis. For example, the "Large Language Instructed Segmentation Assistant" (LISA) model is fine-tuned with a multimodal large language model to reason and generate “tokens” with approximate location information and combined with the SAM model for accurate segmentation {cite:p}`laiLISAReasoningSegmentation2024`. However, the LISA model still has some limitations when processing microscopy images. Although the segmentation level can be improved by optimizing the prompt, as shown in top of {numref}`Figure %s `, although most of the nanoparticles are correctly labeled, there are still some particles that are missed or incorrectly labeled. The main reason for this is that the training data is mainly from real-world scenarios rather than specialized microscope images. Perhaps incorporating microscope images into the training data could help improve the model’s performance in microscope images. In another study {cite:p}`zhangSamGuidedEnhancedFineGrained2023` on sam-guided multimodal LLM, the SAM visual coder was simultaneously introduced into a multimodal large language model. By adding a SAM visual encoder (for detail information) to the CLIP visual encoder (for global information) more detailed characterization can be achieved. Using multiple visual encoders with different functions can significantly improve the accuracy and detail of the generated image descriptions {cite:p}`zhangSamGuidedEnhancedFineGrained2023`. Visual encoders with different functions are similar to the compound eyes of an insect and can provide information with different details: the CLIP visual encoder provides the overall category and information of the image, while the SAM visual encoder provides information about the edges and shapes of the objects. In the future, by combining the multimodal LLM model with the SAM model, it is expected to further enhance the ability to analyze and reason about microscope videos. +In the future, there are potential applications of large language modeling (LLM) in microscopy image analysis. For example, the "Large Language Instructed Segmentation Assistant" (LISA) model is fine-tuned with a multimodal large language model to reason and generate “tokens” with approximate location information and combined with the SAM model for accurate segmentation {cite:p}`laiLISAReasoningSegmentation2024`. However, the LISA model still has some limitations when processing microscopy images. Although the segmentation level can be improved by optimizing the prompt, as shown in top of @KhF5puVr6V, although most of the nanoparticles are correctly labeled, there are still some particles that are missed or incorrectly labeled. The main reason for this is that the training data is mainly from real-world scenarios rather than specialized microscope images. Perhaps incorporating microscope images into the training data could help improve the model’s performance in microscope images. In another study {cite:p}`zhangSamGuidedEnhancedFineGrained2023` on sam-guided multimodal LLM, the SAM visual coder was simultaneously introduced into a multimodal large language model. By adding a SAM visual encoder (for detail information) to the CLIP visual encoder (for global information) more detailed characterization can be achieved. Using multiple visual encoders with different functions can significantly improve the accuracy and detail of the generated image descriptions {cite:p}`zhangSamGuidedEnhancedFineGrained2023`. Visual encoders with different functions are similar to the compound eyes of an insect and can provide information with different details: the CLIP visual encoder provides the overall category and information of the image, while the SAM visual encoder provides information about the edges and shapes of the objects. In the future, by combining the multimodal LLM model with the SAM model, it is expected to further enhance the ability to analyze and reason about microscope videos. diff --git a/notebooks/fig2.ipynb b/notebooks/fig2.ipynb index 4292d92..86f4c19 100644 --- a/notebooks/fig2.ipynb +++ b/notebooks/fig2.ipynb @@ -7,7 +7,7 @@ "---\n", "title: Select 'Train' or 'Test' to interact with dataset \n", "author: Yifei Duan, Yifan Duan \n", - "date: 2024/10/02 \n", + "date: 2024-10-02\n", "---" ] }, diff --git a/notebooks/fig3.ipynb b/notebooks/fig3.ipynb index 7e59307..d262387 100644 --- a/notebooks/fig3.ipynb +++ b/notebooks/fig3.ipynb @@ -7,7 +7,7 @@ "---\n", "title: Interactive train and test loss curves \n", "author: Yifei Duan, Yifan Duan \n", - "date: 2024/10/02 \n", + "date: 2024-10-02\n", "---" ] }, diff --git a/notebooks/fig4.ipynb b/notebooks/fig4.ipynb index 9a2ff7f..759f27e 100644 --- a/notebooks/fig4.ipynb +++ b/notebooks/fig4.ipynb @@ -7,7 +7,7 @@ "---\n", "title: Interactive IoU and Dice coefficients \n", "author: Yifei Duan, Yifan Duan \n", - "date: 2024/10/02 \n", + "date: 2024-10-02\n", "---" ] }, diff --git a/notebooks/fig5.ipynb b/notebooks/fig5.ipynb index 9f45f38..c8d5844 100644 --- a/notebooks/fig5.ipynb +++ b/notebooks/fig5.ipynb @@ -7,7 +7,7 @@ "---\n", "title: Interactive comparison of different model segmentation results \n", "author: Yifei Duan, Yifan Duan \n", - "date: 2024/10/02 \n", + "date: 2024-10-02\n", "---" ] }, diff --git a/notebooks/fig6-3.ipynb b/notebooks/fig6-3.ipynb index c950e99..c2641c1 100644 --- a/notebooks/fig6-3.ipynb +++ b/notebooks/fig6-3.ipynb @@ -8,7 +8,7 @@ "---\n", "title: Interactive Tracked TEM Video with a Slider \n", "author: Yifei Duan, Yifan Duan \n", - "date: 2024/10/02 \n", + "date: 2024-10-02\n", "---" ] }, diff --git a/notebooks/fig7.ipynb b/notebooks/fig7.ipynb index 4ed2b51..daf18bb 100644 --- a/notebooks/fig7.ipynb +++ b/notebooks/fig7.ipynb @@ -7,7 +7,7 @@ "---\n", "title: Interactive LISA Segmentation through optimized model prompts \n", "author: Yifei Duan, Yifan Duan \n", - "date: 2024/10/02 \n", + "date: 2024-10-02\n", "---" ] },