Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A few images have incomplete annotations. #21

Open
ivan-ea opened this issue Oct 13, 2022 · 7 comments
Open

A few images have incomplete annotations. #21

ivan-ea opened this issue Oct 13, 2022 · 7 comments

Comments

@ivan-ea
Copy link

ivan-ea commented Oct 13, 2022

Hi, thank you for making such an interesting dataset publicly available!

If I'm not mistaken, I think there are a few images with incomplete annotations.
In other words, the json entry only contains the coordinates of the segmentation
for 1 or 2 cells, while the image shows clearly many more cells. Example in the
image below (id = 150535). Files affected in the table.

150535_1_annotations

id file_name set segmented cells
205798 A172_Phase_D7_1_01d20h00m_1.png train 1
10517 BT474_Phase_B3_1_03d00h00m_3.png train 1
150535 A172_Phase_A7_1_01d04h00m_3.png train 1
1494964 SHSY5Y_Phase_D10_1_01d16h00m_4.png train 1
718286 BV2_Phase_D4_1_00d12h00m_2.png train 9
628256 BV2_Phase_C4_1_01d16h00m_3.png train 4
1248961 SkBr3_Phase_H3_2_00d00h00m_1.png val 1
976048 BV2_Phase_A4_2_00d00h00m_1.png test 2
1007442 BV2_Phase_A4_2_02d04h00m_3.png test 2

The list may not be complete, as the threshold I used was 10 segments per image,
but I'm confident some of the images with around 20+ segmented cells are all ok.

The number of images affected seems to be very small, so excluding these images
should solve any issue. Nevertheless, I think this information could be relevant
for people that are interested in using the dataset for their DL models.

@StefanBaar
Copy link

It sesms, that most images (I only checked SHSY5Y) have missing annotations:

For example:

  • SHSY5Y_Phase_D10_2_02d16h00m_2.tif: missing annotations are indicated with green ellipses.

SHSY5Y_Phase_D10_2_02d16h00m_2 tif

Further, many annotations appear fragmented and inaccurate. I wonder, if that is due to the rasteration (vector -> pixel) during the annotation process?

Example:
Annotations 403, 257, 117 and 162 of SHSY5Y_Phase_D10_2_02d16h00m_2

frags

Are the vector masks available somewhere?

@RickardSjogren
Copy link
Contributor

Hi @ivan-ea,
Thank you for looking into LIVECell! It certainly looks like some images are missing annotations and have simply slipped through QA when they shouldn't. Excluding these images from training is probably the best option. It is also great for us to be aware of these images so that we can fix them in a potential follow-up release.

@RickardSjogren
Copy link
Contributor

Hi @StefanBaar,
The missing cells that you are pointing to are described in the paper and are due to the design choice of not trying to segment cells in locations where cell boundaries are not clearly visible. We made this choice to limit the risk of introducing bias on how to split ambigous locations like the ones you marked. To help us make the call, we worked with an experienced cell biologist to minimize the risk of doing it incorrectly. It is not always possible to delineate single cells in these types of images and this choice is in line with other published datasets, such as EVICAN.

I believe that the broken visualizations are due to the software you use to make the masks. LIVECell is annotated using polygon annotations and stored in COCO-annotation format, so these fragmented masks are not coming from the raw annotations. When converting polygons to masks, the main challenge will be thin structures like the ones you show and your rendering looks like it needs some tweaking. Cell 117 seems to be missing a protrusion on the bottom left though.

@StefanBaar
Copy link

StefanBaar commented Nov 2, 2022

Hi @RickardSjogren,
Thank you very much for the detailed response and explanation.
In general, I agree with your first paragraph. However, in regards to the second paragraph

I believe that the broken visualizations are due to the software you use to make the masks

I apologize, but I don't think this is correct. It appears, that in your dataset, the annotations are stored as RLE and not as polygons. This means each annotation is stored as pixel mask and not as a list of coordinates. The renders, I have produced above, are as true to the data (provided in the dataset) as possible.

This is how I did it in python:

import numpy as np
from tqdm import tqdm
import torchvision.datasets as dset

an_path = "LIVECell_single_cells/shsy5y/train.json"

coco    = dset.CocoDetection(root    = im_path,
                                                  annFile = an_path)
### get list of image ids
im_ids  = list(sorted(coco.coco.imgs.keys()))
### choose image with id=100
imid0   = im_ids[100]

### load the annotation data for each annotation
annos   = np.asarray(coco.coco.loadAnns(coco.coco.getAnnIds(imid0)))

### only load fractured annotations 
fracids = []
for i in tqdm(range(len(annos))):
    ### converts RLE (1D) to binary mask data (2D)
    ma  = coco.coco.annToMask(annos[i])
    if len(np.unique(ma)) > 2: 
        fracids.append(i)

when looking at the data of the first annotation ( annos[0] ), we get the following output:

{'segmentation': [[696.92,
   0.5,
   697.48,
   3.89,
   697.48,
   10.68,
   696.35,
   19.16,
   694.09,
   22.55,
   694.09,
   25.94,
   698.61,
   33.29,
   703.14,
   41.77,
   704.0,
   40.08,
   704.0,
   0.0]],
 'area': 274.53349999999045,
 'iscrowd': 0,
 'image_id': 1457250,
 'bbox': [694.09, 0.0, 9.909999999999968, 41.77],
 'category_id': 1,
 'id': 1457251}

which looks like RLE and not polygon. I also confirmed the content of the raw json file, in which I could not find any polygon data.

Am I doing something wrong here? If the polygon data is contained somewhere within the json file and if you have some time to spare, could you please elaborate on how to retrieve the polygon data?

Could you provide the polygon data?

Thank you very much for your time.

@audreyeternal
Copy link

audreyeternal commented Dec 9, 2022

@StefanBaar The annotation you provided I think is a polygon. you can refer to COCO data format explain. When iscrowd=0, the annos are in the polygon format. Also seems decimals don't appear in RLE format.

@StefanBaar
Copy link

StefanBaar commented Dec 15, 2022

ok, I got it.
The problem is not the format (RLE or polygon) but what the coco internal function coco.coco.annToMask does with the polygon data. This can be seen in the following image, where on the left, are the RLE and polygon coordinates provided by the coco dataset and on the right is the pixel mask, which is produced by coco.coco.annToMask using the polygon data.

Unknown

In the case of this dataset, coco.coco.annToMask is not an optimal solution to produce pixel masks because the polygon points are often not sufficiantly spaced.

I am curious ... what is the intended method to convert the polygon data into pixel masks?

Further, I don't really understand why one would use polygons (basically a set of straight lines) to annotate images of round objects.
I think for cell images, it would be better to produce pixel annotations (which is usually faster and more precise).

@RickardSjogren
Copy link
Contributor

Thanks for looking into this @StefanBaar . It sure seems as annToMask is not optimal for cells with protrusions like the SH-SY5Y in LIVECell. We observed that those cells are particularly difficult to segment and a better decoding method may be a partial solution.

For the models trained in the paper we used Detectron2, which has its own parsers for COCO-datasets. Under the hood, they us pycocotools.mask.decode to convert the encoding to numpy-masks, which is the same as annToMask.

Regarding the choice of polygons. This is the standard way of annotating instances in most fields. There are certainly some drawbacks depending on the vertex-density you use and so on. Even though pixelmasks are more precise, they are far more time consuming to annotate. This is something we have experimented quite a bit with and polygons provide in most cases sufficient precision while being much more budget-friendly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants