1703.06870 - Mask R-CNN
♥ DL , CV , RCNN, Object Detection
-
Extend Faster RCNN by adding a branch for predicting segmentation masks for each ROI, in parallel with existing branch for classification and bbox regression
-
Review of Faster RCNN
- Two stages
- Get candidate object bboxes from RPN
- Extract features using RolPool from each candidate box
- Finally perform classification and bbox regression.
- The features used by both stages can be shared for faster inference
- Faster-RCNN is not designed for pixel-to-pixel alignment
- RoiPool performs coarse spatial quantization for feature extraction
- How to fix: change it to RoiAlign which is quantization-free and can faithfully preserves exact spatial locations
- Two stages
-
Mask R-CNN
-
Two stages
- Similar to faster RCNN in the first stage(RPN)
- When predicting the class and box offset, Mask R-CNN also outputs a binary mask(third branch) for each RoI
-
$L = L_{cls} + L_{box} + L_{mask}$ -
$L_{mask}$ uses per-pixel sigmoid instead of softmax - For an ROI associated with ground-truch class k,
$L_{mask}$ is only defined on k-th mask - Other mask outputs are irrelevent
$\rightarrow$ we can generate masks for each class without competition among classes - Decouple mask and class prediction
-
-
Mask branch:
- Use a FCN(full collected network) to predics m*m mask from each ROI
- Need ROI features to be well aligned
$\rightarrow$ RoIAlign
-
RoIAlign:
- In RoIPool, quantization leads to misalignments
- Does not affect classification
- Affects pixel-accurate prediction
- In RoIAlign, we avoid quantization: round(x / 16)
$\rightarrow$ x / 16
- In RoIPool, quantization leads to misalignments
-
Archetecture:
- Extend two kinds of backbone heads:
- ResNet
- FPN(feature pyramid network)
- Extend two kinds of backbone heads:
-
-
Mask RCNN is easy to generalize
- Instance segmentation
- Bbox object detection
- Person keypoint detection
- Model a keypoint’s location as a one-hot m * m mask, where onlu a single pixel is labeled as foreground
- Adopt Mask R-CNN to predict K masks, one for each of K keypoint types (e.g., left shoulder, right elbow).