Skip to content

Latest commit

 

History

History
41 lines (39 loc) · 2.25 KB

1703.06870 - Mask R-CNN.md

File metadata and controls

41 lines (39 loc) · 2.25 KB

1703.06870 - Mask R-CNN

♥ DL , CV , RCNN, Object Detection

  • Extend Faster RCNN by adding a branch for predicting segmentation masks for each ROI, in parallel with existing branch for classification and bbox regression

  • Review of Faster RCNN

    • Two stages
      • Get candidate object bboxes from RPN
      • Extract features using RolPool from each candidate box
      • Finally perform classification and bbox regression.
    • The features used by both stages can be shared for faster inference
    • Faster-RCNN is not designed for pixel-to-pixel alignment
      • RoiPool performs coarse spatial quantization for feature extraction
      • How to fix: change it to RoiAlign which is quantization-free and can faithfully preserves exact spatial locations
  • Mask R-CNN

    • Two stages

      • Similar to faster RCNN in the first stage(RPN)
      • When predicting the class and box offset, Mask R-CNN also outputs a binary mask(third branch) for each RoI
    • $L = L_{cls} + L_{box} + L_{mask}$

      • $L_{mask}$ uses per-pixel sigmoid instead of softmax
      • For an ROI associated with ground-truch class k, $L_{mask}$ is only defined on k-th mask
      • Other mask outputs are irrelevent $\rightarrow$ we can generate masks for each class without competition among classes
      • Decouple mask and class prediction
    • Mask branch:

      • Use a FCN(full collected network) to predics m*m mask from each ROI
      • Need ROI features to be well aligned $\rightarrow$ RoIAlign
    • RoIAlign:

      • In RoIPool, quantization leads to misalignments
        • Does not affect classification
        • Affects pixel-accurate prediction
      • In RoIAlign, we avoid quantization: round(x / 16) $\rightarrow$ x / 16
    • Archetecture:

      • Extend two kinds of backbone heads:
        • ResNet
        • FPN(feature pyramid network)
  • Mask RCNN is easy to generalize

    • Instance segmentation
    • Bbox object detection
    • Person keypoint detection
      • Model a keypoint’s location as a one-hot m * m mask, where onlu a single pixel is labeled as foreground
      • Adopt Mask R-CNN to predict K masks, one for each of K keypoint types (e.g., left shoulder, right elbow).