Skip to content

Latest commit

 

History

History
467 lines (264 loc) · 17.3 KB

[paper reading] FCOS.md

File metadata and controls

467 lines (264 loc) · 17.3 KB

[paper reading] FCOS

topic motivation technique key element math use yourself
[FCOS](./[paper reading] FCOS.md) Idea
Contribution
[FCOS Architecture](#FCOS Architecture)
Center-ness
[Multi-Level FPN Prediction](#Multi-Level FPN Prediction)
[Prediction Head](#Prediction Head)
[Training Sample & Label](#Training Sample & Label)
[Model Output](#Model Output)
[Feature Pyramid](#Feature Pyramid)
Inference
[Ablation Study](#Ablation Study)
[FCN & Detection](#FCN & Detection)
[FCOS $vs.$ YOLO v1](#FCOS $vs.$ YOLO v1)
[Symbol Definition](#Symbol Definition)
[Loss Function](#Loss Function)
Center-ness
[Remap of Feature & Image](#Remap of Feature & Image)
[Physical & Prior Knowledge](#Physical & Prior Knowledge)
[Design or Adaptive](#Design or Adaptive)
[Sample & Filter Strategy](#Sample & Filter Strategy)
[Generalized Keypoint-based](#Generalized Keypoint-based)

Motivation

Idea

per-pixel prediction的方法进行object detection(通过fully convolution实现)

Contribution

  1. detection重新表述为per-pixel prediction

  2. 使用 multi-level prediction

    • 提升recall
    • 解决重叠bounding box带来的ambiguity
  3. center-ness branch

    抑制bounding box中的low-quality prediction

Techniques

FCOS Architecture

backbone的默认选择是ResNet-50

![image-20201106155009575]([paper reading] FCOS.assets/image-20201106155009575.png)

Advantage

详见 [Drawbacks of Anchor](#Drawbacks of Anchor)

  • DetectionFCN-solvable task (e.g. semantic segmentation) unify到一起

    一些其他任务的Idea也可以迁移到detection中(re-use idea

  • anchor-free & proposal-free

  • 消除了与anchor相关的**复杂计算 **(e.g. IoU)

    获得 faster training & testingless training memory footprint

  • one-stage中做到了SOTA,可以用于替换PRN

  • 可以快速迁移其他的vision task (e.g. instance segmentation, key-point detection)

Center-ness

center-ness是对每个location进行预测

可以极大地提高性能

Idea

远离center的位置会产生大量low-qualitypredicted bounding box

FCOS引入了center-ness抑制远离center的low-quality bounding box(e.g. down-weight

image-20201106173628910

Implement

引入center-ness branch,来预测location的center-ness

image-20201106172522706

测试时:

  1. 通过下式计算score: $$ \text{Final Score} = \text{Classification Score} × \text{Center-ness} $$

  2. 使用NMS滤除被抑制的bounding box

Multi-Level FPN Prediction

Multi-Level FPN Prediction能解决2个问题

  • Best Possible Recall

    FCOSBest Possible Recall提升到SOTA

  • Ambiguity of Ground-Truth Box Overlap

    解决ground-truth box重叠带来的ambiguity,达到anchor-based程度

    原因:绝大部分情况下,发生重叠的object尺度差距都很大

Idea是:根据regress distance的不同,将需要回归的location分发到不同的feature level

具体来说:

  1. 计算regress target

    image-20201106153423939
  2. 根据feature levelmaximum regress distance,筛选出positive sample $$ m_{i-1} < \text{max}( l^, t^,r^,b^ ) < m_i $$ 其中 $m_i$feature level $i$ 需要regress的maximum distance $$ {m_2, m_3, m_4, m_5, m_6, m_7} = { 0,64,128,256,512 } $$

    相比于原始的FPN(e.g. SSD),FCOS不同scale的object“隐性“地分到了不同的feature level(small在浅层,large在深层)

    我认为这可以看做更精致的手工设计

  3. 1个location落到2个ground-truth box(e.g. ambiguity)中,则选择small box进行regression(更偏向小目标

Key Elements

Prediction Head

Classification Branch

![image-20201106155159833]([paper reading] FCOS.assets/image-20201106155159833.png)

Regression Branch

![image-20201106155213687]([paper reading] FCOS.assets/image-20201106155213687.png)

由于regression target永远是正数,所以在regression branch的top续上exp($s_ix$)(详见 [Shared Head](#Shared Head))

Shared Head

不同feature level共享head

Advantage

  • parameter efficient
  • improve performance

Drawback

由于[Multi-Level FPN Prediction](#Multi-Level FPN Prediction)的使用,会使得不同feature level的输出范围有所不同(e.g. [0, 64] for $P_3$,[64, 128] for $P_4$

为了使得identical heads可以用于different feature level: $$ \text{exp}(x) \rightarrow \text{exp}(s_ix) $$

  • $s_i$trainable scaler,用来自动调整exp的base

Training Sample & Label

Training Sample

直接将location作为training sample(这和语义分割的FCN相似)

Label Pos/Neg

location $(x,y)$正样本的条件为:

  1. location $(x,y)$ 落在 ground-truth box中
  2. location $(x,y)$ 的类别 == 该ground-truth box的类别

FCOS使用了尽可能多的foreground sample训练(e.g. ground-truth box的全部location

而不像anchor-based仅选用与ground-truth box高的作为正样本

也不像[CenterNet (Object as Points)](./[paper reading] CenterNet (Object as Points).md) 只对geometric center作为正样本

Model Output

每个level的feature map每个location有如下的输出:

4D Vector $\pmb t^*$

$$ \pmb t^* = (l^,t^,r^,b^) $$

描述了bounding box的4个side相对于该locationrelative offset

具体来说:

image-20201106153423939

1604590952221

注意:

FCOS是对ground-truth box的每个location进行计算(并不仅仅是geometric center),所以需要预测4个量来获得boundary

像 [CenterNet (Object as Points)](./[paper reading] CenterNet (Object as Points).md) 只对geometric center进行预测,2个量就够了

1604591595855

注意:object重叠的问题可以通过 [Multi-Level FPN Prediction](#Multi-Level FPN Prediction) 解决。但如果仍发生重叠,则优先考虑小样本(选择面积最小的bounding box

C-Dimension Vector $\pmb p$

实验中使用的不是 C-class classifier,而是 Cbinary classifier

Feature Pyramid

定义了5个level的feature map:${P_3, P_4, P_5, P_6, P_7}$(步长为 ${8,16,32,64,128}$

image-20201106164204571

  • ${P_3,P_4,P_5}$backbone的feature map ${ C_c, C_4,C_5}$ + 1×1 Convlution
  • ${P_6,P_7}$ :$P_5$ & $P_6$ 经过stride=2的卷积层

Inference

  1. 将image输入network,在feature map $F_i$每个location获得:

    • classification score $\pmb p_{x,y}$
    • regression prediction $\pmb t_{x,y}$
  2. 选择 $\pmb p_{x,y} &gt; 0.5$location,作为positive sample

  3. decode得到bounding box的coordinate

    image-20201106153423939

Ablation Study

Multi-Level FPN Prediction

image-20201106173855040

结论

  • Best Possible Recall 并不是FCOS的问题
  • Multi-Level FPN Prediction可以提高Best Possible Recall

Ambiguity Samples

image-20201106174013100

结论:

  • Multi-Level FPN Prediction可以解决Ambiguity Samples的问题

    即:大部分的overlap ambiguity会被分到不同的feature level,只有极少的ambiguity location还存在

With or Without Center-ness

image-20201106174249597

  • Center-ness能抑制远离center的low-quality bounding box,从而大幅度提高AP

  • center-ness必须具有单独的支路

    image-20201106174524679

FCN & Detection

FCN 主要用于 dense prediction

其实fundamental vision task都可以unifyone single framework

anchor的使用,实际上使得Detecntion任务偏离neat fully convolutional per-pixel prediction framework

FCOS $vs.$ YOLO v1

相比于YOLO v1只使用靠近center的point进行predictionFCOS使用ground-truth的全部点进行prediction

对于产生的low-quality bounding box,由center-ness进行抑制

使得FCOS可以达到anchor-based detectors相近的recall

Math

Symbol Definition

  • $F_i \in \mathbb{R} ^{H×W×C}$ :backbone中第 $i$ 层的feature map

  • $s$ :到该层的total stride

  • ${B_i}$ground-truth box $$ B_i = (x_0^{(i)}, y_0^{(i)},x_1^{(i)},y_1^{(i)},x^{(i)}) \in \mathbb{R}^4 × { 1,2...C} $$

    • $(x_0^{(i)}, y_0^{(i)})$top-left corner coordinate
    • $(x_1^{(i)}, y_1^{(i)})$bottom-right corner coordinate
    • $c^{(i)}$bounding box中object的class
    • $C$number of class

Loss Function

image-20201106155556758

  • $\lambda = 1$

还缺少一个Center-ness Loss,其为binary cross entropy

该损失在feature map的全部location上计算,具体来说:

  • Classification Loss全部location上计算(positive & negative

  • Classification Losspositive location上计算

    $\mathbb{1} {{ C{x,y}^* > 0}} = 1$ if $c_i^*&gt;0$

Center-ness

$$ \text { centerness* }=\sqrt{\frac{\min \left(l^{}, r^{}\right)}{\max \left(l^{}, r^{}\right)} \times \frac{\min \left(t^{}, b^{}\right)}{\max \left(t^{}, b^{}\right)}} $$

  • center-ness反映location对应centernormalized distance
  • 使用“根号”来延缓center-ness的衰减
  • center-ness的范围为 [0,1]

Remap of Feature & Image

feature map上的 $(x,y)$,映射回原图像为: $$ \big( \lfloor \frac s2 \rfloor + xs , \lfloor \frac s2 \rfloor + ys \big) $$ 该位置会靠近location $(x,y)$ 的对应的reception fieldcenter

Use Yourself

Physical & Prior Knowledge

  • 物理意义的角度:

    center-ness可以encode关于center的物理意义,从而筛选出center-like的位置,即:没有被down-weight的location都近似具有center的物理意义

  • 先验知识的角度:

    center-ness其实encode了关于bounding box的先验知识,即边界的location难以产生高质量的bounding box

启示:如果能找到合理的方式去将先验知识嵌入到模型中,使之具备一定的物理意义,应该可以在性能上获得一些提升

Design or Adaptive

关于multi-size的object如何在multi-scale的feature level上进行detection:

在之前的网络中(e.g. RetinaNet的FPN)都是不指定何种size由哪层去detect,属于自适应的方法

FCOSMulti-level FPN Prediction特定size的object分配到特定的feature level,属于手工设计的方法

具体采用Design还是Adaptive,取决于面对的问题。就目前来看,我的看法:

  • 自适应方法

    需要当前方法有效,其自适应调整较轻微(还是按照这个架子),且有几乎稳定的正收益

  • 手工设计

    当前方法的分配关系比较混乱(一对多的对应关系),specialized程度低,可通过手工设计转化为1对1的对应关系

Sample & Filter Strategy

总的来说,samplefilter是一对trade-off

一般来说,sample的点越多越好,但其空间和时间的占用都会增加,对post-processing的需求也增加

最好还是尽可能多地利用候选样本点,再通过尽可能简单的方式过滤掉冗余的样本(e.g. FCOS的classification score * center-ness)

Generalized Keypoint-based

FCOS = CenterNet (object as points) + Center-ness

CenterNet (object as points) 相当于**只选用bounding box的geometric center(几何中心)**进行bounding box的预测(e.g. bounding box的 $W,H$),所以其检测到的center到各边的距离为 $(\frac{W}{2}, \frac{H}{2})$

FCOS使用了ground-truth bounding box内的所有location进行预测(e.g. $r^,l^,t^,b^$),但使用center-ness远离center的location进行了高效率的抑制

FCOS其实也是generalized keypoint-based的思路,只不过像传统的keypoint(e.g. CornerNetCenterNet),其keypoint具有实际的物理意义(e.g. centercorner),而FCOS的keypoint为ground-truth bounding box的全部location,而通过center-ness来赋予其物理意义

简而言之,FCOS最终用于产生predicted bounding box的locationcenter-like的,即:

  • 空间位置上:并不一定是geometric center
  • 但是具有类似center的物理意义和先验知识

Related Work

Drawbacks of Anchor

  1. detection performancesizeaspect rationumber of anchor等超参数敏感

    anchor需要精密的手工设计

  2. 需要大量的anchor,才能获得high recall rate

    这会导致训练极端的正负样本不均衡

  3. anchor会伴随着复杂的计算

    比如IoU的计算

  4. anchorsizeaspect ratios都是预先定义的,导致无法应对shape variations(尤其对于小目标)

    另外,anchor这种“预先定义”的形式也会影响模型的泛化能力。换句话说,设计的anchortask-specific

DenseBox-Based

  • 对image进行crop和resize,以处理不同size的bounding box

    导致DenseBox必须在image pyramid上进行detection

    这与FCN仅计算一次convolution的思想相悖

  • 仅仅用于特定的domain,难以处理重叠的object

    因为无法确定对应pixel回归到哪一个object

    1604591595855
  • Recall比较低

Anchor-Based Detector

  • 来源

    sliding window 和 proposal based detectors

  • anchor的本质

    预定义的sliding window (proposal) + offset regression

  • anchor的作用

    作为detector的训练数据

  • 典型model

    • Faster-RCNN
    • SSD
    • YOLO v2

YOLO v1

YOLO v1是典型的Anchor-Free Detector

Idea

YOLO v1使用靠近center的point预测bounding box

即:object的center落到哪个grid cell,则由该cell负责预测该object的bounding box

这是因为:靠近center的points能生成质量更高的detection

Drawbacks of Points near Center

只使用靠近center的points,会导致low-racall

正因如此,YOLO v2 又重新使用了anchor

CornerNet

CornerNet是典型的Anchor-Free Detector

Steps

  1. corner detection
  2. corner grouping
  3. post-processing

Drawbacks of Corner

post-processing复杂,需要额外的distance metric