[paper reading] FCOS

topic	motivation	technique	key element	math	use yourself
[FCOS](./[paper reading] FCOS.md)	Idea Contribution	[FCOS Architecture](#FCOS Architecture) Center-ness [Multi-Level FPN Prediction](#Multi-Level FPN Prediction)	[Prediction Head](#Prediction Head) [Training Sample & Label](#Training Sample & Label) [Model Output](#Model Output) [Feature Pyramid](#Feature Pyramid) Inference [Ablation Study](#Ablation Study) [FCN & Detection](#FCN & Detection) [FCOS $vs.$ YOLO v1](#FCOS $vs.$ YOLO v1)	[Symbol Definition](#Symbol Definition) [Loss Function](#Loss Function) Center-ness [Remap of Feature & Image](#Remap of Feature & Image)	[Physical & Prior Knowledge](#Physical & Prior Knowledge) [Design or Adaptive](#Design or Adaptive) [Sample & Filter Strategy](#Sample & Filter Strategy) [Generalized Keypoint-based](#Generalized Keypoint-based)

Motivation

Idea

用per-pixel prediction的方法进行object detection（通过fully convolution实现）

Contribution

将detection重新表述为per-pixel prediction
使用 multi-level prediction：
- 提升recall
- 解决重叠bounding box带来的ambiguity
center-ness branch

抑制bounding box中的low-quality prediction

Techniques

FCOS Architecture

backbone的默认选择是ResNet-50

![image-20201106155009575]([paper reading] FCOS.assets/image-20201106155009575.png)

Advantage

详见 [Drawbacks of Anchor](#Drawbacks of Anchor)

Detection与FCN-solvable task (e.g. semantic segmentation) unify到一起

一些其他任务的Idea也可以迁移到detection中（re-use idea）
anchor-free & proposal-free
消除了与anchor相关的**复杂计算 **(e.g. IoU)

获得 faster training & testing，less training memory footprint
在one-stage中做到了SOTA，可以用于替换PRN
可以快速迁移到其他的vision task (e.g. instance segmentation, key-point detection)

Center-ness

center-ness是对每个location进行预测

可以极大地提高性能

Idea

在远离center的位置会产生大量的low-quality的predicted bounding box

FCOS引入了center-ness，抑制远离center的low-quality bounding box（e.g. down-weight）

Implement

引入center-ness branch，来预测location的center-ness

在测试时：

通过下式计算score： $$ \text{Final Score} = \text{Classification Score} × \text{Center-ness} $$
使用NMS滤除被抑制的bounding box

Multi-Level FPN Prediction

Multi-Level FPN Prediction能解决2个问题：

Best Possible Recall

将FCOS的Best Possible Recall提升到SOTA

Ambiguity of Ground-Truth Box Overlap

解决ground-truth box重叠带来的ambiguity，达到anchor-based程度

原因：绝大部分情况下，发生重叠的object，尺度差距都很大

Idea是：根据regress distance的不同，将需要回归的location分发到不同的feature level

具体来说：

计算regress target
根据feature level的maximum regress distance，筛选出positive sample $$ m_{i-1} < \text{max}( l^, t^,r^,b^ ) < m_i $$ 其中 $m_i$ 是feature level $i$ 需要regress的maximum distance $$ {m_2, m_3, m_4, m_5, m_6, m_7} = { 0,64,128,256,512 } $$

相比于原始的FPN（e.g. SSD），FCOS将不同scale的object“隐性“地分到了不同的feature level（small在浅层，large在深层）

我认为这可以看做更精致的手工设计
若1个location落到2个ground-truth box（e.g. ambiguity）中，则选择small box进行regression（更偏向小目标）

Key Elements

Prediction Head

Classification Branch

![image-20201106155159833]([paper reading] FCOS.assets/image-20201106155159833.png)

Regression Branch

![image-20201106155213687]([paper reading] FCOS.assets/image-20201106155213687.png)

由于regression target永远是正数，所以在regression branch的top续上exp($s_ix$)（详见 [Shared Head](#Shared Head)）

Shared Head

在不同feature level共享head

Advantage：

parameter efficient
improve performance

Drawback：

由于[Multi-Level FPN Prediction](#Multi-Level FPN Prediction)的使用，会使得不同feature level的输出范围有所不同（e.g. [0, 64] for $P_3$，[64, 128] for $P_4$）

为了使得identical heads可以用于different feature level： $$ \text{exp}(x) \rightarrow \text{exp}(s_ix) $$

$s_i$ ：trainable scaler，用来自动调整exp的base

Training Sample & Label

Training Sample

直接将location作为training sample（这和语义分割的FCN相似）

Label Pos/Neg

location $(x,y)$ 为正样本的条件为：

location $(x,y)$ 落在 ground-truth box中
location $(x,y)$ 的类别 == 该ground-truth box的类别

FCOS使用了尽可能多的foreground sample来训练（e.g. ground-truth box的全部location）

而不像anchor-based仅选用与ground-truth box高的作为正样本

也不像[CenterNet (Object as Points)](./[paper reading] CenterNet (Object as Points).md) 只对geometric center作为正样本

Model Output

对每个level的feature map的每个location有如下的输出：

**4D Vector $\pmb t^*$**

$$ \pmb t^* = (l^,t^,r^,b^) $$

描述了bounding box的4个side相对于该location的relative offset

具体来说：

注意：

FCOS是对ground-truth box的每个location进行计算（并不仅仅是geometric center），所以需要预测4个量来获得boundary

像 [CenterNet (Object as Points)](./[paper reading] CenterNet (Object as Points).md) 只对geometric center进行预测，2个量就够了

注意：object重叠的问题可以通过 [Multi-Level FPN Prediction](#Multi-Level FPN Prediction) 解决。但如果仍发生重叠，则优先考虑小样本（选择面积最小的bounding box）

C-Dimension Vector $\pmb p$

实验中使用的不是 C-class classifier，而是 C 个binary classifier

Feature Pyramid

定义了5个level的feature map：${P_3, P_4, P_5, P_6, P_7}$（步长为 ${8,16,32,64,128}$）

${P_3,P_4,P_5}$：backbone的feature map ${ C_c, C_4,C_5}$ + 1×1 Convlution
${P_6,P_7}$ ：$P_5$ & $P_6$ 经过stride=2的卷积层

Inference

将image输入network，在feature map $F_i$ 的每个location获得：
- classification score $\pmb p_{x,y}$
- regression prediction $\pmb t_{x,y}$
选择 $\pmb p_{x,y} > 0.5$ 的location，作为positive sample
decode得到bounding box的coordinate

Ablation Study

Multi-Level FPN Prediction

结论：

Best Possible Recall 并不是FCOS的问题
Multi-Level FPN Prediction可以提高Best Possible Recall

Ambiguity Samples

结论：

Multi-Level FPN Prediction可以解决Ambiguity Samples的问题

即：大部分的overlap ambiguity会被分到不同的feature level，只有极少的ambiguity location还存在

With or Without Center-ness

Center-ness能抑制远离center的low-quality bounding box，从而大幅度提高AP
center-ness必须具有单独的支路

FCN & Detection

FCN 主要用于 dense prediction

其实fundamental vision task都可以unify到one single framework

而anchor的使用，实际上使得Detecntion任务偏离了neat fully convolutional per-pixel prediction framework

FCOS $vs.$ YOLO v1

相比于YOLO v1只使用靠近center的point进行prediction，FCOS使用ground-truth的全部点进行prediction

对于产生的low-quality bounding box，由center-ness进行抑制

使得FCOS可以达到anchor-based detectors相近的recall

Math

Symbol Definition

$F_i \in \mathbb{R} ^{H×W×C}$ ：backbone中第 $i$ 层的feature map
$s$ ：到该层的total stride
${B_i}$ ：ground-truth box $$ B_i = (x_0^{(i)}, y_0^{(i)},x_1^{(i)},y_1^{(i)},x^{(i)}) \in \mathbb{R}^4 × { 1,2...C} $$
- $(x_0^{(i)}, y_0^{(i)})$ ：top-left corner coordinate
- $(x_1^{(i)}, y_1^{(i)})$ ：bottom-right corner coordinate
- $c^{(i)}$ ：bounding box中object的class
- $C$ ：number of class

Loss Function

$\lambda = 1$

还缺少一个Center-ness Loss，其为binary cross entropy

该损失在feature map的全部location上计算，具体来说：

Classification Loss在全部location上计算（positive & negative）
Classification Loss在positive location上计算

$\mathbb{1} {{ C{x,y}^* > 0}} = 1$ if $c_i^*>0$

Center-ness

$$ \text { centerness* }=\sqrt{\frac{\min \left(l^{}, r^{}\right)}{\max \left(l^{}, r^{}\right)} \times \frac{\min \left(t^{}, b^{}\right)}{\max \left(t^{}, b^{}\right)}} $$

center-ness反映location到对应center的normalized distance
使用“根号”来延缓center-ness的衰减
center-ness的范围为 [0,1]

Remap of Feature & Image

feature map上的 $(x,y)$，映射回原图像为： $$ \big( \lfloor \frac s2 \rfloor + xs , \lfloor \frac s2 \rfloor + ys \big) $$ 该位置会靠近location $(x,y)$ 的对应的reception field的center

Use Yourself

Physical & Prior Knowledge

从物理意义的角度：

center-ness可以encode关于center的物理意义，从而筛选出center-like的位置，即：没有被down-weight的location都近似具有center的物理意义
从先验知识的角度：

center-ness其实encode了关于bounding box的先验知识，即边界的location难以产生高质量的bounding box

启示：如果能找到合理的方式去将先验知识嵌入到模型中，使之具备一定的物理意义，应该可以在性能上获得一些提升

Design or Adaptive

关于multi-size的object如何在multi-scale的feature level上进行detection：

在之前的网络中（e.g. RetinaNet的FPN）都是不指定何种size由哪层去detect，属于自适应的方法

而FCOS的Multi-level FPN Prediction把特定size的object分配到特定的feature level，属于手工设计的方法

具体采用Design还是Adaptive，取决于面对的问题。就目前来看，我的看法：

自适应方法：

需要当前方法有效，其自适应调整较轻微（还是按照这个架子），且有几乎稳定的正收益
手工设计：

当前方法的分配关系比较混乱（一对多的对应关系），specialized程度低，可通过手工设计转化为1对1的对应关系

Sample & Filter Strategy

总的来说，sample和filter是一对trade-off

一般来说，sample的点越多越好，但其空间和时间的占用都会增加，对post-processing的需求也增加

最好还是尽可能多地利用候选样本点，再通过尽可能简单的方式过滤掉冗余的样本（e.g. FCOS的classification score * center-ness）

Generalized Keypoint-based

FCOS = CenterNet (object as points) + Center-ness

CenterNet (object as points) 相当于**只选用bounding box的geometric center（几何中心）**进行bounding box的预测（e.g. bounding box的 $W,H$），所以其检测到的center到各边的距离为 $(\frac{W}{2}, \frac{H}{2})$

而FCOS使用了ground-truth bounding box内的所有location进行预测（e.g. $r^,l^,t^,b^$），但使用center-ness对远离center的location进行了高效率的抑制

FCOS其实也是generalized keypoint-based的思路，只不过像传统的keypoint（e.g. CornerNet，CenterNet），其keypoint具有实际的物理意义（e.g. center，corner），而FCOS的keypoint为ground-truth bounding box的全部location，而通过center-ness来赋予其物理意义

简而言之，FCOS最终用于产生predicted bounding box的location是center-like的，即：

在空间位置上：并不一定是geometric center
但是具有类似center的物理意义和先验知识

Related Work

Drawbacks of Anchor

detection performance对size、aspect ratio、number of anchor等超参数敏感

即anchor需要精密的手工设计
需要大量的anchor，才能获得high recall rate

这会导致训练时极端的正负样本不均衡
anchor会伴随着复杂的计算

比如IoU的计算
anchor的size、aspect ratios都是预先定义的，导致无法应对shape variations（尤其对于小目标）

另外，anchor这种“预先定义”的形式也会影响模型的泛化能力。换句话说，设计的anchor是task-specific

DenseBox-Based

对image进行crop和resize，以处理不同size的bounding box

导致DenseBox必须在image pyramid上进行detection

这与FCN仅计算一次convolution的思想相悖
仅仅用于特定的domain，难以处理重叠的object

因为无法确定对应pixel回归到哪一个object
Recall比较低

Anchor-Based Detector

来源：

sliding window 和 proposal based detectors
anchor的本质

预定义的sliding window (proposal) + offset regression
anchor的作用

作为detector的训练数据
典型model
- Faster-RCNN
- SSD
- YOLO v2

YOLO v1

YOLO v1是典型的Anchor-Free Detector

Idea

YOLO v1使用靠近center的point来预测bounding box

即：object的center落到哪个grid cell，则由该cell负责预测该object的bounding box

这是因为：靠近center的points能生成质量更高的detection

Drawbacks of Points near Center

只使用靠近center的points，会导致low-racall

正因如此，YOLO v2 又重新使用了anchor

CornerNet

CornerNet是典型的Anchor-Free Detector

Steps

corner detection
corner grouping
post-processing

Drawbacks of Corner

post-processing复杂，需要额外的distance metric

Files

[paper reading] FCOS.md

Latest commit

History

[paper reading] FCOS.md

File metadata and controls

[paper reading] FCOS

Motivation

Idea

Contribution

Techniques

FCOS Architecture

Advantage

Center-ness

Idea

Implement

Multi-Level FPN Prediction

Key Elements

Prediction Head

Classification Branch

Regression Branch

Shared Head

Training Sample & Label

Training Sample

Label Pos/Neg

Model Output

4D Vector $\pmb t^*$

C-Dimension Vector $\pmb p$

Feature Pyramid

Inference

Ablation Study

Multi-Level FPN Prediction

Ambiguity Samples

With or Without Center-ness

FCN & Detection

FCOS $vs.$ YOLO v1

Math

Symbol Definition

Loss Function

Center-ness

Remap of Feature & Image

Use Yourself

Physical & Prior Knowledge

Design or Adaptive

Sample & Filter Strategy

Generalized Keypoint-based

Related Work

Drawbacks of Anchor

DenseBox-Based

Anchor-Based Detector

YOLO v1

Idea

Drawbacks of Points near Center

CornerNet

Steps

Drawbacks of Corner

**4D Vector $\pmb t^*$**