- 8 NVIDIA Tesla V100 GPUs
- Intel Xeon 4114 CPU @ 2.20GHz
- Python 3.6 / 3.7
- PyTorch 0.4.1
- CUDA 9.0.176
- CUDNN 7.0.4
- NCCL 2.1.15
- All baselines were trained using 8 GPU with a batch size of 16 (2 images per GPU).
- All models were trained on
coco_2017_train
, and tested on thecoco_2017_val
. - We use distributed training and BN layer stats are fixed.
- We adopt the same training schedules as Detectron. 1x indicates 12 epochs and 2x indicates 24 epochs, which corresponds to slightly less iterations than Detectron and the difference can be ignored.
- All pytorch-style pretrained backbones on ImageNet are from PyTorch model zoo.
- We report the training GPU memory as the maximum value of
torch.cuda.max_memory_cached()
for all 8 GPUs. Note that this value is usually less than whatnvidia-smi
shows, but closer to the actual requirements. - We report the inference time as the overall time including data loading, network forwarding and post processing.
- The training memory and time of 2x schedule is simply copied from 1x. It should be very close to the actual memory and time.
We released RPN, Faster R-CNN and Mask R-CNN models in the first version. More models with different backbones will be added to the model zoo.
Backbone | Style | Lr schd | Mem (GB) | Train time (s/iter) | Inf time (fps) | AR1000 | Download |
---|---|---|---|---|---|---|---|
R-50-FPN | caffe | 1x | 4.5 | 0.379 | 14.4 | 58.2 | - |
R-50-FPN | pytorch | 1x | 4.8 | 0.407 | 14.5 | 57.1 | model |
R-50-FPN | pytorch | 2x | 4.8 | 0.407 | 14.5 | 57.6 | model |
R-101-FPN | caffe | 1x | 7.4 | 0.513 | 11.1 | 59.4 | - |
R-101-FPN | pytorch | 1x | 8.0 | 0.552 | 11.1 | 58.6 | model |
R-101-FPN | pytorch | 2x | 8.0 | 0.552 | 11.1 | 59.1 | model |
X-101-32x4d-FPN | pytorch | 1x | 9.9 | 0.691 | 8.3 | 59.4 | model |
X-101-32x4d-FPN | pytorch | 2x | 9.9 | 0.691 | 8.3 | 59.9 | model |
X-101-64x4d-FPN | pytorch | 1x | 14.6 | 1.032 | 6.2 | 59.8 | model |
X-101-64x4d-FPN | pytorch | 2x | 14.6 | 1.032 | 6.2 | 60.0 | model |
Backbone | Style | Lr schd | Mem (GB) | Train time (s/iter) | Inf time (fps) | box AP | Download |
---|---|---|---|---|---|---|---|
R-50-FPN | caffe | 1x | 4.9 | 0.525 | 10.0 | 36.7 | - |
R-50-FPN | pytorch | 1x | 5.1 | 0.554 | 9.9 | 36.4 | model |
R-50-FPN | pytorch | 2x | 5.1 | 0.554 | 9.9 | 37.7 | model |
R-101-FPN | caffe | 1x | 7.4 | 0.663 | 8.4 | 38.8 | - |
R-101-FPN | pytorch | 1x | 8.0 | 0.698 | 8.3 | 38.6 | model |
R-101-FPN | pytorch | 2x | 8.0 | 0.698 | 8.3 | 39.4 | model |
X-101-32x4d-FPN | pytorch | 1x | 9.9 | 0.842 | 7.0 | 40.2 | model |
X-101-32x4d-FPN | pytorch | 2x | 9.9 | 0.842 | 7.0 | 40.5 | model |
X-101-64x4d-FPN | pytorch | 1x | 14.1 | 1.181 | 5.2 | 41.3 | model |
X-101-64x4d-FPN | pytorch | 2x | 14.1 | 1.181 | 5.2 | 40.7 | model |
Backbone | Style | Lr schd | Mem (GB) | Train time (s/iter) | Inf time (fps) | box AP | mask AP | Download |
---|---|---|---|---|---|---|---|---|
R-50-FPN | caffe | 1x | 5.9 | 0.658 | 7.7 | 37.5 | 34.4 | - |
R-50-FPN | pytorch | 1x | 5.8 | 0.690 | 7.7 | 37.3 | 34.2 | model |
R-50-FPN | pytorch | 2x | 5.8 | 0.690 | 7.7 | 38.6 | 35.1 | model |
R-101-FPN | caffe | 1x | 8.8 | 0.791 | 7.0 | 39.9 | 36.1 | - |
R-101-FPN | pytorch | 1x | 9.1 | 0.825 | 6.7 | 39.4 | 35.9 | model |
R-101-FPN | pytorch | 2x | 9.1 | 0.825 | 6.7 | 40.4 | 36.6 | model |
X-101-32x4d-FPN | pytorch | 1x | 10.9 | 0.972 | 5.8 | 41.2 | 37.2 | model |
X-101-64x4d-FPN | pytorch | 2x | 10.9 | 0.972 | 5.8 | 41.4 | 37.1 | model |
X-101-32x4d-FPN | pytorch | 1x | 14.1 | 1.302 | 4.7 | 42.2 | 38.1 | model |
X-101-64x4d-FPN | pytorch | 2x | 14.1 | 1.302 | 4.7 | 42.0 | 37.8 | model |
Backbone | Style | Type | Lr schd | Mem (GB) | Train time (s/iter) | Inf time (fps) | box AP | mask AP | Download |
---|---|---|---|---|---|---|---|---|---|
R-50-FPN | caffe | Faster | 1x | 3.5 | 0.348 | 14.6 | 36.6 | - | - |
R-50-FPN | pytorch | Faster | 1x | 4.0 | 0.375 | 14.5 | 35.8 | - | model |
R-50-FPN | pytorch | Faster | 2x | 4.0 | 0.375 | 14.5 | 37.1 | - | model |
R-101-FPN | caffe | Faster | 1x | 7.1 | 0.484 | 11.9 | 38.4 | - | - |
R-101-FPN | pytorch | Faster | 1x | 7.6 | 0.540 | 11.8 | 38.1 | - | model |
R-101-FPN | pytorch | Faster | 2x | 7.6 | 0.540 | 11.8 | 38.8 | - | model |
R-50-FPN | caffe | Mask | 1x | 5.4 | 0.473 | 10.7 | 37.3 | 34.5 | - |
R-50-FPN | pytorch | Mask | 1x | 5.3 | 0.504 | 10.6 | 36.8 | 34.1 | model |
R-50-FPN | pytorch | Mask | 2x | 5.3 | 0.504 | 10.6 | 37.9 | 34.8 | model |
R-101-FPN | caffe | Mask | 1x | 8.6 | 0.607 | 9.5 | 39.4 | 36.1 | - |
R-101-FPN | pytorch | Mask | 1x | 9.0 | 0.656 | 9.3 | 38.9 | 35.8 | model |
R-101-FPN | pytorch | Mask | 2x | 9.0 | 0.656 | 9.3 | 39.9 | 36.4 | model |
Backbone | Style | Lr schd | Mem (GB) | Train time (s/iter) | Inf time (fps) | box AP | Download |
---|---|---|---|---|---|---|---|
R-50-FPN | caffe | 1x | 6.7 | 0.468 | 9.4 | 35.8 | - |
R-50-FPN | pytorch | 1x | 6.9 | 0.496 | 9.1 | 35.6 | model |
R-50-FPN | pytorch | 2x | 6.9 | 0.496 | 9.1 | 36.5 | model |
R-101-FPN | caffe | 1x | 9.2 | 0.614 | 8.2 | 37.8 | - |
R-101-FPN | pytorch | 1x | 9.6 | 0.643 | 8.1 | 37.7 | model |
R-101-FPN | pytorch | 2x | 9.6 | 0.643 | 8.1 | 38.1 | model |
X-101-32x4d-FPN | pytorch | 1x | 10.8 | 0.792 | 6.7 | 38.7 | model |
X-101-32x4d-FPN | pytorch | 2x | 10.8 | 0.792 | 6.7 | 39.3 | model |
X-101-64x4d-FPN | pytorch | 1x | 14.6 | 1.128 | 5.3 | 40.0 | model |
X-101-64x4d-FPN | pytorch | 2x | 14.6 | 1.128 | 5.3 | 39.6 | model |
Backbone | Style | Lr schd | Mem (GB) | Train time (s/iter) | Inf time (fps) | box AP | Download |
---|---|---|---|---|---|---|---|
R-50-FPN | caffe | 1x | 5.0 | 0.592 | 8.1 | 40.3 | - |
R-50-FPN | pytorch | 1x | 5.5 | 0.622 | 8.0 | 40.3 | model |
R-50-FPN | pytorch | 20e | 5.5 | 0.622 | 8.0 | 41.1 | model |
R-101-FPN | caffe | 1x | 8.5 | 0.731 | 7.0 | 42.2 | - |
R-101-FPN | pytorch | 1x | 8.7 | 0.766 | 6.9 | 42.1 | model |
R-101-FPN | pytorch | 20e | 8.7 | 0.766 | 6.9 | 42.6 | model |
X-101-32x4d-FPN | pytorch | 1x | 10.6 | 0.902 | 5.7 | 43.5 | model |
X-101-32x4d-FPN | pytorch | 20e | 10.6 | 0.902 | 5.7 | 44.1 | model |
X-101-64x4d-FPN | pytorch | 1x | 14.1 | 1.251 | 4.6 | 44.6 | model |
X-101-64x4d-FPN | pytorch | 20e | 14.1 | 1.251 | 4.6 | 44.8 | model |
Backbone | Style | Lr schd | Mem (GB) | Train time (s/iter) | Inf time (fps) | box AP | mask AP | Download |
---|---|---|---|---|---|---|---|---|
R-50-FPN | caffe | 1x | 7.5 | 0.880 | 5.8 | 41.0 | 35.6 | - |
R-50-FPN | pytorch | 1x | 7.6 | 0.910 | 5.7 | 41.3 | 35.7 | model |
R-50-FPN | pytorch | 20e | 7.6 | 0.910 | 5.7 | 42.4 | 36.6 | model |
R-101-FPN | caffe | 1x | 10.5 | 1.024 | 5.3 | 43.1 | 37.3 | - |
R-101-FPN | pytorch | 1x | 10.9 | 1.055 | 5.2 | 42.7 | 37.1 | model |
R-101-FPN | pytorch | 20e | 10.9 | 1.055 | 5.2 | 43.4 | 37.6 | model |
X-101-32x4d-FPN | pytorch | 1x | 12.67 | 1.181 | 4.2 | 44.4 | 38.3 | model |
X-101-32x4d-FPN | pytorch | 20e | 12.67 | 1.181 | 4.2 | 44.9 | 38.7 | model |
X-101-64x4d-FPN | pytorch | 1x | 10.87 | 1.125 | 3.6 | 45.5 | 39.2 | model |
X-101-64x4d-FPN | pytorch | 20e | 10.87 | 1.125 | 3.6 | 45.8 | 39.5 | model |
Notes:
- The
20e
schedule in Cascade (Mask) R-CNN indicates decreasing the lr at 16 and 19 epochs, with a total of 20 epochs. - Cascade Mask R-CNN with X-101-64x4d-FPN was trained using 16 GPU with a batch size of 16 (1 images per GPU).
Backbone | Size | Style | Lr schd | Mem (GB) | Train time (s/iter) | Inf time (fps) | box AP | Download |
---|---|---|---|---|---|---|---|---|
VGG16 | 300 | caffe | 120e | 3.5 | 0.286 | 22.9 / 29.2 | 25.7 | model |
VGG16 | 512 | caffe | 120e | 6.3 | 0.458 | 17.3 / 21.2 | 29.3 | model |
Backbone | Size | Style | Lr schd | Mem (GB) | Train time (s/iter) | Inf time (fps) | box AP | Download |
---|---|---|---|---|---|---|---|---|
VGG16 | 300 | caffe | 240e | 1.2 | 0.189 | 40.1 / 58.0 | 77.8 | model |
VGG16 | 512 | caffe | 240e | 2.9 | 0.261 | 28.1 / 36.2 | 80.4 | model |
Notes:
cudnn.benchmark
is set asTrue
for SSD training and testing.- Inference time is reported for batch size = 1 and batch size = 8.
- The speed difference between VOC and COCO is caused by model parameters and nms.
Backbone | model | Lr schd | Mem (GB) | Train time (s/iter) | Inf time (fps) | box AP | mask AP | Download |
---|---|---|---|---|---|---|---|---|
R-50-FPN (d) | Mask R-CNN | 2x | 7.2 | 0.806 | 5.4 | 39.9 | 36.1 | model |
R-50-FPN (d) | Mask R-CNN | 3x | 7.2 | 0.806 | 5.4 | 40.2 | 36.5 | model |
R-101-FPN (d) | Mask R-CNN | 2x | 9.9 | 0.970 | 4.8 | 41.6 | 37.1 | model |
R-101-FPN (d) | Mask R-CNN | 3x | 9.9 | 0.970 | 4.8 | 41.7 | 37.3 | model |
R-50-FPN (c) | Mask R-CNN | 2x | 7.2 | 0.806 | 5.4 | 39.7 | 35.9 | model |
R-50-FPN (c) | Mask R-CNN | 3x | 7.2 | 0.806 | 5.4 | 40.1 | 36.2 | model |
Notes:
- (d) means pretrained model converted from Detectron, and (c) means the contributed model pretrained by @thangvubk.
- The
3x
schedule is epoch [28, 34, 36]. - The memory is measured with
torch.cuda.max_memory_allocated()
instead oftorch.cuda.max_memory_cached()
. We will update the memory usage of other models in the future.
We compare mmdetection with Detectron and Detectron.pytorch, a third-party port of Detectron to Pytorch. The backbone used is R-50-FPN.
In general, mmdetection has 3 advantages over Detectron.
- Higher performance (especially in terms of mask AP)
- Faster training speed
- Memory efficient
Detectron and Detectron.pytorch use caffe-style ResNet as the backbone. In order to utilize the PyTorch model zoo, we use pytorch-style ResNet in our experiments.
In the meanwhile, we train models with caffe-style ResNet in 1x experiments for comparison. We find that pytorch-style ResNet usually converges slower than caffe-style ResNet, thus leading to slightly lower results in 1x schedule, but the final results of 2x schedule is higher.
We report results using both caffe-style (weights converted from here) and pytorch-style (weights from the official model zoo) ResNet backbone, indicated as pytorch-style results / caffe-style results.
Type | Lr schd | Detectron | Detectron.pytorch | mmdetection |
---|---|---|---|---|
RPN | 1x | 57.2 | - | 57.1 / 58.2 |
2x | - | - | 57.6 / - | |
Faster R-CNN | 1x | 36.7 | 37.1 | 36.4 / 36.7 |
2x | 37.9 | - | 37.7 / - | |
Mask R-CNN | 1x | 37.7 & 33.9 | 37.7 & 33.7 | 37.3 & 34.2 / 37.5 & 34.4 |
2x | 38.6 & 34.5 | - | 38.6 & 35.1 / - | |
Fast R-CNN | 1x | 36.4 | - | 35.8 / 36.6 |
2x | 36.8 | - | 37.1 / - | |
Fast R-CNN (w/mask) | 1x | 37.3 & 33.7 | - | 36.8 & 34.1 / 37.3 & 34.5 |
2x | 37.7 & 34.0 | - | 37.9 & 34.8 / - |
The training speed is measure with s/iter. The lower, the better.
Type | Detectron (P1001) | Detectron.pytorch (XP2) | mmdetection3 (V1004 / XP) |
---|---|---|---|
RPN | 0.416 | - | 0.407 / 0.413 |
Faster R-CNN | 0.544 | 1.015 | 0.554 / 0.579 |
Mask R-CNN | 0.889 | 1.435 | 0.690 / 0.732 |
Fast R-CNN | 0.285 | - | 0.375 / 0.398 |
Fast R-CNN (w/mask) | 0.377 | - | 0.504 / 0.574 |
*1. Detectron reports the speed on Facebook's Big Basin servers (P100), on our V100 servers it is slower so we use the official reported values.
*2. Detectron.pytorch does not report the runtime and we encountered some issue to run it on V100, so we report the speed on TITAN XP.
*3. The speed of pytorch-style ResNet is approximately 5% slower than caffe-style, and we report the pytorch-style results here.
*4. We also run the models on a DGX-1 server (P100) and the speed is almost the same as our V100 servers.
The inference speed is measured with fps (img/s) on a single GPU. The higher, the better.
Type | Detectron (P100) | Detectron.pytorch (XP) | mmdetection (V100 / XP) |
---|---|---|---|
RPN | 12.5 | - | 14.5 / 15.4 |
Faster R-CNN | 10.3 | 9.9 / 9.8 | |
Mask R-CNN | 8.5 | 7.7 / 7.4 | |
Fast R-CNN | 12.5 | 14.5 / 14.1 | |
Fast R-CNN (w/mask) | 9.9 | 10.6 / 10.3 |
We perform various tests and there is no doubt that mmdetection is more memory efficient than Detectron, and the main cause is the deep learning framework itself, not our efforts. Besides, Caffe2 and PyTorch have different apis to obtain memory usage whose implementation is not exactly the same.
nvidia-smi
shows a larger memory usage for both detectron and mmdetection, e.g.,
we observe a much higher memory usage when we train Mask R-CNN with 2 images per GPU using detectron (10.6G) and mmdetection (9.3G), which is obviously more than actually required.
With mmdetection, we can train R-50 FPN Mask R-CNN with 4 images per GPU (TITAN XP, 12G), which is a promising result.