This repository includes
- Training Pipeline for DETR on Custom dataset
- Wider Face Dataset annotaions and images
- Evaluation on test dataset
- Trained weights for Wider Face Dataset in release page
- Metrics Visualization
DETR or DEtection TRansformer is Facebook’s newest addition to the market of available deep learning-based object detection solutions. Very simply, it utilizes the transformer architecture to generate predictions of objects and their position in an image. DETR is a joint Convolutional Neural Network (CNN) and Transformer with a feed-forward network as a head. This architecture allows the network to reliably reason about object relations in the image using the powerful multi-head attention mechanism inherent in the Transformer architecture using features extracted by the CNN.
I've used WIDER FACE dataset which is a publicly available face detection benchmark dataset, consisting of 32,203 images and label 393,703 faces with a high degree of variability in scale, pose and occlusion as depicted in the sample images. WIDER FACE dataset is organized based on 61 event classes. For each event class, the original dataset was split into 40%/10%/50% as training, validation and testing sets.
By compiling the give code, the dataset will be automatically downloaded but you can download it manually from the official website or from my github release page.
In dataloader/face.py, I set the maximum width of images in the random transform to 800 pixels. This should allow for training on most GPUs, but it is advisable to change back to the original 1333 if your GPU can handle it.
We're going to use DETR with a backbone of Resnet 50, pretrained on COCO 2017 dataset. AP is computed on COCO 2017 val5k, and inference time is over the first 100 val5k COCO images, with torchscript transformer. If you want to use other DETR models, you can find them in model zoo below.
Model Zoo
name | backbone | schedule | inf_time | box AP | url | size | |
---|---|---|---|---|---|---|---|
0 | DETR | R50 | 500 | 0.036 | 42.0 | model | logs | 159Mb |
1 | DETR-DC5 | R50 | 500 | 0.083 | 43.3 | model | logs | 159Mb |
2 | DETR | R101 | 500 | 0.050 | 43.5 | model | logs | 232Mb |
3 | DETR-DC5 | R101 | 500 | 0.097 | 44.9 | model | logs | 232Mb |
Run all the cells of detr_custom_dataset.ipynb to train your model without any errors in Google Colaboratory.
Follow this readme to understand the training pipeline of DETR and evaluation on test images.
It took me 4:59:45 hours to finish 15 epochs with batch_size=16 using Tesla P100-PCIE. If you want better accuracy, you can train more epochs.
IoU metric: bbox
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.393
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.766
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.370
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.055
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.391
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.615
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.201
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.448
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.500
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.194
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.519
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.706
For train images,
T.RandomHorizontalFlip(),
T.RandomSelect(
T.RandomResize(scales, max_size=800),
T.Compose([
T.RandomResize([400, 500, 600]),
T.RandomSizeCrop(384, 600),
T.RandomResize(scales, max_size=800),
])
For val images,
T.RandomResize([800], max_size=800)
DETR Tutorial by thedeepreader
Training DETR on your own dataset by Oliver Gyldenberg Hjermitslev