This BVLC Caffe fork contains code for deployment of action detection and action recognition models for smart classroom use-case. You can define own list of possible actions (see annotation file format and steps for model training to change the list of actions) but this repository shows example for 3 action classes: standing, sitting and raising hand.
Assume next structure of data:
|-- data_root |-- images |-- video_1 frame_000000.png frame_000001.png |-- video_2 frame_000000png frame_000001.png |-- annotation annotation_file_1.xml annotation_file_2.xml train_tasks.txt
Each annotation file (see this header) describes a single source of images (see this header).
For annotating it's better to use CVAT utility. So we assume that annotation file is stored in appropriate .xml
format. In annotation file we have single independent track for each person on video which includes of bounding box description on each frame. General structure of annotation file:
|-- root |-- track_0 bounding_box_0 bounding_box_1 |-- track_1 bounding_box_0 bounding_box_1
Toy example of annotation file:
<?xml version="1.0" encoding="utf-8"?>
<annotations count="1">
<track id="0" label="person">
<box frame="0" xtl="1.0" ytl="1.0" xbr="0.0" ybr="0.0" occluded="0">
<attribute name="action">action_name</attribute>
</box>
</track>
</annotations>
where fields have next description:
count
- number of tracksid
- unique ID of track in filelabel
- label of track (data loader will skips all other labels exceptperson
)frame
- unique ID of frame in trackxtl
,ytl
,xbr
,ybr
- bounding box coordinates of top-left and bottom-right cornersoccluded
- marker to highlight heavy occluded bounding boxes (can be skipped during training)name
- name of bounding box attribute (data loader is sensitive foraction
class only)action_name
- valid name of action (you can define own list of actions)
Our implementation of data loader works with independent images stored on the drive. Each image should be named in format frame_xxxxxx.png
or frame_xxxxxx.jpg
(where xxxxxx
is unique image number).
For more robust control of image sources we have created separate file where each row represents a single source in next format: annotation_file_path.xml image_height,image_width images_directory_path
. We assume that all images from the same source are resize to image_height,image_width
sizes (it needs to properly decode annotations).
The train procedure for action detection&recognition model consists of two consistent stages:
- Person Detection training (PD)
- Action Recognition training (AR)
- Action Recognition model evaluation
- Conversion to MO-compatible format
Note 1: To significantly speedup training you can initialize your model from our distributed .caffemodel
snapshots:
$CAFFE_ROOT/models/person_detection_action_recognition/stage1__person_detector/person_detection_0022.caffemodel
- for training Person Detection model$CAFFE_ROOT/models/person_detection_action_recognition/stage2__action_regognition/action_detection_0005.caffemodel
- for training Action Detection model
Note 2: if you want to change the list of supported actions follow next steps:
- Change
ACTION_NAMES_MAP
dictionary in data layer$CAFFE_ROOT/python/custom_layers/actions_data_layer.py
according your list of actions. - Replace fields
valid_action_ids
andvalid_class_ids
to list of you valid action IDs for next layers in$CAFFE_ROOT/models/person_detection_action_recognition/stage1__person_detector/train.prototxt
and$CAFFE_ROOT/models/person_detection_action_recognition/stage2__action_regognition/train.prototxt
files:data
detection_matcher
action/split_loss
action/glob_push_loss
action/local_push_loss
action/push_loss
action/center_loss
samples/center_loss
samples/extractor
- Set a proper number of classes for each row marked with
# num_actions
in next files:$CAFFE_ROOT/models/person_detection_action_recognition/stage2__action_regognition/train.prototxt
$CAFFE_ROOT/models/person_detection_action_recognition/stage2__action_regognition/deploy.prototxt
$CAFFE_ROOT/models/person_detection_action_recognition/stage3__convert_for_inference/ie_conversion.prototxt
$CAFFE_ROOT/models/person_detection_action_recognition/stage3__convert_for_inference/inference.prototxt
- If you have annotated bounding boxes which must be ignored during training you should set for them new
class_id
which isnum_valid_classes + 1
and add thisclass_id
to 'valid_action_ids' in next layers:data
detection_matcher
action/glob_push_loss
On first stage you should train the SSD-based person (two class) detector. To do this you should run single-GPU (python layers does not allow to run on multiple GPUs) training procedure (specify GPU_ID
):
PYTHONPATH=$PYTHONPATH:$CAFFE_ROOT/python $CAFFE_ROOT/build/tools/caffe train --gpu=GPU_ID --solver=$CAFFE_ROOT/models/person_detection_action_recognition/stage1__person_detector/solver.prototxt
If it's needed the model evaluation can be performed by default pipeline in the original SSD repository. Moreover the training process of PD model can be carried out using SSD-original environment without any changes and after this the weights of trained model can be used as an initialization point on next stage.
Note: to get more accurate model it's recommended to use pre-training of backbone on default classification or detection datasets.
On next stage we should train the Action Recognition (AR) model which reuses detections from Person Detector (PD) model part and assigns action label for each of them. To do this follow next steps:
- To initialize AR model part copy weights from twin PD branch:
PYTHONPATH=$PYTHONPATH:$CAFFE_ROOT/python python2 $CAFFE_ROOT/python/rename_layers.py -i pd_weights_path.cafemodel -o ar_init_weights_path.cafemodel -p "cl/"
where pd_weights_path.cafemodel
- weights of trained PD model (see previous section) and ar_init_weights_path.cafemodel
- output path to init weights for AR model.
- Run single-GPU training procedure:
PYTHONPATH=$PYTHONPATH:$CAFFE_ROOT/python $CAFFE_ROOT/build/tools/caffe train --gpu=GPU_ID --solver=$CAFFE_ROOT/models/person_detection_action_recognition/stage2__action_regognition/solver.prototxt --weights="pd_weights_path.cafemodel,ar_init_weights_path.cafemodel"
To evaluate the quality of trained Action Recognition model on your test data you can use provided script. To do this you need the file with testing tasks in the same format as for training stage (see this section). The model can be evaluated in two modes:
- Frame independent evaluation:
PYTHONPATH=$PYTHONPATH:$CAFFE_ROOT/python python2 $CAFFE_ROOT/python/action_metrics.py -t path_to_testing_tasks.txt -p $CAFFE_ROOT/models/person_detection_action_recognition/stage2__action_regognition/deploy.prototxt -w action_detection_best_snapshot_path.caffemodel
- Event-based evaluation:
PYTHONPATH=$PYTHONPATH:$CAFFE_ROOT/python python2 $CAFFE_ROOT/python/action_event_metrics.py -t path_to_testing_tasks.txt -p $CAFFE_ROOT/models/person_detection_action_recognition/stage2__action_regognition/deploy.prototxt -w action_detection_best_snapshot_path.caffemodel
- Run script to convert weights into MO-compatible format:
PYTHONPATH=$PYTHONPATH:$CAFFE_ROOT/python python2 $CAFFE_ROOT/python/convert_to_ie_compatible -m $CAFFE_ROOT/models/person_detection_action_recognition/stage3__convert_for_inference/ie_conversion.prototxt -w "action_detection_snapshot_path.caffemodel" -o "action_detection_inference_path.caffemodel"
where action_detection_snapshot_path.caffemodel
is a selected snapshot of AR training procedure and action_detection_inference_path.caffemodel
- path to save final model parameters.
- Run model optimizer:
python3 mo_caffe.py -m $CAFFE_ROOT/models/person_detection_action_recognition/stage3__convert_for_inference/inference.prototxt -w action_detection_inference_path.caffemodel -n RMNet_Action_Detection