TSN

简介

@inproceedings{wang2016temporal,
  title={Temporal segment networks: Towards good practices for deep action recognition},
  author={Wang, Limin and Xiong, Yuanjun and Wang, Zhe and Qiao, Yu and Lin, Dahua and Tang, Xiaoou and Van Gool, Luc},
  booktitle={European conference on computer vision},
  pages={20--36},
  year={2016},
  organization={Springer}
}

模型库

UCF-101

配置文件	GPU 数量	主干网络	预训练	top1 准确率	top5 准确率	GPU 显存占用 (M)	ckpt	log	json
tsn_r50_1x1x3_75e_ucf101_rgb [1]	8	ResNet50	ImageNet	83.03	96.78	8332	ckpt	log	json

[1] 这里汇报的是 UCF-101 的 split1 部分的结果。

Diving48

配置文件	GPU 数量	主干网络	预训练	top1 准确率	top5 准确率	GPU 显存占用 (M)	ckpt	log	json
tsn_r50_video_1x1x8_100e_diving48_rgb	8	ResNet50	ImageNet	71.27	95.74	5699	ckpt	log	json
tsn_r50_video_1x1x16_100e_diving48_rgb	8	ResNet50	ImageNet	76.75	96.95	5705	ckpt	log	json

HMDB51

配置文件	GPU 数量	主干网络	预训练	top1 准确率	top5 准确率	GPU 显存占用 (M)	ckpt	log	json
tsn_r50_1x1x8_50e_hmdb51_imagenet_rgb	8	ResNet50	ImageNet	48.95	80.19	21535	ckpt	log	json
tsn_r50_1x1x8_50e_hmdb51_kinetics400_rgb	8	ResNet50	Kinetics400	56.08	84.31	21535	ckpt	log	json
tsn_r50_1x1x8_50e_hmdb51_mit_rgb	8	ResNet50	Moments	54.25	83.86	21535	ckpt	log	json

Kinetics-400

配置文件	分辨率	GPU 数量	主干网络	预训练	top1 准确率	top5 准确率	参考代码的 top1 准确率	参考代码的 top5 准确率	推理时间 (video/s)	GPU 显存占用 (M)	ckpt	log	json
tsn_r50_1x1x3_100e_kinetics400_rgb	340x256	8	ResNet50	ImageNet	70.60	89.26	x	x	4.3 (25x10 frames)	8344	ckpt	log	json
tsn_r50_1x1x3_100e_kinetics400_rgb	短边 256	8	ResNet50	ImageNet	70.42	89.03	x	x	x	8343	ckpt	log	json
tsn_r50_dense_1x1x5_50e_kinetics400_rgb	340x256	8x3	ResNet50	ImageNet	70.18	89.10	69.15	88.56	12.7 (8x10 frames)	7028	ckpt	log	json
tsn_r50_320p_1x1x3_100e_kinetics400_rgb	短边 320	8x2	ResNet50	ImageNet	70.91	89.51	x	x	10.7 (25x3 frames)	8344	ckpt	log	json
tsn_r50_320p_1x1x3_110e_kinetics400_flow	短边 320	8x2	ResNet50	ImageNet	55.70	79.85	x	x	x	8471	ckpt	log	json
tsn_r50_320p_1x1x3_kinetics400_twostream [1: 1]*	x	x	ResNet50	ImageNet	72.76	90.52	x	x	x	x	x	x	x
tsn_r50_1x1x8_100e_kinetics400_rgb	短边 256	8	ResNet50	ImageNet	71.80	90.17	x	x	x	8343	ckpt	log	json
tsn_r50_320p_1x1x8_100e_kinetics400_rgb	短边 320	8x3	ResNet50	ImageNet	72.41	90.55	x	x	11.1 (25x3 frames)	8344	ckpt	log	json
tsn_r50_320p_1x1x8_110e_kinetics400_flow	短边 320	8x4	ResNet50	ImageNet	57.76	80.99	x	x	x	8473	ckpt	log	json
tsn_r50_320p_1x1x8_kinetics400_twostream [1: 1]*	x	x	ResNet50	ImageNet	74.64	91.77	x	x	x	x	x	x	x
tsn_r50_video_320p_1x1x3_100e_kinetics400_rgb	短边 320	8	ResNet50	ImageNet	71.11	90.04	x	x	x	8343	ckpt	log	json
tsn_r50_dense_1x1x8_100e_kinetics400_rgb	340x256	8	ResNet50	ImageNet	70.77	89.3	68.75	88.42	12.2 (8x10 frames)	8344	ckpt	log	json
tsn_r50_video_1x1x8_100e_kinetics400_rgb	短边 256	8	ResNet50	ImageNet	71.79	90.25	x	x	x	21558	ckpt	log	json
tsn_r50_video_dense_1x1x8_100e_kinetics400_rgb	短边 256	8	ResNet50	ImageNet	70.40	89.12	x	x	x	21553	ckpt	log	json

这里，MMAction2 使用 [1: 1] 表示以 1: 1 的比例融合 RGB 和光流两分支的融合结果（融合前不经过 softmax）

在 TSN 模型中使用第三方的主干网络

用户可在 MMAction2 的框架中使用第三方的主干网络训练 TSN，例如：

MMClassification 中的主干网络
TorchVision 中的主干网络
pytorch-image-models(timm) 中的主干网络

配置文件	分辨率	GPU 数量	主干网络	预训练	top1 准确率	top5 准确率	ckpt	log	json
tsn_rn101_32x4d_320p_1x1x3_100e_kinetics400_rgb	短边 320	8x2	ResNeXt101-32x4d [MMCls]	ImageNet	73.43	91.01	ckpt	log	json
tsn_dense161_320p_1x1x3_100e_kinetics400_rgb	短边 320	8x2	Densenet-161 [TorchVision]	ImageNet	72.78	90.75	ckpt	log	json
tsn_swin_transformer_video_320p_1x1x3_100e_kinetics400_rgb	short-side 320	8	Swin Transformer Base [timm]	ImageNet	77.51	92.92	ckpt	log	json

由于多种原因，TIMM 中的一些模型未能收到支持，详情请参考 PR #880。

Kinetics-400 数据基准测试 (8 块 GPU, ResNet50, ImageNet 预训练; 3 个视频段)

在数据基准测试中，比较：

不同的数据预处理方法：(1) 视频分辨率为 340x256, (2) 视频分辨率为短边 320px, (3) 视频分辨率为短边 256px;
不同的数据增强方法：(1) MultiScaleCrop, (2) RandomResizedCrop;
不同的测试方法：(1) 25 帧 x 10 裁剪片段, (2) 25 frames x 3 裁剪片段.

配置文件	分辨率	训练时的数据增强	测试时的策略	top1 准确率	top5 准确率	ckpt	log	json
tsn_r50_multiscalecrop_340x256_1x1x3_100e_kinetics400_rgb	340x256	MultiScaleCrop	25x10 frames	70.60	89.26	ckpt	log	json
x	340x256	MultiScaleCrop	25x3 frames	70.52	89.39	x	x	x
tsn_r50_randomresizedcrop_340x256_1x1x3_100e_kinetics400_rgb	340x256	RandomResizedCrop	25x10 frames	70.11	89.01	ckpt	log	json
x	340x256	RandomResizedCrop	25x3 frames	69.95	89.02	x	x	x
tsn_r50_multiscalecrop_320p_1x1x3_100e_kinetics400_rgb	短边 320	MultiScaleCrop	25x10 frames	70.32	89.25	ckpt	log	json
x	短边 320	MultiScaleCrop	25x3 frames	70.54	89.39	x	x	x
tsn_r50_randomresizedcrop_320p_1x1x3_100e_kinetics400_rgb	短边 320	RandomResizedCrop	25x10 frames	70.44	89.23	ckpt	log	json
x	短边 320	RandomResizedCrop	25x3 frames	70.91	89.51	x	x	x
tsn_r50_multiscalecrop_256p_1x1x3_100e_kinetics400_rgb	短边 256	MultiScaleCrop	25x10 frames	70.42	89.03	ckpt	log	json
x	短边 256	MultiScaleCrop	25x3 frames	70.79	89.42	x	x	x
tsn_r50_randomresizedcrop_256p_1x1x3_100e_kinetics400_rgb	短边 256	RandomResizedCrop	25x10 frames	69.80	89.06	ckpt	log	json
x	短边 256	RandomResizedCrop	25x3 frames	70.48	89.89	x	x	x

Kinetics-400 OmniSource 实验

配置文件	分辨率	主干网络	预训练	w. OmniSource	top1 准确率	top5 准确率	推理时间 (video/s)	GPU 显存占用 (M)	ckpt	log	json
tsn_r50_1x1x3_100e_kinetics400_rgb	340x256	ResNet50	ImageNet	❌	70.6	89.3	4.3 (25x10 frames)	8344	ckpt	log	json
x	340x256	ResNet50	ImageNet	✔️	73.6	91.0	x	8344	ckpt	x	x
x	短边 320	ResNet50	IG-1B [1]	❌	73.1	90.4	x	8344	ckpt	x	x
x	短边 320	ResNet50	IG-1B [1]	✔️	75.7	91.9	x	8344	ckpt	x	x

[1] MMAction2 使用 torch-hub 提供的 resnet50_swsl 预训练模型。

Kinetics-600

配置文件	分辨率	GPU 数量	主干网络	预训练	top1 准确率	top5 准确率	推理时间 (video/s)	GPU 显存占用 (M)	ckpt	log	json
tsn_r50_video_1x1x8_100e_kinetics600_rgb	短边 256	8x2	ResNet50	ImageNet	74.8	92.3	11.1 (25x3 frames)	8344	ckpt	log	json

Kinetics-700

配置文件	分辨率	GPU 数量	主干网络	预训练	top1 准确率	top5 准确率	推理时间 (video/s)	GPU 显存占用 (M)	ckpt	log	json
tsn_r50_video_1x1x8_100e_kinetics700_rgb	短边 256	8x2	ResNet50	ImageNet	61.7	83.6	11.1 (25x3 frames)	8344	ckpt	log	json

Something-Something V1

配置文件	分辨率	GPU 数量	主干网络	预训练	top1 准确率	top5 准确率	参考代码的 top1 准确率	参考代码的 top5 准确率	GPU 显存占用 (M)	ckpt	log	json
tsn_r50_1x1x8_50e_sthv1_rgb	height 100	8	ResNet50	ImageNet	18.55	44.80	17.53	44.29	10978	ckpt	log	json
tsn_r50_1x1x16_50e_sthv1_rgb	height 100	8	ResNet50	ImageNet	15.77	39.85	13.33	35.58	5691	ckpt	log	json

Something-Something V2

配置文件	分辨率	GPU 数量	主干网络	预训练	top1 准确率	top5 准确率	参考代码的 top1 准确率	参考代码的 top5 准确率	GPU 显存占用 (M)	ckpt	log	json
tsn_r50_1x1x8_50e_sthv2_rgb	height 240	8	ResNet50	ImageNet	32.97	63.62	30.56	58.49	10966	ckpt	log	json
tsn_r50_1x1x16_50e_sthv2_rgb	height 240	8	ResNet50	ImageNet	27.21	55.84	21.91	46.87	8337	ckpt	log	json

Moments in Time

配置文件	分辨率	GPU 数量	主干网络	预训练	top1 准确率	top5 准确率	GPU 显存占用 (M)	ckpt	log	json
tsn_r50_1x1x6_100e_mit_rgb	短边 256	8x2	ResNet50	ImageNet	26.84	51.6	8339	ckpt	log	json

Multi-Moments in Time

配置文件	分辨率	GPU 数量	主干网络	预训练	mAP	GPU 显存占用 (M)	ckpt	log	json
tsn_r101_1x1x5_50e_mmit_rgb	短边 256	8x2	ResNet101	ImageNet	61.09	10467	ckpt	log	json

ActivityNet v1.3

配置文件	分辨率	GPU 数量	主干网络	预训练	top1 准确率	top5 准确率	GPU 显存占用 (M)	ckpt	log	json
tsn_r50_320p_1x1x8_50e_activitynet_video_rgb	短边 320	8x1	ResNet50	Kinetics400	73.93	93.44	5692	ckpt	log	json
tsn_r50_320p_1x1x8_50e_activitynet_clip_rgb	短边 320	8x1	ResNet50	Kinetics400	76.90	94.47	5692	ckpt	log	json
tsn_r50_320p_1x1x8_150e_activitynet_video_flow	340x256	8x2	ResNet50	Kinetics400	57.51	83.02	5780	ckpt	log	json
tsn_r50_320p_1x1x8_150e_activitynet_clip_flow	340x256	8x2	ResNet50	Kinetics400	59.51	82.69	5780	ckpt	log	json

HVU

配置文件[1]	tag 类别	分辨率	GPU 数量	主干网络	预训练	mAP	HATNet[2]	HATNet-multi[2]	ckpt	log	json
tsn_r18_1x1x8_100e_hvu_action_rgb	action	短边 256	8x2	ResNet18	ImageNet	57.5	51.8	53.5	ckpt	log	json
tsn_r18_1x1x8_100e_hvu_scene_rgb	scene	短边 256	8	ResNet18	ImageNet	55.2	55.8	57.2	ckpt	log	json
tsn_r18_1x1x8_100e_hvu_object_rgb	object	短边 256	8	ResNet18	ImageNet	45.7	34.2	35.1	ckpt	log	json
tsn_r18_1x1x8_100e_hvu_event_rgb	event	短边 256	8	ResNet18	ImageNet	63.7	38.5	39.8	ckpt	log	json
tsn_r18_1x1x8_100e_hvu_concept_rgb	concept	短边 256	8	ResNet18	ImageNet	47.5	26.1	27.3	ckpt	log	json
tsn_r18_1x1x8_100e_hvu_attribute_rgb	attribute	短边 256	8	ResNet18	ImageNet	46.1	33.6	34.9	ckpt	log	json
-	所有 tag	短边 256	-	ResNet18	ImageNet	52.6	40.0	41.3	-	-	-

[1] 简单起见，MMAction2 对每个 tag 类别训练特定的模型，作为 HVU 的基准模型。

[2] 这里 HATNet 和 HATNet-multi 的结果来自于 paper: Large Scale Holistic Video Understanding。 HATNet 的时序动作候选是一个双分支的卷积网络（一个 2D 分支，一个 3D 分支），并且和 MMAction2 有相同的主干网络（ResNet18）。HATNet 的输入是 16 帧或 32 帧的长视频片段（这样的片段比 MMAction2 使用的要长），同时输入分辨率更粗糙（112px 而非 224px）。 HATNet 是在每个独立的任务（对应每个 tag 类别）上进行训练的，HATNet-multi 是在多个任务上进行训练的。由于目前没有 HATNet 的开源代码和模型，这里仅汇报了原 paper 的精度。

注：

这里的 GPU 数量 指的是得到模型权重文件对应的 GPU 个数。默认地，MMAction2 所提供的配置文件对应使用 8 块 GPU 进行训练的情况。依据线性缩放规则，当用户使用不同数量的 GPU 或者每块 GPU 处理不同视频个数时，需要根据批大小等比例地调节学习率。如，lr=0.01 对应 4 GPUs x 2 video/gpu，以及 lr=0.08 对应 16 GPUs x 4 video/gpu。
这里的 推理时间 是根据基准测试脚本获得的，采用测试时的采帧策略，且只考虑模型的推理时间，并不包括 IO 时间以及预处理时间。对于每个配置，MMAction2 使用 1 块 GPU 并设置批大小（每块 GPU 处理的视频个数）为 1 来计算推理时间。
参考代码的结果是通过使用相同的模型配置在原来的代码库上训练得到的。
我们使用的 Kinetics400 验证集包含 19796 个视频，用户可以从验证集视频下载这些视频。同时也提供了对应的数据列表（每行格式为：视频 ID，视频帧数目，类别序号）以及标签映射（类别序号到类别名称）。

对于数据集准备的细节，用户可参考：

准备 ucf101
准备 kinetics
准备 sthv1
准备 sthv2
准备 mit
准备 mmit
准备 hvu
准备 hmdb51

如何训练

用户可以使用以下指令进行模型训练。

python tools/train.py ${CONFIG_FILE} [optional arguments]

例如：以一个确定性的训练方式，辅以定期的验证过程进行 TSN 模型在 Kinetics-400 数据集上的训练。

python tools/train.py configs/recognition/tsn/tsn_r50_1x1x3_100e_kinetics400_rgb.py \
    --work-dir work_dirs/tsn_r50_1x1x3_100e_kinetics400_rgb \
    --validate --seed 0 --deterministic

更多训练细节，可参考基础教程中的 训练配置 部分。

如何测试

用户可以使用以下指令进行模型测试。

python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]

例如：在 Kinetics-400 数据集上测试 TSN 模型，并将结果导出为一个 json 文件。

python tools/test.py configs/recognition/tsn/tsn_r50_1x1x3_100e_kinetics400_rgb.py \
    checkpoints/SOME_CHECKPOINT.pth --eval top_k_accuracy mean_class_accuracy \
    --out result.json

更多测试细节，可参考基础教程中的 测试某个数据集 部分。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README_zh-CN.md

README_zh-CN.md

TSN

简介

模型库

UCF-101

Diving48

HMDB51

Kinetics-400

在 TSN 模型中使用第三方的主干网络

Kinetics-400 数据基准测试 (8 块 GPU, ResNet50, ImageNet 预训练; 3 个视频段)

Kinetics-400 OmniSource 实验

Kinetics-600

Kinetics-700

Something-Something V1

Something-Something V2

Moments in Time

Multi-Moments in Time

ActivityNet v1.3

HVU

如何训练

如何测试

Files

README_zh-CN.md

Latest commit

History

README_zh-CN.md

File metadata and controls

TSN

简介

模型库

UCF-101

Diving48

HMDB51

Kinetics-400

在 TSN 模型中使用第三方的主干网络

Kinetics-400 数据基准测试 (8 块 GPU, ResNet50, ImageNet 预训练; 3 个视频段)

Kinetics-400 OmniSource 实验

Kinetics-600

Kinetics-700

Something-Something V1

Something-Something V2

Moments in Time

Multi-Moments in Time

ActivityNet v1.3

HVU

如何训练

如何测试