Performance

Benchmark

Feb 17, 2016: ThinkPad X240 (Intel i7-4600U); Debian stretch, kernel 4.4.1, gcc 5.3.1
Configuration	Depth (min, 5%, median, 95%, max, mean, std)	RGB (min, 5%, median, 95%, max, mean, std)	Thread per core usage
CPU/TurboJPEG	211.717 222.087 233.171 256.851 304.558 mean=234.497 std=12.0616	15.7237 15.8093 16.5118 20.6042 37.9908 mean=17.2682 std=1.97223	CPU:95% TurboJPEG:50% USB:10% Reg:3%
OpenGL/TurboJPEG	14.2609 14.8663 21.6813 23.0952 37.1771 mean=20.4175 std=2.95671	15.2525 16.8032 19.4003 22.8167 41.6874 mean=19.4453 std=2.10631	OpenGL:17% TurboJPEG:60% USB:20% Reg:16%
Intel-OpenCL/VAAPI	12.9236 13.5946 14.1522 16.4632 29.1926 mean=14.4144 std=1.05776	4.81327 4.8892 4.99418 5.45149 11.5202 mean=5.08095 std=0.298308	OpenCL:6% VAAPI:3% USB:15% Reg:15%
Feb 17, 2016: Jetson TK1 (ARMs, Tegra K1); Ubuntu 14.04, kernel 3.10.40, gcc 4.8.4, CUDA 6.5
Configuration	Depth (min, 5%, median, 95%, max, mean, std)	RGB (min, 5%, median, 95%, max, mean, std)	Thread per core usage
CPU/TurboJPEG	1196.93 1225.1 1232.61 1319.89 1356.19 mean=1242.61 std=30.2808	31.6025 38.3982 38.6873 43.1643 55.0731 mean=39.4813 std=2.26751	CPU:98% TurboJPEG:60% USB:36% Reg:3%
OpenGL/TurboJPEG	17.2772 20.1502 21.9076 23.6485 49.9671 mean=21.702 std=1.41032	41.5806 46.2578 47.2497 50.1534 59.9174 mean=47.5074 std=1.48139	OpenGL:47% TurboJPEG:65% USB:60% Reg:64%
CUDA/TegraJPEG	9.59201 10.1711 10.7408 11.5411 20.2425 mean=10.8238 std=0.529091	11.8931 11.962 12.1092 12.3543 20.1383 mean=12.256 std=0.846912	CUDA:4% TegraJPEG:4% USB:59% Reg:76%
Feb 17, 2016: ThinkPad W540 (Intel i7-4800MQ, Nvidia Quadro K2100M); Ubuntu 14.04, kernel 4.2.0-29, gcc 4.8.4, CUDA 7.5
CPU/TurboJPEG	177.368 178.725 184.239 232.401 237.202 mean=192.605 std=17.612	13.5816 13.9677 14.6258 22.746 24.4688 mean=15.3834 std=2.26277	CPU:91% TurboJPEG:45% USB:7% Reg:2%
OpenGL/TurboJPEG	8.55666 13.9514 16.1583 18.1522 26.7371 mean=16.1974 std=1.87477	13.583 13.6906 14.8395 16.6675 24.4041 mean=14.887 std=1.2095	OpenGL:9% TurboJPEG:45% USB:9% Reg:12%
Intel-OpenCL/VAAPI	9.70148 10.455 11.9606 16.42 23.1066 mean=12.6258 std=2.05783	4.03962 4.10536 4.64751 6.73393 10.8338 mean=4.99849 std=0.907321	OpenCL:4% VAAPI:2% USB:9% Reg:13%
CUDA/VAAPI	3.81637 4.03557 4.06498 4.10873 7.7775 mean=4.07313 std=0.101962	4.04017 4.09998 4.5888 8.64204 16.0589 mean=5.15824 std=1.53683	CUDA:15% VAAPI:2% USB:9% Reg:15%
Feb 17, 2016: ThinkPad W540 (Intel i7-4800MQ, Nvidia Quadro K2100M); Windows 8.1, Visual Studio 2015, Intel OpenCL SDK 2016
Intel-OpenGL/TurboJPEG	12.2028 12.5374 12.9779 15.0185 172.046 mean=13.3322 std=3.61579	14.6684 16.0605 16.3927 18.4994 31.9244 mean=16.8039 std=1.25351	(No usage stats)
Nvidia-OpenGL/TurboJPEG	5.16722 5.38771 11.4718 11.7794 163.966 mean=9.50424 std=3.86527	14.4331 14.5297 14.6954 15.0413 26.9313 mean=14.7962 std=0.723257	N/A
CUDA/TurboJPEG	(VS2015 not supported by CUDA 7.5)	N/A	N/A

Benchmarking setup:

Report CPU and GPU models
Report OS version (include kernel version if Linux), compiler version, API versions (OpenGL/CUDA/OpenCL if you can find it)
Report date of testing.
Build with -DENABLE_CXX11=ON
Set environment variable export LIBFREENECT2_DETAILED_TIMING=1.

Benchmarking cases:

Linux

Also use top -d1 visually to report per thread usage (H thread view, V tree view, I Irix mode to report per core usage).

CPU/TurboJPEG LIBVA_DRIVER_NAME=none ./bin/Protonect -noviewer cpu - The pure software pipeline
OpenGL/TurboJPEG LIBVA_DRIVER_NAME=none ./bin/Protonect -noviewer gl - The OpenGL compatibility pipeline
Intel-OpenCL/VAAPI ./bin/Protonect -noviewer cl - The full Intel pipeline
CUDA-OpenCL/VAAPI ./bin/Protonect -noviewer cl - The how-poorly-does-Nvidia-support-OpenCL pipeline
CUDA/VAAPI ./bin/Protonect -noviewer cuda - The Nvidia and VAAPI mixed pipeline
CUDA/TegraJPEG ./bin/Protonect -noviewer cuda - The Jetson TK1 pipeline

Windows

CPU/TurboJPEG .\install\bin\Protonect -noviewer cpu - The pure software pipeline
OpenGL/TurboJPEG .\install\bin\Protonect -noviewer gl - The OpenGL compatibility pipeline
Intel-OpenCL/TurboJPEG .\install\bin\Protonect -noviewer cl - The full Intel pipeline
CUDA/TurboJPEG .\install\bin\Protonect -noviewer cuda - The Nvidia and VAAPI mixed pipeline

Mac OS X

CPU/TurboJPEG ./bin/Protonect -noviewer cpu - The pure software pipeline
OpenGL/VT ./bin/Protonect -noviewer gl - The OpenGL compatibility pipeline
OpenCL/VT ./bin/Protonect -noviewer cl - The OpenCL pipeline

If a particular configuration is tested but fails:

If the failure is a known unsolved issue, report it.
If the failure is a solved issue that can be fixed by the user, do not report it.

TODO: Maybe use -gpu=0 or -gpu=1 to select GPU?

The page is work in progress.

Platform acceleration of JPEG decoding

VA-API (Intel, Linux): Good
Intel Media SDK (Intel, Windows): possible to implement. mfx_mft_mjpgvd_64.dll 91CD2D6E-897B-4FA1-B0D7-51DC88010E0A Intel Hardware M-JPEG decoder MFT - it's probably an abstraction over DXVA/D3D11.
VDPAU (Nvidia): No. Does not support JPEG at all.
Tegra: In fact in all of Nvidia's products, only Tegra has hardware JPEG decoder (A separate tegra libjpeg decoder is being worked on).
AMD implements JPEG decoder with OpenCL, but we don't want it to compete with depth decoding for resources. (I evaluated GPUJPEG, and it was not good.)
Samsung's Exynos4 provides JPEG codec via v4l2, but this is for mobile devices.
I looked at mpv and ffmpeg. They have no hardware acceleration for JPEG at all.
Chromium uses VAAPI and V4L2.
On Mac a new decoder is provided by @fran6co. (@fran6co: The mac decoder is not hardware accelerated yet, if they ever decide to do it my implementation is going to have it.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance

Benchmark

Platform acceleration of JPEG decoding

Clone this wiki locally