Skip to content
Lingzhu Xiang edited this page Feb 18, 2016 · 48 revisions

Benchmark

Feb 17, 2016: ThinkPad X240 (Intel i7-4600U); Debian stretch, kernel 4.4.1, gcc 5.3.1
Configuration Depth (min, 5%, median, 95%, max, mean, std) RGB (min, 5%, median, 95%, max, mean, std) Thread per core usage
CPU/TurboJPEG 211.717 222.087 233.171 256.851 304.558 mean=234.497 std=12.0616 15.7237 15.8093 16.5118 20.6042 37.9908 mean=17.2682 std=1.97223 CPU:95% TurboJPEG:50% USB:10% Reg:3%
OpenGL/TurboJPEG 14.2609 14.8663 21.6813 23.0952 37.1771 mean=20.4175 std=2.95671 15.2525 16.8032 19.4003 22.8167 41.6874 mean=19.4453 std=2.10631 OpenGL:17% TurboJPEG:60% USB:20% Reg:16%
Intel-OpenCL/VAAPI 12.9236 13.5946 14.1522 16.4632 29.1926 mean=14.4144 std=1.05776 4.81327 4.8892 4.99418 5.45149 11.5202 mean=5.08095 std=0.298308 OpenCL:6% VAAPI:3% USB:15% Reg:15%
Feb 17, 2016: Jetson TK1 (ARMs, Tegra K1); Ubuntu 14.04, kernel 3.10.40, gcc 4.8.4, CUDA 6.5
Configuration Depth (min, 5%, median, 95%, max, mean, std) RGB (min, 5%, median, 95%, max, mean, std) Thread per core usage
CPU/TurboJPEG 1196.93 1225.1 1232.61 1319.89 1356.19 mean=1242.61 std=30.2808 31.6025 38.3982 38.6873 43.1643 55.0731 mean=39.4813 std=2.26751 CPU:98% TurboJPEG:60% USB:36% Reg:3%
OpenGL/TurboJPEG 17.2772 20.1502 21.9076 23.6485 49.9671 mean=21.702 std=1.41032 41.5806 46.2578 47.2497 50.1534 59.9174 mean=47.5074 std=1.48139 OpenGL:47% TurboJPEG:65% USB:60% Reg:64%
CUDA/TegraJPEG 9.59201 10.1711 10.7408 11.5411 20.2425 mean=10.8238 std=0.529091 11.8931 11.962 12.1092 12.3543 20.1383 mean=12.256 std=0.846912 CUDA:4% TegraJPEG:4% USB:59% Reg:76%
Feb 17, 2016: ThinkPad W540 (Intel i7-4800MQ, Nvidia Quadro K2100M); Ubuntu 14.04, kernel 4.2.0-29, gcc 4.8.4, CUDA 7.5
CPU/TurboJPEG 177.368 178.725 184.239 232.401 237.202 mean=192.605 std=17.612 13.5816 13.9677 14.6258 22.746 24.4688 mean=15.3834 std=2.26277 CPU:91% TurboJPEG:45% USB:7% Reg:2%
OpenGL/TurboJPEG 8.55666 13.9514 16.1583 18.1522 26.7371 mean=16.1974 std=1.87477 13.583 13.6906 14.8395 16.6675 24.4041 mean=14.887 std=1.2095 OpenGL:9% TurboJPEG:45% USB:9% Reg:12%
Intel-OpenCL/VAAPI 9.70148 10.455 11.9606 16.42 23.1066 mean=12.6258 std=2.05783 4.03962 4.10536 4.64751 6.73393 10.8338 mean=4.99849 std=0.907321 OpenCL:4% VAAPI:2% USB:9% Reg:13%
CUDA/VAAPI 3.81637 4.03557 4.06498 4.10873 7.7775 mean=4.07313 std=0.101962 4.04017 4.09998 4.5888 8.64204 16.0589 mean=5.15824 std=1.53683 CUDA:15% VAAPI:2% USB:9% Reg:15%
Feb 17, 2016: ThinkPad W540 (Intel i7-4800MQ, Nvidia Quadro K2100M); Windows 8.1, Visual Studio 2015, Intel OpenCL SDK 2016
Intel-OpenGL/TurboJPEG 12.2028 12.5374 12.9779 15.0185 172.046 mean=13.3322 std=3.61579 14.6684 16.0605 16.3927 18.4994 31.9244 mean=16.8039 std=1.25351 (No usage stats)
Nvidia-OpenGL/TurboJPEG 5.16722 5.38771 11.4718 11.7794 163.966 mean=9.50424 std=3.86527 14.4331 14.5297 14.6954 15.0413 26.9313 mean=14.7962 std=0.723257 N/A
CUDA/TurboJPEG (VS2015 not supported by CUDA 7.5) N/A N/A

Benchmarking setup:

  1. Report CPU and GPU models
  2. Report OS version (include kernel version if Linux), compiler version, API versions (OpenGL/CUDA/OpenCL if you can find it)
  3. Report date of testing.
  4. Build with -DENABLE_CXX11=ON
  5. Set environment variable export LIBFREENECT2_DETAILED_TIMING=1.

Benchmarking cases:

  • Linux

Also use top -d1 visually to report per thread usage (H thread view, V tree view, I Irix mode to report per core usage).

  1. CPU/TurboJPEG LIBVA_DRIVER_NAME=none ./bin/Protonect -noviewer cpu - The pure software pipeline
  2. OpenGL/TurboJPEG LIBVA_DRIVER_NAME=none ./bin/Protonect -noviewer gl - The OpenGL compatibility pipeline
  3. Intel-OpenCL/VAAPI ./bin/Protonect -noviewer cl - The full Intel pipeline
  4. CUDA-OpenCL/VAAPI ./bin/Protonect -noviewer cl - The how-poorly-does-Nvidia-support-OpenCL pipeline
  5. CUDA/VAAPI ./bin/Protonect -noviewer cuda - The Nvidia and VAAPI mixed pipeline
  6. CUDA/TegraJPEG ./bin/Protonect -noviewer cuda - The Jetson TK1 pipeline
  • Windows
  1. CPU/TurboJPEG .\install\bin\Protonect -noviewer cpu - The pure software pipeline
  2. OpenGL/TurboJPEG .\install\bin\Protonect -noviewer gl - The OpenGL compatibility pipeline
  3. Intel-OpenCL/TurboJPEG .\install\bin\Protonect -noviewer cl - The full Intel pipeline
  4. CUDA/TurboJPEG .\install\bin\Protonect -noviewer cuda - The Nvidia and VAAPI mixed pipeline
  • Mac OS X
  1. CPU/TurboJPEG ./bin/Protonect -noviewer cpu - The pure software pipeline
  2. OpenGL/VT ./bin/Protonect -noviewer gl - The OpenGL compatibility pipeline
  3. OpenCL/VT ./bin/Protonect -noviewer cl - The OpenCL pipeline

If a particular configuration is tested but fails:

  1. If the failure is a known unsolved issue, report it.
  2. If the failure is a solved issue that can be fixed by the user, do not report it.

TODO: Maybe use -gpu=0 or -gpu=1 to select GPU?

The page is work in progress.

Platform acceleration of JPEG decoding

  • VA-API (Intel, Linux): Good
  • Intel Media SDK (Intel, Windows): possible to implement. mfx_mft_mjpgvd_64.dll 91CD2D6E-897B-4FA1-B0D7-51DC88010E0A Intel Hardware M-JPEG decoder MFT - it's probably an abstraction over DXVA/D3D11.
  • VDPAU (Nvidia): No. Does not support JPEG at all.
  • Tegra: In fact in all of Nvidia's products, only Tegra has hardware JPEG decoder (A separate tegra libjpeg decoder is being worked on).
  • AMD implements JPEG decoder with OpenCL, but we don't want it to compete with depth decoding for resources. (I evaluated GPUJPEG, and it was not good.)
  • Samsung's Exynos4 provides JPEG codec via v4l2, but this is for mobile devices.
  • I looked at mpv and ffmpeg. They have no hardware acceleration for JPEG at all.
  • Chromium uses VAAPI and V4L2.
  • On Mac a new decoder is provided by @fran6co. (@fran6co: The mac decoder is not hardware accelerated yet, if they ever decide to do it my implementation is going to have it.)
Clone this wiki locally