Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: Unsafe Undefined Behavior - Zero Sized PlaneData #23

Closed
wants to merge 3 commits into from

Conversation

Kab1r
Copy link

@Kab1r Kab1r commented Feb 3, 2024

Calls to alloc with zero sized layouts is undefined behavior.
This is particularly problematic when using non-default allocators.
Jemallocator (using dev profile only) panics when a zero sized layout is used.
rav1e creates zero sized Planes in multiple places to avoid the cost of allocation when the field is not used internally.

Calls to `alloc` with zero sized layouts is undefined behavior.
This is particularly problematic when using non-default allocators.
Jemalloc panics when a zero sized layout used.
rav1e creates zero sized `Planes` in multiple places to avoid the cost of allocation when the field is not used internally.
@codecov-commenter
Copy link

codecov-commenter commented Feb 3, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (7184cb8) 65.58% compared to head (de51833) 68.93%.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #23      +/-   ##
==========================================
+ Coverage   65.58%   68.93%   +3.35%     
==========================================
  Files           4        4              
  Lines         985      998      +13     
==========================================
+ Hits          646      688      +42     
+ Misses        339      310      -29     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Kab1r Kab1r changed the title Fix: Unsafe Undefined Behavior - Zero Sized `PlaneData Fix: Unsafe Undefined Behavior - Zero Sized PlaneData Feb 3, 2024
@shssoichiro
Copy link
Collaborator

Come to think of it, was there a reason we needed to use unsafe allocation here? @lu-zero @barrbrain any chance you remember? To be honest, I'd prefer to rework this into safe code if possible, as long as it doesn't have significant performance implications.

@lu-zero
Copy link
Member

lu-zero commented Feb 4, 2024

We need to use aligned memory

@lu-zero
Copy link
Member

lu-zero commented Feb 4, 2024

It needs a benchmark to be sure that we aren't paying more with those branches than just doing a min-sized allocation and keep len 0

@Kab1r
Copy link
Author

Kab1r commented Feb 5, 2024

I don't think safe code can work here easily while having control over alignment.

Initially, I thought that branches almost always be cheaper than doing min-sized allocations, but I found that Jemallocator only performs the checks on the dev profile, so there might be a real performance penalty.

What would a benchmark here look like? Would we just compare encode times, or would something that exclusively tested Planes be more suitable?

@shssoichiro
Copy link
Collaborator

shssoichiro commented Feb 5, 2024

There are some existing benches in this crate that cover Frame::new_with_padding and Plane::new_with_padding which should cover this change. Probably wouldn't hurt if we just ran the whole bench suite in v_frame, it's not that large and would make sure the changes to deref etc are covered.

@lu-zero
Copy link
Member

lu-zero commented Feb 5, 2024

deref is the hottest path among those changed, so it needs a benchmark. Running an encode with and w/out the change with hyperfine can give us some hints in that regard.

@FreezyLemon
Copy link
Contributor

FreezyLemon commented Feb 10, 2024

I'm pretty sure the approach here can be changed a bit to solve most of these issues. Can't we just use NonNull::dangling() instead of an Option::None? That's what std uses in most cases AFAIK and it should play nice with slice::from_raw_parts (see the comments on safety invariants).

That way, we'd just need a special case during construction (and drop) and probably not need to worry about it in deref? Haven't tested it myself, but I think it's possible

@Kab1r
Copy link
Author

Kab1r commented Feb 11, 2024

Benchmark Results

Important

Tested on AMD EPYC 7452 (32 cores, AVX2)
Kernel: Linux 5.15.0-86-generic x86_64 GNU/Linux
Compiler: rustc 1.76.0 (07dca489a 2024-02-04)

rav1e

Note

rav1e (master | p20240206 | 4b5c8f77b73fd58c7b5cec2302e3892660dd2bbb)
Patched with: [patch.crates-io]
Build Command: RUSTFLAGS="-C target-cpu=native" cargo build --release
Reference File: park_joy_420_720p50.y4m

Main

Benchmark 1: ./target/release/rav1e --benchmark --speed 10 --threads 512 -o park_joy_420_720p50_unpatched.ivf park_joy_420_720p50.y4m -y --tiles 512
  Time (mean ± σ):     72.374 s ±  3.134 s    [User: 3279.587 s, System: 46.993 s]
  Range (min … max):   69.432 s … 79.481 s    10 runs

Dangling

Benchmark 1: ./target/release/rav1e --benchmark --speed 10 --threads 512 -o park_joy_420_720p50_dangling.ivf park_joy_420_720p50.y4m -y --tiles 512
  Time (mean ± σ):     79.984 s ±  2.775 s    [User: 3751.362 s, System: 46.819 s]
  Range (min … max):   76.706 s … 83.858 s    10 runs

Option

Benchmark 1: ./target/release/rav1e --benchmark --speed 10 --threads 512 -o park_joy_420_720p50_options.ivf park_joy_420_720p50.y4m -y --tiles 512
  Time (mean ± σ):     87.499 s ±  1.769 s    [User: 4185.019 s, System: 45.935 s]
  Range (min … max):   83.604 s … 89.096 s    10 runs

v_frame

Note

v_frame benchmark suite
Bench Command: RUSTFLAGS="-C target-cpu=native" cargo bench

Main

Expand
frame new_with_padding padding=0
                        time:   [5.5458 µs 5.5481 µs 5.5508 µs]
Found 13 outliers among 100 measurements (13.00%)
  4 (4.00%) high mild
  9 (9.00%) high severe

frame new_with_padding padding!=0
                        time:   [8.5383 µs 8.5482 µs 8.5597 µs]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

plane new padding=0     time:   [3.4554 µs 3.4568 µs 3.4584 µs]
Found 12 outliers among 100 measurements (12.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  8 (8.00%) high severe

plane new padding!=0    time:   [4.6187 µs 4.6210 µs 4.6237 µs]
Found 9 outliers among 100 measurements (9.00%)
  3 (3.00%) high mild
  6 (6.00%) high severe

plane clone             time:   [6.8910 µs 6.8973 µs 6.9044 µs]
Found 18 outliers among 100 measurements (18.00%)
  15 (15.00%) high mild
  3 (3.00%) high severe

plane pad               time:   [3.9786 µs 3.9822 µs 3.9861 µs]
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe

plane copy_from_raw_u8 8-bit
                        time:   [7.3728 µs 7.3771 µs 7.3817 µs]
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

plane copy_from_raw_u8 10-bit
                        time:   [8.3578 µs 8.3989 µs 8.4338 µs]

plane downsampled       time:   [13.487 µs 13.492 µs 13.498 µs]
Found 10 outliers among 100 measurements (10.00%)
  2 (2.00%) low mild
  2 (2.00%) high mild
  6 (6.00%) high severe

Option (vs Main)

Expand
frame new_with_padding padding=0
                        time:   [5.6476 µs 5.6486 µs 5.6495 µs]
                        change: [+1.5166% +1.6587% +1.7744%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe

frame new_with_padding padding!=0
                        time:   [8.5360 µs 8.5387 µs 8.5422 µs]
                        change: [-0.4582% -0.2924% -0.1211%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 11 outliers among 100 measurements (11.00%)
  4 (4.00%) high mild
  7 (7.00%) high severe

plane new padding=0     time:   [3.2918 µs 3.2950 µs 3.2989 µs]
                        change: [-4.8978% -4.7580% -4.6412%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) high mild
  6 (6.00%) high severe

plane new padding!=0    time:   [4.6465 µs 4.6478 µs 4.6493 µs]
                        change: [+0.3847% +0.5299% +0.6492%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  3 (3.00%) high severe

plane clone             time:   [7.4861 µs 7.4906 µs 7.4959 µs]
                        change: [+7.9969% +8.2529% +8.4897%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 19 outliers among 100 measurements (19.00%)
  7 (7.00%) high mild
  12 (12.00%) high severe

plane pad               time:   [3.8507 µs 3.8541 µs 3.8581 µs]
                        change: [-3.6760% -3.4901% -3.3192%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 33 outliers among 100 measurements (33.00%)
  10 (10.00%) low severe
  8 (8.00%) low mild
  4 (4.00%) high mild
  11 (11.00%) high severe

plane copy_from_raw_u8 8-bit
                        time:   [6.7926 µs 6.8812 µs 6.9702 µs]
                        change: [-8.4359% -7.9193% -7.4218%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  2 (2.00%) high mild
  9 (9.00%) high severe

plane copy_from_raw_u8 10-bit
                        time:   [7.8379 µs 7.8743 µs 7.9137 µs]
                        change: [-5.8788% -5.5102% -5.1328%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  2 (2.00%) high mild
  7 (7.00%) high severe

plane downsampled       time:   [69.591 µs 69.626 µs 69.664 µs]
                        change: [+415.02% +415.87% +416.77%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe

plane downscale         time:   [71.090 µs 71.094 µs 71.098 µs]
                        change: [+500.50% +501.21% +501.68%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  5 (5.00%) high severe

plane rows_iter         time:   [14.619 µs 14.621 µs 14.624 µs]
                        change: [-0.3328% -0.2295% -0.1348%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) low mild
  1 (1.00%) high mild
  1 (1.00%) high severe

Dangling (vs Main)

Expand
frame new_with_padding padding=0
                        time:   [5.5834 µs 5.5852 µs 5.5870 µs]
                        change: [+1.5181% +1.6370% +1.7391%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
  3 (3.00%) low mild
  5 (5.00%) high mild
  1 (1.00%) high severe

frame new_with_padding padding!=0
                        time:   [8.3606 µs 8.3625 µs 8.3647 µs]
                        change: [-0.6897% -0.5402% -0.4254%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

plane new padding=0     time:   [3.1663 µs 3.1672 µs 3.1684 µs]
                        change: [+0.1292% +0.2762% +0.3923%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 8 outliers among 100 measurements (8.00%)
  2 (2.00%) high mild
  6 (6.00%) high severe

plane new padding!=0    time:   [4.6428 µs 4.6446 µs 4.6463 µs]
                        change: [+2.0101% +2.1166% +2.2101%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high severe

plane clone             time:   [6.4055 µs 6.4060 µs 6.4065 µs]
                        change: [-15.251% -15.148% -15.073%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low mild
  3 (3.00%) high severe

plane pad               time:   [4.0703 µs 4.0708 µs 4.0713 µs]
                        change: [+0.9177% +1.0594% +1.1705%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 13 outliers among 100 measurements (13.00%)
  2 (2.00%) low severe
  4 (4.00%) high mild
  7 (7.00%) high severe

plane copy_from_raw_u8 8-bit
                        time:   [6.6586 µs 6.6602 µs 6.6620 µs]
                        change: [-10.209% -9.8462% -9.5142%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) low mild
  7 (7.00%) high mild
  2 (2.00%) high severe

plane copy_from_raw_u8 10-bit
                        time:   [8.1811 µs 8.1887 µs 8.1941 µs]
                        change: [+1.3277% +1.9080% +2.4623%] (p = 0.00 < 0.05)
                        Performance has regressed.

plane downsampled       time:   [13.135 µs 13.147 µs 13.160 µs]
                        change: [-0.2355% -0.0659% +0.0786%] (p = 0.43 > 0.05)
                        No change in performance detected.
Found 8 outliers among 100 measurements (8.00%)
  8 (8.00%) high mild

plane downscale         time:   [11.551 µs 11.554 µs 11.557 µs]
                        change: [-0.5335% -0.3642% -0.2149%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

plane rows_iter         time:   [14.917 µs 14.920 µs 14.923 µs]
                        change: [+1.9106% +2.0829% +2.2188%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  3 (3.00%) high severe

Dangling (vs Option)

Expand
frame new_with_padding padding=0
                        time:   [5.6026 µs 5.6037 µs 5.6048 µs]
                        change: [-0.8154% -0.7411% -0.6475%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
  6 (6.00%) high severe

frame new_with_padding padding!=0
                        time:   [8.6580 µs 8.6588 µs 8.6598 µs]
                        change: [+1.2420% +1.3651% +1.4547%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  2 (2.00%) high mild
  6 (6.00%) high severe

plane new padding=0     time:   [3.1576 µs 3.1586 µs 3.1596 µs]
                        change: [-4.2208% -4.1403% -4.0633%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

plane new padding!=0    time:   [4.4064 µs 4.4083 µs 4.4104 µs]
                        change: [-5.3059% -5.2506% -5.1973%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  8 (8.00%) high mild
  5 (5.00%) high severe

plane clone             time:   [6.4131 µs 6.4206 µs 6.4290 µs]
                        change: [-14.541% -14.427% -14.311%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) high mild
  6 (6.00%) high severe

plane pad               time:   [3.9366 µs 3.9409 µs 3.9451 µs]
                        change: [+2.3583% +2.4852% +2.6144%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 20 outliers among 100 measurements (20.00%)
  5 (5.00%) low severe
  1 (1.00%) low mild
  5 (5.00%) high mild
  9 (9.00%) high severe

plane copy_from_raw_u8 8-bit
                        time:   [6.7173 µs 6.7181 µs 6.7190 µs]
                        change: [-1.5480% -0.9748% -0.4779%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  4 (4.00%) high severe

plane copy_from_raw_u8 10-bit
                        time:   [8.1901 µs 8.1917 µs 8.1937 µs]
                        change: [+4.2488% +4.4765% +4.6728%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 13 outliers among 100 measurements (13.00%)
  1 (1.00%) low severe
  4 (4.00%) low mild
  3 (3.00%) high mild
  5 (5.00%) high severe

plane downsampled       time:   [13.179 µs 13.187 µs 13.197 µs]
                        change: [-81.114% -81.088% -81.065%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 17 outliers among 100 measurements (17.00%)
  6 (6.00%) low severe
  4 (4.00%) high mild
  7 (7.00%) high severe

plane downscale         time:   [11.412 µs 11.416 µs 11.421 µs]
                        change: [-83.949% -83.941% -83.930%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  3 (3.00%) high severe

plane rows_iter         time:   [14.566 µs 14.573 µs 14.582 µs]
                        change: [-0.4033% -0.3578% -0.3154%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 11 outliers among 100 measurements (11.00%)
  2 (2.00%) high mild
  9 (9.00%) high severe

@FreezyLemon
Copy link
Contributor

FreezyLemon commented Feb 11, 2024

I can't currently access my desktop machine, but I tested the branches dangling vs. main on my notebook and the results are noticeably different. This might need more investigation? Not sure how I feel about delaying a soundness fix, but the change in behavior doesn't seem 100% consistent.

Dangling vs Main
   Running benches/bench.rs (target/release/deps/bench-77e6c2e86cb854dc)
frame new_with_padding padding=0
                      time:   [4.8044 µs 4.8139 µs 4.8242 µs]
                      change: [-18.398% -17.548% -16.674%] (p = 0.00 < 0.05)
                      Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
4 (4.00%) high mild
1 (1.00%) high severe

Benchmarking frame new_with_padding padding!=0: Collecting 100 samples in estimated 5.0397 s (626k iterations
frame new_with_padding padding!=0
                      time:   [7.9278 µs 8.0211 µs 8.1120 µs]
                      change: [-2.5854% -1.7537% -0.8989%] (p = 0.00 < 0.05)
                      Change within noise threshold.
Found 10 outliers among 100 measurements (10.00%)
5 (5.00%) low severe
3 (3.00%) high mild
2 (2.00%) high severe

plane new padding=0     time:   [2.8500 µs 2.8578 µs 2.8661 µs]
                      change: [+1.4054% +2.1370% +3.1526%] (p = 0.00 < 0.05)
                      Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
2 (2.00%) high mild
3 (3.00%) high severe

plane new padding!=0    time:   [4.4487 µs 4.4620 µs 4.4746 µs]
                      change: [-4.0047% -3.1560% -2.3405%] (p = 0.00 < 0.05)
                      Performance has improved.

plane clone             time:   [6.8967 µs 6.9171 µs 6.9412 µs]
                      change: [+1.0552% +1.4883% +1.9725%] (p = 0.00 < 0.05)
                      Performance has regressed.
Found 12 outliers among 100 measurements (12.00%)
4 (4.00%) high mild
8 (8.00%) high severe

plane pad               time:   [3.3016 µs 3.3085 µs 3.3166 µs]
                      change: [-7.4507% -6.8170% -6.1368%] (p = 0.00 < 0.05)
                      Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
6 (6.00%) high mild

plane copy_from_raw_u8 8-bit
                      time:   [6.9379 µs 6.9516 µs 6.9657 µs]
                      change: [-8.2592% -7.6483% -7.0711%] (p = 0.00 < 0.05)
                      Performance has improved.
Found 16 outliers among 100 measurements (16.00%)
12 (12.00%) low severe
2 (2.00%) low mild
2 (2.00%) high mild

plane copy_from_raw_u8 10-bit
                      time:   [7.8394 µs 7.8589 µs 7.8804 µs]
                      change: [-1.1914% -0.4869% +0.1826%] (p = 0.19 > 0.05)
                      No change in performance detected.
Found 8 outliers among 100 measurements (8.00%)
6 (6.00%) high mild
2 (2.00%) high severe

plane downsampled       time:   [10.931 µs 11.006 µs 11.083 µs]
                      change: [-3.7809% -2.8298% -1.8916%] (p = 0.00 < 0.05)
                      Performance has improved.

plane downscale         time:   [9.7859 µs 9.8030 µs 9.8221 µs]
                      change: [-10.635% -9.8731% -9.1971%] (p = 0.00 < 0.05)
                      Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
5 (5.00%) high mild

plane rows_iter         time:   [12.750 µs 12.852 µs 12.974 µs]
                      change: [-0.1309% +0.5361% +1.1598%] (p = 0.12 > 0.05)
                      No change in performance detected.
Found 6 outliers among 100 measurements (6.00%)
4 (4.00%) high mild
2 (2.00%) high severe

@FreezyLemon
Copy link
Contributor

FreezyLemon commented Feb 11, 2024

I just noticed that impl PlaneData has more soundness bugs. The loop in new is unsound and copy_from_slice in from_slice is most likely unsound. It might make sense to rewrite some of this instead of just patching single bugs.

This might also explain the weird performance behavior.

There might be value in using MaybeUninit<T>, for example. Or using a len: NonZeroUsize in some code paths... Maybe out of scope for this PR though

@shssoichiro
Copy link
Collaborator

I opened #24 to see if using an aligned vec crate would be a viable option. It seems to have an acceptable performance impact, which is positive in some cases and negative in others, but below 10% in all cases. My thought process was that that would be the fastest way to get rid of all soundness bugs that might be present as a result of our manual initialization code, by just not doing it ourselves.

@shssoichiro
Copy link
Collaborator

Closed in favor of the 100% safe approach in #24

@Kab1r Kab1r deleted the fix/zero-size-planedata-alloc-ub branch February 20, 2024 20:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants