-
-
Notifications
You must be signed in to change notification settings - Fork 851
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vectorize (AVX2) JPEG Color Converter #1411
Vectorize (AVX2) JPEG Color Converter #1411
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EPIC!
Unsafe.Add(ref destination, 2) = Avx.Shuffle(cmHi, yoHi, 0b01_00_01_00); | ||
Unsafe.Add(ref destination, 3) = Avx.Shuffle(cmHi, yoHi, 0b11_10_11_10); | ||
} | ||
#else |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't have chance to react, but I'm not entirely happy with the practice established in #1402. I would try putting the different paths into different classes (FromCmykVector8
VS FromCmykVectorAvx2
).
FromCmykVectorAvx2
can derive from FromCmykVector8
for convenience.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that was a stopgap. I’d like to revisit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't shy away from inheritance here, I would suggest a hierarchy like this:
abstract class JpegColorConverter
{
public abstract bool IsAvailable { get; } // new on this class
}
abstract class BasicJpegColorConverter : JpegColorConverter // new base class for all non-vectorized color converters
{
public override bool IsAvailable => true;
}
abstract class VectorizedJpegColorConverter : JpegColorConverter // new base class for all vectorized color converters
{
private readonly int vectorSize;
protected VectorizedJpegColorConverter(JpegColorSpace colorSpace, int precision, int vectorSize)
: base(colorSpace, precision)
{
this.vectorSize = vectorSize;
}
public override void ConvertToRgba(in ComponentValues values, Span<Vector4> result)
{
int remainder = result.Length % vectorSize;
int simdCount = result.Length - remainder;
if (simdCount > 0)
{
ConvertCoreVectorized(values.Slice(0, simdCount), result.Slice(0, simdCount));
}
ConvertCore(values.Slice(simdCount, remainder), result.Slice(simdCount, remainder));
}
protected abstract void ConvertCoreVectorized(in ComponentValues values, Span<Vector4> result);
protected abstract void ConvertCore(in ComponentValues values, Span<Vector4> result);
}
abstract class Avx2JpegColorConverter : VectorizedJpegColorConverter // new base class for all AVX2-based converters
{
protected Avx2JpegColorConverter(JpegColorSpace colorSpace, int precision)
: base(colorSpace, precision, 8)
{
}
public override bool IsAvailable
{
get
{
#if SUPPORTS_RUNTIME_INTRINSICS
return Avx2.IsSupported;
#else
return false;
#endif
}
}
}
// another base class for Vector<float>-based converters
We could then instantiate all converters initially into a static array and determine which one to use based on ColorSpace
, Precision
and IsAvailable
. This would address your point and maximize code re-use at the cost of some virtual method calls.
Let me know what you think. I didn't want to refactor to much for the initial draft-PR and just add the vectorized logic into the existing code structure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tkp1n I like the concept, go for it! Hopefully the runtime costs of the virtual dispatches are be still negligible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I refactored as discussed above, PTAL...
Next up: Benchmarks 🚀
Codecov Report
@@ Coverage Diff @@
## master #1411 +/- ##
==========================================
- Coverage 83.14% 83.08% -0.06%
==========================================
Files 695 707 +12
Lines 31484 31839 +355
Branches 3586 3590 +4
==========================================
+ Hits 26176 26454 +278
- Misses 4585 4668 +83
+ Partials 723 717 -6
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
The numbers are in 💯 Looks like some nice improvements across the board. 👍 YCCK is the big winner. BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19042
Intel Core i7-9750H CPU 2.60GHz, 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=5.0.100-rc.2.20479.15
[Host] : .NET Core 3.1.9 (CoreCLR 4.700.20.47201, CoreFX 4.700.20.47203), X64 RyuJIT
Job-ZMVTJV : .NET Framework 4.8 (4.8.4250.0), X64 RyuJIT
Job-VQWAVA : .NET Core 2.1.23 (CoreCLR 4.6.29321.03, CoreFX 4.6.29321.01), X64 RyuJIT
Job-ETAOPA : .NET Core 3.1.9 (CoreCLR 4.700.20.47201, CoreFX 4.700.20.47203), X64 RyuJIT
IterationCount=3 LaunchCount=1 WarmupCount=3
CMYK
YCCK
RGB
Grayscale
YCbCr
|
Looking good @tkp1n !! I'll do a thorough review as soon as possible! |
So guys, here's my "offer": I want to put together a PR for #1410 that also alters the color converters to pack into a 3-channel RGB float buffer, but I want to "go lazy" and spare solving AVX2 shuffling riddles, because every hour I'd spend there would delay SixLabors/ImageSharp.Drawing#96 further. So the question is: @tkp1n do you have any willingness and chance to spend a bit more time here, and finish such a refactor with @JimBobSquarePants 's help? If we can make it, this trick would deliver superior performance compared to any alternative, since we could spare a completely unnecessary alpha padding. Let me know what you think. If you agree we'd figure out the "what PR follows what" details. Cheers! |
@antonfirsov Isn't this essentially what #1242 should be doing? |
That corresponds to #1121 ("Optimized pipeline"), which is a more expensive refactor, probably not realistic for 1.1. The late night idea was that after all the "low hanging" refactors that happened, with a single extra step we could profit a lot now, and It's probably easier to do while this shuffling thing is fresh in your heads. (This would remove the need of implementing a one-step However I'm not sure now, since what I'm asking might be tricky and time-consuming because the way 3-component RGB-s overlap |
I don't think it's that much effort since the existing converters are all doing most of the work already. It's a case of moving most of it and filling out some gaps using I'm happy to look at it after this is merged for simplicities sake. |
Starting next Monday, I'll be away from "work" like this for 4 weeks with very limited access to the web. 😢 I've removed the "[WIP]" and marked this PR as ready for review. Feel free to let me know what you need to be fixed before this can be merged. If the timing of this PR doesn't suit you due to other planned or ongoing refactoring, feel free to set this one aside, and we'll reconsider it when the timing is right. |
@tkp1n no worries, enjoy your vacation, and thanks for the contribution! Your PR should be the first to be merged, we'll review it ASAP. |
I can't see any issues with this. It all looks fantastic and the performance is consistent with master for the existing optimized conversion. 👍 🚀 If we drop |
Thanks a lot 😃
Done, along with some final polishing of the benchmarks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One major concern about FromYCbCrVector
, and a few simplification ideas.
Test coverage seems to prove that everything works 100% fine, however it might worth for @JimBobSquarePants to do a quick sanity check on the shuffling bits.
src/ImageSharp/Formats/Jpeg/Components/Decoder/ColorConverters/JpegColorConverter.cs
Outdated
Show resolved
Hide resolved
{ | ||
ConvertCore(values.Slice(0, simdCount), result.Slice(0, simdCount), this.MaximumValue, this.HalfValue); | ||
} | ||
protected override bool IsAvailable => true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
protected override bool IsAvailable => true; | |
protected override bool IsAvailable => Vector.IsHardwareAccelerated && Vector<float>.Count == 4; |
This is essentially an SSE path based on Vector4
. The converter is not safe to run if the above conditions are not met, because of the RoundAndDownscalePreVector8
call.
@tannergooding is there an environment variable to enforce a (Vector.IsHardwareAccelerated && Vector<float>.Count == 4) == true
configuration for testing?
If not, I would suggest to remove this converter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you set COMPlus_EnableAVX2=0
it will force Vector<T>
to be 16-bytes. We don't have a switch (at least not one that ships in release builds) that explicitly controls the size of Vector<T>
today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...eSharp/Formats/Jpeg/Components/Decoder/ColorConverters/JpegColorConverter.FromYCbCrVector.cs
Outdated
Show resolved
Hide resolved
|
||
protected override void ConvertCoreVectorized(in ComponentValues values, Span<Vector4> result) | ||
{ | ||
#if SUPPORTS_RUNTIME_INTRINSICS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we hide the whole file behind the condition instead, and add this guard to GetYCbCrConverters
:
#if SUPPORTS_RUNTIME_INTRINSICS
yield return new FromYCbCrAvx2(precision);
#endif
Same for the other color spaces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we add the conditionals to the files we have to refactor the tests as I don't want to pepper the tests with conditionals and we don't have cross platform remote executor available to test everything on Core2.1 Win yet.
Adding to the yield though is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok lets skip it for now. Comment could be useful.
...p/Formats/Jpeg/Components/Decoder/ColorConverters/JpegColorConverter.FromGrayScaleVector8.cs
Outdated
Show resolved
Hide resolved
...harp/Formats/Jpeg/Components/Decoder/ColorConverters/JpegColorConverter.FromGrayScaleAvx2.cs
Outdated
Show resolved
Hide resolved
Vector256<float> g = HwIntrinsics.MultiplyAdd(HwIntrinsics.MultiplyAdd(y, cb, gCbMult), cr, gCrMult); | ||
Vector256<float> b = HwIntrinsics.MultiplyAdd(y, cb, bCbMult); | ||
|
||
// TODO: We should be saving to RGBA not Vector4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JimBobSquarePants we should either do this or #1121, but let's not keep around contradicting plans in our notes!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrote that a while back and forgot about it. I still believe there's merit in cutting out the FromVector4
call though. I need to make some diagrams and think things through.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still believe there's merit in cutting out the
FromVector4
I think #1121 is the most efficient way to do that with the float pipelines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I wasn't clear I don't mean change the pipeline what I mean is to pack triplets so we can avoid Vector4
=> Rgba32
=> Rgb24
. That comment should have said RGB.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private static void ValidateYCbCr(in JpegColorConverter.ComponentValues values, Vector4[] result, int i) | ||
[Theory] | ||
[MemberData(nameof(CommonConversionData))] | ||
public void FromYCbCrVector(int inputBufferLength, int resultBufferLength, int seed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't have to happen in this PR, but for 1.1 we need a variant of this test that works with:
FeatureTestRunner.RunWithHwIntrinsicsFeature(
RunTest,
args,
HwIntrinsics.DisableAVX2);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I would like to do is a single test which I can then pass through the runner. That will test all the versions then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you parameterize it to also exercise the Vector8 the Vector4 and the scalar path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, since the call change the environmental paths like COMPlus_EnableAVX2=0
we can do the following.
FeatureTestRunner.RunWithHwIntrinsicsFeature(
RunTest,
args,
HwIntrinsics.AllowAll | HwIntrinsics.DisableAVX | HwIntrinsics.DisableSSE );
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How to get Vector.IsHardwareAccelerated == false
? Is it enough to HwIntrinsics.DisableSSE
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
COMPlus_FeatureSIMD=0
should do it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep that's it. We have HwIntrinsics.DisableSIMD
for that.
…/ImageSharp into tkp1n/avx2-color-converter
…/ImageSharp into tkp1n/avx2-color-converter
/// Gets a value indicating whether <see cref="Vector{T}"/> code is being JIT-ed to SSE instructions | ||
/// where float and integer registers are of size 128 byte. | ||
/// </summary> | ||
public static bool HasVector4 { get; } = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this name misleading, one may think it has to do something with Vector4
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any suggestions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would just inline it, there would be only one usage, if we'd address my other suggestion.
...Sharp/Formats/Jpeg/Components/Decoder/ColorConverters/JpegColorConverter.FromYCbCrVector4.cs
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Will open an issue later for the test gap on FromYCbCrVector4
.
Thank you @JimBobSquarePants and @antonfirsov for your inputs and hands-on time to get this merged so quickly 🚀 |
Vectorize (AVX2) JPEG Color Converter
This is an initial implementation of the remaining (not yet vectorized) JPEG color converters using AVX(2) runtime intrinsics as well as
Vector<float>
.I'm opening this as a draft PR to get initial feedback before investing additional time in polishing, tests and benchmarks.
See also #809