[X86] [NNVM] [TOPI] Implement NCHWc Winograd convolutions #2111

ajtulloch · 2018-11-14T23:51:36Z

This is the implementation alluded to in
https://discuss.tvm.ai/t/improved-direct-winograd-nchwc-cpu-implementation-with-resnet-50-results/

It is a pretty standard Winograd implementation, modified for NCHWc
layout. It achieves reasonable speedups (up to 2x vs current
implementation) on a number of ResNet 3x3 layers on SKL and AVX.

Benchmarks:

Single-threaded, GCE Skylake instance, ~120GFLOP/s peak fp32 throughput.
1000 iterations of AutoTVM tuning per task.

Workload	Baseline (GFLOP/s)	Winograd (eff GFLOP/s)
3x3s1p1, 1x64x56x56 -> 1x64x56x56	104	193
3x3s1p1, 1x128x28x28 -> 1x128x28x28	110	221
3x3s1p1, 1x256x14x14 -> 1x256x14x14	107	182
3x3s1p1, 1x512x7x7 -> 1x512x7x7	93	176

TODO:

Unit tests.
Modifications to tune_nnvm_x86.py.
Implement parallelism.
More benchmarking results.

masahi · 2018-11-15T00:44:08Z

@ajtulloch you don't need another _contrib_conv2d_NCHWc_winograd_without_weight_transform symbol. You can do weight transform any way you like inside alter_op_layout. See how ARM winograd does this here.

You shouldn't need to change anything in nnvm to enable winograd for x86.

ajtulloch · 2018-11-15T20:02:12Z

@masahi how can I avoid adding _contrib_conv2d_NCHWc_winograd_without_weight_transform? This is an operator that takes a 5D NCHWc input tensor and a pre-transformed 6D transformed_weight tensor, and does a 3x3 Winograd convolution. I don't see how to express that with either _contrib_conv2d_NCHWc (need to modify decl to use Winograd method) or _contrib_conv2d_winograd_without_weight_transform (doesn't accept NCHWc input tensor).

I agree that we could probably remove the _contrib_conv2d_NCHWc_winograd_weight_transform by expressing it as a composition of conv2d_winograd_weight_transform + reshape + transpose.

But just as NCHW Winograd convolution required new NNVM ops, and NCHWc convolution required new NNVM ops, we need new ops for NCHWc Winograd AFAICT.

masahi · 2018-11-15T22:10:37Z

Have you tried _contrib_conv2d_winograd_without_weight_transform with NCHWc inputs? I have a home-grown NCHWc winograd + NNVM integration internally. _contrib_conv2d_winograd_without_weight_transform worked for me. It does accept 5d inputs.

As long as you have a correct

    new_attrs['layout'] = 'NCHW%dc' % ic_bn
    new_attrs['out_layout'] = 'NCHW%dc' % oc_bn

setup in _alter_conv2d_layout, I think it should work.

For the filter shape, the current solution in nnvm is not to do weight shape check during infer shape (see here). So you can use filter transform with any layout.

ajtulloch · 2018-11-15T22:21:12Z

@masahi doesn't that logic apply to having duplicated conv2d/conv2d_NCHWc ops in NNVM as well? I don't see why it's reasonable to have conv2d/conv2d_NCHWc/conv2d_winograd_without_weight_transform but not also conv2d_NCHWc_winograd_without_weight_transform.

masahi · 2018-11-15T23:05:28Z

I see, honestly I don't know why a new sym.contrib.conv2d_NCHWc symbol was introduced, even though it uses the same param / infer_shape/ correct_layout as sym.conv2d. @yzhliu Can you comment?

My opinion is that since we already have everything on nnvm side to enable NCHWc winograd, we don't need another duplicated symbol.

ajtulloch · 2018-11-16T00:49:14Z

@masahi I'd tend to agree, but I can also understand the appeal of being able to say 'conv2d and conv2d_winograd ops take a 4D NCHW input and produce 5D NCHWc output, conv2d_NCHWc and conv2d_NCHWc_winograd ops take a 5D NCHWc input and produce 5D NCHWc output', instead of essentially 'conv2d nodes take any dimension input and produce any dimension output'. If there's a desire from maintainers to consolidate all the convolution node types in NNVM/Relay, that's reasonable, but I'd argue it's orthogonal to this patch.

masahi · 2018-11-16T01:20:52Z

FYI, the recently added cuda int8 convolution uses NCHW4c layout with sym.conv2d symbol (see here). And as I mentioned, the conv2d_winograd can take a filter transform of arbitrary layout by not doing weight shape check.

So things are not consistent already. I think it's good time to discuss this issue as we transition to Relay @tqchen @merrymercy

ajtulloch · 2018-11-16T02:28:54Z

@yzhliu, @yidawang this might be of interest to you folks.

yidawang · 2018-11-16T04:00:30Z

This looks great. Thanks for the contribution! @ajtulloch We will test it out from our side.

ajtulloch · 2018-11-16T04:34:57Z

@yidawang the parallelism part isn't well optimized (we need to conditionally enable parallelism depending on what granularity we compute the previous stages at), but I think everything else should be good to go. Note the caveats mentioned in https://discuss.tvm.ai/t/improved-direct-winograd-nchwc-cpu-implementation-with-resnet-50-results/1017/16?u=ajtulloch BTW, I don't have good heuristics for avoiding these cases right now so maybe just brute-force disabling it for CIn x COut >= 256 * 256 or something would be reasonable.

yidawang · 2018-11-16T04:47:32Z

@ajtulloch Perhaps this paper may have some insight? Optimizing N-Dimensional, Winograd-Based Convolution for Manycore CPUs.

tqchen · 2018-11-20T17:18:49Z

@ajtulloch is this PR ready for review? please request reviewers

ajtulloch · 2018-11-20T21:20:56Z

Just rebased - @merrymercy, @yidawang, @yzhliu, @masahi would you be interested in reviewing these changes?

This is the implementation alluded to in https://discuss.tvm.ai/t/improved-direct-winograd-nchwc-cpu-implementation-with-resnet-50-results/ It is a pretty standard Winograd implementation, modified for NCHWc layout. It achieves reasonable speedups (up to 2x vs current implementation) on a number of ResNet 3x3 layers on SKL and AVX. TODO: Parallelization TODO: Benchmarking suite results on full ResNet suite. TODO: Demonstration in `tune_nnvm_x86.py`

merrymercy · 2018-11-25T09:53:29Z