-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[X86] [NNVM] [TOPI] Implement NCHWc Winograd convolutions #2111
Conversation
c316c45
to
163380c
Compare
@ajtulloch you don't need another You shouldn't need to change anything in nnvm to enable winograd for x86. |
@masahi how can I avoid adding I agree that we could probably remove the But just as NCHW Winograd convolution required new NNVM ops, and NCHWc convolution required new NNVM ops, we need new ops for NCHWc Winograd AFAICT. |
91a968e
to
327c7b3
Compare
Have you tried As long as you have a correct
setup in _alter_conv2d_layout, I think it should work. For the filter shape, the current solution in nnvm is not to do weight shape check during infer shape (see here). So you can use filter transform with any layout. |
06d6c5d
to
a9bdc2e
Compare
@masahi doesn't that logic apply to having duplicated conv2d/conv2d_NCHWc ops in NNVM as well? I don't see why it's reasonable to have |
a9bdc2e
to
37cabe4
Compare
I see, honestly I don't know why a new My opinion is that since we already have everything on nnvm side to enable NCHWc winograd, we don't need another duplicated symbol. |
@masahi I'd tend to agree, but I can also understand the appeal of being able to say 'conv2d and conv2d_winograd ops take a 4D NCHW input and produce 5D NCHWc output, conv2d_NCHWc and conv2d_NCHWc_winograd ops take a 5D NCHWc input and produce 5D NCHWc output', instead of essentially 'conv2d nodes take any dimension input and produce any dimension output'. If there's a desire from maintainers to consolidate all the convolution node types in NNVM/Relay, that's reasonable, but I'd argue it's orthogonal to this patch. |
FYI, the recently added cuda int8 convolution uses NCHW4c layout with So things are not consistent already. I think it's good time to discuss this issue as we transition to Relay @tqchen @merrymercy |
37cabe4
to
7f4b4af
Compare
71aaacf
to
e0087df
Compare
This looks great. Thanks for the contribution! @ajtulloch We will test it out from our side. |
@yidawang the parallelism part isn't well optimized (we need to conditionally enable parallelism depending on what granularity we compute the previous stages at), but I think everything else should be good to go. Note the caveats mentioned in https://discuss.tvm.ai/t/improved-direct-winograd-nchwc-cpu-implementation-with-resnet-50-results/1017/16?u=ajtulloch BTW, I don't have good heuristics for avoiding these cases right now so maybe just brute-force disabling it for CIn x COut >= 256 * 256 or something would be reasonable. |
@ajtulloch Perhaps this paper may have some insight? Optimizing N-Dimensional, Winograd-Based Convolution for Manycore CPUs. |
@ajtulloch is this PR ready for review? please request reviewers |
e0087df
to
7c33f51
Compare
Just rebased - @merrymercy, @yidawang, @yzhliu, @masahi would you be interested in reviewing these changes? |
This is the implementation alluded to in https://discuss.tvm.ai/t/improved-direct-winograd-nchwc-cpu-implementation-with-resnet-50-results/ It is a pretty standard Winograd implementation, modified for NCHWc layout. It achieves reasonable speedups (up to 2x vs current implementation) on a number of ResNet 3x3 layers on SKL and AVX. TODO: Parallelization TODO: Benchmarking suite results on full ResNet suite. TODO: Demonstration in `tune_nnvm_x86.py`
7c33f51
to
c565256
Compare
def div_round_up(a, b): | ||
return (a + b - 1) // b | ||
|
||
# assert all(k == 3 for k in (KH, KW)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
enable this check?
policy='candidate', candidate=[ | ||
[n, coo, cii, oh_m, ow_m, eps, ciii, nu, vc], | ||
[n, coo, oh_m, ow_m, eps, cii, ciii, nu, vc], | ||
# [n, coo, cii, oh_m, ow_m, ciii, nu, eps, vc], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete these candidates?
s[U].pragma(eps, 'debug_skip_region') | ||
else: | ||
pass | ||
# r_kh, r_kw = s[U].op.reduce_axis |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add some basic schedule here
|
||
|
||
cfg.define_knob('input_tile_compute_location', [0, 1, 2, 3]) | ||
if cfg['input_tile_compute_location'].val == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pattern appears many times. It will be good if we can wrap it into something like
cfg.define_annotate('input_tile_compute_location', [n, cii, oh_m, ow_m], policy='parallel_location')
axes = cfg['input_tile_compute_location'].apply(s, V, [n, cii, oh_m, ow_m], producer=input_tile)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I spent some time to see how to replace knob
with something more elegant. They are used in more places, operations involved by knob
are different, don't see how can generalize.
knob
directscompute_at
&vectorize
but alsofuse
&compute_at
too.- If have ideas please elaborate, otherwise can we leave as
knob
?
parallel_axis = s[M].fuse(n, coo) | ||
s[V].compute_at(s[M], ow_m) | ||
|
||
# s[M].parallel(parallel_axis) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uncomment this.
I know for the single-thread case, this parallel annotation will produce redundant code and harm the performance.
I think we should always annotate parallel in the schedule, but add some checks in
https://github.com/dmlc/tvm/blob/9473dca266e307cf1f9faece219af111686ca946/python/tvm/schedule.py#L548-L552
to skip annotating parallel when targeting single thread case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe here we just can return
in case of single thread.
- Could hint how to check this info right here ? From where to extract thread_num or similar info ?
@@ -414,6 +414,83 @@ NNVM_REGISTER_OP(_contrib_conv2d_winograd_without_weight_transform) | |||
|
|||
DMLC_REGISTER_PARAMETER(WinogradConv2DParam); | |||
|
|||
NNVM_REGISTER_OP(_contrib_conv2d_NCHWc_winograd_weight_transform) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is better to support NCHWc layout in the original symbol rather than creating a new symbol, as discussed by @masahi. Registering and maintaining new symbols is harder than adding "if" in the old symbol.
The todo item |
@ajtulloch can you please act on @merrymercy 's comment? |
@@ -0,0 +1,202 @@ | |||
"""Example code to do convolution.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might also want to update autotvm x86 tutorial to add winograd tutorial.
ping ping @ajtulloch |
Ah sorry folks, I missed the notifications. Will absolutely address this by the end of the weekend at the latest, my apologies. |
@masahi regarding the |
@ajtulloch would you like to follow up the changes? |
ping @ajtulloch can we land this PR for the 0.5? |
@ajtulloch can you try to get this in? :) |
@ajtulloch Could you update this work and upgrade it in Relay? I also find it have better performance improvement on ARM compared than default winograd schedule. I suggest we also add it on ARM CPU, for example, named it 'winograd_nchwc' and so on. |
@ajtulloch ,
|
ping @ajtulloch , @cbalint13 let us wait for two days and if there is no update, you are more than welcomed to take over this PR |
@cbalint13 please feel free to take over this PR, as long as you acknowledge andrew in the new one |
superceded by #3553 thanks to @cbalint13 |
@ajtulloch , @tqchen , #3553 is only a part of the integration, will continue with @ajtulloch NCHWc rebase . |
This is the implementation alluded to in
https://discuss.tvm.ai/t/improved-direct-winograd-nchwc-cpu-implementation-with-resnet-50-results/
It is a pretty standard Winograd implementation, modified for NCHWc
layout. It achieves reasonable speedups (up to 2x vs current
implementation) on a number of ResNet 3x3 layers on SKL and AVX.
Benchmarks:
TODO:
tune_nnvm_x86.py
.