-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Winograd matrices computation. #3553
Conversation
@merrymercy @ZihengJiang @masahi @Laurawly , |
topi/python/topi/arm_cpu/conv2d.py
Outdated
@@ -333,54 +332,11 @@ def _decl_winograd(cfg, data, kernel, strides, padding, dilation, layout, out_dt | |||
assert KH == 3 and KW == 3 and HPAD == 1 and WPAD == 1 and HSTR == 1 and WSTR == 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still check kernel size=3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, HPAD / WPAD shouldn't be checked too. Consider, we have no padding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- HPAD,WPAD asserts removed but assert KW == KH instead (in all places)
- {KW, KH} == {3,3} assert removed but assert (kernel_size > 1) (in winograd_utils.py)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move the assertion regrading to whether winograd is valid to the same place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
In one/same we guard winograd matrix related (in)validity, already done here
Taking it out from there means to maintain it in multiple unrelated places (each conv2d files). -
For other (in)valid things (padding, stride, dilation etc) pure conv related things is good where they are already now (separate dedicated conv2d.py files for targeted arches: arm_cpu, mali, cuda, etc).
I would keep it simple with this PR, intention for now is to only replace the winograd matrices, let's not touch functionalities or other actual limitations too much.
I'll keep in mind the suggestions here, next PR will be about enhancement around cfg.space (especialy arm/mali), then will be good time to revisit limitations.
cc @ajtulloch |
The specific choices of the interpolation points substantially changes the approximations - have you validated that the proposed routines don't affect accuracy vs the tuned ones we currently have? cc @zlateski who was looking at this stuff recently. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think some hard-coding will be unavoidable :)
Some clarifications: SHORT: We replace actual hardcoded with same ones but this time are generated. LONG:
FUTURE: There will be more fun when we want winograd over int8, or even sparse-winograd with int8. I try to invite @andravin (not sure he will be aware) to see it's opinion on choosing our anchor points (maybe even auto-tune them in future). Looking forward for more suggestions. |
Hi @cbalint13, the error analysis in the paper that @zlateski pointed to looks pretty rigorous, I would start there. I am a little bit curious what you plan to use the large tiles for. Today's devices have high compute / memory bandwidth ratios and low numeric precision, neither of which is conducive to a large tile size. I think the large tiles only payoff when you have an excess of DRAM memory bandwidth, or a large fast cache. |
Hi @andravin , Thanks for joining the thread !
Most interesting part would be targeting int8 or other quantization variants, so i think by having the matrix generator (maybe some post "precision tuner") would be good allies for winograd class schedules in TVM fashion of doing it. I would remind you all of some more exotic approach that even uses complex numbers as interpolation points here. Even such approach can be accounted in future. (apologies for possible redundancy and length of the info) |
For no confusions, to sum up what we have in this PR up to this point:
Plus, some doors are open from here on. |
This looks good to me - I forgot that this is only for One thing I'd personally like is to add a test that verifies the explicit matrix values for m == 2/4, so they're explicitly specified somewhere? BTW, for the m=6 case, I got them from NNPACK via Maratyszcza/NNPACK#8. |
@cbalint13 and @ajtulloch you might want to try the points that Barabasz et al. say give the best accuracy:
Choices that are a bit unusual are marked with * I remember searching for F(4x4, 3x3) output points that give the best accuracy a while ago, and did see improvements with non-obvious choices, but the differences were not huge: NervanaSystems/neon#224 Not sure I agree though with how the paper modeled the input and filters with a uniform distribution. In practice, input data is gaussian or sub-gaussian and highly correlated. Anyway this observation made me dubious about modifying algorithms to make them on average more accurate with a random distribution that might not reflect the data that they will be used with in practice. I think a proper error analysis would be measured in Resnet-34 or Resnet-50 using actual weights and real images. Also here are some optimizations that are often overlooked.
Sorry this is so long. I did not have the time to make it shorter. ;-) |
Thanks for the great discussions. @cbalint13 it would be great if we can push forward to act on this PR, by summarizing the discussions, and propose list of changes. Then we can get everyone to review and approve |
@cbalint13 please request reviews by tagging people(likely those who are in this conversation) when it is ready |
@ajtulloch , @zlateski , Done tests using old matrix set versus new proposed schema:
@merrymercy , @kevinthesun , @FrozenGene , @tqchen , To sum up:
Immediate future step (not intented for this PR):
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one comment, otherwise lgtm.
|
||
return np.array(in_pts[degree-1], dtype=np.float64) | ||
|
||
def winograd_transform_matrices(tile_size, kernel_size, out_dtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not that familiar with the whole codebase, so the following comment might be irrelevant.
This function performs non-trivial compute, which should probably be memoized on some level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zlateski ,
Not sure how to do the memoization. We have a @memoize
decorator, it is used only in test scripts so far and i see it can memoize only functions with no arguments on them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zlateski,
Also, if use @memoize
can't see too much yield , (it's a pickle file at all) i think the overhead of IO would be greater for such small data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least, from multiple boxed rectangular matrices we ended up in a nice single triangle lookup table :) |
topi/python/topi/arm_cpu/conv2d.py
Outdated
@@ -330,57 +329,14 @@ def _decl_winograd(cfg, data, kernel, strides, padding, dilation, layout, out_dt | |||
HPAD, WPAD, _, _ = get_pad_tuple(padding, kernel) | |||
|
|||
assert layout == 'NCHW' | |||
assert KH == 3 and KW == 3 and HPAD == 1 and WPAD == 1 and HSTR == 1 and WSTR == 1 | |||
assert KH == KW and HSTR == 1 and WSTR == 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add test case for this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other parts lgtm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ajtulloch Thank you all for the help ! |
This patch adds Winograd (Cook-Toom) matrices computation routines as described by A. Lavin & S. Gray based on original wincnn implementation of Andrew Lavin. Also removes actual hard coded redundant matrices.
It opens the door to:
Once this PR pass, will attempt with more winograd enhancements.