-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix kernel_size != 3 on winograd arm/mali. #3642
Conversation
Is there a reason why we are disabling cuda tests for now? |
(kernel_size > 3) is new for our winograd, since its a larger kernel the accuracy may also drop, interesting that wasn't catched before (i think memoization data was changed). |
@merrymercy , @kevinthesun , @FrozenGene Will update on tophub . |
sizes = [2,] | ||
# 2 always present | ||
for s in range(3, 9): | ||
# overlap less then half |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@merrymercy , @FrozenGene ,
What is your opinion on this approach to construct tile_size
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no strong preference what way should we do. If we think it is a bit trick, we could add comment explaining what happened here, then I think it is enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have experiences tuning all these tile sizes.
To keep the space small, you can try to prune more candidates that are very likely not to be the optimal ones according to your tuning results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Inappropriate candidates will be dropped out early by
xgb
, they don't add overhead or lengthen the search time by much. There can be surprises especially picking e.g. tile_size = 3 sometimes. - Actual code cover well 2 & 4 cases (tests shows also 3 have good results). But we can be more precise only if narrow down space of
P
in accord withtile_size
as you mentioned, or let xgb choose its way.
It's done, review ready at tlc-pack/tophub#8 . |
Aha... I remember why I don't include There are some knobs for tiling in the search space. However, they are dependent on tile size. For example, in the arm cpu's schedule, https://github.com/dmlc/tvm/blob/36702a7678573fa4ff9ea04c3920e60d590496c9/topi/python/topi/arm_cpu/conv2d.py#L390-L392. Currently, AutoTVM does not support dependent relationships between tuning knobs. What it does is picking the first candidate of tile size (which is 2 according to your construction) and initializing other knobs. In the above case, it will enumerate all split schemes for P = 784. But some split schemes for P = 784 are invalid for P = 196, which will cause the AutoTVM hangs forever during feature extraction. Ideal solutionSupport hierarchical space definition in AutoTVM Workaround
You don't meet the problem when tuning for CUDA and mali because I added some restrictions to their schedule templates and luckily they work around this problem. But for arm cpu we will meet this problem. Let me know if you want to try to fix it or want more information from me. Otherwise, I can fix it later. |
Yes i give it a try to fix it, I'll update you on progress. To sum up lets settle for now the order :
Arguing by point 0. , the PR would be mergeable, otherwise we transform this PR in a long multi PR. What do you think ? |
Opened a disscuss topic and started first with |
|
Currently, the spaces of |
|
Note that #3717 is also proposing to update CUDA TOPHUB version |
ping @cbalint13 can you update the PR? |
ping @cbalint13 |
|
This PR fix/enable cases for kernel_size other than 3x3.
@tmoreau89 , @merrymercy , @kevinthesun , @FrozenGene
Please help with the review.
Main benefit:
We have flexible
tile_size
and anykernel_size
for winograd.Examples (yolov3-tiny) of having tuned
tile_size
:original (only direct)
[Task 13/20 (1, 384, 26, 26)|(256, 384, 3, 3)] (conv2d) {53.46 GFLOPS /direct}
[Task 14/20 (1, 512, 13, 13)|(1024, 512, 3, 3)] (conv2d) {31.17 GFLOPS /direct}
[Task 15/20 (1, 256, 13, 13)|(512, 256, 3, 3)] (conv2d) {28.77 GFLOPS /direct}
original (no tile_size autotune)
[Task 13/20 (1, 384, 26, 26, 'float16')] (conv2d) {93.78 GFLOPS /winograd}
tile_size=2
[Task 14/20 (1, 512, 13, 13, 'float16')] (conv2d) {93.31 GFLOPS /winograd}
tile_size=2
[Task 15/20 (1, 256, 13, 13, 'float16')] (conv2d) {81.34 GFLOPS /winograd}
tile_size=2
proposed (with tile_size autotune)
[Task 13/20 (1, 384, 26, 26, 'float16')] (conv2d) {98.48 GFLOPS /winograd}
tile_size=4
[Task 14/20 (1, 512, 13, 13, 'float16')] (conv2d) {106.15 GFLOPS /winograd}
tile_size=4
[Task 15/20 (1, 256, 13, 13, 'float16')] (conv2d) {84.54 GFLOPS /winograd}
tile_size=4
It is quite tedious to expose more benefits (WiP on my side) for tuned
tile_size
, these 3 was found running >1 week on mali. That's all what I personally found up to this day, i am sure there are more, will try soon 5x5 & 7x7 kernels (Unet, SuperRes), and post on tophub exceptional ones.