Fix kernel_size != 3 on winograd arm/mali. #3642

cbalint13 · 2019-07-28T00:50:30Z

This PR fix/enable cases for kernel_size other than 3x3.

It is a continuation on Add Winograd matrices computation. #3553 work.
Fix testcases for arm / mali and also [TEST][FLAKY] test_topi_conv2d_winograd.test_conv2d_nchw #3641 .

@tmoreau89 , @merrymercy , @kevinthesun , @FrozenGene
Please help with the review.

Main benefit:
We have flexible tile_size and any kernel_size for winograd.

Examples (yolov3-tiny) of having tuned tile_size:

original (only direct)
[Task 13/20 (1, 384, 26, 26)|(256, 384, 3, 3)] (conv2d) {53.46 GFLOPS /direct}
[Task 14/20 (1, 512, 13, 13)|(1024, 512, 3, 3)] (conv2d) {31.17 GFLOPS /direct}
[Task 15/20 (1, 256, 13, 13)|(512, 256, 3, 3)] (conv2d) {28.77 GFLOPS /direct}

original (no tile_size autotune)
[Task 13/20 (1, 384, 26, 26, 'float16')] (conv2d) {93.78 GFLOPS /winograd} tile_size=2
[Task 14/20 (1, 512, 13, 13, 'float16')] (conv2d) {93.31 GFLOPS /winograd} tile_size=2
[Task 15/20 (1, 256, 13, 13, 'float16')] (conv2d) {81.34 GFLOPS /winograd} tile_size=2

proposed (with tile_size autotune)
[Task 13/20 (1, 384, 26, 26, 'float16')] (conv2d) {98.48 GFLOPS /winograd} tile_size=4
[Task 14/20 (1, 512, 13, 13, 'float16')] (conv2d) {106.15 GFLOPS /winograd} tile_size=4
[Task 15/20 (1, 256, 13, 13, 'float16')] (conv2d) {84.54 GFLOPS /winograd} tile_size=4

It is quite tedious to expose more benefits (WiP on my side) for tuned tile_size, these 3 was found running >1 week on mali. That's all what I personally found up to this day, i am sure there are more, will try soon 5x5 & 7x7 kernels (Unet, SuperRes), and post on tophub exceptional ones.

tmoreau89 · 2019-07-28T00:52:29Z

Is there a reason why we are disabling cuda tests for now?

cbalint13 · 2019-07-28T00:56:54Z

Is there a reason why we are disabling cuda tests for now?

(kernel_size > 3) was running on cuda only.
now with this runs on all targeted arch.

(kernel_size > 3) is new for our winograd, since its a larger kernel the accuracy may also drop, interesting that wasn't catched before (i think memoization data was changed).

cbalint13 · 2019-07-28T05:05:37Z

@merrymercy , @kevinthesun , @FrozenGene

Will update on tophub .
Allow a little time for tophub update (need to fill tile_size), until then this PR is WiP.

topi/python/topi/arm_cpu/conv2d.py

topi/python/topi/mali/conv2d.py

cbalint13 · 2019-07-29T14:27:25Z

topi/python/topi/nn/winograd_util.py

+    sizes = [2,]
+    # 2 always present
+    for s in range(3, 9):
+    # overlap less then half


@merrymercy , @FrozenGene ,
What is your opinion on this approach to construct tile_size ?

I have no strong preference what way should we do. If we think it is a bit trick, we could add comment explaining what happened here, then I think it is enough.

I don't have experiences tuning all these tile sizes.
To keep the space small, you can try to prune more candidates that are very likely not to be the optimal ones according to your tuning results.

@merrymercy ,

Inappropriate candidates will be dropped out early by xgb, they don't add overhead or lengthen the search time by much. There can be surprises especially picking e.g. tile_size = 3 sometimes.

Actual code cover well 2 & 4 cases (tests shows also 3 have good results). But we can be more precise only if narrow down space of P in accord with tile_size as you mentioned, or let xgb choose its way.

cbalint13 · 2019-07-29T14:41:38Z

@merrymercy , @kevinthesun , @FrozenGene

Will update on tophub .
Allow a little time for tophub update (need to fill tile_size), until then this PR is WiP.

It's done, review ready at tlc-pack/tophub#8 .

python/tvm/autotvm/tophub.py

merrymercy · 2019-07-31T14:06:26Z

Aha... I remember why I don't include tile_size as a tunable knob. This is not the xgb vs xgb_knob problem. It is because the search spaces for different tile sizes are different. But AutoTVM does not support this kind of space definition.

There are some knobs for tiling in the search space. However, they are dependent on tile size. For example, in the arm cpu's schedule, https://github.com/dmlc/tvm/blob/36702a7678573fa4ff9ea04c3920e60d590496c9/topi/python/topi/arm_cpu/conv2d.py#L390-L392. P is the number of winograd tiles. In one case, P is 196 if tile size is 4 while is 784 if tile size is 2. We construct the split knob according to the lengths of P, so different tile sizes will have different spaces.

Currently, AutoTVM does not support dependent relationships between tuning knobs. What it does is picking the first candidate of tile size (which is 2 according to your construction) and initializing other knobs. In the above case, it will enumerate all split schemes for P = 784. But some split schemes for P = 784 are invalid for P = 196, which will cause the AutoTVM hangs forever during feature extraction.

Ideal solution

Support hierarchical space definition in AutoTVM

Workaround

Fix the feature extraction. We hang on this line https://github.com/dmlc/tvm/blob/11da1ca31b8121e9a40c70f33ade27670d22e2d0/python/tvm/autotvm/tuner/xgboost_cost_model.py#L311 because we get exceptions when running this function https://github.com/dmlc/tvm/blob/11da1ca31b8121e9a40c70f33ade27670d22e2d0/python/tvm/autotvm/tuner/xgboost_cost_model.py#L328
Use feature_type = knob in the XGBTuner. I won't merge this solution to master.

You don't meet the problem when tuning for CUDA and mali because I added some restrictions to their schedule templates and luckily they work around this problem. But for arm cpu we will meet this problem.

Let me know if you want to try to fix it or want more information from me. Otherwise, I can fix it later.

cbalint13 · 2019-07-31T14:56:52Z

@merrymercy ,

Aha... I remember why I don't include tile_size as a tunable knob. This is not the xgb vs xgb_knob problem. It is because the search spaces for different tile sizes are different. But AutoTVM does not support this kind of space definition.

I was driven by the idea that too simple formula like (tile_size = 4 if H % 4 == 0 else 2) can be rather a simple knob with a small list. Now we afford it with flexible winograd matrix computation.
As result fact, I posted in the header here, knob found better solution overriding mentioned static formula (2 vs 4 tile_size).

There are some knobs for tiling in the search space. However, they are dependent on tile size. For example, in the arm cpu's schedule,

https://github.com/dmlc/tvm/blob/36702a7678573fa4ff9ea04c3920e60d590496c9/topi/python/topi/arm_cpu/conv2d.py#L390-L392

. P is the number of winograd tiles. In one case, P is 196 if tile size is 4 while is 784 if tile size is 2. We construct the split knob according to the lengths of P, so different tile sizes will have different spaces.

I also found this correlation and dependency to P in case of our arm/mali.
Yes i agree, that is also my finding during tests regarding P. Would be great help me elaborate an exact relationship tile_size -> P . Otherwise, if we don't narrow down the possible lists, we just can enumerate bulk range(0,256) for some of knob's there and let xgb find it out but it wold have to search in 10e+7 cfg.space than.
Again, as fact, even ignoring this dependency for now we are still better than existing formula.
In fact, for now, solutions with this knob will be 2 or 4 (very rare 3 hits) in most cases.

Currently, AutoTVM does not support dependent relationships between tuning knobs. What it does is picking the first candidate of tile size (which is 2 according to your construction) and initializing other knobs. In the above case, it will enumerate all split schemes for P = 784. But some split schemes for P = 784 are invalid for P = 196, which will cause the AutoTVM hangs forever during feature extraction.

So, could help me find the relation of P given tile_size ?

Ideal solution

Support hierarchical space definition in AutoTVM

Workaround

Fix the feature extraction. We hang on this line https://github.com/dmlc/tvm/blob/11da1ca31b8121e9a40c70f33ade27670d22e2d0/python/tvm/autotvm/tuner/xgboost_cost_model.py#L311
because we get exceptions when running this function https://github.com/dmlc/tvm/blob/11da1ca31b8121e9a40c70f33ade27670d22e2d0/python/tvm/autotvm/tuner/xgboost_cost_model.py#L328

Use feature_type = knob in the XGBTuner. I won't merge this solution to master.

OK, i put xgb issue on higher pripority, it is trigerred 100% on such scenarios like [X86] [NNVM] [TOPI] Implement NCHWc Winograd convolutions #2111

You don't meet the problem when tuning for CUDA and mali because I added some restrictions to their schedule templates and luckily they work around this problem. But for arm cpu we will meet this problem.

Where are those restrictions exactly ?

Let me know if you want to try to fix it or want more information from me. Otherwise, I can fix it later.

Yes i give it a try to fix it, I'll update you on progress.

To sum up lets settle for now the order :

For now, as mentioned & tested, our PR is still better than existing two cased conditional.
Try find xgb fea issue and repair it.
Lets see a smarter P given tile_size (please help me elaborate a generating formula).
If get tile_size->P we can try introduce hierarchical knob mode (i have an simple idea for that).

Arguing by point 0. , the PR would be mergeable, otherwise we transform this PR in a long multi PR.
Agree 1 + 2 + 3 are mandatory, i would help tackle each in upcoming another 3 PR (there will be too much debate if we keep in a single PR like this one).

What do you think ?

cbalint13 · 2019-07-31T15:30:02Z

Opened a disscuss topic and started first with xgb & hierarchical knob issue.

merrymercy · 2019-08-01T13:44:31Z

#3687 #3689 are necessary to merge this PR. Otherwise this PR will make tuning on arm cpu fails.

I can merge this after the above two are merged.

cbalint13 · 2019-08-01T14:12:38Z

#3687 #3689 are necessary to merge this PR. Otherwise this PR will make tuning on arm cpu fails.

I can merge this after the above two are merged.

@merrymercy ,

May for maliexpand other four vars in this PR , namelyyt, t1, t2 and unroll see here ?
Or lets do in other PR where I can show up with retuned/enhanced TOPI entries with comparing in numbers ?

merrymercy · 2019-08-04T08:49:56Z

Currently, the spaces of P and other knobs that are dependent on tile_size are created with tile_size=2
Should we modify the spaces of these knobs to be the union of all valid values or wait until you implement the conditional knob?

cbalint13 · 2019-08-06T06:27:53Z

Currently, the spaces of P and other knobs that are dependent on tile_size are created with tile_size=2
Should we modify the spaces of these knobs to be the union of all valid values or wait until you implement the conditional knob?

I would implement conditional/hierarchical knob, partially done, need to validate/test. Allow a little more time, will post and reference it here.

tmoreau89 · 2019-08-06T21:07:58Z

Note that #3717 is also proposing to update CUDA TOPHUB version

tqchen · 2019-09-13T20:45:39Z

ping @cbalint13 can you update the PR?

tqchen · 2019-10-10T20:08:07Z

ping @cbalint13

cbalint13 · 2019-10-27T09:12:54Z

I close this for now, in order to complete the following is needed
1. Conditinal-Knob , a knob that hierarchicaly depend on other knob
  - implies that list's combinatoric space will be not fixed (not an easy way, triky to implement)
2. Once such a knob is implement revise winograd schedule schema for <>3, in this very PR.

cbalint13 mentioned this pull request Jul 28, 2019

Hotfix for issue #3641. #3644

Merged

cbalint13 mentioned this pull request Jul 28, 2019

Add tile_size knob for winograd #3642. tlc-pack/tophub#8

Closed

cbalint13 closed this Jul 28, 2019

cbalint13 reopened this Jul 28, 2019

FrozenGene reviewed Jul 29, 2019

View reviewed changes

topi/python/topi/arm_cpu/conv2d.py Outdated Show resolved Hide resolved

FrozenGene reviewed Jul 29, 2019

View reviewed changes

topi/python/topi/mali/conv2d.py Show resolved Hide resolved

merrymercy self-assigned this Jul 29, 2019

cbalint13 commented Jul 29, 2019

View reviewed changes

tmoreau89 reviewed Jul 31, 2019

View reviewed changes

python/tvm/autotvm/tophub.py Show resolved Hide resolved

merrymercy requested review from merrymercy and removed request for merrymercy July 31, 2019 09:09

Fix kernel_size != 3 on winograd arm/mali.

cdcd0e1

jroesch added the status: need update need update based on feedbacks label Sep 1, 2019

cbalint13 closed this Oct 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix kernel_size != 3 on winograd arm/mali. #3642

Fix kernel_size != 3 on winograd arm/mali. #3642

cbalint13 commented Jul 28, 2019 •

edited

Loading

tmoreau89 commented Jul 28, 2019

cbalint13 commented Jul 28, 2019

cbalint13 commented Jul 28, 2019

cbalint13 Jul 29, 2019

FrozenGene Jul 30, 2019

merrymercy Jul 31, 2019 •

edited

Loading

cbalint13 Jul 31, 2019

cbalint13 commented Jul 29, 2019

merrymercy commented Jul 31, 2019 •

edited

Loading

cbalint13 commented Jul 31, 2019 •

edited

Loading

Ideal solution

Workaround

cbalint13 commented Jul 31, 2019

merrymercy commented Aug 1, 2019 •

edited

Loading

cbalint13 commented Aug 1, 2019 •

edited

Loading

merrymercy commented Aug 4, 2019 •

edited

Loading

cbalint13 commented Aug 6, 2019

tmoreau89 commented Aug 6, 2019

tqchen commented Sep 13, 2019

tqchen commented Oct 10, 2019

cbalint13 commented Oct 27, 2019

Fix kernel_size != 3 on winograd arm/mali. #3642

Fix kernel_size != 3 on winograd arm/mali. #3642

Conversation

cbalint13 commented Jul 28, 2019 • edited Loading

tmoreau89 commented Jul 28, 2019

cbalint13 commented Jul 28, 2019

cbalint13 commented Jul 28, 2019

cbalint13 Jul 29, 2019

Choose a reason for hiding this comment

FrozenGene Jul 30, 2019

Choose a reason for hiding this comment

merrymercy Jul 31, 2019 • edited Loading

Choose a reason for hiding this comment

cbalint13 Jul 31, 2019

Choose a reason for hiding this comment

cbalint13 commented Jul 29, 2019

merrymercy commented Jul 31, 2019 • edited Loading

Ideal solution

Workaround

cbalint13 commented Jul 31, 2019 • edited Loading

Ideal solution

Workaround

cbalint13 commented Jul 31, 2019

merrymercy commented Aug 1, 2019 • edited Loading

cbalint13 commented Aug 1, 2019 • edited Loading

merrymercy commented Aug 4, 2019 • edited Loading

cbalint13 commented Aug 6, 2019

tmoreau89 commented Aug 6, 2019

tqchen commented Sep 13, 2019

tqchen commented Oct 10, 2019

cbalint13 commented Oct 27, 2019

cbalint13 commented Jul 28, 2019 •

edited

Loading

merrymercy Jul 31, 2019 •

edited

Loading

merrymercy commented Jul 31, 2019 •

edited

Loading

cbalint13 commented Jul 31, 2019 •

edited

Loading

merrymercy commented Aug 1, 2019 •

edited

Loading

cbalint13 commented Aug 1, 2019 •

edited

Loading

merrymercy commented Aug 4, 2019 •

edited

Loading