Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix kernel_size != 3 on winograd arm/mali. #3642

Closed
wants to merge 1 commit into from
Closed

Fix kernel_size != 3 on winograd arm/mali. #3642

wants to merge 1 commit into from

Conversation

cbalint13
Copy link
Contributor

@cbalint13 cbalint13 commented Jul 28, 2019

This PR fix/enable cases for kernel_size other than 3x3.

@tmoreau89 , @merrymercy , @kevinthesun , @FrozenGene
Please help with the review.


Main benefit:
We have flexible tile_size and any kernel_size for winograd.


Examples (yolov3-tiny) of having tuned tile_size:

original (only direct)
[Task 13/20 (1, 384, 26, 26)|(256, 384, 3, 3)] (conv2d) {53.46 GFLOPS /direct}
[Task 14/20 (1, 512, 13, 13)|(1024, 512, 3, 3)] (conv2d) {31.17 GFLOPS /direct}
[Task 15/20 (1, 256, 13, 13)|(512, 256, 3, 3)] (conv2d) {28.77 GFLOPS /direct}

original (no tile_size autotune)
[Task 13/20 (1, 384, 26, 26, 'float16')] (conv2d) {93.78 GFLOPS /winograd} tile_size=2
[Task 14/20 (1, 512, 13, 13, 'float16')] (conv2d) {93.31 GFLOPS /winograd} tile_size=2
[Task 15/20 (1, 256, 13, 13, 'float16')] (conv2d) {81.34 GFLOPS /winograd} tile_size=2

proposed (with tile_size autotune)
[Task 13/20 (1, 384, 26, 26, 'float16')] (conv2d) {98.48 GFLOPS /winograd} tile_size=4
[Task 14/20 (1, 512, 13, 13, 'float16')] (conv2d) {106.15 GFLOPS /winograd} tile_size=4
[Task 15/20 (1, 256, 13, 13, 'float16')] (conv2d) {84.54 GFLOPS /winograd} tile_size=4

It is quite tedious to expose more benefits (WiP on my side) for tuned tile_size, these 3 was found running >1 week on mali. That's all what I personally found up to this day, i am sure there are more, will try soon 5x5 & 7x7 kernels (Unet, SuperRes), and post on tophub exceptional ones.

@tmoreau89
Copy link
Contributor

Is there a reason why we are disabling cuda tests for now?

@cbalint13
Copy link
Contributor Author

Is there a reason why we are disabling cuda tests for now?

  • (kernel_size > 3) was running on cuda only.
  • now with this runs on all targeted arch.

(kernel_size > 3) is new for our winograd, since its a larger kernel the accuracy may also drop, interesting that wasn't catched before (i think memoization data was changed).

@cbalint13
Copy link
Contributor Author

@merrymercy , @kevinthesun , @FrozenGene

Will update on tophub .
Allow a little time for tophub update (need to fill tile_size), until then this PR is WiP.

@merrymercy merrymercy self-assigned this Jul 29, 2019
sizes = [2,]
# 2 always present
for s in range(3, 9):
# overlap less then half
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@merrymercy , @FrozenGene ,
What is your opinion on this approach to construct tile_size ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no strong preference what way should we do. If we think it is a bit trick, we could add comment explaining what happened here, then I think it is enough.

Copy link
Member

@merrymercy merrymercy Jul 31, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have experiences tuning all these tile sizes.
To keep the space small, you can try to prune more candidates that are very likely not to be the optimal ones according to your tuning results.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@merrymercy ,

  • Inappropriate candidates will be dropped out early by xgb, they don't add overhead or lengthen the search time by much. There can be surprises especially picking e.g. tile_size = 3 sometimes.
  • Actual code cover well 2 & 4 cases (tests shows also 3 have good results). But we can be more precise only if narrow down space of P in accord with tile_size as you mentioned, or let xgb choose its way.

@cbalint13
Copy link
Contributor Author

@merrymercy , @kevinthesun , @FrozenGene

Will update on tophub .
Allow a little time for tophub update (need to fill tile_size), until then this PR is WiP.

It's done, review ready at tlc-pack/tophub#8 .

@merrymercy merrymercy requested review from merrymercy and removed request for merrymercy July 31, 2019 09:09
@merrymercy
Copy link
Member

merrymercy commented Jul 31, 2019

Aha... I remember why I don't include tile_size as a tunable knob. This is not the xgb vs xgb_knob problem. It is because the search spaces for different tile sizes are different. But AutoTVM does not support this kind of space definition.

There are some knobs for tiling in the search space. However, they are dependent on tile size. For example, in the arm cpu's schedule, https://github.com/dmlc/tvm/blob/36702a7678573fa4ff9ea04c3920e60d590496c9/topi/python/topi/arm_cpu/conv2d.py#L390-L392. P is the number of winograd tiles. In one case, P is 196 if tile size is 4 while is 784 if tile size is 2. We construct the split knob according to the lengths of P, so different tile sizes will have different spaces.

Currently, AutoTVM does not support dependent relationships between tuning knobs. What it does is picking the first candidate of tile size (which is 2 according to your construction) and initializing other knobs. In the above case, it will enumerate all split schemes for P = 784. But some split schemes for P = 784 are invalid for P = 196, which will cause the AutoTVM hangs forever during feature extraction.

Ideal solution

Support hierarchical space definition in AutoTVM

Workaround

  1. Fix the feature extraction. We hang on this line https://github.com/dmlc/tvm/blob/11da1ca31b8121e9a40c70f33ade27670d22e2d0/python/tvm/autotvm/tuner/xgboost_cost_model.py#L311 because we get exceptions when running this function https://github.com/dmlc/tvm/blob/11da1ca31b8121e9a40c70f33ade27670d22e2d0/python/tvm/autotvm/tuner/xgboost_cost_model.py#L328
  2. Use feature_type = knob in the XGBTuner. I won't merge this solution to master.

You don't meet the problem when tuning for CUDA and mali because I added some restrictions to their schedule templates and luckily they work around this problem. But for arm cpu we will meet this problem.

Let me know if you want to try to fix it or want more information from me. Otherwise, I can fix it later.

@cbalint13
Copy link
Contributor Author

cbalint13 commented Jul 31, 2019

@merrymercy ,

Aha... I remember why I don't include tile_size as a tunable knob. This is not the xgb vs xgb_knob problem. It is because the search spaces for different tile sizes are different. But AutoTVM does not support this kind of space definition.

  1. I was driven by the idea that too simple formula like (tile_size = 4 if H % 4 == 0 else 2) can be rather a simple knob with a small list. Now we afford it with flexible winograd matrix computation.
  2. As result fact, I posted in the header here, knob found better solution overriding mentioned static formula (2 vs 4 tile_size).

There are some knobs for tiling in the search space. However, they are dependent on tile size. For example, in the arm cpu's schedule,

https://github.com/dmlc/tvm/blob/36702a7678573fa4ff9ea04c3920e60d590496c9/topi/python/topi/arm_cpu/conv2d.py#L390-L392

. P is the number of winograd tiles. In one case, P is 196 if tile size is 4 while is 784 if tile size is 2. We construct the split knob according to the lengths of P, so different tile sizes will have different spaces.

  • I also found this correlation and dependency to P in case of our arm/mali.

  • Yes i agree, that is also my finding during tests regarding P. Would be great help me elaborate an exact relationship tile_size -> P . Otherwise, if we don't narrow down the possible lists, we just can enumerate bulk range(0,256) for some of knob's there and let xgb find it out but it wold have to search in 10e+7 cfg.space than.

  • Again, as fact, even ignoring this dependency for now we are still better than existing formula.

  • In fact, for now, solutions with this knob will be 2 or 4 (very rare 3 hits) in most cases.

Currently, AutoTVM does not support dependent relationships between tuning knobs. What it does is picking the first candidate of tile size (which is 2 according to your construction) and initializing other knobs. In the above case, it will enumerate all split schemes for P = 784. But some split schemes for P = 784 are invalid for P = 196, which will cause the AutoTVM hangs forever during feature extraction.

  • So, could help me find the relation of P given tile_size ?

Ideal solution

Support hierarchical space definition in AutoTVM

Workaround

  1. Fix the feature extraction. We hang on this line https://github.com/dmlc/tvm/blob/11da1ca31b8121e9a40c70f33ade27670d22e2d0/python/tvm/autotvm/tuner/xgboost_cost_model.py#L311
    because we get exceptions when running this function https://github.com/dmlc/tvm/blob/11da1ca31b8121e9a40c70f33ade27670d22e2d0/python/tvm/autotvm/tuner/xgboost_cost_model.py#L328
  2. Use feature_type = knob in the XGBTuner. I won't merge this solution to master.

You don't meet the problem when tuning for CUDA and mali because I added some restrictions to their schedule templates and luckily they work around this problem. But for arm cpu we will meet this problem.

  • Where are those restrictions exactly ?

Let me know if you want to try to fix it or want more information from me. Otherwise, I can fix it later.

Yes i give it a try to fix it, I'll update you on progress.


To sum up lets settle for now the order :

  1. For now, as mentioned & tested, our PR is still better than existing two cased conditional.
  2. Try find xgb fea issue and repair it.
  3. Lets see a smarter P given tile_size (please help me elaborate a generating formula).
  4. If get tile_size->P we can try introduce hierarchical knob mode (i have an simple idea for that).

Arguing by point 0. , the PR would be mergeable, otherwise we transform this PR in a long multi PR.
Agree 1 + 2 + 3 are mandatory, i would help tackle each in upcoming another 3 PR (there will be too much debate if we keep in a single PR like this one).

What do you think ?


@cbalint13
Copy link
Contributor Author

Opened a disscuss topic and started first with xgb & hierarchical knob issue.

@merrymercy
Copy link
Member

merrymercy commented Aug 1, 2019

#3687 #3689 are necessary to merge this PR. Otherwise this PR will make tuning on arm cpu fails.

I can merge this after the above two are merged.

@cbalint13
Copy link
Contributor Author

cbalint13 commented Aug 1, 2019

#3687 #3689 are necessary to merge this PR. Otherwise this PR will make tuning on arm cpu fails.

I can merge this after the above two are merged.

@merrymercy ,

May for maliexpand other four vars in this PR , namelyyt, t1, t2 and unroll see here ?
Or lets do in other PR where I can show up with retuned/enhanced TOPI entries with comparing in numbers ?

@merrymercy
Copy link
Member

merrymercy commented Aug 4, 2019

Currently, the spaces of P and other knobs that are dependent on tile_size are created with tile_size=2
Should we modify the spaces of these knobs to be the union of all valid values or wait until you implement the conditional knob?

@cbalint13
Copy link
Contributor Author

Currently, the spaces of P and other knobs that are dependent on tile_size are created with tile_size=2
Should we modify the spaces of these knobs to be the union of all valid values or wait until you implement the conditional knob?

  • I would implement conditional/hierarchical knob, partially done, need to validate/test. Allow a little more time, will post and reference it here.

@tmoreau89
Copy link
Contributor

Note that #3717 is also proposing to update CUDA TOPHUB version

@jroesch jroesch added the status: need update need update based on feedbacks label Sep 1, 2019
@tqchen
Copy link
Member

tqchen commented Sep 13, 2019

ping @cbalint13 can you update the PR?

@tqchen
Copy link
Member

tqchen commented Oct 10, 2019

ping @cbalint13

@cbalint13
Copy link
Contributor Author

  • I close this for now, in order to complete the following is needed

    1. Conditinal-Knob , a knob that hierarchicaly depend on other knob
      • implies that list's combinatoric space will be not fixed (not an easy way, triky to implement)
    2. Once such a knob is implement revise winograd schedule schema for <>3, in this very PR.

@cbalint13 cbalint13 closed this Oct 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: need update need update based on feedbacks
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants