Add Winograd matrices computation. #3553

cbalint13 · 2019-07-16T22:28:41Z

This patch adds Winograd (Cook-Toom) matrices computation routines as described by A. Lavin & S. Gray based on original wincnn implementation of Andrew Lavin. Also removes actual hard coded redundant matrices.

It opens the door to:

Other kernel_sizes than (3x3), also more tile_sizes (1 < any < 9)
Allow adjusting interpolation points for better precision see discussion 1
Allow custom range scaling for better precision see discussion 1 and 2
Scaling ranges for quantized ops (int8), see: Efficient Winograd Convolution via Integer Arithmetic

Once this PR pass, will attempt with more winograd enhancements.

cbalint13 · 2019-07-17T02:45:41Z

@merrymercy @ZihengJiang @masahi @Laurawly ,
Please help with the review.

kevinthesun · 2019-07-17T06:33:20Z

topi/python/topi/arm_cpu/conv2d.py

@@ -333,54 +332,11 @@ def _decl_winograd(cfg, data, kernel, strides, padding, dilation, layout, out_dt
    assert KH == 3 and KW == 3 and HPAD == 1 and WPAD == 1 and HSTR == 1 and WSTR == 1


We still check kernel size=3？

In fact, HPAD / WPAD shouldn't be checked too. Consider, we have no padding.

@kevinthesun , @FrozenGene ,

HPAD,WPAD asserts removed but assert KW == KH instead (in all places)

{KW, KH} == {3,3} assert removed but assert (kernel_size > 1) (in winograd_utils.py)

Can we move the assertion regrading to whether winograd is valid to the same place?

@kevinthesun ,

In one/same we guard winograd matrix related (in)validity, already done here
Taking it out from there means to maintain it in multiple unrelated places (each conv2d files).

For other (in)valid things (padding, stride, dilation etc) pure conv related things is good where they are already now (separate dedicated conv2d.py files for targeted arches: arm_cpu, mali, cuda, etc).

I would keep it simple with this PR, intention for now is to only replace the winograd matrices, let's not touch functionalities or other actual limitations too much.

I'll keep in mind the suggestions here, next PR will be about enhancement around cfg.space (especialy arm/mali), then will be good time to revisit limitations.

tqchen · 2019-07-17T20:08:12Z

cc @ajtulloch

ajtulloch · 2019-07-18T20:17:26Z

The specific choices of the interpolation points substantially changes the approximations - have you validated that the proposed routines don't affect accuracy vs the tuned ones we currently have? cc @zlateski who was looking at this stuff recently.

zlateski

I think some hard-coding will be unavoidable :)

topi/python/topi/nn/winograd_util.py

cbalint13 · 2019-07-18T20:59:30Z

The specific choices of the interpolation points substantially changes the approximations - have you validated that the proposed routines don't affect accuracy vs the tuned ones we currently have? cc @zlateski who was looking at this stuff recently.

@ajtulloch ,

Some clarifications:

SHORT: We replace actual hardcoded with same ones but this time are generated.

LONG:

Don't know from where original hard coded ones comes exactly (folks, please inform me), but for booth of them with actual tile_size = {2, 4} we replace with the very same except few irrelevant sign inversions.
Also, for these two hardcoded ones, used points to re-generate them are clear {0, 1, -1, 2}, i believe they are used de facto by all implementations out there (please correct me if I am wrong).
For tile_size=6 (i've seen in your PR, don't know either from where it come, the generated ones contains less smaller fractions, (interpolation points suggested by L. Andrew in one discussion, mentioned in header of PR).
For rest of tile_size > 6, I follow practice from wincnn as mentioned earlier, however for such big tile values i think there is not much room to improve precision.
I left the so called anchor_points visible in the code, they are from best practice, but maybe in future they can be tuned smartly (lets leave for future tasks).

FUTURE: There will be more fun when we want winograd over int8, or even sparse-winograd with int8.

I try to invite @andravin (not sure he will be aware) to see it's opinion on choosing our anchor points (maybe even auto-tune them in future).

Looking forward for more suggestions.

andravin · 2019-07-18T22:22:24Z

Hi @cbalint13, the error analysis in the paper that @zlateski pointed to looks pretty rigorous, I would start there.

I am a little bit curious what you plan to use the large tiles for. Today's devices have high compute / memory bandwidth ratios and low numeric precision, neither of which is conducive to a large tile size. I think the large tiles only payoff when you have an excess of DRAM memory bandwidth, or a large fast cache.

cbalint13 · 2019-07-18T23:13:26Z

Hi @cbalint13, the error analysis in the paper that @zlateski pointed to looks pretty rigorous, I would start there.

I am a little bit curious what you plan to use the large tiles for. Today's devices have high compute / memory bandwidth ratios and low numeric precision, neither of which is conducive to a large tile size. I think the large tiles only payoff when you have an excess of DRAM memory bandwidth, or a large fast cache.

Hi @andravin ,

Thanks for joining the thread !

As short intro TVM is about optimizing routines (not only NN related) to various targets (and there are a lot of devices out there) from high/low end GPU to small CPU or even FPGAs, by exploiting combinatoric space of the operators itself. Of course TVM is even more than just optimizing, it holds IR language for compiling/doing such and lots of cool other stuff.
I try propose here a replacement for some hardcoded winograd matrices (the classical tile_size 2 & 4 for conv2d case) but also leave room for any size up to 8.
In practice TVM's autotuning (mostly based on xgboost over collected compute statistics) will find tile size of 2, 3, 4 or rare case even 6 (conducted lots of test on opencl/mali-gpu in winograd scenario), haven't seen to pick up higher values.
I know it may sound absurd > 6 but that's the cool part of TVM it will find the proper way to do it (this is the goal to find it auto-magically).
As far i know TVM project is unique among libraries it can "autotune" and try find best computational approach for a specific operator over a specific hardware target, it quite does well even if combinatoric space covers millions of possibilities (folks here can tell lot more about)

Most interesting part would be targeting int8 or other quantization variants, so i think by having the matrix generator (maybe some post "precision tuner") would be good allies for winograd class schedules in TVM fashion of doing it.

I would remind you all of some more exotic approach that even uses complex numbers as interpolation points here. Even such approach can be accounted in future.

(apologies for possible redundancy and length of the info)

cbalint13 · 2019-07-18T23:33:09Z

For no confusions, to sum up what we have in this PR up to this point:

We replace two hardcoded winograd matrices with generated same ones.
We still have only tile_size = {2, 4} allowed, and this PR will not change this.
We now allow other kernels than {3x3} where K >= 2.

Plus, some doors are open from here on.

ajtulloch · 2019-07-19T18:02:25Z

This looks good to me - I forgot that this is only for m \in {2, 4}.

One thing I'd personally like is to add a test that verifies the explicit matrix values for m == 2/4, so they're explicitly specified somewhere?

BTW, for the m=6 case, I got them from NNPACK via Maratyszcza/NNPACK#8.

andravin · 2019-07-19T20:37:53Z

@cbalint13 and @ajtulloch you might want to try the points that Barabasz et al. say give the best accuracy:

4x4 output: { 0, -1, 1, 1/2, -2 } *
6x6 output: {0, -1, 1, 1/2, -1/2, 2, -2 }
4x1 or 1x4 output (ie 1-D convolution): {0, -1, 1, 1/2, -3} *

Choices that are a bit unusual are marked with *
Paper linked by, @zlateski, above.

I remember searching for F(4x4, 3x3) output points that give the best accuracy a while ago, and did see improvements with non-obvious choices, but the differences were not huge: NervanaSystems/neon#224

Not sure I agree though with how the paper modeled the input and filters with a uniform distribution. In practice, input data is gaussian or sub-gaussian and highly correlated. Anyway this observation made me dubious about modifying algorithms to make them on average more accurate with a random distribution that might not reflect the data that they will be used with in practice. I think a proper error analysis would be measured in Resnet-34 or Resnet-50 using actual weights and real images.

Also here are some optimizations that are often overlooked.

It is possible to use different algorithms for the horizontal and vertical dimensions of a 2D convolution. So you could nest F(4,3) with F(2,3) for example (call it F(4x2, 3x3)), or even F(6,3) with direct convolution, F(6x1, 3x3).
Loop partitioning and reordering can make a big difference in the efficiency of any dense computation, and Winograd/Cook/Toom conv2d is no different. Look to fit a super-tile of transform -> batch matrix multiply -> inverse transform in the innermost loops, so that the transform data stays in the fastest memory (eg shared memory on GPU, L1/L2 cache on CPU). This is important, because the fast convolution algorithms decrease computational density by simultaneously expanding the input data and reducing the number of arithmetic operations.
In order to make the inner loop of 2) efficient, you probably need to use a small tile algorithm for the inner loop, otherwise computational density will be too low and you will be memory bound. Consider using 1) to nest a 1D fast algorithm with a 1D direct convolution (1D tiles are the smallest!), or move a large tile 1D algorithm to the outer loop (so that it goes to DRAM on GPU or L3 cache on CPU), and do a small tile 1D algorithm in the inner loop.

Sorry this is so long. I did not have the time to make it shorter. ;-)

tqchen · 2019-07-23T03:01:08Z

Thanks for the great discussions. @cbalint13 it would be great if we can push forward to act on this PR, by summarizing the discussions, and propose list of changes. Then we can get everyone to review and approve

tqchen · 2019-07-24T18:32:08Z

@cbalint13 please request reviews by tagging people(likely those who are in this conversation) when it is ready

cbalint13 · 2019-07-24T20:57:41Z

@ajtulloch , @zlateski ,

Done tests using old matrix set versus new proposed schema:

all _2_4 tests confirm that old F(2,3), F(4,3) matrices are less accurate than new.
all _2_4_6 tests confirm F(6,3) one by @ajtulloch (NNPACK) is also less accurate than new one.
used table provided by B. Barabasz et al., cited in the code, anyone can update in future.
a shallow test that check to the 1e-5 limit already exist.

@merrymercy , @kevinthesun , @FrozenGene , @tqchen ,

To sum up:

we replace hardcoded matrix schema (the old 2,4) with a flexible computational one.
interpolation points provenance are cited, they can be overrided (if one find better)
we still have only tile_size = {2, 4} allowed and this PR will not change this.
we now allow other kernels than {3x3} where K >= 2 (works for cuda not arm/mali for now).

Immediate future step (not intented for this PR):

get rid of such static things (add a tile_size knob for autotunig)
try integrate/finish @ajtulloch NCHWc push request (in respect with actual API).
fix conv2d for kernel_size > 3 on arm & mali (they fail now, on cuda works).
enhance topi winograd test to support more cases.

ajtulloch

lgtm

zlateski

Just one comment, otherwise lgtm.

zlateski · 2019-07-25T00:12:23Z

topi/python/topi/nn/winograd_util.py

+
+    return np.array(in_pts[degree-1], dtype=np.float64)
+
+def winograd_transform_matrices(tile_size, kernel_size, out_dtype):


I'm not that familiar with the whole codebase, so the following comment might be irrelevant.

This function performs non-trivial compute, which should probably be memoized on some level.

@zlateski ,
Not sure how to do the memoization. We have a @memoize decorator, it is used only in test scripts so far and i see it can memoize only functions with no arguments on them.

@zlateski,
Also, if use @memoize can't see too much yield , (it's a pickle file at all) i think the overhead of IO would be greater for such small data.

@zlateski Very good point! Actually, memoize only write to file when the program exists. Follow up on #3687

cbalint13 · 2019-07-25T00:53:15Z

@zlateski ,

I think some hard-coding will be unavoidable :)

At least, from multiple boxed rectangular matrices we ended up in a nice single triangle lookup table :)

tqchen · 2019-07-25T04:13:53Z

@kevinthesun please https://docs.tvm.ai/contribute/code_review.html#approve-and-request-changes-explicitly

kevinthesun · 2019-07-25T06:09:41Z

topi/python/topi/arm_cpu/conv2d.py

@@ -330,57 +329,14 @@ def _decl_winograd(cfg, data, kernel, strides, padding, dilation, layout, out_dt
    HPAD, WPAD, _, _ = get_pad_tuple(padding, kernel)

    assert layout == 'NCHW'
-    assert KH == 3 and KW == 3 and HPAD == 1 and WPAD == 1 and HSTR == 1 and WSTR == 1
+    assert KH == KW and HSTR == 1 and WSTR == 1


Add test case for this change?

Other parts lgtm.

@kevinthesun ,

Added 5x5, 7x7 testcases (cuda only)

Constrainted arm/mali to 3x3 only (until fix it in other PR)

cbalint13 · 2019-07-26T02:33:10Z

@ajtulloch
@andravin
@FrozenGene
@kevinthesun
@merrymercy
@tqchen
@zlateski

Thank you all for the help !

kevinthesun reviewed Jul 17, 2019

View reviewed changes

tqchen added the status: need review label Jul 17, 2019

tqchen assigned merrymercy Jul 17, 2019

zlateski reviewed Jul 18, 2019

View reviewed changes

topi/python/topi/nn/winograd_util.py Show resolved Hide resolved

tqchen mentioned this pull request Jul 23, 2019

[X86] [NNVM] [TOPI] Implement NCHWc Winograd convolutions #2111

Closed

4 tasks

cbalint13 changed the title ~~Add Winograd matrices computation.~~ [WIP] Add Winograd matrices computation. Jul 23, 2019

cbalint13 changed the title ~~[WIP] Add Winograd matrices computation.~~ Add Winograd matrices computation. Jul 24, 2019

ajtulloch approved these changes Jul 24, 2019

View reviewed changes

zlateski approved these changes Jul 25, 2019

View reviewed changes

FrozenGene approved these changes Jul 25, 2019

View reviewed changes

kevinthesun reviewed Jul 25, 2019

View reviewed changes

Add Winograd matrices computation.

e7fc473

merrymercy approved these changes Jul 25, 2019

View reviewed changes

merrymercy merged commit 97e333c into apache:master Jul 25, 2019

tmoreau89 mentioned this pull request Jul 28, 2019

[TEST][FLAKY] test_topi_conv2d_winograd.test_conv2d_nchw #3641

Closed

cbalint13 mentioned this pull request Jul 28, 2019

Fix kernel_size != 3 on winograd arm/mali. #3642

Closed

merrymercy mentioned this pull request Aug 1, 2019

[TOPI] Memoize winograd matrix #3687

Merged

wweic pushed a commit to wweic/tvm that referenced this pull request Aug 9, 2019

Add Winograd matrices computation. (apache#3553)

77521b9

wweic pushed a commit to neo-ai/tvm that referenced this pull request Sep 6, 2019

Add Winograd matrices computation. (apache#3553)

107daa0

andravin mentioned this pull request Sep 6, 2019

More accurate Winograd/Cook/Toom F(4x4, 3x3) transforms NervanaSystems/neon#224

Closed

tqchen unassigned merrymercy Nov 4, 2019

comaniac mentioned this pull request Nov 7, 2019

[TOPI][CUDA] Fix Winograd Kernel Size Support #4276

Merged

tqchen mentioned this pull request Nov 8, 2019

[RELEASE][DRAFT] TVM v0.6 Release candidate #4259

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Winograd matrices computation. #3553

Add Winograd matrices computation. #3553

cbalint13 commented Jul 16, 2019 •

edited

Loading

cbalint13 commented Jul 17, 2019 •

edited

Loading

kevinthesun Jul 17, 2019

FrozenGene Jul 17, 2019 •

edited

Loading

cbalint13 Jul 17, 2019 •

edited

Loading

kevinthesun Jul 17, 2019 •

edited

Loading

cbalint13 Jul 17, 2019

tqchen commented Jul 17, 2019

ajtulloch commented Jul 18, 2019

zlateski left a comment

cbalint13 commented Jul 18, 2019 •

edited

Loading

andravin commented Jul 18, 2019

cbalint13 commented Jul 18, 2019

cbalint13 commented Jul 18, 2019

ajtulloch commented Jul 19, 2019

andravin commented Jul 19, 2019 •

edited

Loading

tqchen commented Jul 23, 2019

tqchen commented Jul 24, 2019

cbalint13 commented Jul 24, 2019 •

edited

Loading

ajtulloch left a comment

zlateski left a comment

zlateski Jul 25, 2019

cbalint13 Jul 25, 2019

cbalint13 Jul 25, 2019

merrymercy Aug 1, 2019

cbalint13 commented Jul 25, 2019

tqchen commented Jul 25, 2019

kevinthesun Jul 25, 2019

kevinthesun Jul 25, 2019

cbalint13 Jul 25, 2019

cbalint13 commented Jul 26, 2019 •

edited

Loading

		@@ -333,54 +332,11 @@ def _decl_winograd(cfg, data, kernel, strides, padding, dilation, layout, out_dt
		assert KH == 3 and KW == 3 and HPAD == 1 and WPAD == 1 and HSTR == 1 and WSTR == 1


		return np.array(in_pts[degree-1], dtype=np.float64)

		def winograd_transform_matrices(tile_size, kernel_size, out_dtype):

Add Winograd matrices computation. #3553

Add Winograd matrices computation. #3553

Conversation

cbalint13 commented Jul 16, 2019 • edited Loading

cbalint13 commented Jul 17, 2019 • edited Loading

Choose a reason for hiding this comment

FrozenGene Jul 17, 2019 • edited Loading

Choose a reason for hiding this comment

cbalint13 Jul 17, 2019 • edited Loading

Choose a reason for hiding this comment

kevinthesun Jul 17, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tqchen commented Jul 17, 2019

ajtulloch commented Jul 18, 2019

zlateski left a comment

Choose a reason for hiding this comment

cbalint13 commented Jul 18, 2019 • edited Loading

andravin commented Jul 18, 2019

cbalint13 commented Jul 18, 2019

cbalint13 commented Jul 18, 2019

ajtulloch commented Jul 19, 2019

andravin commented Jul 19, 2019 • edited Loading

tqchen commented Jul 23, 2019

tqchen commented Jul 24, 2019

cbalint13 commented Jul 24, 2019 • edited Loading

ajtulloch left a comment

Choose a reason for hiding this comment

zlateski left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cbalint13 commented Jul 25, 2019

tqchen commented Jul 25, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cbalint13 commented Jul 26, 2019 • edited Loading

cbalint13 commented Jul 16, 2019 •

edited

Loading

cbalint13 commented Jul 17, 2019 •

edited

Loading

FrozenGene Jul 17, 2019 •

edited

Loading

cbalint13 Jul 17, 2019 •

edited

Loading

kevinthesun Jul 17, 2019 •

edited

Loading

cbalint13 commented Jul 18, 2019 •

edited

Loading

andravin commented Jul 19, 2019 •

edited

Loading

cbalint13 commented Jul 24, 2019 •

edited

Loading

cbalint13 commented Jul 26, 2019 •

edited

Loading