Partially lower fusion to optimize deserialization performance. #558

rdspring1 · 2023-07-04T17:44:36Z

This PR optimizes the deserialization time by only running the analyze step in GpuLower. We skip all the passes while retaining the necessary information to run a kernel. Deserialization is estimated to be ~75% faster than recompiling GpuLower completely.

After kernel compilation, we only need the information in GpuLower::KernelSummary to create a new ExecutorEntry at runtime. The kir::Allocate nodes are generated directly without the rest of the lowering process.

Our approach for creating kir::Allocate nodes is based on the ExpressionEvaluator. During serialization, we store a set of base IterDomains and a series of operations to create the TensorViews domains. The kir::Allocatenodes are not inserted into thekir::Kernel` because they are only used to calculate the size of buffers at kernel runtime.

TODOs:

ExpressionSerializer and ExpressionBuilder builds TensorView and kir::Allocate nodes for global intermediate buffers and dynamic shared memory.
Store all information to run fusion in KernelSummary, which is stored in FusionExecutor.
Partially recreate and then deserialize KernelSummary
Add support for LoadStoreOp
Create separate paths for RNGOps and TMA kir nodes

…lmem_allocations

rdspring1 · 2023-12-05T23:23:13Z

!build

rdspring1 · 2024-12-13T18:47:29Z

Need to update for new fusion executor dispatch system.

rdspring1 and others added 30 commits June 20, 2023 22:27

Update fusion_cache.fbs

2c870ea

Initial recompile

767efcb

add skip_serde_test option

261e196

add skip_serde_checks

3387f3a

fix issue with has_dynamic_transform_info

a76f7df

Fix group order in FusionKernelRuntime::deserialize

e5c8f8d

Store pointer address directly in TensorArgAbstract

ed58cde

Fix GlobalBufferInfo for intermediate tensor

99fd897

return true if kernel already exists in KernelDb

3f1255a

update maxregcount

a10ea22

add fb to benchmark

9634395

typo

71cedf5

use pytorch third_party flatbuffers

f10b906

Merge branch 'main' into serde_fec_recompile

000b193

lint

58e7d74

Merge branch 'main' into serde_fec_recompile

2d025b9

Merge branch 'main' into serde_fec_recompile

92288d8

Merge branch 'main' into serde_fec_recompile

a64a42f

GlobalBufferInfo requires lowered kernel before deserialization

8d97682

create copy of KernelSummary

f9fc2a4

Update fusion_cache schema

3464914

update executor

56039c6

create fast lower path

e376395

summary

ed550b8

save

24f4730

serialize max_rng_offsets

a101e77

create kernel summary table

e1c8467

clean-up

8ebf3aa

serialize static_smem_allocations, dynamic_smem_allocations, dynamic_…

806e0a8

…lmem_allocations

add comments

9ed84f5

rdspring1 and others added 22 commits October 26, 2023 18:36

Expand IterDomain table support

e0da8d2

Split expr_evaluator into two files

e86b0f4

remove _serde suffix

42c3700

create insertUniqueItem

d852c06

Create internal state

4f72828

refactor binding inputs

383d879

Refactor iterdomain binding

bdb37ff

Update constraint

496a49d

All allocations come from KernelSummary

91e7c05

Refactor

3801f68

Create gatherSymbolicValues and processAllocations

d122b21

Remove Domain table

cc88c61

Create DerivedExpressionSerializer

3e45fc1

Make processAllocations functional

984dcaf

Make ExpressionBuilder into Factory subclass

ae0ea12

Add retrieve to ExpressionBuilder

6790bbf

Create exists and retrieve functions

a147a2e

Add NVRTC tags

b716aec

Merge branch 'main' into serde_fec

a627def

clang-tidy

c62ed67

Merge branch 'main' of github.com:NVIDIA/Fuser into serde_fec

ee194e5

Add dry run to get RNG seed and offset

af55e63

rdspring1 force-pushed the serde_fec branch from 035f766 to af55e63 Compare December 5, 2023 05:24

rdspring1 added 4 commits December 5, 2023 11:03

merge dryRun

4fe5604

Remove castEnumToUnderlyingType

ec1cbf6

cleanup

26936c5

Merge branch 'main' of github.com:NVIDIA/Fuser into serde_fec

9be13e3

rdspring1 closed this Dec 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Partially lower fusion to optimize deserialization performance. #558

Partially lower fusion to optimize deserialization performance. #558

rdspring1 commented Jul 4, 2023 •

edited

Loading

rdspring1 commented Dec 5, 2023

rdspring1 commented Dec 13, 2024

Partially lower fusion to optimize deserialization performance. #558

Partially lower fusion to optimize deserialization performance. #558

Conversation

rdspring1 commented Jul 4, 2023 • edited Loading

rdspring1 commented Dec 5, 2023

rdspring1 commented Dec 13, 2024

rdspring1 commented Jul 4, 2023 •

edited

Loading