Create dispatch system for executors #3263

csarofeen · 2024-10-24T08:29:40Z

Separate out ExprEvalExecutor and HostIrExecutor from what's now called KernelExecutor. Create a dispatch system for them as compile and run are simpler for the former two.

Also renamed instances of FusionExecutorCache to executor_cache, KernelExecutor to ke, ExprEvalExecutor to eee, and HostIrExecutor to hire. It makes this PR large, but was critical to refactor all the instances of these classes.

For review focus on the following files:
csrc/host_ir/executor.[cpp,h]
csrc/runtime/executor.[cpp,h]
csrc/runtime/executor_abstract.h
csrc/runtime/executor_dispatch.[cpp,h]
csrc/runtime/fusion_executor_cache.cpp
csrc/runtime/fusion_kernel_runtime.[cpp,h]

Remaining files are just renaming. I would break this into multiple PRs, but it would be difficult to do at this point.

…ionExecutorCache instances are consistently named executor_cache.

csarofeen · 2024-10-24T08:32:57Z

csrc/fusion.cpp

@@ -845,13 +845,6 @@ bool Fusion::hasDynamicTransform() {
  return !ir_utils::getTVsWithDynamicTransform(this).empty();
 }



Just moved this function to executor.cpp as it wasn't used anywhere else.

csarofeen · 2024-10-24T08:33:42Z

csrc/fusion_profiler.cpp

@@ -326,17 +326,15 @@ SegmentProfiler::SegmentProfiler(uint32_t id, bool cupti_disabled)
      output_bytes_(0),
      kernel_profile_state_(ProfilerState::Ready) {}

-void SegmentProfiler::startCompile(int device) {
-  device_ = device;
+void SegmentProfiler::startCompile() {


Separated out set device as a separate function. KernelExecutor knows device on compilation since runtime information is needed for it, the other executors set it in run.

csarofeen · 2024-10-24T08:33:59Z

csrc/fusion_profiler.h

@@ -22,6 +22,7 @@ namespace nvfuser {
 //! \enum ProfilerState
 //! \brief An enum used to represent the state of a profiling state machine
 enum class ProfilerState {
+  None,


Just added this to initialize the state on construction.

I doubt this is needed. ProfilerState::Ready seems to be a good initial state already -- all reset* functions set the state to that. cc @kevinstephano

csrc/instrumentation.h

csarofeen · 2024-10-24T08:36:12Z

csrc/python_frontend/fusion_definition.cpp

-              inputs,
-              user_sched.fusion_id_,
-              user_sched.device_id_);
+              user_sched.scheduled_fusion.get(), inputs


Need Ryan's advice here.

@rdspring1 another place I could use your help, please see the comment below.

@rdspring1 Could you take a look here?

csrc/runtime/executor.cpp

csrc/runtime/executor_abstract.cpp

csrc/runtime/executor.h

csrc/runtime/fusion_executor_cache.cpp

csrc/runtime/fusion_kernel_runtime.h

tests/cpp/test_alias.cpp

tests/cpp/test_tutorial.cpp

…idevice/executor.[cpp,h] and rename to HostIrExecutor.

csrc/runtime/executor_kernel_arg.h

samnordmann · 2024-11-07T18:34:58Z

I think that the HostUnit alternative will not achieve anything more than what add_out and sum_out can achieve.

I think the HostUnit alternative gives fewer global reads/writes when the addition can be fused to the preceding kernel.
c = sum_H(a*b)  # H means along a host-parallel dimension
With the HostUnit alternative, each iteration reads a, reads b, reads c, computes c+a*b and writes that as the updated c.

With sum_out or add_out, each iteration reads a, reads b, writes a*b, reads a*b, reads c, computes c+a*b, and writes the updated c.

However here the accumulation across iteration is still done on a globally allocated buffer (c).

I think what you are describing here corresponds to fusing operations within the host for-loop's body, say into one HostUnit, which gives the classical benefit of fusing kernels. However, my point is that fusing across iterations is not possible, and the I/O of the for-loop's body (here, a, b, and c, but not a*b indeed) must be on global memory.

wujingyue · 2024-11-07T18:52:22Z

However, my point is that fusing across iterations is not possible

Agreed! I wasn't trying to argue about that.

csarofeen · 2024-11-07T22:11:10Z

@samnordmann @wujingyue this conversation seems pretty great, it'd be wonderful if you could capture it in a design doc.

csrc/runtime/executor.cpp

naoyam · 2024-11-08T04:23:40Z

@Priya2698 Could you check the benchmark profiling with this PR? There should be no performance change, but since there's a lot of code changes, we should make sure everything works as expected.

Priya2698 · 2024-11-08T04:48:00Z

@Priya2698 Could you check the benchmark profiling with this PR? There should be no performance change, but since there's a lot of code changes, we should make sure everything works as expected.

Are you interested in a complete sweep or only the host benchmarking? We can run a complete sweep on the CI.
CC: @xwang233

naoyam · 2024-11-08T05:02:24Z

@Priya2698 Could you check the benchmark profiling with this PR? There should be no performance change, but since there's a lot of code changes, we should make sure everything works as expected.

Are you interested in a complete sweep or only the host benchmarking? We can run a complete sweep on the CI. CC: @xwang233

Please do a complete sweep just in case. Either A100 or H100. Not necessary to check both.

Priya2698 · 2024-11-08T05:06:20Z

@Priya2698 Could you check the benchmark profiling with this PR? There should be no performance change, but since there's a lot of code changes, we should make sure everything works as expected.

Are you interested in a complete sweep or only the host benchmarking? We can run a complete sweep on the CI. CC: @xwang233

Please do a complete sweep just in case. Either A100 or H100. Not necessary to check both.

Got it, we will need to use CI resources then since the runs time out due to dlcluster time limits.
@xwang233 will be able to help. We can run preferably run on A100 due to more availability.

xwang233 · 2024-11-08T06:25:08Z

!test --pybench-full

…r user scheduling. (#3357) The goal is to set `fusion_id` and `device_id` when creating `KernelExecutor` for `UserSchedule. Previously, it was set during `FusionExecutor::compileFusion`. This PR is stacked on `executor_dispatch` **Changes to `UserSchedule` cache system:** **Current:** The map key is the integer value of input arguments. The vector is of size `device id`. `std::unordered_map<size_t, std::vector<UserSchedule>> user_def_schedules;` **New:** The key to first map is the integer value of input arguments. The key to second map is of `device`. **Why?** We can set the the `fusion_id` and `device_id` in the constructor of `UserSchedule` and `KernelExecutor`.

naoyam · 2024-11-12T17:58:43Z

@Priya2698 @xwang233 Could you please let us know how the benchmarking went?

Priya2698 · 2024-11-12T18:04:02Z

@Priya2698 @xwang233 Could you please let us know how the benchmarking went?

Comparing the pipeline results to the latest nvfuser nightly results does not show any major difference: http://nv/eno
Please verify against nightly results, since weekly has additional inputs.

naoyam · 2024-11-12T18:09:12Z

Thanks. I don't see anything suspicious either.

Where can we find host benchmark results?

xwang233 · 2024-11-12T18:41:20Z

Thanks. I don't see anything suspicious either.

Where can we find host benchmark results?

host latency benchmarks for this PR were executed with kernel_reuse disabled, so dynamic results might be inaccurate

Priya2698 · 2024-11-13T14:44:22Z

I am seeing similar results for all cases except test_many_segments_host_benchmark for dynamic. This PR does not contain changes from PR #3388 so this is expected. Apart from that, I do not see any major deviations.

On A100-asus:

test_adaptive_layernorm_fwd:
Branch: main

Branch: This PR

test_many_segment_host_benchmark.py:
Branch: main

Branch: This PR

test_many_pointwise_ops.py
Branch: main

Branch: Current PR

naoyam · 2024-11-13T18:00:14Z

@csarofeen Everything looks good. I'll run the final CI just in case.

naoyam · 2024-11-13T18:00:31Z

!test

csarofeen added 9 commits October 20, 2024 00:30

Broke apart executors, tests will fail, need dispatch.

6aa977f

Change name: FusionExecutor -> KernelExecutor

c23c38d

Draft executor dispatch.

bf835ec

Rename fe->ke for benchmarks.

3d7f017

Fix instrumentation in executors.

61f7cb5

Fix build with executor dispatch (still test failures). Make sure Fus…

48fdaef

…ionExecutorCache instances are consistently named executor_cache.

Finish rename FusionExecutor fe -> KernelExecutor ke

a44018e

All but one nvfuser_tests pass.

0e10199

No op scheduler generates empty kernels.

8ed49a1