-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create dispatch system for executors #3263
Conversation
…ionExecutorCache instances are consistently named executor_cache.
@@ -845,13 +845,6 @@ bool Fusion::hasDynamicTransform() { | |||
return !ir_utils::getTVsWithDynamicTransform(this).empty(); | |||
} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just moved this function to executor.cpp as it wasn't used anywhere else.
@@ -326,17 +326,15 @@ SegmentProfiler::SegmentProfiler(uint32_t id, bool cupti_disabled) | |||
output_bytes_(0), | |||
kernel_profile_state_(ProfilerState::Ready) {} | |||
|
|||
void SegmentProfiler::startCompile(int device) { | |||
device_ = device; | |||
void SegmentProfiler::startCompile() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Separated out set device as a separate function. KernelExecutor knows device on compilation since runtime information is needed for it, the other executors set it in run.
csrc/fusion_profiler.h
Outdated
@@ -22,6 +22,7 @@ namespace nvfuser { | |||
//! \enum ProfilerState | |||
//! \brief An enum used to represent the state of a profiling state machine | |||
enum class ProfilerState { | |||
None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just added this to initialize the state on construction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I doubt this is needed. ProfilerState::Ready seems to be a good initial state already -- all reset* functions set the state to that. cc @kevinstephano
inputs, | ||
user_sched.fusion_id_, | ||
user_sched.device_id_); | ||
user_sched.scheduled_fusion.get(), inputs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need Ryan's advice here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rdspring1 another place I could use your help, please see the comment below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rdspring1 Could you take a look here?
…idevice/executor.[cpp,h] and rename to HostIrExecutor.
However here the accumulation across iteration is still done on a globally allocated buffer ( I think what you are describing here corresponds to fusing operations within the host for-loop's body, say into one HostUnit, which gives the classical benefit of fusing kernels. However, my point is that fusing across iterations is not possible, and the I/O of the for-loop's body (here, |
Agreed! I wasn't trying to argue about that. |
@samnordmann @wujingyue this conversation seems pretty great, it'd be wonderful if you could capture it in a design doc. |
@Priya2698 Could you check the benchmark profiling with this PR? There should be no performance change, but since there's a lot of code changes, we should make sure everything works as expected. |
Are you interested in a complete sweep or only the host benchmarking? We can run a complete sweep on the CI. |
Please do a complete sweep just in case. Either A100 or H100. Not necessary to check both. |
Got it, we will need to use CI resources then since the runs time out due to dlcluster time limits. |
!test --pybench-full |
…r user scheduling. (#3357) The goal is to set `fusion_id` and `device_id` when creating `KernelExecutor` for `UserSchedule. Previously, it was set during `FusionExecutor::compileFusion`. This PR is stacked on `executor_dispatch` **Changes to `UserSchedule` cache system:** **Current:** The map key is the integer value of input arguments. The vector is of size `device id`. `std::unordered_map<size_t, std::vector<UserSchedule>> user_def_schedules;` **New:** The key to first map is the integer value of input arguments. The key to second map is of `device`. **Why?** We can set the the `fusion_id` and `device_id` in the constructor of `UserSchedule` and `KernelExecutor`.
@Priya2698 @xwang233 Could you please let us know how the benchmarking went? |
Comparing the pipeline results to the latest nvfuser nightly results does not show any major difference: http://nv/eno |
Thanks. I don't see anything suspicious either. Where can we find host benchmark results? |
host latency benchmarks for this PR were executed with kernel_reuse disabled, so dynamic results might be inaccurate |
I am seeing similar results for all cases except On
|
@csarofeen Everything looks good. I'll run the final CI just in case. |
!test |
Separate out
ExprEvalExecutor
andHostIrExecutor
from what's now calledKernelExecutor
. Create a dispatch system for them as compile and run are simpler for the former two.Also renamed instances of
FusionExecutorCache
toexecutor_cache
,KernelExecutor
toke
,ExprEvalExecutor
toeee
, andHostIrExecutor
tohire
. It makes this PR large, but was critical to refactor all the instances of these classes.For review focus on the following files:
csrc/host_ir/executor.[cpp,h]
csrc/runtime/executor.[cpp,h]
csrc/runtime/executor_abstract.h
csrc/runtime/executor_dispatch.[cpp,h]
csrc/runtime/fusion_executor_cache.cpp
csrc/runtime/fusion_kernel_runtime.[cpp,h]
Remaining files are just renaming. I would break this into multiple PRs, but it would be difficult to do at this point.