OperationProfiler
and PerseusOptimizer
server and client
#21
Labels
enhancement
New feature or request
OperationProfiler
and PerseusOptimizer
server and client
#21
Perseus is an energy scheduler for large model training (although we're looking into applying this for large model inference, too).
Perseus requires the time and energy consumption profiling results of each forward and backward computations in each pipeline stage in order to schedule energy with
lowtime
. That's whatOperationProfiler
will do.The
PerseusOptimizer
server will, for now, receive a Python file that lists GPU frequencies (produced bylowtime
) and instruct thePerseusOptimizer
client (integrated into the user's training framework) to change GPU frequencies. The server-client split is beneficial in order for Perseus to be agnostic to the training framework. Otherwise, energy scheduling (which requires a holistic view of all computations that happen across all ranks, i.e. the "policy") and the method of realizing the energy schedule in a distributed fashion (i.e., the "mechanism") end up being coupled.The text was updated successfully, but these errors were encountered: