Some question about the Flexflow and example/cpp/moe #1392

yjsunn · 2024-05-16T03:56:53Z

Thanks for the great work.

Sorry for disturbing. I have several question about the moe model at ./examples/cpp/mixture-of-expert

1. How current version of flexflow get the running time of operator?

From the original version of flexflow (Beyond Data and Model Parallelism for Deep Neural Networks), it seems like before the simulation flexflow need to profile the performance of the operator (forward time and backward time, measured by Op::measure_operator_cost). However, the FFmodel::compile seems never call the measure_operator_cost function. May I ask how the current version of flexflow measure the operator or it doesnt need to measure at all.

2. another related question is about the mcmc optimization.

I know other issue (#831 mentioned it. May i ask the current progress about maintaining this function. it seems like the re-factor branch is intergrating this function into complier but I checkout this branch and the moe example could not run in Repo Refactor branch. Or do i need to reset the repo refactor branch into specific commit to run this code.

3. how flexflow consider the network topology?

I find that the Unity paper mentioned that it the search algorithm can handles custom network topologies and heterogeneous compute devices. I am confused how flexflow implement this function. the cpp interface doesn't require user to supply any topology file nor the node. There are some function related to topology-ware logic in simulator.h (class NetworkedMachineModel, class FCTopologyGenerator, class NetworkTopologyGenerator). But I did not find any code in graph optimization create such class or call related function.

4. how flexflow determine the node number?

after providing node number, the flexflow remind me that "FlexFlow will automatically detect the number of nodes". May i ask how this works? it seems like it is related to the previous question about topology.

Thanks for your time and looking forward to your reply!

yjsunn · 2024-05-20T09:12:06Z

hi there, i am aware that most of the problem is related to the code implementation of Unity and code updating afterwards. could you please take a look for me? @lockshaw

Thanks for your time and patience!

lockshaw · 2024-06-03T05:20:21Z

1. How current version of flexflow get the running time of operator?

From the original version of flexflow (Beyond Data and Model Parallelism for Deep Neural Networks), it seems like before the simulation flexflow need to profile the performance of the operator (forward time and backward time, measured by Op::measure_operator_cost). However, the FFmodel::compile seems never call the measure_operator_cost function. May I ask how the current version of flexflow measure the operator or it doesnt need to measure at all.

The current search algorithm (the one from OSDI 2022) is at https://github.com/flexflow/FlexFlow/blob/288a1af4e731192b634a786b10dd4b89d2f0bd3c/src/runtime/substitution.cc#L579, which (a bunch of function calls down) does still call measure_operator_cost

2. another related question is about the mcmc optimization.

I know other issue (#831 mentioned it. May i ask the current progress about maintaining this function. it seems like the re-factor branch is intergrating this function into complier but I checkout this branch and the moe example could not run in Repo Refactor branch. Or do i need to reset the repo refactor branch into specific commit to run this code.

Currently the original FlexFlow mcmc optimizer is not functioning, as it was superseded by Unity's dynamic programming-based search. We're currently working on resurrecting it (there's been some additional progress in #1365), but it may take a while before it's functioning again as it's currently not critical for any of our projects.

3. how flexflow consider the network topology?

I find that the Unity paper mentioned that it the search algorithm can handles custom network topologies and heterogeneous compute devices. I am confused how flexflow implement this function. the cpp interface doesn't require user to supply any topology file nor the node. There are some function related to topology-ware logic in simulator.h (class NetworkedMachineModel, class FCTopologyGenerator, class NetworkTopologyGenerator). But I did not find any code in graph optimization create such class or call related function.

The algorithm conceptually can be extended to support more complex topologies and heterogeneous devices, but as we do not have a heterogeneous cluster we need to run on we haven't gone through the effort to implement these extensions.

4. how flexflow determine the node number?

after providing node number, the flexflow remind me that "FlexFlow will automatically detect the number of nodes". May i ask how this works? it seems like it is related to the previous question about topology.

Where is that quote from? Implementation-wise FlexFlow uses the value from legion, which I think is just passed through the CLI, though fancier things may be done with certain job launchers. In general FlexFlow doesn't try to auto-detect the underlying hardware setup, as the user is usually aware of their hardware setup and can provide the proper parameters to FlexFlow.

Thanks for your time and looking forward to your reply!

yjsunn · 2024-06-05T11:51:48Z

1. How current version of flexflow get the running time of operator?
From the original version of flexflow (Beyond Data and Model Parallelism for Deep Neural Networks), it seems like before the simulation flexflow need to profile the performance of the operator (forward time and backward time, measured by Op::measure_operator_cost). However, the FFmodel::compile seems never call the measure_operator_cost function. May I ask how the current version of flexflow measure the operator or it doesnt need to measure at all.

The current search algorithm (the one from OSDI 2022) is at

https://github.com/flexflow/FlexFlow/blob/288a1af4e731192b634a786b10dd4b89d2f0bd3c/src/runtime/substitution.cc#L579

, which (a bunch of function calls down) does still call measure_operator_cost

2. another related question is about the mcmc optimization.
I know other issue (#831 mentioned it. May i ask the current progress about maintaining this function. it seems like the re-factor branch is intergrating this function into complier but I checkout this branch and the moe example could not run in Repo Refactor branch. Or do i need to reset the repo refactor branch into specific commit to run this code.

Currently the original FlexFlow mcmc optimizer is not functioning, as it was superseded by Unity's dynamic programming-based search. We're currently working on resurrecting it (there's been some additional progress in #1365), but it may take a while before it's functioning again as it's currently not critical for any of our projects.

3. how flexflow consider the network topology?
I find that the Unity paper mentioned that it the search algorithm can handles custom network topologies and heterogeneous compute devices. I am confused how flexflow implement this function. the cpp interface doesn't require user to supply any topology file nor the node. There are some function related to topology-ware logic in simulator.h (class NetworkedMachineModel, class FCTopologyGenerator, class NetworkTopologyGenerator). But I did not find any code in graph optimization create such class or call related function.

The algorithm conceptually can be extended to support more complex topologies and heterogeneous devices, but as we do not have a heterogeneous cluster we need to run on we haven't gone through the effort to implement these extensions.

4. how flexflow determine the node number?
after providing node number, the flexflow remind me that "FlexFlow will automatically detect the number of nodes". May i ask how this works? it seems like it is related to the previous question about topology.

Where is that quote from? Implementation-wise FlexFlow uses the value from legion, which I think is just passed through the CLI, though fancier things may be done with certain job launchers. In general FlexFlow doesn't try to auto-detect the underlying hardware setup, as the user is usually aware of their hardware setup and can provide the proper parameters to FlexFlow.

Thanks for your time and looking forward to your reply!

Thanks so much for your reply @lockshaw ! It really helps! But I still have some confusion about flexflow project.

For the first question related to measure_operator_cost, I am aware that the measure_operator_cost is never called because the defaultconfig set the const static bool onlyDataParallel = True so it may never call graph_optimize in try_one_lambda. However, when i set the onlyDataParallel = false and run the code at examples/cpp/DLRM using command: ./build/examples/cpp/DLRM/dlrm -ll:gpu 2 -ll:fsize 14000 -ll:zsize 14000 --budget 20 . The example code have some assertion failed.
here are the output:

[0 - 7f651c803000] 0.238115 {3}{Mapper}: Enabled Control Replication Optimizations.
[0 - 7f651c803000] 0.238204 {3}{Mapper}: Enabled Control Replication Optimizations.
[0 - 7f651c803000] 0.238254 {3}{Mapper}: Enabled Control Replication Optimizations.
[0 - 7f651c803000] 0.238436 {4}{runtime}: [warning 1117] LEGION WARNING: Mapper FlexFlow Mapper requested to both replicate and origin map task top_level (UID 2) in 'select_task_options'. Replication of origin-mapped tasks is not currently supported and the request to replicate the task will be ignored. (from file /usr/wkspace/FlexFlow/deps/legion/runtime/legion/legion_tasks.cc:803)
For more information see:
http://legion.stanford.edu/messages/warning_code.html#warning_code_1117

[0 - 7f6510e03000] 0.271274 {3}{DLRM}: batchSize(64) workersPerNodes(2) numNodes(1)
[0 - 7f6510e03000] 0.271314 {3}{DLRM}: EmbeddingBagSize(1)
[0 - 7f6510e03000] 0.271350 {3}{DLRM}: Embedding Vocab Sizes: 1000000 1000000 1000000 1000000
[0 - 7f6510e03000] 0.271361 {3}{DLRM}: MLP Top: 64 64 2
[0 - 7f6510e03000] 0.271374 {3}{DLRM}: MLP Bot: 4 64 64
workSpaceSize (128 MB)
workSpaceSize (128 MB)
computation mode is training
config.profiling0
create_operators_from_layers layer_size:18
enter graph_optimize_task
enter try_one_lambda
num_nodes = 1 num_gpus_per_node = 2
enter graph_optimize
ready to enter generic_sequence_optimize
Optimizing graph with 18 nodes
Applying recursive case on bottleneck
Optimizing graph with 3 nodes
!bottleneck.has_value() 1
linear op profiling: 255
dlrm: /usr/wkspace/FlexFlow/src/ops/linear.cc:1135: virtual bool FlexFlow::Linear::measure_operator_cost(FlexFlow::Simulator*, const FlexFlow::MachineView&, FlexFlow::CostMetrics&) const: Assertion `m->profiling == false' failed.
Aborted (core dumped)

it seems like some illigal memory access may happens in the example code when the ops tring to call measure_operator_cost. Could you give me some hints to solve it?

For the last question related to node number:

Where is that quote from? Implementation-wise FlexFlow uses the value from legion, which I think is just passed through the CLI, though fancier things may be done with certain job launchers. In general FlexFlow doesn't try to auto-detect the underlying hardware setup, as the user is usually aware of their hardware setup and can provide the proper parameters to FlexFlow.

I get the output in console when i try to set --nodes like previous scripts in example/cpp/dlrm. The FFConfig::parse_args give me this warning. You may refer to this line. I am confused about how flexflow determine the node number and generate the device mapping.

yjsunn · 2024-06-07T07:22:37Z

Besides, when i checkout into branch *master and try to run the moe example. I found that the moe-related measure_operator_cost could not run properply. When i call the measure_operator_cost of Group_by operator, it come across Cuda failure: 700. here are some info:

moe: /usr/wkspace/fortest/FlexFlow/src/runtime/model.cu:53: void FlexFlow::Op::inner_measure_operator_cost(FlexFlow::Simulator*, const std::function<void()>&, const std::function<void()>&, FlexFlow::CostMetrics&) const: Assertion `false' failed.

it seems like the cudaEventSynchronize(sim->end_event) does not run properly. Could you give me some hints to solve this? @lockshaw

Resurgence27 · 2024-07-22T05:57:10Z

Besides, when i checkout into branch *master and try to run the moe example. I found that the moe-related measure_operator_cost could not run properply. When i call the measure_operator_cost of Group_by operator, it come across Cuda failure: 700. here are some info:

moe: /usr/wkspace/fortest/FlexFlow/src/runtime/model.cu:53: void FlexFlow::Op::inner_measure_operator_cost(FlexFlow::Simulator*, const std::function<void()>&, const std::function<void()>&, FlexFlow::CostMetrics&) const: Assertion `false' failed.

it seems like the cudaEventSynchronize(sim->end_event) does not run properly. Could you give me some hints to solve this? @lockshaw

@yjsunn Hi, have you solved this problem? I failed with it many times, but no solutions.

yjsunn closed this as completed May 20, 2024

yjsunn reopened this May 20, 2024

yjsunn closed this as not planned Won't fix, can't repro, duplicate, stale May 20, 2024

yjsunn reopened this May 20, 2024

lockshaw self-assigned this Jun 3, 2024

lockshaw added the question Further information is requested label Jun 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some question about the Flexflow and example/cpp/moe #1392

Some question about the Flexflow and example/cpp/moe #1392

yjsunn commented May 16, 2024

yjsunn commented May 20, 2024

lockshaw commented Jun 3, 2024

yjsunn commented Jun 5, 2024 •

edited

Loading

yjsunn commented Jun 7, 2024 •

edited

Loading

Resurgence27 commented Jul 22, 2024

Some question about the Flexflow and example/cpp/moe #1392

Some question about the Flexflow and example/cpp/moe #1392

Comments

yjsunn commented May 16, 2024

yjsunn commented May 20, 2024

lockshaw commented Jun 3, 2024

yjsunn commented Jun 5, 2024 • edited Loading

yjsunn commented Jun 7, 2024 • edited Loading

Resurgence27 commented Jul 22, 2024

yjsunn commented Jun 5, 2024 •

edited

Loading

yjsunn commented Jun 7, 2024 •

edited

Loading