Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some question about the Flexflow and example/cpp/moe #1392

Open
yjsunn opened this issue May 16, 2024 · 5 comments
Open

Some question about the Flexflow and example/cpp/moe #1392

yjsunn opened this issue May 16, 2024 · 5 comments
Assignees
Labels
question Further information is requested

Comments

@yjsunn
Copy link

yjsunn commented May 16, 2024

Thanks for the great work.

Sorry for disturbing. I have several question about the moe model at ./examples/cpp/mixture-of-expert

1. How current version of flexflow get the running time of operator?

From the original version of flexflow (Beyond Data and Model Parallelism for Deep Neural Networks), it seems like before the simulation flexflow need to profile the performance of the operator (forward time and backward time, measured by Op::measure_operator_cost). However, the FFmodel::compile seems never call the measure_operator_cost function. May I ask how the current version of flexflow measure the operator or it doesnt need to measure at all.

2. another related question is about the mcmc optimization.

I know other issue (#831 mentioned it. May i ask the current progress about maintaining this function. it seems like the re-factor branch is intergrating this function into complier but I checkout this branch and the moe example could not run in Repo Refactor branch. Or do i need to reset the repo refactor branch into specific commit to run this code.

3. how flexflow consider the network topology?

I find that the Unity paper mentioned that it the search algorithm can handles custom network topologies and heterogeneous compute devices. I am confused how flexflow implement this function. the cpp interface doesn't require user to supply any topology file nor the node. There are some function related to topology-ware logic in simulator.h (class NetworkedMachineModel, class FCTopologyGenerator, class NetworkTopologyGenerator). But I did not find any code in graph optimization create such class or call related function.

4. how flexflow determine the node number?

after providing node number, the flexflow remind me that "FlexFlow will automatically detect the number of nodes". May i ask how this works? it seems like it is related to the previous question about topology.

Thanks for your time and looking forward to your reply!

@yjsunn yjsunn closed this as completed May 20, 2024
@yjsunn yjsunn reopened this May 20, 2024
@yjsunn
Copy link
Author

yjsunn commented May 20, 2024

hi there, i am aware that most of the problem is related to the code implementation of Unity and code updating afterwards. could you please take a look for me? @lockshaw

Thanks for your time and patience!

@yjsunn yjsunn closed this as not planned Won't fix, can't repro, duplicate, stale May 20, 2024
@yjsunn yjsunn reopened this May 20, 2024
@lockshaw
Copy link
Collaborator

lockshaw commented Jun 3, 2024

1. How current version of flexflow get the running time of operator?

From the original version of flexflow (Beyond Data and Model Parallelism for Deep Neural Networks), it seems like before the simulation flexflow need to profile the performance of the operator (forward time and backward time, measured by Op::measure_operator_cost). However, the FFmodel::compile seems never call the measure_operator_cost function. May I ask how the current version of flexflow measure the operator or it doesnt need to measure at all.

The current search algorithm (the one from OSDI 2022) is at https://github.com/flexflow/FlexFlow/blob/288a1af4e731192b634a786b10dd4b89d2f0bd3c/src/runtime/substitution.cc#L579, which (a bunch of function calls down) does still call measure_operator_cost

2. another related question is about the mcmc optimization.

I know other issue (#831 mentioned it. May i ask the current progress about maintaining this function. it seems like the re-factor branch is intergrating this function into complier but I checkout this branch and the moe example could not run in Repo Refactor branch. Or do i need to reset the repo refactor branch into specific commit to run this code.

Currently the original FlexFlow mcmc optimizer is not functioning, as it was superseded by Unity's dynamic programming-based search. We're currently working on resurrecting it (there's been some additional progress in #1365), but it may take a while before it's functioning again as it's currently not critical for any of our projects.

3. how flexflow consider the network topology?

I find that the Unity paper mentioned that it the search algorithm can handles custom network topologies and heterogeneous compute devices. I am confused how flexflow implement this function. the cpp interface doesn't require user to supply any topology file nor the node. There are some function related to topology-ware logic in simulator.h (class NetworkedMachineModel, class FCTopologyGenerator, class NetworkTopologyGenerator). But I did not find any code in graph optimization create such class or call related function.

The algorithm conceptually can be extended to support more complex topologies and heterogeneous devices, but as we do not have a heterogeneous cluster we need to run on we haven't gone through the effort to implement these extensions.

4. how flexflow determine the node number?

after providing node number, the flexflow remind me that "FlexFlow will automatically detect the number of nodes". May i ask how this works? it seems like it is related to the previous question about topology.

Where is that quote from? Implementation-wise FlexFlow uses the value from legion, which I think is just passed through the CLI, though fancier things may be done with certain job launchers. In general FlexFlow doesn't try to auto-detect the underlying hardware setup, as the user is usually aware of their hardware setup and can provide the proper parameters to FlexFlow.

Thanks for your time and looking forward to your reply!

@lockshaw lockshaw self-assigned this Jun 3, 2024
@lockshaw lockshaw added the question Further information is requested label Jun 3, 2024
@yjsunn
Copy link
Author

yjsunn commented Jun 5, 2024

1. How current version of flexflow get the running time of operator?
From the original version of flexflow (Beyond Data and Model Parallelism for Deep Neural Networks), it seems like before the simulation flexflow need to profile the performance of the operator (forward time and backward time, measured by Op::measure_operator_cost). However, the FFmodel::compile seems never call the measure_operator_cost function. May I ask how the current version of flexflow measure the operator or it doesnt need to measure at all.

The current search algorithm (the one from OSDI 2022) is at

https://github.com/flexflow/FlexFlow/blob/288a1af4e731192b634a786b10dd4b89d2f0bd3c/src/runtime/substitution.cc#L579

, which (a bunch of function calls down) does still call measure_operator_cost

2. another related question is about the mcmc optimization.
I know other issue (#831 mentioned it. May i ask the current progress about maintaining this function. it seems like the re-factor branch is intergrating this function into complier but I checkout this branch and the moe example could not run in Repo Refactor branch. Or do i need to reset the repo refactor branch into specific commit to run this code.

Currently the original FlexFlow mcmc optimizer is not functioning, as it was superseded by Unity's dynamic programming-based search. We're currently working on resurrecting it (there's been some additional progress in #1365), but it may take a while before it's functioning again as it's currently not critical for any of our projects.

3. how flexflow consider the network topology?
I find that the Unity paper mentioned that it the search algorithm can handles custom network topologies and heterogeneous compute devices. I am confused how flexflow implement this function. the cpp interface doesn't require user to supply any topology file nor the node. There are some function related to topology-ware logic in simulator.h (class NetworkedMachineModel, class FCTopologyGenerator, class NetworkTopologyGenerator). But I did not find any code in graph optimization create such class or call related function.

The algorithm conceptually can be extended to support more complex topologies and heterogeneous devices, but as we do not have a heterogeneous cluster we need to run on we haven't gone through the effort to implement these extensions.

4. how flexflow determine the node number?
after providing node number, the flexflow remind me that "FlexFlow will automatically detect the number of nodes". May i ask how this works? it seems like it is related to the previous question about topology.

Where is that quote from? Implementation-wise FlexFlow uses the value from legion, which I think is just passed through the CLI, though fancier things may be done with certain job launchers. In general FlexFlow doesn't try to auto-detect the underlying hardware setup, as the user is usually aware of their hardware setup and can provide the proper parameters to FlexFlow.

Thanks for your time and looking forward to your reply!

Thanks so much for your reply @lockshaw ! It really helps! But I still have some confusion about flexflow project.

For the first question related to measure_operator_cost, I am aware that the measure_operator_cost is never called because the defaultconfig set the const static bool onlyDataParallel = True so it may never call graph_optimize in try_one_lambda. However, when i set the onlyDataParallel = false and run the code at examples/cpp/DLRM using command: ./build/examples/cpp/DLRM/dlrm -ll:gpu 2 -ll:fsize 14000 -ll:zsize 14000 --budget 20 . The example code have some assertion failed.
here are the output:

[0 - 7f651c803000] 0.238115 {3}{Mapper}: Enabled Control Replication Optimizations.
[0 - 7f651c803000] 0.238204 {3}{Mapper}: Enabled Control Replication Optimizations.
[0 - 7f651c803000] 0.238254 {3}{Mapper}: Enabled Control Replication Optimizations.
[0 - 7f651c803000] 0.238436 {4}{runtime}: [warning 1117] LEGION WARNING: Mapper FlexFlow Mapper requested to both replicate and origin map task top_level (UID 2) in 'select_task_options'. Replication of origin-mapped tasks is not currently supported and the request to replicate the task will be ignored. (from file /usr/wkspace/FlexFlow/deps/legion/runtime/legion/legion_tasks.cc:803)
For more information see:
http://legion.stanford.edu/messages/warning_code.html#warning_code_1117

[0 - 7f6510e03000] 0.271274 {3}{DLRM}: batchSize(64) workersPerNodes(2) numNodes(1)
[0 - 7f6510e03000] 0.271314 {3}{DLRM}: EmbeddingBagSize(1)
[0 - 7f6510e03000] 0.271350 {3}{DLRM}: Embedding Vocab Sizes: 1000000 1000000 1000000 1000000
[0 - 7f6510e03000] 0.271361 {3}{DLRM}: MLP Top: 64 64 2
[0 - 7f6510e03000] 0.271374 {3}{DLRM}: MLP Bot: 4 64 64
workSpaceSize (128 MB)
workSpaceSize (128 MB)
computation mode is training
config.profiling0
create_operators_from_layers layer_size:18
enter graph_optimize_task
enter try_one_lambda
num_nodes = 1 num_gpus_per_node = 2
enter graph_optimize
ready to enter generic_sequence_optimize
Optimizing graph with 18 nodes
Applying recursive case on bottleneck
Optimizing graph with 3 nodes
!bottleneck.has_value() 1
linear op profiling: 255
dlrm: /usr/wkspace/FlexFlow/src/ops/linear.cc:1135: virtual bool FlexFlow::Linear::measure_operator_cost(FlexFlow::Simulator*, const FlexFlow::MachineView&, FlexFlow::CostMetrics&) const: Assertion `m->profiling == false' failed.
Aborted (core dumped)

it seems like some illigal memory access may happens in the example code when the ops tring to call measure_operator_cost. Could you give me some hints to solve it?

For the last question related to node number:

Where is that quote from? Implementation-wise FlexFlow uses the value from legion, which I think is just passed through the CLI, though fancier things may be done with certain job launchers. In general FlexFlow doesn't try to auto-detect the underlying hardware setup, as the user is usually aware of their hardware setup and can provide the proper parameters to FlexFlow.

I get the output in console when i try to set --nodes like previous scripts in example/cpp/dlrm. The FFConfig::parse_args give me this warning. You may refer to this line. I am confused about how flexflow determine the node number and generate the device mapping.

@yjsunn
Copy link
Author

yjsunn commented Jun 7, 2024

Besides, when i checkout into branch *master and try to run the moe example. I found that the moe-related measure_operator_cost could not run properply. When i call the measure_operator_cost of Group_by operator, it come across Cuda failure: 700. here are some info:

moe: /usr/wkspace/fortest/FlexFlow/src/runtime/model.cu:53: void FlexFlow::Op::inner_measure_operator_cost(FlexFlow::Simulator*, const std::function<void()>&, const std::function<void()>&, FlexFlow::CostMetrics&) const: Assertion `false' failed.

it seems like the cudaEventSynchronize(sim->end_event) does not run properly. Could you give me some hints to solve this? @lockshaw

@Resurgence27
Copy link

Besides, when i checkout into branch *master and try to run the moe example. I found that the moe-related measure_operator_cost could not run properply. When i call the measure_operator_cost of Group_by operator, it come across Cuda failure: 700. here are some info:

moe: /usr/wkspace/fortest/FlexFlow/src/runtime/model.cu:53: void FlexFlow::Op::inner_measure_operator_cost(FlexFlow::Simulator*, const std::function<void()>&, const std::function<void()>&, FlexFlow::CostMetrics&) const: Assertion `false' failed.

it seems like the cudaEventSynchronize(sim->end_event) does not run properly. Could you give me some hints to solve this? @lockshaw

@yjsunn Hi, have you solved this problem? I failed with it many times, but no solutions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants