-
Notifications
You must be signed in to change notification settings - Fork 233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some question about the Flexflow and example/cpp/moe #1392
Comments
hi there, i am aware that most of the problem is related to the code implementation of Unity and code updating afterwards. could you please take a look for me? @lockshaw Thanks for your time and patience! |
The current search algorithm (the one from OSDI 2022) is at https://github.com/flexflow/FlexFlow/blob/288a1af4e731192b634a786b10dd4b89d2f0bd3c/src/runtime/substitution.cc#L579, which (a bunch of function calls down) does still call
Currently the original FlexFlow mcmc optimizer is not functioning, as it was superseded by Unity's dynamic programming-based search. We're currently working on resurrecting it (there's been some additional progress in #1365), but it may take a while before it's functioning again as it's currently not critical for any of our projects.
The algorithm conceptually can be extended to support more complex topologies and heterogeneous devices, but as we do not have a heterogeneous cluster we need to run on we haven't gone through the effort to implement these extensions.
Where is that quote from? Implementation-wise FlexFlow uses the value from legion, which I think is just passed through the CLI, though fancier things may be done with certain job launchers. In general FlexFlow doesn't try to auto-detect the underlying hardware setup, as the user is usually aware of their hardware setup and can provide the proper parameters to FlexFlow.
|
Thanks so much for your reply @lockshaw ! It really helps! But I still have some confusion about flexflow project. For the first question related to measure_operator_cost, I am aware that the measure_operator_cost is never called because the defaultconfig set the
it seems like some illigal memory access may happens in the example code when the ops tring to call measure_operator_cost. Could you give me some hints to solve it? For the last question related to node number:
I get the output in console when i try to set |
Besides, when i checkout into branch *master and try to run the moe example. I found that the moe-related measure_operator_cost could not run properply. When i call the measure_operator_cost of Group_by operator, it come across Cuda failure: 700. here are some info:
it seems like the cudaEventSynchronize(sim->end_event) does not run properly. Could you give me some hints to solve this? @lockshaw |
@yjsunn Hi, have you solved this problem? I failed with it many times, but no solutions. |
Thanks for the great work.
Sorry for disturbing. I have several question about the moe model at ./examples/cpp/mixture-of-expert
1. How current version of flexflow get the running time of operator?
From the original version of flexflow (Beyond Data and Model Parallelism for Deep Neural Networks), it seems like before the simulation flexflow need to profile the performance of the operator (forward time and backward time, measured by Op::measure_operator_cost). However, the FFmodel::compile seems never call the measure_operator_cost function. May I ask how the current version of flexflow measure the operator or it doesnt need to measure at all.
2. another related question is about the mcmc optimization.
I know other issue (#831 mentioned it. May i ask the current progress about maintaining this function. it seems like the re-factor branch is intergrating this function into complier but I checkout this branch and the moe example could not run in Repo Refactor branch. Or do i need to reset the repo refactor branch into specific commit to run this code.
3. how flexflow consider the network topology?
I find that the Unity paper mentioned that it the search algorithm can handles custom network topologies and heterogeneous compute devices. I am confused how flexflow implement this function. the cpp interface doesn't require user to supply any topology file nor the node. There are some function related to topology-ware logic in simulator.h (class NetworkedMachineModel, class FCTopologyGenerator, class NetworkTopologyGenerator). But I did not find any code in graph optimization create such class or call related function.
4. how flexflow determine the node number?
after providing node number, the flexflow remind me that "FlexFlow will automatically detect the number of nodes". May i ask how this works? it seems like it is related to the previous question about topology.
Thanks for your time and looking forward to your reply!
The text was updated successfully, but these errors were encountered: