Adding task-based simulator for PCGs #1365

Marsella8 · 2024-04-12T20:07:45Z

Description of changes:

Related Issues:

Linked Issues:

Issue #

Issues closed by this PR:

Closes Add primitive for estimating cost of a fully-mapped PCG #1323

This change is

…mpiler

…ompiler

lockshaw

Looks great! Stylistically it matches the existing code, which is much appreciated 🙂

Obviously tests need to get finished off and there are probably some minor edge cases to be handled, but as an initial draft this looks really good

Reviewed 5 of 5 files at r1, all commit messages.
Reviewable status: all files reviewed, 8 unresolved discussions (waiting on @Marsella8 and @wmdi)

lib/compiler/src/machine_mapping.cc line 123 at r1 (raw file):

  std::unordered_set<UpwardOpenMultiDiEdge> incoming_edges =
      get_incoming_edges(g, node);
  std::vector<ParallelTensorShape> inputs = transform(

If the ordering here doesn't matter why is this a vector?

lib/compiler/src/machine_mapping.cc line 124 at r1 (raw file):

      get_incoming_edges(g, node);
  std::vector<ParallelTensorShape> inputs = transform(
      as_vector(incoming_edges), [&](UpwardOpenMultiDiEdge const &input_edge) {

Suggestion:

  as_vector(get_incoming_edges(g, node)), [&](UpwardOpenMultiDiEdge const &input_edge) {

lib/compiler/src/machine_mapping.cc line 161 at r1 (raw file):

  // Filling the frontier
  for (auto &[edge, _] : frontier_machine_views) {

Suggestion:

  for (auto const &[edge, _] : frontier_machine_views) {

lib/compiler/src/machine_mapping.cc line 161 at r1 (raw file):

  // Filling the frontier
  for (auto &[edge, _] : frontier_machine_views) {

Maybe cleaner to just get the set of nodes with no incoming closed edges?

lib/compiler/src/machine_mapping.cc line 176 at r1 (raw file):

      if (std::all_of(devices.begin(), devices.end(), [&occupied](int d) {
            return occupied[d] == false
          })) {

See all_of in containers.h

Code quote:

      if (std::all_of(devices.begin(), devices.end(), [&occupied](int d) {
            return occupied[d] == false
          })) {

lib/compiler/test/src/test_parallel_cost_estimator.cpp line 10 at r1 (raw file):

  TEST_CASE("parallel_estimate_cost") {
    // Test graph structure
    //     /-- 1 --\

Add a test case that starts with the simpler case of a straight line? Might be an easier starting point and I think should have a clean analytical answer you can check against

lib/compiler/test/src/test_parallel_cost_estimator.cpp line 11 at r1 (raw file):

    // Test graph structure
    //     /-- 1 --\
        //  0 -         - 3

Suggestion:

  //  0 -         - 3

lib/utils/src/graph/algorithms.cc line 380 at r1 (raw file):

std::unordered_map<Node, std::unordered_set<Node>>
    get_successors(DiGraphView const &g,

Alternatively, implement this as get_predecessors on the flipped graph

…cost-estimator

wmdi

Reviewed all commit messages.
Reviewable status: 4 of 5 files reviewed, 8 unresolved discussions (waiting on @lockshaw and @Marsella8)

…cost-estimator

… and related objects

Marsella8

Reviewable status: 2 of 7 files reviewed, 8 unresolved discussions (waiting on @lockshaw)

lib/compiler/src/machine_mapping.cc line 124 at r1 (raw file):

      get_incoming_edges(g, node);
  std::vector<ParallelTensorShape> inputs = transform(
      as_vector(incoming_edges), [&](UpwardOpenMultiDiEdge const &input_edge) {

Done.

lib/compiler/src/machine_mapping.cc line 161 at r1 (raw file):

  // Filling the frontier
  for (auto &[edge, _] : frontier_machine_views) {

Done.

lib/compiler/src/machine_mapping.cc line 161 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Maybe cleaner to just get the set of nodes with no incoming closed edges?

Done.

lib/compiler/src/machine_mapping.cc line 176 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

See all_of in containers.h

Done.

lib/compiler/test/src/test_parallel_cost_estimator.cpp line 10 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Add a test case that starts with the simpler case of a straight line? Might be an easier starting point and I think should have a clean analytical answer you can check against

Done.

lib/compiler/test/src/test_parallel_cost_estimator.cpp line 11 at r1 (raw file):

    // Test graph structure
    //     /-- 1 --\
        //  0 -         - 3

Done.

wmdi

Reviewed 2 of 4 files at r3, 8 of 8 files at r4, all commit messages.
Reviewable status: all files reviewed, 8 unresolved discussions (waiting on @lockshaw and @Marsella8)

Marsella8

Reviewable status: 4 of 13 files reviewed, 8 unresolved discussions (waiting on @lockshaw, @Marsella8, and @wmdi)

lib/compiler/src/machine_mapping.cc line 123 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

If the ordering here doesn't matter why is this a vector?

Done.

lib/utils/src/graph/algorithms.cc line 380 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Alternatively, implement this as get_predecessors on the flipped graph

Done.

lockshaw

Reviewed 2 of 8 files at r4, 5 of 9 files at r5, all commit messages.
Reviewable status: 9 of 13 files reviewed, 6 unresolved discussions (waiting on @Marsella8 and @wmdi)

lib/compiler/include/compiler/cost_estimator.h line 12 at r5 (raw file):

using SubParallelComputationGraphView =
    OutputLabelledOpenMultiDiGraphView<Operator, ParallelTensor>;

Prefer dtgen-wrapped type (see current SubParallelComputationGraph, which honestly should probably be a View anyway so feel free to just change it to be a View)

lib/compiler/src/cost_estimator.cc line 29 at r5 (raw file):

}

struct TimedNode { // Node and associated finishing time

Prefer dtgen over visitable

lib/compiler/src/cost_estimator.cc line 68 at r5 (raw file):

    std::unordered_set<Node> copy(frontier);
    for (Node const &node : copy) {
      auto views = device_mapping.machine_views.at(node);

Prefer actual type over auto

Code quote:

auto

lib/compiler/src/cost_estimator.cc line 73 at r5 (raw file):

              devices.begin(), devices.end(), [&occupied](device_id_t d) {
                return occupied[d] == false;
              })) {

Prefer functions from containers.h

Code quote:

      if (std::all_of(
              devices.begin(), devices.end(), [&occupied](device_id_t d) {
                return occupied[d] == false;
              })) {

lib/compiler/src/cost_estimator.cc line 75 at r5 (raw file):

              })) {
        float cost = node_estimate_cost(node, g, estimator, device_mapping);
        processing.push({node, current_time + cost});

Pulling some of this out into separate lambdas might improve readability (say, a lambda named start_node_processing)

lib/pcg/src/device_id.cc line 19 at r5 (raw file):

// Most likely not the best way to do it.
device_id_t operator+(device_id_t device, size_t increment) {
  if (std::holds_alternative<gpu_id_t>(device)) {

Refactor to use dtgen (but otherwise this isn't bad, just replace the static_casts with actual field accesses)

Is this an operation we actually want to support for device_id_t, or should this just be done in the context of other functions?

lib/utils/src/graph/algorithms.cc line 380 at r1 (raw file):

Previously, Marsella8 wrote…

Done.

Not seeing it done? (though it doesn't necessarily need to be, just saying I don't see a change)

Marsella8

Reviewed 2 of 5 files at r1, 2 of 8 files at r4, 9 of 9 files at r5.
Reviewable status: 11 of 14 files reviewed, 6 unresolved discussions (waiting on @lockshaw)

lib/compiler/include/compiler/cost_estimator.h line 12 at r5 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Prefer dtgen-wrapped type (see current SubParallelComputationGraph, which honestly should probably be a View anyway so feel free to just change it to be a View)

Tried to sync my branch so as to dtgen and wasn't able to get it working, best if i get this PR merged and then finish this off in a separate branch (e.g. the machine-related one)?

lib/compiler/src/cost_estimator.cc line 29 at r5 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Prefer dtgen over visitable

See comment above

Also, we will always use dtgen going forward for all structs, including the ones that are only for private use / limited to a single module right?

lib/compiler/src/cost_estimator.cc line 68 at r5 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Prefer actual type over auto

Done.

lib/compiler/src/cost_estimator.cc line 73 at r5 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Prefer functions from containers.h

Done.

lib/compiler/src/cost_estimator.cc line 75 at r5 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Pulling some of this out into separate lambdas might improve readability (say, a lambda named start_node_processing)

Done, let me know if this is ok.

As a general question, when do we use lambdas inside the function vs just having a separate function do the same thing? Only when the function is only applicable tto the function we are trying to write?

lib/pcg/src/device_id.cc line 19 at r5 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Refactor to use dtgen (but otherwise this isn't bad, just replace the static_casts with actual field accesses)

Is this an operation we actually want to support for device_id_t, or should this just be done in the context of other functions?

Due to problems above with dtgen, will clean up everything into the machine-related branch.

I don't think we'll ever intend for the user to directly increment device_id_t types, but it is useful for the functions that are part of machine_views.cc and associated modules (e.g. if we want to iterate over the range of IDs to generate all possible IDs)

lib/utils/src/graph/algorithms.cc line 380 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Not seeing it done? (though it doesn't necessarily need to be, just saying I don't see a change)

Now it's changed.

lockshaw

Reviewed 3 of 9 files at r5, 3 of 3 files at r6, all commit messages.
Reviewable status: all files reviewed, 15 unresolved discussions (waiting on @Marsella8)

lib/compiler/include/compiler/cost_estimator.h line 12 at r5 (raw file):

Previously, Marsella8 wrote…

Tried to sync my branch so as to dtgen and wasn't able to get it working, best if i get this PR merged and then finish this off in a separate branch (e.g. the machine-related one)?

You're going to need to merge with repo-refactor anyway, so if you can't get dtgen to work I'm guessing the branch won't be mergeable. That said, if you want I'm already making the change to this specific struct in a different PR, so you can skip this one (though it would still be good to update the other structs such as TimedNode in this PR)

lib/compiler/src/cost_estimator.cc line 29 at r5 (raw file):

Previously, Marsella8 wrote…

See comment above

Also, we will always use dtgen going forward for all structs, including the ones that are only for private use / limited to a single module right?

I'd say use it when you can--it only captures a limited subset of what C++ structs can do (e.g., no methods, etc.), but it also removes a lot of difficulty of making your structs support expected behaviors (printability, equality, hashing, etc.) yourself, so if you struct fits in its semantic model I'd say use it

lib/compiler/src/cost_estimator.cc line 75 at r5 (raw file):

Previously, Marsella8 wrote…

Done, let me know if this is ok.

As a general question, when do we use lambdas inside the function vs just having a separate function do the same thing? Only when the function is only applicable tto the function we are trying to write?

We use lambdas for functions that aren't reusable across the codebase. They're especially useful when you want to manipulate local variables, like occupied here, as the lambdas can manipulate them directly. When in doubt I'd say start with lambdas

lib/compiler/src/cost_estimator.cc line 41 at r6 (raw file):

};

bool predecessors_have_been_processed(

This should be a lambda

lib/compiler/src/cost_estimator.cc line 44 at r6 (raw file):

    std::unordered_set<Node> const &predecessors,
    std::unordered_set<TimedNode> processed) {
  std::unordered_set<Node> simple_processed =

processed_nodes?

Code quote:

 simple_processed

lib/compiler/src/cost_estimator.cc line 47 at r6 (raw file):

      transform(processed, [](TimedNode const &tn) { return tn.node; });

  return all_of(predecessors, [&simple_processed](Node p) {

Suggestion:

  return all_of(predecessors, [&](Node p) {

lib/compiler/src/cost_estimator.cc line 48 at r6 (raw file):

  return all_of(predecessors, [&simple_processed](Node p) {
    return simple_processed.find(p) != simple_processed.end();

contains from utils/containers.h

lib/compiler/src/cost_estimator.cc line 55 at r6 (raw file):

                                     MachineMapping const &device_mapping) {
  return device_mapping.machine_views.at(node).device_ids();
}

Should be moved to the corresponding header for MachineMapping

Code quote:

std::vector<device_id_t> get_devices(Node const &node,
                                     MachineMapping const &device_mapping) {
  return device_mapping.machine_views.at(node).device_ids();
}

lib/compiler/src/cost_estimator.cc line 106 at r6 (raw file):

      std::vector<device_id_t> devices = get_devices(node, device_mapping);
      if (all_of(devices,
                 [&occupied](device_id_t d) { return occupied[d] == false; })) {

Don't worry about explicitly managing lambda references unless there's a good reason (e.g., they're going to live outside of local scope)

Suggestion:

                 [&](device_id_t d) { return occupied[d] == false; })) {

lib/compiler/src/cost_estimator.cc line 117 at r6 (raw file):

      finish_node_processing(finished);

      // Adding candidates to the frontier

FYI these comments aren't really adding a huge amount here and can probably be removed. With the lambdas added the code is readable enough that I don't think they're contributing much over the code itself

lib/compiler/src/cost_estimator.cc line 121 at r6 (raw file):

        std::unordered_set<Node> predecessors = get_predecessors(g, successor);

        if (predecessors_have_been_processed(predecessors, processed)) {

Why not just have predecessors_have_been_processed call get_predecessors directly?

Code quote:

predecessors_have_been_processed(predecessors, processed)

lib/compiler/test/CMakeLists.txt line 13 at r6 (raw file):

    doctest
    utils-test-common
    pcg

Shouldn't be necessary, pcg is already pulled in by compiler (but this isn't a huge deal)

lib/pcg/src/device_id.cc line 19 at r5 (raw file):

Previously, Marsella8 wrote…

Due to problems above with dtgen, will clean up everything into the machine-related branch.

I don't think we'll ever intend for the user to directly increment device_id_t types, but it is useful for the functions that are part of machine_views.cc and associated modules (e.g. if we want to iterate over the range of IDs to generate all possible IDs)

In that case I'd just do the unwrapping manually in machine_views.cc, etc. Only expose an add operator if it's actually a valid operation to add device ids in general, which I'm not sure is the case (though I could be convinced of)

lib/pcg/src/device_id.cc line 17 at r6 (raw file):

}

// Most likely not the best way to do it.

?

lib/pcg/src/machine_view.cc line 30 at r6 (raw file):

    StridedRectangleSide side = this->rect.at(ff_dim_t{0});
    for (device_id_t id = this->start; id < this->start + side.get_num_points();
         id = id + 1) {

The stride of the side doesn't appear to be used?

lib/pcg/src/machine_view.cc line 30 at r6 (raw file):

    StridedRectangleSide side = this->rect.at(ff_dim_t{0});
    for (device_id_t id = this->start; id < this->start + side.get_num_points();
         id = id + 1) {

Suggestion:

      id++) {

lib/pcg/src/machine_view.cc line 33 at r6 (raw file):

      ids.push_back(id);
    }
    return ids;

Either move the declaration of ids in a scope or the return out a scope, having them in different scopes is weird and confusing

lib/pcg/src/strided_rectangle.cc line 21 at r6 (raw file):

  StridedRectangleSide side = this->sides.at(dim);
  return side;
}

Suggestion:

StridedRectangleSide StridedRectangle::at(ff_dim_t const &dim) const {
  return this->sides.at(dim);
}

wmdi and others added 30 commits October 16, 2023 16:02

compiler build

d269b40

Merge branch 'test-substitution' into test-compiler

f46fd11

Merge branch 'test-substitution' into test-compiler

af67e9e

unity dp works

c015efb

format

6211b84

Merge remote-tracking branch 'upstream/repo-refactor' into test-compiler

d9f1302

fmt

fb58a99

fix

02937e1

add substitutions, compiler, and their unit tests to CI

6402ed0

disable runtime unit test

0c45f61

minor fix

95fa427

(not compilable) visitable issue for OptimalCostState

1f7e2b6

fix machine mapping hash & refactor dp algorithm

a9a6402

minor fix

d8bbcb8

fix variant issue

09d3152

fmt

a150d3a

fix

2eb3fdf

fmt

7598a92

fix

05c8336

Merge remote-tracking branch 'upstream/repo-refactor' into test-compiler

71aeddb

add more unit tests

9345400

fmt

c0015df

Merge remote-tracking branch 'origin/repo-refactor' into compiler

6d28697

Fix post-merge

102f5fb

Add shell hook for sapling development

d6e10bb

changed from nullopt to std::nullopt

95fb4cc

fix cast issue

c091479

Merge branch 'test-compiler' of github.com:wmdi/FlexFlow into test-co…

57bd35f

…mpiler

Fix spdlog cmake issue

54c604a

Merge remote-tracking branch 'refs/remotes/wmdi/test-compiler' into c…

a09e528

…ompiler

Changes file

9533099

Marsella8 requested review from lockshaw and wmdi April 12, 2024 20:18

Formatting

9c03409

lockshaw reviewed Apr 17, 2024

View reviewed changes

Pietro Max Marsella added 2 commits April 26, 2024 15:52

Saving

2af6003

Merge branch 'repo-refactor' of github.com:flexflow/FlexFlow into ff-…

3a9850f

…cost-estimator

wmdi reviewed May 9, 2024

View reviewed changes

Pietro Max Marsella added 5 commits May 14, 2024 16:18

Merge branch 'repo-refactor' of github.com:flexflow/FlexFlow into ff-…

2793cc6

…cost-estimator

Moved cost estimator to separate file

798f492

Working on tests

ffa8158

Minor changes

f134e8a

Updates to cost_estimator

a8e8551

lockshaw assigned Marsella8 May 31, 2024

lockshaw mentioned this pull request Jun 3, 2024

Some question about the Flexflow and example/cpp/moe #1392

Open

Updates to parallel_cost_estimator + implementations for machineviews…

3a8b995

… and related objects

Marsella8 commented Jun 3, 2024

View reviewed changes

wmdi reviewed Jun 12, 2024

View reviewed changes

Pietro Max Marsella added 2 commits June 25, 2024 12:07

Tests + fixes for parallel cost estimator

cd76042

Tests + fixes for parallel_cost_estimator

94e6285

Marsella8 commented Jul 1, 2024

View reviewed changes

Marsella8 marked this pull request as ready for review July 1, 2024 23:03

lockshaw requested changes Jul 6, 2024

View reviewed changes

Pietro Max Marsella added 2 commits July 15, 2024 00:15

PR fixes

b469d88

Formatting

44ed29e

Marsella8 commented Jul 15, 2024

View reviewed changes

lockshaw requested changes Jul 17, 2024

View reviewed changes

lockshaw changed the title ~~Adding cost_estimator implementation considering parallelization~~ Adding task-based simulator for PCGs Sep 18, 2024

lockshaw removed the repo-refactor label Dec 16, 2024

lockshaw changed the base branch from repo-refactor to master December 16, 2024 08:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding task-based simulator for PCGs #1365

Adding task-based simulator for PCGs #1365

Marsella8 commented Apr 12, 2024 •

edited

Loading

lockshaw left a comment

wmdi left a comment

Marsella8 left a comment

wmdi left a comment

Marsella8 left a comment

lockshaw left a comment

Marsella8 left a comment

lockshaw left a comment

Adding task-based simulator for PCGs #1365

Are you sure you want to change the base?

Adding task-based simulator for PCGs #1365

Conversation

Marsella8 commented Apr 12, 2024 • edited Loading

lockshaw left a comment

Choose a reason for hiding this comment

wmdi left a comment

Choose a reason for hiding this comment

Marsella8 left a comment

Choose a reason for hiding this comment

wmdi left a comment

Choose a reason for hiding this comment

Marsella8 left a comment

Choose a reason for hiding this comment

lockshaw left a comment

Choose a reason for hiding this comment

Marsella8 left a comment

Choose a reason for hiding this comment

lockshaw left a comment

Choose a reason for hiding this comment

Marsella8 commented Apr 12, 2024 •

edited

Loading