Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New MachineView representation #1458

Merged
merged 40 commits into from
Oct 9, 2024

Conversation

Marsella8
Copy link
Contributor

@Marsella8 Marsella8 commented Aug 7, 2024

Description of changes:

Related Issues:

Linked Issues:

  • Issue #

Issues closed by this PR:

  • Closes #

This change is Reviewable

@Marsella8 Marsella8 requested review from lockshaw and wmdi August 7, 2024 03:03
Copy link

codecov bot commented Aug 7, 2024

Codecov Report

Attention: Patch coverage is 94.33805% with 68 lines in your changes missing coverage. Please review.

Project coverage is 76.83%. Comparing base (a9d10d7) to head (20f6a75).
Report is 1 commits behind head on repo-refactor.

Files with missing lines Patch % Lines
lib/op-attrs/src/op-attrs/parallel_tensor_shape.cc 0.00% 18 Missing ⚠️
lib/pcg/src/pcg/device_id.cc 23.52% 13 Missing ⚠️
lib/pcg/src/pcg/machine_view.cc 87.50% 8 Missing ⚠️
lib/pcg/src/pcg/start_invariant_machine_view.cc 80.48% 8 Missing ⚠️
lib/op-attrs/src/op-attrs/parallel_dim.cc 0.00% 6 Missing ⚠️
lib/pcg/src/pcg/machine_space_offset.cc 63.63% 4 Missing ⚠️
lib/pcg/src/pcg/machine_specification.cc 87.09% 4 Missing ⚠️
lib/compiler/src/compiler/allowed_machine_views.cc 95.23% 3 Missing ⚠️
lib/pcg/src/pcg/operator_task_space.cc 88.23% 2 Missing ⚠️
lib/utils/include/utils/containers/foldl.h 92.85% 1 Missing ⚠️
... and 1 more
Additional details and impacted files
@@                Coverage Diff                @@
##           repo-refactor    #1458      +/-   ##
=================================================
+ Coverage          76.10%   76.83%   +0.72%     
=================================================
  Files                778      792      +14     
  Lines              25611    26559     +948     
  Branches             720      743      +23     
=================================================
+ Hits               19492    20406     +914     
- Misses              6119     6153      +34     
Flag Coverage Δ
unittests 76.83% <94.33%> (+0.72%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
lib/compiler/test/src/allowed_machine_views.cc 100.00% <100.00%> (ø)
...ler/machine_mapping/get_optimal_machine_mapping.cc 100.00% <100.00%> (ø)
...ne_mapping/get_tensor_set_movement_across_split.cc 99.36% <100.00%> (+0.16%) ⬆️
...st/src/compiler/machine_mapping/machine_mapping.cc 100.00% <100.00%> (ø)
...compiler/machine_mapping/machine_mapping_result.cc 100.00% <100.00%> (ø)
...-execution/include/local-execution/cost_estimate.h 0.00% <ø> (ø)
lib/local-execution/src/local_cost_estimator.cc 0.00% <ø> (ø)
lib/pcg/test/src/pcg/computation_graph_builder.cc 90.90% <ø> (ø)
lib/pcg/test/src/pcg/machine_specification.cc 100.00% <100.00%> (ø)
lib/pcg/test/src/pcg/machine_view.cc 100.00% <100.00%> (ø)
... and 30 more

... and 1 file with indirect coverage changes

Copy link
Collaborator

@lockshaw lockshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 35 of 41 files at r1, all commit messages.
Reviewable status: 35 of 41 files reviewed, 14 unresolved discussions (waiting on @Marsella8 and @wmdi)


lib/pcg/include/pcg/device_coordinates.struct.toml line 1 at r1 (raw file):

# DeviceCoordinates is exclusive to machine_view.cc, must not be used outside of it.

I don't think this requirement is necessary anymore with the new design?


lib/pcg/include/pcg/strided_rectangle.struct.toml line 17 at r1 (raw file):

  "utils/hash/unordered_multiset.h",
  "utils/fmt/unordered_multiset.h",
]

utils/hash and utils/fmt are only needed for the implementation, not the declaration

Suggestion:

includes = [
  "pcg/strided_rectangle_side.dtg.h",
  "<unordered_set>",
]

src_includes = [
  "utils/hash/unordered_multiset.h",
  "utils/fmt/unordered_multiset.h",
]

lib/pcg/src/pcg/device_id.cc line 29 at r1 (raw file):

int get_raw_id(device_id_t device_id) {
  if (get_device_type(device_id) == DeviceType::GPU) {

switch might be cleaner? You can also use overload if you'd prefer

Code quote:

  if (get_device_type(device_id) == DeviceType::GPU) {

lib/pcg/src/pcg/device_id.cc line 35 at r1 (raw file):

  } else {
    assert(false && "Unsupported DeviceType");
    return -1;

asserts will get removed in non-debug builds

Suggestion:

    throw mk_runtime_error(fmt::format("Unsupported DeviceType {} for device_id_t {}", get_device_type(device_id), device_id));

lib/pcg/test/src/pcg/machine_view.cc line 66 at r1 (raw file):

        CHECK(expected == result);
      }
      SUBCASE("get_last_device_id") {

Test functions separately (i.e., group by get_last_device_id rather than by MachineView tested on) except for particularly trivial utility functions


lib/pcg/test/src/pcg/machine_view.cc line 70 at r1 (raw file):

      }
    }
    SUBCASE("MachineView #2") {

Use better test names


lib/pcg/test/src/pcg/machine_view.cc line 93 at r1 (raw file):

  }

  TEST_CASE("MachineView make_1d_machine_view - GPU") {

The return type here isn't really necessary and is a bit confusing

Suggestion:

  TEST_CASE("make_1d_machine_view - GPU") {

lib/utils/include/utils/containers/foldl.h line 27 at r1 (raw file):

  C remaining(it, c.end());
  return foldl(remaining, init, func);
}

Suggestion:

template <typename C, typename F, typename E = typename C::value_type>
auto foldl1(C const &c, F func) -> typename C::value_type {
  if (c.empty()) {
     throw mk_runtime_error(fmt::format("foldl1 received empty container: {}", c)); 
  }
  
  std::optional<E> result;
  
  for (E const &e : c) { 
    if (!result.has_value()) {
      result = e; 
    } else {
      result = f(result.value(), e); 
    }
  }
  
  return result.value();
}

lib/utils/include/utils/containers/range.h line 11 at r1 (raw file):

template <typename T>
std::vector<T> range(T start, T end, T step = 1) {

C++ doesn't do generics super well so I'd lean toward concrete types unless there's a strong need for something more flexible

Suggestion:

std::vector<int> range(int start, int end, int step = 1) {

lib/utils/include/utils/containers/replicate.h line 9 at r1 (raw file):

template <typename T>
std::unordered_multiset<T> replicate(std::size_t n, T const &element) {

Prefer int--size_t intuitively seems like it should be good (it represents non-negative values!) but actually what it does is just remove the failure case by overflowing so you can't even assert that the value is okay. At some point we may add a non-negative integer type with the proper semantics, but we don't have one yet

Code quote:

std::size_t

lib/utils/include/utils/containers/scanl.h line 9 at r1 (raw file):

template <typename C, typename F, typename T>
std::vector<T> scanl(C const &c, T init, F const &op) {

Can you add a docstring with a reference to https://hackage.haskell.org/package/base-4.20.0.1/docs/Prelude.html#v:scanl? Not everyone on the FlexFlow team does a bunch of Haskell 🙂


lib/utils/include/utils/containers/scanl.h line 24 at r1 (raw file):

template <typename C, typename F>
auto scanl1(C const &c, F op) {
  using T = typename C::value_type;

See implementation of fold1 for a more readable version


lib/utils/test/src/utils/containers/foldl.cc line 9 at r1 (raw file):

TEST_SUITE(FF_TEST_SUITE) {
  TEST_CASE("foldl") {

I can see you're enjoying Haskell 😛


lib/utils/test/src/utils/containers/range.cc line 45 at r1 (raw file):

      std::vector<int> correct = {};
      CHECK(result == correct);
    }

What's the expected behavior of range(0, 10, -1)?

Copy link
Contributor Author

@Marsella8 Marsella8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 24 of 46 files reviewed, 14 unresolved discussions (waiting on @lockshaw and @wmdi)


lib/pcg/include/pcg/device_coordinates.struct.toml line 1 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

I don't think this requirement is necessary anymore with the new design?

Not a requirement anymore, but I think the user will generally reason about the position of the devices using device_id_t, so it probably won't be used.


lib/pcg/include/pcg/strided_rectangle.struct.toml line 17 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

utils/hash and utils/fmt are only needed for the implementation, not the declaration

Done.


lib/pcg/src/pcg/device_id.cc line 29 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

switch might be cleaner? You can also use overload if you'd prefer

Done.


lib/pcg/src/pcg/device_id.cc line 35 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

asserts will get removed in non-debug builds

Done.


lib/pcg/test/src/pcg/machine_view.cc line 66 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Test functions separately (i.e., group by get_last_device_id rather than by MachineView tested on) except for particularly trivial utility functions

Done.


lib/pcg/test/src/pcg/machine_view.cc line 70 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Use better test names

Done.


lib/pcg/test/src/pcg/machine_view.cc line 93 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

The return type here isn't really necessary and is a bit confusing

Done.


lib/utils/include/utils/containers/foldl.h line 27 at r1 (raw file):

  C remaining(it, c.end());
  return foldl(remaining, init, func);
}

cool pattern !


lib/utils/include/utils/containers/range.h line 11 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

C++ doesn't do generics super well so I'd lean toward concrete types unless there's a strong need for something more flexible

Done.


lib/utils/include/utils/containers/replicate.h line 9 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Prefer int--size_t intuitively seems like it should be good (it represents non-negative values!) but actually what it does is just remove the failure case by overflowing so you can't even assert that the value is okay. At some point we may add a non-negative integer type with the proper semantics, but we don't have one yet

Ahh makes sense thanks.


lib/utils/include/utils/containers/scanl.h line 9 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Can you add a docstring with a reference to https://hackage.haskell.org/package/base-4.20.0.1/docs/Prelude.html#v:scanl? Not everyone on the FlexFlow team does a bunch of Haskell 🙂

Done.


lib/utils/include/utils/containers/scanl.h line 24 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

See implementation of fold1 for a more readable version

Done.


lib/utils/test/src/utils/containers/foldl.cc line 9 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

I can see you're enjoying Haskell 😛

hahaha yes, I'm starting to see why you're so frustrated with C++. Functional code is so clean, wish the C++ syntax was a bit more accomodating...


lib/utils/test/src/utils/containers/range.cc line 45 at r1 (raw file):

Previously, lockshaw (Colin Unger) wrote…

What's the expected behavior of range(0, 10, -1)?

empty range (this range is modelled after python's ranges, so I'd like to keep it consistent, but we can also change it to throw an error).

Copy link
Collaborator

@lockshaw lockshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 41 files at r1, 19 of 24 files at r2, all commit messages.
Reviewable status: 44 of 49 files reviewed, 32 unresolved discussions (waiting on @Marsella8 and @wmdi)


lib/compiler/include/compiler/machine_mapping.h line 63 at r2 (raw file):

std::unordered_set<StartInvariantMachineView>
    get_allowed_start_invariant_machine_views(
        MachineSpecification const &machinespec,

Suggestion:

        MachineSpecification const &machine_spec,

lib/compiler/src/machine_mapping.cc line 378 at r2 (raw file):

}

std::vector<int> get_tensor_parallel_degrees(ParallelTensorShape const &shape) {

Should not return a vector as that imposes an order on ParallelTensorShape's dims, should return either an unordered_set or a custom dtgen struct


lib/compiler/src/machine_mapping.cc line 385 at r2 (raw file):

}

bool is_valid_machine_view(MachineView const &mv,

A blank line here or there would also help readability--you can use them to break up flows of steps into logical pieces


lib/compiler/src/machine_mapping.cc line 393 at r2 (raw file):

  tensor_degrees =
      filter(tensor_degrees, [](int degree) { return degree != 1; });
  return without_order(mv_degrees) == without_order(tensor_degrees);

This returns a multiset, right?

Code quote:

without_order(mv_degrees)

lib/compiler/src/machine_mapping.cc line 401 at r2 (raw file):

                                ParallelTensorShape const &shape) {

  auto candidate_strides = [](std::vector<int> tensor_dims, int total_devices) {

Suggestion:

std::vector<int> const &tensor_dims

lib/compiler/src/machine_mapping.cc line 401 at r2 (raw file):

                                ParallelTensorShape const &shape) {

  auto candidate_strides = [](std::vector<int> tensor_dims, int total_devices) {

I think a return type annotation here would really improve readability

Code quote:

[](std::vector<int> tensor_dims, int total_devices) {

lib/compiler/src/machine_mapping.cc line 404 at r2 (raw file):

    int max_stride_upper_bound =
        (total_devices + 1) /
        product(transform(tensor_dims, [](int degree) { return degree - 1; }));

This is a bit complicated, I feel like a one-line comment would help me figure out what this dense piece of arithmetic is actually doing (or break up the nested expressions and give them each names. While composition of a bunch of functions can be concise, having intermediate variables can really help make code more readable as you can describe each intermediate value/step along the way. Conciseness is good when it benefits readability, but sometimes it gets in the way more than it helps)

Code quote:

    int max_stride_upper_bound =
        (total_devices + 1) /
        product(transform(tensor_dims, [](int degree) { return degree - 1; }));

lib/compiler/src/machine_mapping.cc line 405 at r2 (raw file):

        (total_devices + 1) /
        product(transform(tensor_dims, [](int degree) { return degree - 1; }));
    std::unordered_multiset<std::vector<int>> strides = cartesian_product(

Should this be a stride_size_t?

Code quote:

int

lib/compiler/src/machine_mapping.cc line 410 at r2 (raw file):

  };

  std::vector<int> tensor_dims = filter(get_tensor_parallel_degrees(shape),

Seems like a common pattern, probably good to pull out a separate function?


lib/compiler/src/machine_mapping.cc line 413 at r2 (raw file):

                                        [](int degree) { return degree != 1; });
  std::unordered_set<MachineView> machine_views;
  int total_devices = machinespec.num_nodes * machinespec.num_gpus_per_node;

Why not just have a get_num_devices or something function for MachineSpecification?

Code quote:

  int total_devices = machinespec.num_nodes * machinespec.num_gpus_per_node;

lib/compiler/src/machine_mapping.cc line 414 at r2 (raw file):

  std::unordered_set<MachineView> machine_views;
  int total_devices = machinespec.num_nodes * machinespec.num_gpus_per_node;
  for (std::vector<int> stride :

Suggestion:

  for (std::vector<int> const &stride :

lib/compiler/src/machine_mapping.cc line 416 at r2 (raw file):

  for (std::vector<int> stride :
       candidate_strides(tensor_dims, total_devices)) {
    for (int start_id = 0; start_id < total_devices; start_id++) {

Seems like this loop is intuitively creating StartInvariantMachineViews and then looping over possible start degrees? If so, could this be written to do that? Using more descriptive data structures instead of things like std::vector<int> or std::vector<std::unordered_set<int>> is super helpful for readability. This would also let you break up this function into more intuitive pieces (generating all start-invariant machine views, and then generating all possible start dims for them)


lib/compiler/src/machine_mapping.cc line 437 at r2 (raw file):

      get_candidate_machine_views(machinespec, shape);
  views = filter(views, [&](MachineView const &view) {
    return is_valid_machine_view(view, shape);

Suggestion:

return is_valid_machine_view(view, shape) && is_valid_machine_view(view, machinespec);

lib/compiler/test/src/machine_mapping.cc line 9 at r2 (raw file):

TEST_SUITE(FF_TEST_SUITE) {

  TEST_CASE("get_allowed_machine_view") {

Suggestion:

  TEST_CASE("get_allowed_machine_views") {

lib/compiler/test/src/machine_mapping.cc line 12 at r2 (raw file):

    SUBCASE("1 degree of parallelism") {
      MachineSpecification ms = MachineSpecification(5, 1, 1, 0, 0);

Suggestion:

      MachineSpecification ms = MachineSpecification{5, 1, 1, 0, 0};

lib/pcg/include/pcg/device_coordinates.struct.toml line 18 at r2 (raw file):

[[fields]]
name = "coords"
type = "::FlexFlow::FFOrdered<int>"

I don't think FFOrdered should be used here, since there's no correlation between MachineView dim ordering and tensor dimension ordering. FFOrdered represents things that are in the same order as the dims in a TensorShape (or the shard dims in a ParallelTensorShape)


lib/pcg/include/pcg/machine_view.h line 20 at r2 (raw file):

size_t num_dims(MachineView const &mv);
size_t num_devices(MachineView const &mv);
size_t get_size(MachineView const &mv);

What is get_size here? Name is a bit unclear, might be good to either clarify the name or just add a quick doxygen docstring explaining what this does (I think I know what it is and I'm not sure I can think of a better name, so that would imply a doxygen docstring. That said, I'm not sure this function is actually useful anywhere?)


lib/pcg/include/pcg/start_invariant_machine_view.h line 9 at r2 (raw file):

namespace FlexFlow {

MachineView to_start_dependent(StartInvariantMachineView const &mv,

Suggestion:

MachineView machine_view_from_start_invariant(StartInvariantMachineView const &mv,

lib/pcg/include/pcg/start_invariant_machine_view.h line 11 at r2 (raw file):

MachineView to_start_dependent(StartInvariantMachineView const &mv,
                               device_id_t const &start_id);
StartInvariantMachineView to_start_invariant(MachineView const &mv);

Suggestion:

StartInvariantMachineView start_invariant_from_machine_view(MachineView const &mv);

lib/pcg/include/pcg/strided_rectangle.h line 15 at r2 (raw file):

private:
  std::vector<StridedRectangleSide> _sides;

I don't remember if there are reserved id issues with underscore prefixes, but convention-wise we don't usually use them

Suggestion:

  std::vector<StridedRectangleSide> sides;

lib/pcg/include/pcg/strided_rectangle.h line 28 at r2 (raw file):

  bool operator>=(StridedRectangle const &) const;

  std::vector<StridedRectangleSide> get_sides() const;

For pure getters (things where the result is actually a field) it's safe to return a const &. That said, if you're ever in doubt just return a value type, as that's always safe 🙂

Suggestion:

  std::vector<StridedRectangleSide> const &get_sides() const;

lib/pcg/include/pcg/strided_rectangle.h line 29 at r2 (raw file):

  std::vector<StridedRectangleSide> get_sides() const;
};

Suggestion:

private:
  std::tuple<decltype(sides) const &> tie() const;
  
  friend struct std::hash<StridedRectangle>;
};

lib/pcg/src/pcg/strided_rectangle.cc line 21 at r2 (raw file):

bool StridedRectangle::operator==(StridedRectangle const &other) const {
  return std::tie(this->_sides) == std::tie(other._sides);

Suggestion:

  return this->tie() == other.tie();

lib/pcg/src/pcg/strided_rectangle.cc line 85 at r2 (raw file):

  result ^= std::hash<std::vector<::FlexFlow::StridedRectangleSide>>{}(
                x.get_sides()) +
            0x9e3779b9 + (result << 6) + (result >> 2);

FYI this is actually a function in hash-utils called hash_combine, we just inline it in dtgen (for not really any good reason, we just do at the moment)


lib/pcg/src/pcg/strided_rectangle.cc line 86 at r2 (raw file):

                x.get_sides()) +
            0x9e3779b9 + (result << 6) + (result >> 2);
  return result;

Suggestion:

  return get_std_hash(x.tie());

lib/utils/include/utils/containers/cartesian_product.h line 13 at r2 (raw file):

template <typename Container>
auto cartesian_product(Container const &containers) {

Seems like this is always going to be called over a std::vector<SomeContainer>, so maybe do that specialization to simplify this code a bit?

Code quote:

auto cartesian_product(Container const &containers) {

lib/utils/include/utils/containers/cartesian_product.h line 26 at r2 (raw file):

    }

    for (const auto &item : ordered[depth]) {

Suggestion:

ordered.at(depth)

lib/utils/include/utils/containers/cartesian_product.h line 28 at r2 (raw file):

    for (const auto &item : ordered[depth]) {
      current.push_back(item);
      recurse(current, depth + 1);

Would this be simpler as a non-recursive loop? Not a big issue as this is pretty readable, just wondering


lib/utils/src/utils/containers/any_of.h line 0 at r2 (raw file):
Rename to any_of.cc?


lib/utils/test/src/utils/containers/cartesian_product.cc line 58 at r2 (raw file):

          {1, 2}, {1, 3}, {1, 3}, {1, 2}};
      CHECK(result == correct);
    }

What if one container is empty and the others aren't? Might be a good test case to add

Copy link
Contributor Author

@Marsella8 Marsella8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 30 of 63 files reviewed, 32 unresolved discussions (waiting on @lockshaw and @wmdi)


lib/compiler/include/compiler/machine_mapping.h line 63 at r2 (raw file):

std::unordered_set<StartInvariantMachineView>
    get_allowed_start_invariant_machine_views(
        MachineSpecification const &machinespec,

Done.


lib/compiler/src/machine_mapping.cc line 378 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Should not return a vector as that imposes an order on ParallelTensorShape's dims, should return either an unordered_set or a custom dtgen struct

changed to unordered_multiset (but have to sort the multiset in get_candidate_machine_views, since I need the ordering to be coherent between stride and dimension).


lib/compiler/src/machine_mapping.cc line 385 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

A blank line here or there would also help readability--you can use them to break up flows of steps into logical pieces

Done.


lib/compiler/src/machine_mapping.cc line 393 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

This returns a multiset, right?

yes, since different dimensions might have the same degree (we want to see if there exists a mapping between the 2 (multi)sets of dimensions, so order does not matter)


lib/compiler/src/machine_mapping.cc line 401 at r2 (raw file):

                                ParallelTensorShape const &shape) {

  auto candidate_strides = [](std::vector<int> tensor_dims, int total_devices) {

Done.


lib/compiler/src/machine_mapping.cc line 401 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

I think a return type annotation here would really improve readability

Done.


lib/compiler/src/machine_mapping.cc line 404 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

This is a bit complicated, I feel like a one-line comment would help me figure out what this dense piece of arithmetic is actually doing (or break up the nested expressions and give them each names. While composition of a bunch of functions can be concise, having intermediate variables can really help make code more readable as you can describe each intermediate value/step along the way. Conciseness is good when it benefits readability, but sometimes it gets in the way more than it helps)

I've simplified, but it's still a bit tricky to explain:
Naively, we could think that, given a (2,3) stride for example, it would result in 3*2=6 as many tiles occupied with respect to having all strides be 1, so we could say max_stride_product = num_total_devices/num_devices_used_by_tensor (where num_devices_used_by_tensor is the product of the parallel dims) and so we could simply say that the max stride across any dimension is max_stride_product, so we do the cartesian product and then filter downstream.
This however, doesn't quite work: consider, for example, a 2D MachineView with 2x2 points, and stride 2 across each dimension, and suppose there are 9 total devices. While the "volume" of the MachineView is technically 4x4, it can really fit into a 3x3 and thus we could fit it with the existing devices (since the outer volume can be cut without messing with any of the devices). To address this, I find not the number of total devices used by the tensor, but, the total number of "inner" devices, essentially the ones such that they have associated with them a full stride "volume". So I find the max stride for these using the naive procedure (which works since they all have full stride volume) and I know that if a given stride is too large for them then surely it's too large for the full set of devices, which essentially contains them. (Note that this solutions is incredibly underoptimized, if the generation turns out to be to slow I'll fix it).


lib/compiler/src/machine_mapping.cc line 405 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Should this be a stride_size_t?

fixed. is there a way to cast a container containing type X to the same container but with type Y (so not using transform and casting within the lambda)?


lib/compiler/src/machine_mapping.cc line 410 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Seems like a common pattern, probably good to pull out a separate function?

Done.


lib/compiler/src/machine_mapping.cc line 413 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why not just have a get_num_devices or something function for MachineSpecification?

Done.


lib/compiler/src/machine_mapping.cc line 414 at r2 (raw file):

  std::unordered_set<MachineView> machine_views;
  int total_devices = machinespec.num_nodes * machinespec.num_gpus_per_node;
  for (std::vector<int> stride :

Done.


lib/compiler/src/machine_mapping.cc line 416 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Seems like this loop is intuitively creating StartInvariantMachineViews and then looping over possible start degrees? If so, could this be written to do that? Using more descriptive data structures instead of things like std::vector<int> or std::vector<std::unordered_set<int>> is super helpful for readability. This would also let you break up this function into more intuitive pieces (generating all start-invariant machine views, and then generating all possible start dims for them)

fixed it up a bit, is this enough? I'm a bit hesitant using dtgen since I don't want to give too much visibility to components that are only used within a very narrow piece of the codebase. I could do something like using StrideVec = std::vector<stride_t> to make it more readable, though I'm not sure it would make it more understandable.


lib/compiler/src/machine_mapping.cc line 437 at r2 (raw file):

      get_candidate_machine_views(machinespec, shape);
  views = filter(views, [&](MachineView const &view) {
    return is_valid_machine_view(view, shape);

Done.


lib/compiler/test/src/machine_mapping.cc line 9 at r2 (raw file):

TEST_SUITE(FF_TEST_SUITE) {

  TEST_CASE("get_allowed_machine_view") {

Done.


lib/compiler/test/src/machine_mapping.cc line 12 at r2 (raw file):

    SUBCASE("1 degree of parallelism") {
      MachineSpecification ms = MachineSpecification(5, 1, 1, 0, 0);

Done.


lib/pcg/include/pcg/device_coordinates.struct.toml line 18 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

I don't think FFOrdered should be used here, since there's no correlation between MachineView dim ordering and tensor dimension ordering. FFOrdered represents things that are in the same order as the dims in a TensorShape (or the shard dims in a ParallelTensorShape)

Got it, so FFOrdered is only for Tensor dims?


lib/pcg/include/pcg/machine_view.h line 20 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

What is get_size here? Name is a bit unclear, might be good to either clarify the name or just add a quick doxygen docstring explaining what this does (I think I know what it is and I'm not sure I can think of a better name, so that would imply a doxygen docstring. That said, I'm not sure this function is actually useful anywhere?)

Obsolete and removed (originally it was the "volume", so product of all the sizes of the sides, which I thought I could use for doing bounds check, but it's now useless since the devices themselves can occupy a space smaller than that (e.g. imagine a 2D machine view with stride 2 and 2 devices per side, it technically occupies a 4x4 area but can also really fit into a 3x3)


lib/pcg/include/pcg/start_invariant_machine_view.h line 9 at r2 (raw file):

namespace FlexFlow {

MachineView to_start_dependent(StartInvariantMachineView const &mv,

Done.


lib/pcg/include/pcg/start_invariant_machine_view.h line 11 at r2 (raw file):

MachineView to_start_dependent(StartInvariantMachineView const &mv,
                               device_id_t const &start_id);
StartInvariantMachineView to_start_invariant(MachineView const &mv);

Done.


lib/pcg/include/pcg/strided_rectangle.h line 15 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

I don't remember if there are reserved id issues with underscore prefixes, but convention-wise we don't usually use them

Done.


lib/pcg/include/pcg/strided_rectangle.h line 28 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

For pure getters (things where the result is actually a field) it's safe to return a const &. That said, if you're ever in doubt just return a value type, as that's always safe 🙂

Done.


lib/pcg/include/pcg/strided_rectangle.h line 29 at r2 (raw file):

  std::vector<StridedRectangleSide> get_sides() const;
};

Done.


lib/pcg/src/pcg/strided_rectangle.cc line 21 at r2 (raw file):

bool StridedRectangle::operator==(StridedRectangle const &other) const {
  return std::tie(this->_sides) == std::tie(other._sides);

Done.


lib/pcg/src/pcg/strided_rectangle.cc line 86 at r2 (raw file):

                x.get_sides()) +
            0x9e3779b9 + (result << 6) + (result >> 2);
  return result;

Done.


lib/utils/include/utils/containers/cartesian_product.h line 13 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Seems like this is always going to be called over a std::vector<SomeContainer>, so maybe do that specialization to simplify this code a bit?

Yes you're right, fixed (not it's vector<Containers> -> unordered_multiset<vector<...>>


lib/utils/include/utils/containers/cartesian_product.h line 26 at r2 (raw file):

    }

    for (const auto &item : ordered[depth]) {

Done.


lib/utils/include/utils/containers/cartesian_product.h line 28 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Would this be simpler as a non-recursive loop? Not a big issue as this is pretty readable, just wondering

I like the recursive version and found it simpler to implement.


lib/utils/test/src/utils/containers/cartesian_product.cc line 58 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

What if one container is empty and the others aren't? Might be a good test case to add

Done.


lib/utils/src/utils/containers/any_of.h line at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Rename to any_of.cc?

Done.

Pietro Max Marsella added 2 commits August 13, 2024 12:33
Copy link
Contributor Author

@Marsella8 Marsella8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 30 of 66 files reviewed, 33 unresolved discussions (waiting on @lockshaw and @wmdi)


lib/compiler/src/compiler/allowed_machine_views.cc line 124 at r3 (raw file):

                   start_invariant_from_machine_view);
}

Unsure where to put these functions, will we be using parallel_tensor_dim_idx as the main way of indexing the ParallelTensorShape going forwards or is it just for doing the machine view to tensor mapping?

wmdi
wmdi previously approved these changes Aug 21, 2024
Copy link
Collaborator

@wmdi wmdi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 36 of 36 files at r3, all commit messages.
Reviewable status: all files reviewed, 33 unresolved discussions (waiting on @lockshaw)

Copy link
Collaborator

@lockshaw lockshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 36 of 36 files at r3, all commit messages.
Reviewable status: all files reviewed, 50 unresolved discussions (waiting on @Marsella8)


lib/compiler/include/compiler/tensor_to_machine_view_injection.struct.toml line 2 at r3 (raw file):

namespace = "FlexFlow"
name = "TensorToMachineViewInjection"

Why is this not a surjection? Seems like all of the MachineView dims need to be mapped to?


lib/compiler/include/compiler/tensor_to_machine_view_injection.struct.toml line 14 at r3 (raw file):

  "utils/bidict/bidict.h",
  "utils/hash/unordered_map.h"
]

Suggestion:

includes = [
  "pcg/machine_view_dim_idx.dtg.h",
  "op-attrs/parallel_tensor_dim_idx.dtg.h",
  "utils/bidict/bidict.h",
]

lib/compiler/src/compiler/allowed_machine_views.cc line 36 at r3 (raw file):

}

bool is_valid_machine_view(MachineView const &mv,

I don't see this function declared in allowed_machine_view.h?


lib/compiler/src/compiler/allowed_machine_views.cc line 124 at r3 (raw file):

Previously, Marsella8 wrote…

Unsure where to put these functions, will we be using parallel_tensor_dim_idx as the main way of indexing the ParallelTensorShape going forwards or is it just for doing the machine view to tensor mapping?

For now let's put get_parallel_dim_at_idx and get_parallel_tensor_indices into parallel_tensor_dim_idx.h


lib/compiler/src/compiler/allowed_machine_views.cc line 152 at r3 (raw file):

std::unordered_set<machine_view_dim_idx>
    get_machine_view_indices(MachineView const &mv) {

Move to machine_view_dim_idx.h


lib/compiler/src/compiler/allowed_machine_views.cc line 157 at r3 (raw file):

}

bool is_valid_injection(TensorToMachineViewInjection const &injection,

Probably better to move this and the functions below to an tensor_to_machine_view_injection.h file


lib/compiler/src/compiler/allowed_machine_views.cc line 175 at r3 (raw file):

  std::unordered_set<parallel_tensor_dim_idx> shape_indices =
      get_parallel_tensor_indices(shape);
  shape_indices = filter(shape_indices, [&](auto const idx) {

Suggestion:

(parallel_tensor_dim_idx const &idx)

lib/compiler/src/compiler/allowed_machine_views.cc line 183 at r3 (raw file):

       permutations(shape_indices)) {
    TensorToMachineViewInjection injection =
        TensorToMachineViewInjection(bidict(zip(sorted(mv_indices), p)));

Suggestion:

  std::vector<machine_view_dim_idx> machine_view_dim_ordering = sorted(mv_indices);
  std::unordered_set<TensorToMachineViewInjection> result;
  for (std::vector<parallel_tensor_dim_idx> const &tensor_dim_ordering :
       permutations(shape_indices)) {
    TensorToMachineViewInjection injection =
        TensorToMachineViewInjection(bidict(zip(machine_view_dim_ordering, tensor_dim_ordering)));

lib/compiler/test/src/allowed_machine_ views.cc line 62 at r3 (raw file):

          views.insert(mv);
        }
        return views;

Suggestion:

        return unordered_set_of(transform(count(num_starts), 
                                          [&](int start) {
                                             return MachineView{
                                               device_id_t{gpu_id_t{start}},
                                               StridedRectangle{{
                                                 StridedRectangleSide{num_points_t{2}, stride_t{stride1}},
                                                 StridedRectangleSide{num_points_t{3}, stride_t{stride2}}
                                               }},
                                             };
                                           }));

lib/compiler/test/src/allowed_machine_ views.cc line 64 at r3 (raw file):

        return views;
      };
      std::unordered_set<MachineView> correct;

Is it possible to instead consider a smaller MachineSpecification where the set of MachineViews is small enough that you don't need all this infrastructure to generate it? This testcase is a bit complicated and might be rather annoying to debug


lib/compiler/test/src/allowed_machine_ views.cc line 136 at r3 (raw file):

          make_2d_view(/*stride1*/ 3, /*stride2*/ 1),
          make_2d_view(/*stride1*/ 1, /*stride2*/ 3),
          make_2d_view(/*stride1*/ 1, /*stride2*/ 4)};

Suggestion:

      std::unordered_set<StartInvariantMachineView> correct = {
          make_2d_view(/*stride1*/ 1, /*stride2*/ 1),
          make_2d_view(/*stride1*/ 2, /*stride2*/ 1),
          make_2d_view(/*stride1*/ 1, /*stride2*/ 2),
          make_2d_view(/*stride1*/ 3, /*stride2*/ 1),
          make_2d_view(/*stride1*/ 1, /*stride2*/ 3),
          make_2d_view(/*stride1*/ 1, /*stride2*/ 4),
      };

lib/compiler/test/src/allowed_machine_ views.cc line 140 at r3 (raw file):

      std::unordered_set<StartInvariantMachineView> result =
          get_allowed_start_invariant_machine_views(ms, shape);
      CHECK(result == correct);

Suggestion:

      std::unordered_set<StartInvariantMachineView> result =
          get_allowed_start_invariant_machine_views(ms, shape);
          
      CHECK(result == correct);

lib/compiler/test/src/allowed_machine_ views.cc line 163 at r3 (raw file):

                         StridedRectangleSide{num_points_t(2), stride_t(4)},
                         StridedRectangleSide{num_points_t(3), stride_t(1)}},
                    }};

Suggestion:

  TEST_CASE("get_all_tensor_to_machine_view_injections") {
    ParallelTensorShape shape = ParallelTensorShape{
        ParallelTensorDims{
            FFOrdered<ShardParallelDim>{
                ShardParallelDim{10, 3},
            },
            ReplicaParallelDimSet{
                SumDegree{2},
                DiscardCopyDegree{2},
            },
        },
        DataType::FLOAT,
    };
    
    MachineView view =
        MachineView{
          device_id_from_index(0, DeviceType::GPU),
          StridedRectangle{{
            StridedRectangleSide{num_points_t(2), stride_t(1)},
            StridedRectangleSide{num_points_t(2), stride_t(4)},
            StridedRectangleSide{num_points_t(3), stride_t(1)},
          }},
        };

lib/compiler/test/src/allowed_machine_ views.cc line 178 at r3 (raw file):

    std::unordered_set<TensorToMachineViewInjection> result =
        get_all_tensor_to_machine_view_injections(view, shape);
    CHECK(correct == result);

Suggestion:

    std::unordered_set<TensorToMachineViewInjection> correct = {
        TensorToMachineViewInjection{b1}, TensorToMachineViewInjection{b2}};
    std::unordered_set<TensorToMachineViewInjection> result =
        get_all_tensor_to_machine_view_injections(view, shape);
        
    CHECK(correct == result);

lib/compiler/test/src/machine_mapping.cc line 7 at r3 (raw file):

TEST_SUITE(FF_TEST_SUITE) {

  // TEST_CASE("MachineMapping::combine") {

What's happening with this file?


lib/op-attrs/include/op-attrs/parallel_tensor_dim_idx.variant.toml line 2 at r3 (raw file):

namespace = "FlexFlow"
name = "parallel_tensor_dim_idx"

Naming convention-wise we usually either do ParallelTensorDimIdx or parallel_tensor_dim_idx_t, generally to allow parallel_tensor_dim_idx to be a variable name

Suggestion:

name = "parallel_tensor_dim_idx_t"

lib/op-attrs/include/op-attrs/parallel_tensor_dim_idx.variant.toml line 10 at r3 (raw file):

  "fmt",
]
explicit_constructors = false

Prefer explicit_constructors = true unless there's a good reason. What's the reason here?


lib/pcg/include/pcg/device_coordinates.struct.toml line 18 at r2 (raw file):

Previously, Marsella8 wrote…

Got it, so FFOrdered is only for Tensor dims?

Yes, to distinguish between the FlexFlow/pytorch ordering of dimensions and the flipped legion ordering of dimensions


lib/pcg/include/pcg/device_coordinates.struct.toml line 17 at r3 (raw file):

  "utils/fmt/vector.h",
  
]

Suggestion:

includes = [
  "<vector>",
]

src_includes = [
  "utils/hash/vector.h",
  "utils/fmt/vector.h",
]

lib/pcg/include/pcg/device_coordinates.struct.toml line 20 at r3 (raw file):

[[fields]]
name = "coords"

Suggestion:

name = "raw_coords"

lib/pcg/include/pcg/machine_view_dim_idx.struct.toml line 2 at r3 (raw file):

namespace = "FlexFlow"
name = "machine_view_dim_idx"

For convention we usually either use MachineViewDimIdx or machine_view_dim_idx_t so that machine_view_dim_idx can be a variable name

Suggestion:

name = "machine_view_dim_idx_t"

lib/pcg/include/pcg/strided_rectangle.h line 17 at r3 (raw file):

  std::vector<StridedRectangleSide> sides;
  std::tuple<std::vector<StridedRectangleSide> const &> tie() const;
  friend struct std::hash<StridedRectangle>;

Usually we put member variables (i.e., sides) into their own private: or public: section for readability

Suggestion:

private: 
  std::tuple<std::vector<StridedRectangleSide> const &> tie() const;
  friend struct std::hash<StridedRectangle>;

lib/pcg/src/pcg/device_id.cc line 36 at r3 (raw file):

    default:
      throw mk_runtime_error(
          fmt::format("Unsupported DeviceType {}", get_device_type(device_id)));

Why remove logging of the device_id_t?


lib/pcg/src/pcg/device_id.cc line 43 at r3 (raw file):

  switch (device_type) {
    case DeviceType::GPU:
      return device_id_t(gpu_id_t(idx));

Prefer {...} initialization for consistency

Suggestion:

      return device_id_t{gpu_id_t{idx}};

lib/pcg/src/pcg/machine_view.cc line 33 at r3 (raw file):

      sum(transform(zip(coefficients, as_vector(point.coords)),
                    [](auto const pair) { return pair.first * pair.second; })) +
      get_raw_id(mv.start);

Suggestion:

  size_t coord_offset = sum(transform(zip(coefficients, as_vector(point.coords)),
                    [](auto const pair) { return pair.first * pair.second; }));
  size_t raw_id = get_raw_id(mv.start) + coord_offset;

lib/pcg/src/pcg/machine_view.cc line 37 at r3 (raw file):

  return ((get_device_type(mv) == DeviceType::CPU)
              ? device_id_t(cpu_id_t(raw_id))
              : device_id_t(gpu_id_t(raw_id)));

Suggestion:

  return device_id_from_index(raw_id, get_device_type(mv));

lib/pcg/src/pcg/machine_view.cc line 43 at r3 (raw file):

  std::vector<std::vector<int>> ranges =
      transform(mv.rect.get_sides(), [](StridedRectangleSide const &side) {
        return range(0, get_side_size(side).unwrapped, side.stride.unwrapped);

Pull the lambda out into a separate function (get_points or something) in strided_rectangle_side.h

Probably also worth pulling out a get_coords or something in strided_rectangle.h

Code quote:

      transform(mv.rect.get_sides(), [](StridedRectangleSide const &side) {
        return range(0, get_side_size(side).unwrapped, side.stride.unwrapped);

lib/pcg/src/pcg/machine_view.cc line 47 at r3 (raw file):

  std::unordered_set<DeviceCoordinates> devices_as_points = unordered_set_of(
      transform(cartesian_product(ranges),
                [](auto const &point) { return DeviceCoordinates(point); }));

Prefer concrete type

Suggestion:

(std::vector<int> const &point) 

lib/pcg/src/pcg/machine_view.cc line 55 at r3 (raw file):

}

device_id_t get_last_device_id(MachineView const &mv) {

Suggestion:

device_id_t get_maximum_device_id(MachineView const &mv) {

lib/pcg/src/pcg/machine_view.cc line 59 at r3 (raw file):

  //     transform(mv.rect.get_sides(), [](StridedRectangleSide const &s) {
  //       return s.stride.unwrapped;
  //     }));

Delete comment

Code quote:

  // DeviceCoordinates last_device = DeviceCoordinates(
  //     transform(mv.rect.get_sides(), [](StridedRectangleSide const &s) {
  //       return s.stride.unwrapped;
  //     }));

lib/pcg/src/pcg/machine_view.cc line 67 at r3 (raw file):

}

std::vector<num_points_t> get_num_devices_per_dim(MachineView const &mv) {

Do MachineViews have ordered sides?

Code quote:

std::vector<num_points_t>

lib/pcg/src/pcg/machine_view.cc line 73 at r3 (raw file):

}

std::vector<side_size_t> get_side_size_per_dim(MachineView const &mv) {

Do MachineViews have ordered sides?

Code quote:

std::vector<side_size_t>

lib/pcg/src/pcg/strided_rectangle.cc line 24 at r3 (raw file):

std::tuple<std::vector<StridedRectangleSide> const &>
    StridedRectangle::tie() const {
  return std::tie(sides);

Suggestion:

  return std::tie(this->sides);

lib/pcg/src/pcg/strided_rectangle.cc line 52 at r3 (raw file):

std::vector<StridedRectangleSide> const &StridedRectangle::get_sides() const {
  return sides;

Always use explicit this to reduce ambiguity

Suggestion:

  return this->sides;

lib/pcg/test/src/pcg/machine_view.cc line 56 at r3 (raw file):

      }};
      gpu_id_t start(0);
      MachineView mv{device_id_t{start}, rect};

Suggestion:

      MachineView mv = MachineView{
        gpu_id_t{0},
        StridedRectangle{{
          StridedRectangleSide(num_points_t(2), stride_t{3}),
          StridedRectangleSide(num_points_t(2), stride_t{2}),
        }},
      };

lib/pcg/test/src/pcg/machine_view.cc line 59 at r3 (raw file):

      SUBCASE("get_device_ids") {
        std::unordered_set<device_id_t> expected =
            make_gpu_device_ids({0, 2, 12, 14});

Why aren't the device ids here {0, 2, 3, 5}? Or at the very least not {0, 2, 6, 8}? I feel like I don't understand the calculation here


lib/pcg/test/src/pcg/start_invariant_machine_view.cc line 16 at r3 (raw file):

    MachineView input = MachineView{start, rect};

    MachineView result = machine_view_from_start_invariant(

This is a good property for a rapidcheck test, but it would be good to also have a concrete test of each of the directions that isn't just testing invertibility


lib/pcg/test/src/pcg/strided_rectangle.cc line 17 at r3 (raw file):

      StridedRectangle r1 = StridedRectangle{{s1, s0}};
      CHECK(r0 == r1);
      CHECK(r1.get_sides() == std::vector<StridedRectangleSide>{s0, s1});

How much do we care that the sides are sorted vs just caring that there is a canonical order?


lib/utils/include/utils/bidict/bidict.h line 25 at r3 (raw file):

  }

  bidict(std::vector<std::pair<L, R>> init)

Any particular reason for adding this over the existing constructors?


lib/utils/include/utils/bidict/bidict.h line 26 at r3 (raw file):

  bidict(std::vector<std::pair<L, R>> init)
      : bidict(init.begin(), init.end()) {}

Suggestion:

  explicit bidict(std::vector<std::pair<L, R>> init)
      : bidict(init.begin(), init.end()) {}

lib/utils/include/utils/containers/cartesian_product.h line 17 at r3 (raw file):

  std::function<void(V &, size_t)> recurse = [&](V &current, size_t depth) {
    if (depth == containers.size()) {

Suggestion:

template <typename C, typename Elem = typename C::value_type>
std::unordered_multiset<std::vector<Elem>> 
  cartesian_product(std::vector<C> const &containers) {
  
  std::unordered_multiset<std::vector<Elem>> result;

  std::function<void(std::vector<Elem> &, size_t)> recurse = [&](std::vector<Elem> &current, size_t depth) {
    if (depth == containers.size()) {

lib/utils/include/utils/containers/cartesian_product.h line 22 at r3 (raw file):

    }

    for (const auto &item : containers.at(depth)) {

Suggestion:

    for (Elem const &item : containers.at(depth)) {

lib/utils/include/utils/containers/filter.h line 53 at r3 (raw file):

  std::copy_if(m.cbegin(), m.cend(), std::inserter(result, result.begin()), f);
  return result;
}

Suggestion:

template <typename Elem, typename F>
std::unordered_multiset<Elem> filter(std::unordered_multiset<Elem> const &m,
                                     F const &f) {
  std::unordered_multiset<Elem> result;
  std::copy_if(m.cbegin(), m.cend(), std::inserter(result, result.begin()), f);
  return result;
}

lib/utils/include/utils/containers/permutations.h line 13 at r3 (raw file):

template <typename C, typename V = std::vector<typename C::value_type>>
auto permutations(C const &container) {

FYI there's also a lazy implementation of permutations on the substitutions-fix branch at https://github.com/lockshaw/FlexFlow/blob/substitutions-fix/lib/utils/include/utils/containers/get_all_permutations.h (note that having this version too is fine, just wanted to make you aware)


lib/utils/include/utils/containers/permutations.h line 13 at r3 (raw file):

template <typename C, typename V = std::vector<typename C::value_type>>
auto permutations(C const &container) {

Suggestion:

template <typename C, typename Elem = typename C::value_type>
std::unordered_set<std::vector<Elem>> permutations(C const &container) {

lib/utils/include/utils/containers/replicate.h line 9 at r1 (raw file):

Previously, Marsella8 wrote…

Ahh makes sense thanks.

Not seeing this fix?


lib/utils/test/src/utils/containers/cartesian_product.cc line 10 at r1 (raw file):

TEST_SUITE(FF_TEST_SUITE) {
  TEST_CASE("cartesian_product") {
    SUBCASE("empty") {

Why remove this test case?


lib/utils/test/src/utils/containers/permutations.cc line 5 at r3 (raw file):

#include <doctest/doctest.h>
#include <unordered_set>
#include <vector>

Suggestion:

#include <unordered_set>
#include <vector>
#include "utils/fmt/vector.h"
#include "utils/fmt/unordered_set.h"

lib/compiler/src/machine_mapping.cc line 393 at r2 (raw file):

Previously, Marsella8 wrote…

yes, since different dimensions might have the same degree (we want to see if there exists a mapping between the 2 (multi)sets of dimensions, so order does not matter)

Might be good to rename that function to multiset_of to remove ambiguity


lib/compiler/src/machine_mapping.cc line 404 at r2 (raw file):

Previously, Marsella8 wrote…

I've simplified, but it's still a bit tricky to explain:
Naively, we could think that, given a (2,3) stride for example, it would result in 3*2=6 as many tiles occupied with respect to having all strides be 1, so we could say max_stride_product = num_total_devices/num_devices_used_by_tensor (where num_devices_used_by_tensor is the product of the parallel dims) and so we could simply say that the max stride across any dimension is max_stride_product, so we do the cartesian product and then filter downstream.
This however, doesn't quite work: consider, for example, a 2D MachineView with 2x2 points, and stride 2 across each dimension, and suppose there are 9 total devices. While the "volume" of the MachineView is technically 4x4, it can really fit into a 3x3 and thus we could fit it with the existing devices (since the outer volume can be cut without messing with any of the devices). To address this, I find not the number of total devices used by the tensor, but, the total number of "inner" devices, essentially the ones such that they have associated with them a full stride "volume". So I find the max stride for these using the naive procedure (which works since they all have full stride volume) and I know that if a given stride is too large for them then surely it's too large for the full set of devices, which essentially contains them. (Note that this solutions is incredibly underoptimized, if the generation turns out to be to slow I'll fix it).

Ah okay, yeah I'd add that explanation in as a comment. The code makes a lot more sense with that bit of context


lib/compiler/src/machine_mapping.cc line 405 at r2 (raw file):

Previously, Marsella8 wrote…

fixed. is there a way to cast a container containing type X to the same container but with type Y (so not using transform and casting within the lambda)?

I don't think so, and generally we avoid casting anyway in favor of explicit construction


lib/compiler/src/machine_mapping.cc line 416 at r2 (raw file):

I'm a bit hesitant using dtgen since I don't want to give too much visibility to components that are only used within a very narrow piece of the codebase

That makes sense, but I think having a MultiDimensionalStride or something like that would be fine, especially as it's a pretty simple/obvious concept and could be used elsewhere e.g., to move get_strided_rectangle out into its own function that constructs a StridedRectangle from a starting point, a number of repetitions in each dim, and a MultiDimensionalStride

fixed it up a bit, is this enough?

It's definitely a big improvement! I do think breaking it up into stages that produce different named data structures (i.e., first generating the candidate MultiDimensionalStrides, then using them to generate the StartInvariantMachineViews, and then using them to generate the MachineViews) would help, along with a brief comment explaining the stride generation logic which is inherently a bit complicated. It might also be a good idea to add a brief comment to the candidate generation functions explaining the guarantees (if any) they provide on their outputs: for example, are each of the outputs guaranteed to have at least one valid MachineView, or can there be outputs for which no valid MachineViews exist?

Copy link
Contributor Author

@Marsella8 Marsella8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 21 files at r9.
Reviewable status: 73 of 96 files reviewed, 30 unresolved discussions (waiting on @lockshaw and @wmdi)


lib/compiler/include/compiler/allowed_machine_views.h line 15 at r8 (raw file):

Previously, lockshaw (Colin Unger) wrote…

These two overloads seem nontrivially different(?), so giving them different (and clearer) names might help

second one was redundant.


lib/compiler/src/compiler/allowed_machine_views.cc line 67 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Can any of this get pulled out into separate functions that make logical sense by themselves? Each statement is okay to read but the sheer number of statements is a bit intense and doesn't provide a lot of walking the reader through. If it really is this complicated that's fine (there are places where stuff just fundamentally needs a bunch of code that can't get decomposed), but I'd like to be convinced of that 🙂

I've tried to split it up a bit more, but it's probably not enough. Not super sure how to express the candidate_strides computation more cleanly otherwise.


lib/local-execution/test/src/test_local_cost_estimator.cc line 11 at r8 (raw file):

Previously, lockshaw (Colin Unger) wrote…

What's going on here?

the interface for MachineView has changed so there is not really an equivalent for make_1d_machine_view anymore.
It would have to be something like a OperatorTaskSpace which encodes the number of points (so degree of the only dimension) and then a MachineView which tells us the stride and the StartCoordinate within the MachineSpecification space.


lib/op-attrs/src/op-attrs/parallel_tensor_dim_idx_t.cc line 10 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

I don't know how widely parallel_tensor_dim_idx_t will be used, but it's such an obvious thing to have conceptually that I think it makes sense to integrate into parallel_tensor_shape.h, etc.

Done. note that since we don't have the bijective mapping stuff anymore, parallel_tensor_dim_idx is not needed anymore, but I've moved it to 'parallel_tensor_shape.cc` for future use.


lib/pcg/include/pcg/machine_specification.h line 19 at r8 (raw file):

bool is_valid_machine_space_coordinates(MachineSpecification const &ms,
                                        MachineSpaceCoordinate const &coord);

Done.


lib/pcg/include/pcg/machine_view.struct.toml line 29 at r8 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why are these separate vectors rather than a single vector?

we can have them as a single vector of pairs (or some custom other data type) but i think it would make it a bit more clunky to iterate over either vector. I can change it if you think it's preferrable.


lib/pcg/include/pcg/machine_view_dim_idx_t.h line 9 at r8 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why a vector here?

old residue from the previous interface, now it's deleted


lib/pcg/include/pcg/task_space_operator.h line 12 at r8 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why not just get_task_space_coordinates?

Done.


lib/pcg/include/pcg/task_space_operator.h line 15 at r8 (raw file):

TaskSpaceCoordinate
    get_maximum_fragment_coordinate(TaskSpaceOperator const &task);

Done.


lib/pcg/include/pcg/task_space_operator.h line 18 at r8 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Clearer and doesn't require new terminology (i.e., "fragment")--not that I mind new terminology when necessary, just don't think it's necessary here

Done.


lib/pcg/src/pcg/delete.c line at r8 (raw file):

Previously, lockshaw (Colin Unger) wrote…

What's going on here?

sorry old file where I dumped some stuff, now deleted


lib/pcg/src/pcg/machine_specification.cc line 43 at r8 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Might be better to throw an exception so that the out-of-bounds case can be tested? We can't really test that an assert is fired

Done + added some tests


lib/pcg/src/pcg/machine_view.cc line 47 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

This needs simplifying (pull out some lambdas and/or functions--for example, a function that adds DeviceCoordinates together would make a lot of sense, or maybe a function that gives you a number based on a stride in a StridedRectangleDim?). When you're dealing with wrapper types, instead of just unwrapping them in your function it's often better to write small wrapper functions that do the unwrapping for you (where those functions logically stand by themselves) and then use those, e.g., the functions in computation_graph.h

changed, lmk if this is better (old implementation was crap, I apologize)


lib/pcg/test/src/pcg/machine_view.cc line 35 at r8 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Add argument name comments?

Done.


lib/utils/include/utils/containers/get_all_permutations.h line 105 at r8 (raw file):

Previously, lockshaw (Colin Unger) wrote…

I'd move this to a separate get_all_permutations_with_repetitions.h file unless there's a strong reason it should be here

Done.


lib/utils/include/utils/containers/transform.h line 50 at r8 (raw file):

template <typename F,
          typename In,
          typename Out = decltype(std::declval<F>()(std::declval<In>()))>

Done.


lib/utils/include/utils/containers/transform.h line 53 at r8 (raw file):

std::set<Out> transform(std::set<In> const &v, F const &f) {
  std::set<Out> result;
  for (auto const &e : v) {

Done.


lib/utils/test/src/utils/containers/get_all_permutations.cc line 56 at r8 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Come up with better testcase descriptions. The goal of these should be a high level discussion of semantic properties and/or the failure cases they're trying to check--"container size = 3, n = 1" is something I can trivially see from the code and isn't really (at least in the current wording) a high-level idea. If there's a reason this testcase is here (say, testing the limiting case that we have only one element), then the description should be clarified. If there's not a good high-level reason for that specific testcase, it should probably be removed 🙂

great point, yeah tests weren't super targeted, slightly retooled them to be more informative.


lib/utils/test/src/utils/containers/get_all_permutations.cc line 60 at r8 (raw file):

Previously, lockshaw (Colin Unger) wrote…

I'm guessing this terminology is from https://en.wikipedia.org/wiki/Permutation#Permutations_with_repetition--that said, the naming is somewhat confusing (as even wikipedia notes), so either come up with a slightly clearer name for this (isn't this just essentially cartesian product in the case that the tuple size is >= the list size?) or add an explicit docstring with links to that wikipedia page

afaik, permutations_with_repetition is the most standard name for this kind of behaviour. I've added a docstring


lib/utils/test/src/utils/containers/range.cc line 2 at r8 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Don't these have to be test/utils/doctest/fmt/vector.h now?

Done.


lib/utils/include/utils/fmt/unordered_multiset.h line 27 at r7 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Delete?

Done.


lib/pcg/include/pcg/machine_specification_coordinate.struct.toml line 17 at r7 (raw file):

Previously, lockshaw (Colin Unger) wrote…

"inter" by itself is unclear. "inter"-what?

Done.


lib/pcg/include/pcg/machine_specification_coordinate.struct.toml line 21 at r7 (raw file):

[[fields]]
name = "intra"

Done.


lib/pcg/include/pcg/task_space_operator.struct.toml line 2 at r9 (raw file):

namespace = "FlexFlow"
name = "TaskSpaceOperator"

Done.


lib/pcg/include/pcg/task_space_operator.struct.toml line 24 at r9 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Is there a value to using num_points now? It doesn't seem like there's really an alternative for OperatorTaskSpace (like, there's no way for it to be strided, so the extra safety doesn't seem as worth it now?)

agree, only thing i can think of is that num_points_t makes clear what the degree of an operator means. changed back to int


lib/utils/include/utils/containers/unordered_multiset_of.h line 10 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

It's not really a big deal either way, but technically std::cbegin and std::cend work on raw arrays. This was really more of a "C++ interesting FYI" than a particularly important comment tbh

got it

Copy link
Collaborator

@lockshaw lockshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 9 of 21 files at r9, 34 of 36 files at r10, all commit messages.
Reviewable status: 99 of 101 files reviewed, 38 unresolved discussions (waiting on @Marsella8 and @wmdi)


lib/compiler/test/src/allowed_machine_views.cc line 18 at r10 (raw file):

    SUBCASE("1 degree of parallelism") {

      MachineSpecification ms = MachineSpecification{1, 5, 5, 0, 0};

Suggestion:

      MachineSpecification ms = MachineSpecification{
        /*num_nodes=*/1, 
        /*num_cpus_per_node=*/5, 
        /*num_gpus_per_node=*/5, 
        /*inter_node_bandwidth=*/0, 
        /*intra_node_bandwidth=*/0,
      };

lib/compiler/test/src/allowed_machine_views.cc line 24 at r10 (raw file):

          MachineView{{{stride_t{1}}},
                      {{MachineSpecificationDimension::INTRA_NODE}},
                      MachineSpaceCoordinate{0, 0, DeviceType::GPU}},

Suggestion:

                      MachineSpaceCoordinate{
                        /*node=idx=*/0, 
                        /*device_idx=*/0, 
                        DeviceType::GPU,
                      }},

lib/compiler/test/src/allowed_machine_views.cc line 45 at r10 (raw file):

    SUBCASE("2 degrees of parallelism") {

      MachineSpecification ms = MachineSpecification{3, 3, 3, 0, 0};

Suggestion:

      MachineSpecification ms = MachineSpecification{
        /*num_nodes=*/3, 
        /*num_cpus_per_node=*/3, 
        /*num_gpus_per_node=*/3, 
        /*inter_node_bandwidth=*/0, 
        /*intra_node_bandwidth=*/0,
      };

lib/compiler/test/src/allowed_machine_views.cc line 48 at r10 (raw file):

      OperatorTaskSpace task = OperatorTaskSpace{{2, 3}};

      auto make_2d_views = [&](int start_x,

Suggestion:

      auto make_2d_view

lib/compiler/test/src/allowed_machine_views.cc line 49 at r10 (raw file):

      auto make_2d_views = [&](int start_x,
                               int start_y,

Aren't these in machine space? What does y mean in a machine space?

Code quote:

                               int start_y,

lib/compiler/test/src/allowed_machine_views.cc line 63 at r10 (raw file):

      auto inter = MachineSpecificationDimension::INTER_NODE;
      std::unordered_set<MachineView> correct = {
          make_2d_views(0, 0, /*stride1*/ 1, /*stride2*/ 1, inter, intra),

Suggestion:

          make_2d_views(0, 0, /*stride1=*/1, /*stride2=*/1, inter, intra),

lib/local-execution/test/src/test_local_cost_estimator.cc line 11 at r8 (raw file):

Previously, Marsella8 wrote…

the interface for MachineView has changed so there is not really an equivalent for make_1d_machine_view anymore.
It would have to be something like a OperatorTaskSpace which encodes the number of points (so degree of the only dimension) and then a MachineView which tells us the stride and the StartCoordinate within the MachineSpecification space.

Got it, so we likely need to (at some point) add a primitive in op-attrs that lets us take an UnmappedCostEstimateKey and return the OperatorTaskSpace for the operator?


lib/pcg/include/pcg/machine_view.struct.toml line 29 at r8 (raw file):

Previously, Marsella8 wrote…

we can have them as a single vector of pairs (or some custom other data type) but i think it would make it a bit more clunky to iterate over either vector. I can change it if you think it's preferrable.

I'd go with one vector due to the "make invalid states unrepresentable" reasoning (one vector prevents the two from having different lengths). I'd add a custom well-named (maybe MachineDimensionStride? or StridedMachineDimension? or even something as basic as MachineDimensionWithStride? Feel free to use something else if you can come up with something, I'm not exactly in love with any of the above names) .struct.toml datatype to represent the combination of the two


lib/pcg/test/src/pcg/machine_specification.cc line 11 at r10 (raw file):

  TEST_CASE("MachineSpecification") {

    MachineSpecification ms = MachineSpecification{4, 16, 8, 100.0f, 200.0f};

Suggestion:

    MachineSpecification ms = MachineSpecification{
      /*num_nodes=*/4, 
      /*num_cpus_per_node=*/16, 
      /*num_gpus_per_node=*/8, 
      /*inter_node_bandwidth=*/100.0f, 
      /*intra_node_bandwidth=*/200.0f,
    };

lib/pcg/test/src/pcg/machine_specification.cc line 29 at r10 (raw file):

      SUBCASE("valid MachineSpaceCoordinate") {
        MachineSpaceCoordinate coord =
            MachineSpaceCoordinate{2, 12, DeviceType::CPU};

Suggestion:

           MachineSpaceCoordinate{
             /*node_idx=*/2, 
             /*device_idx=*/12, 
             DeviceType::CPU,
           };

lib/pcg/test/src/pcg/machine_specification.cc line 35 at r10 (raw file):

        CHECK(correct == result);
      }
      SUBCASE("invalid MachineSpaceCoordinate") {

Suggestion:

      SUBCASE("MachineSpaceCoordinate out of bounds of machine spec") {

lib/pcg/test/src/pcg/machine_specification.cc line 37 at r10 (raw file):

      SUBCASE("invalid MachineSpaceCoordinate") {
        MachineSpaceCoordinate coord =
            MachineSpaceCoordinate{2, 18, DeviceType::CPU};

Suggestion:

            MachineSpaceCoordinate{
              /*node_idx=*/2, 
              /*device_idx=*/18, 
              DeviceType::CPU,
            };

lib/pcg/test/src/pcg/machine_view.cc line 52 at r6 (raw file):

Previously, Marsella8 wrote…

Currently the indexing is done "jumping along with the stride" (e.g 3x3 machine view, 2x2 points, 2x2 stride, we have (0,0) be in position (0, 0), (1,0) be in the real position (2, 0), etc...). Depends on whether we want to index the machineview directly (in which case it's better to have the current indexing) or not (in which case you'd prefer the second one). I personally like the second one best, but when we were discussing it a few weeks ago you had used the current one and had assumed you preferred it.

I think this was an out-of-date comment from when I reviewed pre-our-few-weeks-ago-discussion, sorry for the mixup


lib/pcg/test/src/pcg/machine_view.cc line 38 at r10 (raw file):

                                                     /*num_gpus_per_node*/ 6,
                                                     0,
                                                     0};

Suggestion:

      MachineSpecification ms = MachineSpecification{
        /*num_nodes=*/1,
        /*num_cpus_per_node=*/6,
        /*num_gpus_per_node=*/6,
        /*inter_node_bandwidth=*/0,
        /*intra_node_bandwidth=*/0,
      };

lib/pcg/test/src/pcg/machine_view.cc line 40 at r10 (raw file):

                                                     0};

      SUBCASE("Fragment 0") {

Didn't we get rid of "fragment"?

Code quote:

      SUBCASE("Fragment 0") {

lib/pcg/test/src/pcg/machine_view.cc line 49 at r10 (raw file):

      }

      SUBCASE("Fragment 1") {

Come up with better case names. It appears you have three subcases here for the 2d thing, so I'm assuming you're checking the start point and then checking the stepping in each of the two dimensions?

Actually looks like not as there's only one dim, so I'm not sure why Fragment 2 is here at all. Try to think about your subcase names as explaining to me why they're they're, not just what they are (you can add near-infinite testcases, so your names should give me some idea of why this particular set of cases was chosen).

Code quote:

      SUBCASE("Fragment 1") {

lib/pcg/test/src/pcg/machine_view.cc line 58 at r10 (raw file):

      }

      SUBCASE("Fragment 2") {

What happens if the TaskSpaceCoordinate is out of bounds?


lib/pcg/test/src/pcg/machine_view.cc line 74 at r10 (raw file):

                      {{MachineSpecificationDimension::INTER_NODE,
                        MachineSpecificationDimension::INTRA_NODE}},
                      MachineSpaceCoordinate{1, 2, DeviceType::GPU}};

Feels a bit weird having the start field last? 😂

Code quote:

                      MachineSpaceCoordinate{1, 2, DeviceType::GPU}}

lib/pcg/test/src/pcg/machine_view.cc line 80 at r10 (raw file):

                                                     /*num_gpus_per_node*/ 5,
                                                     0,
                                                     0};

Suggestion:

      MachineSpecification ms = MachineSpecification{
        /*num_nodes=*/3,
        /*num_cpus_per_node=*/5,
        /*num_gpus_per_node=*/5,
        /*inter_node_bandwidth=*/0,
        /*intra_node_bandwidth=*/0,
      };

lib/pcg/test/src/pcg/machine_view.cc line 82 at r10 (raw file):

                                                     0};

      SUBCASE("Fragment (0,0)") {

See above about better testcase naming

Code quote:

     SUBCASE("Fragment (0,0)") {

lib/pcg/test/src/pcg/machine_view.cc line 85 at r10 (raw file):

        TaskSpaceCoordinate coord = TaskSpaceCoordinate{{0, 0}};
        MachineSpaceCoordinate correct =
            MachineSpaceCoordinate{1, 2, DeviceType::GPU};

Suggestion:

            MachineSpaceCoordinate{/*node_idx=*/1, /*device_idx=*/2, DeviceType::GPU};

lib/pcg/test/src/pcg/machine_view.cc line 124 at r10 (raw file):

                      {{MachineSpecificationDimension::INTER_NODE,
                        MachineSpecificationDimension::INTRA_NODE,
                        MachineSpecificationDimension::INTRA_NODE}},

Rather than having duplicate INTER_NODE/INTRA_NODE values for the first time in the 3d case, why not have a separate 2d case that is just duplicates?

Code quote:

      OperatorTaskSpace task = OperatorTaskSpace{{2, 2, 2}};
      MachineView mv =
          MachineView{{{stride_t{1}, stride_t{2}, stride_t{1}}},
                      {{MachineSpecificationDimension::INTER_NODE,
                        MachineSpecificationDimension::INTRA_NODE,
                        MachineSpecificationDimension::INTRA_NODE}},

lib/pcg/test/src/pcg/machine_view.cc line 128 at r10 (raw file):

      MachineSpecification ms = MachineSpecification{/*num_nodes*/ 2,
                                                     /*num_cpus_per_node*/ 8,

Fix parameter naming comments


lib/pcg/test/src/pcg/machine_view.cc line 133 at r10 (raw file):

                                                     0};

      SUBCASE("Fragment (0,0,1)") {

Better testcase names (see above)


lib/utils/include/utils/containers/get_all_permutations_with_repetition.h line 16 at r10 (raw file):

 * @details
 *https://en.wikipedia.org/wiki/Permutation#Permutations_with_repetition
 **/

Suggestion:

/**
 * @brief For a given container `c` and integer `n`, return all possible vectors
 * of size `n` that only contain (possibly duplicated) elements of `c`.
 * @details
 * https://en.wikipedia.org/wiki/Permutation#Permutations_with_repetition
 **/

lib/utils/include/utils/containers/get_all_permutations_with_repetition.h line 18 at r10 (raw file):

 **/
template <typename T>
struct permutations_with_repetition_container {

Would probably be better to just do this as a strict function rather than a lazy iterator until we actually have a concrete need to not materialize the whole container at once (I probably should have taken my own advice when writing get_permutations tbh). I'm not against the container implementation, but it's usually going to be much more complex and harder to read so we should only do it if we have a real reason


lib/utils/test/src/utils/containers/cartesian_product.cc line 4 at r10 (raw file):

#include "test/utils/doctest/fmt/unordered_multiset.h"
#include "utils/fmt/unordered_multiset.h"
#include "utils/fmt/vector.h"

Are these still necessary?

Code quote:

#include "utils/fmt/unordered_multiset.h"
#include "utils/fmt/vector.h"

lib/utils/test/src/utils/containers/replicate.cc line 4 at r10 (raw file):

#include "test/utils/doctest/fmt/vector.h"
#include "utils/fmt/unordered_set.h"
#include "utils/fmt/vector.h"

Are these still necessary?

Code quote:

#include "utils/fmt/unordered_set.h"
#include "utils/fmt/vector.h"

lib/utils/test/src/utils/containers/scanl.cc line 3 at r10 (raw file):

#include "utils/containers/scanl.h"
#include "test/utils/doctest/fmt/vector.h"
#include "utils/fmt/vector.h"

Is this still necessary?

Code quote:

#include "utils/fmt/vector.h"

lib/utils/test/src/utils/containers/scanl.cc line 40 at r10 (raw file):

  }

  TEST_CASE("scanl1") {

What happens if the input to scanl1 is empty?

Code quote:

 TEST_CASE("scanl1") {

lib/pcg/include/pcg/operator_task_space.h line 1 at r10 (raw file):

#ifndef _FLEXFLOW_PCG_INCLUDE_operator_task_space_H

Capitalization?


lib/pcg/test/src/pcg/operator_task_space.cc line 9 at r10 (raw file):

TEST_SUITE(FF_TEST_SUITE) {
  TEST_CASE("operator_task_space functions") {
    SUBCASE("get_task_space_coordinates") {

What if the task space dim list is empty?


lib/pcg/test/src/pcg/operator_task_space.cc line 9 at r10 (raw file):

TEST_SUITE(FF_TEST_SUITE) {
  TEST_CASE("operator_task_space functions") {
    SUBCASE("get_task_space_coordinates") {

Suggestion:

  TEST_CASE("get_task_space_coordinates") {

lib/pcg/test/src/pcg/operator_task_space.cc line 11 at r10 (raw file):

    SUBCASE("get_task_space_coordinates") {

      SUBCASE("2D Task") {

Suggestion:

      SUBCASE("2D Task Space") {

lib/pcg/test/src/pcg/operator_task_space.cc line 19 at r10 (raw file):

             TaskSpaceCoordinate{{0, 1}},
             TaskSpaceCoordinate{{1, 0}},
             TaskSpaceCoordinate{{1, 1}}}};

Suggestion:

        std::unordered_set<TaskSpaceCoordinate> correct = {
          TaskSpaceCoordinate{{0, 0}},
          TaskSpaceCoordinate{{0, 1}},
          TaskSpaceCoordinate{{1, 0}},
          TaskSpaceCoordinate{{1, 1}}.
        };

lib/pcg/test/src/pcg/operator_task_space.cc line 32 at r10 (raw file):

             TaskSpaceCoordinate{{0, 0, 1}},
             TaskSpaceCoordinate{{0, 1, 0}},
             TaskSpaceCoordinate{{0, 1, 1}}}};

Suggestion:

        std::unordered_set<TaskSpaceCoordinate> correct = {
          TaskSpaceCoordinate{{0, 0, 0}},
          TaskSpaceCoordinate{{0, 0, 1}},
          TaskSpaceCoordinate{{0, 1, 0}},
          TaskSpaceCoordinate{{0, 1, 1}},
        };

lib/pcg/test/src/pcg/operator_task_space.cc line 38 at r10 (raw file):

      }
    }
    SUBCASE("get_task_space_maximum_coordinate") {

Suggestion:

    TEST_CASE("get_task_space_maximum_coordinate") {

Copy link
Contributor Author

@Marsella8 Marsella8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 99 of 101 files reviewed, 38 unresolved discussions (waiting on @lockshaw and @wmdi)


lib/compiler/test/src/allowed_machine_views.cc line 18 at r10 (raw file):

    SUBCASE("1 degree of parallelism") {

      MachineSpecification ms = MachineSpecification{1, 5, 5, 0, 0};

Done.


lib/compiler/test/src/allowed_machine_views.cc line 24 at r10 (raw file):

          MachineView{{{stride_t{1}}},
                      {{MachineSpecificationDimension::INTRA_NODE}},
                      MachineSpaceCoordinate{0, 0, DeviceType::GPU}},

Done.


lib/compiler/test/src/allowed_machine_views.cc line 45 at r10 (raw file):

    SUBCASE("2 degrees of parallelism") {

      MachineSpecification ms = MachineSpecification{3, 3, 3, 0, 0};

Done.


lib/compiler/test/src/allowed_machine_views.cc line 48 at r10 (raw file):

      OperatorTaskSpace task = OperatorTaskSpace{{2, 3}};

      auto make_2d_views = [&](int start_x,

Done.


lib/compiler/test/src/allowed_machine_views.cc line 49 at r10 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Aren't these in machine space? What does y mean in a machine space?

Machine space is the 2D space described by machine specification. (!= Task Space, which is N-dimensional, with N dimension of the parallel tensor). So we have a vector of strides, a vector of dimensions to which to map each dimension of the task, and then a 2D coordinate (start_x, start_y) which is the node and device indices.


lib/compiler/test/src/allowed_machine_views.cc line 63 at r10 (raw file):

      auto inter = MachineSpecificationDimension::INTER_NODE;
      std::unordered_set<MachineView> correct = {
          make_2d_views(0, 0, /*stride1*/ 1, /*stride2*/ 1, inter, intra),

Done.


lib/local-execution/test/src/test_local_cost_estimator.cc line 11 at r8 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Got it, so we likely need to (at some point) add a primitive in op-attrs that lets us take an UnmappedCostEstimateKey and return the OperatorTaskSpace for the operator?

yes pretty much (we don't really need OperatorTaskSpace as long as we have a datatype from which we can get a vector of the parallel degrees, but obviously OperatorTaskSpace makes things more clean)


lib/pcg/test/src/pcg/machine_specification.cc line 11 at r10 (raw file):

  TEST_CASE("MachineSpecification") {

    MachineSpecification ms = MachineSpecification{4, 16, 8, 100.0f, 200.0f};

Done.


lib/pcg/test/src/pcg/machine_specification.cc line 29 at r10 (raw file):

      SUBCASE("valid MachineSpaceCoordinate") {
        MachineSpaceCoordinate coord =
            MachineSpaceCoordinate{2, 12, DeviceType::CPU};

Done.


lib/pcg/test/src/pcg/machine_specification.cc line 35 at r10 (raw file):

        CHECK(correct == result);
      }
      SUBCASE("invalid MachineSpaceCoordinate") {

Done.


lib/pcg/test/src/pcg/machine_specification.cc line 37 at r10 (raw file):

      SUBCASE("invalid MachineSpaceCoordinate") {
        MachineSpaceCoordinate coord =
            MachineSpaceCoordinate{2, 18, DeviceType::CPU};

Done.


lib/pcg/test/src/pcg/machine_view.cc line 38 at r10 (raw file):

                                                     /*num_gpus_per_node*/ 6,
                                                     0,
                                                     0};

Done.


lib/pcg/test/src/pcg/machine_view.cc line 40 at r10 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Didn't we get rid of "fragment"?

Done.


lib/pcg/test/src/pcg/machine_view.cc line 49 at r10 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Come up with better case names. It appears you have three subcases here for the 2d thing, so I'm assuming you're checking the start point and then checking the stepping in each of the two dimensions?

Actually looks like not as there's only one dim, so I'm not sure why Fragment 2 is here at all. Try to think about your subcase names as explaining to me why they're they're, not just what they are (you can add near-infinite testcases, so your names should give me some idea of why this particular set of cases was chosen).

made a docstring for each case that explains it pretty extensively.


lib/pcg/test/src/pcg/machine_view.cc line 58 at r10 (raw file):

Previously, lockshaw (Colin Unger) wrote…

What happens if the TaskSpaceCoordinate is out of bounds?

now returns an optional if out of bound.
(also added testing)


lib/pcg/test/src/pcg/machine_view.cc line 74 at r10 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Feels a bit weird having the start field last? 😂

lol fair enough, changed :-)


lib/pcg/test/src/pcg/machine_view.cc line 80 at r10 (raw file):

                                                     /*num_gpus_per_node*/ 5,
                                                     0,
                                                     0};

Done.


lib/pcg/test/src/pcg/machine_view.cc line 82 at r10 (raw file):

Previously, lockshaw (Colin Unger) wrote…

See above about better testcase naming

Done.


lib/pcg/test/src/pcg/machine_view.cc line 85 at r10 (raw file):

        TaskSpaceCoordinate coord = TaskSpaceCoordinate{{0, 0}};
        MachineSpaceCoordinate correct =
            MachineSpaceCoordinate{1, 2, DeviceType::GPU};

Done.


lib/pcg/test/src/pcg/machine_view.cc line 124 at r10 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Rather than having duplicate INTER_NODE/INTRA_NODE values for the first time in the 3d case, why not have a separate 2d case that is just duplicates?

good point, added new test case


lib/pcg/test/src/pcg/machine_view.cc line 128 at r10 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Fix parameter naming comments

Done.


lib/pcg/test/src/pcg/machine_view.cc line 133 at r10 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Better testcase names (see above)

Done.


lib/utils/include/utils/containers/get_all_permutations_with_repetition.h line 16 at r10 (raw file):

 * @details
 *https://en.wikipedia.org/wiki/Permutation#Permutations_with_repetition
 **/

Done.


lib/utils/include/utils/containers/get_all_permutations_with_repetition.h line 18 at r10 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Would probably be better to just do this as a strict function rather than a lazy iterator until we actually have a concrete need to not materialize the whole container at once (I probably should have taken my own advice when writing get_permutations tbh). I'm not against the container implementation, but it's usually going to be much more complex and harder to read so we should only do it if we have a real reason

Done.


lib/utils/test/src/utils/containers/cartesian_product.cc line 4 at r10 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Are these still necessary?

Done.


lib/utils/test/src/utils/containers/replicate.cc line 4 at r10 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Are these still necessary?

Done.


lib/utils/test/src/utils/containers/scanl.cc line 3 at r10 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Is this still necessary?

Done.


lib/utils/test/src/utils/containers/scanl.cc line 40 at r10 (raw file):

Previously, lockshaw (Colin Unger) wrote…

What happens if the input to scanl1 is empty?

returns an empty list (added test case)


lib/pcg/include/pcg/operator_task_space.h line 1 at r10 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Capitalization?

Done.


lib/pcg/test/src/pcg/operator_task_space.cc line 9 at r10 (raw file):

TEST_SUITE(FF_TEST_SUITE) {
  TEST_CASE("operator_task_space functions") {
    SUBCASE("get_task_space_coordinates") {

Done.


lib/pcg/test/src/pcg/operator_task_space.cc line 9 at r10 (raw file):

Previously, lockshaw (Colin Unger) wrote…

What if the task space dim list is empty?

Returns a single empty coordinate (vacuously makes sense?). Not sure what a zero dimensional TaskSpaceOperator would represent, whether something with no parallelism or a straight up wrong data-type (so something with no parallelism would be a TaskSpaceOperator with 1 or more ones. Let me know what you think is best.


lib/pcg/test/src/pcg/operator_task_space.cc line 11 at r10 (raw file):

    SUBCASE("get_task_space_coordinates") {

      SUBCASE("2D Task") {

Done.


lib/pcg/test/src/pcg/operator_task_space.cc line 19 at r10 (raw file):

             TaskSpaceCoordinate{{0, 1}},
             TaskSpaceCoordinate{{1, 0}},
             TaskSpaceCoordinate{{1, 1}}}};

Done.


lib/pcg/test/src/pcg/operator_task_space.cc line 32 at r10 (raw file):

             TaskSpaceCoordinate{{0, 0, 1}},
             TaskSpaceCoordinate{{0, 1, 0}},
             TaskSpaceCoordinate{{0, 1, 1}}}};

Done.


lib/pcg/test/src/pcg/operator_task_space.cc line 38 at r10 (raw file):

      }
    }
    SUBCASE("get_task_space_maximum_coordinate") {

Done.

Copy link
Contributor Author

@Marsella8 Marsella8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 87 of 102 files reviewed, 38 unresolved discussions (waiting on @lockshaw and @wmdi)


lib/pcg/include/pcg/machine_view.struct.toml line 29 at r8 (raw file):

Previously, lockshaw (Colin Unger) wrote…

I'd go with one vector due to the "make invalid states unrepresentable" reasoning (one vector prevents the two from having different lengths). I'd add a custom well-named (maybe MachineDimensionStride? or StridedMachineDimension? or even something as basic as MachineDimensionWithStride? Feel free to use something else if you can come up with something, I'm not exactly in love with any of the above names) .struct.toml datatype to represent the combination of the two

done

Copy link
Collaborator

@lockshaw lockshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 13 of 15 files at r11, all commit messages.
Reviewable status: 100 of 102 files reviewed, 7 unresolved discussions (waiting on @Marsella8 and @wmdi)


lib/compiler/test/src/allowed_machine_views.cc line 45 at r10 (raw file):

Previously, Marsella8 wrote…

Done.

Doesn't appear to have changed?


lib/compiler/test/src/allowed_machine_views.cc line 49 at r10 (raw file):

Previously, Marsella8 wrote…

Machine space is the 2D space described by machine specification. (!= Task Space, which is N-dimensional, with N dimension of the parallel tensor). So we have a vector of strides, a vector of dimensions to which to map each dimension of the task, and then a 2D coordinate (start_x, start_y) which is the node and device indices.

Then why not just call it start_node_idx and start_device_idx?


lib/pcg/test/src/pcg/machine_specification.cc line 35 at r10 (raw file):

Previously, Marsella8 wrote…

Done.

I think you're missing "out of bounds" in the new name?


lib/pcg/test/src/pcg/machine_view.cc line 49 at r10 (raw file):

Previously, Marsella8 wrote…

made a docstring for each case that explains it pretty extensively.

My concern was more that the names weren't really justifying why that point was included for testing, but as it appears that all of the possible points are tested exhaustively I guess this is fine


lib/pcg/test/src/pcg/machine_view.cc line 245 at r11 (raw file):

        /**
         * The tasks will thus be distributed like this:

proj format appears to have butchered your nice diagram--there should be a clang-format comment you can insert to disable clang-format for these lines


lib/pcg/test/src/pcg/operator_task_space.cc line 9 at r10 (raw file):

Previously, Marsella8 wrote…

Returns a single empty coordinate (vacuously makes sense?). Not sure what a zero dimensional TaskSpaceOperator would represent, whether something with no parallelism or a straight up wrong data-type (so something with no parallelism would be a TaskSpaceOperator with 1 or more ones. Let me know what you think is best.

Yeah this is fine, I was mostly just curious and it's a bit nice to have edgecases like this documented

Copy link
Contributor Author

@Marsella8 Marsella8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 100 of 102 files reviewed, 7 unresolved discussions (waiting on @lockshaw and @wmdi)


lib/compiler/test/src/allowed_machine_views.cc line 45 at r10 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Doesn't appear to have changed?

Done.


lib/compiler/test/src/allowed_machine_views.cc line 49 at r10 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Then why not just call it start_node_idx and start_device_idx?

Done.


lib/pcg/test/src/pcg/machine_specification.cc line 35 at r10 (raw file):

Previously, lockshaw (Colin Unger) wrote…

I think you're missing "out of bounds" in the new name?

Done.


lib/pcg/test/src/pcg/machine_view.cc line 245 at r11 (raw file):

Previously, lockshaw (Colin Unger) wrote…

proj format appears to have butchered your nice diagram--there should be a clang-format comment you can insert to disable clang-format for these lines

fixed

Copy link
Collaborator

@lockshaw lockshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 15 files at r11, 3 of 3 files at r12, all commit messages.
Reviewable status: 101 of 102 files reviewed, 9 unresolved discussions (waiting on @Marsella8 and @wmdi)


lib/pcg/include/pcg/machine_view_dimension.struct.toml line 20 at r12 (raw file):

[[fields]]
name = "stride"
type = "::FlexFlow::stride_t"

Any particular reason to keep this as a stride_t rather than just dropping it to an int? Not a big issue, just curious


lib/pcg/src/pcg/machine_view.cc line 33 at r12 (raw file):

static std::vector<int>
    get_projection_indices(MachineView const &mv,

Suggestion:

    get_projection_indices_for_dimension(MachineView const &mv,

lib/pcg/src/pcg/machine_view.cc line 39 at r12 (raw file):

  std::vector<MachineSpecificationDimension> projections = get_projections(mv);
  for (size_t i = 0; i < projections.size(); ++i) {
    if (projections[i] == dimension) {

Suggestion:

projections.at(i)

lib/pcg/src/pcg/machine_view.cc line 44 at r12 (raw file):

  }
  return projection_indices;
}

Suggestion:

  std::vector<int> projection_indices
    = filter(count(projections.size()),
             [&](int idx) {
               return projections.at(idx) == dimension;
             });

  return projection_indices;
}

lib/pcg/src/pcg/machine_view.cc line 46 at r12 (raw file):

}

static int compute_index(int start_idx,

Try to aim for clearer function names even if they're static, especially if they take this number of parameters (in fact, arguably the sheer number of shared parameters imply that this should probably just be a lambda in get_machine_space_coordinate as it doesn't really seem to stand alone


lib/pcg/src/pcg/machine_view.cc line 62 at r12 (raw file):

    sizes.push_back(dim_size);
    coord_points.push_back(coord.raw_coord[i]);
    strides.push_back(mv_strides[i].unwrapped);

Code-style wise probably cleaner to generate sizes, coord_points, and strides as three separate transform expressions rather than one big combined loop

Code quote:

  for (int i : projection_indices) {
    int dim_size = task.degrees[i] * mv_strides[i].unwrapped;
    sizes.push_back(dim_size);
    coord_points.push_back(coord.raw_coord[i]);
    strides.push_back(mv_strides[i].unwrapped);

lib/pcg/src/pcg/machine_view.cc line 78 at r12 (raw file):

                                 MachineView const &mv,
                                 TaskSpaceCoordinate const &coord,
                                 MachineSpecification const &ms) {

What about splitting this into a function that takes a MachineView and generates a DeduplicatedMachineView (which has all of the strides multiplied so you can't produce redundant points), and then a corresponding function that takes a DeduplicatedMachineView which then can do a much simpler computation to find the MachineSpaceCoordinate. Thoughts?

Overall one of the best way you can make code more readable is to introduce (sane and conceptually clear) intermediate types, and then break a complicated function that does A -> D into A -> B -> C -> D

Code quote:

std::optional<MachineSpaceCoordinate>
    get_machine_space_coordinate(OperatorTaskSpace const &task,
                                 MachineView const &mv,
                                 TaskSpaceCoordinate const &coord,
                                 MachineSpecification const &ms) {

lib/pcg/src/pcg/machine_view.cc line 96 at r12 (raw file):

  return ms_coord;
}
std::unordered_set<MachineSpaceCoordinate>

Is it possible that this should be an std::unordered_multiset, or is MachineView defined such as to make this impossible?


lib/pcg/src/pcg/machine_view.cc line 103 at r12 (raw file):

  return transform(
      get_task_space_coordinates(task), [&](TaskSpaceCoordinate const &c) {
        return get_machine_space_coordinate(task, mv, c, ms).value();

Try not to abbreviate these variables too much or the code starts to become rather hard to skim as I keep having to chase down definitions unnecessarily

Suggestion:

       return get_machine_space_coordinate(task, machine_view, coord, machine_spec).value();

Copy link
Collaborator

@lockshaw lockshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 15 files at r11.
Reviewable status: all files reviewed, 12 unresolved discussions (waiting on @Marsella8)


lib/compiler/src/compiler/allowed_machine_views.cc line 67 at r6 (raw file):

Previously, Marsella8 wrote…

I've tried to split it up a bit more, but it's probably not enough. Not super sure how to express the candidate_strides computation more cleanly otherwise.

No worries, the key thing is that the high-level flow is clear. As long as you name get_candidate_strides clearly so I know what it does I can just ignore the details of it and still get a good idea of how the overall code works. Breaking things into separate lambdas, naming things clearly, having solid intermediate types, and having a clear logical flow (generate candidate values, generate an overapproximation of the space, then filter) like you did here is great and much more important than low-level style.


lib/compiler/src/compiler/allowed_machine_views.cc line 27 at r12 (raw file):

                           OperatorTaskSpace const &task,
                           MachineSpecification const &ms) {
  std::optional<MachineSpaceCoordinate> maximum_device_coords =

Suggestion:

  std::optional<MachineSpaceCoordinate> maximum_device_coord 

lib/compiler/src/compiler/allowed_machine_views.cc line 36 at r12 (raw file):

 * The returned set includes all valid machine views, and might contain
 invalid ones. This function should never be used externally (see
 * `get_allowed_partial_machine_view_mappings` instead). There is no

Is this the right name?

Code quote:

`get_allowed_partial_machine_view_mappings

lib/compiler/src/compiler/allowed_machine_views.cc line 40 at r12 (raw file):

 its possible for all
 * `MachineView`s to be invalid)
 */

Suggestion:

/* Generates a set of candidate `MachineView`s
 * The returned set includes all valid machine views, and might contain
 * invalid ones. This function should not be used externally (see
 * `get_allowed_partial_machine_view_mappings` instead). There is no
 * guarantee that a non-empty returned set contains a valid machine view (i.e.
 * it's possible for all the returned `MachineView`s to be invalid)
 */

lib/compiler/src/compiler/allowed_machine_views.cc line 61 at r12 (raw file):

    std::vector<stride_t> single_stride_range =
        transform(range(1, max_stride_upper_bound + 1),
                  [](int stride) { return stride_t(stride); });

for consistency

Suggestion:

                  [](int stride) { return stride_t{stride}; });

lib/compiler/src/compiler/allowed_machine_views.cc line 75 at r12 (raw file):

    std::unordered_set<MachineSpaceCoordinate> result;
    for (int i : range(ms.num_nodes)) {
      for (int j : range(get_num_devices_per_node(ms, device_type))) {

Suggestion:

    for (int node_idx : range(ms.num_nodes)) {
      for (int device_idx : range(get_num_devices_per_node(ms, device_type))) {

lib/compiler/src/compiler/allowed_machine_views.cc line 99 at r12 (raw file):

         candidate_starts(machine_spec, device_type)) {
      for (std::vector<MachineSpecificationDimension> const &proj :
           candidate_projections(task)) {

FYI the whole projection vs machine-spec-dim is naming-wise a bit inconsistent and weird. Not a big issue as at least neither term is overloaded to also mean something else (you just have two words for one concept), but it can be a bit confusing at times

Code quote:

      for (std::vector<MachineSpecificationDimension> const &proj :
           candidate_projections(task)) {

lib/compiler/src/compiler/allowed_machine_views.cc line 103 at r12 (raw file):

            transform(zip(strides.raw_strides, proj), [&](auto const &p) {
              return MachineViewDimension{p.first, p.second};
            });

I would add a function to MachineView for constructing a MachineView from a start and separate lists of strides and projections and then use it here to simplify this code a bit

Code quote:

        std::vector<MachineViewDimension> dimensions =
            transform(zip(strides.raw_strides, proj), [&](auto const &p) {
              return MachineViewDimension{p.first, p.second};
            });

Copy link
Contributor Author

@Marsella8 Marsella8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 12 unresolved discussions (waiting on @lockshaw)


lib/compiler/src/compiler/allowed_machine_views.cc line 67 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

No worries, the key thing is that the high-level flow is clear. As long as you name get_candidate_strides clearly so I know what it does I can just ignore the details of it and still get a good idea of how the overall code works. Breaking things into separate lambdas, naming things clearly, having solid intermediate types, and having a clear logical flow (generate candidate values, generate an overapproximation of the space, then filter) like you did here is great and much more important than low-level style.

great tip, will keep it in mind in the future


lib/compiler/src/compiler/allowed_machine_views.cc line 27 at r12 (raw file):

                           OperatorTaskSpace const &task,
                           MachineSpecification const &ms) {
  std::optional<MachineSpaceCoordinate> maximum_device_coords =

Done.


lib/compiler/src/compiler/allowed_machine_views.cc line 36 at r12 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Is this the right name?

changed to the correct one


lib/compiler/src/compiler/allowed_machine_views.cc line 40 at r12 (raw file):

 its possible for all
 * `MachineView`s to be invalid)
 */

Done.


lib/compiler/src/compiler/allowed_machine_views.cc line 75 at r12 (raw file):

    std::unordered_set<MachineSpaceCoordinate> result;
    for (int i : range(ms.num_nodes)) {
      for (int j : range(get_num_devices_per_node(ms, device_type))) {

Done.


lib/compiler/src/compiler/allowed_machine_views.cc line 99 at r12 (raw file):

Previously, lockshaw (Colin Unger) wrote…

FYI the whole projection vs machine-spec-dim is naming-wise a bit inconsistent and weird. Not a big issue as at least neither term is overloaded to also mean something else (you just have two words for one concept), but it can be a bit confusing at times

agreed, replaced projection since it's not only unclear but a bit of a misnomer, since the operation we are doing is not really projection as much as it is flattening an N-dim space into a 2D one.


lib/compiler/src/compiler/allowed_machine_views.cc line 103 at r12 (raw file):

Previously, lockshaw (Colin Unger) wrote…

I would add a function to MachineView for constructing a MachineView from a start and separate lists of strides and projections and then use it here to simplify this code a bit

done


lib/pcg/include/pcg/machine_view_dimension.struct.toml line 20 at r12 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Any particular reason to keep this as a stride_t rather than just dropping it to an int? Not a big issue, just curious

no strong reason / preference, I don't mind it since it makes things a bit more explicit / clear. I can change it if needed


lib/pcg/src/pcg/machine_view.cc line 33 at r12 (raw file):

static std::vector<int>
    get_projection_indices(MachineView const &mv,

Done.


lib/pcg/src/pcg/machine_view.cc line 39 at r12 (raw file):

  std::vector<MachineSpecificationDimension> projections = get_projections(mv);
  for (size_t i = 0; i < projections.size(); ++i) {
    if (projections[i] == dimension) {

Done.


lib/pcg/src/pcg/machine_view.cc line 44 at r12 (raw file):

  }
  return projection_indices;
}

Done.


lib/pcg/src/pcg/machine_view.cc line 46 at r12 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Try to aim for clearer function names even if they're static, especially if they take this number of parameters (in fact, arguably the sheer number of shared parameters imply that this should probably just be a lambda in get_machine_space_coordinate as it doesn't really seem to stand alone

yeah, put into a lambda, much clearer now


lib/pcg/src/pcg/machine_view.cc line 62 at r12 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Code-style wise probably cleaner to generate sizes, coord_points, and strides as three separate transform expressions rather than one big combined loop

done


lib/pcg/src/pcg/machine_view.cc line 78 at r12 (raw file):

Previously, lockshaw (Colin Unger) wrote…

What about splitting this into a function that takes a MachineView and generates a DeduplicatedMachineView (which has all of the strides multiplied so you can't produce redundant points), and then a corresponding function that takes a DeduplicatedMachineView which then can do a much simpler computation to find the MachineSpaceCoordinate. Thoughts?

Overall one of the best way you can make code more readable is to introduce (sane and conceptually clear) intermediate types, and then break a complicated function that does A -> D into A -> B -> C -> D

not sure I fully agree, what would DeduplicatedMachineView represent exactly (the partial strides /degrees for each dimension)? I think the current logic is decently clear, at least at the top level. Cleaned up a bit, let me know if this is better


lib/pcg/src/pcg/machine_view.cc line 96 at r12 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Is it possible that this should be an std::unordered_multiset, or is MachineView defined such as to make this impossible?

each TaskSpace coordinate maps to a unique MachineSpaceCoordinate, so it's good.


lib/pcg/src/pcg/machine_view.cc line 103 at r12 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Try not to abbreviate these variables too much or the code starts to become rather hard to skim as I keep having to chase down definitions unnecessarily

Done.

lockshaw
lockshaw previously approved these changes Oct 9, 2024
Copy link
Collaborator

@lockshaw lockshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 7 of 7 files at r13, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @Marsella8)


lib/pcg/include/pcg/start_invariant_machine_view.h line 39 at r13 (raw file):

                                 MachineSpecification const &ms);

std::unordered_set<MachineSpaceCoordinate>

Seems a bit odd to return a MachineSpaceCoordinate here rather than some other type like MachineSpaceOffset or something to represent the fact that StartInvariantMachineView doesn't actually have a start


lib/pcg/src/pcg/machine_view.cc line 78 at r12 (raw file):

Previously, Marsella8 wrote…

not sure I fully agree, what would DeduplicatedMachineView represent exactly (the partial strides /degrees for each dimension)? I think the current logic is decently clear, at least at the top level. Cleaned up a bit, let me know if this is better

My understanding of how TaskSpaceCoordinate -> MachineSpaceCoordinate works is that for the most part it's a pretty simple calculation: given a machine view, you have machine_space_coord_x = task_space_coord_a * stride_a + start_x. The complexity comes in when you start to have multiple task space dimensions projected to the same machine space dimension, so you end up with, e.g., machine_space_coord_x = task_space_coord_a * stride_a + task_space_coord_b * stride_b + start_x, and then you have an issue where you can have two non-equal pairs (task_space_coord_a, task_space_coord_b) and (task_space_coord_a', task_space_coord_b') such that task_space_coord_a * stride_a + task_space_coord_b * stride_b == task_space_coord_a' * stride_a + task_space_coord_b' * stride_b, leading to multiple TaskSpaceCoordinates mapping to the same device. Correct me if I'm wrong, but the way we fix this is by introducing an order within the task space dimensions that map to a single machine space dimension (e.g., [a, b] here), and then we change the computation to task_space_coord_a * stride_a * (task_space_size_b * stride_b) + task_space_coord_b * stride_b + start_x, which makes the mapping one-to-one. As the logic for computing this is more complex, one way to simplify things would be to argue that stride_a * task_space_size_b * stride_b simply functions as a "normalized"/"deduplicated" stride stride_a', and then you could divide the computation into MachineView -> NormalizedMachineView, and then use the simpler computation for transforming a TaskSpaceCoordinate into a MachineSpaceCoordinatethrough aNormalizedMachineView`

The current code is reasonably nice so this isn't critical, just throwing it out there as a potential improvement as I don't think you'll be able to get the existing code much nicer via only style changes

Copy link
Contributor Author

@Marsella8 Marsella8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @Marsella8)


lib/pcg/include/pcg/start_invariant_machine_view.h line 39 at r13 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Seems a bit odd to return a MachineSpaceCoordinate here rather than some other type like MachineSpaceOffset or something to represent the fact that StartInvariantMachineView doesn't actually have a start

yeah agree, added new dtgen type.


lib/pcg/src/pcg/machine_view.cc line 78 at r12 (raw file):

Previously, lockshaw (Colin Unger) wrote…

My understanding of how TaskSpaceCoordinate -> MachineSpaceCoordinate works is that for the most part it's a pretty simple calculation: given a machine view, you have machine_space_coord_x = task_space_coord_a * stride_a + start_x. The complexity comes in when you start to have multiple task space dimensions projected to the same machine space dimension, so you end up with, e.g., machine_space_coord_x = task_space_coord_a * stride_a + task_space_coord_b * stride_b + start_x, and then you have an issue where you can have two non-equal pairs (task_space_coord_a, task_space_coord_b) and (task_space_coord_a', task_space_coord_b') such that task_space_coord_a * stride_a + task_space_coord_b * stride_b == task_space_coord_a' * stride_a + task_space_coord_b' * stride_b, leading to multiple TaskSpaceCoordinates mapping to the same device. Correct me if I'm wrong, but the way we fix this is by introducing an order within the task space dimensions that map to a single machine space dimension (e.g., [a, b] here), and then we change the computation to task_space_coord_a * stride_a * (task_space_size_b * stride_b) + task_space_coord_b * stride_b + start_x, which makes the mapping one-to-one. As the logic for computing this is more complex, one way to simplify things would be to argue that stride_a * task_space_size_b * stride_b simply functions as a "normalized"/"deduplicated" stride stride_a', and then you could divide the computation into MachineView -> NormalizedMachineView, and then use the simpler computation for transforming a TaskSpaceCoordinate into a MachineSpaceCoordinatethrough aNormalizedMachineView`

The current code is reasonably nice so this isn't critical, just throwing it out there as a potential improvement as I don't think you'll be able to get the existing code much nicer via only style changes

aah ok got it know thanks! I think I'll leave it like this for now, but will take this into consideration when going over that CPU kernel tests PR you tagged me in to see if it makes sense to have a centralized set of functions for working with these indices transformation.

lockshaw
lockshaw previously approved these changes Oct 9, 2024
Copy link
Collaborator

@lockshaw lockshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 6 of 6 files at r14, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @Marsella8)

@lockshaw lockshaw self-requested a review October 9, 2024 23:15
@lockshaw lockshaw enabled auto-merge (squash) October 9, 2024 23:15
@lockshaw lockshaw changed the title Unordered StridedRectangle, get_allowed_machine_views New MachineView representation Oct 9, 2024
@lockshaw lockshaw merged commit 65c3911 into flexflow:repo-refactor Oct 9, 2024
7 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants