-
Notifications
You must be signed in to change notification settings - Fork 233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Local execution e2e training #1472
base: master
Are you sure you want to change the base?
Local execution e2e training #1472
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 2 files at r1, 49 of 50 files at r2, 1 of 1 files at r3, all commit messages.
Reviewable status: all files reviewed, 47 unresolved discussions (waiting on @reyna-abhyankar)
lib/kernels/CMakeLists.txt
line 10 at r2 (raw file):
LIST_DIRECTORIES False src/*.cc src/cuda/cuda_helper.cu
Why is this not just src/cuda/*.cu
?
lib/kernels/include/kernels/array_shape.h
line 45 at r2 (raw file):
std::optional<std::size_t> at_maybe(ff_dim_t) const; ArrayShape sub_shape(legion_dim_t start, ff_dim_t end) const;
legion_dim_t
for start and ff_dim_t
for end? Is this signature intended? It's not illogical, just a bit weird and wanted to check it's not a typo 🙂
lib/kernels/src/array_shape.cc
line 59 at r3 (raw file):
ArrayShape ArrayShape::sub_shape(std::optional<ff_dim_t> start, std::optional<ff_dim_t> end) const { std::vector<size_t> new_shape;
Can probably be replaced by slice(DimOrdered, ...)
: https://github.com/flexflow/FlexFlow/blob/repo-refactor/lib/op-attrs/include/op-attrs/dim_ordered/slice.h#L9
lib/kernels/src/device.h
line 74 at r3 (raw file):
__global__ void scale_kernel(float *ptr, size_t size, float a, float b); __global__ void scale_kernel(float *ptr, unsigned long size, float a, float b);
Why?
lib/local-execution/include/local-execution/local_training_backing.h
line 18 at r3 (raw file):
TensorBackingMap const &, RuntimeArgConfig const &, std::optional<ModelTrainingInstance> &);
Why is this passed as mutable?
lib/local-execution/include/local-execution/model_training_instance.struct.toml
line 2 at r3 (raw file):
namespace = "FlexFlow" name = "ModelTrainingInstance"
Why isn't the model part of ModelTrainingInstance
? The members seem to have changed non-trivially since the old version:
struct ModelTrainingInstance {
ComputationGraph computation_graph;
Optimizer optimizer;
EnableProfiling enable_profiling;
TrainingPCG training_pcg;
TensorMapping tensor_map;
LegionBacking legion_backing;
};
lib/local-execution/include/local-execution/model_training_instance.struct.toml
line 15 at r3 (raw file):
"pcg/tensor_guid_t.dtg.h", "pcg/optimizer_attrs.dtg.h", ]
Suggestion:
includes = [
"op-attrs/ops/loss_attrs.dtg.h",
"pcg/tensor_guid_t.dtg.h",
"pcg/optimizer_attrs.dtg.h",
]
lib/local-execution/include/local-execution/optimizer.h
line 16 at r3 (raw file):
TaskInvocation get_update_invocation(OptimizerAttrs const &, tensor_guid_t const &weight, std::vector<tensor_guid_t> const &);
Some parameter names would be helpful here
Code quote:
std::vector<tensor_guid_t> const &);
lib/local-execution/include/local-execution/optimizer.h
line 22 at r3 (raw file):
TaskInvocation sgd_update(SGDOptimizerAttrs const &, tensor_guid_t const &weight, tensor_guid_t const &);
Some parameter names would be helpful here
lib/local-execution/include/local-execution/optimizer.h
line 29 at r3 (raw file):
tensor_guid_t const &weight, tensor_guid_t const &, tensor_guid_t const &);
Some parameter names would be helpful here
Code quote:
tensor_guid_t const &,
tensor_guid_t const &);
lib/local-execution/include/local-execution/task_impl_function.variant.toml
line 22 at r3 (raw file):
[[values]] type = "::FlexFlow::FwdBwdTaskImplFunction" key = "fwd_bwd_task_impl_function"
Would these be better renamed InitOpTaskImplFunction
and FwdBwdOpTaskImplFunction
, or do they make sense outside of a op context?
Code quote:
[[values]]
type = "::FlexFlow::InitTaskImplFunction"
key = "init_task_impl_function"
[[values]]
type = "::FlexFlow::FwdBwdTaskImplFunction"
key = "fwd_bwd_task_impl_function"
lib/local-execution/include/local-execution/task_invocation.h
line 65 at r3 (raw file):
task_id_t task_id; TaskBinding binding; };
Can be moved over to dtgen
Code quote:
struct TaskInvocation {
public:
TaskInvocation() = delete;
TaskInvocation(task_id_t task_id, TaskBinding const &binding)
: task_id(task_id), binding(binding) {}
public:
task_id_t task_id;
TaskBinding binding;
};
lib/local-execution/include/local-execution/task_signature.h
line 4 at r3 (raw file):
#define _FLEXFLOW_LOCAL_EXECUTION_TASK_SIGNATURE_H // #include "local-execution/tensor_guid_slot_spec.dtg.h"
Delete?
lib/local-execution/include/local-execution/task_signature.h
line 41 at r3 (raw file):
} // adds arg_slot without checking is_serializable, used for arguments that are
Change to proper doxygen docstring?
lib/local-execution/include/local-execution/task_signature.h
line 49 at r3 (raw file):
// adds arg_slot without checking is_serializable, used for arguments that are // deviceSpecific
Change to proper doxygen docstring?
lib/local-execution/include/local-execution/task_signature.struct.toml
line 5 at r3 (raw file):
features = [ "eq", "fmt",
Why not "hash"
?
lib/local-execution/include/local-execution/task_signature.struct.toml
line 10 at r3 (raw file):
includes = [ "local-execution/tensor_guid_slot_spec.dtg.h", "utils/type_index.h",
Suggestion:
"<typeindex>",
lib/local-execution/include/local-execution/task_signature.struct.toml
line 12 at r3 (raw file):
"utils/type_index.h", "utils/optional.h" ]
Suggestion:
includes = [
"local-execution/tensor_guid_slot_spec.dtg.h",
"utils/type_index.h",
"<optional>",
]
lib/local-execution/include/local-execution/task_signature.struct.toml
line 29 at r3 (raw file):
[[fields]] name = "tensor_guid_slots" type = "std::unordered_set<::FlexFlow::TensorGuidSlotSpec>"
Why not a map of slot_id_t -> something
?
Code quote:
type = "std::unordered_set<::FlexFlow::TensorGuidSlotSpec>"
lib/local-execution/include/local-execution/tensor_guid_spec.struct.toml
line 2 at r3 (raw file):
namespace = "FlexFlow" name = "TensorGuidSpec"
This name seems a bit ambiguous? Maybe something clearer?
lib/local-execution/src/local_cost_estimator.cc
line 80 at r3 (raw file):
tensor_backing_map, this->runtime_arg_config, model_training_instance);
Would it make more sense to disaggregate LocalTrainingBacking
rather than having an optional
field for the training instance?
Code quote:
std::optional<ModelTrainingInstance> model_training_instance = std::nullopt;
LocalTrainingBacking local_backing(allocator,
cg_builder.computation_graph,
tensor_backing_map,
this->runtime_arg_config,
model_training_instance);
lib/local-execution/src/local_slots_backing.cc
line 61 at r3 (raw file):
sig.tensor_guid_slots.size() - 2; // ignore 2 (weight and weight_grad) std::vector<tensor_guid_t> buffer_tensors = get_new_tensor_guids_for_layer_without_graph_insertion(
Not a fan of this, come up with some new tensor id type
lib/local-execution/src/local_slots_backing.cc
line 63 at r3 (raw file):
get_new_tensor_guids_for_layer_without_graph_insertion( cg, weight_layer, num_buffer_tensors); for (auto const &tensor_guid : buffer_tensors) {
Suggestion:
for (tensor_guid_t const &tensor_guid : buffer_tensors)
lib/local-execution/src/local_slots_backing.cc
line 100 at r3 (raw file):
OpTaskBinding const &binding, layer_guid_t const &op_guid) const { TensorSlotsBacking mapping; int num_inputs = 0;
Either use filter
or add a count_if
function to containers
or something like it
lib/local-execution/src/local_slots_backing.cc
line 126 at r3 (raw file):
default: throw mk_runtime_error( fmt::format("Invalid TensorRole")); // inserting role yields
It shouldn't, can you try again?
lib/local-execution/src/local_slots_backing.cc
line 132 at r3 (raw file):
IsGrad is_grad = slot_grad_id.is_grad; GenericTensorAccessorW tensor_backing = this->get_tensor_backing( tensor_guids.at(weight_adjusted_idx + tensor_spec.idx), is_grad);
What's going on here with weight_adjusted_idx
?
lib/local-execution/src/local_slots_backing.cc
line 189 at r3 (raw file):
}, [](ConcreteArgSpec const &s) { return s; }, })});
Isn't this just map_values
?
Code quote:
for (auto const &arg_binding : binding.get_arg_bindings()) {
slot_id_t arg_slot = arg_binding.first;
TaskArgSpec task_arg_spec = arg_binding.second;
mapping.insert({arg_slot,
task_arg_spec.visit<ConcreteArgSpec>(overload{
[&](RuntimeArgRefSpec const &s) {
return this->resolve_runtime_arg_ref_spec(s);
},
[](ConcreteArgSpec const &s) { return s; },
})});
lib/local-execution/src/local_training_backing.cc
line 26 at r3 (raw file):
for (layer_guid_t const &node : topological_ordering(this->computation_graph)) {
Much appreciated 🙂
lib/local-execution/src/local_training_backing.cc
line 165 at r3 (raw file):
std::vector<tensor_guid_t> buffer_tensors = this->local_slots_backing.weight_optimizer_tensor_guids.at( weight_tensor);
Why not key this by the layer_guid_t
of the WeightAttrs
?
lib/local-execution/src/local_training_backing.cc
line 179 at r3 (raw file):
} this->training_instance = next(this->training_instance.value());
I'm not convinced that the next()
abstraction is the cleanest way to handle this? Idk, the ModelTrainingInstance
changing during training is a bit conceptually weird. I don't have a super clear alternative idea, but I suppose it would be something along the lines of moving the mutable aspects (I think in this case some optimizer state?) into some other data structure so that ModelTrainingInstance
stays the same
lib/local-execution/src/loss_functions.cc
line 29 at r2 (raw file):
add_slot(sig, LOGIT, IsGrad::NO); add_slot(sig, LABEL, IsGrad::NO); add_slot(sig, LOGIT, IsGrad::YES);
Why is LOGIT
added twice? Does this mean pass both the gradient and the non-gradient into the function?
lib/local-execution/src/loss_functions.cc
line 40 at r2 (raw file):
b.bind(LOGIT, TensorGuidSpec{logit, IsGrad::NO}); b.bind(LABEL, TensorGuidSpec{label, IsGrad::NO}); b.bind(LOGIT, TensorGuidSpec{logit, IsGrad::YES});
Same here, why is LOGIT
bound twice?
lib/local-execution/src/loss_functions.cc
line 64 at r2 (raw file):
if (loss_type == LossFunction::SPARSE_CATEGORICAL_CROSSENTROPY) { // label shape is [parallel dim(?), batch dim, 1]
Label shape is an ArrayShape
, so it shouldn't have any parallel dims, right?
Code quote:
// label shape is [parallel dim(?), batch dim, 1]
lib/local-execution/src/model_training_instance.cc
line 9 at r2 (raw file):
AdamOptimizerAttrs old = old_training_instance.optimizer_attrs.get<AdamOptimizerAttrs>(); double new_beta1_t = old.beta_t * old.beta1;
Should probably get pulled out of the function at least. Also, does it make sense to actually modifying the ModelTrainingInstance
or would we want a separate optimizer state object/field to hold these mutable values?
lib/local-execution/src/model_training_instance.cc
line 26 at r2 (raw file):
new_attrs}; } return old_training_instance;
Might be shorted to just copy the old_training_instance
?
Suggestion:
if (old_training_instance.optimizer_attrs.has<AdamOptimizerAttrs>()) {
old_training_instance.optimizer.attrs.get<AdamOptimizerAttrs>() = AdamOptimizerAttrs{
...
};
} else {
return old_training_instance;
}
lib/local-execution/src/optimizer.cc
line 39 at r2 (raw file):
return {task_id_t::SGD_UPD_NCCL_TASK_ID, b}; } return {task_id_t::SGD_UPD_PS_TASK_ID, b};
Bit more readable
Suggestion:
if (CHOSEN_SYNC_TYPE == ParamSync::NCCL) {
b.bind_arg(HANDLE, ff_handle());
return {task_id_t::SGD_UPD_NCCL_TASK_ID, b};
} else {
return {task_id_t::SGD_UPD_PS_TASK_ID, b};
}
lib/local-execution/src/optimizer.cc
line 127 at r2 (raw file):
return {task_id_t::ADAM_UPD_NCCL_TASK_ID, b}; } return {task_id_t::ADAM_UPD_PS_TASK_ID, b};
Suggestion:
if (CHOSEN_SYNC_TYPE == ParamSync::NCCL) {
b.bind_arg(HANDLE, ff_handle());
return {task_id_t::ADAM_UPD_NCCL_TASK_ID, b};
} else {
return {task_id_t::ADAM_UPD_PS_TASK_ID, b};
}
lib/local-execution/src/optimizer.cc
line 189 at r2 (raw file):
[&](AdamOptimizerAttrs const &s) { return get_adam_update_signature(); }});
Suggestion:
TaskSignature get_update_signature(OptimizerAttrs const &attrs) {
return attrs.visit<TaskSignature>(overload{
[&](SGDOptimizerAttrs const &) { return get_sgd_update_signature(); },
[&](AdamOptimizerAttrs const &) {
return get_adam_update_signature();
}});
lib/local-execution/src/optimizer.cc
line 195 at r2 (raw file):
get_update_invocation(OptimizerAttrs const &attrs, tensor_guid_t const &weight, std::vector<tensor_guid_t> const &buffer_tensors) {
Is buffer
usually the word for these?
Code quote:
&buffer_tensors)
lib/local-execution/src/optimizer.cc
line 211 at r2 (raw file):
[&](AdamOptimizerAttrs const &s) { return get_adam_update_task_impl(); }});
Suggestion:
TaskImplFunction get_update_task_impl(OptimizerAttrs const &attrs) {
return attrs.visit<TaskImplFunction>(overload{
[&](SGDOptimizerAttrs const &) { return get_sgd_update_task_impl(); },
[&](AdamOptimizerAttrs const &) {
return get_adam_update_task_impl();
}});
lib/local-execution/src/task_invocation.cc
line 46 at r2 (raw file):
bool is_invocation_valid(TaskSignature const &sig, TaskInvocation const &inv) { // TODO: implement signature checking return true;
Suggestion:
NOT_IMPLEMENTED();
lib/local-execution/test/src/test_loss_e2e.cc
line 45 at r3 (raw file):
tensor_backing_map.insert({input_tensor, input_backing}); SUBCASE("SparseCategoricalCrossEntropyLossAttrs") {
The scoping of these tests seems a bit off--I get having one test to check that local execution works with a loss function, but it seems like the code should be decomposed such that the loss functions can be tested by themselves and then all this is need is the one integration test to make sure loss functions in general work with local execution.
It's fine if these are just temporary placeholders, but in the final draft it would be nice to have a slightly cleaner approach to testing than "spin up the whole local execution training backend and make it run full passes". If this is in fact the best way to test this then I'm happy to be convinced as some tests do just need a whole bunch of surrounding state to make sense, but more often than not that surrounding state can be minimized by having cleaner abstraction boundaries
lib/local-execution/test/src/test_update_e2e.cc
line 0 at r3 (raw file):
https://reviewable.io/reviews/flexflow/FlexFlow/1472#-O5tZe9t2K_AvSQBCzOY applies to this file as well
lib/op-attrs/include/op-attrs/ops/loss_functions.h
line 8 at r3 (raw file):
#include "loss_function.dtg.h" #include "other_loss_attrs.dtg.h" #include "sparse_categorical_ce_loss_attrs.dtg.h"
Suggestion:
#include "op-attrs/ops/core.h"
#include "op-attrs/ops/loss_attrs.dtg.h"
#include "op-attrs/ops/loss_function.dtg.h"
#include "op-attrs/ops/other_loss_attrs.dtg.h"
#include "op-attrs/ops/sparse_categorical_ce_loss_attrs.dtg.h"
lib/op-attrs/include/op-attrs/ops/other_loss_attrs.struct.toml
line 2 at r3 (raw file):
namespace = "FlexFlow" name = "OtherLossAttrs"
Maybe NonconfigurableLossAttrs
or NonparameterizedLossAttrs
? OtherLossAttrs
is just a bit ambiguous
Code quote:
name = "OtherLossAttrs"
lib/pcg/include/pcg/computation_graph.h
line 36 at r3 (raw file):
std::vector<tensor_guid_t> get_new_tensor_guids_for_layer_without_graph_insertion(
Remove
lib/pcg/include/pcg/optimizer_attrs.h
line 8 at r3 (raw file):
namespace FlexFlow { OptimizerAttrs make_empty_sgd_attrs();
What is an "empty" SGD? I'm not entirely convinced these are values that make sense to have their own functions.
lib/local-execution/include/local-execution/loss_functions.h
line 29 at r3 (raw file):
TaskSignature get_loss_bwd_signature(); TaskInvocation backward(LossAttrs const &, tensor_guid_t logit, tensor_guid_t label);
Why change the signature of backward
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 47 unresolved discussions (waiting on @lockshaw)
lib/kernels/CMakeLists.txt
line 10 at r2 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why is this not just
src/cuda/*.cu
?
There are a couple of files in that folder that haven't been refactored/don't build: metrics_kernels.cu
and embedding_kernels.cu
. I think maybe Dylan was going to tackle those?
lib/kernels/include/kernels/array_shape.h
line 45 at r2 (raw file):
Previously, lockshaw (Colin Unger) wrote…
legion_dim_t
for start andff_dim_t
for end? Is this signature intended? It's not illogical, just a bit weird and wanted to check it's not a typo 🙂
It's called like this in concat_kernels.cu
:
blk_size = shape.sub_shape(legion_dim_t{0}, axis).num_elements();
where axis
is an ff_dim_t
I've left it as not implemented for now because I thought it was weird too.
lib/kernels/src/device.h
line 74 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why?
Was getting a linker error, but I realized it was for a different reason. This is removed and fixed in cuda_helper.cu
lib/local-execution/include/local-execution/model_training_instance.struct.toml
line 15 at r3 (raw file):
"pcg/tensor_guid_t.dtg.h", "pcg/optimizer_attrs.dtg.h", ]
Done.
lib/local-execution/include/local-execution/optimizer.h
line 16 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Some parameter names would be helpful here
Done.
lib/local-execution/include/local-execution/optimizer.h
line 22 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Some parameter names would be helpful here
Done.
lib/local-execution/include/local-execution/optimizer.h
line 29 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Some parameter names would be helpful here
Done.
lib/local-execution/include/local-execution/task_impl_function.variant.toml
line 22 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Would these be better renamed
InitOpTaskImplFunction
andFwdBwdOpTaskImplFunction
, or do they make sense outside of a op context?
Done.
lib/local-execution/include/local-execution/task_signature.h
line 4 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Delete?
Done.
lib/local-execution/include/local-execution/task_signature.h
line 41 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Change to proper doxygen docstring?
Done.
lib/local-execution/include/local-execution/task_signature.h
line 49 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Change to proper doxygen docstring?
Done.
lib/local-execution/include/local-execution/task_signature.struct.toml
line 5 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why not
"hash"
?
Done.
lib/local-execution/include/local-execution/task_signature.struct.toml
line 10 at r3 (raw file):
includes = [ "local-execution/tensor_guid_slot_spec.dtg.h", "utils/type_index.h",
This is fine, but I have to move the deleted part to the src_includes
since the fmt is defined in that file
lib/local-execution/include/local-execution/task_signature.struct.toml
line 12 at r3 (raw file):
"utils/type_index.h", "utils/optional.h" ]
This is fine, but I have to move the deleted part to the src_includes
since the fmt is defined in that file
lib/local-execution/src/local_slots_backing.cc
line 63 at r3 (raw file):
get_new_tensor_guids_for_layer_without_graph_insertion( cg, weight_layer, num_buffer_tensors); for (auto const &tensor_guid : buffer_tensors) {
Done.
lib/local-execution/src/local_slots_backing.cc
line 126 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
It shouldn't, can you try again?
Done.
lib/local-execution/src/local_slots_backing.cc
line 132 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
What's going on here with
weight_adjusted_idx
?
This is the support I had previously for getting the right weight tensor. But this will be superceded by the new weights/inputs reps PR so this will get removed.
lib/local-execution/src/local_slots_backing.cc
line 189 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Isn't this just
map_values
?
That's right. Fixed. Didn't even know that existed.
lib/local-execution/src/loss_functions.cc
line 29 at r2 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why is
LOGIT
added twice? Does this mean pass both the gradient and the non-gradient into the function?
Yes.
lib/local-execution/src/loss_functions.cc
line 40 at r2 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Same here, why is
LOGIT
bound twice?
Same as above, the function needs both gradient and non-gradient tensor. It's the same as when we use bind_grad
for op tasks, but I didn't feel like that specification was necessary for non-op tasks.
lib/local-execution/src/model_training_instance.cc
line 26 at r2 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Might be shorted to just copy the
old_training_instance
?
Will be superceded by changes to the next
and ModelTrainingInstance
abstraction
lib/local-execution/src/optimizer.cc
line 39 at r2 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Bit more readable
Done.
lib/local-execution/src/optimizer.cc
line 127 at r2 (raw file):
return {task_id_t::ADAM_UPD_NCCL_TASK_ID, b}; } return {task_id_t::ADAM_UPD_PS_TASK_ID, b};
Done.
lib/local-execution/src/optimizer.cc
line 189 at r2 (raw file):
[&](AdamOptimizerAttrs const &s) { return get_adam_update_signature(); }});
Done.
lib/local-execution/src/optimizer.cc
line 195 at r2 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Is
buffer
usually the word for these?
I think so? https://pytorch.org/docs/stable/generated/torch.optim.SGD.html uses momentum_buffer
though I guess buffer itself is broad. I can make this grad_buffer_tensors
for more clarity.
lib/local-execution/src/optimizer.cc
line 211 at r2 (raw file):
[&](AdamOptimizerAttrs const &s) { return get_adam_update_task_impl(); }});
Done.
lib/local-execution/src/task_invocation.cc
line 46 at r2 (raw file):
bool is_invocation_valid(TaskSignature const &sig, TaskInvocation const &inv) { // TODO: implement signature checking return true;
Done.
lib/op-attrs/include/op-attrs/ops/loss_functions.h
line 8 at r3 (raw file):
#include "loss_function.dtg.h" #include "other_loss_attrs.dtg.h" #include "sparse_categorical_ce_loss_attrs.dtg.h"
Done.
lib/op-attrs/include/op-attrs/ops/other_loss_attrs.struct.toml
line 2 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Maybe
NonconfigurableLossAttrs
orNonparameterizedLossAttrs
?OtherLossAttrs
is just a bit ambiguous
Done.
lib/pcg/include/pcg/optimizer_attrs.h
line 8 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
What is an "empty" SGD? I'm not entirely convinced these are values that make sense to have their own functions.
That's fair. Removed.
lib/local-execution/include/local-execution/loss_functions.h
line 29 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why change the signature of
backward
?
I thought local execution won't deal with tensor_guid_t
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 49 of 49 files at r4, all commit messages.
Reviewable status: all files reviewed, 24 unresolved discussions (waiting on @oOTigger and @reyna-abhyankar)
lib/kernels/CMakeLists.txt
line 10 at r2 (raw file):
Previously, reyna-abhyankar (Reyna Abhyankar) wrote…
There are a couple of files in that folder that haven't been refactored/don't build:
metrics_kernels.cu
andembedding_kernels.cu
. I think maybe Dylan was going to tackle those?
Sounds good, tracked in #1502 and hopefully will get addressed in a week or two once @oOTigger gets back
lib/kernels/include/kernels/array_shape.h
line 45 at r2 (raw file):
Previously, reyna-abhyankar (Reyna Abhyankar) wrote…
It's called like this in
concat_kernels.cu
:
blk_size = shape.sub_shape(legion_dim_t{0}, axis).num_elements();
whereaxis
is anff_dim_t
I've left it as not implemented for now because I thought it was weird too.
Probably better to instead use the function that converts between an ff_dim_t
and a legion_dim_t
using the number of tensor dims (which iirc was defined...somewhere, maybe in kernels where the LegionDimOrdered
alias is defined?) and then just use the sub_shape(legion_dim_t, legion_dim_t)
overload
lib/local-execution/include/local-execution/task_signature.struct.toml
line 10 at r3 (raw file):
Previously, reyna-abhyankar (Reyna Abhyankar) wrote…
This is fine, but I have to move the deleted part to the
src_includes
since the fmt is defined in that file
That's good, the more things we can put in src_includes
instead of includes
the better for build times as things in src_includes
won't get propagated through #include
s as they're only in the .dtg.cc
file and not the .dtg.h
file
lib/local-execution/src/local_slots_backing.cc
line 132 at r3 (raw file):
Previously, reyna-abhyankar (Reyna Abhyankar) wrote…
This is the support I had previously for getting the right weight tensor. But this will be superceded by the new weights/inputs reps PR so this will get removed.
So is the plan to merge this and then have a separate PR with the fixed representation, or to replace this PR with a new fixed PR, or something else entirely? Just so I know the context in which to review this PR
lib/local-execution/src/local_slots_backing.cc
line 189 at r3 (raw file):
Previously, reyna-abhyankar (Reyna Abhyankar) wrote…
That's right. Fixed. Didn't even know that existed.
🙂
lib/local-execution/src/local_slots_backing.cc
line 157 at r4 (raw file):
ArgSlotsBacking LocalSlotsBacking::construct_arg_slots_backing( OpTaskBinding const &binding, layer_guid_t const &op_guid) const { return map_values(binding.get_arg_bindings(), [&](auto const &arg_binding) {
Prefer explicit type names over auto
Suggestion:
return map_values(binding.get_arg_bindings(), [&](TaskArgSpec const &task_arg_spec) {
lib/local-execution/src/local_slots_backing.cc
line 171 at r4 (raw file):
ArgSlotsBacking LocalSlotsBacking::construct_arg_slots_backing( TaskBinding const &binding) const { return map_values(binding.get_arg_bindings(), [&](auto const &arg_binding) {
Prefer explicit type names over auto
Suggestion:
return map_values(binding.get_arg_bindings(), [&](TaskArgSpec const &task_arg_spec) {
lib/local-execution/src/loss_functions.cc
line 29 at r2 (raw file):
Previously, reyna-abhyankar (Reyna Abhyankar) wrote…
Yes.
Hmm, fair enough. Feels a bit unintuitive to allow binding to a slot twice, might be better to have a way to define that slot as taking both grad and non-grad? Or have separate slots for the grad and non-grad? Doesn't need to be fixed immediately and I'll leave proposals for an exact solution (or the absence of one) up to you as you're more familiar with the current overall state of this interface than I am, just wanted to bring this up because naively this feels a bit odd (like in a function you can't bind an argument twice--the answer could be that SLOTs aren't exactly equivalent to arguments, or maybe just a wrapper function that adds both the grad and the non-grad to the slot?). Not sure, let me know what you think (and either way probably shouldn't be fixed in this PR)
lib/local-execution/src/optimizer.cc
line 127 at r2 (raw file):
Previously, reyna-abhyankar (Reyna Abhyankar) wrote…
Done.
Thanks--I usually tend to reserve the if (...) { ...; return ...; } ...; return ...;
pattern for cases like error handling etc. where the behavior is more "early termination" or (probably equivalently) where the else
case is very large
lib/local-execution/src/optimizer.cc
line 195 at r2 (raw file):
Previously, reyna-abhyankar (Reyna Abhyankar) wrote…
I think so? https://pytorch.org/docs/stable/generated/torch.optim.SGD.html uses
momentum_buffer
though I guess buffer itself is broad. I can make thisgrad_buffer_tensors
for more clarity.
Sure, that sounds good. I have no issue with buffer
as long as it's standard terminology, which it sounds like it is
lib/local-execution/include/local-execution/loss_functions.h
line 29 at r3 (raw file):
Previously, reyna-abhyankar (Reyna Abhyankar) wrote…
I thought local execution won't deal with
tensor_guid_t
You mean parallel_tensor_guid_t
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 65 of 82 files reviewed, 24 unresolved discussions (waiting on @lockshaw and @oOTigger)
lib/kernels/CMakeLists.txt
line 10 at r2 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Sounds good, tracked in #1502 and hopefully will get addressed in a week or two once @oOTigger gets back
Done.
lib/kernels/include/kernels/array_shape.h
line 45 at r2 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Probably better to instead use the function that converts between an
ff_dim_t
and alegion_dim_t
using the number of tensor dims (which iirc was defined...somewhere, maybe in kernels where theLegionDimOrdered
alias is defined?) and then just use thesub_shape(legion_dim_t, legion_dim_t)
overload
Done.
lib/kernels/src/array_shape.cc
line 59 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Can probably be replaced by
slice(DimOrdered, ...)
: https://github.com/flexflow/FlexFlow/blob/repo-refactor/lib/op-attrs/include/op-attrs/dim_ordered/slice.h#L9
Done.
lib/local-execution/include/local-execution/task_invocation.h
line 65 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Can be moved over to
dtgen
Done. Just added eq
for now, created an issue for hash
and fmt
#1503
lib/local-execution/src/local_slots_backing.cc
line 100 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Either use
filter
or add acount_if
function tocontainers
or something like it
This will be superceded by pcg input/weight differentiation since then we don't need to count inputs.
lib/local-execution/src/local_slots_backing.cc
line 132 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
So is the plan to merge this and then have a separate PR with the fixed representation, or to replace this PR with a new fixed PR, or something else entirely? Just so I know the context in which to review this PR
Now that PCG input/weight is merged, I can fix in this PR itself.
lib/local-execution/src/local_slots_backing.cc
line 157 at r4 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Prefer explicit type names over auto
Done. (It's OpArgSpec
here btw)
lib/local-execution/src/local_slots_backing.cc
line 171 at r4 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Prefer explicit type names over auto
Done.
lib/local-execution/src/loss_functions.cc
line 29 at r2 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Hmm, fair enough. Feels a bit unintuitive to allow binding to a slot twice, might be better to have a way to define that slot as taking both grad and non-grad? Or have separate slots for the grad and non-grad? Doesn't need to be fixed immediately and I'll leave proposals for an exact solution (or the absence of one) up to you as you're more familiar with the current overall state of this interface than I am, just wanted to bring this up because naively this feels a bit odd (like in a function you can't bind an argument twice--the answer could be that SLOTs aren't exactly equivalent to arguments, or maybe just a wrapper function that adds both the grad and the non-grad to the slot?). Not sure, let me know what you think (and either way probably shouldn't be fixed in this PR)
Functionally, a slot = a tensor, so I think it's ok to refer to multiple slots with the same name (the name is just a concept) as long as they have different specifications (e.g. grad and non-grad). Sometimes you don't want the grad version, so I don't think the wrapper function would be too helpful. I actually think the current interface is the most clear.
lib/local-execution/src/loss_functions.cc
line 64 at r2 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Label shape is an
ArrayShape
, so it shouldn't have any parallel dims, right?
I think you're right, yes.
lib/local-execution/include/local-execution/loss_functions.h
line 29 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
You mean
parallel_tensor_guid_t
?
Yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 54 of 83 files reviewed, 24 unresolved discussions (waiting on @lockshaw and @oOTigger)
lib/local-execution/include/local-execution/local_training_backing.h
line 18 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why is this passed as mutable?
Done.
lib/local-execution/include/local-execution/model_training_instance.struct.toml
line 2 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why isn't the model part of
ModelTrainingInstance
? The members seem to have changed non-trivially since the old version:struct ModelTrainingInstance { ComputationGraph computation_graph; Optimizer optimizer; EnableProfiling enable_profiling; TrainingPCG training_pcg; TensorMapping tensor_map; LegionBacking legion_backing; };
It's been kind of reversed and disaggregated. You can think of the backing itself as the driver and the training instance as just the fields necessary for training (logit, label, loss, optimizer). This makes it a lot easier to test.
lib/local-execution/include/local-execution/task_signature.struct.toml
line 29 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why not a map of
slot_id_t -> something
?
Done.
lib/local-execution/include/local-execution/tensor_guid_spec.struct.toml
line 2 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
This name seems a bit ambiguous? Maybe something clearer?
Hm, but it is literally a specification for a tensor. Like the tensor_guid and the grad will tell you exactly which tensor to access.
lib/local-execution/src/local_cost_estimator.cc
line 80 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Would it make more sense to disaggregate
LocalTrainingBacking
rather than having anoptional
field for the training instance?
I was thinking about this, but to me it makes sense to have the training backing be the driver and then the optional instance provides fields that are only necessary for training. For example, when I'm testing the local cost estimator, I don't need the loss/optimizer stuff.
lib/local-execution/src/local_slots_backing.cc
line 61 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Not a fan of this, come up with some new tensor id type
Done. It's now a non_graph_tensor_guid_t
and not handled by the computation graph
lib/local-execution/src/local_slots_backing.cc
line 100 at r3 (raw file):
Previously, reyna-abhyankar (Reyna Abhyankar) wrote…
This will be superceded by pcg input/weight differentiation since then we don't need to count inputs.
Done.
lib/local-execution/src/local_slots_backing.cc
line 132 at r3 (raw file):
Previously, reyna-abhyankar (Reyna Abhyankar) wrote…
Now that PCG input/weight is merged, I can fix in this PR itself.
Done.
lib/local-execution/src/local_training_backing.cc
line 165 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why not key this by the
layer_guid_t
of theWeightAttrs
?
Done.
lib/local-execution/src/local_training_backing.cc
line 179 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
I'm not convinced that the
next()
abstraction is the cleanest way to handle this? Idk, theModelTrainingInstance
changing during training is a bit conceptually weird. I don't have a super clear alternative idea, but I suppose it would be something along the lines of moving the mutable aspects (I think in this case some optimizer state?) into some other data structure so thatModelTrainingInstance
stays the same
Done.
lib/local-execution/src/model_training_instance.cc
line 9 at r2 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Should probably get pulled out of the function at least. Also, does it make sense to actually modifying the
ModelTrainingInstance
or would we want a separate optimizer state object/field to hold these mutable values?
Done. Now, next
only changes the optimizer attrs.
lib/local-execution/src/model_training_instance.cc
line 26 at r2 (raw file):
Previously, reyna-abhyankar (Reyna Abhyankar) wrote…
Will be superceded by changes to the
next
andModelTrainingInstance
abstraction
Done. Now, next only changes the optimizer attrs.
lib/local-execution/test/src/test_loss_e2e.cc
line 45 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
The scoping of these tests seems a bit off--I get having one test to check that local execution works with a loss function, but it seems like the code should be decomposed such that the loss functions can be tested by themselves and then all this is need is the one integration test to make sure loss functions in general work with local execution.
It's fine if these are just temporary placeholders, but in the final draft it would be nice to have a slightly cleaner approach to testing than "spin up the whole local execution training backend and make it run full passes". If this is in fact the best way to test this then I'm happy to be convinced as some tests do just need a whole bunch of surrounding state to make sense, but more often than not that surrounding state can be minimized by having cleaner abstraction boundaries
I agree generally, but in this case (for loss function), the real test is just the CPU kernels one. It doesn't really make sense to test the backing without just running a full forward+backward pass (otherwise it's just like testing that an operator works).
lib/local-execution/test/src/test_update_e2e.cc
line at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
https://reviewable.io/reviews/flexflow/FlexFlow/1472#-O5tZe9t2K_AvSQBCzOY applies to this file as well
Same principle here but there's a lot more interaction with the mechanism of the local backing for update (different tensor storage location, access, etc.) that I can add tests for in the respective local slots backing / local task arg accessor tests. Working on that now
lib/pcg/include/pcg/computation_graph.h
line 36 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Remove
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 7 of 17 files at r5, 7 of 16 files at r6, 31 of 33 files at r7, all commit messages.
Reviewable status: all files reviewed, 14 unresolved discussions (waiting on @reyna-abhyankar)
lib/kernels/include/kernels/array_shape.h
line 45 at r2 (raw file):
Previously, reyna-abhyankar (Reyna Abhyankar) wrote…
Done.
The signature (sub_shape(legion_dim_t, ff_dim_t)
) still appears the same?
lib/kernels/include/kernels/array_shape.h
line 20 at r7 (raw file):
ArrayShape(TensorShape const &shape); ArrayShape(std::vector<std::size_t> const &); ArrayShape(LegionTensorDims const &);
Ideally these should be explicit
: https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#Rc-explicit
Code quote:
ArrayShape(size_t *dims, size_t num_dims);
ArrayShape(TensorShape const &shape);
ArrayShape(std::vector<std::size_t> const &);
ArrayShape(LegionTensorDims const &);
lib/kernels/src/array_shape.cc
line 59 at r3 (raw file):
Previously, reyna-abhyankar (Reyna Abhyankar) wrote…
Done.
Would it make more sense to just convert start
and end
to legion_dim_t
?
lib/local-execution/include/local-execution/local_training_backing.h
line 43 at r7 (raw file):
LocalSlotsBacking local_slots_backing; std::optional<ModelTrainingInstance> training_instance; std::optional<OptimizerAttrs> optimizer_attrs;
Having so many optional parameters that affect execution semantics makes me wonder if LocalTrainingBacking
(or related abstractions) should be broken up--you could have a user-facing interface that's still along the lines of execute_forward
, execute_backward
, execute_update
, but then have each of those conditionally execute lower-level functions execute_loss
, etc. that don't use optional
lib/local-execution/include/local-execution/model_training_instance.struct.toml
line 2 at r3 (raw file):
Previously, reyna-abhyankar (Reyna Abhyankar) wrote…
It's been kind of reversed and disaggregated. You can think of the backing itself as the driver and the training instance as just the fields necessary for training (logit, label, loss, optimizer). This makes it a lot easier to test.
That makes sense. In that case what's the point of having ModelTrainingInstance
at all? I'm not against having it, I'm just not sure semantically what it represents in the current code.
lib/local-execution/include/local-execution/task_binding.h
line 13 at r7 (raw file):
namespace FlexFlow { struct TaskBinding {
Add support for hashing and fmt
ing?
lib/local-execution/include/local-execution/task_invocation.h
line 65 at r3 (raw file):
Previously, reyna-abhyankar (Reyna Abhyankar) wrote…
Done. Just added
eq
for now, created an issue forhash
andfmt
#1503
What's the issue with them? Isn't it as simple as just changing from features = ["eq"]
to features = ["eq", "hash", "fmt"]
?
lib/local-execution/include/local-execution/tensor_guid_spec.struct.toml
line 2 at r3 (raw file):
Previously, reyna-abhyankar (Reyna Abhyankar) wrote…
Hm, but it is literally a specification for a tensor. Like the tensor_guid and the grad will tell you exactly which tensor to access.
I guess the concern is "tensor" is a bit overloaded--like "tensor_guid_t" represents just a forward tensor in the computation graph, but then other uses of "tensor" elsewhere are much broader. It might be nice to come up with some naming convention for these different concepts of what a "tensor" is to reduce ambiguity
lib/local-execution/include/local-execution/tensor_guid_spec.struct.toml
line 18 at r7 (raw file):
[[fields]] name = "tensor_guid" type = "::FlexFlow::UnifiedTensorGuid"
Does it make sense for non_graph_tensor_guid_t
s to have a gradient tensor?
lib/local-execution/src/local_cost_estimator.cc
line 80 at r3 (raw file):
Previously, reyna-abhyankar (Reyna Abhyankar) wrote…
I was thinking about this, but to me it makes sense to have the training backing be the driver and then the optional instance provides fields that are only necessary for training. For example, when I'm testing the local cost estimator, I don't need the loss/optimizer stuff.
I'm not really arguing for making them non-optional
fields, but potentially disaggregating LocalTrainingBackend
(or its composite fields) so that the semantics of LocalTrainingBackend
are clearer (note the high-level interface could be the same, but the core logic would be in separate pieces that are tested separately). Similar things were done in old-FlexFlow (e.g., tensors had a whole bunch of fields, and for most computations only a subset were really needed. The logic was that if you were testing a certain feature that you could only initialize the fields that were necessary, but that made it really hard to reason about the code as it meant that given a tensor you were never sure what you were allowed to do with it so there had to be a bunch of checking code and all the hidden invariants made the code really hard to get started with). Obviously LocalTrainingBackend
doesn't have that degree of optional fields yet, and it's not that optional fields are always bad, but it feels like LocalTrainingBackend
is slowly getting closer and closer to this
lib/local-execution/src/local_training_backing.cc
line 43 at r7 (raw file):
if (attrs.has<WeightAttrs>() && this->training_instance.has_value()) { assert(this->optimizer_attrs.has_value()); TaskSignature sig = get_update_signature(this->optimizer_attrs.value());
FYI .value()
will already throw an exception if there isn't a value, so the assert
isn't strictly necessary
lib/local-execution/src/local_training_backing.cc
line 130 at r7 (raw file):
unwrapped_training_instance.logit_tensor, unwrapped_training_instance.label_tensor); // assert(is_invocation_valid(get_loss_bwd_signature(), loss_invocation));
Remove or uncomment?
lib/local-execution/src/local_training_backing.cc
line 172 at r7 (raw file):
TaskInvocation invocation = get_update_invocation( this->optimizer_attrs.value(), weight_tensor, grad_buffer_tensors); // assert(is_invocation_valid(get_update_signature(attrs), invocation));
Remove or uncomment?
lib/local-execution/src/loss_functions.cc
line 29 at r2 (raw file):
Previously, reyna-abhyankar (Reyna Abhyankar) wrote…
Functionally, a slot = a tensor, so I think it's ok to refer to multiple slots with the same name (the name is just a concept) as long as they have different specifications (e.g. grad and non-grad). Sometimes you don't want the grad version, so I don't think the wrapper function would be too helpful. I actually think the current interface is the most clear.
I think my core confusion here is https://reviewable.io/reviews/flexflow/FlexFlow/1472#-O5tR4aD7i5LFASdtM13: "tensor" is overloaded, so what "slot = tensor" means isn't particularly clear to me. For example, if we take the runtime-definition of a tensor as a buffer of data, then what does it mean for me to pass a non-gradient tensor as a gradient, or pass something in as the gradient of a optimizer buffer? I think it would help to split out the different means of "tensor" into different types and then update the interface to use the correct types
lib/local-execution/test/src/test_local_cost_estimator.cc
line 34 at r7 (raw file):
/*vdim=*/embed_dim, /*dropout=*/0.0, /*bias=*/false,
Any reason for this change?
lib/local-execution/test/src/test_loss_e2e.cc
line 45 at r3 (raw file):
Previously, reyna-abhyankar (Reyna Abhyankar) wrote…
I agree generally, but in this case (for loss function), the real test is just the CPU kernels one. It doesn't really make sense to test the backing without just running a full forward+backward pass (otherwise it's just like testing that an operator works).
I disagree that "the real test is just the CPU kernels one"--there's plenty of code between ".execute_update()and the actual kernel call that has to work properly (all of the task spec abstractions, looking up all the right tensors, filling them all in properly, etc.). It feels like the lack of testability here is due to the interface provided by
LocalTrainingBackend--since it's essentially end-to-end (I call
.execute_update()and all the state magically updates based on that) I agree it's difficult to test more fine-grained properties, but if the interface were changed (or a new internal interface was pulled out) such that
.execute_update()returned a data structure containing, say, a list of low-level tasks to execute (or something like that) then it would be testable. Obviously at some point there does need to be a high-level object that provides the user with the necessary functions to call and is responsible for just trivially tying all the internal components together, but
LocalTrainingBackend` feels like it currently has arguably a bit too much code to really be that object (though I think in its current state it's a bit borderline whether it's too large or not), and so it should either be made higher-level and the bulk of the logic it contains refactored out, or it should be made lower-level and an additional interface added on top of it.
On a more specific level though, I'm not really sure what this file is supposed to be testing? Like, what properties of the code is it supposed to be checking? If all of the logic is loss-function-agnostic, why are there tests for different loss functions? Should I view this as just an integration test and that it's really just here to sanity check that things actually compose?
lib/local-execution/test/src/test_update_e2e.cc
line at r3 (raw file):
Previously, reyna-abhyankar (Reyna Abhyankar) wrote…
Same principle here but there's a lot more interaction with the mechanism of the local backing for update (different tensor storage location, access, etc.) that I can add tests for in the respective local slots backing / local task arg accessor tests. Working on that now
I'm a bit confused--what is this file actually testing?
lib/pcg/include/pcg/optimizer_attrs.h
line 9 at r7 (raw file):
namespace FlexFlow { OptimizerAttrs next(OptimizerAttrs const &old);
Any way we could get a more specific name for this? I know it makes sense in local-execution
, but now that it's in next
it might be nice to have it be a bit clearer what it does
Code quote:
next
lib/local-execution/include/local-execution/non_graph_tensor_guid_t.struct.toml
line 2 at r7 (raw file):
namespace = "FlexFlow" name = "non_graph_tensor_guid_t"
As mentioned in our meeting, might be nice to build this in a bit more of an "iterative refinement" style, where rather than simultaneously using tensor_guid_t
(relatively semantically rich id, can be used for lookups in graphs, etc.) and non_graph_tensor_guid_t
(relatively semantically weak, essentially just an opaque handle to an arbitrary buffer somewhere), you instead have some intermediate types (say, for representing gradients if tensors, optimization buffers of tensors, etc.), and then have functions that refine objects over tensor_guid_t
-> (intermediate types) -> non_graph_tensor_guid_t
, and then all of the LocalBackend
code can simply reason over non_graph_tensor_guid_t
. It might lead to more code LOC-wise, but might be a nice way to disaggregate LocalBackend
and friends (which are getting quite large) and make them easier to test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 14 unresolved discussions (waiting on @lockshaw)
lib/kernels/include/kernels/array_shape.h
line 45 at r2 (raw file):
Previously, lockshaw (Colin Unger) wrote…
The signature (
sub_shape(legion_dim_t, ff_dim_t)
) still appears the same?
Oh, did you mean we should not have this at all and instead the calling code should call sub_shape(legion_dim_t, legion_dim_t)
?
lib/kernels/include/kernels/array_shape.h
line 20 at r7 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Ideally these should be
explicit
: https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#Rc-explicit
Done.
lib/kernels/src/array_shape.cc
line 59 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Would it make more sense to just convert
start
andend
tolegion_dim_t
?
That works too.
lib/local-execution/include/local-execution/local_training_backing.h
line 43 at r7 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Having so many optional parameters that affect execution semantics makes me wonder if
LocalTrainingBacking
(or related abstractions) should be broken up--you could have a user-facing interface that's still along the lines ofexecute_forward
,execute_backward
,execute_update
, but then have each of those conditionally execute lower-level functionsexecute_loss
, etc. that don't useoptional
Refactored. Now ModelTrainingInstance
is the driver of the training loop and local backing executes code on a per-layer basis.
lib/local-execution/include/local-execution/model_training_instance.struct.toml
line 2 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
That makes sense. In that case what's the point of having
ModelTrainingInstance
at all? I'm not against having it, I'm just not sure semantically what it represents in the current code.
See https://reviewable.io/reviews/flexflow/FlexFlow/1472#-O8ma9DJBp9ejqaQkx_W
lib/local-execution/include/local-execution/task_binding.h
line 13 at r7 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Add support for hashing and
fmt
ing?
Done.
lib/local-execution/include/local-execution/task_invocation.h
line 65 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
What's the issue with them? Isn't it as simple as just changing from
features = ["eq"]
tofeatures = ["eq", "hash", "fmt"]
?
Done. Had to actually add the hash/fmt functions for some underlying types.
lib/local-execution/src/local_cost_estimator.cc
line 80 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
I'm not really arguing for making them non-
optional
fields, but potentially disaggregatingLocalTrainingBackend
(or its composite fields) so that the semantics ofLocalTrainingBackend
are clearer (note the high-level interface could be the same, but the core logic would be in separate pieces that are tested separately). Similar things were done in old-FlexFlow (e.g., tensors had a whole bunch of fields, and for most computations only a subset were really needed. The logic was that if you were testing a certain feature that you could only initialize the fields that were necessary, but that made it really hard to reason about the code as it meant that given a tensor you were never sure what you were allowed to do with it so there had to be a bunch of checking code and all the hidden invariants made the code really hard to get started with). ObviouslyLocalTrainingBackend
doesn't have that degree of optional fields yet, and it's not that optional fields are always bad, but it feels likeLocalTrainingBackend
is slowly getting closer and closer to this
See https://reviewable.io/reviews/flexflow/FlexFlow/1472#-O8ma9DJBp9ejqaQkx_W
lib/local-execution/src/local_training_backing.cc
line 43 at r7 (raw file):
Previously, lockshaw (Colin Unger) wrote…
FYI
.value()
will already throw an exception if there isn't a value, so theassert
isn't strictly necessary
Done.
lib/local-execution/src/local_training_backing.cc
line 130 at r7 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Remove or uncomment?
Currently a todo #1442 . It should be uncommented but the underlying function is NOT_IMPLEMENTED()
lib/local-execution/src/local_training_backing.cc
line 172 at r7 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Remove or uncomment?
See https://reviewable.io/reviews/flexflow/FlexFlow/1472#-O8nGxOQBQbphnmK4wmJ
lib/local-execution/test/src/test_local_cost_estimator.cc
line 34 at r7 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Any reason for this change?
The PCG interface added the bias tensors, which broke the test because I only had one weight tensor. I quickly wanted to just check if the actual test worked but forgot to switch it back and actually add the bias tensors. This is now corrected.
lib/pcg/include/pcg/optimizer_attrs.h
line 9 at r7 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Any way we could get a more specific name for this? I know it makes sense in
local-execution
, but now that it's innext
it might be nice to have it be a bit clearer what it does
How about get_next_iteration_optimizer_attrs
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 68 of 102 files reviewed, 14 unresolved discussions (waiting on @lockshaw)
lib/local-execution/test/src/test_loss_e2e.cc
line 45 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
I disagree that "the real test is just the CPU kernels one"--there's plenty of code between ".execute_update()
and the actual kernel call that has to work properly (all of the task spec abstractions, looking up all the right tensors, filling them all in properly, etc.). It feels like the lack of testability here is due to the interface provided by
LocalTrainingBackend--since it's essentially end-to-end (I call
.execute_update()and all the state magically updates based on that) I agree it's difficult to test more fine-grained properties, but if the interface were changed (or a new internal interface was pulled out) such that
.execute_update()returned a data structure containing, say, a list of low-level tasks to execute (or something like that) then it would be testable. Obviously at some point there does need to be a high-level object that provides the user with the necessary functions to call and is responsible for just trivially tying all the internal components together, but
LocalTrainingBackend` feels like it currently has arguably a bit too much code to really be that object (though I think in its current state it's a bit borderline whether it's too large or not), and so it should either be made higher-level and the bulk of the logic it contains refactored out, or it should be made lower-level and an additional interface added on top of it.On a more specific level though, I'm not really sure what this file is supposed to be testing? Like, what properties of the code is it supposed to be checking? If all of the logic is loss-function-agnostic, why are there tests for different loss functions? Should I view this as just an integration test and that it's really just here to sanity check that things actually compose?
See https://reviewable.io/reviews/flexflow/FlexFlow/1472#-O8ma9DJBp9ejqaQkx_W for the splitting out. For loss functions, it really is just a sanity test to make sure things run. The reason there's different subcases are because the actual loss function op code/kernel has conditional logic based on which loss attrs we're dealing with. But the backing is agnostic to this. So yes we could have just 1 test for loss functions in general but I just wanted to sanity check.
lib/local-execution/test/src/test_update_e2e.cc
line at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
I'm a bit confused--what is this file actually testing?
Still need to actually make this file check something useful, but it's easier now because of the refactor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 21 of 36 files at r8, 81 of 92 files at r9, all commit messages.
Reviewable status: 134 of 149 files reviewed, 36 unresolved discussions (waiting on @reyna-abhyankar)
lib/kernels/include/kernels/array_shape.h
line 45 at r2 (raw file):
Previously, reyna-abhyankar (Reyna Abhyankar) wrote…
Oh, did you mean we should not have this at all and instead the calling code should call
sub_shape(legion_dim_t, legion_dim_t)
?
Yes
lib/local-execution/src/local_training_backing.cc
line 130 at r7 (raw file):
Previously, reyna-abhyankar (Reyna Abhyankar) wrote…
Currently a todo #1442 . It should be uncommented but the underlying function is
NOT_IMPLEMENTED()
Can you add a link to that TODO in the code, just in case anyone else is unsure when they come across this?
lib/local-execution/src/local_training_backing.cc
line 172 at r7 (raw file):
Previously, reyna-abhyankar (Reyna Abhyankar) wrote…
See https://reviewable.io/reviews/flexflow/FlexFlow/1472#-O8nGxOQBQbphnmK4wmJ
Can you add a link to that TODO in the code, just in case anyone else is unsure when they come across this?
lib/local-execution/test/src/test_loss_e2e.cc
line 45 at r3 (raw file):
Previously, reyna-abhyankar (Reyna Abhyankar) wrote…
See https://reviewable.io/reviews/flexflow/FlexFlow/1472#-O8ma9DJBp9ejqaQkx_W for the splitting out. For loss functions, it really is just a sanity test to make sure things run. The reason there's different subcases are because the actual loss function op code/kernel has conditional logic based on which loss attrs we're dealing with. But the backing is agnostic to this. So yes we could have just 1 test for loss functions in general but I just wanted to sanity check.
Can you hold this as "Working" while we resolve https://reviewable.io/reviews/flexflow/flexflow-train/1472#-OESUtvlCsQMPCJ492-X?
lib/pcg/include/pcg/optimizer_attrs.h
line 9 at r7 (raw file):
Previously, reyna-abhyankar (Reyna Abhyankar) wrote…
How about
get_next_iteration_optimizer_attrs
?
Sounds great. Could also do get_optimizer_attrs_for_next_iteration
if you prefer, either is fine with me
lib/kernels/include/kernels/legion_dim.h
line 13 at r9 (raw file):
legion_dim_t legion_dim_from_ff_dim(ff_dim_t, int num_dimensions); std::optional<legion_dim_t> legion_dim_from_ff_dim(std::optional<ff_dim_t>,
Probably should just use transform
from containers rather than having a whole extra specialization for this.
lib/local-execution/include/local-execution/local_slots_backing.h
line 25 at r9 (raw file):
std::unordered_map<reduced_tensor_t, GenericTensorAccessorW>; struct LocalSlotsBacking {
This object is getting rather large in terms of methods, it would be nice to factor some of them out into separate functions, even if that means some explicit argument passing is necessary (might even be an advantage, because right now it's hard to tell what these methods do because they access so much)--wasn't this going to be solved by adding an explicit lowering phase so that LocalSlotsBacking
isn't responsible for any thing OpTask*
and thus only has to handle Task*
objects?
lib/local-execution/include/local-execution/local_slots_backing.h
line 36 at r9 (raw file):
ComputationGraph const &, Allocator &); void allocate_tensors_by_role(TensorRole const &,
Why is this new API necessary?
lib/local-execution/include/local-execution/local_slots_backing.h
line 49 at r9 (raw file):
TensorSlotsBacking construct_tensor_slots_backing(TaskBinding const &, std::optional<layer_guid_t> const &) const;
Why does this need a layer_guid_t
?
lib/local-execution/include/local-execution/local_slots_backing.h
line 59 at r9 (raw file):
GenericTensorAccessorW const & get_tensor_backing(TensorType const &,
Why does this function care about the TensorType
--shouldn't the reduced_tensor_t
by itself be unambiguous?
Code quote:
get_tensor_backing(TensorType const &,
lib/local-execution/include/local-execution/local_slots_backing.h
line 61 at r9 (raw file):
get_tensor_backing(TensorType const &, reduced_tensor_t const &, std::optional<layer_guid_t> const &) const;
Why is a layer_guid_t
necessary here?
lib/local-execution/include/local-execution/layer_tensor_key.struct.toml
line 2 at r9 (raw file):
namespace = "FlexFlow" name = "LayerTensorKey"
What is the purpose of LayerTensorKey
? I see that it's used in LocalSlotsBacking
, but I'm not clear on why every it needs to be keyed by both layer_guid_t
and reduced_tensor_t
rather than just by reduced_tensor_t
--why is the layer_guid_t
necessary?
lib/local-execution/include/local-execution/op_task_invocation.h
line 87 at r9 (raw file):
bool operator!=(OpTaskBinding const &other) const; std::unordered_map<SlotTensorTypeId, OpTensorSpec> const &
Is it necessary for OpTaskInvocation
to deal with anything other than grad and non-grad tensors? Seems like SlotGradId
as the more accurate datatype for here, but maybe there's a case I'm not thinking about
lib/local-execution/include/local-execution/arg_ref.h
line 72 at r9 (raw file):
size_t result = 0; hash_combine(s.type_idx); hash_combine(s.ref_type);
Why remove ref_type
? That seems like an important field?
lib/local-execution/include/local-execution/concrete_arg.h
line 49 at r9 (raw file):
std::shared_ptr<void const> ptr; std::tuple<decltype(type_idx) const &> tie() const;
Why does tie
not include the ptr
?
lib/local-execution/include/local-execution/concrete_arg.h
line 63 at r9 (raw file):
size_t operator()(::FlexFlow::ConcreteArgSpec const &s) const { size_t result = 0; ::FlexFlow::hash_combine(result, s.get_type_index());
Why not include ptr
in the hash?
lib/local-execution/include/local-execution/model_training_instance.h
line 10 at r9 (raw file):
using PerLayerElapsedTime = std::unordered_map<layer_guid_t, std::optional<float>>;
Might be nice to eventually convert this float
to a wrapper type with, e.g., non-negativity checks, could be named like milliseconds_t
or whatever the unit is (also to make the unit clear) along the lines of #1533
Code quote:
std::optional<float>
lib/local-execution/include/local-execution/task_registry.h
line 17 at r9 (raw file):
ComputationGraphOpAttrs const &attrs); bool registry_contains_op_task(TaskRegistry const &,
Suggestion:
bool registry_contains_task_for_layer(
lib/local-execution/include/local-execution/runtime_arg_ref.h
line 13 at r8 (raw file):
namespace FlexFlow { enum class RuntimeArgRefType {
Why not just make this dtgen
so you don't have to implement to_string
, etc. yourself?
lib/local-execution/include/local-execution/runtime_arg_ref.h
line 30 at r9 (raw file):
RuntimeArgRef<FFIterationConfig> iteration_config(); // std::string format_as(RuntimeArgRefSpec const & x) {
Delete
lib/local-execution/include/local-execution/local_task_argument_accessor.h
line 26 at r9 (raw file):
ConcreteArgSpec const &get_concrete_arg(slot_id_t) const override; GenericTensorAccessor get_tensor(slot_id_t slot,
I know I've mentioned this before (so sorry for bringing it up again, but I appear to have forgotten the previous response) but why have a slot represent any of a number of tensors (forward, gradient, optimizer, etc.) rather than just having a slot represent a single tensor? Is it that it makes backward inference more difficult? Conceptually feels weird, as it makes it a bit confusing who actually determines what gets communicated to a task
Code quote:
GenericTensorAccessor get_tensor(slot_id_t slot,
lib/local-execution/include/local-execution/task_binding.h
line 10 at r9 (raw file):
#include "local-execution/task_id_t.dtg.h" #include "local-execution/task_signature.dtg.h" #include "utils/hash/unordered_map.h"
Move to .cc
file along with hash definition
lib/local-execution/include/local-execution/task_binding.h
line 66 at r9 (raw file):
template <> struct hash<::FlexFlow::TaskBinding> { size_t operator()(::FlexFlow::TaskBinding const &s) const {
Move definition to .cc
file
lib/local-execution/include/local-execution/task_binding.h
line 70 at r9 (raw file):
hash_combine(result, s.get_tensor_bindings()); hash_combine(result, s.get_arg_bindings()); return result;
FYI you can just return get_std_hash(result.tie());
(as long as you make std::hash<TaskBinding>
a friend of TaskBinding
Code quote:
size_t result = 0;
hash_combine(result, s.get_tensor_bindings());
hash_combine(result, s.get_arg_bindings());
return result;
lib/local-execution/include/local-execution/local_training_backing.h
line 37 at r9 (raw file):
std::optional<layer_guid_t> const &) const; TaskArgumentAccessor get_op_task_arg_accessor(OpTaskInvocation const &, layer_guid_t const &) const;
Why are these on LocalTrainingBacking
instead of LocalSlotsBacking
? They feel lower-level than the other methods present here (i.e., execute_init
, execute_forward
, etc. all require arg accessors to function, so it feels odd that it's at the same level of API)
Code quote:
get_task_arg_accessor(TaskInvocation const &,
std::optional<layer_guid_t> const &) const;
TaskArgumentAccessor get_op_task_arg_accessor(OpTaskInvocation const &,
layer_guid_t const &) const;
lib/local-execution/src/ops/attention.cc
line 0 at r9 (raw file):
FYI these should be moved to lib/local-execution/src/local-execution/ops/attention.cc
to match the standard directory layout
lib/local-execution/include/local-execution/reduced_tensor_t.struct.toml
line 2 at r9 (raw file):
namespace = "FlexFlow" name = "reduced_tensor_t"
What does this represent? It probably needs a new name as "reduced" is a bit overloaded (i.e., the reduction operation)
lib/local-execution/src/tensor_reduction.cc
line 7 at r9 (raw file):
reduced_tensor_t lower(tensor_guid_t const &tensor_guid) { return reduced_tensor_t{tensor_guid.raw_graph_output.idx};
This seems rather error-prone: how are you going to avoid assigning a non-tensor_guid_t
tensor the same index? What about instead using some kind of numbering source (like NodeSource
) to avoid duplicate numberings?
lib/local-execution/src/tensor_reduction.cc
line 11 at r9 (raw file):
std::vector<reduced_tensor_t> lower(std::vector<tensor_guid_t> const &tensor_guids) {
Just use transform(..., lower) when you need this functionality (I try to avoid having functions that are just the normal function lifted through transform, as having them would significantly increase the number of functions to maintain for little value)
lib/local-execution/src/local_cost_estimator.cc
line 24 at r9 (raw file):
} return total_elapsed_time; }
Suggestion:
static float get_total_elapsed_time(PerLayerElapsedTime const &fwd,
PerLayerElapsedTime const &bwd) {
return sum(values(fwd)) + sum(values(bwd));
}
lib/local-execution/src/local_cost_estimator.cc
line 29 at r9 (raw file):
: runtime_arg_config(config) {} CostDetails LocalCostEstimator::estimate_cost(
Any way we could factor out pieces of this function? A bunch of these blocks seem conceptually self-contained and this function is getting rather large
lib/local-execution/src/local_cost_estimator.cc
line 51 at r9 (raw file):
std::vector<tensor_guid_t> input_tensor_ids; ComputationGraphBuilder cg_builder;
I feel like you probably just want the raw ComputationGraph
interface (i.e., computation_graph.h
) here, ComputationGraphBuilder
is really intended to provide a lots-of-stuff-auto-inferred pytorch-style interface for constructing models, which you're not really using here
lib/local-execution/src/local_cost_estimator.cc
line 84 at r9 (raw file):
tensor_backing_map, this->runtime_arg_config); local_backing.register_and_allocate_layer(layer_added_result.layer);
Why do we have a separate API for registering layers? Is there a time when we want a LocalTrainingBacking
that doesn't have its layers registered?
lib/pcg/include/pcg/computation_graph_builder.h
line 260 at r9 (raw file):
tensor_guid_t get_output(LayerAttrs const &, int idx) const; LayerAddedResult add_layer_and_get_layer_added_result(
Why not just use add_layer
from computation_graph.h
(
flexflow-train/lib/pcg/include/pcg/computation_graph.h
Lines 18 to 21 in 93298ed
LayerAddedResult add_layer(ComputationGraph &computation_graph, | |
LayerAttrs const &attrs, | |
std::vector<tensor_guid_t> const &inputs, | |
std::vector<TensorAttrs> const &outputs); |
ComputationGraphBuilder
lib/local-execution/include/local-execution/tensor_type.enum.toml
line 11 at r9 (raw file):
[[values]] name = "NON_GRAPH"
Instead of having a miscellaneous NON_GRAPH
category, instead specify the categories that it could be (e.g., LOSS
, etc.)
lib/op-attrs/include/op-attrs/operator_attrs.h
line 4 at r9 (raw file):
#define _OPERATOR_PARAMS_H #include "local-execution/ops/attention.h"
These shouldn't be in op-attrs
--typo?
lib/pcg/include/pcg/parallel_computation_graph/parallel_computation_graph_builder.h
line 183 at r9 (raw file):
ParallelTensorAttrs make_weight_attrs(ParallelTensorShape const &shape,
Why was this changed to be non-static? If it has to be non-static it shouldn't be here (it should probably be in either parallel_tensor_attrs.h
or parallel_tensor_shape.h
)
Description of changes:
Implement end-to-end training loop in local execution.
Related Issues:
Linked Issues:
Issues closed by this PR:
hash
andfmt
forTaskBinding
in local execution #1503This change is