-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add client interface. #16
Changes from 8 commits
6ffea6e
94af15b
f69523a
ccf6cc8
5e26f58
020093c
ee222f0
1b0ccac
b3a2e1e
ec8f500
5dc12cb
0d59c62
4cd6233
fec5e73
a5e799b
1b15b5d
98104f1
cb369fd
b4a6f36
f3de2ca
70547ae
525311c
c776376
49c571e
43d7e16
5e8e1dd
fe23c3c
064edd8
e7c5240
b0b414e
97761e1
91d36f2
84c2f41
f7ab013
302e68a
a2dc8bc
046e740
92d6489
d27f042
2b49746
c4ee015
0cc231b
069a7a7
9069030
0fe063f
e089107
159aa08
8e035b7
9d90c37
8de26ec
b9bfcdd
351c8b5
7dae8d4
5d15b22
1588b51
1e1e41d
c0a6e6f
99c5935
80f314a
7796e15
036dd51
8affa10
77d2458
026b6f1
448f693
769a708
3555d2e
a0c5b3a
e185bf3
f0d79e9
f444772
530da78
db21fe7
165eb84
06de774
c7a07b1
f8c623a
488eaa3
f0729d9
4d2aa6c
88ed638
bd55552
9897995
73eabef
85d2475
61be939
5bffeee
22f370d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Our naming convention is to use kebab-case for markdown files except for the main readme in a repo which is called There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I will add a linter to do this later, but treat Markdown as code and wrap lines to 100 characters. This makes it easier to review and read (not everyone has a good Markdown renderer on their local machine). |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,204 @@ | ||
# Spider Quick Start Guide | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Our convention is to use sentence case for headings rather than capitlizing every word. So this should be written as There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've changed all titles, but I treat Spider as a special terminology and keep it capitalized. |
||
|
||
## Set Up Spider | ||
To get started, first start a database supported by Spider, e.g. MySql. Second, start a scheduler and connect it to the database by running `spider start --scheduler --db <db_url> --port <scheduler_port>`. Third, start some workers and connect them to the database by running `spider start --worker --db <db_url>`. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Try to use lists rather than a block of text. Lists are faster to read than blocks of text. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You have a caveat for starting the worker below. You should note that here otherwise someone will try to start the worker, it won't work, and they will give up. |
||
|
||
## Start a Client | ||
Client first creates a Spider client driver and connects it to the database. Spider automatically cleans up the resource in driver's destructor, but you can close the driver to release the resource early. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should have an intro section where we describe the actors and architecture in the system (client, scheduler, worker, database, etc.). It doesn't need to be too detailed (that can be in another doc), but it should provide enough context for the user to visualize what we're talking about. |
||
```c++ | ||
#include <spider/Spider.hpp> | ||
|
||
auto main(int argc, char **argv) -> int { | ||
spider::Driver driver{}; | ||
driver.connect("db_url"); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Any reason we can't or don't want to do this in the constructor? |
||
|
||
driver.close(); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We don't want to allow this if possible. If we allow the user to close the driver before the destructor is called, then most if not all public methods of the driver class need to have a check at the beginning: There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also, let's not try to make the library "too" easy to use. Our goal right now is to get a robust end-to-end implementation. Since we will be the first users of this framework, it's okay if we need to write a bit of ugly code to get it to work, as long as it's robust. Through the process of using the library, I'm sure we will figure out what are the important features that we should try to simplify. |
||
} | ||
``` | ||
|
||
## Create a Task | ||
In Spider, a task is a non-member function that takes the first argument a `spider::Context` object. It can then take any number of arguments of POD type. | ||
|
||
Task can return any POD type. If a task needs to return more than one result, uses `std::tuple`. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
|
||
The `Context` object represents the context of a running task. It provides methods to get the task metadata information like task id. It also supports the creating task inside a task. We will cover this later. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it's clearer if you defer this point until later. In this section, you can just say "the first argument needs to be a Context. Contexts will be described in section XX." There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure if I should add a section about context. Context provides functions to get task id and run new tasks. The later is covered in "Run task inside task` section. The former is only useful to get a random number or distinguish two instances to de-duplicate their output. |
||
```c++ | ||
auto sum(spider::Context &context, int x, int y) -> int { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Although I can infer these are tasks, it would still be helpful to have some description of these examples (could be as simple as "Examples of task methods."). This is especially true in your later examples which are more complicated. |
||
return x + y; | ||
} | ||
|
||
auto sort(spider::Context &context, int x, int y) -> std::tuple<int, int> { | ||
if (x >= y) { | ||
return { x, y }; | ||
} | ||
return { y, x }; | ||
} | ||
``` | ||
|
||
## Run a Task | ||
Spider enables user to run a task on the cluster. First register the functions statically so it is known by Spider. Simply call `Driver::run` and provide the arguments of the task. `Driver::run` returns a `spider::Future` object, which represents the result that will be available in the future. You can call `Future::ready` to check if the value in future is available yet. You can use `Future::get` to block and get the value once it is available. | ||
```c++ | ||
spider::register_task(sum); | ||
spider::register_task(sort); | ||
|
||
auto main(int argc, char **argv) -> int { | ||
// driver initialization skipped | ||
spider::Future<int> sum_future = driver.run(sum, 2); | ||
assert(4 == sum_future.get()); | ||
|
||
spider::Future<std::tuple<int, int>> sort_future = driver.run(4, 3); | ||
assert(std::tuple{3, 4} == sort_future.get()); | ||
} | ||
``` | ||
|
||
## Group Tasks Together | ||
In real world, running a single task is too simple to be useful. Spider lets you bind outputs of tasks as inputs of another task, similar to `std::bind`. Binding the tasks together forms a dependencies among tasks, which is represented by `spider::TaskGraph`. `TaskGraph` can be further bound into more complicated `TaskGraph` by serving as inputs for another task. You can run the task using `Driver::run` in the same way as running a single task. | ||
```c++ | ||
auto square(spider::Context& context, int x) -> int { | ||
return x * x; | ||
} | ||
|
||
auto square_root(spider::Context& context, int x) -> int { | ||
return sqrt(x); | ||
} | ||
// task registration skipped | ||
auto main(int argc, char **argv) -> auto { | ||
// driver initialization skipped | ||
spider::TaskGraph<int(int, int)> sum_of_square = spider::bind(sum, square, square); | ||
spider::TaskGraph<int(int, int)> rss = spider::bind(square_root, sum_of_square); | ||
spider::Future<int> future = driver::run(rss, 3, 4); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't really understand how arguments are passed between tasks. You mention that this interface is similar to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've updated the doc with more explanation, but I don't think my description is clear enough. I need some advice on this. |
||
assert(5 == future.get()); | ||
} | ||
``` | ||
|
||
## Run Task inside Task | ||
Static task graph is enough to solve a lot of real work problems, but dynamically add tasks on-the-fly could become handy. As mentioned before, spider allows you to add another task as child of the running task by calling `Context::add_child`. | ||
|
||
```c++ | ||
auto gcd(spider::Conect& context, int x, int y) -> std::tuple<int, int> { | ||
if (x == y) { | ||
std::cout << "gdc is: " << x << std::endl; | ||
return { x, y }; | ||
} | ||
if (x > y) { | ||
context.add_child(gcd); | ||
return { x % y, y }; | ||
} | ||
context.add_child(gcd); | ||
return { x, y % x }; | ||
} | ||
``` | ||
|
||
However, it is impossible to get the return value of the task graph from a client. We have a solution by sharing data using key-value store, which will be discussed later. Another solution is to run task or task graph inside a task and wait for its value, just like a client. This solution is closer to the conventional function call semantic. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This sentence doesn't make sense since the previous section has an example where the client is retrieving the result of a task graph. Reading the later parts of this guide, I guess you mean that you can't return the value of a dynamically created task to the client (which is intuitive since the interface doesn't provide a mechanism to do so). |
||
|
||
```c++ | ||
auto gcd(spider:Context& context, int x, int y) -> int { | ||
if (x < y) { | ||
std::swap(x, y); | ||
} | ||
while (x != y) { | ||
spider::Future<std:tuple<int, int>> future = context.run(gcd_impl, x, y); | ||
x = future.get().get().get<0>(); | ||
y = future.get().get().get<1>(); | ||
} | ||
return x; | ||
} | ||
|
||
auto gcd_impl(spider::Context& context, int x, int y) -> std::tuple<int, int> { | ||
return { x, x % y}; | ||
} | ||
``` | ||
|
||
## Data on External Storage | ||
Often simple POD data are not enough. However, passing large amount of data around is expensive. Usually these data is stored on disk or a distributed storage system. For example, an ETL workload usually reads in data from an external storage, writes temporary data on an external storage, and writes final data into an external storage. | ||
|
||
Spider lets user pass the metadata of these data around in `spider::Data` objects. `Data` stores the value of the metadata information of external data, and provides crucial information to Spider for correct and efficient scheduling and failure recovery. `Data` stores a list of nodes which has locality of the external data, and user can specify if locality is a hard requirement, i.e. task can only run on the nodes in locality list. `Data` can include a `cleanup`function, which will run when the `Data` object is no longer reference by any task and client. `Data` has a persist flag to represent that external data is persisted and do not need to be cleaned up. | ||
|
||
```c++ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This example is complicated and should definitely have a description. To not have a description means that we're slowing down the user since they first need to infer what the goal of the example's code is, then they can focus on the details of using the framework. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This example is also a good reason why our internal guidelines say that we should organize things from public to private. If you apply that thinking here, we would put |
||
struct HdfsFile { | ||
std::string url; | ||
}; | ||
|
||
auto filter(spider::Data<Hdfsfile> input) -> spider::Data<HdfsFile> { | ||
std::string const output_path = std::format("/path/%s", context.task_id()); | ||
std::string const input_path = input.get().url; | ||
// Create HdfsFile Data first in case task fails and Spider can clean up the data. | ||
spider::Data<HdfsFile> output = spider::Data<HdfsFile>::Builder() | ||
.cleanup([](HdfsFile const& file) { delete_hdfs_file(file); }) | ||
.build(HdfsFile { output_path }); | ||
auto file = hdfs_create(output_path); | ||
std::vector<std::string> nodes = hdfs_get_nodes(file); | ||
output.set_locality(nodes, false); // not hard locality | ||
|
||
run_filter(input_path, file); | ||
|
||
return output; | ||
} | ||
|
||
auto map(spider::Data<HdfsFile> input) -> spider::Data<HdfsFile> { | ||
std::string const output_path = "/path/to/output"; | ||
std::string const input_path = input.get().url; | ||
|
||
spider::Data<HdfsFile> output = spider::Data<HdfsFile>::Builder() | ||
.cleanup([](HdfaFile const& file) { delete_hdfs_file(file); }) | ||
.build(HdfsFile { output_path }); | ||
|
||
run_map(input_path, output_path); | ||
|
||
// Now that map finishes, the file is persisted on Hdfs as output of job. | ||
output.mark_persist(); | ||
return output; | ||
} | ||
|
||
auto main(int argc, char** argv) -> int { | ||
spider::Data<HdfsFile> input = spider::Data<HdfsFile>::Builder() | ||
.mark_persist(true) | ||
.build(HdfsFile { "/path/to/input" }); | ||
spider::Future<spider::Data<HdfsFile>> future = spider::run( | ||
spider::bind(map, filter), | ||
input); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
std::string const output_path = future.get().get().url; | ||
std::cout << "Result is stored in " << output_path << std::endl; | ||
} | ||
``` | ||
|
||
## Data as Key-Value Store | ||
`Data` can also be used a a key-value store. User can specify a key when creating the data, and the data can be accessed later by its key. Notice that a task can only access the `Data` created by itself or passed to it. Client can access any data with the key. | ||
Using the key value store, we can solve the dynamic task result problem. | ||
|
||
```c++ | ||
auto gcd(spider::Context& context, int x, int y, std::string key) | ||
-> std::tuple<int, int, std::string> { | ||
if (x == y) { | ||
spider::Data<int>.Builder() | ||
.set_key(key) | ||
.build(x); | ||
return { x, y, key }; | ||
} | ||
if (x > y) { | ||
context.add_child(gcd); | ||
return { x % y, y, key }; | ||
} | ||
context.add_child(gcd); | ||
return { x, y % x, key }; | ||
} | ||
|
||
auto main(int argc, char** argv) -> int { | ||
std::string const key = "random_key"; | ||
driver.run(gcd, 48, 18, key); | ||
while (!driver.get_data_by_key(key)) { | ||
int value = driver.get_data_by_key(key).get(); | ||
std::cout << "gcd of " << x << " and " << y << " is " << value << std::endl; | ||
} | ||
} | ||
``` | ||
|
||
## Straggler Mitigation | ||
`Driver::register_task` can take a second argument for timeout milliseconds. If a task executes for longer than the specified timeout, Spider spawns another task instance running the same function. The task that finishes first wins. Other running task instances are cancelled, and associated data is cleaned up. | ||
|
||
The new task has a different task id, and it is the responsibility of the user to avoid any data race and deduplicate the output if necessary. | ||
|
||
## Note on Worker Setup | ||
The setup section said that we can start a worker by running `spider start --worker --db <db_url>`. This is oversimplified. The worker has to know the function it will run. | ||
|
||
When user compiles the client code, an executable and a library are generated. The executable executes the client code as expected. The library contains all the functions registered by user. Worker needs to run with a copy of this library. The actual commands to start a worker is `spider start --worker --db <db_url> --libs [client_libraries]`. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This makes sense, so in the guide above, wouldn't it be easier to just say that we need to create a task library with all the tasks the user wants to run? Then we can reference those tasks in the client code. Right now, the guide says we need to register the tasks by calling There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. User is responsible to create a task library and the client executable. Actually a good practice for user is to put the |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,40 @@ | ||
#include "Data.hpp" | ||
|
||
#include <functional> | ||
#include <string> | ||
#include <vector> | ||
|
||
namespace spider { | ||
|
||
class DataImpl {}; | ||
|
||
template <class T> | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You should document template parameters as well. |
||
auto Data<T>::get() -> T { | ||
return T(); | ||
} | ||
|
||
template <class T> | ||
void Data<T>::set_locality(std::vector<std::string> const& /*nodes*/, bool /*hard*/) {} | ||
|
||
template <class T> | ||
auto Data<T>::Builder::set_key(std::string const& /*key*/) -> Data<T>::Builder& { | ||
return this; | ||
} | ||
|
||
template <class T> | ||
auto Data<T>::Builder::set_locality(std::vector<std::string> const& /*nodes*/, bool /*hard*/) | ||
-> Data<T>::Builder& { | ||
return this; | ||
} | ||
|
||
template <class T> | ||
auto Data<T>::Builder::set_cleanup(std::function<T const&()> const& /*f*/) -> Data<T>::Builder& { | ||
return this; | ||
} | ||
|
||
template <class T> | ||
auto Data<T>::Builder::build(T const& /*t*/) -> Data<T> { | ||
return Data<T>(); | ||
} | ||
|
||
} // namespace spider |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this anymore considering we've dropped support for building on macOS?