Skip to content

Commit

Permalink
add related work
Browse files Browse the repository at this point in the history
  • Loading branch information
amakelov committed Jul 7, 2024
1 parent 747642d commit e0b4e96
Show file tree
Hide file tree
Showing 44 changed files with 4,766 additions and 5,727 deletions.
110 changes: 82 additions & 28 deletions docs/docs/blog/cf.md → docs/docs/blog/01_cf.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
# Tidy Computations
<a href="https://colab.research.google.com/github/amakelov/mandala/blob/master/docs_source/blog/01_cf.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/> </a>

In data-driven fields, such as machine learning, a lot of effort is spent
organizing *computational data* &mdash; results of running programs &mdash; so
Expand Down Expand Up @@ -58,17 +60,17 @@ print(cf.df().to_markdown())



![svg](cf_files/cf_2_0.svg)
![svg](01_cf_files/01_cf_2_0.svg)



| | x | increment | y | add | w |
|---:|----:|:--------------------------------------------|----:|:--------------------------------------|----:|
| 0 | 0 | Call(increment, cid='d47...', hid='230...') | 1 | Call(add, cid='89c...', hid='247...') | 1 |
| 1 | 4 | Call(increment, cid='928...', hid='adf...') | 5 | Call(add, cid='a54...', hid='deb...') | 9 |
| 2 | 1 | Call(increment, cid='948...', hid='6e2...') | 2 | | nan |
| 3 | 3 | Call(increment, cid='9b4...', hid='df2...') | 4 | | nan |
| 4 | 2 | Call(increment, cid='bfb...', hid='5dd...') | 3 | Call(add, cid='a81...', hid='626...') | 5 |
| 0 | 1 | Call(increment, cid='948...', hid='6e2...') | 2 | | nan |
| 1 | 0 | Call(increment, cid='d47...', hid='230...') | 1 | Call(add, cid='89c...', hid='247...') | 1 |
| 2 | 2 | Call(increment, cid='bfb...', hid='5dd...') | 3 | Call(add, cid='a81...', hid='626...') | 5 |
| 3 | 4 | Call(increment, cid='928...', hid='adf...') | 5 | Call(add, cid='a54...', hid='deb...') | 9 |
| 4 | 3 | Call(increment, cid='9b4...', hid='df2...') | 4 | | nan |


This small example illustrates the main components of the CF workflow:
Expand Down Expand Up @@ -164,7 +166,7 @@ cf.draw(verbose=True, orientation='TB')



![svg](cf_files/cf_10_0.svg)
![svg](01_cf_files/01_cf_10_0.svg)



Expand Down Expand Up @@ -230,7 +232,7 @@ cf.draw(verbose=True)



![svg](cf_files/cf_14_0.svg)
![svg](01_cf_files/01_cf_14_0.svg)



Expand All @@ -247,10 +249,10 @@ print(cf.df()[['accuracy', 'scale_data', 'train_svc', 'train_random_forest']].so

| | accuracy | scale_data | train_svc | train_random_forest |
|---:|-----------:|:---------------------------------------------|:--------------------------------------------|:------------------------------------------------------|
| 3 | 0.915 | Call(scale_data, cid='09f...', hid='d6b...') | | Call(train_random_forest, cid='e26...', hid='c42...') |
| 0 | 0.885 | | | Call(train_random_forest, cid='519...', hid='997...') |
| 1 | 0.82 | | Call(train_svc, cid='ddf...', hid='6a0...') | |
| 2 | 0.82 | Call(scale_data, cid='09f...', hid='d6b...') | Call(train_svc, cid='6f4...', hid='7d9...') | |
| 1 | 0.915 | Call(scale_data, cid='09f...', hid='d6b...') | | Call(train_random_forest, cid='e26...', hid='c42...') |
| 3 | 0.885 | | | Call(train_random_forest, cid='519...', hid='997...') |
| 0 | 0.82 | Call(scale_data, cid='09f...', hid='d6b...') | Call(train_svc, cid='6f4...', hid='7d9...') | |
| 2 | 0.82 | | Call(train_svc, cid='ddf...', hid='6a0...') | |


So is this the full story of this dataset? We might want to investigate further
Expand Down Expand Up @@ -316,7 +318,7 @@ cf.draw(verbose=True)



![svg](cf_files/cf_21_0.svg)
![svg](01_cf_files/01_cf_21_0.svg)



Expand All @@ -332,7 +334,7 @@ cf = cf.expand_back(recursive=True); cf.draw(verbose=True)



![svg](cf_files/cf_24_0.svg)
![svg](01_cf_files/01_cf_24_0.svg)



Expand Down Expand Up @@ -361,20 +363,20 @@ print(cf.df()[['n_estimators', 'kernel', 'accuracy', ]].sort_values('accuracy',

| | n_estimators | kernel | accuracy |
|---:|:-----------------------------|:-------------------------------------------|-----------:|
| 1 | | rbf | 0.915 |
| 5 | 5 | | 0.915 |
| 7 | | rbf | 0.915 |
| 8 | | rbf | 0.91 |
| 1 | 10 | | 0.9 |
| 4 | 20 | | 0.9 |
| 10 | 10 | | 0.9 |
| 13 | 20 | | 0.9 |
| 3 | ValueCollection([20, 10, 5]) | ValueCollection(['linear', 'rbf', 'poly']) | 0.895 |
| 9 | | rbf | 0.91 |
| 2 | 20 | | 0.9 |
| 4 | 10 | | 0.9 |
| 6 | 10 | | 0.9 |
| 7 | 20 | | 0.9 |
| 0 | ValueCollection([20, 10, 5]) | ValueCollection(['linear', 'rbf', 'poly']) | 0.895 |
| 11 | ValueCollection([20, 10, 5]) | ValueCollection(['linear', 'rbf', 'poly']) | 0.895 |
| 6 | 5 | | 0.885 |
| 12 | | poly | 0.835 |
| 0 | | linear | 0.82 |
| 2 | | linear | 0.82 |
| 9 | | poly | 0.82 |
| 13 | 5 | | 0.885 |
| 10 | | poly | 0.835 |
| 3 | | linear | 0.82 |
| 8 | | linear | 0.82 |
| 12 | | poly | 0.82 |


Columns where `n_estimators` is `None` correspond to the SVC models, and columns
Expand All @@ -399,13 +401,65 @@ efficient re-computation of only the parts of the computation that have changed.

If you're interested in learning more, check out the [mandala
documentation](https://amakelov.github.io/mandala/), and feel free to reach out
to me on [Twitter](https://x.com/AMakelov).
on [Twitter](https://x.com/AMakelov).

## Why "tidy"?
## Why "tidy"? And some other related work

### Tidy data vs tidy computations
In many ways, the ideas here are a re-imagining of Hadley Wickham's
[Tidy Data](https://www.jstatsoft.org/article/view/v059i10) in the context of
computational data management. In particular, the focus is on computations built
from repeated calls to the same set of functions composed in various ways, which
is a common pattern in machine learning and scientific computing. And like with
the tidy data philosophy, the goal is to eliminate the code and effort required
to organize computations beyond just running the computation itself.

Despite the different setups &mdash; data cleaning versus computation tracking
&mdash; there are many similarities between the two approaches. This is because
in an abstract sense you can think of "data" as a kind of "computation" that
nature has performed via some "data-generating process". The difference stems
from this process typically being unknown, hidden or otherwise hard to model.
Perhaps this is also why the tidy data paper spends some time talking about notions of functional
dependencies and normal forms, which are also relevant to computations. In fact,
tidy data is in Codd's third normal form, which is in turn a more relaxed
version of the Boyce-Codd normal form (BCNF). The BCNF is automatically
satisfied by operation nodes in a computation frame when viewed as relations.

On the one hand, the explicit knowledge of the data generating process makes the
job of computation tracking easier in an ontological sense. Wickham remarks
that, while easy to disambiguate in concrete examples, the concepts of
"variable" and "observation" are actually hard to define abstractly. Not so for
computations: variables are inputs/outputs of functions, and observations are
function calls.

But on the other hand, this detailed knowledge also gives us more complex
situations to handle, such as feedback loops, branching pipelines, and
aggregation/decomposition to name a few. This means we need more expressive
tools and grammars to handle these situations. Furthermore, even if functions
impose a notion of variables and observations, this does not prevent one from
designing function interfaces poorly, which can in turn lead to messy
computations.

### Graphs, databases, categories
There's a rich history of work in relational databases and graph databases, and
CFs share some similarities with both:

- A CF can be seen as a relational database, with a table for each operation and
each variable. The operation tables have columns labeled by the inputs/outputs
of the operation, and the values in these columns are pointers to the
corresponding input/output values, which are stored in the (single-column)
variable tables.
- The `.df()` method works by performing an outer join operation (in a specific
order) on the tables containing the full computational history of all final
values in the CF.
- Similarly, a CF can be seen as a graph database, where the nodes are
calls and values, and the variables and operations serve as "node types".
- Correspondingly, some operations, such as expansion or finding the
dependencies/dependents of some nodes/values can be seen as graph traversal
operations.

Finally, the CF data structure can be seen as a kind of "functor" from a finite
category (roughly speaking the call graph) to the category of sets. Some
consequences of this perspective &mdash; which combines graphs and databases
&mdash; are presented e.g. in [this
paper](https://compositionality-journal.org/papers/compositionality-4-5/)
Loading

0 comments on commit e0b4e96

Please sign in to comment.