add related work

amakelov · Jul 7, 2024 · e0b4e96 · e0b4e96
1 parent 747642d
commit e0b4e96
Show file tree

Hide file tree

Showing 44 changed files with 4,766 additions and 5,727 deletions.
diff --git a/docs/docs/blog/cf.md → docs/docs/blog/01_cf.md b/docs/docs/blog/cf.md → docs/docs/blog/01_cf.md
@@ -1,4 +1,6 @@
 # Tidy Computations
+<a href="https://colab.research.google.com/github/amakelov/mandala/blob/master/docs_source/blog/01_cf.ipynb"> 
+  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/> </a>
 
 In data-driven fields, such as machine learning, a lot of effort is spent
 organizing *computational data* &mdash; results of running programs &mdash; so
@@ -58,17 +60,17 @@ print(cf.df().to_markdown())
 
 
 
-![svg](cf_files/cf_2_0.svg)
+![svg](01_cf_files/01_cf_2_0.svg)
 
 
 
     |    |   x | increment                                   |   y | add                                   |   w |
     |---:|----:|:--------------------------------------------|----:|:--------------------------------------|----:|
-    |  0 |   0 | Call(increment, cid='d47...', hid='230...') |   1 | Call(add, cid='89c...', hid='247...') |   1 |
-    |  1 |   4 | Call(increment, cid='928...', hid='adf...') |   5 | Call(add, cid='a54...', hid='deb...') |   9 |
-    |  2 |   1 | Call(increment, cid='948...', hid='6e2...') |   2 |                                       | nan |
-    |  3 |   3 | Call(increment, cid='9b4...', hid='df2...') |   4 |                                       | nan |
-    |  4 |   2 | Call(increment, cid='bfb...', hid='5dd...') |   3 | Call(add, cid='a81...', hid='626...') |   5 |
+    |  0 |   1 | Call(increment, cid='948...', hid='6e2...') |   2 |                                       | nan |
+    |  1 |   0 | Call(increment, cid='d47...', hid='230...') |   1 | Call(add, cid='89c...', hid='247...') |   1 |
+    |  2 |   2 | Call(increment, cid='bfb...', hid='5dd...') |   3 | Call(add, cid='a81...', hid='626...') |   5 |
+    |  3 |   4 | Call(increment, cid='928...', hid='adf...') |   5 | Call(add, cid='a54...', hid='deb...') |   9 |
+    |  4 |   3 | Call(increment, cid='9b4...', hid='df2...') |   4 |                                       | nan |
 
 
 This small example illustrates the main components of the CF workflow:
@@ -164,7 +166,7 @@ cf.draw(verbose=True, orientation='TB')
 
 
 
-![svg](cf_files/cf_10_0.svg)
+![svg](01_cf_files/01_cf_10_0.svg)
 
 
 
@@ -230,7 +232,7 @@ cf.draw(verbose=True)
 
 
 
-![svg](cf_files/cf_14_0.svg)
+![svg](01_cf_files/01_cf_14_0.svg)
 
 
 
@@ -247,10 +249,10 @@ print(cf.df()[['accuracy', 'scale_data', 'train_svc', 'train_random_forest']].so
 
     |    |   accuracy | scale_data                                   | train_svc                                   | train_random_forest                                   |
     |---:|-----------:|:---------------------------------------------|:--------------------------------------------|:------------------------------------------------------|
-    |  3 |      0.915 | Call(scale_data, cid='09f...', hid='d6b...') |                                             | Call(train_random_forest, cid='e26...', hid='c42...') |
-    |  0 |      0.885 |                                              |                                             | Call(train_random_forest, cid='519...', hid='997...') |
-    |  1 |      0.82  |                                              | Call(train_svc, cid='ddf...', hid='6a0...') |                                                       |
-    |  2 |      0.82  | Call(scale_data, cid='09f...', hid='d6b...') | Call(train_svc, cid='6f4...', hid='7d9...') |                                                       |
+    |  1 |      0.915 | Call(scale_data, cid='09f...', hid='d6b...') |                                             | Call(train_random_forest, cid='e26...', hid='c42...') |
+    |  3 |      0.885 |                                              |                                             | Call(train_random_forest, cid='519...', hid='997...') |
+    |  0 |      0.82  | Call(scale_data, cid='09f...', hid='d6b...') | Call(train_svc, cid='6f4...', hid='7d9...') |                                                       |
+    |  2 |      0.82  |                                              | Call(train_svc, cid='ddf...', hid='6a0...') |                                                       |
 
 
 So is this the full story of this dataset? We might want to investigate further
@@ -316,7 +318,7 @@ cf.draw(verbose=True)
 
 
 
-![svg](cf_files/cf_21_0.svg)
+![svg](01_cf_files/01_cf_21_0.svg)
 
 
 
@@ -332,7 +334,7 @@ cf = cf.expand_back(recursive=True); cf.draw(verbose=True)
 
 
 
-![svg](cf_files/cf_24_0.svg)
+![svg](01_cf_files/01_cf_24_0.svg)
 
 
 
@@ -361,20 +363,20 @@ print(cf.df()[['n_estimators', 'kernel', 'accuracy', ]].sort_values('accuracy',
 
     |    | n_estimators                 | kernel                                     |   accuracy |
     |---:|:-----------------------------|:-------------------------------------------|-----------:|
+    |  1 |                              | rbf                                        |      0.915 |
     |  5 | 5                            |                                            |      0.915 |
-    |  7 |                              | rbf                                        |      0.915 |
-    |  8 |                              | rbf                                        |      0.91  |
-    |  1 | 10                           |                                            |      0.9   |
-    |  4 | 20                           |                                            |      0.9   |
-    | 10 | 10                           |                                            |      0.9   |
-    | 13 | 20                           |                                            |      0.9   |
-    |  3 | ValueCollection([20, 10, 5]) | ValueCollection(['linear', 'rbf', 'poly']) |      0.895 |
+    |  9 |                              | rbf                                        |      0.91  |
+    |  2 | 20                           |                                            |      0.9   |
+    |  4 | 10                           |                                            |      0.9   |
+    |  6 | 10                           |                                            |      0.9   |
+    |  7 | 20                           |                                            |      0.9   |
+    |  0 | ValueCollection([20, 10, 5]) | ValueCollection(['linear', 'rbf', 'poly']) |      0.895 |
     | 11 | ValueCollection([20, 10, 5]) | ValueCollection(['linear', 'rbf', 'poly']) |      0.895 |
-    |  6 | 5                            |                                            |      0.885 |
-    | 12 |                              | poly                                       |      0.835 |
-    |  0 |                              | linear                                     |      0.82  |
-    |  2 |                              | linear                                     |      0.82  |
-    |  9 |                              | poly                                       |      0.82  |
+    | 13 | 5                            |                                            |      0.885 |
+    | 10 |                              | poly                                       |      0.835 |
+    |  3 |                              | linear                                     |      0.82  |
+    |  8 |                              | linear                                     |      0.82  |
+    | 12 |                              | poly                                       |      0.82  |
 
 
 Columns where `n_estimators` is `None` correspond to the SVC models, and columns
@@ -399,13 +401,65 @@ efficient re-computation of only the parts of the computation that have changed.
 
 If you're interested in learning more, check out the [mandala
 documentation](https://amakelov.github.io/mandala/), and feel free to reach out
-to me on [Twitter](https://x.com/AMakelov).
+on [Twitter](https://x.com/AMakelov).
 
-## Why "tidy"?
+## Why "tidy"? And some other related work
+
+### Tidy data vs tidy computations
 In many ways, the ideas here are a re-imagining of Hadley Wickham's
 [Tidy Data](https://www.jstatsoft.org/article/view/v059i10) in the context of
 computational data management. In particular, the focus is on computations built
 from repeated calls to the same set of functions composed in various ways, which
 is a common pattern in machine learning and scientific computing. And like with
 the tidy data philosophy, the goal is to eliminate the code and effort required
 to organize computations beyond just running the computation itself.
+
+Despite the different setups &mdash; data cleaning versus computation tracking
+&mdash; there are many similarities between the two approaches. This is because
+in an abstract sense you can think of "data" as a kind of "computation" that
+nature has performed via some "data-generating process". The difference stems
+from this process typically being unknown, hidden or otherwise hard to model.
+Perhaps this is also why the tidy data paper spends some time talking about notions of functional 
+dependencies and normal forms, which are also relevant to computations. In fact,
+tidy data is in Codd's third normal form, which is in turn a more relaxed
+version of the Boyce-Codd normal form (BCNF). The BCNF is automatically
+satisfied by operation nodes in a computation frame when viewed as relations.
+
+On the one hand, the explicit knowledge of the data generating process makes the
+job of computation tracking easier in an ontological sense. Wickham remarks
+that, while easy to disambiguate in concrete examples, the concepts of
+"variable" and "observation" are actually hard to define abstractly. Not so for
+computations: variables are inputs/outputs of functions, and observations are
+function calls. 
+
+But on the other hand, this detailed knowledge also gives us more complex
+situations to handle, such as feedback loops, branching pipelines, and
+aggregation/decomposition to name a few. This means we need more expressive
+tools and grammars to handle these situations. Furthermore, even if functions
+impose a notion of variables and observations, this does not prevent one from
+designing function interfaces poorly, which can in turn lead to messy
+computations.
+
+### Graphs, databases, categories
+There's a rich history of work in relational databases and graph databases, and
+CFs share some similarities with both:
+
+- A CF can be seen as a relational database, with a table for each operation and
+each variable. The operation tables have columns labeled by the inputs/outputs
+of the operation, and the values in these columns are pointers to the
+corresponding input/output values, which are stored in the (single-column)
+variable tables.
+- The `.df()` method works by performing an outer join operation (in a specific
+order) on the tables containing the full computational history of all final
+values in the CF.
+- Similarly, a CF can be seen as a graph database, where the nodes are
+calls and values, and the variables and operations serve as "node types".
+- Correspondingly, some operations, such as expansion or finding the
+dependencies/dependents of some nodes/values can be seen as graph traversal
+operations.
+
+Finally, the CF data structure can be seen as a kind of "functor" from a finite
+category (roughly speaking the call graph) to the category of sets. Some
+consequences of this perspective &mdash; which combines graphs and databases
+&mdash; are presented e.g. in [this
+paper](https://compositionality-journal.org/papers/compositionality-4-5/)