Question: overall goals and influences for language design #85

monfera · 2021-11-08T10:15:08Z

monfera
Nov 8, 2021

The XQuery and LINQ analogies miss a key point, which is, an implementation of functional relational programming, most famously advocated for in Out of the tar pit.

One concern is that the language so far seems to reflect its sole researcher's individual contribution, and locking in what the language is about has the chance of settling in a specific groove too early, making it difficult to generalize or accomodate desirable things later. For example, the syntax is already proposed without a robust discussion of semantics, and what goals the syntax formation are driven by (that I could find; I might have missed it).

I wonder, in the diverse space of relational algebra and (ill defined) extended relational algebra, what's included and what's not, or if it's somehow kept malleable or even deferred this time.

Ordered tuples (tables) or unordered (sets/bags)? I propose direct support for partial orders, because no order and full order are just special cases of that, and partial orders are very valuable, think interesting orders
Is it a set or bag algebra? Similarly, maybe it can be the continuum: the ability to think of duplications by certain attributes, but not by others admits both a strict set algebra and a full bag algebra, and things in between (a bit like keys / key candidates / functional dependency)
Are attributes ordered (table-like) or unordered? Hopefully the latter
Is the physical representation of a relation defined? Hopefully not, because the same relation can be represented in a multitude of ways
Specifically, what things are anticipated that can act as a relations, yet maybe counterintuitive to most folks?

Though more implementational and optimization related, I'm curious about the stance on the ongoing discussions in query optimizations, and how the language at least permits these, such as vectorization vs pipelining (and mixes); points of pipeline breakers and elective materialization; abilities for quick but approximative queries, as well as prioritizing latency of initial response (partial query results) over overall throughput.

Still operational: super interested in the bit where the optimization needs to be a cost based optimizer rather than (just) a rules based one. As it requires some coupling of data, or at least metadata and statistics to the static code/query. And how the language is expected to excel at arbitrary layers of the memory hierarchy, from optimality of register and stack allocation to the layers of CPU caches L1, L2, L3, main memory, fast, slow magnetic and serial storage, down to things like AWS Glacier.

The type system is also interesting in detail, for example, nominal vs structural typing (should I be able to add an angle in radians to one in degrees).

Types of indices or things like clustered indices would be premature to lock in, yet it'd be interesting to know what the language designer would say about indices in general, and what the influences are (concepts, example paper links etc.)

Any thoughts about integrating relational algebra with elements of linear algebra? Or inclusion of solvers for the user, so they can pose problems that are not just directly executed as RA operators, but it's partly driven in reverse and hill climbing and more sophisticated algos yield answers?

Any support for reactivity, streaming and incremental, efficient recalculation? As in, can the user have maybe tens of levels of "views", to then be updated (via materialization or pipelining), or is it expected to be like most DBs for which it'd be an edge use case and they do a horrible job with these efficiency wise?

How about multi-query optimization where, due to complex querying or multiple users querying simultaneously, it's the overall optimum and constraints like fairness and SLA adherence that matter?

Any list of papers or terms that are meant to be considered for inclusion? Eg. current literature is full of optimizing compilation to machine code for individual queries (has pros/cons).

Any consideration for GPU based computation and visualization?

There are so many facets of what makes a good language for FRP, considering use cases, and a healthy discussion of all these seems valuable.

julianhyde · 2021-11-15T04:15:52Z

julianhyde
Nov 15, 2021
Maintainer

I'm going to convert this into a discussion, and I will respond further there.

0 replies

julianhyde · 2021-11-15T09:49:20Z

julianhyde
Nov 15, 2021
Maintainer

Many good points raised. I'll try to respond briefly to each. Then perhaps spin them off as many discussions.

The XQuery and LINQ analogies miss a key point, which is, an implementation of functional relational programming, most famously advocated for in Out of the tar pit.

I never got to the end of Out of the Tar Pit. It seemed to define an architecture (and philosophy) more than a programming language. And it seemed to deal with the problems of transactions (distributed mutable state) whereas I am more concerned with query.

What Morel has in common with OotTP is a feeling that relational is too good an idea to be left out of general-purpose languages.

One concern is that the language so far seems to reflect its sole researcher's individual contribution, and locking in what the language is about has the chance of settling in a specific groove too early, making it difficult to generalize or accomodate desirable things later. For example, the syntax is already proposed without a robust discussion of semantics, and what goals the syntax formation are driven by (that I could find; I might have missed it).

I've intentionally deferred many decisions (e.g. eager vs lazy, pure vs impure, ordered vs unordered) until there was a good reason to decide. I wanted to get to a point where we have a working prototype language and people interested in kicking the tires, and then have those discussions. And here we are.

I wanted the language to be a superset of Standard ML because that is a nice, simple language, and powerful enough to solve real-world problems. (It is currently a subset of SML. Mainly because I haven't had time to implement features like modules. But a few things might be thrown out, like references if we really need the language to be pure.)

I also want the language to be as powerful as core SQL.

I wonder, in the diverse space of relational algebra and (ill defined) extended relational algebra, what's included and what's not, or if it's somehow kept malleable or even deferred this time.

Currently the core relational operators scan, filter, project, cartesian product, aggregate, sort, union, intersect, except. Semi-join via in and exists. Join (including outer) coming shortly.

All relational operators are higher-order functions, so if you want a new relational operator (say Bernoulli sampling) you can write it yourself, albeit without syntactic sugar.

I don't worry that extended relational algebra is poorly defined. Academic relational algebra works over sets and uses greek symbols for operators, whereas industrial relational algebra works over multisets, allows nested collections and uses English words for operators. While different, they are similar enough to carry results and intuition from one to the other.

Ordered tuples (tables) or unordered (sets/bags)? I propose direct support for partial orders, because no order and full order are just special cases of that, and partial orders are very valuable, think interesting orders

I think partial orders are going to be useful for recursive/incremental queries (computing fixed points).

However, I would not describe a query as partially ordered. This is what I was thinking. Some queries have an expectation of a particular order (say because there was an order clause or because we are reading a list with no intervening operations) and others do not. I'd call those 'ordered' and 'unordered' queries. Note that an order clause does not necessarily create a deterministic order (e.g. from e in emps order e.deptno) but it does create an ordered query.

Some languages have an unorder keyword that allow you to say that order doesn't matter. We might add something similar to Morel. Also, if the destination doesn't care about order (e.g. we are assigning to a set or passing to the min aggregate function).

Why do I make the distinction between ordered and unordered queries? Because I would like people to be able to use the from operator on ML lists to create lists (i.e. list comprehensions). I don't want Morel to be seen as a 'database language' where a query optimizer always jumps in and scrambles everything.

Is it a set or bag algebra? Similarly, maybe it can be the continuum: the ability to think of duplications by certain attributes, but not by others admits both a strict set algebra and a full bag algebra, and things in between (a bit like keys / key candidates / functional dependency)

I think that, like SQL, Morel should allow both sets and bags. Some operators will tend to make collections unique, just as SQL's IN and UNION DISTINCT do.

Will Morel ever know that a list cannot contain duplicate elements, or is sorted? I don't know. SQL occasionally requires constraints (e.g. select empno, sal from emp group by empno is valid if sal is known to be functionally dependent on empno) but it's not good language design. Of course extended knowledge can be used in linters and optimizers.

Perreaux (see #69) makes the point that monad comprehensions always give you the same monad (e.g. apply a comprehension to a set, and you get a set, apply it to a list and you get a list) but monoid comprehensions allow you to choose which monoid (e.g. you could read a list and produce an ordered list, or a set). I like that idea. I think that Morel users should be able to say that they want their from expression to return a set (ordered or unordered) or a list (ordered multiset) or bag (unordered multiset). I don't know what the syntax would be.

Are attributes ordered (table-like) or unordered? Hopefully the latter

We inherit from ML the idea that fields are ordered alphabetically. (Or numerically, in the case of the integer-named fields that make up tuples.)

This causes some implementation problems when we map to SQL but I don't think it's a problem for language design.

When you say 'unordered' do you mean that the order is undefined and therefore nondeterministic? I think alphabetically ordered is preferable to that.

Is the physical representation of a relation defined? Hopefully not, because the same relation can be represented in a multitude of ways

Specifically, what things are anticipated that can act as a relations, yet maybe counterintuitive to most folks?

Currently a relation is a list. By which I mean, lists are the only thing that you can use in a from, and the type of a from expression will be list. (External relations appear in the type system as lists, even though they are generally unordered.)

It doesn't have to be a list of records. It can be a list of any type, including tuples, primitives, datatypes.

Eventually I would like other types (ordered sets, unordered sets, unordered multisets, vectors, dictionaries) to be used as relations. I don't want Morel to get as monad-crazy as Haskell, but it makes sense that any monad type (e.g. option) could be used in a from.

Though more implementational and optimization related, I'm curious about the stance on the ongoing discussions in query optimizations, and how the language at least permits these, such as vectorization vs pipelining (and mixes); points of pipeline breakers and elective materialization; abilities for quick but approximative queries, as well as prioritizing latency of initial response (partial query results) over overall throughput.

Those are reasonable things to want. In a SQL system they would probably manifest as hints. I don't know whether Standard ML has an annotation facility.

Still operational: super interested in the bit where the optimization needs to be a cost based optimizer rather than (just) a rules based one. As it requires some coupling of data, or at least metadata and statistics to the static code/query. And how the language is expected to excel at arbitrary layers of the memory hierarchy, from optimality of register and stack allocation to the layers of CPU caches L1, L2, L3, main memory, fast, slow magnetic and serial storage, down to things like AWS Glacier.

Yes, the availability of statistics is traditionally a difference between PL optimization and query optimization. I gather that gap is narrowing, and some PL optimizers are able to ingest statistics.

If your Morel program references external relations (and it probably should, if you are writing something that is more like a "query" than a traditional "program") then large parts of your program will be translated to Calcite relational algebra (while your Morel program is being compiled, not at run time) and Calcite has access to statistics.

The type system is also interesting in detail, for example, nominal vs structural typing (should I be able to add an angle in radians to one in degrees).

I've not implemented all of Standard ML's type system (I've done datatype but not type.) But I gather that FP languages encourage nominal types, and I intend to go with that.

Hopefully it won't make the syntax more verbose, won't require lots of type annotations, and won't lead to lots of mysterious implicit code.

One use of nominal types - and probably type classes - will be specifying algebraic properties. Say I define my own aggregate function. I want to be able to say that it rolls up as a commutative monoid, order doesn't matter, duplicate values don't matter, and here are the functions to do those operations.

Again, I don't want things to be too weird. But if a user can write simply group e.deptno compute median e.sal (where median is a monoid type, defined by the user, that indicates how to aggregate) that would be excellent. Better than today's syntax, where you have to separate the "aggregate function" from the argument(s) using the of keyword.

Types of indices or things like clustered indices would be premature to lock in, yet it'd be interesting to know what the language designer would say about indices in general, and what the influences are (concepts, example paper links etc.)

Indexes definitely don't belong in Morel. They aren't in SQL's query language. The optimizer knows about the indexes (and statistics gleaned from them) but the existence of indexes doesn't affect whether a program is valid.

Also like SQL, you can't refer to indexes from your program.

Any thoughts about integrating relational algebra with elements of linear algebra? Or inclusion of solvers for the user, so they can pose problems that are not just directly executed as RA operators, but it's partly driven in reverse and hill climbing and more sophisticated algos yield answers?

I know much more about relational algebra than linear algebra. But the basic patterns are the same - higher order functions, syntactic sugar to make it easier to write programs that use those higher order functions, optimization based on rewrites, statistics and knowledge of materialized views. Someone could apply those same patterns to linear algebra and also to graph operations and (since you mention XQuery) document operations.

Any support for reactivity, streaming and incremental, efficient recalculation? As in, can the user have maybe tens of levels of "views", to then be updated (via materialization or pipelining), or is it expected to be like most DBs for which it'd be an edge use case and they do a horrible job with these efficiency wise?

Regarding using and maintaining materialized views, I don't see Morel doing any better than a DB could do.

I'm working on streaming extensions to SQL. It has connections to incremental query execution, and also connections to temporal database support. But I want to see how the streaming extensions land in SQL before trying to add streaming support to Morel.

How about multi-query optimization where, due to complex querying or multiple users querying simultaneously, it's the overall optimum and constraints like fairness and SLA adherence that matter?

That's another area where we would defer to the database. See e.g. Tempura in Calcite CALCITE-4568.

Any list of papers or terms that are meant to be considered for inclusion? Eg. current literature is full of optimizing compilation to machine code for individual queries (has pros/cons).

Especially for "big data" queries (massively parallel) there are bigger wins to be had by restructuring the query (e.g. using a materialized view, or changing join order) than by clever code generation. So Morel's goal is to use a DB optimizer whenever appropriate for the program.

Any consideration for GPU based computation and visualization?

We've laid the groundwork already. Programs are pure, we know from the algebra when it is safe to divide a program into parallel streams. We convert to Calcite relational algebra, and work is under way for Calcite adapters for Apache Arrow, which can compile queries for GPUs (see CALCITE-2040). We just need to put the pieces together.

There are so many facets of what makes a good language for FRP, considering use cases, and a healthy discussion of all these seems valuable.

I agree. Thanks for starting the discussion.

This was a very long thread, even though i have tried to be brief in each answer. Many of the questions/answers above (e.g. consideration of the unorder keyword) deserve to be broken out as separate design points. So, if you want to follow up on any of them, please copy your question and my answer into a new discussion thread, and we can take it from there.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: overall goals and influences for language design #85

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Question: overall goals and influences for language design #85

monfera Nov 8, 2021

Replies: 2 comments

julianhyde Nov 15, 2021 Maintainer

julianhyde Nov 15, 2021 Maintainer

monfera
Nov 8, 2021

julianhyde
Nov 15, 2021
Maintainer

julianhyde
Nov 15, 2021
Maintainer