Replies: 2 comments
-
I'm going to convert this into a discussion, and I will respond further there. |
Beta Was this translation helpful? Give feedback.
-
Many good points raised. I'll try to respond briefly to each. Then perhaps spin them off as many discussions.
I never got to the end of Out of the Tar Pit. It seemed to define an architecture (and philosophy) more than a programming language. And it seemed to deal with the problems of transactions (distributed mutable state) whereas I am more concerned with query. What Morel has in common with OotTP is a feeling that relational is too good an idea to be left out of general-purpose languages.
I've intentionally deferred many decisions (e.g. eager vs lazy, pure vs impure, ordered vs unordered) until there was a good reason to decide. I wanted to get to a point where we have a working prototype language and people interested in kicking the tires, and then have those discussions. And here we are. I wanted the language to be a superset of Standard ML because that is a nice, simple language, and powerful enough to solve real-world problems. (It is currently a subset of SML. Mainly because I haven't had time to implement features like modules. But a few things might be thrown out, like references if we really need the language to be pure.) I also want the language to be as powerful as core SQL.
Currently the core relational operators scan, filter, project, cartesian product, aggregate, sort, union, intersect, except. Semi-join via in and exists. Join (including outer) coming shortly. All relational operators are higher-order functions, so if you want a new relational operator (say Bernoulli sampling) you can write it yourself, albeit without syntactic sugar. I don't worry that extended relational algebra is poorly defined. Academic relational algebra works over sets and uses greek symbols for operators, whereas industrial relational algebra works over multisets, allows nested collections and uses English words for operators. While different, they are similar enough to carry results and intuition from one to the other.
I think partial orders are going to be useful for recursive/incremental queries (computing fixed points). However, I would not describe a query as partially ordered. This is what I was thinking. Some queries have an expectation of a particular order (say because there was an Some languages have an Why do I make the distinction between ordered and unordered queries? Because I would like people to be able to use the
I think that, like SQL, Morel should allow both sets and bags. Some operators will tend to make collections unique, just as SQL's Will Morel ever know that a list cannot contain duplicate elements, or is sorted? I don't know. SQL occasionally requires constraints (e.g. Perreaux (see #69) makes the point that monad comprehensions always give you the same monad (e.g. apply a comprehension to a set, and you get a set, apply it to a list and you get a list) but monoid comprehensions allow you to choose which monoid (e.g. you could read a list and produce an ordered list, or a set). I like that idea. I think that Morel users should be able to say that they want their
We inherit from ML the idea that fields are ordered alphabetically. (Or numerically, in the case of the integer-named fields that make up tuples.) This causes some implementation problems when we map to SQL but I don't think it's a problem for language design. When you say 'unordered' do you mean that the order is undefined and therefore nondeterministic? I think alphabetically ordered is preferable to that.
Currently a relation is a list. By which I mean, lists are the only thing that you can use in a It doesn't have to be a list of records. It can be a list of any type, including tuples, primitives, datatypes. Eventually I would like other types (ordered sets, unordered sets, unordered multisets, vectors, dictionaries) to be used as relations. I don't want Morel to get as monad-crazy as Haskell, but it makes sense that any monad type (e.g.
Those are reasonable things to want. In a SQL system they would probably manifest as hints. I don't know whether Standard ML has an annotation facility.
Yes, the availability of statistics is traditionally a difference between PL optimization and query optimization. I gather that gap is narrowing, and some PL optimizers are able to ingest statistics. If your Morel program references external relations (and it probably should, if you are writing something that is more like a "query" than a traditional "program") then large parts of your program will be translated to Calcite relational algebra (while your Morel program is being compiled, not at run time) and Calcite has access to statistics.
I've not implemented all of Standard ML's type system (I've done Hopefully it won't make the syntax more verbose, won't require lots of type annotations, and won't lead to lots of mysterious implicit code. One use of nominal types - and probably type classes - will be specifying algebraic properties. Say I define my own aggregate function. I want to be able to say that it rolls up as a commutative monoid, order doesn't matter, duplicate values don't matter, and here are the functions to do those operations. Again, I don't want things to be too weird. But if a user can write simply
Indexes definitely don't belong in Morel. They aren't in SQL's query language. The optimizer knows about the indexes (and statistics gleaned from them) but the existence of indexes doesn't affect whether a program is valid. Also like SQL, you can't refer to indexes from your program.
I know much more about relational algebra than linear algebra. But the basic patterns are the same - higher order functions, syntactic sugar to make it easier to write programs that use those higher order functions, optimization based on rewrites, statistics and knowledge of materialized views. Someone could apply those same patterns to linear algebra and also to graph operations and (since you mention XQuery) document operations.
Regarding using and maintaining materialized views, I don't see Morel doing any better than a DB could do. I'm working on streaming extensions to SQL. It has connections to incremental query execution, and also connections to temporal database support. But I want to see how the streaming extensions land in SQL before trying to add streaming support to Morel.
That's another area where we would defer to the database. See e.g. Tempura in Calcite CALCITE-4568.
Especially for "big data" queries (massively parallel) there are bigger wins to be had by restructuring the query (e.g. using a materialized view, or changing join order) than by clever code generation. So Morel's goal is to use a DB optimizer whenever appropriate for the program.
We've laid the groundwork already. Programs are pure, we know from the algebra when it is safe to divide a program into parallel streams. We convert to Calcite relational algebra, and work is under way for Calcite adapters for Apache Arrow, which can compile queries for GPUs (see CALCITE-2040). We just need to put the pieces together.
I agree. Thanks for starting the discussion. This was a very long thread, even though i have tried to be brief in each answer. Many of the questions/answers above (e.g. consideration of the |
Beta Was this translation helpful? Give feedback.
-
The XQuery and LINQ analogies miss a key point, which is, an implementation of functional relational programming, most famously advocated for in Out of the tar pit.
One concern is that the language so far seems to reflect its sole researcher's individual contribution, and locking in what the language is about has the chance of settling in a specific groove too early, making it difficult to generalize or accomodate desirable things later. For example, the syntax is already proposed without a robust discussion of semantics, and what goals the syntax formation are driven by (that I could find; I might have missed it).
I wonder, in the diverse space of relational algebra and (ill defined) extended relational algebra, what's included and what's not, or if it's somehow kept malleable or even deferred this time.
Though more implementational and optimization related, I'm curious about the stance on the ongoing discussions in query optimizations, and how the language at least permits these, such as vectorization vs pipelining (and mixes); points of pipeline breakers and elective materialization; abilities for quick but approximative queries, as well as prioritizing latency of initial response (partial query results) over overall throughput.
Still operational: super interested in the bit where the optimization needs to be a cost based optimizer rather than (just) a rules based one. As it requires some coupling of data, or at least metadata and statistics to the static code/query. And how the language is expected to excel at arbitrary layers of the memory hierarchy, from optimality of register and stack allocation to the layers of CPU caches L1, L2, L3, main memory, fast, slow magnetic and serial storage, down to things like AWS Glacier.
The type system is also interesting in detail, for example, nominal vs structural typing (should I be able to add an angle in radians to one in degrees).
Types of indices or things like clustered indices would be premature to lock in, yet it'd be interesting to know what the language designer would say about indices in general, and what the influences are (concepts, example paper links etc.)
Any thoughts about integrating relational algebra with elements of linear algebra? Or inclusion of solvers for the user, so they can pose problems that are not just directly executed as RA operators, but it's partly driven in reverse and hill climbing and more sophisticated algos yield answers?
Any support for reactivity, streaming and incremental, efficient recalculation? As in, can the user have maybe tens of levels of "views", to then be updated (via materialization or pipelining), or is it expected to be like most DBs for which it'd be an edge use case and they do a horrible job with these efficiency wise?
How about multi-query optimization where, due to complex querying or multiple users querying simultaneously, it's the overall optimum and constraints like fairness and SLA adherence that matter?
Any list of papers or terms that are meant to be considered for inclusion? Eg. current literature is full of optimizing compilation to machine code for individual queries (has pros/cons).
Any consideration for GPU based computation and visualization?
There are so many facets of what makes a good language for FRP, considering use cases, and a healthy discussion of all these seems valuable.
Beta Was this translation helpful? Give feedback.
All reactions