Skip to content

Commit

Permalink
[MOREL-209] File reader, and progressive types
Browse files Browse the repository at this point in the history
Adds a type-safe system to browse directories, subdirectories,
and read files as lists of records.

If Morel is started with the argument
`--directory=src/test/resources/data`, you can write

  from d in file.scott.dept;

because `data` has a subdirectory `scott` that contains a
file `dept.csv`.

This change introduces *progressive types* to solve the
problem that, when browsing a large file system, we would
have to traverse every directory, and parse every file,
in order to report the type of the `file` value. The type of
`file` is progressive; it is at first a partial record (with
missing fields shown as `, ...`), but over time, more fields
in that record are discovered. Fields are never lost, so
once a program has been typed, the type remains valid.

Add sample files under `src/test/resources/data`;
migrate `wordle.smli` to use the Wordle sample data.

In this version we support `.csv` and `.csv.gz` files;
later, we could support other file types, or just delegate to
Calcite's file adapter.

Add `class TypedValue`, to represent a value that knows its
own type, so that the value's type can change during the
session.

The new built-in `file` value is not global, but belongs to
a session, to prevent thread-safety issues, and to prevent
concurrent tests from interfering with each other. Add
`BuiltIn.sessionValue`.

During type deduction (unification) we add a label, `z$dummy`,
that distinguishes progressive record types from regular
record types. It is removed before the record type is created.

Add article based on screencast https://youtu.be/uybUjCYsBKI.

Fixes #209
  • Loading branch information
julianhyde committed Jan 4, 2024
1 parent 8f5fc92 commit 561ea6f
Show file tree
Hide file tree
Showing 42 changed files with 16,808 additions and 1,964 deletions.
237 changes: 237 additions & 0 deletions docs/2023-12-31-file-reader-and-progressive-types.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,237 @@
<!--
{% comment %}
Licensed to Julian Hyde under one or more contributor license
agreements. See the NOTICE file distributed with this work
for additional information regarding copyright ownership.
Julian Hyde licenses this file to you under the Apache
License, Version 2.0 (the "License"); you may not use this
file except in compliance with the License. You may obtain a
copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND,
either express or implied. See the License for the specific
language governing permissions and limitations under the
License.
{% endcomment %}
-->

<!--
This post started as a script for a screencast.
Setup for recording:
* In 'morel' script, add '2>/dev/null' to the last java command.
* In bash, export PS1='$ '
* So that the screencast starts with title and author,
create a title file, /tmp/title.txt, and 'cat /tmp/title.txt'
before pressing 'record'.
-->

# File reader and progressive types in Morel version 0.4

## Abstract

In Morel 0.4 we have added a file reader, to make it easy to browse
your file system for data sets in CSV files. A directory appears to
Morel a record value, and the fields of that record are the files or
subdirectories. If a file is a list of records, such as a CSV file,
its type in Morel will be a list of records.

To make this work, we extended Morel's type system with a feature we
call *progressive types*. Progressive types give you the benefits of
static typing when it's not possible (or is inefficient or
inconvenient) to gather all the type information up front.

There is also a
[screencast](https://www.youtube.com/watch?v=uybUjCYsBKI&t=1s)
based on this article;
[[MOREL-209](https://github.com/hydromatic/morel/issues/209)]
is the feature description.

## File reader

I wanted to give a quick demo of a feature we've just added to Morel.

We've added a file reader so that you can easily load and work on data
sets such as CSV files. We'll be using a data set in a directory
called "data".

```bash
$ ls -lR src/test/resources/data

src/test/resources/data/:
total 8
drwxrwxr-x 2 jhyde jhyde 4096 Dec 31 14:23 scott
drwxrwxr-x 2 jhyde jhyde 4096 Dec 31 14:38 wordle

src/test/resources/data/scott:
total 16
-rw-rw-r-- 1 jhyde jhyde 50 Dec 24 19:10 bonus.csv
-rw-rw-r-- 1 jhyde jhyde 131 Dec 31 14:23 dept.csv
-rw-rw-r-- 1 jhyde jhyde 420 Dec 24 19:10 emp.csv.gz
-rw-rw-r-- 1 jhyde jhyde 127 Dec 24 19:10 salgrade.csv

src/test/resources/data/wordle:
total 92
-rw-rw-r-- 1 jhyde jhyde 11944 Dec 24 19:10 answers.csv
-rw-rw-r-- 1 jhyde jhyde 77844 Dec 24 19:10 words.csv
```

That directory has subdirectories `scott` and `wordle`. Each of
those has some CSV files, and there's one compressed CSV file.

Now let's start the Morel shell:

```bash
$ ./morel --directory=src/test/resources/data
morel version 0.4.0 (java version "21.0.1", JLine terminal, xterm-256color)
-
```

Morel is a functional programming language that is also a query
language. It is statically typed, and its main type constructors are
lists and records.

In the following, we create two record values, a list of records,
and write a simple query.

```
- val fred = {name="Fred", age=27};
val fred = {age=27,name="Fred"} : {age:int, name:string}
- val velma = {name="Velma", age=20};
val velma = {age=20,name="Velma"} : {age:int, name:string}
- val employees = [fred, velma];
val employees = [{age=27,name="Fred"},{age=20,name="Velma"}]
: {age:int, name:string} list
- from e in employees yield e.age;
val it = [27,20] : int list
```

We wanted to make the file reader interactive. You shouldn't have to
leave the Morel shell to see what files are available.

So you can browse the whole file system as if you are looking at the
fields of a record. The `file` object is where you start.

```
- file;
val it = {scott={},wordle={}} : {scott:{...}, wordle:{...}, ...}
```

As you can see, it is a record with fields `scott` and `wordle`. In
the file reader, every directory is a record, and the fields are the
files or subdirectories.

Now let's look at `file.scott`:

```
- file.scott;
val it = {bonus=<relation>,dept=<relation>,emp=<relation>,salgrade=<relation>}
: {bonus:{...} list, dept:{...} list, emp:{...} list, salgrade:{...} list,
...}
```

It, too, is a record, but fields such as `dept` and `emp` are listed
as relations, because they are CSV files. We can run queries on those
data sets. Here is a query to compute the total salary budget for each
department. You could write a similar query in SQL using `JOIN` and
`GROUP BY`:

```
- from d in file.scott.dept
= join e in file.scott.emp on d.deptno = e.deptno
= group d.dname compute sum of e.sal;
val it =
[{dname="RESEARCH",sum=10875.0},{dname="SALES",sum=9400.0},
{dname="ACCOUNTING",sum=8750.0}] : {dname:string, sum:real} list
```

After we have traversed into `scott` and `dept`, the type of the
`file` value has changed:

```
- file;
val it =
{scott={bonus=<relation>,dept=<relation>,emp=<relation>,salgrade=<relation>},
wordle={}}
: {
scott:{bonus:{...} list, dept:{deptno:int, dname:string, loc:string} list,
emp:
{comm:real, deptno:int, empno:int, ename:string,
hiredate:string, job:string, mgrno:int, sal:real} list,
salgrade:{...} list, ...}, wordle:{...}, ...}
```

Note that the `scott` field has been expanded, and so have the `dept`
and `emp` fields. This is called *progressive typing*. What is it, and
why did we add it?

## Static, dynamic and progressive typing

Progressive typing defers collecting the type of certain record fields
until they are referencing in a program. Adding it to Morel was the
hardest part of building the file reader.

Why is it necessary? Imagine if that `data` directory had contained a
thousand subdirectories and a million files. The type of the `file`
object would be so large that it would fill many screens. More to the
point, the Morel system would take an age to start up, because it is
opening all those files and directories to find out their types.

Static and dynamic typing each have their strengths (and legions of
passionate fans). Static typing improves performance, code quality
and maintenance, and helps auto-suggestion in IDEs, but requires a
'closed world' where everything is known. Dynamic typing is better
for interacting with the 'open world' of loosely coupled systems,
where not everything is known, and things are forever
changing. Reading a file system is definitely in its sweet spot.

But using dynamic typing is not an option in a strong, statically
typed language like Morel. By deferring the collection of types,
progressive typing can handle the 'open world' of the file system. By
only ever expanding types, progressive typing retains the guarantees
of static typing. Say I compile my program and `file` has a
particular type. Later, the type of `file` later expands due to
progressive typing. My program will still be valid, because all the
fields and sub-fields it needs are still there.

Morel's type system remains static. Progressive types can be injected
into programs at particular points (to date, the `file` value is the
only injection point) and do not affect how the rest of the program is
typed.

## Variables

You don't lose the benefits of progressive typing if you use
variables.

For example, this query replaces the expression `file.scott` in the
previous query with a variable `s` to make the query more concise. It
gives the same results as the previous query.

```
- val s = file.scott;
val s = {bonus=<relation>,dept=<relation>,emp=<relation>,salgrade=<relation>}
: {bonus:{...} list, dept:{...} list, emp:{...} list, salgrade:{...} list,
...}
- from d in s.dept
= join e in s.emp on d.deptno = e.deptno
= group d.dname compute sum of e.sal;
val it =
[{dname="RESEARCH",sum=10875.0},{dname="SALES",sum=9400.0},
{dname="ACCOUNTING",sum=8750.0}] : {dname:string, sum:real} list
```

## Conclusion

The file reader lets you browse the file system starting
from a single variable called `file`, and load data sets from CSV
files. Progressive types give you the benefits of static typing but
without filling your screen with useless type information.
26 changes: 14 additions & 12 deletions src/main/java/net/hydromatic/morel/Main.java
Original file line number Diff line number Diff line change
Expand Up @@ -61,8 +61,6 @@

import static net.hydromatic.morel.util.Static.str;

import static java.util.Objects.requireNonNull;

/** Standard ML REPL. */
public class Main {
private final BufferedReader in;
Expand All @@ -77,9 +75,12 @@ public class Main {
*
* @param args Command-line arguments */
public static void main(String[] args) {
final List<String> argList = ImmutableList.copyOf(args);
final Map<String, ForeignValue> valueMap = ImmutableMap.of();
final Map<Prop, Object> propMap = new LinkedHashMap<>();
Prop.DIRECTORY.set(propMap, new File(System.getProperty("user.dir")));
final Main main =
new Main(ImmutableList.copyOf(args), System.in, System.out,
ImmutableMap.of(), new File(System.getProperty("user.dir")), false);
new Main(argList, System.in, System.out, valueMap, propMap, false);
try {
main.run();
} catch (Throwable e) {
Expand All @@ -90,21 +91,21 @@ public static void main(String[] args) {

/** Creates a Main. */
public Main(List<String> args, InputStream in, PrintStream out,
Map<String, ForeignValue> valueMap, File directory, boolean idempotent) {
Map<String, ForeignValue> valueMap, Map<Prop, Object> propMap,
boolean idempotent) {
this(args, new InputStreamReader(in), new OutputStreamWriter(out),
valueMap, directory, idempotent);
valueMap, propMap, idempotent);
}

/** Creates a Main. */
public Main(List<String> argList, Reader in, Writer out,
Map<String, ForeignValue> valueMap, File directory, boolean idempotent) {
Map<String, ForeignValue> valueMap, Map<Prop, Object> propMap,
boolean idempotent) {
this.in = buffer(idempotent ? stripOutLines(in) : in);
this.out = buffer(out);
this.echo = argList.contains("--echo");
this.valueMap = ImmutableMap.copyOf(valueMap);
final Map<Prop, Object> map = new LinkedHashMap<>();
Prop.DIRECTORY.set(map, requireNonNull(directory, "directory"));
this.session = new Session(map);
this.session = new Session(propMap);
this.idempotent = idempotent;
}

Expand Down Expand Up @@ -203,7 +204,7 @@ private static BufferedReader buffer(Reader in) {
}

public void run() {
Environment env = Environments.env(typeSystem, valueMap);
Environment env = Environments.env(typeSystem, session, valueMap);
final Consumer<String> echoLines = out::println;
final Consumer<String> outLines =
idempotent
Expand Down Expand Up @@ -313,7 +314,8 @@ static class SubShell extends Shell {
outLines.accept("[opening " + fileName + "]");
File file = new File(fileName);
if (!file.isAbsolute()) {
final File directory = Prop.DIRECTORY.fileValue(main.session.map);
final File directory =
Prop.SCRIPT_DIRECTORY.fileValue(main.session.map);
file = new File(directory, fileName);
}
if (!file.exists()) {
Expand Down
3 changes: 2 additions & 1 deletion src/main/java/net/hydromatic/morel/Shell.java
Original file line number Diff line number Diff line change
Expand Up @@ -262,8 +262,9 @@ public void run() {
final TypeSystem typeSystem = new TypeSystem();
final Map<Prop, Object> map = new LinkedHashMap<>();
Prop.DIRECTORY.set(map, config.directory);
Prop.SCRIPT_DIRECTORY.set(map, config.directory);
final Session session = new Session(map);
Environment env = Environments.env(typeSystem, config.valueMap);
Environment env = Environments.env(typeSystem, session, config.valueMap);
final LineFn lineFn =
new TerminalLineFn(minusPrompt, equalsPrompt, lineReader);
final SubShell subShell =
Expand Down
5 changes: 5 additions & 0 deletions src/main/java/net/hydromatic/morel/ast/Core.java
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@
import net.hydromatic.morel.type.RecordType;
import net.hydromatic.morel.type.Type;
import net.hydromatic.morel.type.TypeSystem;
import net.hydromatic.morel.type.TypedValue;
import net.hydromatic.morel.util.Pair;

import com.google.common.collect.ImmutableList;
Expand Down Expand Up @@ -624,6 +625,10 @@ static Comparable wrap(Exp exp, Object value) {
* {@link Comparable}, the value will be in a wrapper. */
public <C> C unwrap(Class<C> clazz) {
Object v;
if (value instanceof Wrapper
&& ((Wrapper) value).o instanceof TypedValue) {
return ((TypedValue) ((Wrapper) value).o).valueAs(clazz);
}
if (clazz.isInstance(value) && clazz != Object.class) {
v = value;
} else if (Number.class.isAssignableFrom(clazz)
Expand Down
12 changes: 10 additions & 2 deletions src/main/java/net/hydromatic/morel/ast/CoreBuilder.java
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@
import net.hydromatic.morel.type.TupleType;
import net.hydromatic.morel.type.Type;
import net.hydromatic.morel.type.TypeSystem;
import net.hydromatic.morel.type.TypedValue;
import net.hydromatic.morel.util.Pair;

import com.google.common.collect.ImmutableList;
Expand Down Expand Up @@ -170,6 +171,12 @@ public Core.Id id(Core.NamedPat idPat) {

public Core.RecordSelector recordSelector(TypeSystem typeSystem,
RecordLikeType recordType, String fieldName) {
final @Nullable TypedValue typedValue = recordType.asTypedValue();
if (typedValue != null) {
TypedValue typedValue2 =
typedValue.discoverField(typeSystem, fieldName);
recordType = (RecordLikeType) typedValue2.typeKey().toType(typeSystem);
}
int slot = 0;
for (Map.Entry<String, Type> pair : recordType.argNameTypes().entrySet()) {
if (pair.getKey().equals(fieldName)) {
Expand All @@ -179,8 +186,9 @@ public Core.RecordSelector recordSelector(TypeSystem typeSystem,
}
++slot;
}
throw new IllegalArgumentException("no field '" + fieldName + "' in type '"
+ recordType + "'");

throw new IllegalArgumentException("no field '" + fieldName
+ "' in type '" + recordType + "'");
}

public Core.RecordSelector recordSelector(TypeSystem typeSystem,
Expand Down
1 change: 1 addition & 0 deletions src/main/java/net/hydromatic/morel/ast/Op.java
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,7 @@ public enum Op {
// types
TY_VAR(true),
RECORD_TYPE(true),
PROGRESSIVE_RECORD_TYPE(true),
DATA_TYPE(" ", 8),
/** Used internally, as the 'type' of a type constructor that does not contain
* data. */
Expand Down
Loading

0 comments on commit 561ea6f

Please sign in to comment.