Skip to content
Paul Rogers edited this page Apr 29, 2019 · 8 revisions

This is a practical guide, so let's immediately start using the EVF in its simplest form: the Row Set framework used to create, read and compare record batches in unit tests. The row set framework makes a number of simplifying assumptions to make tests as simple as possible:

  • When creating a batch, we define the schema up front.
  • No projection, null columns, or type conversion is needed.
  • Batches are "small"; the framework does not enforce memory limits.

While the row set framework is specific to tests, the column accessor mechanism is used throughout the EVF. The easiest way to understand the column accessors is to start with the row set framework. (The EVF uses the term "accessor" to mean either a vector reader or a vector writer.)

In this example, we refer to this documentation and the ExampleTest class.

This page focuses on the overall process. See the next page for the details for each vector type.

The examples here also appear in the ExampleOperatorTest file.

Create a Test File

Since the row set framework is typically used in a test, our example will work in that context. Find a handy spot within Drill to define your temporary test file. (Unfortunately, Drill is not designed to allow you to create these files as a separate project outside of Drill.) You can create the file in the test package if you like.

public class ExampleTest extends SubOperatorTest {

  @Test
  public void rowSetExample() {
  }
}

The SubOperatorTest base class takes care of configuring and launching an in-memory version of Drill so that you can focus on the specific test case at hand.

Define Your Schema

Next define your schema using the SchemaBuilder class. Careful, there are two such classes in Drill: you want the one in the 'metadatapackage. Let's define a simple schema with two columns: a non-nullableIntand a nullableVarchar`.

import org.apache.drill.exec.record.metadata.SchemaBuilder;
import org.apache.drill.common.types.TypeProtos.MinorType;
...

  @Test
  public void rowSetExample() {
    final TupleMetadata schema = new SchemaBuilder()
      .add("id", MinorType.INT)
      .addNullable("name", MinorType.VARCHAR)
      .buildSchema();
  }

Some things to note:

  • The schema builder allows a fluent notation which is very handy in tests. Production code is never this easy since the schema is not known at compile time.
  • The add() methods typically add a column with no options which is a non-nullable (AKA Required) column.
  • The addNullable() methods add a nullable (Optional) column.
  • Drill defines two classes called `MinorType. Use the import shown above to get the correct one.
  • The result of buildSchema() is a TupleSchema.
  • The schema builder an also build a BatchSchema by calling build(). BatchSchema is used by the VectorContainer class, but TupleMetadata holds a more complete set of metadata, and can define extended types properly that BatchSchema cannot.

The TupleMetadata class describes both a record and (as we'll see later) a Drill "Map" (really a struct.) Each tuple is made up of columns, defined by the ColumnMetadata interface. ColumnMetadata provides a rich set of information about each column. Combined, the metadata classes drive much of the EVF as we'll see.

Now that you are familiar with the schema classes, we'll leave it as an exercise for the reader to explore them and learn all that they have to offer.

Create a Record Batch the Easy Way

The next step is to create a record batch using the schema. In tests, the easy way to do this is with the RowSetBuilder class:

      final RowSet rowSet = new RowSetBuilder(fixture.allocator(), schema)
        .addRow(1, "kiwi")
        .addRow(2, "watermelon")
        .build();

Things to notice here:

  • The RowSet interface provides an easy-to-use wrapper around the actual record batch.
  • For more advanced tests, you may need to use one of the subclasses of RowSet.
  • The record batch itself is available via the foo() method.
  • The RowSetBuilder class provides a fluent way to create, populate, and return a row set.
  • The addRow() method takes a list of Java objects. The code uses the type of the Java object to figure out which set method to call. (We'll discuss those methods shortly.)
  • If you want to create a row with a single column, use addSingleCol() instead. Otherwise, Java sometimes gets confused about the type of the single argument.

Create a Record Batch using Column Writers

The above technique is often all you need when writing tests to verify some operation. (You will write such unit tests for your present work, right? I thought so.)

If you are creating an operator that works in production code, you won't know the data at compile time. Instead, you must work with each column one-by-one using the column writer classes.

    DirectRowSet rs = DirectRowSet.fromSchema(fixture.allocator(), schema);
    RowSetWriter writer = rs.writer();
    writer.scalar("id").setInt(1);
    writer.scalar("name").setString("kiwi");
    writer.save();
    ...
    final SingleRowSet rowSet = writer.done();

Some things to note:

  • Here we saw a number of row set subclasses. DirectRowSet holds a writeable row set.
  • SingleRowSet holds a readable row set which may or may not have a single-batch (SV2) selection vector. In our case, it has no selection vector.
  • The RowSetWriter is a kind of TupleWriter that provides extra methods to work with entire rows, such as the save() method that says that the row is complete. (TupleWriter is also used to write to Map vectors.)
  • The row set writer is always ready to write a row, so there is no "start row" method here. (Note that there is such a method in the result set loader as we'll see later.)
  • If you omit the call to save(), the row set writer will happily overwrite any existing value in the current row. This is done deliberately to handle advanced use cases.
  • The scalar(name) method looks up a ColumnWriter by name.
  • The returned column writer has many different set methods. We use setInt() and setString() here.
  • The setString() method is a convenience method: it converts a Java string into the byte array required by the vector. If you already have a byte array, you can call the setBytes() method instead.
  • Every scalar reader supports all the set methods. This avoids the need for casting to the correct writer type. Also, as we'll see later, it allows automatic type conversions when configured to do so.

Caching Column Writers

The above used the "get by name" methods to simplify the code. You'll want to optimize production code. You can do so by referencing columns by position (as defined by the schema):

    writer.scalar(0).setInt(1);
    writer.scalar(1).setString("kiwi");

Or, you can cache the column writers:

    RowSetWriter writer = rs.writer();
    ScalarWriter idWriter = writer.scalar("id");
    ScalarWriter nameWriter = writer.scalar("name");
    idWriter.setInt(1);
    nameWriter.setString("kiwi");
    writer.save();
    ...

Note that the set() methods themselves are heavily optimized: they do the absolute minimum work to write your value into the underlying value vector. This consists of a couple of checks (for empty slots and to detect when the vector is full). Using the column writers has been shown to be at least as efficient as using the value vector Mutator classes (and, for non-nullable and array values, much faster.)

Reading a Row Set

Now that you have a record batch, the next step is to do something with it. The simplest thing you can do (in a test) is to print the record batch so you can see what you have:

    rowSet.print();

Output:

#: id, name
0: 1, "kiwi"
1: 2, "watermelon"

Once you are done with the row set, you must clear it to release the memory held by value vectors:

    rowSet.clear();

You can also verify vectors. Suppose we want to verify that the two forms of writing to vectors above produces the same record batch:

   RowSet expected = // Build using RowSetBuilder
   RowSet actual = // Build using column writers

   RowSetUtilities.verify(expected, actual);

The above takes the first argument as the "expected" value and the second as the "actual", then compares the schemas and values. This is how we use the row set framework to verify the result of some operation on record batch (including the result of an entire query.) No need to clear the vector: the verify() function clears both batches for us.

If we want to work with individual values, we can use the column readers which work much like the column writers. Let's assume we've created a print() method that will print a value.

   final RowSetReader reader = rowSet.reader();
   while (reader.next()) {
     print(reader.scalar("id").getInt());
     print(reader.scalar("name").getString());
   }

Notes:

  • The RowSetReader is a specialized TupleReader that iterates over records in a batch by calling the next() method.
  • The reader starts positioned before the first record, so you must call next() to move to the first record.
  • Access to column readers works very much like the writer example. You can cache the column readers for performance, or access them by column index.
  • For reading, you call get() methods of the type appropriate for your column.

Advanced Row Sets

When reading values, Drill can work with a single batch or multiple batches using one of three kinds of indirection:

  • Direct: Read values from the record batch in order.
  • Single-batch indirection: Uses an "SV2" selection vector to reorder records within a batch, such as the result of sorting the batch.
  • Multiple-batch indirection: Uses an "SV4" selection vector to reorder records across multiple batches, again perhaps as the result of a multi-batch sort.

Here is how single-batch indirection works (courtesy of the Drill documentation):

Selection Vector

We've not discussed the use of offset vectors since these are not used in the scan operator or in record readers.

Clone this wiki locally