-
Notifications
You must be signed in to change notification settings - Fork 979
Value Vector Implementation
Value vectors are the heart and soul of Drill: readers build them, operators transform them, network operations ship them around, and clients consume them. The core concepts of value vectors appear here. The implementation, however, is quite complex.
The term "value vector" is an overloaded term that can mean a variety of things depending on which part of the implementation stack is of interest. The following is a typical stack:
- Optional, Repeated or Variable vector (see below)
- Required vector (see below)
- DrillBuf (a kind of byte buffer)
- Array of bytes, typically allocated from direct memory
A value vector Java class maps to/from the underlying byte buffer. The Optional, Repeated and Variable vectors provide additional levels of abstraction on top of the byte buffers, as we'll see below.
Value vectors work with Java primitive types (bits, ints and so on.) In Java, each primitive requires distinct code. That is a "get" method returns a short or an int; a single method can't do both. Many applications provide uniform access by converting primitives to objects ("boxing"), but doing so creates garbage. In Drill's case, high-speed processing of vectors would produce vast amounts of garbage, resulting in large GC overhead. Drill's solution is to provide type-specific vectors with the corresponding primitive get/set methods.
Each primitive also has a distinct width, which is influences how to map the primitive elements into a byte vector. So, each vector type encodes its width informaton into its implementation methods.
Since Drill provides around 38 types ("minor types" in Drill terminology), Drill provides 38 different value vector types.
Further, Drill provides three forms of cardinality ("mode" in Drill terminology): Required (non-nullable), Optional (nullable) and Repeated (array).
The combination of (minor type, mode) gives rise to a "major type" in Drill, of which about 108 exist.
Drill provides a separate value vector class for each major type. Because it is not practical to maintain this many classes (with virtually identical code) by hand, Drill generates the classes. You can find them in the vector
project in org.apache.drill.exec.vector
.
Since each major type has its own vector class, code that uses vectors must also be written specific to each major type. This means writing 100+ different implementations. Because this would be impossible to do by hand, all code anywhere in Drill that works with vectors is also generated. Thus, when working with vectors, one must think in terms of meta-programming: writing code that generates code that works with vectors.
Every value vector consists of:
- A "payload" vector with the actual data values.
- An optonal set of offset vectors that map into the payload vector. (See below.)
- An accessor class to retrieve values from the vector.
- A mutator class to write values into the vector.
- A field reader that ((more info needed...))
Interestingly, value vectors do not know how many values they contain; you cannot query a value vector to find out the row count. Presumably, this count is maintained elsewhere. ((Where?))
Value vector classes are generated in the vector
project using a Maven plug-in (Mojo) that uses fmpp to do the actual code generation. Fmpp in turn uses FreeMarker as the template engine. Fmpp reads a set of templates and processes them to an output using a set of data. For Vectors, the templates reside in the vector
project in src/main/codegen/templates
. The data is in src/main/codegen/data/ValueVectorTypes.tdd
, a JSON file. The data defines the set of major types (defined by length) and minor types. (Note that the terms "major" and "minor" here are not the same as used above or as defined by the Drill MajorType
and MinorType
classes...)
We start with vectors for the required mode since they are the simplest. Consider the IntVector
class which holds signed, 32-bit integers. Required-mode vectors work directly with the byte buffer that backs the vector: read and write operations are translated into opeations on the byte buffer. (All other vectors are coded in terms of the required-mode vectors.)
Client code does not work with vectors directly. Instead, clients work via (type-specific) accessor classes:
class Accessor extends BaseDataValueVector.BaseAccessor {
public int getValueCount() ...
public boolean isNull(int index) ...
public int get(int index) ...
...
}
The getValueCount( )
and isNull()
methods are inherited (and hence generic to all vectors.) But, the get()
method is type specific. Both isNull()
and get()
take an index, which is the record index relative to the start of the record (vector) batch. In this case, since the vector stores only required values, isNull()
always returns false. Internally, the get()
method converts the record index to a byte buffer index. Since ints are a constant 4 bytes, the conversion is simple:
public int get(int index) { return data.getInt(index * 4); }
For variable-length vectors, such as VarCharVector
, the vector has a level of indirection through an offset vector.
To work with a IntVector
, say, (generated) client code:
- Retrieves the vector generically from a vector bundle. (The vector corresponds to a column.)
- Casts the generic
ValueVector
to the specific type, in this case,IntVector
. - Calls
IntVector.getAccessor()
to get the accessor. - Calls
isNull(index)
to check if the value is null (if the code is common for Required and Optional modes.) - Calls
get(index)
to retrieve the int value (assuming the code wants the value in primitive form.)
Accessors provide other ways to retrieve values:
public Integer getObject(int index)...
public int getPrimitiveObject(int index)...
public void get(int index, IntHolder holder)
public void get(int index, NullableIntHolder holder)
The other forms are used where convienient. For example, the "Holder" forms are used when calling UDFs ((verify)).
Vectors also provide a Mutator
to write the vector. Drill vector semantics are write-once: once a vector position is written to, that position becomes (logically) immutable.
public class Mutator extends BaseDataValueVector.BaseMutator {
public void set(int index, int value) ...
public void setSafe(int index, int value) ...
public void setSafe(int index, IntHolder holder) ...
public void setSafe(int index, NullableIntHolder holder) ...
}
Optional mode vectors represent values that can be null. As a result, the vector introduces a parallel bit vector to track whether a value is null. The space for the value is allocated the same whether the value is null or not. But, for null values, the storage location for that row is left unused.
Let consider the simplest example: NullableIntVector
which includes two vectors:
private final UInt1Vector bits = new UInt1Vector(bitsField, allocator);
private final IntVector values = new IntVector(field, allocator);
Note that the optional-mode vector does not directly write to a byte buffer. Instead, there is a level of indirection: both the null flags and data values are stored in separate (required-mode) vectors. This means all read and write operations delegate to the underlying required-mode vectors.
As with the required-mode vector, optional-mode vectors provide Accessor
and Mutator
classes. In fact, the signatures of the Accessor
is identical for IntVector
(required-mode) and NullableIntVector
(optional-mode.) Unfortunately, the classes do not share a common ancestor, so client code must be generated specifically for the two cases as the two Accessor
classes are separate Java classes. The same is true of the Mutator
class.
Repeated-mode vectors make use of two required-mode vectors: one stores the (combined) list of values, the other stores the offsets into that vector for each record. Consider the RepeatedIntVector
class:
protected final UInt4Vector offsets;
private IntVector values;
As in other cases, access to the vector is through an Accessor
. Since this is an array, the Accessor
has additional array-specific methods:
public final class Accessor extends BaseRepeatedValueVector.BaseRepeatedAccessor {
public int getInnerValueCountAt(int index) ...
public boolean isNull(int index) ...
public int get(int index, int positionIndex) ...
}
Here, getInnerValueCountAt()
gets the number of array entries for a record, with the record indexed as described earlier. isNull()
always returns false since Drill semantics don't allow a null array (only an array with no values.) This is important: Although the Repeated mode implies a cardinality of (0..*), the array itself is required, though it may be empty.
Finally, get(index, positionIndex)
takes both a record index and an index into the array for that record.
The Mutator
is similar:
public final class Mutator extends BaseRepeatedValueVector.BaseRepeatedMutator implements RepeatedMutator {
public void add(int index, int value) ...
public void addSafe(int index, int srcValue) ...
}
Since Drill vectors are write-once, the way the client writes an array is to write each value one at a time by calling add
to extend the array with the new value.
Both the Accessor
and Mutator
take the usual additional methods that work with objects and holders. And, as always, the Accessor
and Mutator
are type-specific.