Skip to content

Working with VARCHAR Data in UDFs

Paul Rogers edited this page Apr 29, 2018 · 11 revisions

As described in the previous section, most Drill types provide a holder that contains all the data you need. But, the variable-width types (VARCHAR and VARBINARY) are more complex as they provide direct access to the underlying data stored in direct memory. This page explains how to work effectively with these types.

One thing to keep in mind: direct memory is exactly that: memory outside of the Java heap. Java can provide no protection. If your code writes to the wrong memory location, you can crash the Drillbit process with no logging to help you figure out your error.

VARCHAR and UTF-8

Drill stores VARCHAR as bytes encoded in UTF-8. To properly handle such data, you must write code that is aware of the UTF-8 encoding. Doing so is quite complex, so one choice is to convert the UTF-8 data to Java String, then convert the result back to UTF-8. This conversion does, however, consume time. So, Drill often takes a shortcut: it works with the single-byte subset of UTF-8 (also known as ASCII.) While this can produce incorrect results with non-ASCII data, it seems to work well for many cases. (Yes, this does seem more like a bug than a feature...)

So, before you begin writing code, consider if your input data will always be ASCII. If so, you can work with the UTF-8 data a buffer of ASCII characters. But, if your data includes non-ASCII characters (or you wish to contribute your code to the Apache Drill project), you must either handled multi-byte characters, or do the conversion to String.

Structure of a VARCHAR Vector

The VarCharHolder provides a view into the underlying VarCharVector. Before we look at the holder, let's understand the vector.

The VarCharVector is comprised of two parts, as described in the Drill documentation:

  • A data vector which is a large buffer that contains all the column values concatenated together.
  • An offset vector which provides the start and end position of the value for each row.

So, to work with the VARCHAR value for a specific row, we need:

  • The buffer that holds all the values
  • The start position of the current value
  • The end position of the current value (from which we can compute the length as start - end)

VARCHAR Holder

Unsurprisingly, the above is exactly what the VarCharHolder provides:

public final class VarCharHolder implements ValueHolder{
    public int start;
    public int end;
    public DrillBuf buffer;

DrillBuf

The first two fields make sense, but what is a DrillBuf? Here is where Drill assumes that UDF writers are already familiar with Drill internals. The DrillBuf is an interface to a direct memory buffer. If you have downloaded Drill sources, and have set up a development environment, you can take a look at the DrillBuf source code since Drill provides no Javadoc for any of its APIs.

For our purposes, we will need the following methods when working with parameters:

  public byte getByte(int index);
  public ByteBuf getBytes(int index, byte[] dst);
  public ByteBuf getBytes(int index, byte[] dst, int dstIndex, int length);

The first methods allows us to work byte-by-byte. The second allows us to work with the entire value, such as when we wish to convert to strings (the length of the dst buffer says how many bytes to copy.) The last version is general purpose for when we need slightly more control.

Example of a VARCHAR Parameter

Let's write two functions: one which counts letters using the single-byte subset of UTF-8 (AKA ASCII), the other which uses Unicode. These illustrate the two ways to work with VARCHAR data, and the pitfalls of working only with ASCII.

ASCII Version

Here is the ASCII (byte-oriented) version:

public class CountLettersFunctions {

  @FunctionTemplate(
      name = "countLettersAscii",
      scope = FunctionScope.SIMPLE,
      nulls = NullHandling.NULL_IF_NULL)

  public static class CountLettersAsciiFunction implements DrillSimpleFunc {
    @Param VarCharHolder input;
    @Output IntHolder output;

    @Override
    public void setup() { }

    @Override
    public void eval() {
      int len = input.end - input.start;
      output.value = 0;
      for (int i = 0; i < len; i++) {
        int c = (input.buffer.getByte(input.start + i) & 0xFF);
        if (Character.isAlphabetic(c)) {
          output.value++;
        }
      }
    }
  }
}

The code above AND's (&& operator) the return value of getByte() with 0xFF. Why? getByte() returns a signed 1-byte value. For values with the high-bit set, the value is sign-extended to the int field, causing negative numbers. The AND ensures that the returned values are in the range 0-255.

Note that we are using a handy trick: defining more than one function in a single Java file. Drill uses this technique frequently.

Let's try some examples. Let's create letterInput.json:

{ word: "aA 12 zZ!" }
{ word: "Αθήνα?" }
{ word: "-Mосква-" }
{ word: "海." }

Then run a query and look at the results:

SELECT word, countLettersAscii(word) FROM `cp`.`letterInput.json`
aA 12 zZ!,4
Αθήνα?,5
-Mосква-,7
海.,2

The first line is fine: 4 letters. The Greek is good (second line). But, the Russian (third line) is wrong: 7 characters instead of 6. And the Chinese (last line) reports 2 letters instead of 1. What is wrong?

The UTF-8 encoding of some of the characters is multi-byte and is overlapping ASCII letter range which is all our function can handle. (Another error that could occur, not shown here, is that a character is not counted if it falls outside the ASCII letter range.)

Clearly, in this case, we have to become UTF-8 experts. Or, we can pay the cost to convert UTF-8 to a Java String. Lets do that next.

Converting VARCHAR Data to/from Java Strings

To convert an array of UTF-8 bytes to a string, we write:

  byte bytes[] = ...
  String str = new String(bytes, com.google.common.base.Charsets.UTF_8);

You must provide the character set argument. Otherwise, Java will use the default character set for whatever machine is running Drill, which can produce inconsistent results.

Similarly, to convert from a String to an array of bytes:

   byte bytes[] = String.getBytes(com.google.common.base.Charsets.UTF_8);

String Helper Functions

Drill provides a number of helper functions to work with VARCHAR columns, defined in the org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers class. (Remember to use the fully-qualified name when referencing these methods.)

public static String getStringFromVarCharHolder(VarCharHolder varCharHolder);
public static String getStringFromVarCharHolder(NullableVarCharHolder varCharHolder);
public static String toStringFromUTF8(int start, int end, DrillBuf buffer);

Note that the functions only convert from VARCHAR to String. We have to roll our own for the reverse conversion.

Unicode Version

Here is the Unicode version (in the same Java file) in which we let Java convert from UTF-8 to UTF-16 (used by the Java String class):

  @FunctionTemplate(
      name = "countLetters",
      scope = FunctionScope.SIMPLE,
      nulls = NullHandling.NULL_IF_NULL)

  public static class CountLettersFunction implements DrillSimpleFunc {
    @Param VarCharHolder input;
    @Output IntHolder output;

    @Override
    public void setup() { }

    @Override
    public void eval() {
      String value = org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.getStringFromVarCharHolder(input);
      output.value = 0;
      for (int i = 0; i < value.length(); i++) {
        if (Character.isAlphabetic(value.charAt(i))) {
          output.value++;
        }
      }
    }
  }

Note that in the loop over the characters, we use the length of the String, not the buffer, since, in the UTF-8 multi-byte encoding, the numbers are different.

Let's try the query again:

aA 12 zZ!,4
Αθήνα?,5
-Mосква-,6
海.,1

Bingo! As we can see, we must trade off speed for accuracy. We must pay the cost to convert to Unicode in order to perform operations that depend on the Unicode encoding.

We've also seen how to work with the DrillBuf provided in the VarCharHolder class.

Returning a VARCHAR

Returning a VARCHAR is a bit more complex. When working with parameters, Drill has already allocated the space for the incoming values, we simply work with that existing buffer. But for the output, we must create a buffer. For obscure reasons, Drill insists we put the return value in direct memory.

We don't write the value directly to the output vector; instead we write it to a temporary buffer which is copied to the output. (Because of this copy, it is not clear that there is much benefit creating a direct memory buffer. Either we should write to the output vector, or use a heap buffer for a temporary value. Sigh... We just have to work with the code the way it exists, not the way we'd like it to be.)

Declare the Output Type

Suppose we want to create a function that returns the set of letter-like characters. We declare the output type as VARCHAR as follows:

  @FunctionTemplate(
      name = "extractLetters",
      scope = FunctionScope.SIMPLE,
      nulls = NullHandling.NULL_IF_NULL)

  public static class ExtractLettersFunction implements DrillSimpleFunc {
    @Param VarCharHolder input;
    @Output VarCharHolder output;

The @Inject Annotation

For reasons known only to Drill, we must tell Drill twice that we are returning a VARCHAR: once in the @Output annotation and a second time using the @Inject annotation. Then, we must glue the two together in our code. (Since there is only one way to do this correctly, one wonders why Drill doesn't just do it for us...)

  public static class ExtractLettersFunction implements DrillSimpleFunc {
    ...
    @Inject DrillBuf outputBuf;

Allocating the Buffer

The injected buffer is initialized with a capacity of 256 bytes. If we need more, we must replace it with a larger buffer. The method for doing so is:

  public DrillBuf reallocIfNeeded(final int size);

The buffer is reallocated to the new size. (Actually, it is allocated to the new size, rounded up to the next power of 2.) The return value is a new buffer that we use to replace the old one. Because of the way Drill works, both our old and new buffers are allocated until the query exits.

If we are curious about the buffer size, we can call:

  public int capacity();

The buffer is injected once, then reused for each function call. The buffer is a wrapper around direct memory. Our code must allocate the required memory (as shown below.) But, since the buffer is reused, once our function runs for a while, we may be able to reuse the previous buffer, if it is long enough.

But, because we reuse the same buffer, and Drill will not release buffers until the query exits, we can end up wasting quite a bit of space if we constantly grow our buffer from 256 to 512, 1024 bytes and so on. There is nothing we can do to fix this; we can just be aware of the issue.

Drill uses Netty as its memory allocator. Because of the way Netty works, code should avoid allocating buffers larger than 16 MB. Such a large size is unlikely for VARCHAR columns, but possible for VARBINARY columns if you happen to be working with images, audio or other binary blob types.

Writing to the Output Buffer

We will need the following ByteBuf methods:

  public ByteBuf setByte(int index, int value);
  public void setByte(int index, byte b);
  public ByteBuf setBytes(int index, byte[] src);
  public ByteBuf setBytes(int index, byte[] src, int srcIndex, int length);

The ByteBuf return values are an artifact of the fact that DrillBuf extends Netty's ByteBuf class. Normally we don't use these values.

Example that Returns a VARCHAR

Let's put all this together to write a function that returns the set of letter-like characters rather than just counting them.

  @FunctionTemplate(
      name = "extractLetters",
      scope = FunctionScope.SIMPLE,
      nulls = NullHandling.NULL_IF_NULL)

  public static class ExtractLettersFunction implements DrillSimpleFunc {
    @Param VarCharHolder input;
    @Output VarCharHolder output;
    @Inject DrillBuf outputBuf;

    @Override
    public void setup() { }

    @Override
    public void eval() {

      // Convert the input to a string.

      String value = org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.getStringFromVarCharHolder(input);

      // Build up the output string.

      StringBuilder result = new StringBuilder();
      for (int i = 0; i < value.length(); i++) {
        char c = value.charAt(i);
        if (Character.isAlphabetic(c)) {
          result.append(c);
        }
      }

      // Convert to the output buffer.

      byte outBytes[] = result.toString().getBytes();
      outputBuf = outputBuf.reallocIfNeeded(outBytes.length);
      outputBuf.setBytes(0, outBytes);
      output.buffer = outputBuf;
      output.start = 0;
      output.end = outBytes.length;
    }
  }

Let's focus on the eval() method, as the rest has already been covered. eval() divides into three blocks. The first we've already discussed, the second is just plain Java code. The third is where we must focus. We start by converting our output String to an array of UTF-8 bytes:

      byte outBytes[] = result.toString().getBytes();

Next, we reallocate our working buffer to hold our data. The buffer is reused, so we need to reallocate memory only if our needs are larger than the current size:

      outputBuf = outputBuf.reallocIfNeeded(outBytes.length);

Note that reallocIfNeeded() returns a new buffer if reallocation occurs. If we forget to assign it back to our outputBuf, we'll continue working the the old buffer, and we may cause the query to fail with an IndexOutOfBounds exception.

As it turns out, Drill's generated code is designed to handle the case in which we replace the buffer. Here is an abstract of the generated code:

11:     public class ProjectorGen0 {
12:     
            // The injected buffer
13:         DrillBuf work0;
...
37:                         final NullableVarCharHolder output = new NullableVarCharHolder();
38:                         NullableVarCharHolder input = out4;
                            // Give the buffer to our function
39:                         DrillBuf outputBuf = work0;
40:                          
        // The inlined body of our function
41:     CountLettersFunctions$ExtractLettersFunction_eval: {
42:         String value = org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.getStringFromVarCharHolder(input);
...
            // We replace the buffer
56:         outputBuf = outputBuf.reallocIfNeeded(outBytes.length);
...
61:     }
62:
                            // Drill copies our newly-created buffer
                            // into its own holding variable.      
63:                         work0 = outputBuf;

With the buffer allocated, we can copy our UTF-8 bytes into the buffer:

      outputBuf.setBytes(0, outBytes);

Finally, we tell the output VarCharHolder where to find its buffer, and where to find the data within the buffer:

      output.buffer = outputBuf;
      output.start = 0;
      output.end = outBytes.length;

Note the use of outBytes.length above. We want to set the length to the actual byte count. As it turns out, ByteBuf can also tell us its length via the capacity() method, but this will tell us the buffer length which is likely to be greater than the number of bytes.

Example Run

Let's try the function on the same input data as before:

SELECT word, extractLetters(word) FROM `cp`.`letterInput.json`
aA 12 zZ!,aAzZ
Αθήνα?,Αθήνα
-Mосква-,Mосква
海.,海

Looks like everything works.

Convenience Functions

The above description does things using just Drill-provided methods and Java. The GitHub repo for this article contains a UdfUtils class that handles some of the routine work.

public static String varCharInput(VarCharHolder input);

Converts a non-nullable VarChar to a Java String. Same as StringFunctionHelpers.getStringFromVarCharHolder().

public static String varCharInput(NullableVarCharHolder input);

Converts a nullable VarChar to a Java String. Same as StringFunctionHelpers.getStringFromVarCharHolder(), except that if the input VarChar is NULL, the returned Java String is null.

public static DrillBuf varCharOutput(String result,
       DrillBuf outputBuf, VarCharHolder output);

Does the work of converting a Java String to a non-nullable VarChar. Resizes the output buffer if needed. Assign the return value to your output buffer.

public static DrillBuf varCharOutput(String result,
       DrillBuf outputBuf, NullableVarCharHolder output);

As above, but handles nullable VarChars. If the input Java String is null, sets the output holder to null, else works the same as the non-null version.

Caveats

Use of the above functions is not free; they will prevent Drill from doing "scalar replacement" to eliminate the input or output VarCharHolder, causing a slight performance hit. (The same is true of the Drill-provided functions.) Still, the gain in developer convenience may be worth the runtime performance cost in some cases.

Example

Here is the example extractLetters function rewritten to use the non-nullable convenience functions:

    @Override
    public void eval() {

      // Convert the input to a string.

      String value = org.apache.drill.exec.expr.contrib.udfExample.UdfUtils.varCharInput(input);
      ...
      // Convert to the output buffer.

      outputBuf = org.apache.drill.exec.expr.contrib.udfExample.UdfUtils.varCharOutput(
                      result, outputBuf, output);
    }

VARBINARY Type

Everything shown above for VARCHAR works for VARBINARY as well. In fact, if we know that the VARBINARY holds a UTF-8-encoded text field (as can be the case for HBase), we can even convert the binary data to a Java String. The only difference is that Java provides no helper function, so we have to do the work ourselves:

  @Param VarBinaryHolder input;
...
    int len = input.end - input.start;
    byte buf[] = new byte[len];
    input.buffer.getBytes(input.start, buf);
    String value = new String(buf, com.google.common.base.Charsets.UTF_8);

Convenience Functions

The UdfUtils class also provides convenience functions for the case that a VARBINARY holds a string:

public static String varBinaryToString(NullableVarBinaryHolder input);

Testing a Simple UDF With VARCHARs

Testing UDFs is a chore if done directly with the Drill server. We must build the code, copy it to Drill, restart Drill, and run a query. Much simpler if we can use JUnit to do the detailed testing, then just use Drill for integration testing.

Let's show how to use the Drill test framework to test our UDF that uses VARCHAR parameters or return values. Such testing is a bit more complex than the earlier case with simple types.

First, let's define a dupIt function that duplicates a string n times:

  @FunctionTemplate(
      name = "dupit",
      scope = FunctionScope.SIMPLE,
      nulls = NullHandling.NULL_IF_NULL)

  // VARCHAR-REQUIRED dupit(VARCHAR-REQUIRED, INT-REQUIRED)

  public static class DupItFunction implements DrillSimpleFunc {
    @Param VarCharHolder input;
    @Param IntHolder count;
    @Output VarCharHolder output;
    @Inject DrillBuf outputBuf;

    @Override
    public void setup() { }

    @Override
    public void eval() {

      // Convert the input to a string.

      String value = org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.getStringFromVarCharHolder(input);

      // Build up the output string.

      StringBuilder result = new StringBuilder();
      for (int i = 0; i < count.value; i++) {
        result.append(value);
      }

      // Convert to the output buffer.

      byte outBytes[] = result.toString().getBytes(com.google.common.base.Charsets.UTF_8);
      outputBuf = outputBuf.reallocIfNeeded(outBytes.length);
      outputBuf.setBytes(0, outBytes);
      output.buffer = outputBuf;
      output.start = 0;
      output.end = outBytes.length;
    }
  }

We need to write some helper functions that simulate how Drill works with our function. First, let's define the test function using a special OperatorFixture used in Drill unit tests:

  @Test
  public void testDupItFn() throws Exception {
    try (OperatorFixture fixture = OperatorFixture.standardFixture()) {
      OperatorContext context = fixture.operatorContext(null);
      try {
        // Test code goes here
      } finally {
        context.close();
      }
    }
  }

The fixture, along with the OperatorContext allow us to create a managed buffer one that Drill will track for us. (The operator fixture is available in Drill 1.12 and later.) Let's use that feature to create an instance of our function and wire it up to its holders and to an input and output buffer:

  private DupItFunction newDupIt(OperatorContext context) {
    DupItFunction fn = new DupItFunction();
    fn.input = new VarCharHolder();
    // Larger than our largest output
    fn.input.buffer = context.getManagedBuffer(1024);
    fn.count = new IntHolder();
    fn.output = new VarCharHolder();
    // Default size, function should realloc
    fn.outputBuf = context.getManagedBuffer();
    fn.setup();
    return fn;
  }

Next, we need a function to call our function, complete with String-to-buffer-to-String conversions:

  private String callDupIt(DupItFunction fn, String value, int count) {
    fn.count.value = count;
    byte byteVal[] = value.getBytes(Charsets.UTF_8);
    fn.input.start = 0;
    fn.input.end = byteVal.length;
    fn.input.buffer.setBytes(0, byteVal);
    fn.eval();
    return StringFunctionHelpers.getStringFromVarCharHolder(fn.output);
  }

Given those, we can now easily write test cases:

        DupItFunction fn = newDupIt(context);
        assertEquals("", callDupIt(fn, "x", 0));
        assertEquals("xyxyxyxy", callDupIt(fn, "xy", 4));
        String result = callDupIt(fn, "0123456789", 1000);
        assertEquals(10_000, result.length());

The last case above verified that we are properly resizing our working buffer when needed.

And there we have it. For the price of a bit of boilerplate code, we can debug our UDF in our IDE without pulling our hair out trying to debug generated code in a Drill server.

Clone this wiki locally