-
Notifications
You must be signed in to change notification settings - Fork 979
Simplified UDF Framework
The material here has repeated pointed out that UDFs are not Java, they are a Drill-specific DSL that uses a simplified subset of Java constructs. This explains why developers often wrap themselves around the axel if they try to develop complex UDFs: all the things you want to do for complicated code in Java are unsupported in UDFS. Frustration mounts, especially when trying to debug.
Recall that the reason that Drill uses the DSL technique is to extract maximum performance for Drill's own built-in functions. But, what if we were willing to trade off a bit of runtime performance for better development-time productivity? (And, as mentioned before, it is not clear that we will actually incur a runtime hit.)
We can do that by looking at the UDF not as the implementation of our function, but rather just a wrapper around our function. This lets us use our function in other Java code, and use normal JUnit tests to debug it.
The concept is simple:
- Put the function implementation into a non-UDF class.
- Write the UDF as a thin Drill-to-Java wrapper.
Let's see how this works for our log2
example.
First, we create a normal Java class to hold our function. Because this is plain Java, there are no restrictions on syntax or structure. We can put multiple functions in a single class, we can use constants, etc. In fact, we may even be able to use code we already have as our implementation. For example:
package org.apache.drill.exec.expr.contrib.udfExample;
public class FunctionImpl {
private static final double LOG_2 = Math.log(2.0D);
public static final double log2(double x) {
return Math.log(x) / LOG_2;
}
}
Here we put the implementation in the same package as the UDF, which is handy. But, if the implementation already exists, or we want to reuse the code elsewhere, we are free to use any package since Drill won't look for our code; only Java will.
We can now create a plain-old JUnit test (outside of Drill):
@Test
public void testImpl() {
assertEquals(1D, FunctionImpl.log2(2), 0.001D);
assertEquals(2D, FunctionImpl.log2(4), 0.001D);
assertEquals(-1D, FunctionImpl.log2(1.0/2.0), 0.001D);
}
The beauty of this method is that, if our function is complex, we can work out the kinks without the additional complexity of Drill. Said another way, why struggle with Drill when the bug is in the function implementation itself?
We now create a UDF as before, but now we just wrap the above implementation. We can create this in its own class. Or, we can use a very handy trick used by Drill itself: define the class as a nested class inside our implementation class, as shown here:
public class FunctionImpl {
private static final double LOG_2 = Math.log(2.0D);
public static final double log2(double x) {
return Math.log(x) / LOG_2;
}
@FunctionTemplate(
name = "log2",
scope = FunctionScope.SIMPLE,
nulls = NullHandling.NULL_IF_NULL)
// FLOAT8-REQUIRED log2(FLOAT8-REQUIRED)
public static class Log2Wrapper implements DrillSimpleFunc {
@Param public Float8Holder x;
@Output public Float8Holder out;
@Override
public void setup() { }
@Override
public void eval() {
out.value = org.apache.drill.exec.expr.contrib.udfExample.FunctionImpl.log2(x.value);
}
}
}
Note the fully-qualified reference to the implementation function, even though it is in the same class as the UDF. (The need for this syntax was discussed ((need link)).)
Note also the comment to summarize the function name and types. This is a quick way to gather the information for human readers and is presented using the notation that Drill uses internally.
We can now run the same integration test as before, this time using our wrapper function. The result should be the same.
Even if you believe that Drill can improve performance by avoiding the per-row function call needed for this framework, using this framework in the early days of development can save you time and effort. Once everything works, try moving your implementation from the plain-Java class to the UDF, carefully following the functionality restrictions we've discussed. If you run into issues, just revert to the two-class version and call it a day.