-
Notifications
You must be signed in to change notification settings - Fork 979
EVF Tutorial Scan Framework Creator
We've created a batch reader for the log plugin. But, thus far been an "orphan": nothing calls it. Let's fix that.
EVF is a general-purpose framework: it handles many kinds of scans. We must customize it for each specific reader. Rather than doing so by creating subclasses, we instead assemble the pieces we needed through composition by providing a framework builder class. This builder class is what allows the Easy framework to operator with both "legacy" and EVF-based readers.
Prior to the EVF, Easy format plugins were based on the original ScanBatch
. At the start of execution, the Easy framework calls getRecordReader()
in your plugin class to create a record reader for each file split. The Easy framework then passes all the readers to the scan batch.
With EVF, we use the record batch reader we just created. Instead of creating all readers up-front, EVF creates them on-the-fly as needed. EVF will create a separate instance for each file split. The easiest way to do so is to override the following method in your plugin class:
TODO: Replace with code for log reader.
@Override
public ManagedReader<? extends FileSchemaNegotiator> newBatchReader(
EasySubScan scan, OptionManager options) throws ExecutionSetupException {
TextParsingSettingsV3 settings = new TextParsingSettingsV3();
settings.set(getConfig());
return new CompliantTextBatchReader(settings);
}
In an advanced case, we could even create a different reader depending on some interesting condition. For example, the Parquet reader has both a "new" and "old" version with different capabilities. We have access to the scan, options, and the format config (via the getConfig()
method).
This method is not yet called anywhere, so the plugin should still run using the old reader.
EVF supports a number of "scan frameworks" and a wide variety of options. We use the "builder" pattern to specify how we want the scan to work: we create a builder, pass it options to configure the framework, then let the Easy scan framework do the actual building for us. Here's how we configure the file scan framework by adding a method to the plugin class:
@Override
protected FileScanBuilder frameworkBuilder(
FragmentContext context, EasySubScan scan) throws ExecutionSetupException {
FileScanBuilder builder = new FileScanBuilder();
// The default type of regex columns is nullable VarChar,
// so let's use that as the missing column type.
builder.setNullType(Types.optional(MinorType.VARCHAR));
return builder;
}
The log reader reads from a file, so we use the FileScanBuilder
class. We could support the columns
column to read into an array, like CSV, if we wanted.
We call setNullType()
to define a type to use for missing columns rather than the traditional nullable INT. We observe that the native type of a regex column is nullable Varchar. So, if the user asked for a column that we don't have, we should use that same type so that types remain unchanged when the user later decides to define that column.
After you add this method, the log reader will still use the old version of the reader because we've not told the Easy framework to call the method we just created.
With this method in place, our new version is "live". You should use your unit tests to step through the new code to make sure it works -- and to ensure you understand the EVF, or at least the parts you need.
We've now completed a "bare bones" conversion to the new framework. We'd be fine if we stopped here.
The new framework offers additional features that can further simplify the log format plugin. We'll look at those topics in the next section.