Data conversion for ingestion includes the following two types:
- To create derived fields
- Data format conversion
- Dataset and schema tagging
- Encrypting sensitive information
Derived fields are used in the following scenarios:
- Create one or more primary or delta fields for incremental data compaction
- Push global information down to each row to denormalize a data structure
- Pull a nested data element up to top row level so that it can be used as primary or delta field
Derived fields are configured through ms.derived.fields.
Data format conversion includes:
- Converting CSV data to JSON
- Converting JSON data to Avro
- Converting rows into batches of rows
Data format conversion are handled by converters, the configuration is converter.classes.
Converters are optional, and there could be multiple converters, i.e, the number of converters can be 0 or more. Typical converters are:
- org.apache.gobblin.converter.avro.JsonIntermediateToAvroConverter
- org.apache.gobblin.converter.csv.CsvToJsonConverterV2
- com.linkedin.cdi.converter.JsonNormalizerConverter
- com.linkedin.cdi.converter.AvroNormalizerConverter
- org.apache.gobblin.converter.LumosAttributesConverter
- com.linkedin.cdi.converter.InFlowValidationConverter
- org.apache.gobblin.converter.IdentityConverter
Each converter can have its own set of properties.
- ms.csv specifies the CSV attributes like header line position and column projection, etc.
- converter.avro.date.format, optional, only needed if there are "date" type fields
- converter.avro.time.format, optional, only needed if there are "time" type fields
- converter.avro.timestamp.format, optional, only needed if there are "timestamp" type fields
- ms.normalizer.batch.size, optional
- ms.data.explicit.eof, required to be true
The tagging converters tag attributes to the ingested dataset, at the dataset level or field level.
Currently, the following properties are for dataset tagging:
- extract.primary.key.fields, one or more fields that can be used as the logical primary key of the dataset. A primary key field can be a nested field.
- extract.delta.fields, one of more fields that can be used as the delta key of the newly extracted records so that they can be merged with previously extracted records properly. Delta fields need to be of TIMESTAMP or LONG type. When it is LONG type the data need to be EPOCH values.
Fields that have to stored with encryption for security can be configured through ms.encryption.fields.
Those fields will be encrypted using Gobblin encryption codec.