diff --git a/_data-prepper/common-use-cases/codec-processor-combinations.md b/_data-prepper/common-use-cases/codec-processor-combinations.md new file mode 100644 index 0000000000..ae1209e973 --- /dev/null +++ b/_data-prepper/common-use-cases/codec-processor-combinations.md @@ -0,0 +1,47 @@ +--- +layout: default +title: Codec processor combinations +parent: Common use cases +nav_order: 25 +--- + +# Codec processor combinations + +At ingestion time, data received by the [`s3` source]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/sources/s3/) can be parsed by [codecs]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/sources/s3#codec). Codecs compresses and decompresses large data sets in a certain format before ingestion them through a Data Prepper pipeline [processor]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/processors/). + +While most codecs can be used with most processors, the following codec processor combinations can make your pipeline more efficient when used with the following input types. + +## JSON array + +A [JSON array](https://json-schema.org/understanding-json-schema/reference/array) is used to order elements of different types. Because an array is required in JSON, the data contained within the array must be tabular. + +The JSON array does not require a processor. + +## NDJSON + +Unlike a JSON array, [NDJSON](https://www.npmjs.com/package/ndjson) allows for each row of data to be delimited by a newline, meaning data is processed per line instead of an array. + +The NDJSON input type is parsed using the [newline]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/sources/s3#newline-codec) codec, which parses each single line as a single log event. The [parse_json]({{site.url}}{{site.baseurl}}data-prepper/pipelines/configuration/processors/parse-json/) processor then outputs each line as a single event. + +## CSV + +The CSV data type inputs data as a table. It can used without a codec or processor, but it does require one or the other, for example, either just the `csv` processor or the `csv` codec. + +The CSV input type is most effective when used with the following codec processor combinations. + +### `csv` codec + +When the [`csv` codec]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/sources/s3#csv-codec) is used without a processor, it automatically detects headers from the CSV and uses them for index mapping. + +### `newline` codec + +The [`newline` codec]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/sources/s3#newline-codec) parses each row as a single log event. The codec will only detect a header when `header_destination` is configured. The [csv]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/csv/) processor then outputs the event into columns. The header detected in `header_destination` from the `newline` codec can be used in the `csv` processor under `column_names_source_key.` + +## Parquet + +[Apache Parquet](https://parquet.apache.org/docs/overview/) is a columnar storage format built for Hadoop. It is most efficient without the use of a codec. Positive results, however, can be achieved when it's configured with [S3 Select]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/sources/s3#using-s3_select-with-the-s3-source). + +## Avro + +[Apache Avro] helps streamline streaming data pipelines. It is most efficient when used with the [`avro` codec]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/sinks/s3#avro-codec) inside an `s3` sink. + diff --git a/_data-prepper/pipelines/configuration/sources/s3.md b/_data-prepper/pipelines/configuration/sources/s3.md index 59580d749e..6c5a8cc46d 100644 --- a/_data-prepper/pipelines/configuration/sources/s3.md +++ b/_data-prepper/pipelines/configuration/sources/s3.md @@ -121,9 +121,9 @@ Option | Required | Type | Description ## codec -The `codec` determines how the `s3` source parses each S3 object. +The `codec` determines how the `s3` source parses each Amazon S3 object. For increased and more efficient performance, you can use [codec combinations]({{site.url}}{{site.baseurl}}/data-prepper/common-use-cases/codec-processor-combinations/) with certain processors. -### newline codec +### `newline` codec The `newline` codec parses each single line as a single log event. This is ideal for most application logs because each event parses per single line. It can also be suitable for S3 objects that have individual JSON objects on each line, which matches well when used with the [parse_json]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/parse-json/) processor to parse each line. @@ -147,11 +147,14 @@ Option | Required | Type | Description `delimiter` | Yes | Integer | The delimiter separating columns. Default is `,`. `quote_character` | Yes | String | The character used as a text qualifier for CSV data. Default is `"`. `header` | No | String list | The header containing the column names used to parse CSV data. -`detect_header` | No | Boolean | Whether the first line of the S3 object should be interpreted as a header. Default is `true`. +`detect_header` | No | Boolean | Whether the first line of the Amazon S3 object should be interpreted as a header. Default is `true`. + + + ## Using `s3_select` with the `s3` source -When configuring `s3_select` to parse S3 objects, use the following options. +When configuring `s3_select` to parse Amazon S3 objects, use the following options: Option | Required | Type | Description :--- |:-----------------------|:------------| :--- @@ -222,12 +225,13 @@ Option | Required | Type | Description Option | Required | Type | Description :--- | :--- | :--- | :--- -`interval` | Yes | String | Indicates the minimum interval between each scan. The next scan in the interval will start after the interval duration from the last scan ends and when all the objects from the previous scan are processed. Supports ISO_8601 notation strings, such as `PT20.345S` or `PT15M`, and notation strings for seconds (`60s`) and milliseconds (`1600ms`). +`interval` | Yes | String | Indicates the minimum interval between each scan. The next scan in the interval will start after the interval duration from the last scan ends and when all the objects from the previous scan are processed. Supports ISO 8601 notation strings, such as `PT20.345S` or `PT15M`, and notation strings for seconds (`60s`) and milliseconds (`1600ms`). `count` | No | Integer | Specifies how many times a bucket will be scanned. Defaults to `Integer.MAX_VALUE`. + ## Metrics -The `s3` source includes the following metrics. +The `s3` source includes the following metrics: ### Counters