Skip to content

Commit

Permalink
Add codec and processor combinations to S3 source (#5288)
Browse files Browse the repository at this point in the history
* Add codec and processor combinations to S3 source.

Signed-off-by: Naarcha-AWS <[email protected]>

* Update s3.md

Signed-off-by: Naarcha-AWS <[email protected]>

* Make codec-processor combinations its own file

Signed-off-by: Naarcha-AWS <[email protected]>

* Make Data Prepper codec page

Signed-off-by: Naarcha-AWS <[email protected]>

* Fix links

Signed-off-by: Naarcha-AWS <[email protected]>

* Fix links

Signed-off-by: Naarcha-AWS <[email protected]>

* Reformat article without tables.

Signed-off-by: Naarcha-AWS <[email protected]>

* fix link

Signed-off-by: Naarcha-AWS <[email protected]>

* Apply suggestions from code review

Signed-off-by: Naarcha-AWS <[email protected]>

* Apply suggestions from code review

Co-authored-by: Melissa Vagi <[email protected]>
Signed-off-by: Naarcha-AWS <[email protected]>

* Update _data-prepper/common-use-cases/codec-processor-combinations.md

Co-authored-by: Melissa Vagi <[email protected]>
Signed-off-by: Naarcha-AWS <[email protected]>

* Apply suggestions from code review

Co-authored-by: Melissa Vagi <[email protected]>
Signed-off-by: Naarcha-AWS <[email protected]>

* Apply suggestions from code review

Co-authored-by: Melissa Vagi <[email protected]>
Signed-off-by: Naarcha-AWS <[email protected]>

* Apply suggestions from code review

Co-authored-by: Melissa Vagi <[email protected]>
Signed-off-by: Naarcha-AWS <[email protected]>

* Update codec-processor-combinations.md

Signed-off-by: Naarcha-AWS <[email protected]>

---------

Signed-off-by: Naarcha-AWS <[email protected]>
Signed-off-by: Naarcha-AWS <[email protected]>
Co-authored-by: Melissa Vagi <[email protected]>
  • Loading branch information
Naarcha-AWS and vagimeli authored Nov 18, 2023
1 parent 826e677 commit 72b3363
Show file tree
Hide file tree
Showing 2 changed files with 57 additions and 6 deletions.
47 changes: 47 additions & 0 deletions _data-prepper/common-use-cases/codec-processor-combinations.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
---
layout: default
title: Codec processor combinations
parent: Common use cases
nav_order: 25
---

# Codec processor combinations

At ingestion time, data received by the [`s3` source]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/sources/s3/) can be parsed by [codecs]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/sources/s3#codec). Codecs compresses and decompresses large data sets in a certain format before ingestion them through a Data Prepper pipeline [processor]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/processors/).

While most codecs can be used with most processors, the following codec processor combinations can make your pipeline more efficient when used with the following input types.

## JSON array

A [JSON array](https://json-schema.org/understanding-json-schema/reference/array) is used to order elements of different types. Because an array is required in JSON, the data contained within the array must be tabular.

The JSON array does not require a processor.

## NDJSON

Unlike a JSON array, [NDJSON](https://www.npmjs.com/package/ndjson) allows for each row of data to be delimited by a newline, meaning data is processed per line instead of an array.

The NDJSON input type is parsed using the [newline]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/sources/s3#newline-codec) codec, which parses each single line as a single log event. The [parse_json]({{site.url}}{{site.baseurl}}data-prepper/pipelines/configuration/processors/parse-json/) processor then outputs each line as a single event.

## CSV

The CSV data type inputs data as a table. It can used without a codec or processor, but it does require one or the other, for example, either just the `csv` processor or the `csv` codec.

The CSV input type is most effective when used with the following codec processor combinations.

### `csv` codec

When the [`csv` codec]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/sources/s3#csv-codec) is used without a processor, it automatically detects headers from the CSV and uses them for index mapping.

### `newline` codec

The [`newline` codec]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/sources/s3#newline-codec) parses each row as a single log event. The codec will only detect a header when `header_destination` is configured. The [csv]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/csv/) processor then outputs the event into columns. The header detected in `header_destination` from the `newline` codec can be used in the `csv` processor under `column_names_source_key.`

## Parquet

[Apache Parquet](https://parquet.apache.org/docs/overview/) is a columnar storage format built for Hadoop. It is most efficient without the use of a codec. Positive results, however, can be achieved when it's configured with [S3 Select]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/sources/s3#using-s3_select-with-the-s3-source).

## Avro

[Apache Avro] helps streamline streaming data pipelines. It is most efficient when used with the [`avro` codec]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/sinks/s3#avro-codec) inside an `s3` sink.

16 changes: 10 additions & 6 deletions _data-prepper/pipelines/configuration/sources/s3.md
Original file line number Diff line number Diff line change
Expand Up @@ -121,9 +121,9 @@ Option | Required | Type | Description

## codec

The `codec` determines how the `s3` source parses each S3 object.
The `codec` determines how the `s3` source parses each Amazon S3 object. For increased and more efficient performance, you can use [codec combinations]({{site.url}}{{site.baseurl}}/data-prepper/common-use-cases/codec-processor-combinations/) with certain processors.

### newline codec
### `newline` codec

The `newline` codec parses each single line as a single log event. This is ideal for most application logs because each event parses per single line. It can also be suitable for S3 objects that have individual JSON objects on each line, which matches well when used with the [parse_json]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/parse-json/) processor to parse each line.

Expand All @@ -147,11 +147,14 @@ Option | Required | Type | Description
`delimiter` | Yes | Integer | The delimiter separating columns. Default is `,`.
`quote_character` | Yes | String | The character used as a text qualifier for CSV data. Default is `"`.
`header` | No | String list | The header containing the column names used to parse CSV data.
`detect_header` | No | Boolean | Whether the first line of the S3 object should be interpreted as a header. Default is `true`.
`detect_header` | No | Boolean | Whether the first line of the Amazon S3 object should be interpreted as a header. Default is `true`.




## Using `s3_select` with the `s3` source<a name="s3_select"></a>

When configuring `s3_select` to parse S3 objects, use the following options.
When configuring `s3_select` to parse Amazon S3 objects, use the following options:

Option | Required | Type | Description
:--- |:-----------------------|:------------| :---
Expand Down Expand Up @@ -222,12 +225,13 @@ Option | Required | Type | Description

Option | Required | Type | Description
:--- | :--- | :--- | :---
`interval` | Yes | String | Indicates the minimum interval between each scan. The next scan in the interval will start after the interval duration from the last scan ends and when all the objects from the previous scan are processed. Supports ISO_8601 notation strings, such as `PT20.345S` or `PT15M`, and notation strings for seconds (`60s`) and milliseconds (`1600ms`).
`interval` | Yes | String | Indicates the minimum interval between each scan. The next scan in the interval will start after the interval duration from the last scan ends and when all the objects from the previous scan are processed. Supports ISO 8601 notation strings, such as `PT20.345S` or `PT15M`, and notation strings for seconds (`60s`) and milliseconds (`1600ms`).
`count` | No | Integer | Specifies how many times a bucket will be scanned. Defaults to `Integer.MAX_VALUE`.


## Metrics

The `s3` source includes the following metrics.
The `s3` source includes the following metrics:

### Counters

Expand Down

0 comments on commit 72b3363

Please sign in to comment.