-
Notifications
You must be signed in to change notification settings - Fork 500
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add codec and processor combinations to S3 source (#5288)
* Add codec and processor combinations to S3 source. Signed-off-by: Naarcha-AWS <[email protected]> * Update s3.md Signed-off-by: Naarcha-AWS <[email protected]> * Make codec-processor combinations its own file Signed-off-by: Naarcha-AWS <[email protected]> * Make Data Prepper codec page Signed-off-by: Naarcha-AWS <[email protected]> * Fix links Signed-off-by: Naarcha-AWS <[email protected]> * Fix links Signed-off-by: Naarcha-AWS <[email protected]> * Reformat article without tables. Signed-off-by: Naarcha-AWS <[email protected]> * fix link Signed-off-by: Naarcha-AWS <[email protected]> * Apply suggestions from code review Signed-off-by: Naarcha-AWS <[email protected]> * Apply suggestions from code review Co-authored-by: Melissa Vagi <[email protected]> Signed-off-by: Naarcha-AWS <[email protected]> * Update _data-prepper/common-use-cases/codec-processor-combinations.md Co-authored-by: Melissa Vagi <[email protected]> Signed-off-by: Naarcha-AWS <[email protected]> * Apply suggestions from code review Co-authored-by: Melissa Vagi <[email protected]> Signed-off-by: Naarcha-AWS <[email protected]> * Apply suggestions from code review Co-authored-by: Melissa Vagi <[email protected]> Signed-off-by: Naarcha-AWS <[email protected]> * Apply suggestions from code review Co-authored-by: Melissa Vagi <[email protected]> Signed-off-by: Naarcha-AWS <[email protected]> * Update codec-processor-combinations.md Signed-off-by: Naarcha-AWS <[email protected]> --------- Signed-off-by: Naarcha-AWS <[email protected]> Signed-off-by: Naarcha-AWS <[email protected]> Co-authored-by: Melissa Vagi <[email protected]>
- Loading branch information
1 parent
826e677
commit 72b3363
Showing
2 changed files
with
57 additions
and
6 deletions.
There are no files selected for viewing
47 changes: 47 additions & 0 deletions
47
_data-prepper/common-use-cases/codec-processor-combinations.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,47 @@ | ||
--- | ||
layout: default | ||
title: Codec processor combinations | ||
parent: Common use cases | ||
nav_order: 25 | ||
--- | ||
|
||
# Codec processor combinations | ||
|
||
At ingestion time, data received by the [`s3` source]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/sources/s3/) can be parsed by [codecs]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/sources/s3#codec). Codecs compresses and decompresses large data sets in a certain format before ingestion them through a Data Prepper pipeline [processor]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/processors/). | ||
|
||
While most codecs can be used with most processors, the following codec processor combinations can make your pipeline more efficient when used with the following input types. | ||
|
||
## JSON array | ||
|
||
A [JSON array](https://json-schema.org/understanding-json-schema/reference/array) is used to order elements of different types. Because an array is required in JSON, the data contained within the array must be tabular. | ||
|
||
The JSON array does not require a processor. | ||
|
||
## NDJSON | ||
|
||
Unlike a JSON array, [NDJSON](https://www.npmjs.com/package/ndjson) allows for each row of data to be delimited by a newline, meaning data is processed per line instead of an array. | ||
|
||
The NDJSON input type is parsed using the [newline]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/sources/s3#newline-codec) codec, which parses each single line as a single log event. The [parse_json]({{site.url}}{{site.baseurl}}data-prepper/pipelines/configuration/processors/parse-json/) processor then outputs each line as a single event. | ||
|
||
## CSV | ||
|
||
The CSV data type inputs data as a table. It can used without a codec or processor, but it does require one or the other, for example, either just the `csv` processor or the `csv` codec. | ||
|
||
The CSV input type is most effective when used with the following codec processor combinations. | ||
|
||
### `csv` codec | ||
|
||
When the [`csv` codec]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/sources/s3#csv-codec) is used without a processor, it automatically detects headers from the CSV and uses them for index mapping. | ||
|
||
### `newline` codec | ||
|
||
The [`newline` codec]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/sources/s3#newline-codec) parses each row as a single log event. The codec will only detect a header when `header_destination` is configured. The [csv]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/processors/csv/) processor then outputs the event into columns. The header detected in `header_destination` from the `newline` codec can be used in the `csv` processor under `column_names_source_key.` | ||
|
||
## Parquet | ||
|
||
[Apache Parquet](https://parquet.apache.org/docs/overview/) is a columnar storage format built for Hadoop. It is most efficient without the use of a codec. Positive results, however, can be achieved when it's configured with [S3 Select]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/sources/s3#using-s3_select-with-the-s3-source). | ||
|
||
## Avro | ||
|
||
[Apache Avro] helps streamline streaming data pipelines. It is most efficient when used with the [`avro` codec]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/sinks/s3#avro-codec) inside an `s3` sink. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters