Skip to content

Commit

Permalink
DOC-3834 initial edit of data-pipelines pages
Browse files Browse the repository at this point in the history
  • Loading branch information
andy-stark-redis committed May 24, 2024
1 parent 9ba0ec9 commit 422aae6
Show file tree
Hide file tree
Showing 4 changed files with 61 additions and 24 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -16,18 +16,25 @@ type: integration
weight: 30
---

The data in the source database is often _normalized_, meaning that column values are scalar and entity relationships are expressed as mappings of primary keys to foreign keys between different tables.
Normalized data models are useful when you're inserting, updating, and deleting data at the cost of slower reads.
Redis as a cache, on the other hand, is focused on speeding up read queries. To that end, RDI provides _denormalization_ of data.
The data in the source database is often
[*normalized*](https://en.wikipedia.org/wiki/Database_normalization).
This means that columns can't have composite values (such as arrays) and relationships between entities
are expressed as mappings of primary keys to foreign keys between different tables.
Normalized data models reduce redundancy and improve data integrity for write queries but this comes
at the expense of speed.
A Redis cache, on the other hand, is focused on making *read* queries fast, so RDI provides data
*denormalization* to help with this.

## Nest strategy

Nest is the only currently supported denormalization strategy.
This strategy denormalizes many-to-one relationships in the source database to JSON documents, where the parent entity is the root of the document and the children entities are nested inside a JSON `map` attribute.
*Nesting* is the strategy RDI uses to denormalize many-to-one relationships in the source database.
It does this by representing the
parent object (the "one") as a JSON document with the children (the "many") nested inside an
attribute called `map`.

{{< image filename="/images/rdi/nest-flow.png" >}}

Denormalization is performed by using a `nest` block in the children entities' RDI job, as shown in this example:
Configure normalization with a `nest` block in the child entities' RDI job, as shown in this example:

```yaml
source:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
---
Title: Pipelines
linkTitle: Pipelines
description: Learn how to configure ingest pipelines
weight: 4
Title: Configure data pipelines
linkTitle: Configure
description: Learn how to configure ingest pipelines for data transformation
weight: 1
alwaysopen: false
categories: ["redis-di"]
aliases:
Expand All @@ -12,7 +12,7 @@ RDI implements
[change data capture](https://en.wikipedia.org/wiki/Change_data_capture) (CDC)
with *pipelines*. (See the
[architecture overview]({{< relref "/integrate/redis-data-integration/ingest/architecture#overview" >}})
for an introduction to pipelines.) There are 2 basic types of pipeline:
for an introduction to pipelines.) There are two basic types of pipeline:

- *Ingest* pipelines capture data from an external source database
and add it to a Redis target database.
Expand All @@ -28,12 +28,13 @@ structure of the configuration:
{{< image filename="images/rdi/ingest/ingest-config-folders.svg" >}}

The main configuration for the pipeline is in the `config.yaml` file.
This specifies the connection details for the source database(s) (such
This specifies the connection details for the source database (such
as host, username, and password) and also the queries that RDI will use
to extract the required data. You can also specify one or more optional *job* configurations in the `Jobs` folder. Use these to specify custom
transformations to apply to the source data before writing it to the target.
*data transformations*
to apply to the source data before writing it to the target.

The sections below describe these 2 types of configuration files in more detail.
The sections below describe these two types of configuration files in more detail.

## The `config.yaml` file

Expand Down Expand Up @@ -73,8 +74,8 @@ The main sections of the file configure [`sources`](#sources) and [`targets`](#t

### Sources

The `sources` section has one or more subsections for each of the sources that
you need to configure. Every source section starts with a unique name
The `sources` section has a subsection for the source that
you need to configure. The source section starts with a unique name
to identify the source (in the example we have a source
called `mysql` but you can choose any name you like). The example
configuration contains the following data:
Expand Down Expand Up @@ -110,7 +111,9 @@ and TLS/mTLS secrets here if you use them.

## Job files

You can optionally supply one or more job files that specify how you want to transform the captured data before writing it to the target. Each job file contains a YAML
You can optionally supply one or more job files that specify how you want to
transform the captured data before writing it to the target.
Each job file contains a YAML
configuration that controls the transformation for a particular table from the source
database. For ingest pipelines, you can also add a `default-job.yaml` file to provide
a default transformation for tables that don't have a specific job file of their own.
Expand Down Expand Up @@ -165,13 +168,15 @@ available source, transform, and target configuration options and also a set
of example job configurations.

## Source preparation

Before using the pipeline you must first prepare your source database to use
the Debezium connector for *change data capture (CDC)*. See the
[architecture overview]({{< relref "/integrate/redis-data-integration/ingest/architecture#overview" >}})
for more information about CDC.
Each database type has a different set of preparation steps. You can
find the preparation guides for the databases that RDI supports in the
[Prepare your source database](#prepare-your-source-database) section below.
[Prepare source databases]({{< relref "/integrate/redis-data-integration/ingest/data-pipelines/prepare-dbs" >}})
section.

## Ingest pipeline lifecycle

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,15 +16,14 @@ type: integration
weight: 20
---

## Debezium type handling

RDI automatically converts data that has a Debezium JSON schema into Redis types.
Some Debezium types require special conversion. For example:

- Date and Time types are converted to epoch time.
- Decimal numeric types are converted to strings that can be used by applications without losing precision.
- Decimal numeric types are converted to strings so your app can use them
without losing precision.

The following Debezium logical types are currently handled:
The following Debezium logical types are supported:

- double
- float
Expand All @@ -42,10 +41,10 @@ The following Debezium logical types are currently handled:
- org.apache.kafka.connect.data.Decimal
- org.apache.kafka.connect.data.Time

These types are currently **not** supported and will return "Unsupported Error":
These types are **not** supported and will return "Unsupported Error":

- io.debezium.time.interval

All other values will be treated as plain String.
All other values are treated as plain strings.

For more information, see [a full list of source database values conversion]({{<relref "/integrate/redis-data-integration/reference/data-types-conversion">}}).
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
---
Title: Prepare source databases
aliases: null
alwaysopen: false
categories:
- docs
- integrate
- rs
- rdi
description: Enable CDC features in your source databases
group: di
hideListLinks: false
linkTitle: Prepare source databases
summary: Redis Data Integration keeps Redis in sync with the primary database in near
real time.
type: integration
weight: 30
---

Each database uses a different mechanism to track changes to its data and
generally, these mechanisms are not switched on by default.
RDI's Debezium collector uses these mechanisms for change data capture (CDC),
so you must prepare your source database before you can use it with RDI.

The pages in this section give detailed instructions to get your source
database ready for Debezium to use:

0 comments on commit 422aae6

Please sign in to comment.