Skip to content
This repository has been archived by the owner on Sep 19, 2024. It is now read-only.

Commit

Permalink
docs: add chapter about sql persistence (#195)
Browse files Browse the repository at this point in the history
  • Loading branch information
paullatzelsperger authored Aug 16, 2024
1 parent 82d5a0e commit 7c5d46f
Show file tree
Hide file tree
Showing 2 changed files with 290 additions and 41 deletions.
97 changes: 56 additions & 41 deletions developer/wip/for-contributors/contributor-handbook.md
Original file line number Diff line number Diff line change
@@ -1,34 +1,36 @@
# Contributor Documentation

<!-- TOC -->

* [Contributor Documentation](#contributor-documentation)
* [0. Intended audience](#0-intended-audience)
* [1. Getting started](#1-getting-started)
* [1.1 Prerequisites](#11-prerequisites)
* [1.2 Terminology](#12-terminology)
* [1.3 Architectural and coding principles](#13-architectural-and-coding-principles)
* [2. The control plane](#2-the-control-plane)
* [2.1 Entities](#21-entities)
* [2.2 Programming Primitives](#22-programming-primitives)
* [2.3 Serialization via JSON-LD](#23-serialization-via-json-ld)
* [2.4 Extension model](#24-extension-model)
* [2.5 Dependency injection deep dive](#25-dependency-injection-deep-dive)
* [2.6 Service layers](#26-service-layers)
* [2.7 Policy Monitor](#27-policy-monitor)
* [2.8 Protocol extensions (DSP)](#28-protocol-extensions-dsp)
* [2.9 (Postgre-)SQL persistence](#29-postgre-sql-persistence)
* [2.10 Data plane signaling](#210-data-plane-signaling)
* [3. The data plane](#3-the-data-plane)
* [3.1 Data plane self-registration](#31-data-plane-self-registration)
* [3.2 Public API authentication](#32-public-api-authentication)
* [3.3 Writing a custom data plane extension (sink/source)](#33-writing-a-custom-data-plane-extension-sinksource)
* [3.4 Writing a custom data plane (using only DPS)](#34-writing-a-custom-data-plane-using-only-dps)
* [4. Development best practices](#4-development-best-practices)
* [4.1 Writing Unit-, Component-, Integration-, Api-, EndToEnd-Tests](#41-writing-unit--component--integration--api--endtoend-tests)
* [4.1 Other best practices](#41-other-best-practices)
* [5. Further concepts](#5-further-concepts)
* [5.2 Autodoc](#52-autodoc)
* [5.3 Adapting the Gradle build](#53-adapting-the-gradle-build)
* [0. Intended audience](#0-intended-audience)
* [1. Getting started](#1-getting-started)
* [1.1 Prerequisites](#11-prerequisites)
* [1.2 Terminology](#12-terminology)
* [1.3 Architectural and coding principles](#13-architectural-and-coding-principles)
* [2. The control plane](#2-the-control-plane)
* [2.1 Entities](#21-entities)
* [2.2 Programming Primitives](#22-programming-primitives)
* [2.3 Serialization via JSON-LD](#23-serialization-via-json-ld)
* [2.4 Extension model](#24-extension-model)
* [2.5 Dependency injection deep dive](#25-dependency-injection-deep-dive)
* [2.6 Service layers](#26-service-layers)
* [2.7 Policy Monitor](#27-policy-monitor)
* [2.8 Protocol extensions (DSP)](#28-protocol-extensions-dsp)
* [3. (Postgre-)SQL persistence](#3-postgre-sql-persistence)
* [4. The data plane](#4-the-data-plane)
* [4.1 Data plane signaling](#41-data-plane-signaling)
* [4.2 Data plane self-registration](#42-data-plane-self-registration)
* [4.3 Public API authentication](#43-public-api-authentication)
* [4.4 Writing a custom data plane extension (sink/source)](#44-writing-a-custom-data-plane-extension-sinksource)
* [4.5 Writing a custom data plane (using only DPS)](#45-writing-a-custom-data-plane-using-only-dps)
* [5. Development best practices](#5-development-best-practices)
* [5.1 Writing Unit-, Component-, Integration-, Api-, EndToEnd-Tests](#51-writing-unit--component--integration--api--endtoend-tests)
* [5.1 Other best practices](#51-other-best-practices)
* [6. Further concepts](#6-further-concepts)
* [6.2 Autodoc](#62-autodoc)
* [6.3 Adapting the Gradle build](#63-adapting-the-gradle-build)

<!-- TOC -->

## 0. Intended audience
Expand Down Expand Up @@ -165,34 +167,47 @@ Detailed documentation about the EDC service layers can be found [here](./contro

### 2.8 Protocol extensions (DSP)

### 2.9 (Postgre-)SQL persistence
### 3. (Postgre-)SQL persistence

PostgreSQL is a very popular open-source database and it has a large community and vendor adoption. It is also EDCs data
persistence technology of choice.

Every [store](./control-plane/service-layers.md#5-data-persistence) in the EDC, intended to persist state, comes out of
the box with two implementations:

- in-memory
- sql (PostgreSQL dialect)

By default, the [in-memory stores](./control-plane/service-layers.md#51-in-memory-stores) are provided by the dependency
injection, the SQL variants can be used by simply adding the relevant extensions (e.g. `asset-index-sql`,
`contract-negotiation-store-sql`, ...) to the classpath.

translation mapping, querying, JSON field mappers, etc.
Detailed documentation about EDCs PostgreSQL implementations can be found [here](./postgres-persistence.md)

### 2.10 Data plane signaling
## 4. The data plane

## 3. The data plane
### 4.1 Data plane signaling

### 3.1 Data plane self-registration
### 4.2 Data plane self-registration

### 3.2 Public API authentication
### 4.3 Public API authentication

### 3.3 Writing a custom data plane extension (sink/source)
### 4.4 Writing a custom data plane extension (sink/source)

### 3.4 Writing a custom data plane (using only DPS)
### 4.5 Writing a custom data plane (using only DPS)

## 4. Development best practices
## 5. Development best practices

### 4.1 Writing Unit-, Component-, Integration-, Api-, EndToEnd-Tests
### 5.1 Writing Unit-, Component-, Integration-, Api-, EndToEnd-Tests

test pyramid...

### 4.1 Other best practices
### 5.1 Other best practices

-> link to best practices doc

## 5. Further concepts
## 6. Further concepts

### 5.2 Autodoc
### 6.2 Autodoc

### 5.3 Adapting the Gradle build
### 6.3 Adapting the Gradle build
234 changes: 234 additions & 0 deletions developer/wip/for-contributors/postgres-persistence.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,234 @@
# EDC Data Persistence with PostgreSQL

<!-- TOC -->
* [EDC Data Persistence with PostgreSQL](#edc-data-persistence-with-postgresql)
* [1. Configuring DataSources](#1-configuring-datasources)
* [1.2 Using custom datasource in stores](#12-using-custom-datasource-in-stores)
* [2. SQL Statement abstraction](#2-sql-statement-abstraction)
* [3. Querying PostgreSQL databases](#3-querying-postgresql-databases)
* [3.1 The canonical form](#31-the-canonical-form)
* [3.1 Translation Mappings](#31-translation-mappings)
* [3.1.1 Mapping primitive fields](#311-mapping-primitive-fields)
* [3.1.2 Mapping complex objects](#312-mapping-complex-objects)
* [Option 1: using foreign keys](#option-1-using-foreign-keys)
* [Option 2a: encoding the object](#option-2a-encoding-the-object)
* [Option 2b: encoding lists/arrays](#option-2b-encoding-listsarrays)
<!-- TOC -->

By default, the `in-memory` stores are provided by the dependency injection, the `sql` implementations can be used by
simply registering the relative extensions (e.g. `asset-index-sql`, `contract-negotiation-store-sql`, ...).

## 1. Configuring DataSources

For using `sql` extensions, a `DataSource` is needed, and it should be registered on the `DataSourceRegistry` service.

The `sql-pool-apache-commons` extension is responsible for creating and registering pooled data sources starting from
configuration. At least one data source named `"default"` is required.

```properties
edc.datasource.default.url=...
edc.datasource.default.name=...
edc.datasource.default.password=...
```

It is **recommended** to hold these values in the Vault rather than in configuration. The config key (e.g.
`edc.datasource.default.url`) serves as secret alias. If no vault entries are found for these keys, they will be
obtained from the configuration. This is **unsafe** and should be avoided!

Other datasources can be defined using the same settings structure:

```properties
edc.datasource.<datasource-name>.url=...
edc.datasource.<datasource-name>.name=...
edc.datasource.<datasource-name>.password=...
```

`<datasource-name>` is string that then can be used by the store's configuration to use specific data sources.

### 1.2 Using custom datasource in stores

Using a custom datasource in a store can be done by configuring the setting:

```properties
edc.sql.store.<store-context>.datasource=<datasource-name>
```

Note that `<store-context>` can be an arbitrary string, but it is recommended to use a descriptive name. For example,
the `SqlPolicyStoreExtension` defines a data source name as follows:

```java
@Extension("SQL policy store")
public class SqlPolicyStoreExtension implements ServiceExtension {

@Setting(value = "The datasource to be used", defaultValue = DataSourceRegistry.DEFAULT_DATASOURCE)
public static final String DATASOURCE_NAME = "edc.sql.store.policy.datasource";

@Override
public void initialize(ServiceExtensionContext context) {
var datasourceName = context.getConfig().getString(DATASOURCE_NAME, DataSourceRegistry.DEFAULT_DATASOURCE);
//...
}
}
```

## 2. SQL Statement abstraction

EDC does not use any sort of Object-Relation-Mapper (ORM), which would automatically translate Java object graphs to SQL
statements. Instead, EDC uses pre-canned parameterized SQL statements.

We typically distinguish between literals such as table names or column names and "templates", which are SQL statements
such as `INSERT`.

Both are declared as getters in an interface that extends the `SqlStatements` interface, with literals being `default` methods and templates being implemented by a `BaseSqlDialectStatements` class.

A simple example could look like this:
```java
public class BaseSqlDialectStatements implements SomeEntityStatements {

@Override
public String getDeleteByIdTemplate() {
return executeStatement().delete(getSomeEntityTable(), getIdColumn());
}

@Override
public String getUpdateTemplate() {
return executeStatement()
.column(getIdColumn())
.column(getSomeStringFieldColumn())
.column(getCreatedAtColumn())
.update(getSomeEntityTable(), getIdColumn());
}
//...
}
```
Note that the example makes use of the `SqlExecuteStatement` utility class, which should be used to construct all SQL
statements - _except queries_. Queries are special in that they have a highly dynamic aspect to them. For more
information, please read on in [this chapter](#3-querying-postgresql-databases).

As a general rule of thumb, issuing multiple statements (within one transaction) should be preferred over writing
complex nested statements. It is very easy to inadvertently create an inefficient or wasteful statement that causes high
resource load on the database server. The latency that is introduced by sending multiple statements to the DB server is
likely negligible in comparison, especially because EDC is architected towards reliability rather than latency.

## 3. Querying PostgreSQL databases

Generally speaking, the basis for all queries is a `QuerySpec` object. This means, that at some point a `QuerySpec` must
be translated into an SQL `SELECT` statement. The place to do this is the `SqlStatements` implementation often called
`BaseSqlDialectStatements`:

```java
@Override
public SqlQueryStatement createQuery(QuerySpec querySpec) {
var select = "SELECT * FROM %s".formatted(getSomeEntityTable());
return new SqlQueryStatement(select, querySpec, new SomeEntityMapping(this), operatorTranslator);
}
```

Now, there are a few things to unpack here:
- the `SELECT` statement serves as starting point for the query
- individual `WHERE` clauses get added by parsing the `filterExpression` property of the `QuerySpec`
- `LIMIT` and `OFFSET` clauses get appended based on `QuerySpec#offset` and `QuerySpec#limit`
- the `SomeEntityMapping` maps the canonical form onto the SQL literals
- the `operatorTranslator` is used to convert operators such as `=` or `like` into SQL operators

### 3.1 The canonical form

Theoretically it is possible to map every schema onto every other schema, given that they are of equal cardinality. To
achieve that, EDC introduces the notion of a _canonical form_, which is our internal working schema for entities. In
other words, this is the schema in which objects are represented internally. If we ever support a wider variety of
translation and transformation paths, everything would have to be transformed into that canonical format first.

In actuality the _canonical form_ of an object is defined by the Java class and its field names. For instance, a query
for contract negotiations must be specified using the field names of a `ContractNegotiation` object:

```java
public class ContractNegotiation {
// ...
private ContractAgreement contractAgreement;
// ...
}

public class ContractAgreement {
// ...
private final String assetId;
}
```

Consequently, `contractAgreement.assetId` would be valid, whereas `contract_agreement.asset_id` would be invalid. Or,
the left-hand operand looks like as if we were traversing the Java object graph. This is what we call the _canonical
form_ . Note the omission of the root object `contractNegotiation`!

### 3.1 Translation Mappings

Translation mappings are EDCs way to map a `QuerySpec` to SQL statements. At its core, it contains a `Map` that contains
the Java entity field name and the related SQL column name.

In order to decouple the canonical form from the SQL schema (or any other database schema), a mapping scheme exists to
map the canonical model onto the SQL model. This `TranslationMapping` is essentially a graph-like metamodel of the
entities: every Java entity has a related mapping class that contains its field names and the associated SQL column
names. The convention is to append `*Mapping` to the class name, e.g. `PolicyDefinitionMapping`.

#### 3.1.1 Mapping primitive fields

Primitive fields are stored directly as columns in SQL tables. Thus, mapping primitive data types is trivial: a simple
mapping from one onto the other is necessary, for example, `ContractNegotiation.counterPartyAddress` would be
represented in the `ContractNegotiationMappin` as an entry

```java
"counterPartyAddress"->"counterparty_address"
```

When constructing `WHERE/AND` clauses, the canonical property is simply be replaced by the respective SQL column name.

#### 3.1.2 Mapping complex objects

For fields that are of complex type, such as the `ContractNegotiation.contractAgreement` field, it is necessary to
accommodate this, depending on how the relational data model is defined. There are two basic variants we use:

#### Option 1: using foreign keys

In this case, the referenced object is stored in a separate table using a foreign key relation. Thus, the canonical
property (`contractAgreement`) is mapped onto the SQL schema using another `*Mapping` class. Here, this would be the
`ContractAgreementMapping`. When resolving a property in the canonical format (`contractAgreement.assetId`), this means
we must recursively descend into the model graph and resolve the correct SQL expression.

> Note: mapping `one-to-many` relations (= arrays/lists) with foreign keys is not implemented at this time.
#### Option 2a: encoding the object

Another popular way to store complex objects is to encode them in JSON and store them in a `VARCHAR` column. In
PostgreSQL we use the specific `JSON` type instead of `VARCHAR`. For example, the `TranferProcess` is stored in a table
called `edc_transfer_process`, its `DataAddress` property is encoded in JSON and stored in a `JSON` field.

Querying for `TransferProcess` objects: when mapping the filter expression
`contentDataAddress.properties.somekey=somevalue`, the `contentDataAddress` is represented as JSON, therefore in the
`TransferProcessMapping` the `contentDataAddress` field maps to a `JsonFieldTranslator`:

```java
public TransferProcessMapping(TransferProcessStoreStatements statements) {
// ...
add(FIELD_CONTENTDATAADDRESS, new JsonFieldTranslator(statements.getContentDataAddressColumn()));
// ...
}
```

which would then get translated to:

```sql
SELECT *
FROM edc_transfer_process
-- omit LEFT OUTER JOIN for readability
WHERE content_data_address -> 'properties' ->> 'somekey' = 'somevalue'
```

_Note that JSON queries are specific to PostgreSQL and are not portable to other database technologies!_

#### Option 2b: encoding lists/arrays

Like accessing objects, accessing lists/arrays of objects is possible using special JSON operators. In this case the
special Postgres function `json_array_elements()` is used. Please refer to the [official
documentation](https://www.postgresql.org/docs/9.5/functions-json.html).

For an example of how this is done, please look at how the `TransferProcessMapping` maps a `ResourceManifest`, which in
turn contains a `List<ResourceDefinition>` using the `ResourceManifestMapping`.

0 comments on commit 7c5d46f

Please sign in to comment.