docs: add chapter about sql persistence (#195)

eclipse-edc · Aug 16, 2024 · 7c5d46f · 7c5d46f
1 parent 82d5a0e
commit 7c5d46f
Show file tree

Hide file tree

Showing 2 changed files with 290 additions and 41 deletions.
diff --git a/developer/wip/for-contributors/contributor-handbook.md b/developer/wip/for-contributors/contributor-handbook.md
@@ -1,34 +1,36 @@
 # Contributor Documentation
 
 <!-- TOC -->
+
 * [Contributor Documentation](#contributor-documentation)
-  * [0. Intended audience](#0-intended-audience)
-  * [1. Getting started](#1-getting-started)
-    * [1.1 Prerequisites](#11-prerequisites)
-    * [1.2 Terminology](#12-terminology)
-    * [1.3 Architectural and coding principles](#13-architectural-and-coding-principles)
-  * [2. The control plane](#2-the-control-plane)
-    * [2.1 Entities](#21-entities)
-    * [2.2 Programming Primitives](#22-programming-primitives)
-    * [2.3 Serialization via JSON-LD](#23-serialization-via-json-ld)
-    * [2.4 Extension model](#24-extension-model)
-    * [2.5 Dependency injection deep dive](#25-dependency-injection-deep-dive)
-    * [2.6 Service layers](#26-service-layers)
-    * [2.7 Policy Monitor](#27-policy-monitor)
-    * [2.8 Protocol extensions (DSP)](#28-protocol-extensions-dsp)
-    * [2.9 (Postgre-)SQL persistence](#29-postgre-sql-persistence)
-    * [2.10 Data plane signaling](#210-data-plane-signaling)
-  * [3. The data plane](#3-the-data-plane)
-    * [3.1 Data plane self-registration](#31-data-plane-self-registration)
-    * [3.2 Public API authentication](#32-public-api-authentication)
-    * [3.3 Writing a custom data plane extension (sink/source)](#33-writing-a-custom-data-plane-extension-sinksource)
-    * [3.4 Writing a custom data plane (using only DPS)](#34-writing-a-custom-data-plane-using-only-dps)
-  * [4. Development best practices](#4-development-best-practices)
-    * [4.1 Writing Unit-, Component-, Integration-, Api-, EndToEnd-Tests](#41-writing-unit--component--integration--api--endtoend-tests)
-    * [4.1 Other best practices](#41-other-best-practices)
-  * [5. Further concepts](#5-further-concepts)
-    * [5.2 Autodoc](#52-autodoc)
-    * [5.3 Adapting the Gradle build](#53-adapting-the-gradle-build)
+    * [0. Intended audience](#0-intended-audience)
+    * [1. Getting started](#1-getting-started)
+        * [1.1 Prerequisites](#11-prerequisites)
+        * [1.2 Terminology](#12-terminology)
+        * [1.3 Architectural and coding principles](#13-architectural-and-coding-principles)
+    * [2. The control plane](#2-the-control-plane)
+        * [2.1 Entities](#21-entities)
+        * [2.2 Programming Primitives](#22-programming-primitives)
+        * [2.3 Serialization via JSON-LD](#23-serialization-via-json-ld)
+        * [2.4 Extension model](#24-extension-model)
+        * [2.5 Dependency injection deep dive](#25-dependency-injection-deep-dive)
+        * [2.6 Service layers](#26-service-layers)
+        * [2.7 Policy Monitor](#27-policy-monitor)
+        * [2.8 Protocol extensions (DSP)](#28-protocol-extensions-dsp)
+        * [3. (Postgre-)SQL persistence](#3-postgre-sql-persistence)
+    * [4. The data plane](#4-the-data-plane)
+        * [4.1 Data plane signaling](#41-data-plane-signaling)
+        * [4.2 Data plane self-registration](#42-data-plane-self-registration)
+        * [4.3 Public API authentication](#43-public-api-authentication)
+        * [4.4 Writing a custom data plane extension (sink/source)](#44-writing-a-custom-data-plane-extension-sinksource)
+        * [4.5 Writing a custom data plane (using only DPS)](#45-writing-a-custom-data-plane-using-only-dps)
+    * [5. Development best practices](#5-development-best-practices)
+        * [5.1 Writing Unit-, Component-, Integration-, Api-, EndToEnd-Tests](#51-writing-unit--component--integration--api--endtoend-tests)
+        * [5.1 Other best practices](#51-other-best-practices)
+    * [6. Further concepts](#6-further-concepts)
+        * [6.2 Autodoc](#62-autodoc)
+        * [6.3 Adapting the Gradle build](#63-adapting-the-gradle-build)
+
 <!-- TOC -->
 
 ## 0. Intended audience
@@ -165,34 +167,47 @@ Detailed documentation about the EDC service layers can be found [here](./contro
 
 ### 2.8 Protocol extensions (DSP)
 
-### 2.9 (Postgre-)SQL persistence
+### 3. (Postgre-)SQL persistence
+
+PostgreSQL is a very popular open-source database and it has a large community and vendor adoption. It is also EDCs data
+persistence technology of choice.
+
+Every [store](./control-plane/service-layers.md#5-data-persistence) in the EDC, intended to persist state, comes out of
+the box with two implementations:
+
+- in-memory
+- sql (PostgreSQL dialect)
+
+By default, the [in-memory stores](./control-plane/service-layers.md#51-in-memory-stores) are provided by the dependency
+injection, the SQL variants can be used by simply adding the relevant extensions (e.g. `asset-index-sql`,
+`contract-negotiation-store-sql`, ...) to the classpath.
 
-translation mapping, querying, JSON field mappers, etc.
+Detailed documentation about EDCs PostgreSQL implementations can be found [here](./postgres-persistence.md)
 
-### 2.10 Data plane signaling
+## 4. The data plane
 
-## 3. The data plane
+### 4.1 Data plane signaling
 
-### 3.1 Data plane self-registration
+### 4.2 Data plane self-registration
 
-### 3.2 Public API authentication
+### 4.3 Public API authentication
 
-### 3.3 Writing a custom data plane extension (sink/source)
+### 4.4 Writing a custom data plane extension (sink/source)
 
-### 3.4 Writing a custom data plane (using only DPS)
+### 4.5 Writing a custom data plane (using only DPS)
 
-## 4. Development best practices
+## 5. Development best practices
 
-### 4.1 Writing Unit-, Component-, Integration-, Api-, EndToEnd-Tests
+### 5.1 Writing Unit-, Component-, Integration-, Api-, EndToEnd-Tests
 
 test pyramid...
 
-### 4.1 Other best practices
+### 5.1 Other best practices
 
 -> link to best practices doc
 
-## 5. Further concepts
+## 6. Further concepts
 
-### 5.2 Autodoc
+### 6.2 Autodoc
 
-### 5.3 Adapting the Gradle build
+### 6.3 Adapting the Gradle build
diff --git a/developer/wip/for-contributors/postgres-persistence.md b/developer/wip/for-contributors/postgres-persistence.md
@@ -0,0 +1,234 @@
+# EDC Data Persistence with PostgreSQL
+
+<!-- TOC -->
+* [EDC Data Persistence with PostgreSQL](#edc-data-persistence-with-postgresql)
+  * [1. Configuring DataSources](#1-configuring-datasources)
+    * [1.2 Using custom datasource in stores](#12-using-custom-datasource-in-stores)
+  * [2. SQL Statement abstraction](#2-sql-statement-abstraction)
+  * [3. Querying PostgreSQL databases](#3-querying-postgresql-databases)
+    * [3.1 The canonical form](#31-the-canonical-form)
+    * [3.1 Translation Mappings](#31-translation-mappings)
+      * [3.1.1 Mapping primitive fields](#311-mapping-primitive-fields)
+      * [3.1.2 Mapping complex objects](#312-mapping-complex-objects)
+      * [Option 1: using foreign keys](#option-1-using-foreign-keys)
+      * [Option 2a: encoding the object](#option-2a-encoding-the-object)
+      * [Option 2b: encoding lists/arrays](#option-2b-encoding-listsarrays)
+<!-- TOC -->
+
+By default, the `in-memory` stores are provided by the dependency injection, the `sql` implementations can be used by
+simply registering the relative extensions (e.g. `asset-index-sql`, `contract-negotiation-store-sql`, ...).
+
+## 1. Configuring DataSources
+
+For using `sql` extensions, a `DataSource` is needed, and it should be registered on the `DataSourceRegistry` service.
+
+The `sql-pool-apache-commons` extension is responsible for creating and registering pooled data sources starting from
+configuration. At least one data source named `"default"` is required.
+
+```properties
+edc.datasource.default.url=...
+edc.datasource.default.name=...
+edc.datasource.default.password=...
+```
+
+It is **recommended** to hold these values in the Vault rather than in configuration. The config key (e.g.
+`edc.datasource.default.url`) serves as secret alias. If no vault entries are found for these keys, they will be
+obtained from the configuration. This is **unsafe** and should be avoided!
+
+Other datasources can be defined using the same settings structure:
+
+```properties
+edc.datasource.<datasource-name>.url=...
+edc.datasource.<datasource-name>.name=...
+edc.datasource.<datasource-name>.password=...
+```
+
+`<datasource-name>` is string that then can be used by the store's configuration to use specific data sources.
+
+### 1.2 Using custom datasource in stores
+
+Using a custom datasource in a store can be done by configuring the setting:
+
+```properties
+edc.sql.store.<store-context>.datasource=<datasource-name>
+```
+
+Note that `<store-context>` can be an arbitrary string, but it is recommended to use a descriptive name. For example,
+the `SqlPolicyStoreExtension` defines a data source name as follows:
+
+```java
+@Extension("SQL policy store")
+public class SqlPolicyStoreExtension implements ServiceExtension {
+
+    @Setting(value = "The datasource to be used", defaultValue = DataSourceRegistry.DEFAULT_DATASOURCE)
+    public static final String DATASOURCE_NAME = "edc.sql.store.policy.datasource";
+
+    @Override
+    public void initialize(ServiceExtensionContext context) {
+        var datasourceName = context.getConfig().getString(DATASOURCE_NAME, DataSourceRegistry.DEFAULT_DATASOURCE);
+        //...
+    }
+}
+```
+
+## 2. SQL Statement abstraction
+
+EDC does not use any sort of Object-Relation-Mapper (ORM), which would automatically translate Java object graphs to SQL
+statements. Instead, EDC uses pre-canned parameterized SQL statements.
+
+We typically distinguish between literals such as table names or column names and "templates", which are SQL statements
+such as `INSERT`.
+
+Both are declared as getters in an interface that extends the `SqlStatements` interface, with literals being `default` methods and templates being implemented by a `BaseSqlDialectStatements` class.
+
+A simple example could look like this:
+```java
+public class BaseSqlDialectStatements implements SomeEntityStatements {
+
+    @Override
+    public String getDeleteByIdTemplate() {
+        return executeStatement().delete(getSomeEntityTable(), getIdColumn());
+    }
+
+    @Override
+    public String getUpdateTemplate() {
+        return executeStatement()
+                .column(getIdColumn())
+                .column(getSomeStringFieldColumn())
+                .column(getCreatedAtColumn())
+                .update(getSomeEntityTable(), getIdColumn());
+    }
+    //...
+}
+```
+Note that the example makes use of the `SqlExecuteStatement` utility class, which should be used to construct all SQL
+statements - _except queries_. Queries are special in that they have a highly dynamic aspect to them. For more
+information, please read on in [this chapter](#3-querying-postgresql-databases).
+
+As a general rule of thumb, issuing multiple statements (within one transaction) should be preferred over writing
+complex nested statements. It is very easy to inadvertently create an inefficient or wasteful statement that causes high
+resource load on the database server. The latency that is introduced by sending multiple statements to the DB server is
+likely negligible in comparison, especially because EDC is architected towards reliability rather than latency. 
+
+## 3. Querying PostgreSQL databases
+
+Generally speaking, the basis for all queries is a `QuerySpec` object. This means, that at some point a `QuerySpec` must
+be translated into an SQL `SELECT` statement. The place to do this is the `SqlStatements` implementation often called
+`BaseSqlDialectStatements`:
+
+```java
+@Override
+public SqlQueryStatement createQuery(QuerySpec querySpec) {
+    var select = "SELECT * FROM %s".formatted(getSomeEntityTable());
+    return new SqlQueryStatement(select, querySpec, new SomeEntityMapping(this), operatorTranslator);
+}
+```
+
+Now, there are a few things to unpack here:
+- the `SELECT` statement serves as starting point for the query 
+- individual `WHERE` clauses get added by parsing the `filterExpression` property of the `QuerySpec`
+- `LIMIT` and `OFFSET` clauses get appended based on `QuerySpec#offset` and `QuerySpec#limit`
+- the `SomeEntityMapping` maps the canonical form onto the SQL literals
+- the `operatorTranslator` is used to convert operators such as `=` or `like` into SQL operators
+
+### 3.1 The canonical form
+
+Theoretically it is possible to map every schema onto every other schema, given that they are of equal cardinality. To
+achieve that, EDC introduces the notion of a _canonical form_, which is our internal working schema for entities. In
+other words, this is the schema in which objects are represented internally. If we ever support a wider variety of
+translation and transformation paths, everything would have to be transformed into that canonical format first.
+
+In actuality the _canonical form_ of an object is defined by the Java class and its field names. For instance, a query
+for contract negotiations must be specified using the field names of a `ContractNegotiation` object:
+
+```java
+public class ContractNegotiation {
+    // ...
+    private ContractAgreement contractAgreement;
+    // ...
+}
+
+public class ContractAgreement {
+    // ...
+    private final String assetId;
+}
+```
+
+Consequently, `contractAgreement.assetId` would be valid, whereas `contract_agreement.asset_id` would be invalid. Or,
+the left-hand operand looks like as if we were traversing the Java object graph. This is what we call the _canonical
+form_ . Note the omission of the root object `contractNegotiation`!
+
+### 3.1 Translation Mappings
+
+Translation mappings are EDCs way to map a `QuerySpec` to SQL statements. At its core, it contains a `Map` that contains
+the Java entity field name and the related SQL column name. 
+
+In order to decouple the canonical form from the SQL schema (or any other database schema), a mapping scheme exists to
+map the canonical model onto the SQL model. This `TranslationMapping` is essentially a graph-like metamodel of the
+entities: every Java entity has a related mapping class that contains its field names and the associated SQL column
+names. The convention is to append `*Mapping` to the class name, e.g. `PolicyDefinitionMapping`.
+
+#### 3.1.1 Mapping primitive fields
+
+Primitive fields are stored directly as columns in SQL tables. Thus, mapping primitive data types is trivial: a simple
+mapping from one onto the other is necessary, for example, `ContractNegotiation.counterPartyAddress` would be
+represented in the `ContractNegotiationMappin` as an entry
+
+```java
+"counterPartyAddress"->"counterparty_address"
+```
+
+When constructing `WHERE/AND` clauses, the canonical property is simply be replaced by the respective SQL column name.
+
+#### 3.1.2 Mapping complex objects
+
+For fields that are of complex type, such as the `ContractNegotiation.contractAgreement` field, it is necessary to
+accommodate this, depending on how the relational data model is defined. There are two basic variants we use:
+
+#### Option 1: using foreign keys
+
+In this case, the referenced object is stored in a separate table using a foreign key relation. Thus, the canonical
+property (`contractAgreement`) is mapped onto the SQL schema using another `*Mapping` class. Here, this would be the
+`ContractAgreementMapping`. When resolving a property in the canonical format (`contractAgreement.assetId`), this means
+we must recursively descend into the model graph and resolve the correct SQL expression.
+
+> Note: mapping `one-to-many` relations (= arrays/lists) with foreign keys is not implemented at this time.
+
+#### Option 2a: encoding the object
+
+Another popular way to store complex objects is to encode them in JSON and store them in a `VARCHAR` column. In
+PostgreSQL we use the specific `JSON` type instead of `VARCHAR`. For example, the `TranferProcess` is stored in a table
+called `edc_transfer_process`, its `DataAddress` property is encoded in JSON and stored in a `JSON` field.
+
+Querying for `TransferProcess` objects: when mapping the filter expression
+`contentDataAddress.properties.somekey=somevalue`, the `contentDataAddress` is represented as JSON, therefore in the
+`TransferProcessMapping` the `contentDataAddress` field maps to a `JsonFieldTranslator`:
+
+```java
+public TransferProcessMapping(TransferProcessStoreStatements statements) {
+    // ...
+    add(FIELD_CONTENTDATAADDRESS, new JsonFieldTranslator(statements.getContentDataAddressColumn()));
+    // ...
+}
+```
+
+which would then get translated to:
+
+```sql
+SELECT *
+FROM edc_transfer_process
+-- omit LEFT OUTER JOIN for readability
+WHERE content_data_address -> 'properties' ->> 'somekey' = 'somevalue'
+```
+
+_Note that JSON queries are specific to PostgreSQL and are not portable to other database technologies!_
+
+#### Option 2b: encoding lists/arrays
+
+Like accessing objects, accessing lists/arrays of objects is possible using special JSON operators. In this case the
+special Postgres function `json_array_elements()` is used. Please refer to the [official
+documentation](https://www.postgresql.org/docs/9.5/functions-json.html).
+
+For an example of how this is done, please look at how the `TransferProcessMapping` maps a `ResourceManifest`, which in
+turn contains a `List<ResourceDefinition>` using the `ResourceManifestMapping`. 
+