diff --git a/developer/wip/for-contributors/contributor-handbook.md b/developer/wip/for-contributors/contributor-handbook.md index a73d9de2..e84ff25a 100644 --- a/developer/wip/for-contributors/contributor-handbook.md +++ b/developer/wip/for-contributors/contributor-handbook.md @@ -1,36 +1,34 @@ # Contributor Documentation - * [Contributor Documentation](#contributor-documentation) - * [0. Intended audience](#0-intended-audience) - * [1. Getting started](#1-getting-started) - * [1.1 Prerequisites](#11-prerequisites) - * [1.2 Terminology](#12-terminology) - * [1.3 Architectural and coding principles](#13-architectural-and-coding-principles) - * [2. The control plane](#2-the-control-plane) - * [2.1 Entities](#21-entities) - * [2.2 Programming Primitives](#22-programming-primitives) - * [2.3 Serialization via JSON-LD](#23-serialization-via-json-ld) - * [2.4 Extension model](#24-extension-model) - * [2.5 Dependency injection deep dive](#25-dependency-injection-deep-dive) - * [2.6 Service layers](#26-service-layers) - * [2.7 Policy Monitor](#27-policy-monitor) - * [2.8 Protocol extensions (DSP)](#28-protocol-extensions-dsp) - * [3. (Postgre-)SQL persistence](#3-postgre-sql-persistence) - * [4. The data plane](#4-the-data-plane) - * [4.1 Data plane signaling](#41-data-plane-signaling) - * [4.2 Data plane self-registration](#42-data-plane-self-registration) - * [4.3 Public API authentication](#43-public-api-authentication) - * [4.4 Writing a custom data plane extension (sink/source)](#44-writing-a-custom-data-plane-extension-sinksource) - * [4.5 Writing a custom data plane (using only DPS)](#45-writing-a-custom-data-plane-using-only-dps) - * [5. Development best practices](#5-development-best-practices) - * [5.1 Writing Unit-, Component-, Integration-, Api-, EndToEnd-Tests](#51-writing-unit--component--integration--api--endtoend-tests) - * [5.1 Other best practices](#51-other-best-practices) - * [6. Further concepts](#6-further-concepts) - * [6.2 Autodoc](#62-autodoc) - * [6.3 Adapting the Gradle build](#63-adapting-the-gradle-build) - + * [0. Intended audience](#0-intended-audience) + * [1. Getting started](#1-getting-started) + * [1.1 Prerequisites](#11-prerequisites) + * [1.2 Terminology](#12-terminology) + * [1.3 Architectural and coding principles](#13-architectural-and-coding-principles) + * [2. The control plane](#2-the-control-plane) + * [2.1 Entities](#21-entities) + * [2.2 Programming Primitives](#22-programming-primitives) + * [2.3 Serialization via JSON-LD](#23-serialization-via-json-ld) + * [2.4 Extension model](#24-extension-model) + * [2.5 Dependency injection deep dive](#25-dependency-injection-deep-dive) + * [2.6 Service layers](#26-service-layers) + * [2.7 Policy Monitor](#27-policy-monitor) + * [2.8 Protocol extensions (DSP)](#28-protocol-extensions-dsp) + * [3. (Postgre-)SQL persistence](#3-postgre-sql-persistence) + * [4. The data plane](#4-the-data-plane) + * [4.1 Data plane signaling](#41-data-plane-signaling) + * [4.2 Data plane self-registration](#42-data-plane-self-registration) + * [4.3 Public API authentication](#43-public-api-authentication) + * [4.4 Writing a custom data plane extension (sink/source)](#44-writing-a-custom-data-plane-extension-sinksource) + * [4.5 Writing a custom data plane (using only DPS)](#45-writing-a-custom-data-plane-using-only-dps) + * [5. Development best practices](#5-development-best-practices) + * [5.1 Writing Unit-, Component-, Integration-, Api-, EndToEnd-Tests](#51-writing-unit--component--integration--api--endtoend-tests) + * [5.1 Other best practices](#51-other-best-practices) + * [6. Further concepts](#6-further-concepts) + * [6.2 Autodoc](#62-autodoc) + * [6.3 Adapting the Gradle build](#63-adapting-the-gradle-build) ## 0. Intended audience @@ -171,7 +169,7 @@ Detailed documentation about the policy monitor can be found [here](./control-pl ### 2.8 Protocol extensions (DSP) -### 3. (Postgre-)SQL persistence +## 3. (Postgre-)SQL persistence PostgreSQL is a very popular open-source database and it has a large community and vendor adoption. It is also EDCs data persistence technology of choice. @@ -192,6 +190,8 @@ Detailed documentation about EDCs PostgreSQL implementations can be found [here] ### 4.1 Data plane signaling +Data Plane Signaling (DPS) is the communication protocol that is used between control planes and data planes. + ### 4.2 Data plane self-registration ### 4.3 Public API authentication diff --git a/developer/wip/for-contributors/data-plane/data-plane-signaling/data-plane-signaling.md b/developer/wip/for-contributors/data-plane/data-plane-signaling/data-plane-signaling.md new file mode 100644 index 00000000..1fa703fc --- /dev/null +++ b/developer/wip/for-contributors/data-plane/data-plane-signaling/data-plane-signaling.md @@ -0,0 +1,311 @@ +# The Data Plane Signaling interface + + +* [The Data Plane Signaling interface](#the-data-plane-signaling-interface) + * [1. `DataAddress` and `EndpointDataReference`](#1-dataaddress-and-endpointdatareference) + * [2. Signaling protocol messages and API endpoints](#2-signaling-protocol-messages-and-api-endpoints) + * [2.1 `START`](#21-start) + * [2.2 `SUSPEND`](#22-suspend) + * [2.3 `TERMINATE`](#23-terminate) + * [3. Data plane public API](#3-data-plane-public-api) + * [3.1 Endpoints and endpoint resolution](#31-endpoints-and-endpoint-resolution) + * [3.2 Public API Access Control](#32-public-api-access-control) + * [3.3 Token expiry and renewal](#33-token-expiry-and-renewal) + * [4. Data plane registration](#4-data-plane-registration) + * [5. Data plane selection](#5-data-plane-selection) + + +Data plane signaling (DPS) defines a set of API endpoints and message types which are used for communication between a +control plane and dataplane to control data flows. + +## 1. `DataAddress` and `EndpointDataReference` + +When the control plane signals to the data plane to start a client pull transfer process, the data plane returns a +`DataAddress`. This is only true for consumer-pull transfers - provider push transfers **do not return** a +`DataAddress`. This `DataAddress` contains information the client can use to resolve the provider's data plane endpoint. +It also contain an access token (cf. [authorization](#52-public-api-authorization)). + +This `DataAddress` is returned by the provider control plane to the consumer in a `TransferProcessStarted` DSP message. +Its purpose is to inform the consumer where they can obtain the data, and which authorization token to use. + +The `EndpointDataReference` is a data structure that is used on the consumer side to contain all the relevant +information of the `DataAddress` and some additional information associated with the transfer, such as asset ID and +contract ID. Note that is is only the case if the consumer is implemented using EDC. + +A transfer process may be `STARTED` multiple times (e.g., after it is temporarily `SUSPENDED`), the consumer may receive +a different `DataAddress` objects as part of each start message. The consumer must **always** create a new `EDR` from +these messages and remove the previous EDR. Data plane implementations may choose to pass the same `DataAddress` or an +updated one. + +This start signaling pattern can be used to change a data plane's endpoint address, for example, after a software +upgrade, or a load balancer switch-over. + +## 2. Signaling protocol messages and API endpoints + +All requests support idempotent behavior. Data planes must therefore perform request de-duplication. After a data plane +commits a request, it will return an ack to the control plane, which will transition the `TransferProcess` to its next +state (e.g., `STARTED`, `SUSPENDED`, `TERMINATED`). If a successful ack is not received, the control plane will resend +the request during a subsequent tick period. + +### 2.1 `START` + +During the transfer process `STARTING` phase, the provider control plane selects a data plane using the +`DataFlowController` implementations it has available, which will then send a `DataFlowStartMessage` to the data plane. + +The control plane (i.e. the `DataFlowController`) records which data plane was selected for the transfer process so that +it can properly route subsequent, start, stop, and terminate requests. + +For client pull transfers, the data plane returns a `DataAddress` with an access token. + +If the data flow was previously `SUSPENDED`, the data plane may elect to return the same `DataAddress` or create a new +one. + +The provider control plane sends a `DataFlowStartMessage` to the provider data plane: + +```http request +POST https://dataplane-host:port/api/signaling/v1/dataflows +Content-Type: application/json + +{ + "@context": { "@vocab": "https://w3id.org/edc/v0.0.1/ns/" }, + "@id": "transfer-id", + "@type": "DataFlowStartMessage", + "processId": "process-id", + "datasetId": "dataset-id", + "participantId": "participant-id", + "agreementId": "agreement-id", + "transferType": "HttpData-PULL", + "sourceDataAddress": { + "type": "HttpData", + "baseUrl": "https://jsonplaceholder.typicode.com/todos" + }, + "destinationDataAddress": { + "type": "HttpData", + "baseUrl": "https://jsonplaceholder.typicode.com/todos" + }, + "callbackAddress" : "http://control-plane", + "properties": { + "key": "value" + } +} +``` + +The data plane responds with a `DataFlowResponseMessage`, that contains the public endpoint, the authorization token and +possibly other information in the form of a `DataAddress`. For more information about how access tokens are generated, +please refer to [this chapter](#52-public-api-access-control). + +### 2.2 `SUSPEND` + +During the transfer process `SUSPENDING` phase, the `DataFlowController` will send a `DataFlowSuspendMessage` to the +data plane. The data plane will transition the data flow to the `SUSPENDED` state and invalidate the associated access +token. + +```http request +POST https://dataplane-host:port/api/signaling/v1/dataflows +Content-Type: application/json + +{ + "@context": { "@vocab": "https://w3id.org/edc/v0.0.1/ns/" }, + "@type": "DataFlowSuspendMessage", + "reason": "reason" +} +``` + +### 2.3 `TERMINATE` + +During the transfer process `TERMINATING` phase, the `DataFlowController` will send a `DataFlowTerminateMessage` to the +data plane. The data plane will transition the data flow to the `TERMINATED` state and invalidate the associated access +token. + +```http request +POST https://dataplane-host:port/api/signaling/v1/dataflows +Content-Type: application/json + +{ + "@context": { "@vocab": "https://w3id.org/edc/v0.0.1/ns/" }, + "@type": "DataFlowTerminateMessage", + "reason": "reason" +} +``` + + +## 3. Data plane public API + +One popular use case for data transmission is where the provider organization exposes a REST API where consumers can +download data. We call this a "Http-PULL" transfer. This is especially useful for structured data, such as JSON and it +can even be used to model _streaming_ data. + +To achieve that, the provider data plane can expose a "public API" that takes REST requests and satisfies them by +pulling data out of a `DataSource` which it obtains by verifying and parsing the `Authorization` token (see [this +chapter](#32-public-api-access-control) for details). + +### 3.1 Endpoints and endpoint resolution + +### 3.2 Public API Access Control + +The design of the EDC Data Plane Framework is based on **non-renewable** access tokens. One access token will be +maintained for the period a transfer process is in the `STARTED` state. This duration may be a single request or a +series of requests spanning an indefinite period of time ("streaming"). + +Other data plane implementations my chose to support renewable tokens. Token renewal is often used as a strategy for +controlling access duration and mitigating leaked tokens. The EDC implementation will handle access duration and +mitigate against leaked tokens in the following ways. + +#### 3.2.1 Access Duration + +Access duration is controlled by the transfer process and contract agreement, not the token. If a transfer processes is +moved from the `STARTED` to the `SUSPENDED`, `TERMINATED`, or `COMPLETED` state, the access token will no longer be +valid. Similarly, if a contract agreement is violated or otherwise invalidated, a cascade operation will terminate all +associated transfer processes. + +To achieve that, the data plane maintains a list of currently active/valid tokens. + +#### 3.2.2 Leaked Access Tokens + +If an access token is leaked or otherwise compromised, its associated transfer process is placed in the `TERMINATED` +state and a new one is started. In order to mitigate the possibility of ongoing data access when a leak is not +discovered, a data plane may implement token renewal. Limited-duration contract agreements and transfer processes may +also be used. For example, a transfer process could be terminated after a period of time by the provider and the +consumer can initiate a new process before or after that period. + +#### 3.2.3 Access Token Generation + +When the `DataFlowManager` receives a `DataFlowStartMessage` to start the data transmission, it uses the +`DataPlaneAuthorizationService` to generate an access token (in JWT format) and a `DataAddress`, that contains the +follwing information: + +- `endpoint`: the URL of the public API +- `endpointType`: should be `https://w3id.org/idsa/v4.1/HTTP` for HTTP pull transfers +- `authorization`: the newly generated access token. + +
+ DataAddress with access token + + ```json + { + "dspace:dataAddress": { + "@type": "dspace:DataAddress", + "dspace:endpointType": "https://w3id.org/idsa/v4.1/HTTP", + "dspace:endpoint": "http://example.com", + "dspace:endpointProperties": [ + { + "@type": "dspace:EndpointProperty", + "dspace:name": "https://w3id.org/edc/v0.0.1/ns/authorization", + "dspace:value": "token" + }, + { + "@type": "dspace:EndpointProperty", + "dspace:name": "https://w3id.org/edc/v0.0.1/ns/authType", + "dspace:value": "bearer" + } + ] + } + } + ``` +

+ +This `DataAddress` is returned in the `DataFlowResponse` as mentioned [here](#21-start). With that alone, the data plane +would not be able to determine token revocation or invalidation, so it must also _record_ the access token. + +To that end, the EDC data plane stores an `AccessTokenData` object that contains the token, the source `DataAddress` and +some information about the bearer of the token, specifically: + +- agreement ID +- asset ID +- transfer process ID +- flow type (`push` or `pull`) +- participant ID (of the consumer) +- transfer type (see [later sections](#4-data-plane-registration) for details) + +The token creation flow is illustrated by the following sequence diagram: + +![](./data-plane-signaling_create-token.png) + +#### 3.2.4 Access Token Validation and Revocation + +When the consumer executes a REST request against the provider data plane's public API, it must send the previously +received access token (inside the `DataAddress`) in the `Authorization` header. + +The data plane then attempts to resolve the `AccessTokenData` object associated with that token and checks that the +token is valid. + +The authorization flow is illustrated by the following sequence diagram: + +![](data-plane-signaling_authorize.png) + +A default implementation will be provided that always returns `true`. Extensions can supply alternative implementations +that perform use-case-specific authorization checks. + +> Please note that `DataPlaneAccessControlService` implementation must handle all request types (including transport +> types) in a data plane runtime. If multiple access check implementations are required, creating a multiplexer or +> individual data plane runtimes is recommended. + +> Note that in EDC, the access control check (step 8) always returns `true`! + +In order to _revoke_ the token with immediate effect, it is enough to delete the `AccessTokenData` object from the +database. This is done using the `DataPlaneAuthorizationService` as well. + +### 3.3 Token expiry and renewal + +EDC does **not currently implement** token expiry and renewal, so this section is intended for developers who wish to +provide a custom data plane. + +To implement token renewal, the recommended way is to create an extension, that exposes a refresh endpoint which can be +used by consumers. The URL of this refresh endpoint could be encoded in the original `DataAddress` in the +`dspace:endpointProperties` field. +> In any case, this will be a dataspace-specific solution, so administrative steps are +required to achieve interoperability. + +## 4. Data plane registration + +The life cycle of a data plane is decoupled from the life cycle of a control plane. That means, they could be started, +paused/resumed and stopped at different points in time. In clustered deployments, this is very likely the default +situation. With this, it is also possible to add or remove individual data planes anytime. + +When data planes come online, they _register_ with the control plane using the `DataPlaneSelectorControlApi`. Each +dataplane sends a `DataPlaneInstance` object that contains information about its supported transfer types, supported +source types, URL, the data plane's `component ID` and other properties. + +From then on, the control plane sends periodic heart-beats to the dataplane. + +## 5. Data plane selection + +During data plane self-registration, the control plane builds a list of `DataPlaneInstance` objects, each of which +represents one (logical) data plane component. Note that these are _logical_ instances, that means, even replicated +runtimes would still only count as _one_ instance. + +In a periodic task the control plane engages a state machine `DataPlaneSelectorManager` to manage the state of each +registered data plane. To do that, it simply sends a REST request to the `/v1/dataflows/check` endpoint of the data +plane. If that returns successfully, the dataplane is still up and running. +If not, the control plane will consider the data plane as "unavailable". + +In addition to availability, the control plane also records the _capabilities_ of each data plane, i.e. which which +source data types and transfer types are supported. Each data plane must declare where it can transfer data _from_ +(source type, e.g. `AmazonS3`) and _how_ it can transfer data (transfer type, e.g. `Http-PULL`). + + +### 5.1 Building the catalog + +The data plane selection directly influences the contents of the catalog: for example, let say that a particular +provider can transmit an asset either via HTTP (pull), or via S3 (push), then each one of these variants would be +represented in the catalog as individual `Distribution`. + +Upon building the catalog, the control plane checks for each `Asset`, whether the `Asset.dataAddress.type` field is +contained in the list of `allowedTransferTypes` of each `DataPlaneInstance` + +In the example above, at least one data plane has to have `Http-PULL` in its `allowedTransferTypes`, and at least one +has to have `AmazonS3-PUSH`. Note that one data plane could have _both_ entries. + +### 5.2 Fulfilling data requests + +When a `START` message is sent from the control plane to the data plane via the Signaling API, the data plane first +checks whether it can fulfill the request. If multiple data planes can fulfill the request, the `selectionStrategy` is +employed to determine the actual data plane. + +This check is necessary, because a `START` message could contain a transfer type, that is not supported by any of the +data planes, or all data planes, that could fulfill the request are unavailable. + +This algorithm is called _data plane selection_. + +> Selection strategies can be added via extensions, using the `SelectionStrategyRegistry`. By default, a data plane is +> selected at random. \ No newline at end of file diff --git a/developer/wip/for-contributors/data-plane/data-plane-signaling/data-plane-signaling_authorize.png b/developer/wip/for-contributors/data-plane/data-plane-signaling/data-plane-signaling_authorize.png new file mode 100644 index 00000000..33461b97 Binary files /dev/null and b/developer/wip/for-contributors/data-plane/data-plane-signaling/data-plane-signaling_authorize.png differ diff --git a/developer/wip/for-contributors/data-plane/data-plane-signaling/data-plane-signaling_authorize.puml b/developer/wip/for-contributors/data-plane/data-plane-signaling/data-plane-signaling_authorize.puml new file mode 100644 index 00000000..1232f5bf --- /dev/null +++ b/developer/wip/for-contributors/data-plane/data-plane-signaling/data-plane-signaling_authorize.puml @@ -0,0 +1,49 @@ +' +' Copyright (c) 2024 Bayerische Motoren Werke Aktiengesellschaft (BMW AG) +' +' This program and the accompanying materials are made available under the +' terms of the Apache License, Version 2.0 which is available at +' https://www.apache.org/licenses/LICENSE-2.0 +' +' SPDX-License-Identifier: Apache-2.0 +' +' Contributors: +' Amadeus - Initial Version +' + +@startuml + + +skinParam NoteBackgroundColor WhiteSmoke +skinParam NoteFontColor Black +skinParam ParticipantBackgroundColor WhiteSmoke +skinParam ActorBackgroundColor WhiteSmoke +skinParam AgentBackgroundColor White +skinParam AgentBorderColor SkyBlue +skinparam shadowing false +skinparam BoxPadding 10 + +autonumber + +participant DataPlaneAuthorizationService as dpauths +participant DataPlaneAccessTokenService as dpats +participant DataPlaneAccessControlService as dpacs +participant TokenValidationService as tvs + +[-> dpauths: <> +activate dpauths +dpauths -> tvs: validate token w/ own public key +activate tvs +return Result +dpauths -> dpats: <> +activate dpats +dpats -> dpats: restore AccessTokenData \nfrom raw token +return AccessTokenData +dpauths -> dpacs: <> +dpacs --> dpacs: perform access control +dpacs -> dpauths: return true | false + + +return Result +deactivate dpauths +@enduml \ No newline at end of file diff --git a/developer/wip/for-contributors/data-plane/data-plane-signaling/data-plane-signaling_create-token.png b/developer/wip/for-contributors/data-plane/data-plane-signaling/data-plane-signaling_create-token.png new file mode 100644 index 00000000..e693b72d Binary files /dev/null and b/developer/wip/for-contributors/data-plane/data-plane-signaling/data-plane-signaling_create-token.png differ diff --git a/developer/wip/for-contributors/data-plane/data-plane-signaling/data-plane-signaling_create-token.puml b/developer/wip/for-contributors/data-plane/data-plane-signaling/data-plane-signaling_create-token.puml new file mode 100644 index 00000000..afcd67a1 --- /dev/null +++ b/developer/wip/for-contributors/data-plane/data-plane-signaling/data-plane-signaling_create-token.puml @@ -0,0 +1,49 @@ +' +' Copyright (c) 2024 Bayerische Motoren Werke Aktiengesellschaft (BMW AG) +' +' This program and the accompanying materials are made available under the +' terms of the Apache License, Version 2.0 which is available at +' https://www.apache.org/licenses/LICENSE-2.0 +' +' SPDX-License-Identifier: Apache-2.0 +' +' Contributors: +' Amadeus - Initial Version +' + +@startuml + + +skinParam NoteBackgroundColor WhiteSmoke +skinParam NoteFontColor Black +skinParam ParticipantBackgroundColor WhiteSmoke +skinParam ActorBackgroundColor WhiteSmoke +skinParam AgentBackgroundColor White +skinParam AgentBorderColor SkyBlue +skinparam shadowing false +skinparam BoxPadding 10 + +autonumber + +participant DataPlaneAuthorizationService as dpauths +participant DataPlaneAccessTokenService as dpats +participant TokenGenerationService as tgs + +[-> dpauths: <> +activate dpauths +dpauths -> dpauths: convert DataFlowStartMessage to \nTokenParameters + DataAddress +dpauths -> dpats: <> +activate dpats + +dpats->dpats: convert TokenParameters -> TokenDecorator[] + +dpats -> tgs: <> +activate tgs +return TokenRepresentation + +dpats --> dpauths: return TokenRepresentation \n(containing serialized JWT) +deactivate dpats +[<-- dpauths: return serialized JWT +deactivate dpauths + +@enduml \ No newline at end of file