Skip to content
This repository has been archived by the owner on Sep 19, 2024. It is now read-only.

Commit

Permalink
feat: Add documentation on dataspaces, control plane, and data plane …
Browse files Browse the repository at this point in the history
…concepts (#186)

* Add doc

* Remove blank diagram

* Fix title names

* Update developer/wip/for-adopters/Data Plane Concepts.md

Co-authored-by: Paul Latzelsperger <[email protected]>

* Update developer/wip/for-adopters/Concepts.md

Co-authored-by: Paul Latzelsperger <[email protected]>

---------

Co-authored-by: Paul Latzelsperger <[email protected]>
  • Loading branch information
jimmarino and paullatzelsperger authored Aug 14, 2024
1 parent 0f58248 commit 82d5a0e
Show file tree
Hide file tree
Showing 20 changed files with 946 additions and 0 deletions.
30 changes: 30 additions & 0 deletions developer/wip/for-adopters/Concepts.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
These chapters cover the key EDC components and the abstractions it uses. After reading them, you will have a solid understanding of how EDC works.

## Dataspaces

A brief introduction to what a dataspace is and how it relates to EDC.

## Modules, Runtimes, and Components

An overview of the EDC modularity system.
## The Control Plane

Explains how data, policies, access control, and transfers are managed.

## The Data Plane

Describes how the EDC integrates with off-the-shelf protocols such as `HTTP`, `Kafka`, cloud object storage, and other technologies to transfer data between parties.

## The Identity Hub

Details how EDC implements decentralized identity, access control, and trust using standards such as [Decentralized Identifiers](https://www.w3.org/TR/did-core/)and [W3c Verifiable Credentials](https://www.w3.org/TR/vc-data-model/).

## The Federated Catalog

Covers how publishing and retrieving federated data catalogs works.

## General Concepts

Explains how to create deployment architectures, distributions, and extensions. This section also provides an overview of *Management Domains* and system configuration.


488 changes: 488 additions & 0 deletions developer/wip/for-adopters/Control Plane Concepts.md

Large diffs are not rendered by default.

31 changes: 31 additions & 0 deletions developer/wip/for-adopters/Data Plane Concepts.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@

A data plane is responsible for transmitting data using a wire protocol at the direction of the control plane. Data planes can vary greatly, from a simple serverless function to a data streaming platform or an API that clients access. One control plane may manage multiple data planes that specialize in the type of data sent or the wire protocol requested by the data consumer. This section provides an overview of how data planes work and the role they play in a dataspace.

## Separation of Concerns

Although a data plane can be collocated in the same process as a control plane, this is not a recommended setup. Typically, a data plane component is deployed as a separate set of instances to an independent environment such as a Kubernetes cluster. This allows the data plane to be operated and scaled independently from the control plane. At runtime, a data plane must register with a control plane, which in turn directs the data plane using the *Data Plane Signaling API*. EDC does not ship with an out-of-the-box data plane. Rather, it provides the *Data Plane Framework (DPF)*, a platform for building custom data planes. You can choose to start with the DPF or build your own data plane using your programming language of choice. In either case, understanding the data plane registration process and Signaling API are the first steps.

## Data Plane Registration

In the EDC model, control planes and data planes are dynamically associated. At startup, a data plane registers itself with a control plane using its component ID. Registration is idempotent and persistent and made available to all clustered control plane runtimes via persistent storage. After a data plane is registered, the control plane periodically sends a heartbeat and culls the registration if the data plane is unavailable.

The data plane registration includes metadata about its capabilities, including:
- The supported wire protocols and supported transfer types. For example, "HTTP-based consumer pull" or "S3-based provider push"
- The supported data source types.

The control plane uses data plane metadata for two purposes. First, it is used to determine which data transfer types are available for an asset when generating a catalog. Second, the metadata is used to select a data plane when a transfer process is requested.

## Data Plane Signaling

A control plane communicates with a data plane through a RESTful interface called the Data Plane Signaling API. Custom data planes can be written that integrate with the EDC control plane by implementing the registration protocol and the signaling API.

The Data Plane Signaling flow is shown below:

![[data-plane-signalling.png]]

When a transfer process is started, and a data plane is selected, a start message will be sent. If the transfer process is a consumer-pull type where data is accessed by the consumer, the response will contain an Endpoint Data Reference (EDR) that contains the coordinates to the data and an access token if one is required. The control plane may send additional signals, such as SUSPEND and RESUME, or TERMINATE, in response to events. For example, the control plane policy monitor could send a SUSPEND or TERMINATE message if a policy violation is encountered.
## The Data Plane Framework (DPF)

EDC includes a framework for building custom data planes called the DPF. DPF supports end-to-end streaming transfers (i.e., data content is streamed rather than materialized in memory) for scalability and both pull- and push- style transfers. The framework has extensibility points for supporting different data sources and sinks (e.g., S3, HTTP, Kafka) and can perform direct streaming between different source and sink types.

The [EDC samples](https://github.com/eclipse-edc/Samples) contain examples of how to use the DPF.
61 changes: 61 additions & 0 deletions developer/wip/for-adopters/Dataspaces.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
The concept of a dataspace is the starting point for learning about the EDC. A dataspace is a *context* between one or more *participants* that share data. A participant is typically an organization, but it could be any entity, such as a service or machine.

### Dataspace Protocol (DSP): The Lingua Franca for Data Sharing

The messages exchanged in a dataspace are defined by the [Dataspace Protocol Specification (DSP)](https://github.com/eclipse-dataspace-protocol-base/DataspaceProtocol). EDC implements and builds on these asynchronous messaging patterns, so it will help to become acquainted with the specification. DSP defines how to retrieve data catalogs, conduct negotiations to create contract agreements that grant access to data, and send data over various lower-level wire protocols. While DSP focuses on the messaging layer for controlling data access, it does not specify how "trust" is established between participants. By trust, we mean on what basis a provider makes the decision to grant access to data, for example, by requiring the presentation of verifiable credentials issued by a third-party. This is specified by the [Decentralized Claims Protocol (DCP)](https://github.com/eclipse-dataspace-dcp/decentralized-claims-protocol), which layers on DSP. We won't cover the two specifications here, other than to highlight a few key points that are essential to understanding how EDC works.

After reading this document, we recommend consulting the DSP and DCP specifications for further information.

### The Question of Identity

One of the most important things to understand is how identities work in a dataspace and EDC. A participant has a single identity, which is a URI. EDC supports multiple identity systems, including OAuth2 and the [Decentralized Claims Protocol (DCP).](https://github.com/eclipse-dataspace-dcp/decentralized-claims-protocol) If DCP is used, the identity will be a Web DID.

An EDC component, such as a control plane, acts as a *participant agent*; in other words, it is a system that runs on behalf of a participant. Therefore, each component will use a single identity. This concept is important and nuanced. Let's consider several scenarios.

#### Simple Scenarios
##### Single Deployment

An organization deploys a single-instance control plane. This is the simplest possible setup, although it is not very reliable or scalable. In this scenario, the connector has exactly one identity. Now take the case where an organization decides on a more robust deployment with multiple control plane instances hosted as a Kubernetes `ReplicaSet.` The control plane instances still share the same identity.

##### Distributed Deployment

EDC supports the concept of *management domains*, which are realms of control. If different departments want to manage EDC components independently, the organization can define management domains where those components are deployed. Each management domain can be hosted on distinct Kubernetes clusters and potentially run in different cloud environments. Externally, the organization's EDC infrastructure appears as a unified whole, with a single top-level catalog containing multiple sub-catalogs and data sharing endpoints.

In this scenario, departments deploy their own control plane clusters. Again, each instance is configured with the same identity across all management domains.

#### Multiple Operating Units

In some dataspaces, a single legal entity may have multiple subdivisions operating independently. For example, a multinational may have autonomous operating units in different geographic regions with different data access rights. In this case, each operating unit is a dataspace participant with a distinct identity. EDC components deployed by each operating unit will be configured with different identities. From a dataspace perspective, each operating unit is a distinct entity.
### Common Misconceptions
#### Data transfers are only about sending static files

Data can be in a variety of forms. While the EDC can share static files, it also supports open-ended transfers such as streaming and API access. For example, many EDC use cases involve providing automated access to event streams or API endpoints, including pausing or terminating access based on continual evaluation of data use policies.

#### Dataspace software has to be installed

There is no such thing as dataspace "software" or a dataspace "application." A dataspace is a decentralized context. Participants deploy the EDC and communicate with other participant systems using DSP and DCP.

#### EDC adds a lot of overhead

EDC is designed as a lightweight, non-resource-intensive engine. EDC adds no overhead to data transmission since specialized wire protocols handle the latter. For example, EDC can be used to grant access to an API endpoint or data stream. Once access is obtained, the consumer can invoke the API directly or subscribe to a stream without requiring the request to be proxied through EDC components.
#### Cross-dataspace communication vs. interoperability

There is no such thing as cross-dataspace communication. All data sharing takes place within a dataspace. However, that does not mean there is no such thing as dataspace *interoperability*. Let's unpack this.

Consider two dataspaces, DS-1 and DS-B. It's possible for a participant P-A, a member of DS-1, to share data with P-B, a member of DS-2, under one of the following conditions:
- P-A is also a member of DS-2, or
- P-B is also a member of DS-1

P-A shares data with P-B **in the context of** DS-1 or DS-2. Data does not flow between DS-1 and DS-2. It's possible for one EDC instance to operate within multiple dataspaces as long as its identity remains the same (if not, different EDC deployments will be needed).

Interoperability is different. Two dataspaces are interoperable if:

- They have compatible identity systems. For example, if both dataspaces use DCP and Web DIDs, or a form of OAuth2 with federation between the Identity Providers.
- They have a common set of verifiable credentials (or claims) and credential issuers.
- They have an agreed set of data sharing policies.

If these conditions are met, it is possible for a single connector deployment to participate in two dataspaces.




Loading

0 comments on commit 82d5a0e

Please sign in to comment.