Skip to content

Commit

Permalink
Expand documentation
Browse files Browse the repository at this point in the history
Based on review feedback, expand on the documentation.
- Include a high-level overview of job lifecyle
- Add details on data rates
- Clarify the ordering behaviour
- Add reference to OAuth RFC
- 201 -> 20x for success codes
- Remove implementation details for OAuth flow
  It's documented more thoroughly elsewhere and we should encourage use
  of a framework or third-party authorization server.
  • Loading branch information
calumcalder committed Dec 13, 2024
1 parent e5ae96f commit 0d4bee3
Showing 1 changed file with 81 additions and 39 deletions.
120 changes: 81 additions & 39 deletions extensions/data-transfer/portability-data-transfer-generic/README.md
Original file line number Diff line number Diff line change
@@ -1,20 +1,85 @@
# Generic Importers

## Implementing a Server
## Overview

To consume data from a generic importer, create a HTTP service in the language of your choice.
The service should accept POST requests on a given base URL, and kebab-case sub-path specific to each data vertical (`https://example.com/import/blobs`, `https://example.com/import/social-posts`).
The Data Transfer Project transfer worker manages the transfer of data between data exporter and data importer services. \
The transfer of some set of user data is encapsulated by a transfer job. Each job consists of many data items from a particular data vertical (CALENDAR, PHOTOS, etc.). See the [Schemas](#schemas) section for the list of data verticals supported by Generic Importers.

Transfer jobs are initiated by a user, generally on the platform owned by the exporter service. When processing a transfer job the transfer worker pulls data items from the exporter service and pushes those data items to the importer service. \
The interface for pulling data from exporters and pushing to importers are implemented in DTP by `Exporter` modules and `Importer` modules respectively. This allows the DTP worker to transfer data between services agnostic of protocol and encoding, and without assumptions about the storage or organisation of data at the destination.

The typical way to integrate with DTP as a data importer is to implement an `Importer` module through an extension that pushes data to a new or existing API. This requires writing and maintaining a Java extension, but gives complete control over how the module interacts with your API. \
As an alternative path for integration, the Generic Importers extension provides an `Importer` module implementation that can be configured to call any endpoint conforming to the API described in the [Generic Importer API](#generic-importer-api) section. This allows developers to focus on languages and frameworks they are familiar with, at the expense of being able to customise the behaviour of the importer or reusing existing APIs.

## Generic Importer API

The Generic Importer API uses HTTP as the application layer protocol and encodes most request data as `application/json`. Data items are sent one-per-request

The service should accept POST requests on a given base URL, and kebab-case sub-path specific to each data vertical (`https://example.com/import/blobs`, `https://example.com/import/social-posts`). \
See the [HTTP Interface](#http-interface) section for details on the API.

When a transfer job is created, credentials to call your service endpoint are obtained using OAuth 2.0 (see [Authentication and Authorization](#authentication-and-authorization) below).
Your service should therefore support OAuth 2.0 or integrate with another service which provides these features.

### DTP Behaviour

This section contains information on how the DTP platform behaves, and in particular how the Generic `Importer` module works. This should help to guide decisions in how you implement your importer service.

#### Job Life-cycle

Transfer jobs get created by users on the platform hosting the DTP transfer worker, which is generally owned by the exporter. \
During job creation the platform needs to obtain credentials to call the exporting service and importing service on behalf of the user, which for importer services using Generic Importers means using the OAuth 2.0 Authorization Code flow discussed in [Authentication and Authorization](#authentication-and-authorization). \
Once a DTP transfer worker begins working on the transfer job it will begin sending POST requests to your endpoint containing user data items to associate with their account, authenticated and authorized using the access code obtained at job creation. \
The job continues until all items have been sent.

> [!NOTE]
> The Generic Importers extension currently provides no mechanism for informing an importer service of the beginning or end of a transfer job.
#### Ordering

Data items are sent to your importer service in the same order that they are exported from the exporter service. This makes the order of data items transferred in a job an implementation detail of the `Exporter` module and corresponding exporter service, however there are is a helpful convention that Exporters tend to follow; for data items that are containers of other items, such as photo albums, these items are conventionally exported before the data items contained inside them (or at most in the same page). \
The Generic `Importer` module also follows this convention; exporters export data in pages (`ContainerResource`s, for those reading the code), and the Generic `Importer` module will export albums in a page before the photos in that page.

##### Failures on dependent items

In some circumstances the transfer of a data item that is a container of other data items may fail. \
The DTP transfer worker makes a best-effort to transfer all data, and does not consider the dependencies between data items; importer services may not respect this dependency, or have their own way of recovering from such failures.
For this reason if a container data item fails to transfer then the data items contained within that container will still be transferred. It is the responsibility of the importer service to decide how to handle this case.

#### Concurrency

To keep the API simple, data items in a transfer job are sent by the `Importer` module to your HTTP service sequentially and as one data item per request. \
Your API may receive multiple concurrent requests, even for the same user, if there are multiple concurrent transfer jobs to your service.

#### Data Rates

The limiting factors for data import rate are the speeds of the export service and your HTTP service. Since export services batch their exports to the DTP transfer worker and are typically hosted in or near the same data-center as the transfer worker, this practically means data will be sent as fast as your HTTP service can accept it. \
If your service cannot accept data at the rate sent by the DTP transfer worker some strategies for managing this include throttling and rate-limiting.

Your service can throttle the DTP transfer worker by setting a maximum request rate per unit of time and queueing requests that exceeds that quota until the next time period. Since the Generic Importer module sends each data item sequentially the next data item will not be sent until your service has completed a HTTP response for the current item, making this an effective strategy to slow the transfer worker down. \
Note that throttling requests for too long may cause the HTTP request to time-out.

Alternatively or additionally, your service can rate limit the DTP worker by responding with a 429 Too Many Requests status code when the transfer worker is sending data too quickly. The request will be retried with a delay as described in the [Retries](#retries) section. \
Note that the transfer worker will not retry sending a given data item indefinitely, so this may result in lost data.

#### Retries

If a request sent to your HTTP service fails it will usually be retried. The retry and back-off logic is based on a configuration set for the instance of the DTP transfer worker that processes the transfer job. The retry logic is therefore typically configured by the owners of each exporter service.

The DTP transfer worker's default retry strategy is defined in [distributions/demo-server/src/main/resources/config/retry/default.yaml]. At time of writing, the default strategy is an exponential back-off with 1.5x multiplier, starting at 1000ms, with up to 5 retry attempts. \
If the default retry behaviour is not sufficient for your service, please contact the owner of the exporter service to configure a more specific retry strategy for your service, or create an issue or pull request to make this configurable per-import service.

### HTTP Interface

Data will be POSTed to your endpoint with a `Content-Type` of either `application/json` for basic data-types, or `multipart/related` for file-based data-types. See below for how to interpret each of these.
Your endpoint should return a status 201 if the resource has been created, 40x for errors caused by the Importer, or 50x for service errors.
The Generic `Importer` module maps each data item in a job to one HTTP request made to your HTTP service. \
Data will be POSTed to your endpoint with a `Content-Type` of either `application/json` for basic data-types, or `multipart/related` for file-based data-types. See below for how to interpret each of these. \
Your endpoint should return a 20x status code if the resource has been created, 40x for errors caused by the Importer, or 50x for service errors. See [Endpoint Errors](#endpoint-errors) below for details on returning errors.

#### Basic data types

For basic data-types the Importer will send a POST request with a `Content-Type: application/json` header. The body of the POST request will be a UTF-8 encoded JSON payload conforming to the relevant data-type schema detailed in the [Schemas](#schemas) section.

For example, below is a full request of a SOCIAL_POST item (JSON formatted for readability)
For example, below is a full request of a SOCIAL-POSTS item (JSON formatted for readability)

```http
POST /import/social-posts HTTP/1.1
Expand Down Expand Up @@ -159,7 +224,7 @@ Endpoint: `/media`

The MEDIA vertical describes a user's photos and videos, and the albums that contain them.

Albums are guaranteed to be imported before the photos and videos contained in them.
Albums are conventionally imported before the photos and videos contained in them.

#### Basic Data Types

Expand Down Expand Up @@ -293,7 +358,7 @@ Endpoint: `/blobs`

The BLOBS vertical represents arbitrary file data and folder structures.

Folders are guaranteed to be imported before any of their containing data, which may include folders and/or files.
Folders are conventionally imported before any of their containing data, which may include folders and/or files.

#### Basic Data Types

Expand Down Expand Up @@ -359,7 +424,7 @@ Endpoint: `/calendar`

The CALENDAR vertical describes calendars and events on those calendars.

Calendars are guaranteed to be created before the events associated with them.
Calendars are conventionally imported before the events associated with them.

All data exposed by the CALENDER vertical is encoded as a basic data type.

Expand Down Expand Up @@ -482,7 +547,7 @@ Endpoint: `/social-posts`

The SOCIAL-POSTS vertical represents posts made by the user on a social media platform.

Only the SocialActivity data type is exposed by the SOCIAL-POSTS vertical, which is a basic data type.
Only the `SocialActivity` data type is exposed by the SOCIAL-POSTS vertical, which is a basic data type.

```json
{
Expand Down Expand Up @@ -648,8 +713,11 @@ The combination of the HTTP response code and `error` field can be used to encod

## Authentication and Authorization

Generic Importers support the OAuth v2 Authorization Code Flow; exporters will direct users to your OAuth authorization page, requesting an authorization code with OAuth scopes defined by your importer configuration, which will then be used to claim an access token.
The access token will be sent as a Bearer token in the HTTP Authorization header sent with all import requests.
Generic Importers support the [OAuth 2.0 Authorization Code Flow](https://datatracker.ietf.org/doc/html/rfc6749#section-1.3.1); platforms will direct users to your OAuth authorization page, requesting an authorization code with OAuth scopes defined by your importer configuration, which will then be used to claim an access token.
The access token will be sent as a Bearer token in the HTTP Authorization header sent with all import requests made by the DTP transfer worker.

The authorization server may be the same as your HTTP service, or can be a separate server that your service integrates with. \
When implementing your importer service there is likely an existing OAuth library that integrates with your chosen framework, or there are third-party authorization services you can integrate with.

### Token Refresh

Expand All @@ -665,30 +733,4 @@ Content-Type: application/json
}
```

The worker will request a token refresh through the standard OAuth refresh token flow, sending a form-encoded POST request to the configured refresh endpoint.

```http
POST /oauth/refresh HTTP/1.1
Accept: application/json
Content-Type: application/x-www-form-urlencoded
Content-Length: 92
Host: localhost:8080
Connection: Keep-Alive
Accept-Encoding: gzip
User-Agent: okhttp/3.9.1
refresh_token=refreshToken&client_id=appKey&client_secret=appSecret&grant_type=refresh_token
```

The token refresh endpoint should return a response containing the access token. If a new refresh_token is provided it will be used for future token refreshes.

```http
HTTP/1.1 200 OK
Content-Type: application/json
{
"access_token": "newAccessToken",
"refresh_token": "newRefreshToken", // optional
"token_type": "Bearer"
}
```
The worker will request a token refresh through the standard OAuth refresh token flow.

0 comments on commit 0d4bee3

Please sign in to comment.