Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Generic Importers #1389

Open
wants to merge 34 commits into
base: master
Choose a base branch
from

Conversation

calumcalder
Copy link
Collaborator

@calumcalder calumcalder commented Nov 14, 2024

To support importing data via DTP, developers are required to implement a Java extension module exposing a TransferExtension and Importer, with the Importer interfacing with a new or existing API also supported by the developer. Creating a Java extension creates a barrier for some developers who don't have the context or bandwidth to implement and maintain such an extension.
As a way to make it easier to integrate with DTP this PR implements a GenericTransferExtension and GenericImporter, which can be reused by developers without committing or maintaining Java code.

Generic Importers can be configured to call a HTTP endpoint created and maintained by the developer, allowing them to use the programming language and frameworks of their choice, as long as their API conforms to the OAuth Authorization Code Flow, and the HTTP JSON API laid out in the included README.

While this PR contains a functional implementation for the BLOBS, CALENDAR, MEDIA, and SOCIAL-POSTS verticals, I anticipate the implementation and API will be iterated on as developers begin testing and integration.

TODO

  • Auth
  • Configuration
    • Rework TransferExtension loading
  • Data-types
    • Blobs
    • Calendar
    • Media
    • Social Posts
  • Documentation
    • Service endpoints
    • Service schema
    • OAuth flow
    • Configuration
  • Tests
    • Serialization
    • OAuth
    • Importer

In preparation for creating a generic transfer extension, set the
`@type` field when serializing `SocialActivity*` objects.
Adds a generic `Importer` for SOCIAL_POSTS. The importer passes the
social data straight through, and is built to be extensible to other
data types.
Extends the `GenericImporter` from the previous commit to support data
with both JSON metadata and file data. As an example of usage, provide
an implementation for `BLOBS` data.
@calumcalder calumcalder force-pushed the feat/generic-exporters branch from dfd07cd to 312862b Compare November 15, 2024 12:23
@calumcalder calumcalder force-pushed the feat/generic-exporters branch from 7afade1 to 40e18b1 Compare November 15, 2024 15:23
Extracts the logic for serializing BLOBS data to a separate module.
Adds tests for the serializer.
Adds a generic wrapper around the payload, which will wrap all other
types after they undergo the same refactoring.
Extracts the SOCIAL_POSTS generic importer serialization logic to a
separate module.
Also adds tests, and wraps the payload in GenericPayload, like with
BLOBS that was done in the previous commit.
Extracts the CALENDAR generic importer serialization logic to a separate
module.
Also adds tests, and wraps the payload in GenericPayload, like with
other data types in previous commits.
Extracts the MEDIA generic importer serialization logic to a separate
module.
Also adds tests, and wraps the payload in GenericPayload, like with
other data types in previous commits.
Moves `CachedDownloadableItem` to the `BlobbySerializer` file, since
that's the only place it's used.
In previous commits the type information of exported data was being lost
at serialization time. This change delays converting to JSON
representation until the time when data is sent on the wire, allowing
type information to be preserved as long as possible and to avoid
double-serialization of the data.

The exported data have a union of possible types (the MediaSerializer,
for example, generates a list of MediaAlbums, VideoModels, and
PhotoModels), but unfortunately Java doesn't support proper union types.
A solution to this is to create an empty `ExportData` interface for each
serializer and have the possible subtypes 'implement' the empty
interface, then using the `ExportData` interface as a type bound.

This makes Media and Calendar serializers more verbose as we're having
to duplicate the underlying models, but it gives us a layer of
separation between the model and the serialized data, allowing them to
evolve independently, and guaranteeing that changes to the model will
require a corresponding change in the serializer, reducing the chance of
the serializer interface being broken.
Since we've added a layer between generic importer serializers and the
underlying models to better preserve type information, make use of this
by using a custom schema for media items. This lets us be
a) use consistent schemas for Photos and Videos
b) use TZ-aware date-times (although we have to assume the underlying
   Date is UTC when copying the model)
Simplifies the `@type` annotations of exported data. Might be worth
doing this for nested types too, although some of them are directly
using Models defined by DTP.
@calumcalder calumcalder force-pushed the feat/generic-exporters branch from da4861a to 23bec4a Compare November 28, 2024 10:43
Expands the `TransferServiceConfig` object to allow arbitraty data in a
`serviceConfig` field, to be later parsed by `TransferExtension`s at
initialization time.
Extends `GenericTransferExtension` to read from `serviceConfig` to
configure itself for the import service being targeted.

Also allows `TransferExtension`s to declare support for services
through a `supportsService` method, turning Service->Extension mappings
from one-to-one to many-to-one, meaning `GenericTransferExtension` can
be configured to support multiple services.

For now the configuration is fairly barebones; a list of supported
verticals (wrapped in an object to allow future support for
vertical-level configuration), the base of the API endpoint, and the
service name.
Tidies up the MediaSerializer class, and reverts some now-unnecessary
changes to the core DTP models.
Improves the serialization of Blobby data.
Refactors GenericImporter to make GenericFileImporter simpler, and to
remove some duplication in the two classes.
There was a manual test being used during development that was replaced
by unit tests.
File data had inherited some structure from when it was based on
`BlobbyStorageContainerResource`. Since refactoring to a standalone
schema, there's no need to have this structure, so simplify the schema
by flattening it.
@calumcalder calumcalder force-pushed the feat/generic-exporters branch from adb64ab to 76c37d7 Compare December 6, 2024 16:45
@calumcalder calumcalder marked this pull request as ready for review December 6, 2024 16:58
This makes the naming more consistent with e.g. MediaSerializer
lisad
lisad previously approved these changes Dec 11, 2024
Copy link
Member

@lisad lisad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

love it. The documentation is already very readable to an outsider.


### HTTP Interface

Data will be POSTed to your endpoint with a `Content-Type` of either `application/json` for basic data-types, or `multipart/related` for file-based data-types. See below for how to interpret each of these.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could use some basic early information about the bigger picture. Data will be POSTed to your endpoint - starting with what ? Ending when? How fast?

E.g. "When a user of a DTP-powered service requests a transfer to your HTTP service, DTP will first do an OAuth check then begin sending POST requests at a rate metered by the operator of the sending service. Each POST request is a single resource. The job continues until all objects have been sent. The sender makes no assumptions about the storage or organization of these resources at the destination."

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and probably add that the order of sending of resources depends on the sender implementation? (but that albums should come before photos in those albums?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... and all Calendar objects before all Event objects in those calendars...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added what ended up spiralling in to a lot of details (mostly what we discussed in the maintainers meeting) in 0d4bee3 - hopefully that covers the spirit of what you suggested here


## Authentication and Authorization

Generic Importers support the OAuth v2 Authorization Code Flow; exporters will direct users to your OAuth authorization page, requesting an authorization code with OAuth scopes defined by your importer configuration, which will then be used to claim an access token.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


The token refresh endpoint should return a response containing the access token. If a new refresh_token is provided it will be used for future token refreshes.

```http
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should add a comment that this is just an example, other OAuth2 compliant messages ought to work just fine. I assume that folks ought to be able to use a decent OAuth2 library, we're not asking them to hand-code this answer.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a note in 0d4bee3 that you should probably be using a library or third-party auth server, and figured the refresh token API is documented well enough elsewhere that I've removed these details

}
}

private URL urlAppend(URL base, String suffix) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow this one is easier in python ;)
this seems like a broadly useful utility for the project - not that we should move it now, but flagging for interest

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure is! 🙂
I think because most other extensions use SDKs they don't have to deal with URLs manually, but definitely could be moved to a utility class if we spot it being reimplemented elsewhere

* Utility to manage {@link TokensAndUrlAuthData} containing OAuth refresh and access tokens.<br>
* See {@see #withAuthData} for how to use the auth data managed by this class.
*/
public class OAuthTokenManager {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised this isn't already part of the DTP framework. Doesn't it have to be reimplemented like this for each extension? Not something that needs to be addressed in this PR for sure, just asking!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was surprised too, I think similar to the urlAppend method because other extensions are built on SDKs the tokens are managed there

GenericImporter.configureObjectMapper(objectMapper);
}

<T> void assertJsonEquals(String expectedPayload, GenericPayload<T> actualWrapperPayload)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems really useful but the longer instances where this is used were still somewhat ugly and hard to write with all the quotes, quote escapes, and line concatenations. Is it easy for assertJsonEquals to handle that too and make these tests even easier to write/read? And is this another utility that is potentially used elsewhere in DTP?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, this is another one where the python equivalent is much simpler!
Mutli-line string literals weren't added until Java 15 (13 with a compiler flag), and we're still on 11 so annoyingly can't utilise those and I think this is the best we can do for embedding large JSON strings in java files.
Looking at other tests, the imgur importer tests store expected JSON data in resource files which would be another option, but it does mean the data doesn't live in the test file itself

I could be convinced either way on this, and I'd be open to other suggestions if anyone has them

Request request =
new Request.Builder()
.url(endpoint)
.addHeader("Authorization", format("Bearer %s", authData.getToken()))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't have to address this now. But probably a good practice to extract these common strings to a highlevel constants file that is accessible to all the extensions.


final class GenericTransferConstants {
public static final String SCHEMA_SOURCE_BASE =
"https://github.com/dtinit/data-transfer-project/blob/master";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a little confused by this.

It seems like we are using this only in the tests to compare what is getting serialized. If that is the case, we should compare what is currently at the runtime but not what is currently in the master branch. There is 100% chance that someone might be running a little older version of generic importers compared to master and that would break the tests. Instead, we should just read the file from classpath and do what we are doing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It gets used when building the schema source string to go in GenericPayload, in each Serializer class. That ends up in the payload to the importer endpoint.

It could go out of date, but I don't anticipate the code moving very often and would hope instances catch up relatively quickly. I did try to get it generated at runtime or compile time, but I couldn't find a way to get it to point to the right place (the path ended up in a build directory that doesn't exist in the repo) - if you know a way to do it that would be helpful!

One idea I had here was to switch to generating the POJOs from a JSON schema at build time, at which point the schema source could point to that file and make a bit more sense

Based on review feedback, expand on the documentation.
- Include a high-level overview of job lifecyle
- Add details on data rates
- Clarify the ordering behaviour
- Add reference to OAuth RFC
- 201 -> 20x for success codes
- Remove implementation details for OAuth flow
  It's documented more thoroughly elsewhere and we should encourage use
  of a framework or third-party authorization server.
@calumcalder
Copy link
Collaborator Author

Thanks for the reviews!

@calumcalder calumcalder requested review from sparuvu and lisad December 13, 2024 16:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants