-
Notifications
You must be signed in to change notification settings - Fork 481
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Generic Importers #1389
base: master
Are you sure you want to change the base?
feat: Generic Importers #1389
Conversation
In preparation for creating a generic transfer extension, set the `@type` field when serializing `SocialActivity*` objects.
Adds a generic `Importer` for SOCIAL_POSTS. The importer passes the social data straight through, and is built to be extensible to other data types.
Extends the `GenericImporter` from the previous commit to support data with both JSON metadata and file data. As an example of usage, provide an implementation for `BLOBS` data.
...ric/src/main/java/org/datatransferproject/datatransfer/generic/GenericTransferExtension.java
Outdated
Show resolved
Hide resolved
dfd07cd
to
312862b
Compare
7afade1
to
40e18b1
Compare
...ric/src/main/java/org/datatransferproject/datatransfer/generic/GenericTransferExtension.java
Outdated
Show resolved
Hide resolved
Extracts the logic for serializing BLOBS data to a separate module. Adds tests for the serializer. Adds a generic wrapper around the payload, which will wrap all other types after they undergo the same refactoring.
Extracts the SOCIAL_POSTS generic importer serialization logic to a separate module. Also adds tests, and wraps the payload in GenericPayload, like with BLOBS that was done in the previous commit.
Extracts the CALENDAR generic importer serialization logic to a separate module. Also adds tests, and wraps the payload in GenericPayload, like with other data types in previous commits.
Extracts the MEDIA generic importer serialization logic to a separate module. Also adds tests, and wraps the payload in GenericPayload, like with other data types in previous commits.
Moves `CachedDownloadableItem` to the `BlobbySerializer` file, since that's the only place it's used.
In previous commits the type information of exported data was being lost at serialization time. This change delays converting to JSON representation until the time when data is sent on the wire, allowing type information to be preserved as long as possible and to avoid double-serialization of the data. The exported data have a union of possible types (the MediaSerializer, for example, generates a list of MediaAlbums, VideoModels, and PhotoModels), but unfortunately Java doesn't support proper union types. A solution to this is to create an empty `ExportData` interface for each serializer and have the possible subtypes 'implement' the empty interface, then using the `ExportData` interface as a type bound. This makes Media and Calendar serializers more verbose as we're having to duplicate the underlying models, but it gives us a layer of separation between the model and the serialized data, allowing them to evolve independently, and guaranteeing that changes to the model will require a corresponding change in the serializer, reducing the chance of the serializer interface being broken.
Since we've added a layer between generic importer serializers and the underlying models to better preserve type information, make use of this by using a custom schema for media items. This lets us be a) use consistent schemas for Photos and Videos b) use TZ-aware date-times (although we have to assume the underlying Date is UTC when copying the model)
Simplifies the `@type` annotations of exported data. Might be worth doing this for nested types too, although some of them are directly using Models defined by DTP.
da4861a
to
23bec4a
Compare
Expands the `TransferServiceConfig` object to allow arbitraty data in a `serviceConfig` field, to be later parsed by `TransferExtension`s at initialization time. Extends `GenericTransferExtension` to read from `serviceConfig` to configure itself for the import service being targeted. Also allows `TransferExtension`s to declare support for services through a `supportsService` method, turning Service->Extension mappings from one-to-one to many-to-one, meaning `GenericTransferExtension` can be configured to support multiple services. For now the configuration is fairly barebones; a list of supported verticals (wrapped in an object to allow future support for vertical-level configuration), the base of the API endpoint, and the service name.
Tidies up the MediaSerializer class, and reverts some now-unnecessary changes to the core DTP models.
Improves the serialization of Blobby data.
Refactors GenericImporter to make GenericFileImporter simpler, and to remove some duplication in the two classes.
There was a manual test being used during development that was replaced by unit tests.
File data had inherited some structure from when it was based on `BlobbyStorageContainerResource`. Since refactoring to a standalone schema, there's no need to have this structure, so simplify the schema by flattening it.
adb64ab
to
76c37d7
Compare
This makes the naming more consistent with e.g. MediaSerializer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
love it. The documentation is already very readable to an outsider.
|
||
### HTTP Interface | ||
|
||
Data will be POSTed to your endpoint with a `Content-Type` of either `application/json` for basic data-types, or `multipart/related` for file-based data-types. See below for how to interpret each of these. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this could use some basic early information about the bigger picture. Data will be POSTed to your endpoint - starting with what ? Ending when? How fast?
E.g. "When a user of a DTP-powered service requests a transfer to your HTTP service, DTP will first do an OAuth check then begin sending POST requests at a rate metered by the operator of the sending service. Each POST request is a single resource. The job continues until all objects have been sent. The sender makes no assumptions about the storage or organization of these resources at the destination."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and probably add that the order of sending of resources depends on the sender implementation? (but that albums should come before photos in those albums?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... and all Calendar objects before all Event objects in those calendars...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added what ended up spiralling in to a lot of details (mostly what we discussed in the maintainers meeting) in 0d4bee3 - hopefully that covers the spirit of what you suggested here
|
||
## Authentication and Authorization | ||
|
||
Generic Importers support the OAuth v2 Authorization Code Flow; exporters will direct users to your OAuth authorization page, requesting an authorization code with OAuth scopes defined by your importer configuration, which will then be used to claim an access token. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add link to https://www.rfc-editor.org/rfc/rfc6749
|
||
The token refresh endpoint should return a response containing the access token. If a new refresh_token is provided it will be used for future token refreshes. | ||
|
||
```http |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should add a comment that this is just an example, other OAuth2 compliant messages ought to work just fine. I assume that folks ought to be able to use a decent OAuth2 library, we're not asking them to hand-code this answer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a note in 0d4bee3 that you should probably be using a library or third-party auth server, and figured the refresh token API is documented well enough elsewhere that I've removed these details
} | ||
} | ||
|
||
private URL urlAppend(URL base, String suffix) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wow this one is easier in python ;)
this seems like a broadly useful utility for the project - not that we should move it now, but flagging for interest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure is! 🙂
I think because most other extensions use SDKs they don't have to deal with URLs manually, but definitely could be moved to a utility class if we spot it being reimplemented elsewhere
* Utility to manage {@link TokensAndUrlAuthData} containing OAuth refresh and access tokens.<br> | ||
* See {@see #withAuthData} for how to use the auth data managed by this class. | ||
*/ | ||
public class OAuthTokenManager { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm surprised this isn't already part of the DTP framework. Doesn't it have to be reimplemented like this for each extension? Not something that needs to be addressed in this PR for sure, just asking!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was surprised too, I think similar to the urlAppend
method because other extensions are built on SDKs the tokens are managed there
GenericImporter.configureObjectMapper(objectMapper); | ||
} | ||
|
||
<T> void assertJsonEquals(String expectedPayload, GenericPayload<T> actualWrapperPayload) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems really useful but the longer instances where this is used were still somewhat ugly and hard to write with all the quotes, quote escapes, and line concatenations. Is it easy for assertJsonEquals to handle that too and make these tests even easier to write/read? And is this another utility that is potentially used elsewhere in DTP?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, this is another one where the python equivalent is much simpler!
Mutli-line string literals weren't added until Java 15 (13 with a compiler flag), and we're still on 11 so annoyingly can't utilise those and I think this is the best we can do for embedding large JSON strings in java files.
Looking at other tests, the imgur importer tests store expected JSON data in resource files which would be another option, but it does mean the data doesn't live in the test file itself
I could be convinced either way on this, and I'd be open to other suggestions if anyone has them
...ommon/src/main/java/org/datatransferproject/types/common/models/calendar/RecurrenceRule.java
Outdated
Show resolved
Hide resolved
...sfer-generic/src/main/java/org/datatransferproject/datatransfer/generic/GenericImporter.java
Show resolved
Hide resolved
...sfer-generic/src/main/java/org/datatransferproject/datatransfer/generic/GenericImporter.java
Show resolved
Hide resolved
Request request = | ||
new Request.Builder() | ||
.url(endpoint) | ||
.addHeader("Authorization", format("Bearer %s", authData.getToken())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't have to address this now. But probably a good practice to extract these common strings to a highlevel constants file that is accessible to all the extensions.
|
||
final class GenericTransferConstants { | ||
public static final String SCHEMA_SOURCE_BASE = | ||
"https://github.com/dtinit/data-transfer-project/blob/master"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a little confused by this.
It seems like we are using this only in the tests to compare what is getting serialized. If that is the case, we should compare what is currently at the runtime but not what is currently in the master branch. There is 100% chance that someone might be running a little older version of generic importers compared to master and that would break the tests. Instead, we should just read the file from classpath and do what we are doing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It gets used when building the schema source string to go in GenericPayload
, in each Serializer
class. That ends up in the payload to the importer endpoint.
It could go out of date, but I don't anticipate the code moving very often and would hope instances catch up relatively quickly. I did try to get it generated at runtime or compile time, but I couldn't find a way to get it to point to the right place (the path ended up in a build directory that doesn't exist in the repo) - if you know a way to do it that would be helpful!
One idea I had here was to switch to generating the POJOs from a JSON schema at build time, at which point the schema source could point to that file and make a bit more sense
...ric/src/main/java/org/datatransferproject/datatransfer/generic/GenericTransferExtension.java
Outdated
Show resolved
Hide resolved
...ric/src/main/java/org/datatransferproject/datatransfer/generic/GenericTransferExtension.java
Outdated
Show resolved
Hide resolved
...ric/src/main/java/org/datatransferproject/datatransfer/generic/GenericTransferExtension.java
Show resolved
Hide resolved
Based on review feedback, expand on the documentation. - Include a high-level overview of job lifecyle - Add details on data rates - Clarify the ordering behaviour - Add reference to OAuth RFC - 201 -> 20x for success codes - Remove implementation details for OAuth flow It's documented more thoroughly elsewhere and we should encourage use of a framework or third-party authorization server.
Also use `Set.contains` to check for vertical support.
Thanks for the reviews! |
To support importing data via DTP, developers are required to implement a Java extension module exposing a
TransferExtension
andImporter
, with theImporter
interfacing with a new or existing API also supported by the developer. Creating a Java extension creates a barrier for some developers who don't have the context or bandwidth to implement and maintain such an extension.As a way to make it easier to integrate with DTP this PR implements a
GenericTransferExtension
andGenericImporter
, which can be reused by developers without committing or maintaining Java code.Generic Importers can be configured to call a HTTP endpoint created and maintained by the developer, allowing them to use the programming language and frameworks of their choice, as long as their API conforms to the OAuth Authorization Code Flow, and the HTTP JSON API laid out in the included README.
While this PR contains a functional implementation for the BLOBS, CALENDAR, MEDIA, and SOCIAL-POSTS verticals, I anticipate the implementation and API will be iterated on as developers begin testing and integration.
TODO
TransferExtension
loading