Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extending the spec to include content creation/modification #140

Open
monotasker opened this issue Oct 15, 2018 · 21 comments
Open

Extending the spec to include content creation/modification #140

monotasker opened this issue Oct 15, 2018 · 21 comments
Assignees

Comments

@monotasker
Copy link
Collaborator

I'm wondering whether there's interest in the DTS group in the idea of extending the API spec to provide for creating and modifying content, in addition to fetching it. The context for this is that I'm part of a project looking at how we might create standard APIs to allow better modularity, reusability, and interoperability for TEI editing tools. It strikes me that some of the DTS endpoints could naturally be exposed for POST, PUT, PATCH, and DELETE requests. That would fairly naturally allow the API to cover a much broader range of use cases. I'm actually going to be drafting a proposal for that extension for a paper I'm giving in November.

@PonteIneptique
Copy link
Member

PonteIneptique commented Oct 16, 2018

First, let me say it pleases me so much to read such a thing.

I think this one will need more time than the current fixes / small improvement we draft. It was originally thought about when we started working on DTS but we decided that the first release should focus on delivery rather than both ingestion + delivery. I think it would be nice to start thinking about it, leaving space for it for it to not be mandatory being an absolute necessity (for reasons that might be too obvious but, basically, not everyone has the ability to handle this kind of input flow or want to, while they might still want to serve).

I think any draft will be welcome :). On the side, pure curiosity, can we know where the paper will be given ? :)

@PietroLiuzzo
Copy link
Contributor

For epigraphy.info as is being currently planned, this would be very useful. As some of you know the plan since the Heidelberg meeting was exactly to rely on DTS fully and this would need that the API specification at least includes POST. At the moment (please don't faint...) my demo applications for that use an intermediate bespoke non standard model which is posted and contains links to GET the data which has changed from the DTS API.
I agree 200% with @PonteIneptique that it needs to be optional to implement any other method than GET.
perhaps the draft extension can be shared here so that people do not have to wait for a publication? it would sound better to me to see it discussed here first instead of seeing it in a paper and discussing it after.

@monotasker
Copy link
Collaborator Author

Sure, I'm happy to share the draft once I've got an initial version pulled together. Is this the best forum for doing that and discussing it?

@monotasker
Copy link
Collaborator Author

The paper is being delivered (with Ken Penner) at the Society of Biblical Literature meeting in Denver this November, in the Humanities Computing section. The proposed DTS extensions are part of a broader interest in moving toward more modular and interoperable editing/publishing tools. Any feedback I can get before then would be fantastic.

@PonteIneptique
Copy link
Member

(Own opinion) I think the best is to do it with some kind of personal repository which could be a clone of this repository, and make a new page about "Extending services with publishing options" or some kind of way better title than this one ?

@monotasker
Copy link
Collaborator Author

Yes, I could set up a cloned repo. In the meantime, since there seems to be some interest, I'd be glad for input here on two basic questions:

  1. What endpoint should be used for adding and modifying metadata about a document? It looks to me like at present it would be "collection." But semantically it seems odd to me to use "collection" to work on the extensive metadata contained in a TEI header, particularly when you're just working on one document. So I'm wondering about a dedicated endpoint like "docinfo" or "metadata" for that kind of editing. On the other hand, I know that some kinds of metadata are already retrievable from "collection."
  2. What format would be best to use for submitting metadata? I can think of three options for submitting new/updated data:
    ----a) as fragments of a TEI header (xml);
    ----b) as JSON-LD;
    ----c) as generic JSON using terms from a standard vocabulary
    ----d) as query parameters using terms from a standard vocabulary
    I'm leaning toward generic JSON, since everything in DTS other than the document endpoint seems to use JSON. It seems to me, though that requiring JSON-LD in a request would be overkill. As for query parameters, a lot of the metadata would be very awkward (if not impossible) to squeeze into a param value. So while it might be good to allow a few parameters, extensive metadata uploads will probably have to be done by either xml or JSON.
  3. What vocabulary would be best to use for submitting metadata? This choice is related to the choice of format. If the data is submitted as TEI xml, then the TEI semantics are built in.
    ----a) TEI header syntax (probably the most complete);
    ----b) JSON using dts and dc terms (only covers a small subset of the necessary metadata for proper documentation);
    ----c) JSON using bespoke terms (I'm increasingly allergic to this.)
    At the moment my theoretical ideal would be to discover that someone has extensions to dublin core that line up easily with the semantics of TEI header syntax. Then it would be simple to accept JSON. Is there such an existing standard vocabulary? I thought I would ask before I dive further into searching on my own.

Thanks for any input you can offer.

@hcayless
Copy link
Contributor

There will be potentially huge complexities involved here. Not a reason to avoid doing it, just a warning from someone with a few scars.

  1. I would use the Collection endpoint to add metadata, and the Document endpoint to add or update documents (including TEI Headers). A "leaf" node in a collection would typically contain metadata about a document (though we're deliberately vague about document/collection boundaries).
  2. I'd say the format should depend on the type of data being changed. I'm not sure at this point how we could avoid using JSON-LD for updating collections. On the other hand, I could imagine extrapolating a collection leaf node from an uploaded TEI document, so there's that.
  3. We don't even have anything like a full metadata spec for collections. We punted and said "use Dublin Core", which I think was a good decision, but we're fuzzy at this point on what extended metadata might look like. I'm comfortable in saying there is nothing like a full mapping between the TEI Header and any sort of DC. There's a small subset of the TEI Header with fairly obvious correspondences to DC and after that all bets are off. They're not even really the same kind of thing...As I mentioned above, though, I could imagine conventions for extracting metadata from the TEI Header of a document.

@PonteIneptique
Copy link
Member

I would definitely up the used of JSON-LD and look at the dts:extended property that accepts any declared ontology in the @context. This let us say

If you have simple metadata, there is dts:dublincore, for more complex and situational ones, use `dts:extended

And obviously, we need to at least accept TEI on the update/put front. But this one feels particularly hard to specify without stepping in project boundaries : we'll need to agree on error codes like "InvalideSchema" for service providers to explain why they did not accept the content.

@monotasker
Copy link
Collaborator Author

Thanks for the input so far. I'm going to go ahead and fork this repository, adding some proposed extensions as the basis for further discussion. Just two follow-up questions for the moment.

  1. How practical is it to require proper JSON-LD in the payload for an api request? I see the need to return JSON-LD responses. But a lot of potential api users are never going to learn how to write proper JSON-LD. It's not an intuitive format, even for some experienced web developers. So my concern is that if we require it for requests we're going to set up a big hurdle in the way of adoption. I wonder if a more practical compromise would be to ask for JSON-LD based on some existing terminology (dc+), but to also accept JSON that uses the correct terminology without all of the referencing involved in LD? I guess I'm wondering what would be lost by allowing for that if it makes use of the api more feasible?
  2. I'm also a bit concerned about separating sharply between metadata stored in other ways and TEI headers. The TEI header is actually designed, I believe, to hold all of the possible metadata related to an item. If we're going to respect TEI standards, I think that means treating metadata stored in a TEI header and metadata stored in (say) a database as equivalent. So I don't think we should be deciding at the API level whether metadata will be stored in the document header or in a database. I would rather see one API route (one endpoint) for metadata of all kinds and let the individual project decide where they want to store it. (Ideally I would argue that the same metadata should be inserted into TEI headers for the documents and stored in a database. But that's not an API-level concern.)
  3. Related to this, I'm reconsidering whether we shouldn't treat TEI-formatted xml as the preferred format for submitting rich metadata. If dc doesn't give us the rich semantics to organize that kind of data, the TEI header does. So why not use the existing standard? That could mean we allow people to (a) submit the small subset of metadata supported by dc using JSON and (b) submit any metadata using TEI xml. That would give us a fully working semantic metadata standard immediately, while we work on a properly TEI-compatible extension for dc.

@PonteIneptique
Copy link
Member

I will reply quickly to 1., a little about 2, and will need to think a bit more about 3.

  1. I think the best way to do would be to ask for sending actual collection object with the modified/new values to send the same way the API shows them. This would avoid to have a different input format than output. And from there, it is rather simple because it's quite constrained by the format of our DTS Collection model.
  2. I do not totally agree with The TEI header is actually designed, I believe, to hold all of the possible metadata related to an item, as I believe some metadata are much more useful outside of TEI, mostly when it's repeated like information about authors for whom you have multiple TEI files. Though, you could definitely have, in your application, your metadata taken away from the teiHeader, but it becomes an application specific link rather than a standard wide procedure. This does not prevent drafting a spec for input data on document' and collection's endpoint though.

I am not sure I am clear, so feel free to tell me, English is definitely not my first language :)

@monotasker
Copy link
Collaborator Author

monotasker commented Oct 17, 2018

Thanks @PonteIneptique. No worry about language. If I were trying to write this in French it would take me all day!

I like the idea of having the collection object returned with modifications.

I may have been unclear in what I said about the TEI header. I didn't mean that it was designed to be the main storage medium for every purpose. As you say, an application will often want to store some kinds of metadata in other places as well. I just meant that the TEI header is explicitly designed to be able to organize extremely rich and varied information about a document. And the TEI spec seems to encourage that wherever else metadata is stored it is also stored in the TEI header.

I'm uncomfortable in principle with an API that lets me update the same information about a document through two different endpoints (document and collection). That seems to me to break the semantics of the endpoint. I don't think the API should have any opinion on how or where metadata is stored. It should just provide a semantically rational endpoint for sending the data. It doesn't make sense to me to send "creator" data to one endpoint if it's in JSON and a different endpoint if it's in xml. Similarly, I'm not comfortable with using one endpoint to update "creator" information and a different endpoint to update information on (say) a document's normalization scheme or orthography. Both are metadata, and so it seems to me that both should go to the same endpoint.

I also think we should distinguish generally between the format of a request payload and the internal storage mechanisms of the application. Right now the document endpoint returns TEI xml, but that doesn't mean a project is storing the document as xml. The api just requires that output because it's standard. Similarly, I think we should choose a payload format for modifying metadata based on what format is (a) standard, and (b) semantically rich enough. It's then up to the project to decide how they want to represent and store that data internally.

I hope this is a bit clearer. I'll look forward to hearing your thoughts on my 3. Basically, I want to be able to modify metadata that is as rich as the TEI syntax allows. To the extent that DC supports some of those semantics, I'm all for sending it as a modified JSON-LD object. But where DC doesn't support the TEI semantics, I think we need to find a way to support it. Since xml is already an exchange format, and you're already using xml in some responses, it would make sense to me in the short run to support those semantics by allowing xml upload as well. The user could then decide whether to send basic metadata using DC or richer metadata using ETI xml. I hope my reasoning is a bit clearer now.

@monotasker
Copy link
Collaborator Author

Sorry for the wordiness of my replies, too. Can you tell that I'm on sabbatical right now?

@PonteIneptique
Copy link
Member

Sorry for the delayed answer.

Similarly, I think we should choose a payload format for modifying metadata based on what format is (a) standard, and (b) semantically rich enough

To that, I'd also add that would definitely feel weird as a client to provide different-than-the-output input mimetype. Going to collection but sending TEI feels completely weird as the basic format is ld+json

Both are metadata, and so it seems to me that both should go to the same endpoint.

In an ideal world, probably. Unfortunately, some projects do not store rich metadata in TEI, some do. Mostly, what we decided is that the metadata endpoint should be different from the document endpoint in output type, also because xml is definitely losing the race when it comes to APIs.

Basically, I want to be able to modify metadata that is as rich as the TEI syntax allows

That's also an issue though, isn't it ? TEI is so flexible in how it handles its metadata that there is a metadata scheme per project, when it's not more. If some metadata are too rich for the output format of collection, then I think it is fine if it is stored only in the teiHeader of document. The collection endpoint is not about completeness, it's about re-usability, a place where I feel teiHeader are definitely losing.

But where DC doesn't support the TEI semantics, I think we need to find a way to support it

I just want to add again that in collection, you have the ability to use other namespaces in the dts:extended property. While it might not be enough to cover all TEI situation, it should cover some.

The user could then decide whether to send basic metadata using DC or richer metadata using TEI xml.

The question still stands though : how should we translate TEI input into LD+JSON that is the output default format of collection ? And my answer here would be : I think we should not, because there is no such official translation. But then again, I feel like it's fine that some modifications of teiHeader in document might have impact on other endpoints, mostly because if you are changing or adding a new citation node, it will already have an impact on navigation :)

Is my point of view clearer ?

@hcayless
Copy link
Contributor

I agree with Thibault. While I could see the usefulness of a “single source” implementation, where you could upload a TEI doc and it would automatically get a record in the collection endpoint, I would definitely not want to extract and attempt to represent all of the TEIHeader metadata there. That way lies madness.

We might end up issuing recommendations for providers of TEI documents on ways to represent information so that DTS can leverage it, though.

@monotasker
Copy link
Collaborator Author

Thanks @PonteIneptique and @hcayless. Your responses are helpful and I've got some thinking to do.

What I'm struggling with as I read what you're saying is that it seems like we need an extended API to serve a very common use-case in text-editing: recording extended information about the text, its transcription, related publications, etc. I don't want to simply leave all of this up to each implementation to do ad hoc, because I'm envisioning tools that are highly inter-operable. So I don't want to just abandon the use-case.

It sounds like there are a couple of principles guiding both of your responses:

  1. You want each endpoint to accept and return the same data format. (How general is the agreement here? The existing spec allows the document endpoint to return more than one format.)
  2. You are concerned that the TEI header syntax is too flexible and exhaustive to be usable as an exchange format.
    I can see the concern on both points. I don't find them as pressing as the two of you seem to, but I'm willing to continue being convinced. And I'm willing to work within the constraints that your team has agreed on. But I'm still wondering whether TEI headers are as hopeless as the two of you seem to suggest. Maybe if I put together a couple of concrete examples it would help me to see the problems.

What about adding another endpoint for extended document data? Something like "docinfo." Semantically it makes sense to me if we distinguish rich background information from the basic metadata listed in a collection catalogue (i.e., the collection endpoint). It also seems to me that the data format(s) we allow in sending and fetching that broader information might be different (JSON-LD?) than what we want to use for sections of document text (i.e., the document endpoint). If we add a "docinfo" endpoint, then would it make more sense to allow it to accept TEI header fragments? Or maybe allow it to return and accept either TEI header xml or JSON-LD?

By the way, @PonteIneptique, when you mention using other namespaces, are you suggesting that such namespaces already exist or that we create one? I think some of the TEI-header semantics should translate fairly easily to JSON. So if we had a "docinfo" endpoint I could look at starting to build such a namespace. But, again, if it already partially exists somewhere (beyond DC) I definitely don't want to reinvent the wheel.

@monotasker
Copy link
Collaborator Author

For clarification: I'm thinking that a "docinfo" endpoint could return either a full TEI header or JSON-LD data, based on a url parameter in the request. Then the user could edit and return the same object, whether it's the TEI header or the JSON-LD object.

@PietroLiuzzo
Copy link
Contributor

I think I would be more inclined to add selected information from the teiHeader in dts:extended so that I put there what I want to put there and only that

@monotasker monotasker self-assigned this Apr 26, 2019
@monotasker
Copy link
Collaborator Author

I just submitted a pull request (to dev branch) for a draft of the expanded Documents endpoint documentation. I'm still integrating my separate document into the Collections-Endpoint.md file. So I'll submit a second pull request when that's done.

@monotasker
Copy link
Collaborator Author

Okay, I merged the (revised) pull request to the dev branch today.

@monotasker
Copy link
Collaborator Author

I've finally made the lingering fix to the Link headers in the revised and expanded Document endpoint docs. The issue was that hydra requires every response to include the URL of the json-ld api documentation in the Link header. So I've added this to the link header of every response in the docs:

</dts/api/document/documentation>; rel="apiDocumentation"

I'll be committing the updated version shortly and then I'll submit a pull request against the dev branch. Before we can merge that with master I'm going to have to go through and more-or-less manually merge dev with the changes to master made since June. (Github won't automerge them.) I'll work on getting that ready for final approval (with a pull request against master) for the next committee meeting.

@monotasker
Copy link
Collaborator Author

Oh, quick question before I make the PR: Hydra includes a properly namespaced term to use in the "rel" value for those Link headers: http://www.w3.org/ns/hydra/core#apiDocumentation. I'm assuming that we should be using that full URL in that Link header:

</dts/api/document/documentation>; rel="http://www.w3.org/ns/hydra/core#apiDocumentation"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants