How Metadata Works in the Publishing World #57

dauwhe · 2019-09-09T16:42:54Z

In much of the EPUB world, the metadata that matters is not inside the EPUB, but outside (in the form of ONIX). The metadata inside EPUBs is often wrong, is difficult to change, and there is very little incentive to make it accurate since it's mostly unused.

In the web world, page metadata directly affects search ranking, Google rich snippets, etc. There is no out-of-band transmission of metadata. There is strong incentive to make it accurate.

How do we avoid the situation with EPUB, where we've spent decades worrying about metadata, continually changing how it's expressed, without really benefiting users?

mattgarrish · 2019-09-10T19:21:53Z

This is generally why its a bad idea for publication specifications to dive so deeply into metadata vocabularies.

We wanted to provide a framework for metadata expressions with EPUB 3, but got sucked into the metadata vortex of despair by introducing some "essential" metadata that didn't seem to exist but that looked like the non-ONIX folks would need. That's led to EPUB being looked at as having to define the essential metadata, when metadata expressions really should be figured out at the publishing/publication level.

I opened w3c/wpub#429 in part because I see a lot of the same happening here. The more we layer in the more it looks like what we exclude doesn't matter, and that leads to more requests to include things. Plus the more we recommend for certain areas of publishing the more annoying we make metadata for others.

The manifest is somewhat unencumbered now that most metadata is optional, but we still define a whole lot of concepts that really aren't essential to user agents (dates, etc.).

During 3.1, we started to look at defining prescriptive metadata guidelines for publishers using alternative means, like best practices documents. Call me old fashioned, but it still strikes me as the best balance. Let each community define what it wants to express and how it wants to express it outside the specification.

iherman · 2019-09-11T04:02:27Z

Well... we should be careful. The presence of, e.g., ONIX is clearly important for trade publishing. But we also know from our discussions that, at this moment, Publication Manifests will not be considered by trade publishing for some times which will keep to EPUB 3.x.

What is the situation in other areas? E.g., the little I know about scholarly, where (at least for journals) the "publication" content is not dominated by packaging, which put things in a very different perspective.

mattgarrish · 2019-09-11T14:50:12Z

The presence of, e.g., ONIX is clearly important for trade publishing. But we also know from our discussions that, at this moment, Publication Manifests will not be considered by trade publishing for some times which will keep to EPUB 3.x.

Right, I'm not suggesting we pick a side in that debate.

I just look at the publication manifest and I see a "format" that is itself also one big metadata expression language so we're already in an ideal scenario. We don't need to turn the specification into a rehash of schema.org or dcterms or any other scheme, as these are already accommodated.

I just feel we're better off staying out of the metadata sphere as much as we can. If we can't pin a property to something the user agent needs for some specific purpose, then the property probably doesn't belong in the specification.

We need to find a way to empower the publishing communities (hint, hint) to work out the details of what metadata belongs in the manifest when an ONIX record isn't the primary source of that descriptive detail, and for these communities to publish notes or guides for each relevant publishing realm.

llemeurfr · 2019-09-13T17:29:58Z

It strikes me since I came in this publishing domain that EPUB metadata lack a clear use case: who are these metadata made for? And the question is now the same for publication manifest metadata.

IMO they should be made for end users, helping users classify and find publications they have "acquired" and are present on their large "bookshelf" or "personal library". This is not discovery / commercial data to be used by booksellers (ONIX is made for that). This is not classification data to be used by academic libraries (MARC is good for that).

Once the use case is clear, we can decide which metadata are useful and which are not so.

gregoriopellegrino · 2019-10-07T09:25:48Z

As a note, for possible new versions of the specifications: comparing the metadata available in EPUB and those available in the pub-manifest, I noticed the lack of some information, which is used in real use cases. These are:

contributor: there may have more roles then the ones available in the pub-manifest (Afterword by, Epilogue by, Curated by, ...)
description
rights
source, I think it could be useful to have the reference to the paper book, another publication, etc.
subject

llemeurfr · 2019-10-07T09:36:00Z

@gregoriopellegrino the only metadata in your list I don't see the use for end-users (readers) is the rights information: if a user has a publication in his hands, what is the use of rights information for him? would it contain things like "you, reader, have the right to do this, but do not have the right to do that, with the publication you have acquired"?

gregoriopellegrino · 2019-10-07T09:37:51Z

llemeurfr · 2019-10-07T10:01:06Z

@gregoriopellegrino if a consensus is found around a copyright notice (I would support it), then a "copyrightNotice" property would be more interesting than a "rights" property then.

We can have a look at the news industry, where a copyrightNotice property is defined (https://www.iptc.org/std/NewsML-G2/guidelines/#copyright-notice) as a child of a bigger "rightsInfo" structure (https://www.iptc.org/std/NewsML-G2/guidelines/#rights-metadata).

schema.org has another way: copyrightHolder + copyrightYear. In case of consensus around the concept, we'll have to choose our way.

iherman · 2019-10-08T03:29:15Z

This issue was discussed in a meeting.

No actions or resolutions

View the transcript

issue 57, metadata in the publishing world
Wendy Reid: #57
Wendy Reid: dave raised an issue about how metadata works in the publishing world
Dave Cramer: Everybody knows I worry about a lot of things, our experience with EPUB has been spent in metadata rabbit holes, new vocabularies
… everyone has a property that is important to them, we spend a lot of effort
… metadata is not always exposed to the reader, and it travels separately from the EPUB itself
Charles LaPierre: except VitalSource is starting to expose the Accessibility Metadata
Dave Cramer: I raised this so we could be thoughtful about the metadata we expose
Ben Schroeter: to add to that
… we do supply a11y metadata in the epub that is used by distributors
Charles LaPierre: I’m on the a11y metadata thing
… vitalsource is exposing EPUB a11y metadata to users
Wendy Reid: that metadata is in ONIX, too?
Charles LaPierre: some of it; it’s not a 1:1 mapping; there’s more in ONIX 3
… but it’s not in ONIX 2.1, which is still widely used in US
Brady Duga: what Dave said is true for publisher-supplied ebooks, but not so much from user supplied epub
… but the metadata that matters is mostly author and title
Ivan Herman: I’ve said several times, if this manifest exercise becomes successful, it may not be in the worlds where EPUB is already successful
… we should not be bound by EPUB or ONIX
Laurent Le Meur: in the thorium reader we try to present metadata in the OPDS feed or in EPUB
… but there is no consistent set of user-oriented metadata
… but we would like to get it right… publisher, language, category, subject, narrator…
… all would be useful
Matt Garrish: there’s room to do metadata standardization outside of the standard itself
… rather than putting every metadata scheme in the spec itself, leave it to the communities
… there should be some core stuff
… it’s probably more efficient to do things outside
Bill Kasdorf: will there be a generic way to incorporate community-specific metadata?
… so scholarly publishers can include what they think is essential?
Matt Garrish: that’s exactly how it would work and how it is set up right now
… we’re a proxy for schema.org; we can use anything there without having it directly in our spec
… and our context files include more prefixes
… we are very flexible
Avneesh Singh: +1 Matt
Matt Garrish: there should be a clear purpose to list metadata in the core spec
Bill Kasdorf: I am anti-bloat
Gregorio Pellegrino: I understand what Matt says but some metadata is essential, like description
… we should suggest to use some metadata, because otherwise reading systems won’t implement
Laurent Le Meur: I agree that in the core spec maybe we don’t need an extensive set
… communities can define their own community
… who is the community? Is the audiobook community literally this working group?
… but we need something defined somewhere
Ivan Herman: one of the reasons we took JSON-LD is because schema.org used it
… but JSON-LD is ideally suited for this… you can just add things and it’s OK.
… laurent is right; for different areas there should be communities defining metadatas
… I don’t know if there is additional metadata required by audiobooks, if so let’s add it
… it depends… a CG might be able to define some of these things
… the main goal is to provide a framework
Dave Cramer: I’m not opposed to metadata, we seem to think that embedding metadata is always good but past experience shows this data is rarely used, I’m aware of few reading systems that use title and author but we’ve made many EPUBs using copyright statements
Avneesh Singh: let this spec go to CR as is
Ivan Herman: See example to link external metadata (ONIX in this case)
Avneesh Singh: and we have the audiobooks spec; we need to know what audio publishers want
Ivan Herman: +1 to Avneesh
Avneesh Singh: and we can do a note or registry with metadata
Gregorio Pellegrino: I agree with avneesh
… the possibility to define the role of contributor in schema.org are very poor
… we need ways to add that
Laurent Le Meur: adding metadata can be done step-by-step
… we need a group that can host these needs
… which group is it? This WG working on audiobooks?
… then we can wait for needs from publishers of audiobooks
Matt Garrish: to what gregorio said, we can request that schema.org add stuff that’s missing
Ivan Herman: +1 to matt
Gregorio Pellegrino: +1 to matt
Matt Garrish: it never ends well when we add metadata to our own standards
Ivan Herman: a partial answer to laurent
… there are two issues. one is, who are the groups that develop metadata? I don’t think there is one answer.
… two: how do you find the metadata that has already been developed?
… we may need a registry
Bill Kasdorf: in some sectors of publishing there are organizations that govern metadata
… IPTC, JATS, etc
… as we reach out to other sectors, we will find there are already metadata standards
Wendy Reid: this is not a question for us to solve today
… we have sufficient metadata in our specs for now. We’ll see how it goes in CR.
… so let’s move on
Ivan Herman: is it OK if we close this issue?
Dave Cramer: refresh your github
Gregorio Pellegrino: if we close the issue, how can we say we are thinking about this?
Gregorio Pellegrino: Fine
Ivan Herman: we could say it’s deferred
Wendy Reid: #98
Gregorio Pellegrino: Defer

mrjj · 2020-03-07T12:06:41Z

schema.org has another way: copyrightHolder + copyrightYear. In case of consensus around the concept, we'll have to choose our way.

JFI we have practical case when something like copyrightEndYear can matter. For example publisher is purchasing not exclusive sub-license on classic work for example "The Hobbit, or There and Back Again", usually its time-limited for 3-10 years, otherwise huge capex will be worse than a possible discounts. In this case its important to preserve known publisher sub-license revocation date across further distribution chain. On case if publisher stop control distribution legal status for any reason. Usually i see this date defined on agreement papers in legal dept and agreements between publisher and distribution, but no single place for time mark like restrict download of this bundle after YYYY-MM-DD

Schema.org very unclear about start year as: The year during which the claimed copyright for the CreativeWork was first asserted, because without information about authority who asserted IPR transfer/contract/any other agreement its just a 4 digits number. Following the practical case as end user i may want to see publisher copyright notice and ensure that i'm not witness of any infringement and not gaining civil responsibility to report about it, if there is notice, its enough to be sure i'm not responsible side. But if i want check this agreement as gov worker, first of all i may want to know which authority to contact besides publisher. If this authority is assigning any identifiers related to rights transfer fact its fine and clear where to place any identifier of this kind. Year of first copyright assertion by holder party in both cases seems to be the minor detail especially without detailed information about all related paperwork.

EPUB license manifest template stored on license server managed by Readium platform covers this case, its possible to check on server as well as it defining time cap for any bundled EUL manifest file.

Internally to describe license status resolution we are using model very close to IPR Transfer of Inde_c_s framework. Its not a fit for DPub/EPUB 3.x but good tool to model different license status. According to this model copyrightHolder appears to be very unspecific note.

But TransferAction Schema.org concept seems to be good and compatible equivalent. Maybe it worth considering as possible recommended description practice and i may be answer to the questions where to put all this parties, dates, rights and so on. Its really hard to define perfect couple of fields for this needs. CreativeWork itself not that good for this. Legal side of business knowledge usually about linkage forming legal status of all entities in their sum with temporal bindings, not about one thing with limitless properties. From my side i see a coming up with idea to add some more aspects there as a sanity crime, So i see some sense in using existing part of model related to business activity being linked with work as it supposed, And preserve only place for EULA and information about responsible parties/authorities (possible to use on practice like reg num/contact info) for the case of transmission or bundling.

The other thing that happened to be very important for our digital publishing activity is a public domain status of work. And (ongoing/past) date of this status transfer. But this is a question as complex as the whole current topic.

mattgarrish mentioned this issue Sep 13, 2019

Add basic dcterms:conformsTo example #59

Merged

dauwhe closed this as completed Oct 7, 2019

mattgarrish reopened this Oct 7, 2019

mattgarrish added the status: postponed label Oct 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How Metadata Works in the Publishing World #57

How Metadata Works in the Publishing World #57

dauwhe commented Sep 9, 2019

mattgarrish commented Sep 10, 2019

iherman commented Sep 11, 2019

mattgarrish commented Sep 11, 2019

llemeurfr commented Sep 13, 2019 •

edited

Loading

gregoriopellegrino commented Oct 7, 2019

llemeurfr commented Oct 7, 2019

gregoriopellegrino commented Oct 7, 2019

llemeurfr commented Oct 7, 2019

iherman commented Oct 8, 2019

mrjj commented Mar 7, 2020

How Metadata Works in the Publishing World #57

How Metadata Works in the Publishing World #57

Comments

dauwhe commented Sep 9, 2019

mattgarrish commented Sep 10, 2019

iherman commented Sep 11, 2019

mattgarrish commented Sep 11, 2019

llemeurfr commented Sep 13, 2019 • edited Loading

gregoriopellegrino commented Oct 7, 2019

llemeurfr commented Oct 7, 2019

gregoriopellegrino commented Oct 7, 2019

llemeurfr commented Oct 7, 2019

iherman commented Oct 8, 2019

mrjj commented Mar 7, 2020

llemeurfr commented Sep 13, 2019 •

edited

Loading