Skip to content
This repository has been archived by the owner on Apr 16, 2020. It is now read-only.

Archive metadata and licensing --> js discussion #25

Closed
Tracked by #2
eminence opened this issue Sep 18, 2015 · 16 comments
Closed
Tracked by #2

Archive metadata and licensing --> js discussion #25

eminence opened this issue Sep 18, 2015 · 16 comments
Assignees
Labels

Comments

@eminence
Copy link
Collaborator

For each archive, we need a standard way to record some metadata with the archive. At the moment, the most important thing to include is licensing information, but we may find other information that we would like to require.

This issue is to track the discussion on this topic. Below is a draft proposal, with two examples. All aspects of this proposal are open for discussion.

  • Metadata should be stored in a file called _Metadata.json. The name is designed so that I'll appear near the top of directory listings.
  • The json object is a dictionary with the following keys:
    • title -- Provides a name for the archive
    • description -- A more verbose description, if needed
    • source -- Lists of URLs where this data came from
    • license -- An array of dictionaries listing the relevant licenses. Each has the following keys:
      • summary -- a brief summary of the license
      • source -- Where to find the license/legal terms in full
    • last_synched -- an ISO 8601 timestamp indicating the last time this archive was updated
  • I think to start "license" and "title" should be required, others can be optional

For two concrete examples, see the metadata for #23 and the metadata for #18

Other thoughts:

  • Should the metadata include maintainer information?
  • Should the metadata include the script/tool that was used to sync/update the archive? might be useful is the current maintainer goes away

CC #5 for related discussion

@davidar
Copy link
Collaborator

davidar commented Sep 18, 2015

👍

However, instead of inventing our own format, ideally we could use an existing standard. For example:

@davidar
Copy link
Collaborator

davidar commented Sep 18, 2015

I think we should separate metadata into two categories:

  1. machine-readable only, such as timestamps and hashes, which human end-users aren't likely to care about
  2. both machine- and human-readable, such as descriptions and licenses

For (1) I'm perfectly happy to just dump a (hidden) .metadata.json in the root directory, with whatever format is used by the tool used to update the archive.

For the second, I think we should use either (or both):

  • human-readable HTML + machine-readable tags
  • human-readable Markdown (or similar) + machine-readable YAML header (like Jekyll uses)

under the conventional README and LICENSE filenames. Personally I'm in favour of Markdown+YAML, and we can include a copy of the markdown viewer webapp

To answer your other questions:

Should the metadata include maintainer information?

Yes, I'd say to include this in the license: e.g. "Original source blah, processed and uploaded to IPFS by blah"

Should the metadata include the script/tool that was used to sync/update the archive? might be useful is the current maintainer goes away

Definitely, I think it's even been suggested to put a copy of the tool within the archive itself. IPFS de-duplication means this has no more overhead than a link.

@eminence
Copy link
Collaborator Author

What would be the purpose of including hashes? IPFS itself will ensure data integrity.

Something like Markdown or YAML sounds find. I'd rather not use HTML, because HTML is not very friendly if you don't have a web browser to render it

@davidar
Copy link
Collaborator

davidar commented Sep 18, 2015

What would be the purpose of including hashes? IPFS itself will ensure data integrity.

Some protocols (like rsync) supporting checking the hash of a remote file to see if it has changed. I'm basically talking about any metadata that the update tool can use internally to make its job easier.

Something like Markdown or YAML sounds find. I'd rather not use HTML, because HTML is not very friendly if you don't have a web browser to render it

Agreed. Specifically I'm proposing something like:

README.md:

---
title: arXiv
source: http://arxiv.org/
authors:
  - arXiv contributors
  - IPFS archivists
updated: 2015-03-14
---
This is a mirror of the [Creative Commons](http://creativecommons.org)
subset of [arXiv](http://arxiv.org).

Yada yada

LICENSE.md:

---
license: http://creativecommons.org/licenses/by-sa/3.0/
title: CC-BY-3.0
morePermissions: blah
attributionURL:
  - http://arxiv.org
  - http://ipfs.io
attributionName: arXiv, IPFS
---
You are free to:

    Share — copy and redistribute the material in any medium or format
    Adapt — remix, transform, and build upon the material 

Yada yada

We also need to account for the fact that archives may have different licenses for different parts, in which case I'd suggest placing a separate LICENSE file into the relevant directories.

@eminence @jbenet Thoughts?

This was referenced Sep 19, 2015
@davidar davidar added the spec label Sep 19, 2015
@davidar davidar self-assigned this Sep 19, 2015
@jbenet
Copy link
Contributor

jbenet commented Sep 19, 2015

  • 👎 on frontmatter. i think it confuses most people.
  • a package.jsonld based on OKFN's or npm's would probably work well.
  • we should try to use existing formats here if possible
  • typical to include purely verbatim license files

@davidar
Copy link
Collaborator

davidar commented Sep 19, 2015

👎 on frontmatter. i think it confuses most people.

Fair enough. I meant it as a more readable alternative to the HTML microformats recommended by Creative Commons (and many others), which are even more confusing.

a package.jsonld based on OKFN's

👍 Thanks, that looks even better.

or npm's would probably work well.

👎 Yeah... I'm not drinking the NodeJS kool-aid ;)

we should try to use existing formats here if possible
typical to include purely verbatim license files

Sorry, I should have provided a reference, as I'm not the first person to propose something like this:

http://blog.martinfenner.org/2013/06/29/metadata-in-scholarly-markdown/

but I agree that it's not exactly widespread (yet :).

@jbenet
Copy link
Contributor

jbenet commented Sep 19, 2015

👎 Yeah... I'm not drinking the NodeJS kool-aid ;)

Well, the OKFN data-package.json is directly derived from npm's package.json.

It turns out that node is one of the best programming systems out there, thanks to npm. npm got so much extremely right. The assumption that "it's js, it has to be bad" is so absurdly wrong. It beats go get/vendor, cabal, gem, and so on. cargo promises to be on the ballpark, mostly because it copied npm in all the important things.

http://blog.martinfenner.org/2013/06/29/metadata-in-scholarly-markdown/

The problem with frontmatter is that it makes processing the files very annoying, particularly in APIs. I like it as a writer, but not a programmer.

@davidar
Copy link
Collaborator

davidar commented Sep 19, 2015

I know this isn't the right place for this discussion, but I'll bite. I haven't used npm much, so I may be missing something, but looking at the spec, nothing particularly novel jumps out at me. It just looks like all the standard package fields, but in JSON.

Don't get me wrong, JavaScript actually ranks reasonably highly on my list compared to a lot of alternatives. But this current trend that JavaScript is the solution to every problem, and somehow solves it better than every other programming language, is frankly ridiculous. People complain about Haskell monads being painful, and yet callback hell is the best thing since sliced bread. Green threads have been around for a long time, and other languages have done a lot more in getting concurrency right. Don't even get me started on atom (1GB+ of ram for a text editor, seriously?).

@whyrusleeping
Copy link
Contributor

(1GB+ of ram for a text editor, seriously?).

~17KB baby! anything more is bloat.

whyrusl+  5126  4.6  0.2 198740 16996 pts/3    S+   10:37   0:00 vim repo/fsrepo/fsrepo.go

@eminence
Copy link
Collaborator Author

+1 on the data-packages format. My original proposal is fairly similar to this, so it matches up pretty well with what I had in mind

@davidar
Copy link
Collaborator

davidar commented Sep 23, 2015

@eminence @jbenet Ok, so I'm thinking we should have:

  1. an OKFN datapackage.json file,
  2. a verbatim LICENSE file (either in the top-level directory, or sub-directories in the case of multiple licenses), and
  3. a standard README(.md) file containing any lengthy descriptions, etc.

@jbenet
Copy link
Contributor

jbenet commented Sep 23, 2015

@davidar SGTM.

And, not talking about javascript. Talking about npm. This is inconsistent:

  • 👎 Yeah... I'm not drinking the NodeJS kool-aid ;)
  • 👍 Thanks, that looks even better.
  • the OKFN data-package.json is directly derived from npm's package.json.

The point is that the statement "not drinking the <THING> kool-aid" is typical of actively ignoring whatever <THING> is, including anything that may be good and valuable, instead of studying <THING> and dismissing the provably bad parts. I'm really tired of the js-hate, particularly when people make inconsistent or uninformed claims, like dismissing npm without even trying it, or understanding why it is well designed. It is similar to the dismissal that haskell gets from the "hardcore C/C++ systems people" (i.e. because they've not taken the time to understand it).

anyway, yep. not really worth discussing here.

@davidar
Copy link
Collaborator

davidar commented Sep 23, 2015

The point is that the statement "not drinking the kool-aid" is typical of actively ignoring whatever is, including anything that may be good and valuable, instead of studying and dismissing the provably bad parts.

@jbenet Alright, I apologise for my wording, I could have phrased it better. For the record, I would have been equally as dismissive about using Python's packaging format for this, or Debian's, or whatever, simply because software packaging and data packaging are different problems. In any case, I can approve of a data packaging format which happens to be derived from a small part of NPM without necessarily approving of NPM as a whole. It's not that I dislike NPM in particular, I just don't see the relevance to data packaging in comparison to any other software packaging format.

Also, despite what people seem to think, I don't hate JS/NPM anymore than I hate Python/PyPI (which I use quite often). What I do hate is when people try to apply them to things outside of the domain in which it makes sense to do so ("when all you have is a hammer, everything looks like a nail"). My motto is "all programming languages suck, but some suck less than others in specific circumstances". JS is a good choice for some problems, for others it sucks (e.g. atom, IMHO). Haskell is good for some things, for systems programming it sucks. Python ... you get the idea.

In terms of NodeJS (and last I checked NPM is an official subproject) in particular, it's kind of the embodiment of applying a language to a problem it was never meant to solve. If it were marketed as a scripting language (in the same category as Python), then I wouldn't have a problem with it, but a lot of people make far more overzealous claims about it (yes, it's better than PHP, but there's a lot of other languages that are better still). It would be like trying to run Perl or Fortran in a web browser (please tell me nobody has tried that :). As a result, it makes me automatically skeptical of assertions about it's superiority without any supporting facts. You keep telling me NPM is well-designed, and I've used it a little and tried researching myself to understand what you mean, but I'm not seeing anything all that special TBH. Like I said, please elaborate if you think I'm missing something.

Anyway, those were the thoughts I was trying to convey in my somewhat flippant remark :)

@rht
Copy link

rht commented Sep 28, 2015

~17KB baby! anything more is bloat.

neovim ~11KB

@rht
Copy link

rht commented Oct 23, 2015

Consider using spdx for license parsing, see ipfs/kubo#337.

@jbenet
Copy link
Contributor

jbenet commented Dec 3, 2015

I reopened as #45 since this turned into js bikeshedding discussion

@jbenet jbenet closed this as completed Dec 3, 2015
@jbenet jbenet changed the title Archive metadata and licensing Archive metadata and licensing --> js bikeshedding Dec 3, 2015
@jbenet jbenet changed the title Archive metadata and licensing --> js bikeshedding Archive metadata and licensing --> js discussion Dec 3, 2015
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

5 participants