Keeping tool metadata up to date with upstream services and limiting duplication #65

proycon · 2019-12-12T17:22:52Z

I have a concern regarding keeping the switchboard tool registry up to date with upstream tools and (unnecessary) duplication of metadata. These hosted tools will get updated now and then and the switchboard registry by definition lags a bit behind. (I doubt upstream tool developers will remember or be willing to update the switchboard registry every time they do a new deployment). In cases where the calling API does not change, it's not really a functional problem. Semantically though, fields such as "version" become a bit useless (or in the worst interpretation, misleading) if they do not point to the actual version used.

I'm wondering whether it might be an idea to have the switchboard actively harvest parts of the software metadata from the various sources (like once a day). Of course this requires that machine-parseable metadata is made available in the first place, which will not be possible in all cases because the information simply isn't offered, but in some cases where it is offered, and done so in a machine-parseable way, it would be a shame not to use it. For everything hosted in LaMachine, we have codemeta metadata (JSON-LD) available. My general recommendation is always: keep software metadata as close to the source(s) as possible, and let it trickle down.

A related issue is the amount of duplication in the metadata currently, even in the same registry. Looking at the various Weblicht entries for example, a lot of the more general software metadata is shared, but each service entrypoint demands its own entry in the registry, so shared information has to be edited in multiple places in the registry. For now it's all not too big of an issue, but when things scale up it may become so.

(Also somewhat related to clarin-eric/switchboard-tool-registry#5)

proycon · 2019-12-12T17:43:24Z

I want to add a bit more context about the solution we currently have in place at CLST in Nijmegen: The closest thing I have slightly comparable to a switchboard is our portal page (powered by labirinto ). Like the switchboard, it offers an entry to various tools (though much simpler, without offering any of the data matching facilities the switchboard does offer).

Instead of having an independent metadata registry for the portal page, our registry is largely compiled automatically from the metadata of the actual software. Software metadata from the Python Package Index, Debian/Ubuntu repository, CRAN, Maven Central, is read and converted to a common codemeta specification, which is specifically designed to map software metadata from different systems to a common scheme. In our case the software installation is managed by LaMachine, so that takes care of calling the codemeta tools and aggregating all metadata into a single registry (amending it with things it can figure out itself) which can then in turn be used by labirinto, the portal tool.

Of course, this only concerns generic software metadata (name, authors, description, licence, version, etc), which is an integral part of what you have in your registry but not sufficient for operation of the switchboard (you need more information like what CLAM or OpenAPIs.org offers). I think what we should aim for in practice, is to combine software metadata from various sources, but where those sources should be as close to the upstream source as possible, so they don't run the risk of losing relevance by being out of date, in a constantly moving ecosystem.

(I'm also poking @JanOdijk because he may be interested in this discussion as he he has been involved with software metadata on CMDI-side of things for CLARIAH)

andmor- · 2019-12-18T16:33:33Z

While I do agree with the problem definition and principles, I want to point out that the proposed solution clashes against one of the basic principles of the Switchboard: "Keep the requirements for tools to be part of the Switchboard as low as possible".
Some considerations:

Surely some tool providers do not have the resources to implement a metadata endpoint on their tools. Despite not extremely complex, this is certainly more complex than the current requirement of implement support for 1 GET parameter.
Using JSON-LD still leaves open the actual schema to be used. So even your tools probably would need a new metadata endpoint to feed the switchboard.
Worse than the situation we have now, would be a situation where some tools provide their metadata via their endpoint and others via the Switchboard tool registry files.
Metadata fields like version or authors are actually pointless for the Switchboard. They are not used nor I can envision a usage for it.
Metadata fields like mimetypes or languages are dangerous to be set dynamically by the receiving tools. Each tool maintainer tends to want her tools to be as visible as possible, while from the point of view of the users, the functionality that a certain tool offers for a certain mimetype might not be relevant, mature enough or might not even integrate properly. i.e. Allowing these fields to be dynamically filled-in by the tools in real time, would transfer control of the switchboard matching mechanism to those endpoints. Without proper coordination and fail-safe mechanisms, this could lead to all kinds of problems
The metadata field url (or equivalent) will always have to be provided by some kind of registry so that the Switchboard knows what to harvest.

From the above, and looking at the original problem description, I propose that fields that are not needed and with great potential to become outdated e.g. version or authors are actually removed from the registry's metadata.

emanueldima · 2020-01-08T10:11:01Z

I agree with André's comment above, we should not add friction for the tools that we integrate. We should just remove the fields that are not essential and expect the tool's landing page to provide information about versions, authors, etc.

proycon · 2020-01-28T11:29:56Z

Yes, I understand where you are coming from and I completely agree the requirements for tools should be kept as low as possible. Removing fields that are not essential or can not be kept up to date is a good idea. I understand you guys were planning an overhaul of the registry format anyway.

I did some exploratory work on this issue last december, after the workshop, and think there is a good middle ground, which does not put the burden of harvesting on the switchboard, nor requires tools to be harvestable and subscribe to a certain metadata format. We have many CLAM-based services in Nijmegen, so I simply wrote a tool that automatically creates switchboard registry entries for a given CLAM-based service (see https://github.com/proycon/clam2switchboard, still a work in progress). Now this of course only works for CLAM, but if there are any other commonly used frameworks then other such tools could possibly be created too. In any case, the current 'manual' method remains just as valid.

You then get automatically generated switchboard registry entries like: proycon/switchboard-tool-registry@1c54c3b (proof of concept, not ready for merge yet)

emanueldima · 2020-12-02T13:22:38Z

I think we can close this. We decided to accept contrib scripts in the https://github.com/clarin-eric/switchboard-tool-registry-contrib. These scripts can translate from a tool specific metadata formats to the Switchboard format and even automate the creation of PRs.

This was referenced Dec 13, 2019

Support multiple inputs #28

Open

Support multiple tasks #67

Open

emanueldima closed this as completed Dec 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Keeping tool metadata up to date with upstream services and limiting duplication #65

Keeping tool metadata up to date with upstream services and limiting duplication #65

proycon commented Dec 12, 2019 •

edited

Loading

proycon commented Dec 12, 2019

andmor- commented Dec 18, 2019

emanueldima commented Jan 8, 2020

proycon commented Jan 28, 2020

emanueldima commented Dec 2, 2020

Keeping tool metadata up to date with upstream services and limiting duplication #65

Keeping tool metadata up to date with upstream services and limiting duplication #65

Comments

proycon commented Dec 12, 2019 • edited Loading

proycon commented Dec 12, 2019

andmor- commented Dec 18, 2019

emanueldima commented Jan 8, 2020

proycon commented Jan 28, 2020

emanueldima commented Dec 2, 2020

proycon commented Dec 12, 2019 •

edited

Loading