-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Keeping tool metadata up to date with upstream services and limiting duplication #65
Comments
I want to add a bit more context about the solution we currently have in place at CLST in Nijmegen: The closest thing I have slightly comparable to a switchboard is our portal page (powered by labirinto ). Like the switchboard, it offers an entry to various tools (though much simpler, without offering any of the data matching facilities the switchboard does offer). Instead of having an independent metadata registry for the portal page, our registry is largely compiled automatically from the metadata of the actual software. Software metadata from the Python Package Index, Debian/Ubuntu repository, CRAN, Maven Central, is read and converted to a common codemeta specification, which is specifically designed to map software metadata from different systems to a common scheme. In our case the software installation is managed by LaMachine, so that takes care of calling the codemeta tools and aggregating all metadata into a single registry (amending it with things it can figure out itself) which can then in turn be used by labirinto, the portal tool. Of course, this only concerns generic software metadata (name, authors, description, licence, version, etc), which is an integral part of what you have in your registry but not sufficient for operation of the switchboard (you need more information like what CLAM or OpenAPIs.org offers). I think what we should aim for in practice, is to combine software metadata from various sources, but where those sources should be as close to the upstream source as possible, so they don't run the risk of losing relevance by being out of date, in a constantly moving ecosystem. (I'm also poking @JanOdijk because he may be interested in this discussion as he he has been involved with software metadata on CMDI-side of things for CLARIAH) |
While I do agree with the problem definition and principles, I want to point out that the proposed solution clashes against one of the basic principles of the Switchboard: "Keep the requirements for tools to be part of the Switchboard as low as possible".
From the above, and looking at the original problem description, I propose that fields that are not needed and with great potential to become outdated e.g. |
I agree with André's comment above, we should not add friction for the tools that we integrate. We should just remove the fields that are not essential and expect the tool's landing page to provide information about versions, authors, etc. |
Yes, I understand where you are coming from and I completely agree the requirements for tools should be kept as low as possible. Removing fields that are not essential or can not be kept up to date is a good idea. I understand you guys were planning an overhaul of the registry format anyway. I did some exploratory work on this issue last december, after the workshop, and think there is a good middle ground, which does not put the burden of harvesting on the switchboard, nor requires tools to be harvestable and subscribe to a certain metadata format. We have many CLAM-based services in Nijmegen, so I simply wrote a tool that automatically creates switchboard registry entries for a given CLAM-based service (see https://github.com/proycon/clam2switchboard, still a work in progress). Now this of course only works for CLAM, but if there are any other commonly used frameworks then other such tools could possibly be created too. In any case, the current 'manual' method remains just as valid. You then get automatically generated switchboard registry entries like: proycon/switchboard-tool-registry@1c54c3b (proof of concept, not ready for merge yet) |
I think we can close this. We decided to accept contrib scripts in the https://github.com/clarin-eric/switchboard-tool-registry-contrib. These scripts can translate from a tool specific metadata formats to the Switchboard format and even automate the creation of PRs. |
I have a concern regarding keeping the switchboard tool registry up to date with upstream tools and (unnecessary) duplication of metadata. These hosted tools will get updated now and then and the switchboard registry by definition lags a bit behind. (I doubt upstream tool developers will remember or be willing to update the switchboard registry every time they do a new deployment). In cases where the calling API does not change, it's not really a functional problem. Semantically though, fields such as "version" become a bit useless (or in the worst interpretation, misleading) if they do not point to the actual version used.
I'm wondering whether it might be an idea to have the switchboard actively harvest parts of the software metadata from the various sources (like once a day). Of course this requires that machine-parseable metadata is made available in the first place, which will not be possible in all cases because the information simply isn't offered, but in some cases where it is offered, and done so in a machine-parseable way, it would be a shame not to use it. For everything hosted in LaMachine, we have codemeta metadata (JSON-LD) available. My general recommendation is always: keep software metadata as close to the source(s) as possible, and let it trickle down.
A related issue is the amount of duplication in the metadata currently, even in the same registry. Looking at the various Weblicht entries for example, a lot of the more general software metadata is shared, but each service entrypoint demands its own entry in the registry, so shared information has to be edited in multiple places in the registry. For now it's all not too big of an issue, but when things scale up it may become so.
(Also somewhat related to clarin-eric/switchboard-tool-registry#5)
The text was updated successfully, but these errors were encountered: