-
Notifications
You must be signed in to change notification settings - Fork 12
Weekly updates
This is the first weekly report for the Catalog Assessment and Triage project.
My time this week was spent on two tasks: attempting to build the Catalog software locally, for testing and debugging, and getting the catalog MODS and MADS records into shape to examine in a database.
I wanted to try building the Catalog on an up-to-date OS with current software versions, in order to assess how hard it would be to maintain and extend. I followed the instructions here.
The instructions describe how to build the software on Ubuntu 10.04, which is no longer supported. The latest LTS release of Ubuntu is 16.04, so I built a virtual machine based on that version.
Many additional libraries and packages were also out of date. The instructions call for Java 6, for example, but Java 6 is no longer supported (the current version is 9). Tomcat 6 is no longer available from Apache's web site; but it turns out Tomcat is no longer needed at all, since Solr no longer requires it. Most seriously, running the Ruby bundler to install dependencies produced a cascade of errors because many packages could not be found.
It is unclear, then, how much effort would be required to update the Perseus Catalog to the latest version of Blacklight. Given that there are outstanding questions about the overall suitability of Blacklight to the complex metadata in the Perseus Catalog, it does not seem prudent to do anything further with it at this point, until we re-evaluate the Catalog's functional requirements and re-assess the suitability of other platforms.
A number of MODS and MADS records had accumulated minor errors (namespace errors, primarily), which Alison corrected. All the records in catalog_data/ now validate, so I can use programmatic techniques to address some of the tickets in the GitHub issues list.
In the coming week, Alison will review the tickets and classify them so we can group and prioritize. I will study the Blacklight code and the running web app to better understand the services it provides, its API, and its data requirements beyond the MODS and MADS data.
My time this week was spent correcting MODS and MADS records that had various data or other issues. I also corresponded with Cliff over email and a silent movie version of Google Hangout and tried my best to answer a number of questions regarding the MODS and MADS records, the CITE Collections table, and how the whole update process works.
This week I studied the Cite-collection tool to get a better understanding of the the cite tables and the role they play. I began a functional outline of the current Catalog, and I worked with Alison to begin to generate use cases for the Catalog more generally -- concentrating for the moment on how it is being used now. I also began to play around with a service-oriented architecture to use in the catalog re-implementation.
A plan is emerging. In the next week to ten days, I will go through the issues Alison has identified as Data Cleanup tasks and knock those off. In the process, I will develop some Schematron validators to help with future work. I will also begin to develop some integration scripts that will update the running catalog app whenever Alison commits a change. And I will continue to outline the functional requirements of the catalog and design a service-oriented architecture to meet them.
I went through all of the current issues in catalog_data and created a number of labels, a practice I also found very beneficial as it allowed me to close several issues and document a few issues Cliff can begin with for the next potential catalog update. Otherwise I simply tried to answer Cliff's questions as he had them and me with him again on Friday.
I spent the last two weeks wrestling with issue #102 ("Many duplicate records for epigrams in the Greek Anthology"). In the end, Alison asked me to revert the changes because they broke too many existing URNs, but the exercise allowed me to dig down deep and understand the state of the metadata. Unfortunately, it appears to be pretty severely broken: multiple duplicate records; multiple identifiers; incomplete or missing identifiers; data forks between the MODS and MADS records and the CITE tables; not to mention a troublesome disregard for file-naming conventions throughout the filesystem (lots of file names with spaces and periods in them). Most serious of all, the data model for abstract works, editions, and translations is fundamentally broken, and this has led to some pretty serious abuses of MODS (as well as lots of data duplication).
These discoveries, coupled with my review of the current Catalog application, lead me to think that a fundamental assessment needs to be made now about the best use of the remaining development time on the catalog:
- Establish actual use cases for the catalog application. The current application is pretty simple from the user's side and needlessly complex on the management side.
- Establish actual use cases for the catalog data.
- Establish consistent data models. Clearly distinguish abstract works and expressions from specific editions and translations. Establish clear, machine-actionable linkages between works and authors. Harmonize the CITE tables and the MODS/MADS records, and then get rid of one of them.
In the short run, a basic re-implementation of the current catalog app might solve Alison's immediate problems by making it easier for her to modify her records and update the catalog:
- Back-propagate the CITE table data to the MODS and MADS records.
- Write an XML-database-based catalog application that works directly from the actual metadata.
I'd like to work with Alison and James Tauber this week to determine what direction to take.
Brief summary of last month's work:
- Developed simple eXist-based catalog app for looking searching and browsing MODS and MADS data
- Lots of metadata cleanup
- Began to work on catalog_pending records