Skip to content

Building Custom Discovery for Digitized Collections Using Computational Methods

Notifications You must be signed in to change notification settings

csbailey5t/TRLN-Workshop-2020

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Building Custom Discovery for Digitized Collections Using Computational Methods

By Scott Bailey

Triangle Research Library Network (TRLN) Annual Meeting 2020

Computational methods, such as topic modeling, create an opportunity for librarians to build experimental graphical interfaces to digitized collections. In this hands-on workshop with the Python programming language, participants will learn how to topic model a text corpus, and build interactive visualizations to expose items in the collection in new ways.

Throughout this workshop, we'll pay careful attention to the moments in our process where human and expert attention is still required.

Running the workshop code

All of the code in this workshop exists in Jupyter Notebooks (.ipynb files). The workshop code can be run in multiple ways.

If you already have a local Python installation and are comfortable working with virtual environments, go ahead and create a virtual environment and install the libraries listed in requirements.txt in your preferred way. This workshop was developed with Python 3.8 (by way of pyenv) and virtualenv for simplicity, but you could use conda, pipenv, poetry, or other environment managers.

If you are just getting started with Python or simply prefer to work in the browser (I recommend this for the live workshop), click on the Google Colab button below to open the workshop notebook in Google Colab, Google's hosted Jupyter Notebook environment. You'll be able to run all of the code in your browser.

Open In Colab

You can also run the full notebook in Binder. To run in Binder, click the button below:

Binder

Our Corpus: The Animal Turn

The digitized collection we'll work with today is from NC State University Libraries' Special Collections: The Animal Turn. This is a grant-funded collection with substantial materials from NC State University Libraries' Animal Rights and Welfare collections, along with materials from the ASPCA's archives. We're only going to be working with the text of these materials today, but one could certainly use computational methods to explore the collection as images.

This collection is backed by IIIF (International Image Interoperability Framework) and includes OCR text. With permission, I've taken advantage of the public IIIF manifest for the collection, and scraped the OCR text. The scraping code is included in this repo in scrape.ipynb.

Resources on Machine Learning and Libraries

There is an ever-growing number of reports, articles, and environmental scans on the future of machine learning/artificial intelligence and libraries. Here are a select few pieces if you're wanting to look further:

Credits

This workshop is based off of work I did while at Stanford Libraries as part of the SUL AI Studio, a library initiative to explore possible uses of machine learning/artificial intelligence in relation to library collections and services. Working with Javier de la Rosa, Rebecca Wingfield, and Arcadia Falcone on the Jarndyce Single Volume Nineteenth Century Novel Collection, I experimented with different machine learning approaches to produce semantic models and cluster documents with an eye toward metadata creation and discovery. Code from that experimental, proof-of-concept only work is here.

This workshop builds directly from that work and would not have been possible without our great project team.

About

Building Custom Discovery for Digitized Collections Using Computational Methods

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published