By Scott Bailey
Triangle Research Library Network (TRLN) Annual Meeting 2020
Computational methods, such as topic modeling, create an opportunity for librarians to build experimental graphical interfaces to digitized collections. In this hands-on workshop with the Python programming language, participants will learn how to topic model a text corpus, and build interactive visualizations to expose items in the collection in new ways.
Throughout this workshop, we'll pay careful attention to the moments in our process where human and expert attention is still required.
All of the code in this workshop exists in Jupyter Notebooks (.ipynb
files). The workshop code can be run in multiple ways.
If you already have a local Python installation and are comfortable working with virtual environments, go ahead and create a virtual environment and install the libraries listed in requirements.txt
in your preferred way. This workshop was developed with Python 3.8 (by way of pyenv
) and virtualenv
for simplicity, but you could use conda
, pipenv
, poetry
, or other environment managers.
If you are just getting started with Python or simply prefer to work in the browser (I recommend this for the live workshop), click on the Google Colab button below to open the workshop notebook in Google Colab, Google's hosted Jupyter Notebook environment. You'll be able to run all of the code in your browser.
You can also run the full notebook in Binder. To run in Binder, click the button below:
The digitized collection we'll work with today is from NC State University Libraries' Special Collections: The Animal Turn. This is a grant-funded collection with substantial materials from NC State University Libraries' Animal Rights and Welfare collections, along with materials from the ASPCA's archives. We're only going to be working with the text of these materials today, but one could certainly use computational methods to explore the collection as images.
This collection is backed by IIIF (International Image Interoperability Framework) and includes OCR text. With permission, I've taken advantage of the public IIIF manifest for the collection, and scraped the OCR text. The scraping code is included in this repo in scrape.ipynb
.
There is an ever-growing number of reports, articles, and environmental scans on the future of machine learning/artificial intelligence and libraries. Here are a select few pieces if you're wanting to look further:
- Responsible Operations: Data Science, Machine Learning, and AI in Libraries by Thomas Padilla
- Machine Learning + Libraries: A Report on the State of the Field from LC Labs and written by Ryan Cordell.
- Mapping the Current Landscape of Research Library Engagement with Emerging Technologies in Research and Learning from ARL, CNI, and Educause. This isn't specifically and only about ML in libraries, but it's a heavy and recurring theme.
- Shifting to Data Savvy: The Future of Data Science in Libraries, an IMLS funded report on libraries, librarians and data science.
This workshop is based off of work I did while at Stanford Libraries as part of the SUL AI Studio, a library initiative to explore possible uses of machine learning/artificial intelligence in relation to library collections and services. Working with Javier de la Rosa, Rebecca Wingfield, and Arcadia Falcone on the Jarndyce Single Volume Nineteenth Century Novel Collection, I experimented with different machine learning approaches to produce semantic models and cluster documents with an eye toward metadata creation and discovery. Code from that experimental, proof-of-concept only work is here.
This workshop builds directly from that work and would not have been possible without our great project team.