The Marriott Reparative Metadata Assessment Tool (MaRMAT) is a Python application designed for auditing collections metadata files against a lexicon of potentially problematic terms. The tool's design facilitates an easy-to-follow process for assessing metadata using a lexicon of problematic terms. For PC user's, we provide a graphical interface for file loading, column selection, and term matching, making it user-friendly for those with limited programming experience. The tool can also be run in your command line.
Code developed by Kaylee Alexander in collaboration with ChatGPT 3.5, Rachel Wittmann, and Anna Neatrour at the University of Utah's J. Willard Marriot Library.
-
1.1 About the Tool
1.2 The Lexicons
1.3 Features
1.4 Sample Data
-
2.1 Usage
2.2 Dependencies
2.3 Notes
-
3.1 Usage
3.2 Dependencies
3.3 Installation
The Marriot Reparative Metadata Assessment Tool (MaRMAT) was built off of the Duke University’s Description Audit Tool to assist digital collection metadata practitioners in bulk analysis of metadata collections to identify potentially harmful language in description and allow for repairing metadata to reflect current and preferred terminology. While Duke University Description Audit Tool was created to analyze MARC XML and EAD finding aid metadata, MaRMAT was developed to analyze metadata in a spreadsheet format, allowing for assessment of Dublin Core metadata and other schemas due to only requiring key column-header names. In addition, the script has been altered to provide more custom querying capabilities.
MaRMAT is designed to query spreadsheet-based (CSV/TSV) metadata against a lexicon of potentially harmful terms in uncontrolled metadata elements, such as Title, Description, and Collection Title. Controlled metadata, such as Subject, can be queried against a database of outdated or problematic Library of Congress Subject Headings. The bulk query of multiple columns of metadata against the provided lexicon, or user-supplied custom created lexicon, is designed to facilitate efficient bulk analysis instead of individual keyword searching methods.
Identifying potentially harmful language, problematic and outdated Library of Congress Subject Headings, is one step towards reparative metadata practices. Deciding what and how to change this metadata, however, is up to metadata practitioners and involves awareness, education, and sensitivity for the communities and history reflected in digital collections. The Digital Library Federation’s Inclusive Metadata Toolkit, created by the Digital Library Federation’s Cultural Assessment Working Group, provides resources to educate and assist in reparative metadata decision-making.
At the most basic level, MaRMAT is designed to match terms from a lexicon with textual data and produce a CSV file containing the matched results. It utilizes the Pandas library for data manipulation and regular expressions for text processing. It was designed primarily with librarians in mind, specifically those engaged in reparative metadata practices, to assist in idenfiying terms in their metadata that may be outdated, biased, or otherwise problematic. The underlying code (including preliminary iterations) and sample lexicons for using the tool can be accessed via the Code folder of this repository. For additional information about the GUI, see GUI-Documentation.
An initial test case developed a tool for parsing, extracting, tokenizing, and preprocessing XML files containing Open Archives Initiative (OAI) feed metadata for library special collections that would then crosscheck tokens against Duke University's lexicons and append the corresponding lexicon categories (Aggrandizement, Race Euphemisms, Race Terms, Slavery Terms, Gender Terms, LGBTQ, Mental Illness, and Disability) to each row in the CSV output. This tool is accessible via the XML Test Code folder of this repository, please note that this may not work with all OAI feed formats or take into account resumption tokens.
There are a few lexicons provided to help begin your reparative metadata assessment. Not all of the terms in these lexicons may need remediation, rather, they may signal areas of your collections that should be reiveiwed carefully. Users may download the provided lexicons to use in MaRMAT as is, remove terms that may not be problematic in your metadata, or add additional terms and categories based on specific project needs. The only requirements for a lexicon to work against another file are that there be two columns in the CSV file: "Term" and "Category" (case sensitive). Therefore, the tool's use is not limited to assessing metadata for problematic terms; it may also be loaded with a custom lexicon to perform matching against a variety of content types.
Lexicon | Description |
---|---|
Reparative Metadata Lexicon | The Reparative Metadata Lexicon includes potentially harmful terminology organized by category and is best suited for uncontrolled metadata fields (i.e. Title, Description). This lexicon has been adapted from Duke University's lexicons, which were created for similar use cases. For the Marriott Reparative Metadata Assessment Tool (MaRMAT), Duke's lexicons were modified by transposing across their category columns to create a single lexicon (term, category) that better accommodate users adding additional terms and categories without having to adjust the underlying code structure. |
Library of Congress Subject Heading Lexicon | The Library of Congress Subject Heading Lexicon includes changed and canceled Library of Congress Subject Headings (mostly from 2023) and headings that have been identified as problematic. The LCSH-lexicon is best suited to run against the Subject field, or other fields that contain LCSH terms |
- Load lexicon and metadata files in CSV format.
- Select columns from the metadata file for analysis.
- Choose the column in the metadata file to be rewritten as the "Identifier" column so that the output can be reconciled with the original metadata file.
- Select categories of terms from the lexicon for analysis.
- Perform matching to find matches between selected columns and categories.
- Export results to a CSV file.
Coming soon
The MaRMAT can be run by any user from their command line. Where indicated in the script, provide the paths to each file, specify the columns you wish to analyze, designate your "Identifier" column, and input the categories of terms you want to match. Then, run the Python file from your command line.
-
Install Python if not already installed (Python 3.x recommended).
-
Clone or download the MaRMAT repository.
-
Navigate to the tool's directory in the command-line interface.
-
Update the paths to your lexicon and metadata files in the MaRMAT-2.5.py script.
-
Run the tool using the following command:
python MaRMAT-2.5.py
-
Follow the on-screen prompts to input the columns and categories:
- Enter the names of the columns you want to analyze, separated by commas (e.g., "column1,column2").
- Enter the name of the identifier column (e.g., the name of a column used as a record ID)
- Enter the categories of terms you want to search for, separated by commas (e.g., "Category1,Category2").
-
Review the matching results displayed on the console or in the generated CSV file.
- Python 3.x: Python is a widely used high-level programming language for general-purpose programming.
- pandas: Pandas is a Python library that provides easy-to-use data structures and data analysis tools for manipulating and analyzing structured data, particularly tabular data. Pandas can be installed via pip:
pip install pandas
- re: This module provides regular expression matching operations. It's a built-in module in Python and doesn't require separate installation.
Note: These dependencies are necessary to run the provided code successfully. Ensure that you have them installed before running the code.
- Ensure that both the lexicon and metadata files are in CSV format.
- The lexicon file should contain columns for terms and their corresponding categories ("Terms","Category").
- The metadata file should contain the text data to be analyzed, with each row representing a separate entry.
- The metadata file should contain a column, such as a Record ID, that you can use as an "Identifier" to reconcile the tool's output with your original metadata.
- The tool outputs matching results to a CSV file named "matching_results.csv" in the tool's directory.
To facilitate wider use, the MaRMAT GUI allows users to easily load a lexicon and a metadata file, select a key column (i.e., Identifier) to use in reconciling matches, and choose the columns and categories they'd like to perform matching on.
*Note: The GUI is not compatible with MacOS. Additional information on the MaRMAT GUI is available here.
-
Loading Files:
- Click on the "Load Lexicon" button to load the lexicon file.
- Click on the "Load Metadata" button to load the metadata file.
-
Selecting Columns:
- After loading files, click "Next" to proceed to column selection.
- Select the columns from the metadata file that you want to analyze.
-
Selecting Identifier Column:
- After selecting columns, choose the column in the metadata file that will serve as the key column or "Identifier" column, such as a record ID.
-
Selecting Categories:
- Next, choose the categories of terms from the lexicon that you want to search for.
-
Performing Matching:
- Click "Perform Matching" to find matches between selected columns and categories.
- The results will be exported to a CSV file.
-
Python 3.x: Python is a widely used high-level programming language for general-purpose programming.
-
Tkinter: Tkinter is Python's standard GUI (Graphical User Interface) package. It is used to create desktop applications with a graphical interface.
Note: These dependencies are essential for running the Reparative Metadata Audit Tool. If you don't have Python installed, you can download it from the official Python website. Tkinter is usually included with Python distributions, so no separate installation is required.
No installation is required. Simply download and run the Python script to start the application on your PC.
This tool was inspired by the Duke University Libraries Description Audit Tool, developed by Noah Huffman at the Rubenstein Library, and expanded by Miriam Shams-Rainey (see Description-Audit).