This repository contains scripts to index metadata stored in a gypsum backend. Each index is a self-contained SQLite file that can be downloaded and queried without further reliance on the backend. We construct new indices by inspecting the backend for JSON metadata documents within each project-asset-version combination; existing indices are updated by routinely scanning the logs for new content.
This documentation is intended for system administrators to generate new indices, or developers of packages that use the indices. Users should not have to interact with these indices directly, as this should be mediated by client packages in relevant frameworks like R/Bioconductor. For example, the gypsum R client provides functions for downloading and caching the indices, which are then called by more user-facing packages like the scRNAseq R package.
All metadata documents should be JSON-formatted and assigned to a particular project-asset-version in the gypsum backend. The scripts in this repository will scrape the contents of the backend to identify metadata documents to add to the index.
One SQLite index is generated for each "class" of metadata documents.
All metadata documents of the same class should have the same base name, i.e., after removing any directories or /
-delimited prefix.
Typically, each metadata class corresponds to a specific JSON schema used to validate its documents;
this suggests that the documents are comparable with similar structure and values, though this is not strictly required for indexing.
Metadata documents are stored losslessly in the SQLite file as a JSONB column.
This can be directly used for queries via SQLite's various json_*
methods.
While most of this documentation will focus on files stored on gypsum's R2 bucket, the same code can also be used for gypsum-compatible stores on local filesystems. This means that on-premise metadata documents (e.g., managed by the Gobbler) can be indexed into equivalent SQLite files.
To enable efficient text search, string properties in each metadata document are converted into (field, token) pairs for indexing.
The field name for each property is defined as its .
-seperated path within the document, assuming that no properties have .
in the name.
For arrays, each item is treated as a separate token with the same field name as the array itself - we do not distinguish based on the ordering of items.
{
"a": {
"b": {
"c": 1, // field name is 'a.b.c'
"d": "asdasd" // field name is 'a.b.d'
},
"e": [
"qwerty", // field name is 'a.e',
{
"f": 100 // field name is 'a.e.f'
}
]
}
}
Our tokenization scheme is adapted from the FTS5 Unicode61 tokenizer. Each string is normalized to remove diacritics, converted into lowercase, and then split at any character that is not a Unicode letter/number or a dash. We consider dashes as part of the token to preserve scientific terms like gene names.
The version of the SQLite file structure is stored in the user_version
pragma as the integer MAJOR * 1000000 + MINOR * 1000 + PATCH
.
If the pragma has not been set or is zero, it can be assumed to be 1000000, i.e., version 1.0.0.
The version of the file structure should correspond to the version in the package.json
.
This is a table where each row corresponds to a successfully indexed project-asset-combination. The table contains the following fields:
vid
: integer, the version ID and the primary key for this table.project
: text, the name of the project.asset
: text, the name of the asset.version
: text, the name of the version.latest
: integer (1 or 0), whether this version is the latest for its asset.- (for
user_version >= 1.1.0
)user
: string, the identity of the uploading user. - (for
user_version >= 1.1.0
)time
: integer, the Unix timestamp when the upload finished.
Multicolumn indices are available for (project, asset, version)
and (latest, project, asset)
.
This is a table where each row corresponds to a metadata file within a project-asset-combination. It contains the following columns:
pid
: integer, the path ID and the primary key for this table.vid
: integer, a version ID fromversions
, specifying the project-asset-version containing this metadata file.path
: text, the relative path to the metadata file inside the project-asset-version's directory. For R2-based stores, this is the suffix to be added to{project}/{asset}/{version}
to obtain the object key of the file.metadata
: blob, the contents of the metadata file as JSONB.
An index is available for path
.
This is a table where each row corresponds to a unique token. It contains the following columns:
tid
: integer, the token ID and the primary key for this table.token
: text, the token.
An index is available for token
.
This is a table where each row corresponds to a unique field name. It contains the following columns:
fid
: integer, the field name ID and the primary key for this table.field
: text, the field name.
An index is available for field
.
This is a table where each row defines to a relationship between a metadata file, the field name and token. It contains the following columns:
pid
: integer, the path ID frompaths
.fid
: integer, the field name ID fromfields
.tid
: integer, the token ID fromtokens
.
The existence of a row indicates that a token (from tid
) is found in a field (named as fid
) of a metadata file (at pid
).
A multicolumn index is available for (tid, fid)
.
The scripts/
subdirectory contains several scripts for generating and updating the SQLite files.
These expect to have a modestly recent version of Node.js (tested on 16.19.1).
Required dependencies can be installed with the usual npm install
process.
Each script can either be run directly or via npx
:
# Older versions of Node.js may need a preceding NODE_OPTIONS='--experimental-fetch'.
./scripts/fresh.js --class BASENAME,DBNAME --gypsum URL --dir DIR
# Alternatively, to run from packages depending on this one:
npx --package=gypsum-metadata-index fresh --class BASENAME,DBNAME --gypsum URL --dir DIR
All scripts have the following arguments in common:
--class BASENAME,DBNAME
: a string describing a metadata document class. This should contain the base name of each metadata document and the name of the SQLite file in which to save the metadata, separated by a comma. This argument may be specified multiple times for different classes.--registry DIR
: a string containing a path to a directory containing a gypsum-like organization of files. This is equivalent to the Gobbler's registry.--gypsum URL
: a string containing the URL to a gypsum REST API. Ignored if--registry
is provided.--dir DIR
: a string containing a path to an output directory for the SQLite files.
The fresh.js
script will generate one SQLite file corresponding to each JSON schema.
This is done by listing all non-probational project-asset-versions in the gypsum backend,
extracting metadata documents for each class,
and inserting the metadata into the class-specific SQLite file.
./scripts/fresh.js --class BASENAME,DBNAME --gypsum URL --dir DIR
If the location specified in --dir
does not exist, it will be created.
Any existing SQLite files in --dir
will be overwritten by the new files as specified in --class
.
In addition to creating new SQLite files, fresh.js
will also add a modified
file containing a Unix timestamp.
This will be used by update.js
(see below) to determine which logs to consider during updates.
The update.js
script will modify each SQLite file based on recent changes in the gypsum backend.
It does so by scanning the logs in the backend, filtering for those generated after the modified
timestamp.
Each log may be used to perform an update to the SQLite file based on its action type (see here),
either by inserting rows corresponding to new objects or (more rarely) by deleting rows corresponding to deleted assets, versions or projects.
./scripts/update.js --class BASENAME,DBNAME --gypsum URL --dir DIR
It is assumed that the location specified in --dir
already contains the SQLite files named by --class
,
as update.js
should be run on indices after their creation by fresh.js
.
In addition to modifying the SQLite files, update.js
will update the modified
file to the timestamp of the last log.
The manual.js
script will reset the records for a specified project/asset/version in a SQLite file.
This is intended for manual patches to specific entries, e.g., if update.js
overlooks some changes in the backend.
./scripts/manual.js --class BASENAME,DBNAME --gypsum URL --dir DIR \
--project PROJECT
./scripts/manual.js --class BASENAME,DBNAME --gypsum URL --dir DIR \
--project PROJECT --asset ASSET
./scripts/manual.js --class BASENAME,DBNAME --gypsum URL --dir DIR \
--project PROJECT --asset ASSET -- version VERSION
This script has the following additional options:
--project PROJECT
: string containing the name of the project. This is a required argument.--asset ASSET
: string containing the name of the asset inside the project. This is an optional argument.--version VERSION
: string containing the name of the version of the asset. Optional and ignored if--asset
is not provided.
If only --project
is specified, the entire project is reindexed in each SQLite file.
This involves deleting all related records in all tables, then repopulating them after pulling metadata from on the backend.
If --asset
is specified, the reindexing is limited to the named asset,
while if --version
is also specified, the reindexing is further limited to the named version.
The gypsum documentation describes the expected file organization in the gypsum bucket. The same organization is used in the Gobbler's documentation for an on-premise alternative.
The bioconductor-metadata-index repository uses this repository to create SQLite indices for Bioconductor packages.