Skip to content

Using QLever for UniProt

Hannah Bast edited this page Mar 16, 2023 · 5 revisions

Instructions for building a QLever index for the complete UniProt data, written by Hannah Bast on 27.04.2022, last updated on 16.03.2023.

Obtaining and preparing the data

I downloaded all RDF and OWL files from https://ftp.uniprot.org/pub/databases/uniprot/current_release/rdf as follows (at the time of the download, the files were from 12.10.2022).

DATE=2022-10-12
curl -s https://ftp.expasy.org/databases/uniprot/current_release/rdf/RELEASE.meta4 \
  | sed 's/<metalink.*/<metalink>/' \
  | xmllint --xpath '/metalink/files/file/url[@location="ch"]/text()' - \
  > uniprot.download-urls.${DATE}
mkdir -p rdf.${DATE}
> uniprot.${DATE}.download-log
cat uniprot.download-urls.${DATE} \
  | while read URL; do wget --no-verbose -P rdf.${DATE} ${URL} 2>&1 | tee -a uniprot.${DATE}.download-log; done

The total number of files with RDF data was 723, with a total size of 788 GB. I converted these files to compressed Turtle using Apache Jenaand GNU parallel as follows (this takes over a day). The total file size of the resulting ttl.xz files was 702 GB.

XML2TTL="apache-jena-3.17.0/bin/rdfxml --output=ttl 2> /dev/null"
mkdir -p ttl.${DATE}
> rdf2ttl.commands.txt
for RDF in rdf.${DATE}/*.{owl,owl.xz,rdf,rdf.xz}; do \
  echo "xzcat -f ${RDF} | ${XML2TTL} | xz -c > ttl.${DATE}/$(basename ${RDF} | sed 's/\(rdf\|rdf.xz\|owl\|owl.xz\)$/ttl.xz/') && echo 'DONE converting ${RDF}'" >> rdf2ttl.commands.txt; done
cat rdf2ttl.commands | parallel

Note that earlier versions of the UniProt RDF/XML files used inconsistent definitions for the base prefix [1]. This is no longer a problem, thanks to the UniProt team for the fix!

Compiling the QLever code

Clone the current QLever master and merge a small PR specifically written for UniProt, which changes two settings in the code that cannot yet be changed via the command line or settings file. Namely, lower the maximum size for a literal kept in RAM (from 1024 to 128), and don't store any literals of the predicates rdf:value and up:md5Checksum in RAM. These settings are crucial, otherwise your index build will run out of RAM.

git clone --recursive [email protected]:ad-freiburg/qlever
cd qlever
git merge origin/uniprot-settings
docker build -t qlever.uniprot .

To compile natively (without docker), follow the instructions provided by the qlever script when typing qlever install-binaries.

Building the QLever index

Use the qlever script with the preconfigured Qleverfile for UniProt as follows. The first command downloads the Qleverfile, the second command builds the index.

. qlever uniprot
qlever index

If you want to use your natively compiled code, set USE_DOCKER = false in the Qleverfile. If you want to build all six permutations instead of just PSO and POS (which are enough for almost all queries), set PSO_AND_POS_ONLY = false in the Qleverfile (or remove the whole line with that variable, since the default is to build all six permutations).

Here are the stats (produced by qlever index-stats) of the index building on an AMD Ryzen 9 5900X PC (12 cores) with 128 GB of RAM:

Parse input             :  22.6 h
Build vocabularies      :  28.2 h
Convert to global IDs   :   4.6 h
PSO & POS permutations  :  29.9 h

TOTAL index build time  :  85.3 h
183 GB	uniprot.full.index.pos
209 GB	uniprot.full.index.pso
1.5 TB	uniprot.full.vocabulary.external
332 GB	uniprot.full.vocabulary.external.idsAndOffsets.mmap
 40 GB	uniprot.full.vocabulary.internal
2.3 TB	total

Starting the server

In the same directory, just type the following. The server is then up in 1.5 minutes.

qlever start

[1] In the 2021-11-17 version of the UniProt data, the base prefix was defined inconsistently in different files. Namely, it was defined as : @prefix : http://purl.uniprot.org/core/` in most files, but had a different definition in others. This is not forbidden, but confusing. For the indexing with QLever, we identified four files with an inconsistent definition of that prefix and fixed it as follows (this is very fast, since the respective files are small).

xzcat -f rdf.${DATE}/taxonomy-hierarchy.rdf.xz | ${XML2TTL} | sed 's/@prefix :/@prefix rdfs:/; s/:subClassOf/rdfs:subClassOf/' | xz -c > ttl.${DATE}/taxonomy-hierarchy.ttl.xz &
xzcat -f rdf.${DATE}/uniparc-patents.rdf.xz | ${XML2TTL} | sed 's/@prefix :/@prefix schema:/; s/:mentions/schema:mentions/' | xz -c > ttl.${DATE}/uniparc-patents.ttl.xz &
xzcat -f rdf.${DATE}/go-hierarchy.owl.xz | ${XML2TTL} | sed 's/@prefix :/@prefix rdfs:/; s/:subClassOf/rdfs:subClassOf/' | xz -c > ttl.${DATE}/go-hierarchy.ttl.xz &
xzcat -f rdf.${DATE}/void.rdf | ${XML2TTL} | sed '/@prefix :/d' | xz -c > ttl.${DATE}/void.ttl.xz &