Using QLever for UniProt

Log of building QLever for the complete UniProt data, written by Hannah Bast on 27.04.2022

Obtaining and preparing the data (last updated 11.11.2022)

I downloaded all RDF and OWL files from https://ftp.uniprot.org/pub/databases/uniprot/current_release/rdf as follows (at the time of the download, the files were from 12.10.2022).

DATE=2022-10-12
curl -s https://ftp.expasy.org/databases/uniprot/current_release/rdf/RELEASE.meta4 \
  | sed 's/<metalink.*/<metalink>/' \
  | xmllint --xpath '/metalink/files/file/url[@location="ch"]/text()' - \
  > uniprot.download-urls.${DATE}
mkdir -p rdf.${DATE}
> uniprot.${DATE}.download-log
cat uniprot.download-urls.${DATE} \
  | while read URL; do wget --no-verbose -P rdf.${DATE} ${URL} 2>&1 | tee -a uniprot.${DATE}.download-log; done

The total number of files with RDF data was 723, with a total size of 788 GB. I converted these files to compressed Turtle using Apache Jenaand GNU parallel as follows (this takes over a day). The total file size of the resulting ttl.xz files was [11.17.2021: 633 GB].

XML2TTL="apache-jena-3.17.0/bin/rdfxml --output=ttl 2> /dev/null"
mkdir -p ttl.${DATE}
> rdf2ttl.commands.txt
for RDF in rdf.${DATE}/*.{owl,owl.xz,rdf,rdf.xz}; do \
  echo "xzcat -f ${RDF} | ${XML2TTL} | xz -c > ttl.${DATE}/$(basename ${RDF} | sed 's/\(rdf\|rdf.xz\|owl\|owl.xz\)$/ttl.xz/') && echo 'DONE converting ${RDF}'" >> rdf2ttl.commands.txt; done
cat rdf2ttl.commands | parallel

Most of these files use the following definition for the base prefix: @prefix : <http://purl.uniprot.org/core/>. However, four of these files use a different definition for the base prefix. I replaced them with a "proper" prefix definition as follows (this is very fast, since the respective files are small).

xzcat -f rdf.${DATE}/taxonomy-hierarchy.rdf.xz | ${XML2TTL} | sed 's/@prefix :/@prefix rdfs:/; s/:subClassOf/rdfs:subClassOf/' | xz -c > ttl.${DATE}/taxonomy-hierarchy.ttl.xz &
xzcat -f rdf.${DATE}/uniparc-patents.rdf.xz | ${XML2TTL} | sed 's/@prefix :/@prefix schema:/; s/:mentions/schema:mentions/' | xz -c > ttl.${DATE}/uniparc-patents.ttl.xz &
xzcat -f rdf.${DATE}/go-hierarchy.owl.xz | ${XML2TTL} | sed 's/@prefix :/@prefix rdfs:/; s/:subClassOf/rdfs:subClassOf/' | xz -c > ttl.${DATE}/go-hierarchy.ttl.xz &
xzcat -f rdf.${DATE}/void.rdf | ${XML2TTL} | sed '/@prefix :/d' | xz -c > ttl.${DATE}/void.ttl.xz &

Compiling the QLever code

I cloned the current QLever master (as of this writing, version from 20.04.2022) and merged a small PR specifically written for UniProt, which changes two settings in the code that cannot yet be changed via the command line. Namely, lower the maximum size for a literal kept in RAM (from 1024 to 128), and don't store any literal of the predicates rdf:value and up:md5Checksum in RAM. To install everything on the machine needed to compile QLever, follow the instructions provided by the qlever script.

git clone --recursive [email protected]:ad-freiburg/qlever.git
cd qlever
git pull joka921:only-changes-for-uniprot
mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_COMPILER="g++-11" -DLOGLEVEL=INFO -DUSE_PARALLEL=true -GNinja ..
ninja
export PATH=$PATH:$(pwd)

Building the QLever index

I used the qlever script with the following Qleverfile (showing only the relevant sections), by simply typing qlever index in the directory with the UniProt data, as prepared above.

# Docker settings
USE_DOCKER       = 0

# Indexer settings
DB               = uniprot.full
RDF_FILES        = ttl.2021-11-17/*.ttl.xz
CAT_FILES        = "( ls ${RDF_FILES} | while read TTL; do xzcat \${TTL} | head -50 | grep ^@prefix; done | sort -u && ls ${RDF_FILES} | while read TTL; do xzcat \$TTL | grep -v ^@prefix; done )"
PSO_AND_POS_ONLY = 1
STXXL_MEMORY_GB  = 80

# Server settings
MEMORY_FOR_QUERIES             = 80
CACHE_MAX_SIZE_GB              = 50
CACHE_MAX_SIZE_GB_SINGLE_ENTRY = 10
CACHE_MAX_NUM_ENTRIES          = 100

Here is the settings.json I used:

{
 "languages-internal": ["en"],
  "prefixes-external": [
    "<http://purl.uniprot.org/uniprot/",
    "<http://purl.uniprot.org/uniparc/",
    "<http://purl.uniprot.org/uniref/",
    "<http://purl.uniprot.org/isoforms/",
    "<http://purl.uniprot.org/range/",
    "<http://purl.uniprot.org/position/",
    "<http://purl.uniprot.org/refseq/",
    "<http://purl.uniprot.org/embl-cds/",
    "<http://purl.uniprot.org/EMBL",
    "<http://purl.uniprot.org/PATRIC",
    "<http://purl.uniprot.org/SEED",
    "<http://purl.uniprot.org/gi",
    "<http://rdf.ebi.ac.uk/resource",
    "<http://purl.uniprot.org/SHA-384"
  ],  
  "locale": {
          "language": "en",
          "country": "US",
          "ignore-punctuation": true
  },  
  "ascii-prefixes-only": true,
  "num-triples-per-partial-vocab": 20000000
}

Here are the stats (produced by qlever index-stats) of the index building on an AMD Ryzen 9 5900X PC (12 cores) with 128 GB of RAM:

Parse input             :  22.6 h
Build vocabularies      :  28.2 h
Convert to global IDs   :   4.6 h
PSO & POS permutations  :  29.9 h

TOTAL index build time  :  85.3 h

183 GB	uniprot.full.index.pos
209 GB	uniprot.full.index.pso
1.5 TB	uniprot.full.vocabulary.external
332 GB	uniprot.full.vocabulary.external.idsAndOffsets.mmap
 40 GB	uniprot.full.vocabulary.internal
2.3 TB	total

Starting the server

In the same directory, just type qlever start. The server is then up in 1.5 minutes.

Executing "start":

ServerMain -i uniprot.full -j 8 -m 80 -c 50 -e 10 -k 100 -p 7018 --only-pso-and-pos-permutations --no-patterns > uniprot.full.server-log.txt &

Starting the QLever server in the background and waiting till it's ready (Ctrl+C will not kill it) ...

2022-04-27 22:09:39.036	- INFO:  QLever Server, compiled on Apr 20 2022 06:31:01
2022-04-27 22:09:39.036	- INFO:  Initializing server ...
2022-04-27 22:09:39.052	- INFO:  Reading internal vocabulary from file uniprot.full.vocabulary.internal ...
2022-04-27 22:10:30.999	- INFO:  Done, number of words: 2,008,006,445
2022-04-27 22:10:30.999	- INFO:  Number of words in external vocabulary: 22,214,545,863
2022-04-27 22:10:31.024	- INFO:  Registered PSO permutation: #relations = 268, #blocks = 178,835, #triples = 93,713,158,653
2022-04-27 22:10:31.049	- INFO:  Registered POS permutation: #relations = 268, #blocks = 178,835, #triples = 93,713,158,653
2022-04-27 22:10:31.049	- INFO:  Only the PSO and POS permutation were loaded, SPARQL queries with predicate variables will therefore not work
2022-04-27 22:10:31.049	- INFO:  Sorting random result tables to estimate the sorting performance of this machine ...
2022-04-27 22:11:18.679	- INFO:  The server is ready

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using QLever for UniProt

Obtaining and preparing the data (last updated 11.11.2022)

Compiling the QLever code

Building the QLever index

Starting the server

Clone this wiki locally