Skip to content

Using QLever for UniProt

Hannah Bast edited this page Nov 11, 2022 · 5 revisions

Log of building QLever for the complete UniProt data, written by Hannah Bast on 27.04.2022

Obtaining and preparing the data (last updated 11.11.2022)

I downloaded all RDF and OWL files from https://ftp.uniprot.org/pub/databases/uniprot/current_release/rdf as follows (at the time of the download, the files were from 12.10.2022).

DATE=2022-10-12
curl -s https://ftp.expasy.org/databases/uniprot/current_release/rdf/RELEASE.meta4 \
  | sed 's/<metalink.*/<metalink>/' \
  | xmllint --xpath '/metalink/files/file/url[@location="ch"]/text()' - \
  > uniprot.download-urls.${DATE}
mkdir -p rdf.${DATE}
> uniprot.${DATE}.download-log
cat uniprot.download-urls.${DATE} \
  | while read URL; do wget --no-verbose -P rdf.${DATE} ${URL} 2>&1 | tee -a uniprot.${DATE}.download-log; done

The total number of files with RDF data was 723, with a total size of 788 GB. I converted these files to compressed Turtle using Apache Jenaand GNU parallel as follows (this takes over a day). The total file size of the resulting ttl.xz files was [11.17.2021: 633 GB].

XML2TTL="apache-jena-3.17.0/bin/rdfxml --output=ttl 2> /dev/null"
mkdir -p ttl.${DATE}
> rdf2ttl.commands.txt
for RDF in rdf.${DATE}/*.{owl,owl.xz,rdf,rdf.xz}; do \
  echo "xzcat -f ${RDF} | ${XML2TTL} | xz -c > ttl.${DATE}/$(basename ${RDF} | sed 's/\(rdf\|rdf.xz\|owl\|owl.xz\)$/ttl.xz/') && echo 'DONE converting ${RDF}'" >> rdf2ttl.commands.txt; done
cat rdf2ttl.commands | parallel

Most of these files use the following definition for the base prefix: @prefix : <http://purl.uniprot.org/core/>. However, four of these files use a different definition for the base prefix. I replaced them with a "proper" prefix definition as follows (this is very fast, since the respective files are small).

xzcat -f rdf.${DATE}/taxonomy-hierarchy.rdf.xz | ${XML2TTL} | sed 's/@prefix :/@prefix rdfs:/; s/:subClassOf/rdfs:subClassOf/' | xz -c > ttl.${DATE}/taxonomy-hierarchy.ttl.xz &
xzcat -f rdf.${DATE}/uniparc-patents.rdf.xz | ${XML2TTL} | sed 's/@prefix :/@prefix schema:/; s/:mentions/schema:mentions/' | xz -c > ttl.${DATE}/uniparc-patents.ttl.xz &
xzcat -f rdf.${DATE}/go-hierarchy.owl.xz | ${XML2TTL} | sed 's/@prefix :/@prefix rdfs:/; s/:subClassOf/rdfs:subClassOf/' | xz -c > ttl.${DATE}/go-hierarchy.ttl.xz &
xzcat -f rdf.${DATE}/void.rdf | ${XML2TTL} | sed '/@prefix :/d' | xz -c > ttl.${DATE}/void.ttl.xz &

Compiling the QLever code

I cloned the current QLever master (as of this writing, version from 20.04.2022) and merged a small PR specifically written for UniProt, which changes two settings in the code that cannot yet be changed via the command line. Namely, lower the maximum size for a literal kept in RAM (from 1024 to 128), and don't store any literal of the predicates rdf:value and up:md5Checksum in RAM. To install everything on the machine needed to compile QLever, follow the instructions provided by the qlever script.

git clone --recursive [email protected]:ad-freiburg/qlever.git
cd qlever
git pull joka921:only-changes-for-uniprot
mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_COMPILER="g++-11" -DLOGLEVEL=INFO -DUSE_PARALLEL=true -GNinja ..
ninja
export PATH=$PATH:$(pwd)

Building the QLever index

I used the qlever script with the following Qleverfile (showing only the relevant sections), by simply typing qlever index in the directory with the UniProt data, as prepared above.

# Docker settings
USE_DOCKER       = 0

# Indexer settings
DB               = uniprot.full
RDF_FILES        = ttl.2021-11-17/*.ttl.xz
CAT_FILES        = "( ls ${RDF_FILES} | while read TTL; do xzcat \${TTL} | head -50 | grep ^@prefix; done | sort -u && ls ${RDF_FILES} | while read TTL; do xzcat \$TTL | grep -v ^@prefix; done )"
PSO_AND_POS_ONLY = 1
STXXL_MEMORY_GB  = 80

# Server settings
MEMORY_FOR_QUERIES             = 80
CACHE_MAX_SIZE_GB              = 50
CACHE_MAX_SIZE_GB_SINGLE_ENTRY = 10
CACHE_MAX_NUM_ENTRIES          = 100 

Here is the settings.json I used:

{
 "languages-internal": ["en"],
  "prefixes-external": [
    "<http://purl.uniprot.org/uniprot/",
    "<http://purl.uniprot.org/uniparc/",
    "<http://purl.uniprot.org/uniref/",
    "<http://purl.uniprot.org/isoforms/",
    "<http://purl.uniprot.org/range/",
    "<http://purl.uniprot.org/position/",
    "<http://purl.uniprot.org/refseq/",
    "<http://purl.uniprot.org/embl-cds/",
    "<http://purl.uniprot.org/EMBL",
    "<http://purl.uniprot.org/PATRIC",
    "<http://purl.uniprot.org/SEED",
    "<http://purl.uniprot.org/gi",
    "<http://rdf.ebi.ac.uk/resource",
    "<http://purl.uniprot.org/SHA-384"
  ],  
  "locale": {
          "language": "en",
          "country": "US",
          "ignore-punctuation": true
  },  
  "ascii-prefixes-only": true,
  "num-triples-per-partial-vocab": 20000000
}

Here are the stats (produced by qlever index-stats) of the index building on an AMD Ryzen 9 5900X PC (12 cores) with 128 GB of RAM:

Parse input             :  22.6 h
Build vocabularies      :  28.2 h
Convert to global IDs   :   4.6 h
PSO & POS permutations  :  29.9 h

TOTAL index build time  :  85.3 h
183 GB	uniprot.full.index.pos
209 GB	uniprot.full.index.pso
1.5 TB	uniprot.full.vocabulary.external
332 GB	uniprot.full.vocabulary.external.idsAndOffsets.mmap
 40 GB	uniprot.full.vocabulary.internal
2.3 TB	total

Starting the server

In the same directory, just type qlever start. The server is then up in 1.5 minutes.

Executing "start":

ServerMain -i uniprot.full -j 8 -m 80 -c 50 -e 10 -k 100 -p 7018 --only-pso-and-pos-permutations --no-patterns > uniprot.full.server-log.txt &

Starting the QLever server in the background and waiting till it's ready (Ctrl+C will not kill it) ...

2022-04-27 22:09:39.036	- INFO:  QLever Server, compiled on Apr 20 2022 06:31:01
2022-04-27 22:09:39.036	- INFO:  Initializing server ...
2022-04-27 22:09:39.052	- INFO:  Reading internal vocabulary from file uniprot.full.vocabulary.internal ...
2022-04-27 22:10:30.999	- INFO:  Done, number of words: 2,008,006,445
2022-04-27 22:10:30.999	- INFO:  Number of words in external vocabulary: 22,214,545,863
2022-04-27 22:10:31.024	- INFO:  Registered PSO permutation: #relations = 268, #blocks = 178,835, #triples = 93,713,158,653
2022-04-27 22:10:31.049	- INFO:  Registered POS permutation: #relations = 268, #blocks = 178,835, #triples = 93,713,158,653
2022-04-27 22:10:31.049	- INFO:  Only the PSO and POS permutation were loaded, SPARQL queries with predicate variables will therefore not work
2022-04-27 22:10:31.049	- INFO:  Sorting random result tables to estimate the sorting performance of this machine ...
2022-04-27 22:11:18.679	- INFO:  The server is ready