-
Notifications
You must be signed in to change notification settings - Fork 54
Using QLever for UniProt
Log of building QLever for the complete UniProt data, written by Hannah Bast on 27.04.2022
I downloaded all RDF and OWL files from https://ftp.uniprot.org/pub/databases/uniprot/current_release/rdf as follows (at the time of the download, the files were from 12.10.2022).
DATE=2022-10-12
curl -s https://ftp.expasy.org/databases/uniprot/current_release/rdf/RELEASE.meta4 \
| sed 's/<metalink.*/<metalink>/' \
| xmllint --xpath '/metalink/files/file/url[@location="ch"]/text()' - \
> uniprot.download-urls.${DATE}
mkdir -p rdf.${DATE}
> uniprot.${DATE}.download-log
cat uniprot.download-urls.${DATE} \
| while read URL; do wget --no-verbose -P rdf.${DATE} ${URL} 2>&1 | tee -a uniprot.${DATE}.download-log; done
The total number of files with RDF data was 723, with a total size of 788 GB. I converted these files to compressed Turtle using Apache Jena
and GNU parallel
as follows (this takes over a day). The total file size of the resulting ttl.xz
files was [11.17.2021: 633 GB].
XML2TTL="apache-jena-3.17.0/bin/rdfxml --output=ttl 2> /dev/null"
mkdir -p ttl.${DATE}
> rdf2ttl.commands.txt
for RDF in rdf.${DATE}/*.{owl,owl.xz,rdf,rdf.xz}; do \
echo "xzcat -f ${RDF} | ${XML2TTL} | xz -c > ttl.${DATE}/$(basename ${RDF} | sed 's/\(rdf\|rdf.xz\|owl\|owl.xz\)$/ttl.xz/') && echo 'DONE converting ${RDF}'" >> rdf2ttl.commands.txt; done
cat rdf2ttl.commands | parallel
Most of these files use the following definition for the base prefix: @prefix : <http://purl.uniprot.org/core/>
. However, four of these files use a different definition for the base prefix. I replaced them with a "proper" prefix definition as follows (this is very fast, since the respective files are small).
xzcat -f rdf.${DATE}/taxonomy-hierarchy.rdf.xz | ${XML2TTL} | sed 's/@prefix :/@prefix rdfs:/; s/:subClassOf/rdfs:subClassOf/' | xz -c > ttl.${DATE}/taxonomy-hierarchy.ttl.xz &
xzcat -f rdf.${DATE}/uniparc-patents.rdf.xz | ${XML2TTL} | sed 's/@prefix :/@prefix schema:/; s/:mentions/schema:mentions/' | xz -c > ttl.${DATE}/uniparc-patents.ttl.xz &
xzcat -f rdf.${DATE}/go-hierarchy.owl.xz | ${XML2TTL} | sed 's/@prefix :/@prefix rdfs:/; s/:subClassOf/rdfs:subClassOf/' | xz -c > ttl.${DATE}/go-hierarchy.ttl.xz &
xzcat -f rdf.${DATE}/void.rdf | ${XML2TTL} | sed '/@prefix :/d' | xz -c > ttl.${DATE}/void.ttl.xz &
I cloned the current QLever master (as of this writing, version from 20.04.2022) and merged a small PR specifically written for UniProt, which changes two settings in the code that cannot yet be changed via the command line. Namely, lower the maximum size for a literal kept in RAM (from 1024 to 128), and don't store any literal of the predicates rdf:value
and up:md5Checksum
in RAM. To install everything on the machine needed to compile QLever, follow the instructions provided by the qlever script.
git clone --recursive [email protected]:ad-freiburg/qlever.git
cd qlever
git pull joka921:only-changes-for-uniprot
mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_COMPILER="g++-11" -DLOGLEVEL=INFO -DUSE_PARALLEL=true -GNinja ..
ninja
export PATH=$PATH:$(pwd)
I used the qlever script with the following Qleverfile (showing only the relevant sections), by simply typing qlever index
in the directory with the UniProt data, as prepared above.
# Docker settings
USE_DOCKER = 0
# Indexer settings
DB = uniprot.full
RDF_FILES = ttl.2021-11-17/*.ttl.xz
CAT_FILES = "( ls ${RDF_FILES} | while read TTL; do xzcat \${TTL} | head -50 | grep ^@prefix; done | sort -u && ls ${RDF_FILES} | while read TTL; do xzcat \$TTL | grep -v ^@prefix; done )"
PSO_AND_POS_ONLY = 1
STXXL_MEMORY_GB = 80
# Server settings
MEMORY_FOR_QUERIES = 80
CACHE_MAX_SIZE_GB = 50
CACHE_MAX_SIZE_GB_SINGLE_ENTRY = 10
CACHE_MAX_NUM_ENTRIES = 100
Here is the settings.json
I used:
{
"languages-internal": ["en"],
"prefixes-external": [
"<http://purl.uniprot.org/uniprot/",
"<http://purl.uniprot.org/uniparc/",
"<http://purl.uniprot.org/uniref/",
"<http://purl.uniprot.org/isoforms/",
"<http://purl.uniprot.org/range/",
"<http://purl.uniprot.org/position/",
"<http://purl.uniprot.org/refseq/",
"<http://purl.uniprot.org/embl-cds/",
"<http://purl.uniprot.org/EMBL",
"<http://purl.uniprot.org/PATRIC",
"<http://purl.uniprot.org/SEED",
"<http://purl.uniprot.org/gi",
"<http://rdf.ebi.ac.uk/resource",
"<http://purl.uniprot.org/SHA-384"
],
"locale": {
"language": "en",
"country": "US",
"ignore-punctuation": true
},
"ascii-prefixes-only": true,
"num-triples-per-partial-vocab": 20000000
}
Here are the stats (produced by qlever index-stats
) of the index building on an AMD Ryzen 9 5900X PC (12 cores) with 128 GB of RAM:
Parse input : 22.6 h
Build vocabularies : 28.2 h
Convert to global IDs : 4.6 h
PSO & POS permutations : 29.9 h
TOTAL index build time : 85.3 h
183 GB uniprot.full.index.pos
209 GB uniprot.full.index.pso
1.5 TB uniprot.full.vocabulary.external
332 GB uniprot.full.vocabulary.external.idsAndOffsets.mmap
40 GB uniprot.full.vocabulary.internal
2.3 TB total
In the same directory, just type qlever start
. The server is then up in 1.5 minutes.
Executing "start":
ServerMain -i uniprot.full -j 8 -m 80 -c 50 -e 10 -k 100 -p 7018 --only-pso-and-pos-permutations --no-patterns > uniprot.full.server-log.txt &
Starting the QLever server in the background and waiting till it's ready (Ctrl+C will not kill it) ...
2022-04-27 22:09:39.036 - INFO: QLever Server, compiled on Apr 20 2022 06:31:01
2022-04-27 22:09:39.036 - INFO: Initializing server ...
2022-04-27 22:09:39.052 - INFO: Reading internal vocabulary from file uniprot.full.vocabulary.internal ...
2022-04-27 22:10:30.999 - INFO: Done, number of words: 2,008,006,445
2022-04-27 22:10:30.999 - INFO: Number of words in external vocabulary: 22,214,545,863
2022-04-27 22:10:31.024 - INFO: Registered PSO permutation: #relations = 268, #blocks = 178,835, #triples = 93,713,158,653
2022-04-27 22:10:31.049 - INFO: Registered POS permutation: #relations = 268, #blocks = 178,835, #triples = 93,713,158,653
2022-04-27 22:10:31.049 - INFO: Only the PSO and POS permutation were loaded, SPARQL queries with predicate variables will therefore not work
2022-04-27 22:10:31.049 - INFO: Sorting random result tables to estimate the sorting performance of this machine ...
2022-04-27 22:11:18.679 - INFO: The server is ready