-
Notifications
You must be signed in to change notification settings - Fork 52
Fast generation of RDF2Vec embeddings with a SPARQL endpoint
On this page, we provide a guideline to:
- setup a DBPedia endpoint using a Stardog RDF store
- create a custom random walker which uses Python's multiprocessing library
- Tip and Tricks
- Benchmark results
We will load DBPedia into Stardog because we have already a lot of expertise working with it. Of course, you can load the data in any other RDF store supporting a SPARQL endpoint.
Stardog can be curled through: https://www.stardog.com/get-started/ after which you can unzip the downloaded file. (keep in mind that Stardog will require java 8 to work properly)
Because a lot of triples will be loaded, we must make sure Stardog can use all the available resources of our server.
Therefore, it is necessary to correctly set the STARDOG_SERVER_JAVA_ARGS
according to the following table
# of Triples | JVM Heap Memory | Direct memory | Total System Memory |
---|---|---|---|
100 million | 3GB | 4GB | 8GB |
1 billion | 8GB | 20GB | 32GB |
10 billion | 30GB | 80GB | 128GB |
25 billion | 60GB | 160GB | 256GB |
50 billion | 80GB | 380GB | 512GB |
Our server setup had 32GB of ram available, so we performed: export STARDOG_SERVER_JAVA_ARGS=-Xms8g -Xmx8g -XX:MaxDirectMemorySize=20g
on our server, which enabled us to load 1 billion triples.
As we will load multiple files with a large amount of triples, we will first set the Stardog server into bulk
loading mode.
Bulk mode can be easily enabled by creating in your STARDOG_HOME
folder a stardog.properties file (if you did not defined this folder, your STARDOG_HOME
folder will be the same folder where you executed the curl command above).
In this stardog.properties file, you paste the following two lines:
memory.mode = bulk
strict.parsing = false
Strict parsing will disable the parsing checks (some DBPedia triples violate the predefined, ontological rules) Other settings can be listed in this properties as file as well.
Now, you can start the Stardog database by executing the following command:
./stardog-7.4.5/bin/stardog-admin server start --disable-security --no-cors
The first time you start Stardog, it will ask to install a license. Just answer the questions asked in the terminal. Academic users can have Stardog 1 year for free, other people for 60 days.
keep in mind that your Stardog version number can be different. By default, we both disable the servers security and cors as we are not planning to make our database public. We refer to the Stardog documentation sites for more information (https://www.stardog.com/docs)
The following script can be used to download all the relevant triples for the October 2015 english version of DBPedia (the 2015 version is the one used in the original RDF2Vec paper).
mkdir -p data
cd data
mkdir core
cd core
wget -np -nd -r -A ttl.bz2 -A nt.bz2 "http://downloads.dbpedia.org/2015-10/core/"
cd ..
mkdir core-i18n
cd core-i18n
wget -nd -np -r -A ttl.bz2 "http://downloads.dbpedia.org/2015-10/core-i18n/en/"
cd ..
wget -nd -np -r -A .owl "http://downloads.dbpedia.org/2015-10/dbpedia_2015-10.owl"
Again, you can edit this script to install other version or other languages if needed.
Multiple ttl.bz2 and nt.bz2 files will be downloaded in a newly created data folder. Stardog can directly load these bz2 files, so you don't have to decompress them.
To load them into Stardog, you can use the following commands:
./stardog-7.4.5/bin/stardog-admin db create -n dbpedia $(find . -name \*.bz2 -print -type f | xargs)
./stardog-7.4.5/bin/stardog data add dbpedia data/dbpedia_2015-10.owl
We recommand you to run these commands in a seperate screen or tmux terminal, such that you can follow the progress of loading the database by using tail -f stardog.log
(ctrl-c to quit)
Grab a coffee, this took ± 2 hours for our setup (32gb ram).
After all triples are loaded, it is better to tear down the Stardog server using:
./stardog-7.4.5/bin/stardog-admin server stop
and change the stardog.properties memory.mode to:
memory.mode = default
This will rebalance the provided 32gb RAM to make sure SELECT queries can be performed optimal.
Now you can start the Stardog service again using ./stardog-7.4.5/bin/stardog-admin server start --disable-security --no-cors
and you are ready to go.
To use this Stardog service as a remote KG in our pyrdf2vec library, you can use some code like described below:
from pyrdf2vec.graphs import KG
from pyrdf2vec.samplers import UniformSampler
from pyrdf2vec.walkers import RandomWalker
from pyrdf2vec import RDF2VecTransformer
from pyrdf2vec.embedders import Word2Vec
kg = KG(location="http://YOUR_STARDOG_IP_OR_LOCALHOST:5820/dbpedia/query", is_remote=True)
walkers = [RandomWalker(1, 200, UniformSampler())]
embedder = Word2Vec(size=200)
transformer = RDF2VecTransformer(walkers=walkers, embedder=embedder)
embeddings = transformer.fit_transform(kg, ['http://dbpedia.org/resource/Brussels'])
print(embeddings)
Make sure the IP adress of your Stardog service is correclty filled in. Stardog also runs on port 5820 by default.
The /dbpedia/query defines the SPARQL endpoint of our database dbpedia
which we used when loading the dbpedia triples.
Stardog can also host multiple database on a single service.
As you will notice, executing SPARQL request result in some delays which increases the time to generate the embeddings.
Therefore, the RandomWalker code can be extended with both a tqdm
progress bar to show the current state and can be multi-processed using the default Python MultiProcessing library to reduce these delays by executing code in multiple processes.
The code to create such a RandomWalker is defined below:
from multiprocessing import Pool
from hashlib import md5
from typing import List,Set, Tuple, Any
from tqdm import tqdm
import rdflib
class MultiProcessingRandomWalker(RandomWalker):
def _proc(self, t):
kg, instance = t
walks = self.extract_random_walks(kg, instance)
canonical_walks = set()
for walk in walks:
canonical_walk = []
for i, hop in enumerate(walk): # type: ignore
if i == 0 or i % 2 == 1:
canonical_walk.append(str(hop))
else:
digest = md5(str(hop).encode()).digest()[:8]
canonical_walk.append(str(digest))
canonical_walks.add(tuple(canonical_walk))
return {instance:tuple(canonical_walks)}
#overwrite this method
def _extract(self, kg: KG, instances: List[rdflib.URIRef]) -> Set[Tuple[Any, ...]]:
canonical_walks = set()
seq = [(kg, r) for _,r in enumerate(instances)]
print(self.depth)
with Pool(4) as pool:
res = list(tqdm(pool.imap_unordered(self._proc, seq),
total=len(seq)))
res = {k:v for element in res for k,v in element.items()}
for r in instances:
canonical_walks.update(res[r])
return canonical_walks
By default, we use here 4 processors in a multiprocessing pool. The code executed by each processor is defined in the _proc function and is 100% equal to the original inner loop of the original RandomWalker in the random.py file.
you can use this MultiProcessingRandomWalker by simple providing it in the walkers argument list:
walkers = [MultiProcessingRandomWalker(1, 200, UniformSampler())]
- You will get the fastest results when using the MultiProcessingRandomWalker on the same machine as the Stardog service. This reduces the latencies introduced to send all the SPARQL responses over the network.
- You can play with the number of processors, but the general rule is that you use 1 less than the number of processors on you machine. Python's multiprocessing library can be used for this:
import multiprocessing.cpu_count; print(multiprocessing.cpu_count()-1)
- Newer Python versions require that you run multi-processed code inside the main function. So if needed, encapsulate your code in
if __name__ == '__main__':
- MultiProcessingRandomWalker with the public DBPedia endpoint = bad idea.. You will make more requests per second than this public endpoint can handle properly (they fixed the numer of parallel executions). More concrete, you will be blocked. So use your own endpoint!
Below, we provide some time results generated using the setup described above. We compare running our code from both the Stardog server and a laptop. The difference between these two results is that the code ran on the laptop will have delays because the SPARQL requests are sent over the network.
We also compare the influence of using the Multiprocessing module.
Our benchmark dataset is defined by a set of DBPedia cities. We created an 200x1 embedding with 200 random paths for each of the 212 listed cities in this benchmark dataset and report the average number of instances that we can process per second.
Depth/test | PC (single) | PC (4 cores) | On Server (single) | On Server (4 cores) |
---|---|---|---|---|
1 | 0.78it/s | 2.58it/s | 2.09it/s | 8.14it/s |
2 | 0.04it/s | 0.22it/s | 0.78it/s | 2.85it/s |
3 | 0.03it/s | 0.14it/s | 0.57it/s | 2.05it/s |
4 | 0.02it/s | 0.12it/s | 0.52it/s | 1.91it/s |