Managing FAIR Knowledge Graphs as Polyglot Data End Points: A Benchmark based on the rdf2pg Framework and Plant Biology Data
This repository contains code to benchmark three different graph databases and graph query languages, against plant biology datasets, which are conceptually aligned (based on the same data model) in the different database/language flavours.
This is work by the KnetMiner team and Carlos Bobed.
The alignment is produced by means of the rdf2pg framework, and this work contributes to assess the benefits of managing data in multiple data languages and formats, by means of our rdf2pg tools.
This work is an extension of previous work by the KnetMiner team, which we presented at SWAT4LS 2018 (old presentation here).
We have tested three combinations of graph database, graph query language, data formats:
- SPARQL on the Virtuoso triple store, dealing with RDF data (and the corresponding model).
- Cypher on the Neo4j graph database, with data directly imported into the database from our rdf2neo tool.
- Gremlin on ArcadeDB, with data imported from files in graphML format.
Details on the test settings used are in the dataset loading results report.
For each of the graph databases mentioned above, we have tested the loading and the query performance of three datasets:
- Biopax: a small dataset, mostly containing data about the Arabidopsis model organisms, including pathways from AraCyc and gene annotations from Gene Ontology.
- Arabidopsis: a medium-size dataset, containing more data about Arabidopsis, including AraCyc, Gene Ontology, gene annotations from ENSEMBL Plants and TAIR, protein annotations from UniProt, scientific publications from PubMed.
- Poaceae: a large dataset with integrated data about different cereals (wheat, rice and barley), obtained from a variety of sources, including the ones mentioned above, plus genome-wide study data from AraGWAS and more. Partial access to this dataset is available via KnetMiner programmatic data access endpoints.
The figure below shows the main types contained in each dataset:
These model was encoded based on BioKNO, an application ontology, defined within the KnetMiner platform, to represent the data we deal with in the KnetMiner platform. This models common plant biology entities, some specific pattern used by KnetMiner applications and mappings to existing biology ontology and life science standards.
We have done two types of tests:
Loading tests, where we tested the time taken to populate each dataset with each of the tested datasets. See the linked report for details
After loading each dataset, we performed querying tests, where, for each dataset, we tested all of the chosen databases and query languages, each time timing the same set of queries. More precisely, for each of the tested query languages, we wrote conceptually equivalent queries.
While "conceptually equivalent" is difficult to define precisely, informally, it means the best effort to search for data that have the same semantics and equivalent representations in the different technologies and formats being tested. It also means writing queries that, across different technologies, present similar levels of complexity and search engine challenges.
For example, where it is easy for Neo4 to return a node property or an empty value (because they are attached to the nodes), we have translated this as OPTIONAL matches in SPARQL (since looking for a resource property is a triple pattern like any other).
The (Jupyter-based) reports linked above has more test details and detailed results linked above.
- We have started testing ArcadeDB with its SQL dialect, using the same datasets and the same queries. This is a preliminary result, work to be continued.
Like the data, the queries listed below are based on the already-mentioned BioKNO ontology. We have split the benchmark queries into categories that take into account both the query semantics and the kind of challenge it puts on the query engines.
Regarding the semantic motif queries, these produce patterns that occur often in KnetMiner, when we want to associate genes to relevant other entities (such as encoded proteins, biological processes, publications about genes or processes). In practice, a semantic motif query is a 'chain' pattern, it tries to follow a linear path from a gene to another entity, through a known chain of relations (eg, Gene -> encodes -> Protein -> participates -> Process -> mentioend -> Publication). Details in the KnetMiner Wiki and in the KnetMiner paper
WARNING: do not edit what follows! It is automatically generated via this code.
Common counts of elements like number of nodes, number of relations, etc.
- cnt: Counts instances, SPARQL, Cypher, Gremlin
- cntType: Instances of a given type, SPARQL, Cypher, Gremlin
- cntRel: Count relations, SPARQL, Cypher, Gremlin
- cntRelType: Count relations of a given type, SPARQL, Cypher, Gremlin
Queries that selects elements, including simple joins.
- sel: Select entity and properties, SPARQL, Cypher, Gremlin
- join: Simple Join, SPARQL, Cypher, Gremlin
- joinRel: Join literal properties of reified relations, SPARQL, Cypher, Gremlin
- joinFilter: Simple join + attribute filter, SPARQL, Cypher, Gremlin
- joinRe: Simple join + regex search, SPARQL, Cypher, Gremlin
- joinReif: Join through relation property, SPARQL, Cypher, Gremlin
Queries that perform graph pattern and subquery unions.
- 2union: 2 unions, no nesting, SPARQL, Cypher, Gremlin
- 2union1Nest: 2 unions, 1 nesting, SPARQL, Cypher, Gremlin
- 2union1Nest+: 2 unions, 1 nesting (with Cypher CALL), SPARQL, Cypher, Gremlin
- pway: Complex union of paths over pathways, SPARQL, Cypher, Gremlin
- exist: Not exists, SPARQL, Cypher, Gremlin
- existAg: Not exists + aggregation, SPARQL, Cypher, Gremlin
Queries that perform data grouping and aggregations.
- grp: Group by, SPARQL, Cypher, Gremlin
- grpAg: Group by + 2 aggregation functions, SPARQL, Cypher, Gremlin
- mulGrpAg: Multiple subqueries having aggregations , SPARQL, Cypher, Gremlin
- nestAg: Nested and outer aggregations (see Q6 from the Berlin benchmark), SPARQL, Cypher, Gremlin
Queries that select and traverse paths.
- varPathC: Variable path query (fixed len), SPARQL, Cypher, Gremlin
- varPath: Variable path query (unbound len and restricted on top), SPARQL, Cypher, Gremlin
- shrtSmf: Short Semantic Motif, SPARQL, Cypher, Gremlin
- medSmf: Medium length Semantic Motif, SPARQL, Cypher, Gremlin
- lngSmf: Long and Complex Semantic Motif, SPARQL, Cypher, Gremlin