Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Criterion benchmarks for RDF graph instances #20

Closed
robstewart57 opened this issue Dec 12, 2014 · 3 comments
Closed

Criterion benchmarks for RDF graph instances #20

robstewart57 opened this issue Dec 12, 2014 · 3 comments

Comments

@robstewart57
Copy link
Owner

We don't know what the construction or query performance is for MGraph or TriplesGraph. We should know about this so that we can spot which is faster for each use case of the rdf4h API. We should also have this in place so that any new implementations of an RDF instance can be measured against existing ones, which relates to #19 .

Criterion would give us a robust benchmarking platform to understand the performance of the rdf4h API, and its performance limitations. https://hackage.haskell.org/package/criterion

@cordawyn
Copy link
Collaborator

I think that tightly coupled to this criterion is "graph building" vs "graph querying" performance. I'll get to it later, but first I'd like to discuss one very specific issue that I've come across while refactoring TriplesGraph (following #12 issue):

  • The output of graph implementations must have expanded URIs or contain "expandable instances" (e.g. (Namespace, "URI fragment") couples or anything that can be resolved into an absolute URI). That is, a simple UNode("rdf:type") is no good.

Or... perhaps that's fine to leave those URIs as they are?

We have an option of letting the user take care of prefix expansion and "absolutizing" URIs. But I don't see why we should force users to resolve "rdf:type" -- after all, if they stick to using "rdf:type" everywhere (building and querying the graph), it's as good as an absolute URI.

It would be different however, if we accepted only absolute URIs into UNode and had a separate Node type for prefixed nodes, as you suggested earlier. Then we would also need a separate Node type for relative URIs... I think it all makes sense and a certain underpinning in terms of categories (that we are expected to manipulate in Haskell 😃 ), but as far as performance goes... I'm not so sure. Would it also be possible to efficiently build and query graphs where one node can be represented by 3 different node types?

So, to summarize, we have 3 options:

  1. Force absolute URIs in UNodes.
  2. Handle absolute, prefixed and relative URIs as separate types.
  3. Leave it up to user to resolve/not resolve prefixed/relative URIs.

And now, for that subtlety of building and querying graphs that I mentioned in the beginning. If we somehow handle URIs in RDF4H (options 1 or 2 above), we still have 2 options where to do it:

  1. Perform resolution at graph building time (mkRdf and possibly functions for adding triples/nodes).
  2. Perform resolution at graph querying time (query, select and a dozen of other functions).

I think it largely depends on a given graph implementation and we cannot say in advance which would be faster (although the 2nd option sounds like a lot of coding compared to the 1st option, because one has to ensure consistent output of all functions, while doing so in mkRdf automatically ensures that all throughout the rest of the implementation). And I believe it's the same as far as filtering of the duplicates is concerned. Perhaps there are implementations that even allow duplicates in the input or output. In any case, it makes sense to have 2 benchmarks: one for building, another for querying the graph.

This, most likely, is a question to graph implementors, and we don't have much to discuss here. Just sharing my thoughts. But we should really sort out things with absolute/relative/prefixed URIs in Nodes first, I believe.

What are your thoughts?

@robstewart57
Copy link
Owner Author

I've made a start with 8a7e6b0 . The preliminary results are:

Benchmark rdf4h-bench: RUNNING...
benchmarking parse/TriplesGraph
time                 768.8 ms   (759.9 ms .. 773.7 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 769.0 ms   (767.8 ms .. 769.9 ms)
std dev              1.328 ms   (0.0 s .. 1.517 ms)
variance introduced by outliers: 19% (moderately inflated)

benchmarking parse/MGraph
time                 1.067 s    (775.7 ms .. 1.374 s)
                     0.990 R²   (0.963 R² .. 1.000 R²)
mean                 995.5 ms   (918.9 ms .. 1.046 s)
std dev              75.93 ms   (0.0 s .. 87.66 ms)
variance introduced by outliers: 21% (moderately inflated)

benchmarking parse/PatriciaTreeGraph
time                 1.289 s    (1.250 s .. 1.348 s)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 1.265 s    (1.256 s .. 1.272 s)
std dev              12.51 ms   (0.0 s .. 13.32 ms)
variance introduced by outliers: 19% (moderately inflated)

benchmarking query/TriplesGraph
time                 493.7 μs   (484.9 μs .. 505.4 μs)
                     0.998 R²   (0.996 R² .. 1.000 R²)
mean                 487.1 μs   (484.4 μs .. 492.3 μs)
std dev              12.53 μs   (6.996 μs .. 18.90 μs)
variance introduced by outliers: 17% (moderately inflated)

benchmarking query/MGraph
time                 383.2 ns   (382.2 ns .. 384.7 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 382.9 ns   (382.6 ns .. 383.6 ns)
std dev              1.294 ns   (686.9 ps .. 2.390 ns)

benchmarking query/PatriciaTreeGraph
time                 12.88 ms   (12.80 ms .. 12.95 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 12.89 ms   (12.78 ms .. 12.98 ms)
std dev              244.8 μs   (155.0 μs .. 395.0 μs)

benchmarking select/TriplesGraph
time                 709.9 μs   (708.6 μs .. 711.5 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 712.5 μs   (711.2 μs .. 715.0 μs)
std dev              5.740 μs   (3.679 μs .. 8.686 μs)

benchmarking select/MGraph
time                 6.844 ms   (6.738 ms .. 6.933 ms)
                     0.996 R²   (0.993 R² .. 0.998 R²)
mean                 6.901 ms   (6.810 ms .. 7.045 ms)
std dev              321.9 μs   (224.2 μs .. 441.3 μs)
variance introduced by outliers: 24% (moderately inflated)

benchmarking select/PatriciaTreeGraph
time                 14.32 ms   (14.12 ms .. 14.50 ms)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 14.26 ms   (14.13 ms .. 14.37 ms)
std dev              286.3 μs   (200.4 μs .. 459.9 μs)

Benchmark rdf4h-bench: FINISH

@robstewart57
Copy link
Owner Author

Here are the RDF graph implementation benchmarks from October 2016:

http://robstewart57.github.io/rdf4h/rdf-bench-04102016.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants