-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-2061 mapdb upgrade #2063
GH-2061 mapdb upgrade #2063
Conversation
@@ -573,7 +573,7 @@ | |||
<dependency> | |||
<groupId>org.mapdb</groupId> | |||
<artifactId>mapdb</artifactId> | |||
<version>1.0.8</version> | |||
<version>3.0.8</version> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's quite a jump!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, but mapdb documentation is still a bit sparse though :-/ Neither v1 nor v2 are supported anymore...
v3 allows for an overflow: one can create an in-memory mapdb, set an "expire" on either size or access time, and the expired entries can be stored in e.g. another on-disk mapdb
See also jankotek/mapdb#708
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That does look nice. If I remember correctly we sort of rolled our own for this by using a MapDB backed set but only calling commit (which writes to disk) every N items. If we can now configure mapdb to handle that kind of overflow for us, it simplify the code a lot (and it will probably also be more reliable).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Documentation is here: https://jankotek.gitbooks.io/mapdb/content/
There seems to be a long-standing issue with mmap and JDK, which is a bit unfortunate since the performance gains are substantial https://jankotek.gitbooks.io/mapdb/content/performance/
There is also bug in JVM. Mmaped file handles are not released until DirectByteBuffer is GCed. That means that mmap file remains open even after db.close() is called. On Windows it prevents file to be reopened or deleted. On Linux it consumes file descriptors, and could lead to errors once all descriptors are used.
There is a workaround for this bug using undocumented API. But it was linked to JVM crashes in rare cases and is disabled by default. Use DBMaker.cleanerHackEnable() to enable it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jeenbroekstra apparently the overflow in MapDB is not (yet) supported for Maps, not for Sets :-(
Otherwise it would indeed save a few lines in GroupIterator
I've approved because the change looks ok, but CI seems to have some issues. The test run appears to be stuck... |
I've canceled the test run because it was going for more than an hour and a half. Couldn't see details but clearly got stuck somewhere. I'll restart it on the off chance that it was a temporary glitch - I hear github had some problems this morning. EDIT Er. Right. That's annoying, there doesn't to be a way to retry the job from the UI (normally there's a 'restart workflow' button on the right, but apparently it only shows that if the build failed, not if it timed out like ours did). @barthanssens can you try starting it by pushing an (empty) commit? Something like:
|
Changed my mind about using mmap, since the mapdb documentation (https://jankotek.gitbooks.io/mapdb/content/performance/) mentions that it has JVM issues .
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. It'd be nice if we could do some quick performance comparison (at least to make sure it's not a regression), but not a blocker.
ok, i'll try to compare the performance of mapdb 1 vs 3 |
I'm having some issues with (de)serilization in a mapdb 3 microbenchmark, so don't merge this PR yet. |
Ah nice, thanks. |
Working on (micro-)benchmark, initial test seems to indicate that disk-based MapDB v3 is twice as slow as v1 when not using mmap (and mmap has its own issues) :-/ I'll do some more tests, especially with vs without transactions. |
Was the old version using mmap? |
Not by default, no (at least, that's the impression I get reading the MapDB code) MapDB v3 also has an option for using fileChannels, and some other options to fiddle with Some JMH numbers on adding 50 x 1000 bindingsets in a hashset (without any other RDF4J handling): MapDB 1.0.8: MapDB 3.0.8: without commits: 1,5s |
That is a lot slower. I've used JRF with my benchmarks sometimes to figure out what is actually slower. Added this to the class:
Creates a recording.jfr file that can be opened with Java Mission Control. That way you can see what is taking so much more time with the new version. |
Progress: setting a specific serialization in mapdb 3.0.8 improves performance a lot. |
I will be taking another look at this, as it may also be relevant to GH-2998. |
c36e7b6
to
6257895
Compare
I've rebased this against the current main branch. |
Signed-off-by:Bart Hanssens <[email protected]>
Signed-off-by:Bart Hanssens <[email protected]>
… descriptors) Signed-off-by:Bart Hanssens <[email protected]>
Signed-off-by:Bart Hanssens <[email protected]>
27ba836
to
981b5cc
Compare
Rebased again, on develop. Makes it a bit clearer what's actually involved in this PR. |
@barthanssens which particular benchmark did you use to do performance comparison? Is it checked in? |
Hmz, have to check if I still have the code... In any case, it really was a micro-benchmark, directly using MapDB to add BindingSets instead of benchmarking the NativeStore with different MapDB versions |
@jeenbroekstra code can be found here https://github.com/Fedict/mapdbtest |
Ran some comparative benchmarks on my machine: MapDB 3.0.8
MapDB 1.0.8
It's not insignificant but I am quite anxious about relying on a completely unsupported version. I'll see if I can come up with some other benchmarks that more closely mimic the use in the query engine, see if the difference is as pronounced there. |
Alternatively, I just stumbled across Ehcache (https://www.ehcache.org/) which looks potentially promising as well - it's a caching mechanism a la guava but with a lot of extra knobs, including different caching tiers (among which on-disk persistence). |
Yet another option coud be Chronicle Map (https://github.com/OpenHFT/Chronicle-Map/blob/ea/docs/CM_Tutorial.adoc), but it looks like the max number of entries is fixed once initialized |
GitHub issue resolved: #2061
Briefly describe the changes proposed in this PR:
Use mmap for the temp file if available (= when using a 64-bit JVM)PR Author Checklist:
Note: we merge all feature pull requests using squash and merge. See RDF4J git merge strategy for more details.