GH-2061 mapdb upgrade #2063

barthanssens · 2020-04-02T19:15:04Z

GitHub issue resolved: #2061

Briefly describe the changes proposed in this PR:

Upgrade to MapDB 3.0.8
~~Use mmap for the temp file if available (= when using a 64-bit JVM)~~
Use unnamed tempfile instead of creating a temp file manually

PR Author Checklist:

my pull request is self-contained
I've added tests for the changes I made
every commit message starts with the issue number (GH-xxxx) followed by a meaningful description of the change
every commit has been signed off

Note: we merge all feature pull requests using squash and merge. See RDF4J git merge strategy for more details.

abrokenjester · 2020-04-02T21:21:59Z

pom.xml

@@ -573,7 +573,7 @@
 			<dependency>
 				<groupId>org.mapdb</groupId>
 				<artifactId>mapdb</artifactId>
-				<version>1.0.8</version>
+				<version>3.0.8</version>


That's quite a jump!

Yeah, but mapdb documentation is still a bit sparse though :-/ Neither v1 nor v2 are supported anymore...

v3 allows for an overflow: one can create an in-memory mapdb, set an "expire" on either size or access time, and the expired entries can be stored in e.g. another on-disk mapdb

See also jankotek/mapdb#708

That does look nice. If I remember correctly we sort of rolled our own for this by using a MapDB backed set but only calling commit (which writes to disk) every N items. If we can now configure mapdb to handle that kind of overflow for us, it simplify the code a lot (and it will probably also be more reliable).

Documentation is here: https://jankotek.gitbooks.io/mapdb/content/

There seems to be a long-standing issue with mmap and JDK, which is a bit unfortunate since the performance gains are substantial https://jankotek.gitbooks.io/mapdb/content/performance/

There is also bug in JVM. Mmaped file handles are not released until DirectByteBuffer is GCed. That means that mmap file remains open even after db.close() is called. On Windows it prevents file to be reopened or deleted. On Linux it consumes file descriptors, and could lead to errors once all descriptors are used.
There is a workaround for this bug using undocumented API. But it was linked to JVM crashes in rare cases and is disabled by default. Use DBMaker.cleanerHackEnable() to enable it.

@jeenbroekstra apparently the overflow in MapDB is not (yet) supported for Maps, not for Sets :-(

Otherwise it would indeed save a few lines in GroupIterator

abrokenjester · 2020-04-02T21:23:59Z

I've approved because the change looks ok, but CI seems to have some issues. The test run appears to be stuck...

abrokenjester · 2020-04-02T21:36:04Z

I've canceled the test run because it was going for more than an hour and a half. Couldn't see details but clearly got stuck somewhere. I'll restart it on the off chance that it was a temporary glitch - I hear github had some problems this morning.

EDIT Er. Right. That's annoying, there doesn't to be a way to retry the job from the UI (normally there's a 'restart workflow' button on the right, but apparently it only shows that if the build failed, not if it timed out like ours did).

@barthanssens can you try starting it by pushing an (empty) commit? Something like:

git commit -s --allow-empty -m "Trigger verification"

CI build fails

barthanssens · 2020-04-03T13:13:47Z

Changed my mind about using mmap, since the mapdb documentation (https://jankotek.gitbooks.io/mapdb/content/performance/) mentions that it has JVM issues .
So I guess it's better not to use mmap.

Mmap files are highly dependent on the operating system. For example, on Windows you cannot delete a mmap file while it is locked by JVM. If Windows JVM dies without closing the mmap file, you have to restart Windows to release the file lock.
There is also bug in JVM. Mmaped file handles are not released until DirectByteBuffer is GCed. That means that mmap file remains open even after db.close() is called. On Windows it prevents file to be reopened or deleted. On Linux it consumes file descriptors, and could lead to errors once all descriptors are used.
There is a workaround for this bug using undocumented API. But it was linked to JVM crashes in rare cases and is disabled by default.

See also https://bugs.openjdk.java.net/browse/JDK-4724038

abrokenjester

Looks good to me. It'd be nice if we could do some quick performance comparison (at least to make sure it's not a regression), but not a blocker.

barthanssens · 2020-04-08T09:11:59Z

ok, i'll try to compare the performance of mapdb 1 vs 3

barthanssens · 2020-04-10T11:08:11Z

I'm having some issues with (de)serilization in a mapdb 3 microbenchmark, so don't merge this PR yet.

hmottestad · 2020-04-10T11:10:22Z

Jeen found that handy "Convert to draft" link there. Not exactly the best user design if you ask me, but hey...at least there is an option now :P

barthanssens · 2020-04-10T14:30:22Z

Ah nice, thanks.

barthanssens · 2020-04-15T08:35:51Z

Working on (micro-)benchmark, initial test seems to indicate that disk-based MapDB v3 is twice as slow as v1 when not using mmap (and mmap has its own issues) :-/

I'll do some more tests, especially with vs without transactions.

hmottestad · 2020-04-15T11:12:25Z

Was the old version using mmap?

barthanssens · 2020-04-15T12:01:36Z

Not by default, no (at least, that's the impression I get reading the MapDB code)

MapDB v3 also has an option for using fileChannels, and some other options to fiddle with

Some JMH numbers on adding 50 x 1000 bindingsets in a hashset (without any other RDF4J handling):

MapDB 1.0.8:
with db.commit (= what is currently used in RDF4J): 0,8 seconds / op
without db.commit: 0.1 s

MapDB 3.0.8:
with commits: 6.2 s / op(!)
with commits, with filechannel: 3,4 s

without commits: 1,5s
without commits, with filechannel: 0.8s

hmottestad · 2020-04-15T12:13:55Z

That is a lot slower.

I've used JRF with my benchmarks sometimes to figure out what is actually slower.

Added this to the class:

@Fork(value = 1, jvmArgs = {"-Xms8G", "-Xmx8G", "-Xmn4G", "-XX:+UseSerialGC", "-XX:+UnlockCommercialFeatures", "-XX:StartFlightRecording=delay=5s,duration=120s,filename=recording.jfr,settings=profile", "-XX:FlightRecorderOptions=samplethreads=true,stackdepth=1024", "-XX:+UnlockDiagnosticVMOptions", "-XX:+DebugNonSafepoints"})

Creates a recording.jfr file that can be opened with Java Mission Control.

That way you can see what is taking so much more time with the new version.

barthanssens · 2020-04-16T14:00:47Z

Progress: setting a specific serialization in mapdb 3.0.8 improves performance a lot.
Though JMH microbenchmark still takes 1.1s (vs 0.8s in MapDB 1.0.8)

abrokenjester · 2021-04-15T23:52:06Z

I will be taking another look at this, as it may also be relevant to GH-2998.

abrokenjester · 2021-04-16T00:19:05Z

I've rebased this against the current main branch.

Signed-off-by:Bart Hanssens <[email protected]>

… descriptors) Signed-off-by:Bart Hanssens <[email protected]>

Signed-off-by:Bart Hanssens <[email protected]>

abrokenjester · 2021-04-16T03:46:37Z

Rebased again, on develop. Makes it a bit clearer what's actually involved in this PR.

abrokenjester · 2021-04-16T03:48:58Z

@barthanssens which particular benchmark did you use to do performance comparison? Is it checked in?

barthanssens · 2021-04-16T06:25:05Z

@barthanssens which particular benchmark did you use to do performance comparison? Is it checked in?

Hmz, have to check if I still have the code... In any case, it really was a micro-benchmark, directly using MapDB to add BindingSets instead of benchmarking the NativeStore with different MapDB versions

barthanssens · 2021-04-16T21:00:25Z

@jeenbroekstra code can be found here https://github.com/Fedict/mapdbtest

abrokenjester · 2021-04-17T01:14:07Z

Ran some comparative benchmarks on my machine:

MapDB 3.0.8

Benchmark         Mode  Cnt  Score   Error  Units
Main.addCommit    avgt   20  0.837 ± 0.089   s/op
Main.addNoCommit  avgt   20  0.194 ± 0.003   s/op

MapDB 1.0.8

Benchmark         Mode  Cnt  Score   Error  Units
Main.addCommit    avgt   20  0.527 ± 0.031   s/op
Main.addNoCommit  avgt   20  0.066 ± 0.001   s/op

It's not insignificant but I am quite anxious about relying on a completely unsupported version. I'll see if I can come up with some other benchmarks that more closely mimic the use in the query engine, see if the difference is as pronounced there.

abrokenjester · 2021-04-17T01:30:02Z

Alternatively, I just stumbled across Ehcache (https://www.ehcache.org/) which looks potentially promising as well - it's a caching mechanism a la guava but with a lot of extra knobs, including different caching tiers (among which on-disk persistence).

barthanssens · 2021-04-17T09:49:23Z

Yet another option coud be Chronicle Map (https://github.com/OpenHFT/Chronicle-Map/blob/ea/docs/CM_Tutorial.adoc), but it looks like the max number of entries is fixed once initialized
https://www.javadoc.io/doc/net.openhft/chronicle-map/latest/net/openhft/chronicle/map/ChronicleMapBuilder.html
Default is 1 million entries

abrokenjester previously approved these changes Apr 2, 2020

View reviewed changes

barthanssens requested a review from hmottestad April 3, 2020 13:04

abrokenjester approved these changes Apr 4, 2020

View reviewed changes

barthanssens added the WIP label Apr 10, 2020

hmottestad marked this pull request as draft April 10, 2020 11:08

hmottestad removed the WIP label Apr 10, 2020

hmottestad mentioned this pull request Apr 30, 2020

Considering using MapDB in place of HashFile. #2153

Closed

abrokenjester linked an issue May 7, 2020 that may be closed by this pull request

Upgrade MapDB #2061

Closed

hmottestad closed this Jul 24, 2020

abrokenjester reopened this Apr 15, 2021

abrokenjester force-pushed the GH-2061-mapdb-upgrade branch from c36e7b6 to 6257895 Compare April 16, 2021 00:18

abrokenjester self-requested a review April 16, 2021 00:21

barthanssens added 3 commits April 16, 2021 13:45

eclipse-rdf4jGH-2061 mapdb upgrade

009ff68

Signed-off-by:Bart Hanssens <[email protected]>

eclipse-rdf4jGH-2061 code formatting

2f6758f

Signed-off-by:Bart Hanssens <[email protected]>

eclipse-rdf4jGH-2061 don't use mmap since it has issues (no releasing…

f8588be

… descriptors) Signed-off-by:Bart Hanssens <[email protected]>

barthanssens and others added 2 commits April 16, 2021 13:45

eclipse-rdf4jGH-2061 use filechannel and faster serialization

77b6e60

Signed-off-by:Bart Hanssens <[email protected]>

eclipse-rdf4jGH-2061 formatting fixed

981b5cc

abrokenjester force-pushed the GH-2061-mapdb-upgrade branch from 27ba836 to 981b5cc Compare April 16, 2021 03:46

barthanssens closed this Dec 16, 2021

barthanssens deleted the GH-2061-mapdb-upgrade branch December 16, 2021 10:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-2061 mapdb upgrade #2063

GH-2061 mapdb upgrade #2063

barthanssens commented Apr 2, 2020 •

edited

Loading

abrokenjester Apr 2, 2020

barthanssens Apr 2, 2020

abrokenjester Apr 2, 2020

barthanssens Apr 2, 2020

barthanssens Apr 3, 2020

abrokenjester commented Apr 2, 2020

abrokenjester commented Apr 2, 2020 •

edited

Loading

barthanssens commented Apr 3, 2020

abrokenjester left a comment

barthanssens commented Apr 8, 2020

barthanssens commented Apr 10, 2020

hmottestad commented Apr 10, 2020

barthanssens commented Apr 10, 2020

barthanssens commented Apr 15, 2020

hmottestad commented Apr 15, 2020

barthanssens commented Apr 15, 2020

hmottestad commented Apr 15, 2020

barthanssens commented Apr 16, 2020

abrokenjester commented Apr 15, 2021

abrokenjester commented Apr 16, 2021

abrokenjester commented Apr 16, 2021

abrokenjester commented Apr 16, 2021

barthanssens commented Apr 16, 2021

barthanssens commented Apr 16, 2021

abrokenjester commented Apr 17, 2021 •

edited

Loading

abrokenjester commented Apr 17, 2021

barthanssens commented Apr 17, 2021

GH-2061 mapdb upgrade #2063

GH-2061 mapdb upgrade #2063

Conversation

barthanssens commented Apr 2, 2020 • edited Loading

abrokenjester Apr 2, 2020

Choose a reason for hiding this comment

barthanssens Apr 2, 2020

Choose a reason for hiding this comment

abrokenjester Apr 2, 2020

Choose a reason for hiding this comment

barthanssens Apr 2, 2020

Choose a reason for hiding this comment

barthanssens Apr 3, 2020

Choose a reason for hiding this comment

abrokenjester commented Apr 2, 2020

abrokenjester commented Apr 2, 2020 • edited Loading

barthanssens commented Apr 3, 2020

abrokenjester left a comment

Choose a reason for hiding this comment

barthanssens commented Apr 8, 2020

barthanssens commented Apr 10, 2020

hmottestad commented Apr 10, 2020

barthanssens commented Apr 10, 2020

barthanssens commented Apr 15, 2020

hmottestad commented Apr 15, 2020

barthanssens commented Apr 15, 2020

hmottestad commented Apr 15, 2020

barthanssens commented Apr 16, 2020

abrokenjester commented Apr 15, 2021

abrokenjester commented Apr 16, 2021

abrokenjester commented Apr 16, 2021

abrokenjester commented Apr 16, 2021

barthanssens commented Apr 16, 2021

barthanssens commented Apr 16, 2021

abrokenjester commented Apr 17, 2021 • edited Loading

MapDB 3.0.8

MapDB 1.0.8

abrokenjester commented Apr 17, 2021

barthanssens commented Apr 17, 2021

barthanssens commented Apr 2, 2020 •

edited

Loading

abrokenjester commented Apr 2, 2020 •

edited

Loading

abrokenjester commented Apr 17, 2021 •

edited

Loading