Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-2061 mapdb upgrade #2063

Closed

Conversation

barthanssens
Copy link
Contributor

@barthanssens barthanssens commented Apr 2, 2020

GitHub issue resolved: #2061

Briefly describe the changes proposed in this PR:

  • Upgrade to MapDB 3.0.8
  • Use mmap for the temp file if available (= when using a 64-bit JVM)
  • Use unnamed tempfile instead of creating a temp file manually

PR Author Checklist:

  • my pull request is self-contained
  • I've added tests for the changes I made
  • every commit message starts with the issue number (GH-xxxx) followed by a meaningful description of the change
  • every commit has been signed off

Note: we merge all feature pull requests using squash and merge. See RDF4J git merge strategy for more details.

abrokenjester
abrokenjester previously approved these changes Apr 2, 2020
@@ -573,7 +573,7 @@
<dependency>
<groupId>org.mapdb</groupId>
<artifactId>mapdb</artifactId>
<version>1.0.8</version>
<version>3.0.8</version>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's quite a jump!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but mapdb documentation is still a bit sparse though :-/ Neither v1 nor v2 are supported anymore...

v3 allows for an overflow: one can create an in-memory mapdb, set an "expire" on either size or access time, and the expired entries can be stored in e.g. another on-disk mapdb

See also jankotek/mapdb#708

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That does look nice. If I remember correctly we sort of rolled our own for this by using a MapDB backed set but only calling commit (which writes to disk) every N items. If we can now configure mapdb to handle that kind of overflow for us, it simplify the code a lot (and it will probably also be more reliable).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation is here: https://jankotek.gitbooks.io/mapdb/content/

There seems to be a long-standing issue with mmap and JDK, which is a bit unfortunate since the performance gains are substantial https://jankotek.gitbooks.io/mapdb/content/performance/

There is also bug in JVM. Mmaped file handles are not released until DirectByteBuffer is GCed. That means that mmap file remains open even after db.close() is called. On Windows it prevents file to be reopened or deleted. On Linux it consumes file descriptors, and could lead to errors once all descriptors are used.
There is a workaround for this bug using undocumented API. But it was linked to JVM crashes in rare cases and is disabled by default. Use DBMaker.cleanerHackEnable() to enable it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeenbroekstra apparently the overflow in MapDB is not (yet) supported for Maps, not for Sets :-(

Otherwise it would indeed save a few lines in GroupIterator

@abrokenjester
Copy link
Contributor

I've approved because the change looks ok, but CI seems to have some issues. The test run appears to be stuck...

@abrokenjester
Copy link
Contributor

abrokenjester commented Apr 2, 2020

I've canceled the test run because it was going for more than an hour and a half. Couldn't see details but clearly got stuck somewhere. I'll restart it on the off chance that it was a temporary glitch - I hear github had some problems this morning.

EDIT Er. Right. That's annoying, there doesn't to be a way to retry the job from the UI (normally there's a 'restart workflow' button on the right, but apparently it only shows that if the build failed, not if it timed out like ours did).

@barthanssens can you try starting it by pushing an (empty) commit? Something like:

git commit -s --allow-empty -m "Trigger verification"

@barthanssens
Copy link
Contributor Author

Changed my mind about using mmap, since the mapdb documentation (https://jankotek.gitbooks.io/mapdb/content/performance/) mentions that it has JVM issues .
So I guess it's better not to use mmap.

Mmap files are highly dependent on the operating system. For example, on Windows you cannot delete a mmap file while it is locked by JVM. If Windows JVM dies without closing the mmap file, you have to restart Windows to release the file lock.
There is also bug in JVM. Mmaped file handles are not released until DirectByteBuffer is GCed. That means that mmap file remains open even after db.close() is called. On Windows it prevents file to be reopened or deleted. On Linux it consumes file descriptors, and could lead to errors once all descriptors are used.
There is a workaround for this bug using undocumented API. But it was linked to JVM crashes in rare cases and is disabled by default.

See also https://bugs.openjdk.java.net/browse/JDK-4724038

Copy link
Contributor

@abrokenjester abrokenjester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. It'd be nice if we could do some quick performance comparison (at least to make sure it's not a regression), but not a blocker.

@barthanssens
Copy link
Contributor Author

ok, i'll try to compare the performance of mapdb 1 vs 3

@barthanssens
Copy link
Contributor Author

I'm having some issues with (de)serilization in a mapdb 3 microbenchmark, so don't merge this PR yet.

@hmottestad hmottestad marked this pull request as draft April 10, 2020 11:08
@hmottestad
Copy link
Contributor

Screenshot 2020-04-10 at 13 08 51

Jeen found that handy "Convert to draft" link there. Not exactly the best user design if you ask me, but hey...at least there is an option now :P

@hmottestad hmottestad removed the WIP label Apr 10, 2020
@barthanssens
Copy link
Contributor Author

Ah nice, thanks.

@barthanssens
Copy link
Contributor Author

Working on (micro-)benchmark, initial test seems to indicate that disk-based MapDB v3 is twice as slow as v1 when not using mmap (and mmap has its own issues) :-/

I'll do some more tests, especially with vs without transactions.

@hmottestad
Copy link
Contributor

Was the old version using mmap?

@barthanssens
Copy link
Contributor Author

Not by default, no (at least, that's the impression I get reading the MapDB code)

MapDB v3 also has an option for using fileChannels, and some other options to fiddle with

Some JMH numbers on adding 50 x 1000 bindingsets in a hashset (without any other RDF4J handling):

MapDB 1.0.8:
with db.commit (= what is currently used in RDF4J): 0,8 seconds / op
without db.commit: 0.1 s

MapDB 3.0.8:
with commits: 6.2 s / op(!)
with commits, with filechannel: 3,4 s

without commits: 1,5s
without commits, with filechannel: 0.8s

@hmottestad
Copy link
Contributor

That is a lot slower.

I've used JRF with my benchmarks sometimes to figure out what is actually slower.

Added this to the class:

@Fork(value = 1, jvmArgs = {"-Xms8G", "-Xmx8G", "-Xmn4G", "-XX:+UseSerialGC", "-XX:+UnlockCommercialFeatures", "-XX:StartFlightRecording=delay=5s,duration=120s,filename=recording.jfr,settings=profile", "-XX:FlightRecorderOptions=samplethreads=true,stackdepth=1024", "-XX:+UnlockDiagnosticVMOptions", "-XX:+DebugNonSafepoints"})

Creates a recording.jfr file that can be opened with Java Mission Control.

That way you can see what is taking so much more time with the new version.

@barthanssens
Copy link
Contributor Author

Progress: setting a specific serialization in mapdb 3.0.8 improves performance a lot.
Though JMH microbenchmark still takes 1.1s (vs 0.8s in MapDB 1.0.8)

@abrokenjester abrokenjester linked an issue May 7, 2020 that may be closed by this pull request
@hmottestad hmottestad closed this Jul 24, 2020
@abrokenjester abrokenjester reopened this Apr 15, 2021
@abrokenjester
Copy link
Contributor

I will be taking another look at this, as it may also be relevant to GH-2998.

@abrokenjester
Copy link
Contributor

I've rebased this against the current main branch.

@abrokenjester abrokenjester self-requested a review April 16, 2021 00:21
@abrokenjester
Copy link
Contributor

Rebased again, on develop. Makes it a bit clearer what's actually involved in this PR.

@abrokenjester
Copy link
Contributor

@barthanssens which particular benchmark did you use to do performance comparison? Is it checked in?

@barthanssens
Copy link
Contributor Author

@barthanssens which particular benchmark did you use to do performance comparison? Is it checked in?

Hmz, have to check if I still have the code... In any case, it really was a micro-benchmark, directly using MapDB to add BindingSets instead of benchmarking the NativeStore with different MapDB versions

@barthanssens
Copy link
Contributor Author

@jeenbroekstra code can be found here https://github.com/Fedict/mapdbtest

@abrokenjester
Copy link
Contributor

abrokenjester commented Apr 17, 2021

Ran some comparative benchmarks on my machine:

MapDB 3.0.8

Benchmark         Mode  Cnt  Score   Error  Units
Main.addCommit    avgt   20  0.837 ± 0.089   s/op
Main.addNoCommit  avgt   20  0.194 ± 0.003   s/op

MapDB 1.0.8

Benchmark         Mode  Cnt  Score   Error  Units
Main.addCommit    avgt   20  0.527 ± 0.031   s/op
Main.addNoCommit  avgt   20  0.066 ± 0.001   s/op

It's not insignificant but I am quite anxious about relying on a completely unsupported version. I'll see if I can come up with some other benchmarks that more closely mimic the use in the query engine, see if the difference is as pronounced there.

@abrokenjester
Copy link
Contributor

Alternatively, I just stumbled across Ehcache (https://www.ehcache.org/) which looks potentially promising as well - it's a caching mechanism a la guava but with a lot of extra knobs, including different caching tiers (among which on-disk persistence).

@barthanssens
Copy link
Contributor Author

Yet another option coud be Chronicle Map (https://github.com/OpenHFT/Chronicle-Map/blob/ea/docs/CM_Tutorial.adoc), but it looks like the max number of entries is fixed once initialized
https://www.javadoc.io/doc/net.openhft/chronicle-map/latest/net/openhft/chronicle/map/ChronicleMapBuilder.html
Default is 1 million entries

@barthanssens barthanssens deleted the GH-2061-mapdb-upgrade branch December 16, 2021 10:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Upgrade MapDB
3 participants