Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-2799 avoid OOM through HashMap resizing by setting initial size #2872

Merged
merged 4 commits into from
Apr 24, 2021

Conversation

abrokenjester
Copy link
Contributor

@abrokenjester abrokenjester commented Feb 24, 2021

GitHub issue resolved: #2799

Briefly describe the changes proposed in this PR:

  • initialize the LinkedHashModel with a fixed size equal to the threshold we use for overflow to disk - this should avoid unexpected resizing before overflow is triggered
  • note that LinkedHashModel itself internally already uses size * 2 for the statement set to compensate for the load factor, so using the exact block size should be fine for our purposes

PR Author Checklist (see the contributor guidelines for more details):

  • my pull request is self-contained
  • I've added tests for the changes I made
  • I've applied code formatting (you can use mvn process-resources to format from the command line)
  • I've squashed my commits down to one or a few meaningful commits
  • every commit message starts with the issue number (GH-xxxx) followed by a meaningful description of the change
  • every commit has been signed off

@abrokenjester
Copy link
Contributor Author

There's a bit of a false assumption in this change I just realized, due to poor variable/constant naming: although the LARGE_BLOCK constant plays a roll in the overflow trigger, it is not as simple as "overflow when we have more statements than this fixed constant".

Alternative is that we move the "10% of the heap is still free" requirement up a bit, to, say, 15%?

@hmottestad
Copy link
Contributor

I think we should start with a test, or a benchmark if that's easier. Essentially limit the memory available and load in a big file. I'm also curious if this scales as it should, eg. does loading in a bigger file while giving more memory trigger the overflow as consistently as loading a smaller file with less memory available?

@abrokenjester
Copy link
Contributor Author

abrokenjester commented Feb 24, 2021

Not at my desk right now, but I believe there's an existing compliance test for the overflow that we could extend if necessary.

Update: the test I was thinking of is TestNativeStoreMemoryOverflow which is actually a unit test for the native store. I gotta admit I can't make out what the test is suposed to prove exactly: it seems to just add a bunch of statements and then check they're there. It's not obvious to me how it actually checks the MemoryOverflow behavior.

@abrokenjester
Copy link
Contributor Author

I'm not aware of a simple way to control the available heap space for a single junit test - so we may want to "abuse" a jmh benchmark for this purpose. I'm currently out of spare time, so won't continue on this immediately. @hmottestad if you have an idea and time/will to set up such a test feel free to add it on to this branch.

@hmottestad
Copy link
Contributor

After I commented here I set out to google if maybe junit 5 could fork the JVM for a test to configure the amount of memory. Could not find anything. I guess abusing JMH might be the only way unfortunately.

@abrokenjester
Copy link
Contributor Author

To be fair even with benchmarking in place there's only so much we can do to make this work better: at the end of the day we are relying on an estimate of available heap space and predicted usage to trigger disk overflow. In real life situations that estimate could be way off because of circumstances beyond our control (e.g. some different Java object suddenly consuming a lot of memory or something as basic as the data currently being uploaded suddenly containing a few massive literal values). We make a best effort, but we can't guarantee the process won't run out of memory.

@abrokenjester
Copy link
Contributor Author

Btw a reason disk syncing on memory overflow is giving us such a performance hit was given by Arjohn Kampman on the mailinglist:

The reason that syncing the data to disk when the overflow is triggered takes so long is also related to a call to dataset(). When overflowToDisk() calls disk.addAll(memory), this triggers a call to SailSourceModel.add(s,p,o,c...) for each statement. This method then calls both contain(s,p,o,c...) and sink().approve(s,p,o.c) for for each statement. The latter call starts a new transaction and updates the txn-status file, but the contains() call then commits the transaction for the previous statement via a call to dataset(), again updating the txn-status file. So for every cached statement, rdf4j does two I/O calls. On a spinning disk with a an average write time of 10 ms, this limits the overflow process to at most 50 statements per second.

That is not directly related to this fix, but if we can address some of that, we can have more confidence that setting a lower overflow threshold won't cause a massive performance penalty.

@hmottestad
Copy link
Contributor

Do you mind @jeenbroekstra if I force push this branch? I would like to add a commit with a benchmark as the first commit in this branch so that we can easily test before and after.

@abrokenjester
Copy link
Contributor Author

Go for it

@hmottestad hmottestad force-pushed the GH-2799-memoryoverflowmodel-oom branch from e850bb4 to 590dad6 Compare April 16, 2021 06:26
@hmottestad
Copy link
Contributor

ok, I've just force pushed, but will need to force push at least one more time before I'm done.

@hmottestad hmottestad force-pushed the GH-2799-memoryoverflowmodel-oom branch from 590dad6 to b76dde0 Compare April 16, 2021 08:21
@hmottestad
Copy link
Contributor

I've added two benchmarks, one synthetic and one with a real world file.

The synthetic one I've got failing with very low memory. The real world file I've got failing with higher memory. Both are failing at different places. And to make things even more complicated the real world file is failing only within a certain memory range (500-700 MB), outside of that range it either overflows correctly or doesn't need to overflow.

@hmottestad
Copy link
Contributor

hmottestad commented Apr 16, 2021

The synthetic benchmark only fails on java 8 for me. Probably due to using G1GC.

Btw. I tried using the parallel gc, and it failed differently:

java.lang.OutOfMemoryError: Java heap space
	at java.util.BitSet.initWords(BitSet.java:166)
	at java.util.BitSet.<init>(BitSet.java:161)
	at org.eclipse.rdf4j.sail.nativerdf.datastore.HashFile.<init>(HashFile.java:158)
	at org.eclipse.rdf4j.sail.nativerdf.datastore.HashFile.<init>(HashFile.java:98)
	at org.eclipse.rdf4j.sail.nativerdf.datastore.DataStore.<init>(DataStore.java:46)
	at org.eclipse.rdf4j.sail.nativerdf.ValueStore.<init>(ValueStore.java:139)
	at org.eclipse.rdf4j.sail.nativerdf.NativeSailStore.<init>(NativeSailStore.java:92)
	at org.eclipse.rdf4j.sail.nativerdf.NativeSailStore.<init>(NativeSailStore.java:80)
	at org.eclipse.rdf4j.sail.nativerdf.NativeStore$2.createSailStore(NativeStore.java:272)
	at org.eclipse.rdf4j.sail.nativerdf.MemoryOverflowModel.overflowToDisk(MemoryOverflowModel.java:263)
	at org.eclipse.rdf4j.sail.nativerdf.MemoryOverflowModel.checkMemoryOverflow(MemoryOverflowModel.java:250)
	at org.eclipse.rdf4j.sail.nativerdf.MemoryOverflowModel.add(MemoryOverflowModel.java:122)
	at org.eclipse.rdf4j.sail.base.Changeset.approve(Changeset.java:259)
	at org.eclipse.rdf4j.sail.base.SailSourceConnection.add(SailSourceConnection.java:709)
	at org.eclipse.rdf4j.sail.base.SailSourceConnection.addStatement(SailSourceConnection.java:577)
	at org.eclipse.rdf4j.sail.helpers.AbstractSailConnection.addStatement(AbstractSailConnection.java:443)
	at org.eclipse.rdf4j.repository.sail.SailRepositoryConnection.addWithoutCommit(SailRepositoryConnection.java:393)
	at org.eclipse.rdf4j.repository.base.AbstractRepositoryConnection.addWithoutCommit(AbstractRepositoryConnection.java:508)
	at org.eclipse.rdf4j.repository.base.AbstractRepositoryConnection.add(AbstractRepositoryConnection.java:418)
	at org.eclipse.rdf4j.sail.nativerdf.benchmark.OverflowBenchmarkSynthetic.lambda$addData$1(OverflowBenchmarkSynthetic.java:177)
	at org.eclipse.rdf4j.sail.nativerdf.benchmark.OverflowBenchmarkSynthetic$$Lambda$15/841851332.accept(Unknown Source)
	at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
	at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
	at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647)
	at java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:272)
	at java.util.stream.IntPipeline$4$1.accept(IntPipeline.java:250)
	at java.util.stream.Streams$RangeIntSpliterator.forEachRemaining(Streams.java:110)
	at java.util.Spliterator$OfInt.forEachRemaining(Spliterator.java:693)
	at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
	at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
	at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
	at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)

EDIT: Yet another different failure point (Java 8 + G1GC):

java.lang.OutOfMemoryError: Java heap space
	at org.eclipse.rdf4j.sail.nativerdf.btree.Node.<init>(Node.java:59)
	at org.eclipse.rdf4j.sail.nativerdf.btree.BTree.lambda$new$0(BTree.java:96)
	at org.eclipse.rdf4j.sail.nativerdf.btree.BTree$$Lambda$11/564661451.apply(Unknown Source)
	at org.eclipse.rdf4j.sail.nativerdf.btree.ConcurrentNodeCache.lambda$readAndUse$1(ConcurrentNodeCache.java:47)
	at org.eclipse.rdf4j.sail.nativerdf.btree.ConcurrentNodeCache$$Lambda$21/1430898173.apply(Unknown Source)
	at java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1853)
	at org.eclipse.rdf4j.sail.nativerdf.btree.ConcurrentNodeCache.readAndUse(ConcurrentNodeCache.java:46)
	at org.eclipse.rdf4j.sail.nativerdf.btree.BTree.readNode(BTree.java:1023)
	at org.eclipse.rdf4j.sail.nativerdf.btree.Node.getChildNode(Node.java:220)
	at org.eclipse.rdf4j.sail.nativerdf.btree.RangeIterator.findNext(RangeIterator.java:145)
	at org.eclipse.rdf4j.sail.nativerdf.btree.RangeIterator.next(RangeIterator.java:67)
	at org.eclipse.rdf4j.sail.nativerdf.TripleStore.commit(TripleStore.java:881)
	at org.eclipse.rdf4j.sail.nativerdf.NativeSailStore$NativeSailSink.flush(NativeSailStore.java:366)
	at org.eclipse.rdf4j.sail.base.SailSourceBranch.flush(SailSourceBranch.java:263)
	at org.eclipse.rdf4j.sail.base.SailSourceBranch.autoFlush(SailSourceBranch.java:345)
	at org.eclipse.rdf4j.sail.base.SailSourceBranch$1.close(SailSourceBranch.java:187)
	at org.eclipse.rdf4j.sail.base.SailSourceBranch.flush(SailSourceBranch.java:266)
	at org.eclipse.rdf4j.sail.base.UnionSailSource.flush(UnionSailSource.java:68)
	at org.eclipse.rdf4j.sail.base.SailSourceConnection.commitInternal(SailSourceConnection.java:469)
	at org.eclipse.rdf4j.sail.nativerdf.NativeStoreConnection.commitInternal(NativeStoreConnection.java:86)
	at org.eclipse.rdf4j.sail.helpers.AbstractSailConnection.commit(AbstractSailConnection.java:392)
	at org.eclipse.rdf4j.repository.sail.SailRepositoryConnection.commit(SailRepositoryConnection.java:216)
	at org.eclipse.rdf4j.sail.nativerdf.benchmark.OverflowBenchmarkSynthetic.loadLotsOfDataEmptyStore(OverflowBenchmarkSynthetic.java:86)
	at org.eclipse.rdf4j.sail.nativerdf.benchmark.generated.OverflowBenchmarkSynthetic_loadLotsOfDataEmptyStore_jmhTest.loadLotsOfDataEmptyStore_avgt_jmhStub(OverflowBenchmarkSynthetic_loadLotsOfDataEmptyStore_jmhTest.java:232)
	at org.eclipse.rdf4j.sail.nativerdf.benchmark.generated.OverflowBenchmarkSynthetic_loadLotsOfDataEmptyStore_jmhTest.loadLotsOfDataEmptyStore_AverageTime(OverflowBenchmarkSynthetic_loadLotsOfDataEmptyStore_jmhTest.java:173)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.openjdk.jmh.runner.BenchmarkHandler$BenchmarkTask.call(BenchmarkHandler.java:453)
	at org.openjdk.jmh.runner.BenchmarkHandler$BenchmarkTask.call(BenchmarkHandler.java:437)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)


@abrokenjester
Copy link
Contributor Author

Related: #2998

@hmottestad
Copy link
Contributor

Not going to dig any further now. I've spent like 3 hours on this because IntelliJ was acting up and running the benchmarks takes a long time.

@abrokenjester
Copy link
Contributor Author

I'm becoming more and more convinced that we should invest in replacing the current MemoryOverflowModel with a completely different implementation, something that just relies on standard Java Object Serialization or some simple abstraction like MapDB for disk syncing. I will try and spike something, see what we get.

@hmottestad
Copy link
Contributor

Maybe for the time being just try to adjust the point where the model overflows to disk. Perhaps you can add some logging to show what it currently thinks is going on memory wise so you can see what numbers it's comparing. Also maybe try without Xms in the benchmark.

The real world benchmark is best to start with I think.

@hmottestad hmottestad force-pushed the GH-2799-memoryoverflowmodel-oom branch from 1e55523 to 1716d06 Compare April 23, 2021 14:21
@hmottestad hmottestad force-pushed the GH-2799-memoryoverflowmodel-oom branch from 1716d06 to 508e9ee Compare April 23, 2021 14:21
@hmottestad
Copy link
Contributor

I did some logging and found out that sometimes the algorithm would decide that the max block size was something like 10 MB. This is when it would overflow. So I've just added a hard limit of 32MB of required available memory. This works fine for both my tests (benchmarks) and I feel it's a reasonable tradeoff. How many users give their NativeStore 64MB or 128MB and then load large files?

@abrokenjester
Copy link
Contributor Author

I did some logging and found out that sometimes the algorithm would decide that the max block size was something like 10 MB. This is when it would overflow. So I've just added a hard limit of 32MB of required available memory. This works fine for both my tests (benchmarks) and I feel it's a reasonable tradeoff. How many users give their NativeStore 64MB or 128MB and then load large files?

This seems reasonable to me as well, nice workaround.

Copy link
Contributor Author

@abrokenjester abrokenjester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, happy to have this merged. I'll leave it to you.

@hmottestad hmottestad merged commit 62dacb6 into main Apr 24, 2021
@hmottestad hmottestad deleted the GH-2799-memoryoverflowmodel-oom branch April 24, 2021 09:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

OOM in MemoryOverflowModel
2 participants