Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Under moderate load, JVM eventually spins indefinitely when this shim is in use #7

Open
alienth opened this issue Mar 13, 2017 · 16 comments

Comments

@alienth
Copy link
Contributor

alienth commented Mar 13, 2017

After writing a few hundred thousand datapoints, the JVM inevitably eats up all CPU and sits there not doing much. Even if you stop writing datapoints to the service, the process will sit there indefinitely eating up CPU. JVM becomes unresponsive to sleuthing via JMX or jvmtop. Tried both OpenJDK and Oracle JRE, both 1.8, and got the same behaviour. Very reproducable: Jam hundreds of thousands of datapoints at it for ~10 or so minutes.

When the proc enters this state, no communication is happening with Cassandra.

All of the metrics and tagk/tagvs I'm sending already have IDs, so I don't believe it is an issue with the shim's pseudo row lock method to acquire those.

I was able to get two thread dumps while a proc was in this state. These dumps are on the same PID, separated by a few minutes.

threads5.txt
threads6.txt

Some type of soft resource deadlock, perhaps?

@alienth alienth changed the title Under moderate load, JVM eventually stops responding when this shim is in use Under moderate load, JVM eventually spins indefinitely when this shim is in use Mar 13, 2017
@alienth
Copy link
Contributor Author

alienth commented Mar 13, 2017

Mm, looks like a bunch of the threads are waiting after getOrCreateId. Since all of my data already has IDs, perhaps this indicates there is some issue with fetching IDs? :/ Table seems fine, tho. And as I mentioned, while the proc is spinning there is no communication with cassandra happening.

@alienth
Copy link
Contributor Author

alienth commented Mar 14, 2017

After leaving the JVM like this for a few hours, eventually heap slowly grows and will start hitting OOMs. Got a heap dump when it happened so I'll see what I can find.

@alienth
Copy link
Contributor Author

alienth commented Mar 14, 2017

Hmmmm
image

@alienth
Copy link
Contributor Author

alienth commented Mar 14, 2017

All of those nodes contain the following:

image

@alienth
Copy link
Contributor Author

alienth commented Mar 14, 2017

image

@alienth
Copy link
Contributor Author

alienth commented Mar 14, 2017

Reproduced behaviour with auto_create_metrics set to false. Now all of the threads are WAITING contain a getId frame, kinda as expected.

@alienth
Copy link
Contributor Author

alienth commented Mar 14, 2017

threads.txt

Second dump, same proc, few minutes later:

threads.txt

@alienth
Copy link
Contributor Author

alienth commented Mar 14, 2017

Thread dump with async UI fetching code enabled:

threads.txt

@alienth
Copy link
Contributor Author

alienth commented Mar 14, 2017

Did a tcpdump of the chatter with cassandra up to the failure. This is a bit curious:

image

@alienth
Copy link
Contributor Author

alienth commented Mar 15, 2017

New thread dump on astyanax 3.9.0, and pool connections dropped to 2 per host.
threads.txt

@alienth
Copy link
Contributor Author

alienth commented Mar 15, 2017

Here's something curious I've found in MAT: A bunch of the threads are parking waiting for a single lock. That lock is being held by a ThreadPoolExecutor by AstyanaxConfigurationImpl. The thing that is weird is that the executor has a completedTaskCount value of 0..

@alienth
Copy link
Contributor Author

alienth commented Mar 15, 2017

Confirmed that the ThreadPoolExecutor which is shared by both the AstyanaxConfigurationImpl and theThriftKeyspaceImpl is at maximum capacity, and has never completed a single task (in the heap dump I have).

@alienth
Copy link
Contributor Author

alienth commented Mar 15, 2017

Cassandra yaml in use: https://gist.github.com/alienth/34d448e7525b5c2ded3471f7769c6841
Table definitions: https://gist.github.com/alienth/3e6493598e0c37824417b4ffd53843ef

(Note that the WITH COMPACT STORAGE line is what makes them visible to Thrift clients).

Keyspace stats:

Keyspace : tsdbuid
        Read Count: 3078259
        Read Latency: 0.0274613627378333 ms.
        Write Count: 118
        Write Latency: 0.02597457627118644 ms.

Keyspace : tsdb
        Read Count: 9
        Read Latency: 0.2872222222222222 ms.
        Write Count: 653899
        Write Latency: 0.01409285837721116 ms.

Deets of setup: 2 node cluster, RF of 1. Dell servers with 32GB of RAM, 16 core E5640. Disks are Dell SAS drives.

@alienth
Copy link
Contributor Author

alienth commented Mar 15, 2017

threads.txt

@alienth
Copy link
Contributor Author

alienth commented Mar 15, 2017

Solved this by customizing the astyanax threadpool to have 64 workers, and bumping connection pool maxconns per host to 20.

@alienth
Copy link
Contributor Author

alienth commented Mar 16, 2017

Addressed in #9.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant