-
Notifications
You must be signed in to change notification settings - Fork 103
NDBench at Netflix
NDBench has been widely used inside Netflix. In the following sections, we talk about some use cases in which NDBench has proven to be a useful tool.
A couple of months ago, we finished the Cassandra migration from version 2.0 to 2.1. Prior to starting the process, it was imperative for us to understand the performance gains that we would achieve, as well as the performance hit we would incur during the rolling upgrade of our Cassandra instances. The figures below illustrate the p99 and p95 read latency differences using NDBench. In the first, we highlight the differences between Cassandra 2.0 (blue line) vs 2.1 (brown line).
Last year, we also migrated all our Cassandra instances from the older Red Hat 5.10 OS to Ubuntu 14.04 (Trusty Tahr). We used NDBench to measure performance under the newer operating system. In the following figure, we showcase the three phases of the migration process by using NDBench’s long-running benchmark capability. We used rolling terminations of the Cassandra instances to update the AMIs with the new OS, and NDBench to verify that there would be no client-side impact during the migration. NDBench also allowed us to validate that the performance of the new OS was better after the migration.
NDBench is also part of our AMI certification process, which consists of integration tests and deployment validation. We designed pipelines in Spinnaker and integrated NDBench into them. The following figure shows the bakery-to-release lifecycle. We initially bake an AMI with Cassandra, create a Cassandra cluster, create an NDBench cluster, configure it, and run a performance test. We finally review the results and make the decision on whether to promote an “Experimental” AMI to a “Candidate”. We use similar pipelines for Dynomite, testing out the replication functionalities with different client-side APIs. Passing the NDBench performance tests means that the AMI is ready to be used in the production environment. Similar pipelines are used across the board for other data store systems at Netflix.
NDBench allows us to run infinite horizon tests to identify potential memory leaks from long running processes that we develop or use in-house. At the same time, in our integration tests we introduce failure conditions, change the underlying variables of our systems, introduce CPU intensive operations (like repair/reconciliation), and determine the optimal performance based on the application requirements. Finally, our sidecars such as Priam, Dynomite-manager and Raigad perform various activities, such as multi-threaded backups to object storage systems. We want to make sure, through integration tests, that the performance of our data store systems is not affected.