From 83088e55f4d0f584f1fcad3d2852d6faf1dcbef9 Mon Sep 17 00:00:00 2001 From: "albert.autin" Date: Wed, 16 Dec 2020 13:32:54 -0700 Subject: [PATCH 1/4] Disseminate tribal knowledge into README regarding partitions and break times. --- README.md | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/README.md b/README.md index b0d7210..f8a51d3 100644 --- a/README.md +++ b/README.md @@ -272,6 +272,9 @@ The first time you test a device(s), you must prepare the device(s) by first cleaning them (writing zeros everywhere) and then "salting" them (writing random data everywhere) with act_prep. +##### Partitions +In some cases, and in particular if you plan to use them while running Aerospike, partitions can increase performance. Some systems perform better with more partitions, possibly due to more threads, so you may want to experiment with adding partitions. As a general rule of thumb, you do not want more partitions than the total number of cores - starting at 4 per drive and pulling that back if drives*partitions is too much; say for a 96 core machine with 8 SSD volumes, you do not want more than 12 partitions. If you use partitions, you can run act_prep on all partitions in parallel or run act_prep before partitioning. + act_prep takes a device name as its only command-line parameter. For a typical 240GB SSD, act_prep takes 30-60+ minutes to run. The time varies depending on the device and the capacity. @@ -414,6 +417,15 @@ When doing stress testing at a level ABOVE where the device is certified, a device passes the test if ACT runs to completion, regardless of the number of errors. +# Break times, testing anomalies +A drive that has been tested at a much higher rate than it can pass may need a 'break'. For example, if you try to run a 100x test against a drive that only supports 20x - the drive may only pass at 10x. There appears to be some recovery time, perhaps due to the internal garbage collection of the hardware, where the drive has to recover. For this reason you can get more reliable results by testing low and ramping up until failure, but if you experience an ACT failure and wish to lower the test volume be aware that you may need to give the drive a 'break' before performing another test. The time in which a drive takes to recover is dependent on manufacturer and model and can vary by many hours and is not well understood by the ACT community currently. +To illustrate this behavior, a sequence of tests may go like this: +100x[PASS] -> 150x[PASS] -> 300x[FAIL] -> 150x[FAIL]. A 150x test could fail in this condition because of this 'break' period needed after a drive is pushed too far. +Instead, you may have to do this: +100x[PASS] -> 150x[PASS] -> 300x[FAIL] -> (wait: 8h?) -> 150x[PASS] -> 160x[PASS] ... and so on. + +Again, the wait period is not well known and likely varies quite a lot. + ## ACT Configuration Reference ------------------------------ From 45fddf3035cef0b3e454db405797e6a259a42acc Mon Sep 17 00:00:00 2001 From: "albert.autin" Date: Wed, 16 Dec 2020 13:34:03 -0700 Subject: [PATCH 2/4] Add a parted script to make people mad. --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index f8a51d3..6dfd3e1 100644 --- a/README.md +++ b/README.md @@ -273,7 +273,7 @@ cleaning them (writing zeros everywhere) and then "salting" them (writing random data everywhere) with act_prep. ##### Partitions -In some cases, and in particular if you plan to use them while running Aerospike, partitions can increase performance. Some systems perform better with more partitions, possibly due to more threads, so you may want to experiment with adding partitions. As a general rule of thumb, you do not want more partitions than the total number of cores - starting at 4 per drive and pulling that back if drives*partitions is too much; say for a 96 core machine with 8 SSD volumes, you do not want more than 12 partitions. If you use partitions, you can run act_prep on all partitions in parallel or run act_prep before partitioning. +In some cases, and in particular if you plan to use them while running Aerospike, partitions can increase performance. Some systems perform better with more partitions, possibly due to more threads, so you may want to experiment with adding partitions. As a general rule of thumb, you do not want more partitions than the total number of cores - starting at 4 per drive and pulling that back if drives*partitions is too much; say for a 96 core machine with 8 SSD volumes, you do not want more than 12 partitions. If you use partitions, you can run act_prep on all partitions in parallel or run act_prep before partitioning. If using partition, be aware of boundaries such that a partition does not span multiple physical sectors. Example partition script: `parted --script /dev/mydrive mklabel gpt mkpart primary 0% 25% mkpart primary 25% 50% mkpart primary 50% 75% mkpart primary 75% 100%`. act_prep takes a device name as its only command-line parameter. For a typical 240GB SSD, act_prep takes 30-60+ minutes to run. The time varies depending on From 83df064ba411a7fea2f81f15a9cdba9c9ab604c6 Mon Sep 17 00:00:00 2001 From: Alb0t <40698370+Alb0t@users.noreply.github.com> Date: Wed, 16 Dec 2020 16:32:26 -0700 Subject: [PATCH 3/4] Update README.md Fix some wording, add a blurb about irqbalance. --- README.md | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 6dfd3e1..3cbcd8b 100644 --- a/README.md +++ b/README.md @@ -273,7 +273,7 @@ cleaning them (writing zeros everywhere) and then "salting" them (writing random data everywhere) with act_prep. ##### Partitions -In some cases, and in particular if you plan to use them while running Aerospike, partitions can increase performance. Some systems perform better with more partitions, possibly due to more threads, so you may want to experiment with adding partitions. As a general rule of thumb, you do not want more partitions than the total number of cores - starting at 4 per drive and pulling that back if drives*partitions is too much; say for a 96 core machine with 8 SSD volumes, you do not want more than 12 partitions. If you use partitions, you can run act_prep on all partitions in parallel or run act_prep before partitioning. If using partition, be aware of boundaries such that a partition does not span multiple physical sectors. Example partition script: `parted --script /dev/mydrive mklabel gpt mkpart primary 0% 25% mkpart primary 25% 50% mkpart primary 50% 75% mkpart primary 75% 100%`. +In some cases, and in particular if you plan to use them while running Aerospike, you may want to test with ACT using partitions. Some systems perform better with more partitions, possibly due to more parallelization in some layer, so you may want to experiment with adding partitions. As a general rule of thumb, you do not want more partitions than the total number of cores - you could start at 4 per drive but pull that back if drives*partitions is too much; say for a 96 core machine with 8 SSD volumes, you do not want more than 12 partitions (8*12 = 96, so more partitions wouldn't be good.). If you use partitions, you can run act_prep on all partitions in parallel or run act_prep before partitioning. If using partitions, be aware of boundaries such that a partition does not span multiple physical sectors. Example partition script: `parted --script /dev/mydrive mklabel gpt mkpart primary 0% 25% mkpart primary 25% 50% mkpart primary 50% 75% mkpart primary 75% 100%`. act_prep takes a device name as its only command-line parameter. For a typical 240GB SSD, act_prep takes 30-60+ minutes to run. The time varies depending on @@ -426,6 +426,9 @@ Instead, you may have to do this: Again, the wait period is not well known and likely varies quite a lot. +##### IRQBalance +If your system has IRQBalance disabled, you can try to enable it. In some cases this had led to increased performance. + ## ACT Configuration Reference ------------------------------ From fb8930000639631f57ba9823c508c89fbd4c4a04 Mon Sep 17 00:00:00 2001 From: "albert.autin" Date: Wed, 7 Jul 2021 15:32:35 -0600 Subject: [PATCH 4/4] clarify --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 3cbcd8b..88e17f2 100644 --- a/README.md +++ b/README.md @@ -418,7 +418,7 @@ device passes the test if ACT runs to completion, regardless of the number of errors. # Break times, testing anomalies -A drive that has been tested at a much higher rate than it can pass may need a 'break'. For example, if you try to run a 100x test against a drive that only supports 20x - the drive may only pass at 10x. There appears to be some recovery time, perhaps due to the internal garbage collection of the hardware, where the drive has to recover. For this reason you can get more reliable results by testing low and ramping up until failure, but if you experience an ACT failure and wish to lower the test volume be aware that you may need to give the drive a 'break' before performing another test. The time in which a drive takes to recover is dependent on manufacturer and model and can vary by many hours and is not well understood by the ACT community currently. +A drive that has been tested at a much higher rate than it can pass may need a 'break'. For example, if you try to run a 100x test against a drive that only supports 20x - the drive may only pass at 10x until the drive has some time to catch up. There appears to be some recovery time, perhaps due to the internal garbage collection of the hardware, where the drive has to recover. For this reason you can get more reliable results by testing low and ramping up until failure, but if you experience an ACT failure and wish to lower the test volume be aware that you may need to give the drive a 'break' before performing another test. The time in which a drive takes to recover is dependent on manufacturer and model and can vary by many hours. To illustrate this behavior, a sequence of tests may go like this: 100x[PASS] -> 150x[PASS] -> 300x[FAIL] -> 150x[FAIL]. A 150x test could fail in this condition because of this 'break' period needed after a drive is pushed too far. Instead, you may have to do this: