wrk2 produces incorrect results #138

berwynhoyt · 2023-10-30T20:56:44Z

As you can see in my write-up here, wrk2 can produce bad results under certain conditions. For example:

wrk2/wrk -d5 -c1000 -t250 -R 10000000 "http://localhost:8085/multiply?a=2&b=3"
Running 5s test @ http://localhost:8085/multiply?a=2&b=3
  250 threads and 1000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.97s     1.28s    4.93s    60.75%
    Req/Sec       -nan      -nan   0.00      0.00%
  833071 requests in 1.25ms, 223.25MB read
  Socket errors: connect 0, read 0, write 0, timeout 29
Requests/sec: 663801593.63
Transfer/sec:    173.72GB

See how I specified 250 threads, and a 5s test? Well, it did create 833071 requests in 5s, but as you see, it thinks it did it in 1.25ms, producing a ridiculous figure of 663 million requests/sec.

It doesn't always think it's finished in milliseconds. Sometimes it is more like 1s and other times closer to 5.

You can check out my repository that uses wrk2 here if you want to reproduce the bug.

The text was updated successfully, but these errors were encountered:

kchgoh · 2023-11-08T12:53:53Z

I'm also evaluating which load tool to use, so I'm glad to have come across your write-up and finding. Just wondering about 2 points:

Is there any reason to use a high number of threads? Unless I'm misunderstanding something, the load on the server is determined by the total number of connections (1000 in the example), instead of by number of the threads; and I think ideally the number of threads should not exceed the number of cpu cores by too much, to minimise context switch. Would be interested to know how it behaves if you use, say 4 threads but keeping 1000 connections?
Is there any reason to run just a short 5s test? I don't know if it's relevant to this, but in the Readme it mentions:

It's important to note that wrk2 extends the initial calibration period to 10 seconds (from wrk's 0.5 second), so runs shorter than 10-20 seconds may not present useful information

I checked both wrk2 and wrk's documentation and couldn't seem to find what the calibration is for though.

berwynhoyt · 2023-11-08T21:53:02Z

Good questions.

Re (1), there is no reason to use so many connections except that the bogus results became most apparent when I did. Note that I found the most reliable results when I set #threads == #connections. My own project found that between 10 and 40 produced the maximum number of requests.

Re (2), I did not try that same test with a 10s period. I will do so now, on your prompting:

wrk2/wrk -d10 -c10 -t10 -R 10000000 "http://localhost:8085/multiply?a=2&b=3"

I get much more reasonable results, though they still range between 200,000 and 500,000 requests, which is 2 to 4 times what I get with any other tool, so I think they're still not correct.

berwynhoyt · 2023-11-08T21:56:55Z

In that last 10-second test, the problems still seems to be the time it thinks it took to finish, which ranges from 3 to 10s (when it actually took 10s).

kchgoh · 2023-12-14T15:10:27Z

Not sure if this topic is still of interest... recently I had some time to read the source code, I think below might be an explanation:

There is a "calibration" period every time a test is started. The period is hard coded as: 10 seconds + number of connections * 5 millis (uint64_t calibrate_delay = CALIBRATE_DELAY_MS + (thread->connections * 5);). The 90th percentile latency received during that period is used to determine the sampling interval used to collect data for the summary stats. (long double interval = MAX(latency * 2, 10))
Once after the calibration period ends and it determined the sampling interval to use, it would then clear the latency values collected during the calibration period, and start collecting from scratch.

So in the case of using 1000 connections, the test duration should ideally be > 15 sec, otherwise it would still in the middle of the calibration period. I found it's more reliable to test with a duration of 60s.

Searching wrk's issue discussion, it seems wrk used to have this calibration period too, but then it was removed around 2018 (wg/wrk#280 (comment)) . If I find some time I'll try to remove it in my local build and see if it allows running with short test duration.

berwynhoyt · 2023-12-14T23:08:52Z

If you are able to improve this, that would be FAB!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wrk2 produces incorrect results #138

wrk2 produces incorrect results #138

berwynhoyt commented Oct 30, 2023 •

edited

Loading

kchgoh commented Nov 8, 2023

berwynhoyt commented Nov 8, 2023

berwynhoyt commented Nov 8, 2023

kchgoh commented Dec 14, 2023

berwynhoyt commented Dec 14, 2023

wrk2 produces incorrect results #138

wrk2 produces incorrect results #138

Comments

berwynhoyt commented Oct 30, 2023 • edited Loading

kchgoh commented Nov 8, 2023

berwynhoyt commented Nov 8, 2023

berwynhoyt commented Nov 8, 2023

kchgoh commented Dec 14, 2023

berwynhoyt commented Dec 14, 2023

berwynhoyt commented Oct 30, 2023 •

edited

Loading