Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wrk2 produces incorrect results #138

Open
berwynhoyt opened this issue Oct 30, 2023 · 5 comments
Open

wrk2 produces incorrect results #138

berwynhoyt opened this issue Oct 30, 2023 · 5 comments

Comments

@berwynhoyt
Copy link

berwynhoyt commented Oct 30, 2023

As you can see in my write-up here, wrk2 can produce bad results under certain conditions. For example:

wrk2/wrk -d5 -c1000 -t250 -R 10000000 "http://localhost:8085/multiply?a=2&b=3"
Running 5s test @ http://localhost:8085/multiply?a=2&b=3
  250 threads and 1000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.97s     1.28s    4.93s    60.75%
    Req/Sec       -nan      -nan   0.00      0.00%
  833071 requests in 1.25ms, 223.25MB read
  Socket errors: connect 0, read 0, write 0, timeout 29
Requests/sec: 663801593.63
Transfer/sec:    173.72GB

See how I specified 250 threads, and a 5s test? Well, it did create 833071 requests in 5s, but as you see, it thinks it did it in 1.25ms, producing a ridiculous figure of 663 million requests/sec.

It doesn't always think it's finished in milliseconds. Sometimes it is more like 1s and other times closer to 5.

You can check out my repository that uses wrk2 here if you want to reproduce the bug.

@kchgoh
Copy link

kchgoh commented Nov 8, 2023

I'm also evaluating which load tool to use, so I'm glad to have come across your write-up and finding. Just wondering about 2 points:

  1. Is there any reason to use a high number of threads? Unless I'm misunderstanding something, the load on the server is determined by the total number of connections (1000 in the example), instead of by number of the threads; and I think ideally the number of threads should not exceed the number of cpu cores by too much, to minimise context switch. Would be interested to know how it behaves if you use, say 4 threads but keeping 1000 connections?
  2. Is there any reason to run just a short 5s test? I don't know if it's relevant to this, but in the Readme it mentions:

It's important to note that wrk2 extends the initial calibration period to 10 seconds (from wrk's 0.5 second), so runs shorter than 10-20 seconds may not present useful information

I checked both wrk2 and wrk's documentation and couldn't seem to find what the calibration is for though.

@berwynhoyt
Copy link
Author

Good questions.

Re (1), there is no reason to use so many connections except that the bogus results became most apparent when I did. Note that I found the most reliable results when I set #threads == #connections. My own project found that between 10 and 40 produced the maximum number of requests.

Re (2), I did not try that same test with a 10s period. I will do so now, on your prompting:

wrk2/wrk -d10 -c10 -t10 -R 10000000 "http://localhost:8085/multiply?a=2&b=3"

I get much more reasonable results, though they still range between 200,000 and 500,000 requests, which is 2 to 4 times what I get with any other tool, so I think they're still not correct.

@berwynhoyt
Copy link
Author

In that last 10-second test, the problems still seems to be the time it thinks it took to finish, which ranges from 3 to 10s (when it actually took 10s).

@kchgoh
Copy link

kchgoh commented Dec 14, 2023

Not sure if this topic is still of interest... recently I had some time to read the source code, I think below might be an explanation:

  • There is a "calibration" period every time a test is started. The period is hard coded as: 10 seconds + number of connections * 5 millis (uint64_t calibrate_delay = CALIBRATE_DELAY_MS + (thread->connections * 5);). The 90th percentile latency received during that period is used to determine the sampling interval used to collect data for the summary stats. (long double interval = MAX(latency * 2, 10))
  • Once after the calibration period ends and it determined the sampling interval to use, it would then clear the latency values collected during the calibration period, and start collecting from scratch.

So in the case of using 1000 connections, the test duration should ideally be > 15 sec, otherwise it would still in the middle of the calibration period. I found it's more reliable to test with a duration of 60s.

Searching wrk's issue discussion, it seems wrk used to have this calibration period too, but then it was removed around 2018 (wg/wrk#280 (comment)) . If I find some time I'll try to remove it in my local build and see if it allows running with short test duration.

@berwynhoyt
Copy link
Author

If you are able to improve this, that would be FAB!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants