Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error: Warning stuck buffer_pool buffer with IO workload #8521

Open
rkomandu opened this issue Nov 12, 2024 · 6 comments
Open

Error: Warning stuck buffer_pool buffer with IO workload #8521

rkomandu opened this issue Nov 12, 2024 · 6 comments
Assignees
Labels

Comments

@rkomandu
Copy link
Collaborator

rkomandu commented Nov 12, 2024

Environment info

  • NooBaa Version: VERSION
  • Platform: Kubernetes 1.14.1 | minikube 1.1.1 | OpenShift 4.1 | other: specify

noobaa-20241104 (5.17.1) - standalone noobaa

Actual behavior

  1. Ran the Warp IO workload of "get"

./warp get --insecure --duration 60m --host .com:6443 --access-key KCxP4AN9937kVqoCrNIs --secret-key bIdwF/5nJtSnrHWXrhPOhkv1WqGjtayMk6D+aU/U --tls --obj.size 256M --bucket warp-get-bucket-reg 2>&1| tee /tmp/warp-get-11nov2024.log

observed following in the log (system running concurrently long versioning test as well in other directory)

Nov 11 03:54:17 node-gui0 [3532906]: [nsfs/3532906] [ERROR] CONSOLE:: Error: Warning stuck buffer_pool buffer    at BuffersPool.get_buffer (/usr/local/noobaa-core/src/util/buffer_utils.js:218:25)    at async NamespaceFS.read_object_stream (/usr/local/noobaa-core/src/sdk/namespace_fs.js:1080:46)    at async NsfsObjectSDK._call_op_and_update_stats (/usr/local/noobaa-core/src/sdk/object_sdk.js:543:27)    at async Object.get_object [as handler] (/usr/local/noobaa-core/src/endpoint/s3/ops/s3_get_object.js:116:25)    at async handle_request (/usr/local/noobaa-core/src/endpoint/s3/s3_rest.js:161:19)    at async Object.s3_rest [as handler] (/usr/local/noobaa-core/src/endpoint/s3/s3_rest.js:66:9)
grep "Error: Warning stuck buffer_pool buffer" noobaa.log | grep "03:54" |wc -l
30

No errors on the client node is observed. I am saying around 03:54 , because the GPFS daemon has started back on the 1 node (out of 2 node protocol node) , where the RR-DNS is configured the IO continued to run when the HA happened previously. So this above message is nothing related to HA (will attach the logs)


Warp analyze  --> ignore 10 errors of "read tcp" as this is expected since the HA is happening (i.e gpfs is starting) 

./warp analyze --analyze.op GET --analyze.v  warp-get-2024-11-11[030028]-5qrv.csv.zst --debug
7550 operations loaded... Done!

----------------------------------------
Operation: GET (5050). Ran 1h0m4s. Size: 256000000 bytes. Concurrency: 20.
Errors: 15
First Errors:
 * https://<cesip>:6443/, 2024-11-11 03:52:28 -0500 EST: read tcp 9.42.124.113:59180->9.42.93.99:6443: read: connection reset by peer
 * https://<cesip>:6443/, 2024-11-11 03:52:28 -0500 EST: read tcp 9.42.124.113:59164->9.42.93.99:6443: read: connection reset by peer
 * https://<cesip>:6443/, 2024-11-11 03:52:28 -0500 EST: read tcp 9.42.124.113:59270->9.42.93.99:6443: read: connection reset by peer
 * https://<cesip>:6443/, 2024-11-11 03:52:28 -0500 EST: read tcp 9.42.124.113:59244->9.42.93.99:6443: read: connection reset by peer
 * https://<cesip>:6443/, 2024-11-11 03:52:28 -0500 EST: read tcp 9.42.124.113:59256->9.42.93.99:6443: read: connection reset by peer
 * https://<cesip>:6443/, 2024-11-11 03:52:28 -0500 EST: read tcp 9.42.124.113:59258->9.42.93.99:6443: read: connection reset by peer
 * https://<cesip>:6443/, 2024-11-11 03:52:28 -0500 EST: read tcp 9.42.124.113:59178->9.42.93.99:6443: read: connection reset by peer
 * https://<cesip>:6443/, 2024-11-11 03:52:28 -0500 EST: read tcp 9.42.124.113:59166->9.42.93.99:6443: read: connection reset by peer
 * https://<cesip>:6443/, 2024-11-11 03:52:28 -0500 EST: read tcp 9.42.124.113:59222->9.42.93.99:6443: read: connection reset by peer
 * https://<cesip>:6443/, 2024-11-11 03:52:28 -0500 EST: read tcp 9.42.124.113:59234->9.42.93.99:6443: read: connection reset by peer


Requests considered: 5014:
 * Avg: 14.302s, 50%: 17.271s, 90%: 21.121s, 99%: 23.583s, Fastest: 1.486s, Slowest: 1m54.959s, StdDev: 9.26s
 * TTFB: Avg: 141ms, Best: 11ms, 25th: 55ms, Median: 73ms, 75th: 93ms, 90th: 118ms, 99th: 216ms, Worst: 1m35.183s StdDev: 2.096s
 * First Access: Avg: 17.059s, 50%: 19.136s, 90%: 21.355s, 99%: 23.466s, Fastest: 1.486s, Slowest: 1m52.859s, StdDev: 8.407s
 * First Access TTFB: Avg: 152ms, Best: 21ms, 25th: 59ms, Median: 78ms, 75th: 101ms, 90th: 130ms, 99th: 335ms, Worst: 1m33.717s StdDev: 2.18s
 * Last Access: Avg: 10.997s, 50%: 7.742s, 90%: 20.381s, 99%: 22.633s, Fastest: 1.486s, Slowest: 1m52.228s, StdDev: 6.788s
 * Last Access TTFB: Avg: 91ms, Best: 16ms, 25th: 52ms, Median: 69ms, 75th: 86ms, 90th: 108ms, 99th: 192ms, Worst: 36.283s StdDev: 782ms

Throughput:
* Average: 340.77 MiB/s, 1.40 obj/s

Throughput, split into 239 x 15s:
 * Fastest: 678.2MiB/s, 2.78 obj/s (15s, starting 03:57:01 EST)
 * 50% Median: 248.9MiB/s, 1.02 obj/s (15s, starting 03:26:16 EST)
 * Slowest: 43.6MiB/s, 0.18 obj/s (15s, starting 03:14:16 EST)

Default endpoint forks in the system with 2 CES S3 nodes having 1 CES IP each assigned

Expected behavior

Are we expected to get these ERRORS , as posted above ?

CONSOLE:: Error: Warning stuck buffer_pool buffer at BuffersPool.get_buffer

Steps to reproduce

Run warp as shown below and it occurred on a system where it had medium workload i can say

More information - Screenshots / Logs / Other output

will update once the logs are uploaded https://ibm.ent.box.com/folder/293508364523

@rkomandu rkomandu added the NS-FS label Nov 12, 2024
@romayalon
Copy link
Contributor

@shirady can you please take a look ?

@shirady shirady self-assigned this Nov 14, 2024
@shirady
Copy link
Contributor

shirady commented Nov 14, 2024

Hi @rkomandu ,
Could you please reproduce and provide logs with a higher debug level?
I had issues with the GPFS machine and couldn't reproduce a whole test run.

Additional Details

At this moment I'm looking for a level-1 printing from here:

dbg.log1('BufferPool.get_buffer: sem value', this.sem._value, 'waiting_value', this.sem._waiting_value, 'buffers length', this.buffers.length);

As you can see from the error stuck:

[ERROR] CONSOLE:: Error: Warning stuck buffer_pool buffer at BuffersPool.get_buffer (/usr/local/noobaa-core/src/util/buffer_utils.js:218:25) at async NamespaceFS.read_object_stream (/usr/local/noobaa-core/src/sdk/namespace_fs.js:1080:46) at async NsfsObjectSDK._call_op_and_update_stats (/usr/local/noobaa-core/src/sdk/object_sdk.js:543:27) at async Object.get_object [as handler] (/usr/local/noobaa-core/src/endpoint/s3/ops/s3_get_object.js:116:25) at async handle_request (/usr/local/noobaa-core/src/endpoint/s3/s3_rest.js:161:19) at async Object.s3_rest [as handler] (/usr/local/noobaa-core/src/endpoint/s3/s3_rest.js:66:9)

It comes on GET request, in NSFS it is read_object_stream after we tried to get_buffer (a buffer from the memory to execute the read operation).

We want to see the values of the sem._waiting_value and want to see the trend: Does it keep raising? Does it increase and decrease during the test? etc.

We know that it started with 0:

[3532906]: [nsfs/3532906] [L0] core.sdk.namespace_fs:: NamespaceFS: buffers_pool [ BufferPool.get_buffer: sem value: 2097152 waiting_value: 0 buffers length: 0, BufferPool.get_buffer: sem value: 33554432 waiting_value: 0 buffers length: 0, BufferPool.get_buffer: sem value: 425721856 waiting_value: 0 buffers length: 0, BufferPool.get_buffer: sem value: 3825205248 waiting_value: 0 buffers length: 0 ]

This printing comes from here:

if (this.warning_timeout) {
const err = new Error('Warning stuck buffer_pool buffer');
warning_timer = setTimeout(() => {
console.error(err.stack);
}, this.warning_timeout);
warning_timer.unref();
}

This warning printing might be in 2 minutes waiting - as configured here:
config.NSFS_BUF_POOL_WARNING_TIMEOUT = 2 * 60 * 1000;

BTW, Although it is a warning, I'm not sure why it is printed as with console.error and not console.warn, I can try and suggest a code change.

@rkomandu
Copy link
Collaborator Author

rkomandu commented Nov 15, 2024

@shirady

I had issues with the GPFS machine and couldn't reproduce a whole test run.

lets see if we can get your cluster working for the run.
For now kickstarted the run for 90min , debuglevel set to all for now

DEBUGLEVEL : all
ENABLEMD5 : true

Will update once run complete

Note: High level thought, this bug might be recreated with the load on the system. Am saying this because the issue 8524 with versioning PUT method didn't run into this error when ran for 3hrs (ran on Wed)

@rkomandu
Copy link
Collaborator Author

@shirady

Ran for the 90 min Warp get op run and didn't run into the buffer_pool error

./warp get --insecure --duration 90m --host :6443 --access-key KCxP4AN9937kVqoCrNIs --secret-key bIdwF/5nJtSnrHWXrhPOhkv1WqGjtayMk6D+aU/U --tls --obj.size 256M --bucket newbucket-warp-get-8521-15nov 2>&1| tee /tmp/newbucket-warp-get-255M-8521-15nov.log

[root@gui0 log]# zgrep "stuck" noobaa.log-20241115.gz
[root@gui0 log]# grep "stuck" noobaa.log

[root@gui1 log]# zgrep "stuck" noobaa.log-20241115.gz
[root@gui1 log]# grep "stuck" noobaa.log

@rkomandu
Copy link
Collaborator Author

Please try to check from the code flow perspective , as mentioned it could also be w/r/t load on the system

@shirady
Copy link
Contributor

shirady commented Nov 20, 2024

Hi,
I will share that I ran the warp twice and didn't reproduce it.
I didn't find the printing "Error: Warning stuck buffer_pool buffer" (or any "stuck" in the logs).
I ran it with a high debug level and didn't find a place where the waiting_value is not 0, from this output:

dbg.log1('BufferPool.get_buffer: sem value', this.sem._value, 'waiting_value', this.sem._waiting_value, 'buffers length', this.buffers.length);

@rkomandu, I'm planning to check other things I will update you here about it.

Additional information:

I run:
./warp get --host=<IP-address-of-1-node> --access-key=<> --secret-key=<> --obj.size=256M --duration=60m --bucket=<bucket-name> --objects 1500 --insecure --tls (after creating the bucket).
I had to set the --objects due to space limits on the machine.

The outputs:

  1. Without changing the NSFS_CALCULATE_MD5
----------------------------------------
Operation: PUT. Concurrency: 20
* Average: 367.05 MiB/s, 1.50 obj/s

Throughput, split into 192 x 5s:
 * Fastest: 389.2MiB/s, 1.59 obj/s
 * 50% Median: 368.5MiB/s, 1.51 obj/s
 * Slowest: 325.8MiB/s, 1.33 obj/s

----------------------------------------
Operation: GET. Concurrency: 20
* Average: 849.39 MiB/s, 3.48 obj/s

Throughput, split into 239 x 15s:
 * Fastest: 896.0MiB/s, 3.67 obj/s
 * 50% Median: 851.6MiB/s, 3.49 obj/s
 * Slowest: 774.0MiB/s, 3.17 obj/s
  1. Changing the NSFS_CALCULATE_MD5 to true
----------------------------------------
Operation: PUT. Concurrency: 20
* Average: 161.74 MiB/s, 0.66 obj/s

Throughput, split into 141 x 15s:
 * Fastest: 171.4MiB/s, 0.70 obj/s
 * 50% Median: 162.1MiB/s, 0.66 obj/s
 * Slowest: 148.0MiB/s, 0.61 obj/s

----------------------------------------
Operation: GET. Concurrency: 20
* Average: 821.38 MiB/s, 3.36 obj/s

Throughput, split into 238 x 15s:
 * Fastest: 853.9MiB/s, 3.50 obj/s
 * 50% Median: 823.2MiB/s, 3.37 obj/s
 * Slowest: 741.7MiB/s, 3.04 obj/s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants