Error: Warning stuck buffer_pool buffer with IO workload #8521

rkomandu · 2024-11-12T10:06:33Z

Environment info

NooBaa Version: VERSION
Platform: Kubernetes 1.14.1 | minikube 1.1.1 | OpenShift 4.1 | other: specify

noobaa-20241104 (5.17.1) - standalone noobaa

Actual behavior

Ran the Warp IO workload of "get"

./warp get --insecure --duration 60m --host .com:6443 --access-key KCxP4AN9937kVqoCrNIs --secret-key bIdwF/5nJtSnrHWXrhPOhkv1WqGjtayMk6D+aU/U --tls --obj.size 256M --bucket warp-get-bucket-reg 2>&1| tee /tmp/warp-get-11nov2024.log

observed following in the log (system running concurrently long versioning test as well in other directory)

Nov 11 03:54:17 node-gui0 [3532906]: [nsfs/3532906] [ERROR] CONSOLE:: Error: Warning stuck buffer_pool buffer    at BuffersPool.get_buffer (/usr/local/noobaa-core/src/util/buffer_utils.js:218:25)    at async NamespaceFS.read_object_stream (/usr/local/noobaa-core/src/sdk/namespace_fs.js:1080:46)    at async NsfsObjectSDK._call_op_and_update_stats (/usr/local/noobaa-core/src/sdk/object_sdk.js:543:27)    at async Object.get_object [as handler] (/usr/local/noobaa-core/src/endpoint/s3/ops/s3_get_object.js:116:25)    at async handle_request (/usr/local/noobaa-core/src/endpoint/s3/s3_rest.js:161:19)    at async Object.s3_rest [as handler] (/usr/local/noobaa-core/src/endpoint/s3/s3_rest.js:66:9)
grep "Error: Warning stuck buffer_pool buffer" noobaa.log | grep "03:54" |wc -l
30

No errors on the client node is observed. I am saying around 03:54 , because the GPFS daemon has started back on the 1 node (out of 2 node protocol node) , where the RR-DNS is configured the IO continued to run when the HA happened previously. So this above message is nothing related to HA (will attach the logs)


Warp analyze  --> ignore 10 errors of "read tcp" as this is expected since the HA is happening (i.e gpfs is starting) 

./warp analyze --analyze.op GET --analyze.v  warp-get-2024-11-11[030028]-5qrv.csv.zst --debug
7550 operations loaded... Done!

----------------------------------------
Operation: GET (5050). Ran 1h0m4s. Size: 256000000 bytes. Concurrency: 20.
Errors: 15
First Errors:
 * https://<cesip>:6443/, 2024-11-11 03:52:28 -0500 EST: read tcp 9.42.124.113:59180->9.42.93.99:6443: read: connection reset by peer
 * https://<cesip>:6443/, 2024-11-11 03:52:28 -0500 EST: read tcp 9.42.124.113:59164->9.42.93.99:6443: read: connection reset by peer
 * https://<cesip>:6443/, 2024-11-11 03:52:28 -0500 EST: read tcp 9.42.124.113:59270->9.42.93.99:6443: read: connection reset by peer
 * https://<cesip>:6443/, 2024-11-11 03:52:28 -0500 EST: read tcp 9.42.124.113:59244->9.42.93.99:6443: read: connection reset by peer
 * https://<cesip>:6443/, 2024-11-11 03:52:28 -0500 EST: read tcp 9.42.124.113:59256->9.42.93.99:6443: read: connection reset by peer
 * https://<cesip>:6443/, 2024-11-11 03:52:28 -0500 EST: read tcp 9.42.124.113:59258->9.42.93.99:6443: read: connection reset by peer
 * https://<cesip>:6443/, 2024-11-11 03:52:28 -0500 EST: read tcp 9.42.124.113:59178->9.42.93.99:6443: read: connection reset by peer
 * https://<cesip>:6443/, 2024-11-11 03:52:28 -0500 EST: read tcp 9.42.124.113:59166->9.42.93.99:6443: read: connection reset by peer
 * https://<cesip>:6443/, 2024-11-11 03:52:28 -0500 EST: read tcp 9.42.124.113:59222->9.42.93.99:6443: read: connection reset by peer
 * https://<cesip>:6443/, 2024-11-11 03:52:28 -0500 EST: read tcp 9.42.124.113:59234->9.42.93.99:6443: read: connection reset by peer


Requests considered: 5014:
 * Avg: 14.302s, 50%: 17.271s, 90%: 21.121s, 99%: 23.583s, Fastest: 1.486s, Slowest: 1m54.959s, StdDev: 9.26s
 * TTFB: Avg: 141ms, Best: 11ms, 25th: 55ms, Median: 73ms, 75th: 93ms, 90th: 118ms, 99th: 216ms, Worst: 1m35.183s StdDev: 2.096s
 * First Access: Avg: 17.059s, 50%: 19.136s, 90%: 21.355s, 99%: 23.466s, Fastest: 1.486s, Slowest: 1m52.859s, StdDev: 8.407s
 * First Access TTFB: Avg: 152ms, Best: 21ms, 25th: 59ms, Median: 78ms, 75th: 101ms, 90th: 130ms, 99th: 335ms, Worst: 1m33.717s StdDev: 2.18s
 * Last Access: Avg: 10.997s, 50%: 7.742s, 90%: 20.381s, 99%: 22.633s, Fastest: 1.486s, Slowest: 1m52.228s, StdDev: 6.788s
 * Last Access TTFB: Avg: 91ms, Best: 16ms, 25th: 52ms, Median: 69ms, 75th: 86ms, 90th: 108ms, 99th: 192ms, Worst: 36.283s StdDev: 782ms

Throughput:
* Average: 340.77 MiB/s, 1.40 obj/s

Throughput, split into 239 x 15s:
 * Fastest: 678.2MiB/s, 2.78 obj/s (15s, starting 03:57:01 EST)
 * 50% Median: 248.9MiB/s, 1.02 obj/s (15s, starting 03:26:16 EST)
 * Slowest: 43.6MiB/s, 0.18 obj/s (15s, starting 03:14:16 EST)

Default endpoint forks in the system with 2 CES S3 nodes having 1 CES IP each assigned

Expected behavior

Are we expected to get these ERRORS , as posted above ?

CONSOLE:: Error: Warning stuck buffer_pool buffer at BuffersPool.get_buffer

Steps to reproduce

Run warp as shown below and it occurred on a system where it had medium workload i can say

More information - Screenshots / Logs / Other output

will update once the logs are uploaded https://ibm.ent.box.com/folder/293508364523

The text was updated successfully, but these errors were encountered:

romayalon · 2024-11-12T11:05:20Z

@shirady can you please take a look ?

shirady · 2024-11-14T14:05:59Z

Hi @rkomandu ,
Could you please reproduce and provide logs with a higher debug level?
I had issues with the GPFS machine and couldn't reproduce a whole test run.

Additional Details

At this moment I'm looking for a level-1 printing from here:

noobaa-core/src/util/buffer_utils.js

Line 205 in e1bf29e

    
           dbg.log1('BufferPool.get_buffer: sem value', this.sem._value, 'waiting_value', this.sem._waiting_value, 'buffers length', this.buffers.length);

As you can see from the error stuck:

[ERROR] CONSOLE:: Error: Warning stuck buffer_pool buffer at BuffersPool.get_buffer (/usr/local/noobaa-core/src/util/buffer_utils.js:218:25) at async NamespaceFS.read_object_stream (/usr/local/noobaa-core/src/sdk/namespace_fs.js:1080:46) at async NsfsObjectSDK._call_op_and_update_stats (/usr/local/noobaa-core/src/sdk/object_sdk.js:543:27) at async Object.get_object [as handler] (/usr/local/noobaa-core/src/endpoint/s3/ops/s3_get_object.js:116:25) at async handle_request (/usr/local/noobaa-core/src/endpoint/s3/s3_rest.js:161:19) at async Object.s3_rest [as handler] (/usr/local/noobaa-core/src/endpoint/s3/s3_rest.js:66:9)

It comes on GET request, in NSFS it is read_object_stream after we tried to get_buffer (a buffer from the memory to execute the read operation).

We want to see the values of the sem._waiting_value and want to see the trend: Does it keep raising? Does it increase and decrease during the test? etc.

We know that it started with 0:

[3532906]: [nsfs/3532906] [L0] core.sdk.namespace_fs:: NamespaceFS: buffers_pool [ BufferPool.get_buffer: sem value: 2097152 waiting_value: 0 buffers length: 0, BufferPool.get_buffer: sem value: 33554432 waiting_value: 0 buffers length: 0, BufferPool.get_buffer: sem value: 425721856 waiting_value: 0 buffers length: 0, BufferPool.get_buffer: sem value: 3825205248 waiting_value: 0 buffers length: 0 ]

This printing comes from here:

noobaa-core/src/util/buffer_utils.js

Lines 217 to 223 in e1bf29e

    
           if (this.warning_timeout) { 
        
               const err = new Error('Warning stuck buffer_pool buffer'); 
        
               warning_timer = setTimeout(() => { 
        
                   console.error(err.stack); 
        
               }, this.warning_timeout); 
        
               warning_timer.unref(); 
        
           }

This warning printing might be in 2 minutes waiting - as configured here:

noobaa-core/config.js

Line 789 in e1bf29e

config.NSFS_BUF_POOL_WARNING_TIMEOUT = 2 * 60 * 1000;

BTW, Although it is a warning, I'm not sure why it is printed as with console.error and not console.warn, I can try and suggest a code change.

rkomandu · 2024-11-15T03:53:17Z

@shirady

I had issues with the GPFS machine and couldn't reproduce a whole test run.

lets see if we can get your cluster working for the run.
For now kickstarted the run for 90min , debuglevel set to all for now

DEBUGLEVEL : all
ENABLEMD5 : true

Will update once run complete

Note: High level thought, this bug might be recreated with the load on the system. Am saying this because the issue 8524 with versioning PUT method didn't run into this error when ran for 3hrs (ran on Wed)

rkomandu · 2024-11-15T08:32:45Z

@shirady

Ran for the 90 min Warp get op run and didn't run into the buffer_pool error

./warp get --insecure --duration 90m --host :6443 --access-key KCxP4AN9937kVqoCrNIs --secret-key bIdwF/5nJtSnrHWXrhPOhkv1WqGjtayMk6D+aU/U --tls --obj.size 256M --bucket newbucket-warp-get-8521-15nov 2>&1| tee /tmp/newbucket-warp-get-255M-8521-15nov.log

[root@gui0 log]# zgrep "stuck" noobaa.log-20241115.gz
[root@gui0 log]# grep "stuck" noobaa.log

[root@gui1 log]# zgrep "stuck" noobaa.log-20241115.gz
[root@gui1 log]# grep "stuck" noobaa.log

rkomandu · 2024-11-15T08:33:57Z

Please try to check from the code flow perspective , as mentioned it could also be w/r/t load on the system

shirady · 2024-11-20T16:08:42Z

Hi,
I will share that I ran the warp twice and didn't reproduce it.
I didn't find the printing "Error: Warning stuck buffer_pool buffer" (or any "stuck" in the logs).
I ran it with a high debug level and didn't find a place where the waiting_value is not 0, from this output:

noobaa-core/src/util/buffer_utils.js

Line 205 in e1bf29e

    
           dbg.log1('BufferPool.get_buffer: sem value', this.sem._value, 'waiting_value', this.sem._waiting_value, 'buffers length', this.buffers.length);

@rkomandu, I'm planning to check other things I will update you here about it.

Additional information:

I run:
./warp get --host=<IP-address-of-1-node> --access-key=<> --secret-key=<> --obj.size=256M --duration=60m --bucket=<bucket-name> --objects 1500 --insecure --tls (after creating the bucket).
I had to set the --objects due to space limits on the machine.

The outputs:

Without changing the NSFS_CALCULATE_MD5

----------------------------------------
Operation: PUT. Concurrency: 20
* Average: 367.05 MiB/s, 1.50 obj/s

Throughput, split into 192 x 5s:
 * Fastest: 389.2MiB/s, 1.59 obj/s
 * 50% Median: 368.5MiB/s, 1.51 obj/s
 * Slowest: 325.8MiB/s, 1.33 obj/s

----------------------------------------
Operation: GET. Concurrency: 20
* Average: 849.39 MiB/s, 3.48 obj/s

Throughput, split into 239 x 15s:
 * Fastest: 896.0MiB/s, 3.67 obj/s
 * 50% Median: 851.6MiB/s, 3.49 obj/s
 * Slowest: 774.0MiB/s, 3.17 obj/s

Changing the NSFS_CALCULATE_MD5 to true

----------------------------------------
Operation: PUT. Concurrency: 20
* Average: 161.74 MiB/s, 0.66 obj/s

Throughput, split into 141 x 15s:
 * Fastest: 171.4MiB/s, 0.70 obj/s
 * 50% Median: 162.1MiB/s, 0.66 obj/s
 * Slowest: 148.0MiB/s, 0.61 obj/s

----------------------------------------
Operation: GET. Concurrency: 20
* Average: 821.38 MiB/s, 3.36 obj/s

Throughput, split into 238 x 15s:
 * Fastest: 853.9MiB/s, 3.50 obj/s
 * 50% Median: 823.2MiB/s, 3.37 obj/s
 * Slowest: 741.7MiB/s, 3.04 obj/s

rkomandu added the NS-FS label Nov 12, 2024

shirady self-assigned this Nov 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error: Warning stuck buffer_pool buffer with IO workload #8521

Error: Warning stuck buffer_pool buffer with IO workload #8521

rkomandu commented Nov 12, 2024 •

edited

Loading

romayalon commented Nov 12, 2024

shirady commented Nov 14, 2024 •

edited

Loading

rkomandu commented Nov 15, 2024 •

edited

Loading

rkomandu commented Nov 15, 2024

rkomandu commented Nov 15, 2024

shirady commented Nov 20, 2024

Error: Warning stuck buffer_pool buffer with IO workload #8521

Error: Warning stuck buffer_pool buffer with IO workload #8521

Comments

rkomandu commented Nov 12, 2024 • edited Loading

Environment info

Actual behavior

Expected behavior

Steps to reproduce

More information - Screenshots / Logs / Other output

romayalon commented Nov 12, 2024

shirady commented Nov 14, 2024 • edited Loading

Additional Details

rkomandu commented Nov 15, 2024 • edited Loading

rkomandu commented Nov 15, 2024

rkomandu commented Nov 15, 2024

shirady commented Nov 20, 2024

rkomandu commented Nov 12, 2024 •

edited

Loading

shirady commented Nov 14, 2024 •

edited

Loading

rkomandu commented Nov 15, 2024 •

edited

Loading