`bad file descriptor` when rapidly appending and `writeback_cache` is enabled #5315

aeblyve · 2024-11-21T20:07:57Z

What happened:

I ran R code which rapidly appended data to a .csv file, but I got errors about a bad file descriptor when I call for warnings:

> warnings()
Warning messages:
1: In close.connection(file) : Problem closing connection:  Bad file descriptor
2: In close.connection(file) : Problem closing connection:  Bad file descriptor
3: In close.connection(file) : Problem closing connection:  Bad file descriptor
4: In close.connection(file) : Problem closing connection:  Bad file descriptor
5: In close.connection(file) : Problem closing connection:  Bad file descriptor
6: In close.connection(file) : Problem closing connection:  Bad file descriptor
7: In close.connection(file) : Problem closing connection:  Bad file descriptor
8: In close.connection(file) : Problem closing connection:  Bad file descriptor
9: In close.connection(file) : Problem closing connection:  Bad file descriptor
10: In close.connection(file) : Problem closing connection:  Bad file descriptor
11: In close.connection(file) : Problem closing connection:  Bad file descriptor
12: In close.connection(file) : Problem closing connection:  Bad file descriptor

The file is not correctly written, it is missing substantial pieces that should be there. It seems random, like a race condition, how many pieces are missing.

What you expected to happen:

The data should be appended (even if it takes some time).

How to reproduce it (as minimally and precisely as possible):

This bash simulation worked to get the same (or very similar) behavior on our mount:

[root@hostname juicefs]# for i in $(seq 1 10); do echo $i >> afile; done   
-bash: echo: write error: Bad file descriptor
-bash: echo: write error: Bad file descriptor
-bash: echo: write error: Bad file descriptor
-bash: echo: write error: Bad file descriptor
-bash: echo: write error: Bad file descriptor
-bash: echo: write error: Bad file descriptor
-bash: echo: write error: Bad file descriptor
-bash: echo: write error: Bad file descriptor
-bash: echo: write error: Bad file descriptor
-bash: echo: write error: Bad file descriptor

Anything else we need to know?

The issue consistently disappeared when we removed writeback_cache from the mount options. writeback still works.

Here is a sample .accesslog from when the bug with the R code was happening. Notice the bad file descriptor on read.


2024.11.21 12:26:57.506861 [uid:uid,gid:gid,pid:28644] read (57698439,4096,28672): (0) - bad file descriptor <0.000002>
2024.11.21 12:26:57.507507 [uid:uid,gid:gid,pid:28644] setattr (57698439[41684],0x460,[mtime=1732210017]): (57698439,[-rw-r--r--:0100644,1,uid,gid,1732209892,1732210017,1732210017,32225]) - OK <0.000609>
2024.11.21 12:26:57.507536 [uid:uid,gid:gid,pid:28644] flush (57698439,41684,83806B71FE2A25BE) - OK <0.000002>
2024.11.21 12:26:57.507566 [uid:0,gid:0,pid:0] release (57698439,41684) - OK <0.000006>
2024.11.21 12:26:57.511805 [uid:uid,gid:gid,pid:28644] getattr (57698438): (57698438,[-rwxr--r--:0100744,1,uid,gid,1732209874,1732209874,1732209874,4724]) - OK <0.000460>
2024.11.21 12:26:57.515278 [uid:uid,gid:gid,pid:28805] flush (57698438,39669,1D60D24A7855051D) - OK <0.000001>
2024.11.21 12:26:57.515346 [uid:uid,gid:gid,pid:28644] flush (57698438,39669,83806B71FE2A25BE) - OK <0.000001>
2024.11.21 12:26:57.515384 [uid:0,gid:0,pid:0] release (57698438,39669) - OK <0.000006>

Environment:

JuiceFS version (use juicefs --version) or Hadoop Java SDK version:
juicefs version 1.2.1+2024-08-30.cd871d1
Cloud provider or hardware configuration running JuiceFS:
AWS EC2 instance.
OS (e.g cat /etc/os-release):

NAME="AlmaLinux"
VERSION="8.10 (Cerulean Leopard)"
ID="almalinux"
ID_LIKE="rhel centos fedora"
VERSION_ID="8.10"
PLATFORM_ID="platform:el8"
PRETTY_NAME="AlmaLinux 8.10 (Cerulean Leopard)"
ANSI_COLOR="0;34"
LOGO="fedora-logo-icon"
CPE_NAME="cpe:/o:almalinux:almalinux:8::baseos"
HOME_URL="https://almalinux.org/"
DOCUMENTATION_URL="https://wiki.almalinux.org/"
BUG_REPORT_URL="https://bugs.almalinux.org/"

ALMALINUX_MANTISBT_PROJECT="AlmaLinux-8"
ALMALINUX_MANTISBT_PROJECT_VERSION="8.10"
REDHAT_SUPPORT_PRODUCT="AlmaLinux"
REDHAT_SUPPORT_PRODUCT_VERSION="8.10"
SUPPORT_END=2029-06-01

Kernel (e.g. uname -a):

Linux hostname 4.18.0-553.27.1.el8_10.x86_64 #1 SMP Tue Nov 5 04:50:16 EST 2024 x86_64 x86_64 x86_64 GNU/Linu

Object storage (cloud provider and region, or self maintained):
Amazon S3.
Metadata engine info (version, cloud provider managed or self maintained):
Redis on Amazon EC2, self maintained. Version 6.2.7
Network connectivity (JuiceFS to metadata engine, JuiceFS to object storage):
Amazon EC2 IP networking, and networking to Amazon S3.
Others:
Systemd mount options at the time of failure looked like this:

Options=_netdev,allow_other,writeback_cache,writeback,max-uploads=40,buffer-size=1024,backup-meta=0

The text was updated successfully, but these errors were encountered:

jiefenghuang · 2024-11-22T02:23:42Z

Hi @aeblyve, for your use case, we recommend using only writeback.
writeback_cache is a mechanism of FUSE, while writeback is a mechanism of JuiceFS.

The simplified data flow is: userspace -> kernel -> FUSE (default: write-through) -> JuiceFS -> object storage.
So, under the default FUSE IO mode, i.e., write-through, append writes are converted into sequential IO requests to JuiceFS. JuiceFS then uses the writeback mechanism to first write to local disks and then asynchronously upload to object storage.

If you change the FUSE IO mode to writeback_cache, the kernel will first cache the requests in the page cache and then flush pages to the underlying layer at unpredictable intervals. The IO requests received by JuiceFS might become random writes, which is why this mode is not recommended.

Additionally, we will look into your issue to see if it can be reproduced and investigate the specific problem.

aeblyve added the kind/bug Something isn't working label Nov 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`bad file descriptor` when rapidly appending and `writeback_cache` is enabled #5315

`bad file descriptor` when rapidly appending and `writeback_cache` is enabled #5315

aeblyve commented Nov 21, 2024 •

edited

Loading

jiefenghuang commented Nov 22, 2024

bad file descriptor when rapidly appending and writeback_cache is enabled #5315

bad file descriptor when rapidly appending and writeback_cache is enabled #5315

Comments

aeblyve commented Nov 21, 2024 • edited Loading

jiefenghuang commented Nov 22, 2024

`bad file descriptor` when rapidly appending and `writeback_cache` is enabled #5315

`bad file descriptor` when rapidly appending and `writeback_cache` is enabled #5315

aeblyve commented Nov 21, 2024 •

edited

Loading