Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Support fs health check monitor on Azure Blob Storage / S3 #16743

Open
audunsolemdal opened this issue Nov 29, 2024 · 3 comments
Labels
enhancement Enhancement or improvement to existing feature or request Storage Issues and PRs relating to data and metadata storage

Comments

@audunsolemdal
Copy link

Is your feature request related to a problem? Please describe

Currently I running the opensearch helm chart on kubernetes monitor.fs.health.enabled = true
My node uses azure blob storage as the backend which seems to work fine, but the fs health check seems to fail

[ERROR][o.o.m.f.FsHealthService ] [datalev-opensearch-master-0] health check of [/usr/share/opensearch/data/nodes/0] failed

Describe the solution you'd like

Ideally a health check which works on azure blob storage / Amazon S3 storage.

Related component

Other

Describe alternatives you've considered

Current workaround is setting
monitor.fs.health.enabled = false

Additional context

Error log

2024-11-29 10:15:54.713	[2024-11-29T09:15:54,712][INFO ][o.o.s.s.c.FlintStreamingJobHouseKeeperTask] [opensearch-master-0] Finished housekeeping task for auto refresh streaming jobs.
2024-11-29 10:15:54.712	[2024-11-29T09:15:54,711][INFO ][o.o.s.s.c.FlintStreamingJobHouseKeeperTask] [opensearch-master-0] Starting housekeeping task for auto refresh streaming jobs.
2024-11-29 10:15:53.436	[2024-11-29T09:15:53,436][INFO ][o.o.j.s.JobSweeper       ] [opensearch-master-0] Running full sweep
2024-11-29 10:15:53.336		at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]
2024-11-29 10:15:53.336		at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
2024-11-29 10:15:53.336		at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
2024-11-29 10:15:53.336		at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.18.0.jar:2.18.0]
2024-11-29 10:15:53.336		at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:1005) [opensearch-2.18.0.jar:2.18.0]
2024-11-29 10:15:53.336		at org.opensearch.threadpool.Scheduler$ReschedulingRunnable.doRun(Scheduler.java:246) [opensearch-2.18.0.jar:2.18.0]
2024-11-29 10:15:53.336		at org.opensearch.monitor.fs.FsHealthService$FsHealthMonitor.run(FsHealthService.java:195) [opensearch-2.18.0.jar:2.18.0]
2024-11-29 10:15:53.336		at org.opensearch.monitor.fs.FsHealthService$FsHealthMonitor.monitorFSHealth(FsHealthService.java:228) [opensearch-2.18.0.jar:2.18.0]
2024-11-29 10:15:53.336		at java.base/sun.nio.ch.ChannelOutputStream.close(ChannelOutputStream.java:111) ~[?:?]
2024-11-29 10:15:53.336		at java.base/java.nio.channels.spi.AbstractInterruptibleChannel.close(AbstractInterruptibleChannel.java:113) ~[?:?]
2024-11-29 10:15:53.336		at java.base/sun.nio.ch.FileChannelImpl.implCloseChannel(FileChannelImpl.java:210) ~[?:?]
2024-11-29 10:15:53.336		at java.base/jdk.internal.ref.PhantomCleanable.clean(PhantomCleanable.java:133) ~[?:?]
2024-11-29 10:15:53.336		at java.base/jdk.internal.ref.CleanerImpl$PhantomCleanableRef.performCleanup(CleanerImpl.java:178) ~[?:?]
2024-11-29 10:15:53.336		at java.base/sun.nio.ch.FileChannelImpl$Closer.run(FileChannelImpl.java:116) ~[?:?]
2024-11-29 10:15:53.336		at java.base/java.io.FileDescriptor$1.close(FileDescriptor.java:89) ~[?:?]
2024-11-29 10:15:53.336		at java.base/java.io.FileDescriptor.close(FileDescriptor.java:304) ~[?:?]
2024-11-29 10:15:53.336		at java.base/java.io.FileDescriptor.close0(Native Method) ~[?:?]
2024-11-29 10:15:53.336	java.io.IOException: Input/output error
2024-11-29 10:15:53.336	[2024-11-29T09:15:53,335][ERROR][o.o.m.f.FsHealthService  ] [opensearch-master-0] health check of [/usr/share/opensearch/data/nodes/0] failed

@audunsolemdal audunsolemdal added enhancement Enhancement or improvement to existing feature or request untriaged labels Nov 29, 2024
@github-actions github-actions bot added the Other label Nov 29, 2024
@andrross
Copy link
Member

andrross commented Dec 2, 2024

Here is what the health check does: https://github.com/opensearch-project/OpenSearch/blob/2.x/server/src/main/java/org/opensearch/monitor/fs/FsHealthService.java#L223-L229

tl;dr: create a file, write a byte, fsync, close file, delete file

This is all pretty straightforward stuff using the Java NIO API. I would expect anything that is acting as a filesystem to need to work for these APIs.

@audunsolemdal It looks like your stack trace is pointing to a failure when attempting to close the OutputStream that was used to write a byte. Any idea why that might fail?

@andrross andrross added the Storage Issues and PRs relating to data and metadata storage label Dec 2, 2024
@andrross andrross removed the Other label Dec 2, 2024
@audunsolemdal
Copy link
Author

This is all pretty straightforward stuff using the Java NIO API. I would expect anything that is acting as a filesystem to need to work for these APIs.

@audunsolemdal It looks like your stack trace is pointing to a failure when attempting to close the OutputStream that was used to write a byte. Any idea why that might fail?

I am not sure why it fails on that step, but there are some limitations in Azure blob storage compared to a full fledged file system. I am mounting a Kubernetes Persistent Volume via the Azure Blob Storage CSI driver, which is based on Blobfuse2

https://github.com/Azure/azure-storage-fuse?tab=readme-ov-file#un-supported-file-system-operations

So far I have not noticed any issues using this apart from the health check.

@andrross
Copy link
Member

andrross commented Dec 3, 2024

I'm not sure if this is actionable for us at the moment. I think we'd need more specific details about the failure you're seeing and how we could accommodate it. It's really odd that these simple operations would fail but he actual usage of the file system by OpenSearch (which surely includes creating, writing, flushing, and deleting files) would succeed. I might suggest writing a super simple Java program that mimics our health check to see if you can isolate the failure in a more controlled environment. @audunsolemdal what do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Storage Issues and PRs relating to data and metadata storage
Projects
Status: 🆕 New
Development

No branches or pull requests

3 participants