-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kafka Ingestion Peon Tasks Success But Overlord Shows Failure #12701
Comments
For those still here - I do appreciate you hanging around hopefully we can find something useful out of this discussion. I am going to split this into multiple posts due to character limits I was finally able to do a debug session of the Overlord and Peon services and get more details on what seems to be the issue. I believe I have found something but unsure yet what is the proper way to proceed but I will provide the details below. Some overall details useful for reading these logs:
|
1 Based off my findings in my cluster (your cluster may be different but could be seeing something similar), this is a problem that has been around with Druid for a while or at least appears to be dating back to version: "druid-12.2-rc3" Similar issues that were opened from other users without resolution or real engagement from community, there's probably more but based off my key search terms:
In our specific case, the issue stems from the Supervisor/Overlord asking the running Kafka task to "pause" (which does seem to happen frequently), and usually these "pause" requests (IPC method which uses HTTP) go through fine but often the Peon can respond with "202 Accepted" instead of the usual "200 OK" or "400 Bad Request". Although it's not clear to me yet why 202 was issued and it never actually paused. In our case, it Some Relevant info below:
|
2 Continued After the 202 Accepted, Supervisor/Overlord will constantly hit the "/status" endpoint of the peon in order to see if it actually paused. After the 202 it appears to only respond with "STARTING" status
|
3 Continued According to the source, a task can be in one of the states:
What we are seeing right now, that it is stuck in "STARTING" Status but appears to be still doing work on the peon side with segments and actions. This is what I see on the peon side during the 202 Accepted response
|
4 Final To answer my own previous question - When the PUBLISHING phase is completed and task is done, peon is supposed to update the ZK node which will be picked up by the Supervisor/Overlord
But since it never got to complete fully due to it not actually "Pausing" and having a shutdown sent, I don't believe it ever got the chance to update the ZK node and ends up completing shortly after. BUt it did appear to be able to publish/push segments and do handoffs since it did register "SUCCESS" status after shutdown and also I see logs for segment handoffs/publishing right before shutdown. Some peon logs below:
I am unsure of how to exactly to proceed but hoping to get some more insight from the community and also shed some light on an ongoing issue that has been affecting possibly many users using druid across different versions. I wonder why the "202 Accepted" response and what is the expectation for it being implemented if it can lead to scenarios like this where it never actually pauses and stays in STARTING phase? And what conditions lead to it. I am also open to tuning retries/timeouts to see if we can temporarily resolve this for now but unsure how to change those (like the PT2S and maxRetries for "Waiting for task to pause" - or rather trying to extend HTTP timeout for 202s to go away and hopefully respond with 200 after some time) and if those would even help. Unsure if upgrading from 0.22.1 => 0.23.0 would also help, don't see too many issues in the release about peon/overlord changes but open to discussion/advice |
After more digging, we found that "202 Accepted" happen often from "pause" HTTP calls but they usually do pause/resume later. This specific issue happens when it doesn't actually pause and if we look at the code here: https://github.com/apache/druid/blob/0.22.1/indexing-service/src/main/java/org/apache/druid/indexing/seekablestream/SeekableStreamIndexTaskRunner.java We can see that it is supposed to log a debug message when it eventually does pause/resume from 202s:
For normal pause with "202 Accepted" response:
For cases where the Peon responds with "202 Accepted" but never pauses, we never see that debug message, looks something like this:
|
I believe this is enough information to get the ball rolling but I can definitely provide more information if needed to better understand the issue, would appreciate help from anyone. If others are experiencing the same issue, I would also appreciate a simple comment/confirmation so we can see if this is still affecting a wide audience as indicated by the other issues I linked. |
@pchang388 - The effort that you have put into investigating this and documenting it, is commendable. I am not an expert on the protocol between supervisor and peon. I am hoping that someone who is, can comment. Though I have fixed one relevant issue in 0.23.0. I will recommend an upgrade and then do further troubleshooting. This will avoid those continuous retries from the supervisor to pause the peon. I would also suggest you take flame graphs on peons when you run into this issue again. Here is an article on how to do that - https://support.imply.io/hc/en-us/articles/360033747953-Profiling-Druid-queries-using-flame-graphs. That way, we will know where is peon spending its time on. |
Hey @pchang388, is it possible that the state hasn't transitioned from STARTING to READING? |
Hey @abhishekagarwal87 - appreciate the support! That sounds like a good idea, we are planning to do the upgrade sometime early/middle of next week and are hoping this would go away permanently with the new version. These type of issues can be hard to troubleshoot, since in our case, it is intermittent (few days of running fine and then few days of this problem occurring). I will provide an update on the situation after the upgrade and see where we are with this specific issue. And thank you for the flame graph profiling link, I was not aware this is something we could do but definitely could help us see more into what is happening during the "pause" not actually going through. I will take a look and try it out if things are not resolved with the new version. |
Hey @AmatyaAvadhanula - thanks for taking a look! I do see one log entry with that info level message and also a few similar ones. That specific log message appears once early in the task lifecycle during the starting phase (Task start time: 2022-07-06T20:56:26.370Z). Here are the logs around that time frame including the specific message
Before and After the "pause" that never appeared to paused, I also see it producing metrics for kafka metrics emitter: - probably not as relevant but showing it can communicate with kafka brokers before and after:
|
@pchang388, thanks for the response The task may not be in a "STARTING" phase at the time of the request as this state is not assigned after the task starts "READING". I think so because:
|
Perhaps the task has reached
I'm also not sure that
|
Hi all, So we upgraded to 0.23.0 and the problem still persists unfortunately. We have 50% or more of the tasks fail at times with the same/similar error, example:
These are INFO level logs, I will probably have to run debug mode again to get more data and also try out the flame graphs for peons (as recommended earlier) to get a better idea of what's going on. But just wanted to provide an update about the 0.23.0 upgrade and confirming this specific issue persists. |
Just in case - including for reference, this is our relevant configuration options for Overlord and MM/Peons:
As mentioned, I will try to get some peon flame graphs for you @abhishekagarwal87 |
Hi @pchang388, In
|
Hey @AmatyaAvadhanula - thank you very much for the follow up, I am trying this out now and will let you know what I find! |
Thank you again for your support and advice, I am not very familiar with Java but was able to follow the documentation and build the 0.23.0 release with the section you referenced commented out. My local machine is using Windows/WSL (not by choice, corporate requirements - I am limited to Windows and chose WSL to get close to a Unix like work station). In order to avoid potential issues with libraries and WSL combability, I opted to use a docker container as my build environment, Dockerfile below:
Then cloned the druid repo inside the container and switched to 0.23.0 release:
Make the change to the referenced section and then use mvn to build the package:
I then took the tarball and installed/configured the new druid build on all MM/Peon VMs. I am going to let task run a bit and report back when I get a chance on what I am seeing now with the changes |
Oops didn't mean to close - reopening issue But will report back after we get some data/logs |
So I did let it run with the section commented out for a bit, and most of them failed (~50%+) with the same error messages we've seen before but I did see one that was different, relevant INFO level logs below (can also get you debug logs if needed). I'm unsure at the moment if the change we made removed a race condition but looking at the code I believe I understand why you suggested it and looks like it could possibly do so. Example Task:
In this new example, the last chatRetry was at Knowing this I was able to temporarily provide some relieve and let our existing tasks complete (for the most part 90%+) by increasing chatRetries to a very high number (8 => 25) since it's 10 seconds between retries after the 3rd retry, this allows for ~2 more minutes for the Peon to actually pause. This is just a workaround for now since the issue still exists with the Peons not pausing in a reasonable time but does provide some temporarily relieve. Seems like we will definitely need those flame graphs @abhishekagarwal87 mentioned to get a better idea of what the peon is working on during the pause request coming in |
Since the Peon seems to be unable to pause in a reasonable timeframe or at times unresponsive/hung, I took a look at some of the metrics for the common actions it would do during a task lifecycle. According to the docs: An indexing task starts running and building a new segment. It must determine the identifier of the segment before it starts building it. For a task that is appending (like a Kafka task, or an index task in append mode) this is done by calling an "allocate" API on the Overlord to potentially add a new partition to an existing set of segments. For a task that is overwriting (like a Hadoop task, or an index task not in append mode) this is done by locking an interval and creating a new version number and new set of segments. When the indexing task has finished reading data for the segment, it pushes it to deep storage and then publishes it by writing a record into the metadata store. So during the And during the
I hope this background information provides more details into our setup/configuration. Hopefully makes it easier to spot a potential issue/bottleneck (like the Overlord seems to be). I really do appreciate the help @abhishekagarwal87 and @AmatyaAvadhanula. My next steps is to get the flame graphs for the peons to get an idea of what the threads are doing. But please let me know if you have any further suggestions or things to try or I should provide any more information. |
Thank you @pchang388 for the detailed analysis. The best way to troubleshoot these degradations is to look at the flame graphs of overlord and peon. Looking forward to that information. I skimmed through the information that is already available. There is no major degradation in the system and JVM metrics. It could be that the overlord just became slower due to an application issue. I have seen that sometimes overlord HTTP threads slow down because of metadata operations. Some of these threads busy in metadata operations hold a lock that locks other threads. This brings the whole service to a crawl. We can confirm this by looking at flame graphs. If the metadata store is indeed the cause of slowness, then scaling the metadata store would certainly help. |
Hey guys, just an update on my side - still working on getting the flame graphs for peon and overlords. Our VMs are kind of in a locked down environment and we had Btw the sjk tool doesn't work with JAVA_HOME set in an env variable (aragozin/jvm-tools#73) but I thought to try the local copy/install method and it looks like it's working so far. Will hopefully be able to provide the graphs tomorrow, thank you guys again for your patience and help |
Hi I am still working on this, but I do have an ~hour of data from Overlord, Peon, and MM in a flame graph. Just to note, the Peons and MMs are using the custom build we made (with the section commented out) - my plan tomorrow is to get more data using the standard 0.23.0 build (since the change did not seem to fix the issue and also ensure everyone's code is the same). If this is not preferred, please let me know and I will keep the custom build running. I am attaching the 3 flame graphs to this comment (in a single zip file), in this specific example, it appears the task did complete - although I'm not 100% sure since I was doing this while also working on other things and lost track of the taskId. And as stated, I will try to get more flame graphs with the standard 0.23.0 build and also try to capture a Peon's graph during a failed task. Also to note, the peon stack dump generation did have an error at the end:
It still has data but I believe the error was because the PID shutdown before the sjk duration. But if this is unrecoverable, I can definitely get more. But hopefully there is some data here that may point at a potential issue |
yes, let's use the standard 0.23.0 build. btw we would need the flame graphs on the overlord and the tasks that are not responding to pause request. |
you should also run sjk for like ~ 5 minutes when peon is not responding. Loading a flame graph with 1 hour of data is very slow. |
Hi @abhishekagarwal87 and @AmatyaAvadhanula, I reverted back to the 0.23.0 official release for MM/Peons. I reran the sjk profiler on the Overlord and Peons for a smaller period of time (5 minutes) as requested. One of the peons, had 21 messages for Ran sjk profiler using flags as shown below, example for peon:
Under a normal configuration/settings, this task would have failed since it exhausts the default chatRetries count (https://druid.apache.org/docs/latest/development/extensions-core/kafka-supervisor-reference.html):
The task was only allowed to succeed because we increased chatRetries from 8 => 25 as a workaround to get tasks to complete for now. If you'd like, I can turn that back down to 8 and let some tasks fail instead but my thought was since we are looking specifically pause request window, the outcome (SUCCESS/FAILED) of the tasks is not that important. Please let me know if you prefer to let it fail (we still have some failing but the chatRetries increase helps a lot but still 10-50% of tasks failing intermittently) and also capture the stack dump of the failure window and I can run it for a longer interval. It's difficult to find out which one will fail and try to get that specific stack dump but I'm working on an ansible playbook that will grep/grab the PID and do it for all Peons, so it may be easier the next time. Here is the Overlord logs showing the 21 tries, which seems very high, and the time to pause was actually ~194 seconds or ~3.23 minutes:
I started the sjk profiler (on Overlord and Peon for this specific task) as soon as the first Overlord log showed up: ~ Attached is a zip of the Overlord and Peon Flame HTML files, please let me know if you prefer to also see the stack dump file (.std extension) and I can include that in a different comment as well. index_kafka_REDACT_36184a251c6f5d5_chhibjdl_flame_graphs.zip Once again, thank you guys for continuing to assist me and I hope we learn something from this. |
I also saved the Overlord and Peon Task (INFO level) logs for the example Peon task shown above, can provide the entire output or snippets if needed, please let me know! |
what is the size of |
I believe this is what your are asking for:
At time of posting this comment, the oldest segment in there is showing as: |
@pchang388 could you please provide some additional info:
|
hey @AmatyaAvadhanula,
|
I am also planning on moving off yugabyte for our metadata db since the query metrics do not seem very performant. I set up postgresql-ha in K8 and it seems to be better from my tests and I do believe we will have some better query performance after the move. I am still in the process validating the metrics/performance postgresql-ha before I do so. Hoping that will make a difference since I do see in the logs sometimes for this specific error: https://support.yugabyte.com/hc/en-us/articles/4403469712397-Database-Transactions-errors-out-with-Restart-read-required- Yugabyte DB by default runs in snapshot isolation level, preventing phantom and non-repeatable reads. In an environment where multiple transactions conflict with each other, it's possible the data has changed since it was the last read and before committing in a read/write transaction. This causes the current transaction to read the stale data or read data from the past. In such cases, the read operation needs to be restarted at a higher hybrid time. |
@AmatyaAvadhanula - How will 2) help here? @pchang388 - To me, the metadata store seems to be the culprit. If you filter the
I was thinking that DB operations are slower because of the high scale. That doesn't seem to be the case here. You can also look at |
I've been busy the last few days but will provide another update tomorrow. And thank you for that information! We also believed the metadata DB was the issue but it was not apparent from regular metrics like read/write latency, etc.. Yes they were a little high like I mentioned before which led us to seek alternatives but I did not think they were absurd/out of bounds. That metric you provided and also your evaluation of the flame graph was essential in confirming the metadata db as the potential issue. We are in the process of changing metadata DB from Yugabyte to open-source Postgres HA (PgPool implementation). We have already done so in our dev and stg environments and seeing much better improvement for that metric:
I will do the prod migration and let you know if we also see the same improvements there. I also have an idea of why Yugabyte/Overlord interactions were causing locks or long wait times for DB operations. Will post that information as well in my follow up tomorrow for other people to reference in case they have similar issues as us. |
That's great to hear. Looking at those numbers, migrating to pg will certainly help. I am curious about why did your team choose to use Yugabyte as the metadata store backend. I would also love to know what is your druid monitoring stack like? |
So I did migrate to PG-HA in Prod and results shown below in this comment. But first, I want to say thank you @abhishekagarwal87 and @AmatyaAvadhanula, really do appreciate you guys helping me on this and for having a lot of patience! After the migration from Yugabyte to PG-HA, we have stability for ingestion tasks and no more Overlord slowness/degraded behavior. What made this one difficult to identify from our side was that all the metrics for Yugabyte DB ( latency, operation metrics, etc. ) pointed at somewhat high latency for reads and deletes but not enough to cause task failures and the degree of overlord slowness in my opinion. And also the amount of errors in our logs for PSQL related queries/actions in our overlord logs were around ~1-5 in a 7 day period (with info/debug level logging, although it's possible I missed them with my search terms). Here are the differences for the metric:
So the issue was the metadata DB (in our case Yugabyte) as @abhishekagarwal87 pointed out. The symptoms for this type of issue include:
If you see these symptoms, you should consider testing out a different metadata DB to see if that relieves the issue. Now the question becomes why did Yugabyte perform this way without noticeable metrics in latency, etc.? I am not a DB person/expert and what follows is all speculation based off my assumptions. I was curious and ran benchmark tests using pgbench and ysql_bench (Yugabytes version of pgbench), what I noticed was that our Yugabyte instance was considerably faster but had a high percentage of errors when handling transactions.
These errors include things like:
As stated, not a DB guy, but these are transactions that are failing and require a retry and according to Yugabyte docs: """ In scenarios where only a few transactions conflict with each other, optimistic concurrency control is a good strategy. This is generally the case in high-volume systems. For example, most web applications have short-lived connections to the database. YugabyteDB opts for optimistic concurrency in the case of simple transactions. This is achieved by assigning a random priority to each of the transactions. In the case of a conflict, the transaction with a lower priority is aborted. Some transactions that get aborted due to a conflict are internally retried while others result in an error to the end application. Based off that information, it appears the Overlord threads may hang on DB operations due to:
I'm sure it's possible that Yugabyte can be configured to match the needs of Druid but I am not a DB guy, so I opted to use something more established and simple. But out of the box, Yugabyte did not work well with Druid and caused our issues we've been seeing. I hope this helps future users who may be experiencing the same issue and help them resolve their problems. |
I joined the team later than most, and Yugabyte was chosen since someone on our team had familiarity with it or at least the company if I recall correctly. But to be honest, I am not a fan of it, really hard to troubleshoot and actually read the logs. There's so much information about raft elections, housekeeping tasks, etc. that it becomes difficult to follow. It also has weird quirks about quorum not being filled even though multiple replicas are running in my experience but that's intermittent so could be due to infrastructure or networking issues. Yugabyte also some tools to support cross datacenter replication/quorum that we wanted to try. For the metrics stack we use, I will dig those up and send you that information! Then I will close the ticket as solved. Once again thank you both for everything and especially your patience and persistence. |
Also if any DB experts want to chime in on why Yugabyte was causing issues with Overlord, that would be awesome! Always good to learn about new things. I've also learned I need to get familiar with Java as I continue to work with apache services. I mostly use Python, Javascript/Typescript, and lately Golang for my dev projects but would definitely be good to know Java as well. |
As requested, here is the metrics/monitoring stuff we have. We use the grafana/prometheus/alertmanager stack as our platform on Kubernetes.
|
Thank you again @abhishekagarwal87 and @AmatyaAvadhanula |
Thank you @pchang388. This is the exact information that I was looking for. |
Agreed, would definitely be nice. If I have some time, I may take a crack at it or another option is to create a small custom service that runs the queries directly on druid and transforms to prom format |
Apologies if this breaks any rules, but I tried on the druid forums without much success so trying here to see if I can reach a different audience. Relevant information below and more details in the druid forum post.
0.22.1
,0.23.0
https://www.druidforum.org/t/kafka-ingestion-peon-tasks-success-but-overlord-shows-failure/7374
In general when we run all our tasks, we start seeing issues between Overlord and MM/Peons. Often times, the Peon will show that the task was successful but the overlord believes it failed and tries to shut it down. And things start to get sluggish with the Overlord and it starts taking a while to recognize completed tasks and tasks that are trying to start which seems to be pointing at a communication/coordination failure between Overlord and MM/Peons. We even see TaskAssignment between Overlord and MM timeouts (PT10M - default is PT5M) occur.
The only thing that seems to be able to help is reducing the number of tasks we have running concurrently by suspending certain supervisors. Which also indicates an issue with the 3 Druid services handling the load of our current ingestion. But according to system metrics, resource usage is not hitting any limits and it still has more compute it can use. It's odd since we know there are probably a lot of users ingesting more data per hour than us and we don't see this type of issue in their discussions/white papers.
Any help will definitely be appreciated.
The text was updated successfully, but these errors were encountered: