Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] pg_rewind: fatal: could not find previous WAL record #803

Open
1 of 2 tasks
antoniopetrole opened this issue Dec 19, 2024 · 10 comments
Open
1 of 2 tasks

[Bug] pg_rewind: fatal: could not find previous WAL record #803

antoniopetrole opened this issue Dec 19, 2024 · 10 comments
Labels
type: Bug Something isn't working

Comments

@antoniopetrole
Copy link
Member

Apache Cloudberry version

1.6.0

What happened

Hey all, so our cluster that we have has had a number of segments go down over the past month. Typically this isn't a major issue since we can just run gprecoverseg in an incremental fashion and things are back up within the hour. However almost every time we've had segments go down, the incremental recovery fails with the error

pg_rewind: fatal: could not find previous WAL record

I'm not sure if the default WAL settings are perhaps insufficient in the sense that the WAL records are being rotated out too quickly or if this is a bug. To provide some additional context, here's a timeline of events for segments that crashed on 12/16

12/16/2024 22:08:13: Segment 171 crashes due to a segfault (I'll be opening another issue for this today) from a query
12/16/2024 22:09:33: Segment 51 crashes due to a segfault from the same query
12/16/2024 22:18:38: We run gprecoverseg (no args) and it shows it's only going to recover segment51 and it fails with the following error

pg_rewind: connected to server
pg_rewind: servers diverged at WAL location 22DB/87CEBDC0 on timeline 3
pg_rewind: error: could not open file "/data4/primary/gpseg51/pg_wal/00000003000022DB0000001B": No such file or directory
pg_rewind: fatal: could not find previous WAL record at 22DB/6FFFFFF8

12/16/2024 22:19:29: We try running gprecoverseg (no args) again and receive the same exact error as above. It also is saying it's only recovering segment 51
12/16/2024 22:28:01: We try running gprecoverseg -F -a and get the following error

pg_basebackup: error: connection to server at "10.9.1.232", port 5003 failed: FATAL:  the database system is resetting

12/16/2024 22:40:04: We try running gprecoverseg (no args) again and now it's showing it's going to try and recover both segment 51 and segment 171 but ends up with the same exact error as before
12/16/2024 22:48:33: 6 more segments go down due to a segfault from the same query (segments 187,19,138,108,143,57)
12/16/2024 22:56:33: 23 more segments go down due to a segfault from the same query (segments 60,129,53,144,131,62,63,175,54,107,71,105,45,160,12,47,177,161,180,70,178,61,147)
12/16/2024 23:42:40: We successfully recover segment 70 with incremental (the duration was 34 minutes)
12/17/2024 1:20:45: We successfully recover 8 more segments incrementally (segments 60,53,62,54,71,107,12,47)
12/17/2024 1:30:12: We try recovering 19 more segments incrementally (19,57,61,63,105,108,129,131,138,143,144,147,160,161,171,175,177,178,180) and received the following error

pg_rewind: connected to server
pg_rewind: executing "/usr/local/cloudberry-db-1.6.0/bin/postgres" for target server to complete crash recovery
LOG:  gp_role forced to 'utility' in single-user mode
2024-12-17 01:30:34.368316 UTC,,,p540072,th1765611648,,,,0,,,seg108,,,,,"LOG","00000","database system was interrupted; last known up at 2024-12-16 23:36:51 UTC",,,,,,,,"StartupXLOG","xlog.c",6819,
2024-12-17 01:30:34.383430 UTC,,,p540072,th1765611648,,,,0,,,seg108,,,,,"LOG","00000","Synchronization of the wal directory starts.",,,,,,,,"SyncAllXLogFiles","fd.c",3446,
2024-12-17 01:30:34.383713 UTC,,,p540072,th1765611648,,,,0,,,seg108,,,,,"LOG","00000","synchronization of the wal directory finishes.",,,,,,,,"SyncAllXLogFiles","fd.c",3448,
2024-12-17 01:30:34.513928 UTC,,,p540072,th1765611648,,,,0,,,seg108,,,,,"LOG","00000","database system was not properly shut down; automatic recovery in progress",,,,,,,,"StartupXLOG","xlog.c",7388,
2024-12-17 01:31:00.783898 UTC,,,p540072,th1765611648,,,,0,,,seg108,,,,,"LOG","00000","redo starts at 22D8/8C750D40",,,,,,,,"StartupXLOG","xlog.c",7677,
2024-12-17 01:31:00.783968 UTC,,,p540072,th1765611648,,,,0,,,seg108,,,,,"LOG","00000","invalid record length at 22D8/8C750D78: wanted 24, got 0",,,,,,,,"ReadRecord","xlog.c",4484,
2024-12-17 01:31:00.784030 UTC,,,p540072,th1765611648,,,,0,,,seg108,,,,,"LOG","00000","redo done at 22D8/8C750D40 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s",,,,,,,,"StartupXLOG","xlog.c",7966,

PostgreSQL stand-alone backend 14.4
backend> pg_rewind: servers diverged at WAL location 22D8/8C73A0B8 on timeline 1
pg_rewind: error: could not open file "/data1/primary/gpseg108/pg_wal/00000001000022D80000001D": No such file or directory
pg_rewind: fatal: could not find previous WAL record at 22D8/77FFFFA8

12/17/2024 3:02:46: We end up recovering 13 specified segments incrementally according to the logs (61,63,105,131,144,147,160,161,171,175,177,178,180)
12/17/2024 4:09:44: On our next run we try to recover more segments and we end up recovering duplicates of our last run (it seems like they didn't recover successfully although gprecoverseg says they did? I need to double check this)
(12,47,53,54,60,61,62,63,70,71,105,107,131,144,147,160,161,171,175,177,178,180)
12/17/2024 4:18:41: We kick off a FULL recovery for the remaining segments which works after running for 10 hours (segments 19,45,51,57,108,129,138,143,187)
12/17/2024 19:02:03: We run a rebalance and the cluster is back in a healthy state

So I know this is a lot of information to digest, I have a spreadsheet link below that shows the timeline of events along with error messages etc. The main thing I want to call out is that we received this WAL error on segments 51, 187, 45, 108, 57, and 19. However a bunch of the other segments that crashed at the same time as those were able to do an incremental with no issues. It seems like we effectively had to go through the process of eliminating potential problem segments from our gprecoverseg input file one at a time to figure out what's going wrong. What's also strange is that our first incremental recovery attempt was only 9 minutes after the segment went down and we hit this issue. However for the final incremental that worked it was almost 5 hours after those segments went down, so I'm not sure how valid the WAL record rotation theory is.

I'm happy to provide any clarifications/logs that might be needed for this, here is a more detailed timeline for review
https://docs.google.com/spreadsheets/d/1qJNC88rFAD6OzF2b0fEeNgAX57QdnKY8SZhnhbg9q6g/edit?usp=sharing

What you think should happen instead

The expected behavior here would be that there aren't inconsistencies in WAL files being available amongst a group of segments that all failed at the same time

How to reproduce

Unfortunately I'm not sure how to even reproduce this

Operating System

Rocky Linux 8.10 (Green Obsidian)

Anything else

This has happened 3 times in the past month requiring a full recovery for either all downed segments or a subset of them

Are you willing to submit PR?

  • Yes, I am willing to submit a PR!

Code of Conduct

@antoniopetrole antoniopetrole added the type: Bug Something isn't working label Dec 19, 2024
@edespino
Copy link
Contributor

FYI: technically, this is a pre-Apache Cloudberry (Incubating) release. The project has not had their initial Apache release.

@edespino
Copy link
Contributor

@antoniopetrole - the error messages indicate there were segfaults. Are your systems configured to capture core files?

@antoniopetrole
Copy link
Member Author

@edespino Yes we have specific directories where core files are dumped and I believe we have the core files for all of the times segments have gone down since we deployed the cluster. I'm going to include more information about that in my next issue regarding the segfault itself

@edespino
Copy link
Contributor

@antoniopetrole I am preparing a tool to assist with the core file review - extracting relevant information. I should have a version to share shortly.

@edespino
Copy link
Contributor

To assist with core file analysis, I created the following repository containing apache_cloudberry_core_analyzer.py:

https://github.com/cloudberry-contrib/support-tools

@my-ship-it
Copy link
Contributor

Please wait, then there will be engineers to help support.

@foreyes
Copy link
Contributor

foreyes commented Dec 20, 2024

@antoniopetrole Thanks for reporting this complicated issue, we are going to investigate this soon.
BTW, could you please @ me when another issue is opened?

@foreyes
Copy link
Contributor

foreyes commented Jan 7, 2025

Although we still haven't figured out the root cause of issue #806, I am also working on constructing a workload that may make pg_rewind to fail, in an attempt to reproduce this issue independently.

@foreyes
Copy link
Contributor

foreyes commented Jan 10, 2025

Have a potential fix for issue #806, and still finding the root cause. At the same time, we are also trying to reproduce this issue.

@antoniopetrole
Copy link
Member Author

@foreyes Thank you! I understand that this is a complex issue with a lot of moving parts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: Bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants