Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pgcopydb stopped with [TARGET 0] SSL error: bad length failure during WAL replay #885

Open
dverite opened this issue Sep 26, 2024 · 2 comments

Comments

@dverite
Copy link

dverite commented Sep 26, 2024

Hi,
I'd like to report a failure seen when moving a large database (RDS pg14 to RDS pg14, client side: pgcopydb v0.17 on Ubuntu 22.04)

Invocation:

pgcopydb clone \
--table-jobs 4 \
--index-jobs 2 \
--restore-jobs 8 \
--split-tables-larger-than "25GB" \
--split-max-parts 4 \
--no-role-passwords \
--no-acl \
--no-tablespaces \
--skip-extensions \
--skip-ext-comments \
--skip-collations \
--skip-large-objects \
--skip-vacuum \
--slot-name migration_sre_storage \
--follow \
--verbose &>> logs.txt

Start:

2024-09-12 05:44:51.253 2090 INFO   main.c:136                Running pgcopydb version 0.17-1.pgdg22.04+1 from "/usr/bin/pgcopydb"

After about 5 hours running, it failed with the following output:

2024-09-12 10:37:11.880 2099 INFO   ld_apply.c:459            Replaying changes from file "/root/.local/share/pgcopydb/0000000100016A4800000002.sql"
2024-09-12 10:37:11.915 2099 ERROR  pgsql.c:2330              [TARGET 0] SSL error: bad length
2024-09-12 10:37:11.915 2099 ERROR  pgsql.c:2330              [TARGET 0] SSL SYSCALL error: EOF detected
2024-09-12 10:37:11.915 2099 ERROR  pgsql.c:4822              Failed to setup replication origin transaction at origin LSN 16A48/A08C730 and origin timestamp "2024-09-12 05:57:35.937263+0000"
2024-09-12 10:37:11.915 2099 ERROR  ld_apply.c:818            Failed to setup apply transaction, see above for details
2024-09-12 10:37:11.915 2099 ERROR  ld_apply.c:525            Failed to apply SQL from file "/root/.local/share/pgcopydb/0000000100016A4800000002.sql", see above for details
2024-09-12 10:37:11.915 2099 INFO   follow.c:824              Apply process has terminated
2024-09-12 10:37:11.999 2096 ERROR  follow.c:1057             Subprocess catchup with pid 2099 has exited with error code 12
2024-09-12 10:37:11.999 2096 NOTICE follow.c:1102             Process catchup has exited unexpectedly, and endpos is unset: terminating other processes
2024-09-12 10:37:11.999 2096 NOTICE follow.c:1151             kill -TERM 2097 (prefetch)
2024-09-12 10:37:11.999 2096 NOTICE follow.c:1151             kill -TERM 2098 (transform)
2024-09-12 10:37:11.999 2098 INFO   follow.c:771              Transform process has terminated

The [TARGET 0] in the first error message is suspicious, since instead of 0 we expect the PID of the target backend.
The Postgres logs do not show anything at the time of the error, except FATAL: connection to client lost

The same error occurred two times in two independent runs. It did not occur when testing databases smaller or less active in the same environment.

@arajkumar
Copy link
Contributor

@dverite Could you please try the suggestion from #773 (comment)?

i.e Modify the tcp keep alive GUC on the target Postgres?

alter system set tcp_keepalives_count=60
alter system set tcp_keepalives_idle=10
alter system set tcp_keepalives_interval=10

alter system probably won't work in RDS, you may need to tweak it on respective instance's Parameter Group.

@dverite
Copy link
Author

dverite commented Oct 29, 2024

This particular migration cannot be retried, it happened on production environment only.

According to the console output, pgcopydb adds keepalives=1&keepalives_idle=10&keepalives_interval=10&keepalives_count=60 to the URL of the connection.

Independently of that, the server-side parameters that were active were:

       Parameter        | Value 
-------------------------+-------
 tcp_keepalives_count    | 2
 tcp_keepalives_idle     | 300
 tcp_keepalives_interval | 30

Also we did not have any timeout mentioned in the error output, contrary to the case of issue #773

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants