You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While keeping an eye on production CDC task errors/warnings for an unrelated rollout, I noticed an interesting pattern. We have at least four production capture tasks from Neon Postgres instances which keep getting restarted and then quickly shutting down with the following pattern of logs:
2024-10-15T20:53:22Z INFO initializing connector {"address":"REDACTED.neon.tech:5432","database":"REDACTED","slot":"REDACTED","user":"REDACTED"}
2024-10-15T20:53:22Z INFO queried database version {REDACTED}
2024-10-15T20:53:23Z INFO discovering tables null
2024-10-15T20:53:23Z INFO starting replication {"publication":"flow_publication","slot":"flow_slot","startLSN":"REDACTED"}
2024-10-15T20:53:23Z INFO streaming to fence {"fenceAfter":"0s"}
2024-10-15T20:53:23Z INFO advancing confirmed LSN {"ackLSN":"REDACTED"}
2024-10-15T20:53:23Z INFO processed replication events {"events":4}
2024-10-15T20:53:23Z INFO streaming to fence {"fenceAfter":"1m0s"}
2024-10-15T20:53:33Z INFO advancing confirmed LSN {"ackLSN":"REDACTED"}
2024-10-15T20:53:53Z INFO processed replication events {"events":0}
2024-10-15T20:53:53Z WARN replication stream closed unexpectedly null
2024-10-15T20:53:53.755809536Z WARN connector exited with unconsumed input stream remainder {"module":"connector_init::rpc"}
We can see in the above logs that we connected and streamed to a fence (which in Postgres still implies a watermark write) at 2024-10-15T20:53:23Z and then went into an indefinite time-based streaming phase and got disconnected when the replication stream closed at 2024-10-15T20:53:53Z.
This appears to be "benign" in the sense that the captures are remaining connected so long as stuff is taking place and then shutting down after exactly 30 seconds while idle. I'm going to draw the obvious inference that Neon might be disconnecting idle connections really quickly here.
One odd thing, of course, is that we're supposed to be sending a "Standby Status Update" message on that replication stream every 10 seconds on the dot (regardless of whether the confirmed LSN has advanced -- we just don't log it the rest of the time to avoid spamming the logs even worse). So it shouldn't be a TCP-level connection timeout since that connection can't be idle for >10s at a stretch and we're getting disconnected only after 30s.
That's as much investigation and thought as I've given the matter at present, just filing an issue to look into this more later.
The text was updated successfully, but these errors were encountered:
While keeping an eye on production CDC task errors/warnings for an unrelated rollout, I noticed an interesting pattern. We have at least four production capture tasks from Neon Postgres instances which keep getting restarted and then quickly shutting down with the following pattern of logs:
We can see in the above logs that we connected and streamed to a fence (which in Postgres still implies a watermark write) at
2024-10-15T20:53:23Z
and then went into an indefinite time-based streaming phase and got disconnected when the replication stream closed at2024-10-15T20:53:53Z
.This appears to be "benign" in the sense that the captures are remaining connected so long as stuff is taking place and then shutting down after exactly 30 seconds while idle. I'm going to draw the obvious inference that Neon might be disconnecting idle connections really quickly here.
One odd thing, of course, is that we're supposed to be sending a "Standby Status Update" message on that replication stream every 10 seconds on the dot (regardless of whether the confirmed LSN has advanced -- we just don't log it the rest of the time to avoid spamming the logs even worse). So it shouldn't be a TCP-level connection timeout since that connection can't be idle for >10s at a stretch and we're getting disconnected only after 30s.
That's as much investigation and thought as I've given the matter at present, just filing an issue to look into this more later.
The text was updated successfully, but these errors were encountered: