Skip to content

Commit

Permalink
Track logical replication slot catalog_xmin age (#19083)
Browse files Browse the repository at this point in the history
Logical replication subscribers report the catalog_xmin to the
publisher. This is exposed through the pg_replication_slots system view.

The catalog_xmin can have an impact outside of catalog, notably on index
pages reusability: A deleted index page can only be set as reusable when
it's not visible by anything in the global shared visibility. This
visibility includes catalog_xmin, thus a lagging logical replication can
lead to increased index size due to not being able to reuse index pages.

The first iteration of metrics from pg_replication_slots didn't include
catalog_xmin. This patch adds it and reports the catalog_xmin age as a
metric.
  • Loading branch information
bonnefoa authored Nov 20, 2024
1 parent 3fe5fea commit 3d09b28
Show file tree
Hide file tree
Showing 4 changed files with 5 additions and 0 deletions.
1 change: 1 addition & 0 deletions postgres/changelog.d/19083.added
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Track logical replication slot catalog_xmin age
2 changes: 2 additions & 0 deletions postgres/datadog_checks/postgres/util.py
Original file line number Diff line number Diff line change
Expand Up @@ -383,6 +383,7 @@ def get_list_chunks(lst, n):
CASE WHEN temporary THEN 'temporary' ELSE 'permanent' END,
CASE WHEN active THEN 'active' ELSE 'inactive' END,
CASE WHEN xmin IS NULL THEN NULL ELSE age(xmin) END,
CASE WHEN catalog_xmin IS NULL THEN NULL ELSE age(catalog_xmin) END,
pg_wal_lsn_diff(
CASE WHEN pg_is_in_recovery() THEN pg_last_wal_receive_lsn() ELSE pg_current_wal_lsn() END, restart_lsn),
pg_wal_lsn_diff(
Expand All @@ -395,6 +396,7 @@ def get_list_chunks(lst, n):
{'name': 'slot_persistence', 'type': 'tag'},
{'name': 'slot_state', 'type': 'tag'},
{'name': 'replication_slot.xmin_age', 'type': 'gauge'},
{'name': 'replication_slot.catalog_xmin_age', 'type': 'gauge'},
{'name': 'replication_slot.restart_delay_bytes', 'type': 'gauge'},
{'name': 'replication_slot.confirmed_flush_delay_bytes', 'type': 'gauge'},
],
Expand Down
1 change: 1 addition & 0 deletions postgres/metadata.csv
Original file line number Diff line number Diff line change
Expand Up @@ -123,6 +123,7 @@ postgresql.replication.wal_replay_lag,gauge,,second,,"Time elapsed between flush
postgresql.replication.wal_write_lag,gauge,,second,,Time elapsed between flushing recent WAL locally and receiving notification that this standby server has written it (but not yet flushed it or applied it). This can be used to gauge the delay that synchronous_commit level remote_write incurred while committing if this server was configured as a synchronous standby. Only available with postgresql 10 and newer.,-1,postgres,repl write lag,
postgresql.replication_delay,gauge,,second,,The current replication delay in seconds. Only available with postgresql 9.1 and newer,-1,postgres,repl delay,
postgresql.replication_delay_bytes,gauge,,byte,,The current replication delay in bytes. Only available with postgresql 9.2 and newer,-1,postgres,repl delay bytes,
postgresql.replication_slot.catalog_xmin_age,gauge,,transaction,,"The age of the oldest transaction affecting the system catalogs that this slot needs the database to retain. VACUUM cannot remove catalog tuples deleted by any later transaction. This metric is tagged with slot_name, slot_type, slot_persistence, slot_state.",-1,postgres,repslot catalog_xmin,
postgresql.replication_slot.confirmed_flush_delay_bytes,gauge,,byte,,"The delay in bytes between the current WAL position and last position this slot's consumer confirmed. This is only available for logical replication slots. This metric is tagged with slot_name, slot_type, slot_persistence, slot_state.",-1,postgres,repslot flush,
postgresql.replication_slot.restart_delay_bytes,gauge,,byte,,"The amount of WAL bytes that the consumer of this slot may require and won't be automatically removed during checkpoints unless it exceeds max_slot_wal_keep_size parameter. Nothing is reported if there's no WAL reservation for this slot. This metric is tagged with slot_name, slot_type, slot_persistence, slot_state.",-1,postgres,repslot restart,
postgresql.replication_slot.spill_bytes,count,,byte,,"Amount of decoded transaction data spilled to disk while performing decoding of changes from WAL for this slot. This and other spill counters can be used to gauge the I/O occurred during logical decoding and allow tuning logical_decoding_work_mem. Extracted from pg_stat_replication_slots. Only available with PostgreSQL 14 and newer. This metric is tagged with slot_name, slot_type, slot_state.",-1,postgres,repslot spill_byte,
Expand Down
1 change: 1 addition & 0 deletions postgres/tests/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -314,6 +314,7 @@ def check_replication_slots(aggregator, expected_tags, count=1):
for metric_name in _iterate_metric_name(QUERY_PG_REPLICATION_SLOTS):
if 'slot_type:physical' in expected_tags and metric_name in [
'postgresql.replication_slot.confirmed_flush_delay_bytes',
'postgresql.replication_slot.catalog_xmin_age',
]:
continue
if 'slot_type:logical' in expected_tags and metric_name in [
Expand Down

0 comments on commit 3d09b28

Please sign in to comment.