Sample Header Ad - 728x90

Postgres Hot standby stuck starting waiting for contrrecord despite wal file being present

1 vote
2 answers
369 views
We are experimenting with [patroni Multi Datacenter HA](https://patroni.readthedocs.io/en/latest/ha_multi_dc.html) with postgres16 using replication slots. At high level for building the remote cluster from scratch , we are by running base backup, while active is experiencing writes . We use replication slot for hot standby via patroni . Here is example of active cluster config
postgresql:
  parameters:
    archive_command: /bin/true
    archive_mode: 'on'
    checkpoint_completion_target: '0.9'
    checkpoint_timeout: 300
    default_statistics_target: 100
    effective_cache_size: 12GB
    effective_io_concurrency: 2
    fsync: 'on'
    full_page_writes: 'on'
    hot_standby: true
    hot_standby_feedback: true
    idle_in_transaction_session_timeout: 10min
    log_min_duration_statement: 10000
    log_min_error_statement: ERROR
    log_min_messages: WARNING
    log_temp_files: 4000
    maintenance_work_mem: 3GB
    max_connections: 512
    max_parallel_maintenance_workers: 4
    max_parallel_workers: 10
    max_parallel_workers_per_gather: 2
    max_prepared_transactions: 1024
    max_replication_slots: 10
    max_slot_wal_keep_size: 20GB
    max_wal_senders: 10
    max_wal_size: 2GB
    max_worker_processes: 10
    min_wal_size: 512MB
    random_page_cost: 2
    shared_buffers: 8GB
    synchronous_commit: remote_write
    track_functions: all
    track_io_timing: 'on'
    wal_buffers: 16MB
    wal_compression: 'on'
    wal_keep_segments: 128
    wal_level: replica
    work_mem: 16MB
  use_pg_rewind: true
  use_slots: true
retry_timeout: 14
What we see in experiments is, once PG is started post basebackup via patroni, it moves to patroni leader/running state , then we apply standby cluster config
curl -X PATCH http://localhost:8008/config \
     -H "Content-Type: application/json" \
     -d '{
           "standby_cluster": {
             "host": "standby.example.com",
             "port": 5432,
             "primary_slot_name": "standby_slot"
           }
         }' \
     --data-urlencode "force=true"
This results in pg rewind , followed by cluster moving to Standby leader / streaming state most of the time. But at other times, it moves to Standby leader/ starting state at the end of rewind .What we see in pg logs is something similar to below
2024-08-19 20:02:00 UTC:::@::LOG:  contrecord is requested by 93/4C000028
2024-08-19 20:02:00 UTC:::@::LOG:  waiting for WAL to become available at 93/4C000040
Further analyzing lsn in replication slot indicates this record is generated post base backup time,was archived by leader just after base backup. In standby cluster , the wal file associated with lsn is present in pg_wal dir , but database is stuck on starting up , it never goes to streaming state and current state of replication slot is not active. Why wal file not being sent to hot standby despite being present on active cluster , to enable it move it to streaming state ?
Asked by Ganesh (121 rep)
Aug 19, 2024, 09:48 PM
Last activity: Sep 7, 2024, 04:12 AM