When queries start executing on standby server the replay lag starts to increase

0 votes

2 answers

1214 views

postgresql replication high-availability master-slave-replication troubleshooting

I have a two servers with following specs: * 8 vCPU, 32768 MB RAM, 640 GB SSD The master Postgres 13.3 database (db1) is installed on first server (Ubuntu 16.04.7) with the following config:

shared_buffers = 16GB 
work_mem = 128MB
maintenance_work_mem = 8GB
effective_cache_size = 16GB

effective_io_concurrency = 400
max_worker_processes = 8
max_parallel_workers_per_gather = 4
max_parallel_workers = 8

wal_level = logical
synchronous_commit = on
max_wal_size = 4GB
min_wal_size = 32MB
wal_keep_size = 16384
wal_sender_timeout = 60s
checkpoint_completion_target = 0.7

synchronous_standby_names = 'FIRST 1 (db2_slave)'
max_standby_archive_delay = 1800s
max_standby_streaming_delay = 1800s

The standby is a Postgres 13.4 database (db2) installed on second server (Ubuntu 20.04.3) with the following config:

shared_buffers = 24GB
work_mem = 128MB
maintenance_work_mem = 16GB
effective_cache_size = 24GB

effective_io_concurrency = 400
max_worker_processes = 8
max_parallel_workers_per_gather = 4
max_parallel_workers = 8

wal_level = logical
synchronous_commit = on
max_wal_size = 4GB
min_wal_size = 32MB
checkpoint_completion_target = 0.7

primary_conninfo = 'host=... port=5432 user=repluser passfile=''...'' application_name=db2_slave'
primary_slot_name = 'db2'
hot_standby = on
max_standby_archive_delay = 1800s
max_standby_streaming_delay = 1800s

If I run iotop -u postgresql on the standby, I see two processes:

2229172 postgres: 13/main: walreceiver streaming DDFD/8E9FE9E0
2229138 postgres: 13/main: startup recovering 000000010000DDFD0000008E

After I run **read request which takes a few seconds** on the standby (SELECT COUNT(*) FROM big_table;), the walreceiver streaming continues to work, but the replica stops syncing:

2229138 postgres: 13/main: startup recovering 000000010000DE0400000017 waiting

I ran this query on master:

SELECT client_addr                                                       as client,
       usename                                                           as user,
       application_name                                                  as name,
       state,
       sync_state                                                        as mode,
       pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn))   as pending,
       pg_size_pretty(pg_wal_lsn_diff(sent_lsn, write_lsn))              as write,
       pg_size_pretty(pg_wal_lsn_diff(write_lsn, flush_lsn))             as flush,
       pg_size_pretty(pg_wal_lsn_diff(flush_lsn, replay_lsn))            as replay,
       pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) as total_lag
FROM pg_stat_replication;

And the output was:

client    |   user   |   name    |   state   | mode | pending |  write  |  flush  | replay | total_lag 
-------------+----------+-----------+-----------+------+---------+---------+---------+--------+-----------
 ...         | repluser | db2_slave | streaming | sync | 0 bytes | 0 bytes | 0 bytes | 21 MB  | 21 MB
(1 row)

If I execute this request several times, the replay and total lag increases all the time during execution this query (SELECT COUNT(*) FROM big_table). Therefore, I want to know the answers to the questions: 1) Why does the replay lag keep increasing during the execution of an analytical query for replica? 2) Why is the recovery process in the "waiting" state as soon as I start a request to the standby?

Asked by Andrei (111 rep)

Oct 30, 2021, 10:12 PM
Last activity: Nov 2, 2021, 05:21 AM

When queries start executing on standby server the replay lag starts to increase

Related Questions