Autovacuum unable to vacuum which is causing DB reastart of Aurora Postgres DB
0
votes
2
answers
178
views
(Moved question from stackoverflow to dba.stackexchange)
I am using AWS Aurora Postgres 13.8. Now we have around 35-40 replication slots in DB and we dropped few replication slots. The reason for dropping was we were seeing restart of DB. CPU, Memory and Write/Read IOPS are normal. We are currently on db.r6i.24xlarge. We have verified that there are no long running transactions and no blocked pids
We dropped around 12(First 7 and then 5) replication slot in DB. xmin horizon is increasing and is not completely able to vacuum table. Below is from logs
First, we dropped 7 slots, logs after that were like
2024-10-03 14:20:19 UTC::@::LOG: automatic vacuum of table ".pg_catalog.pg_statistic": index scans: 0
pages: 0 removed, 4175180 remain, 0 skipped due to pins, 649889 skipped frozen
tuples: 0 removed, 63044551 remain, 62974732 are dead but not yet removable, oldest xmin: 850469227
buffer usage: 7064601 hits, 0 misses, 0 dirtied
avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
system usage: CPU: user: 9.86 s, system: 0.02 s, elapsed: 50.84 s
WAL usage: 0 records, 0 full page images, 0 bytes
Then we dropped 5 more slots, logs were then like:-
2024-10-03 17:02:12 UTC::@::LOG: automatic vacuum of table ".pg_catalog.pg_statistic": index scans: 0
pages: 0 removed, 4175180 remain, 0 skipped due to pins, 2299157 skipped frozen
tuples: 0 removed, 33025155 remain, 32953376 are dead but not yet removable, oldest xmin: 883545046
buffer usage: 3757715 hits, 0 misses, 0 dirtied
avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
system usage: CPU: user: 4.93 s, system: 0.00 s, elapsed: 15.72 s
WAL usage: 0 records, 0 full page images, 0 bytes
I do understand that no. of dead tuples have been cleaned significantly but how many more replication slot do we need to drop in order to remove this error.
Also we have started seeing below error in logs now apart from above error.
024-10-04 07:03:15 UTC::@::LOG: automatic vacuum of table "..": index scans: 1
pages: 0 removed, 5949 remain, 0 skipped due to pins, 557 skipped frozen
tuples: 5485 removed, 79016 remain, 0 are dead but not yet removable, oldest xmin: 903322403
buffer usage: 3757819 hits, 0 misses, 0 dirtied
avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
system usage: CPU: user: 2.03 s, system: 0.00 s, elapsed: 13.15 s
WAL usage: 0 records, 0 full page images, 0 bytes
As per the latest comment by Laurenz Albe in stack overflow, I was asked to to **drop all stale replication slots (where restart_lsn in pg_replication_slots is old)**
I ran this query
select slot_name, active, catalog_xmin, restart_lsn, pg_current_wal_insert_lsn()-restart_lsn as diff_lsn from pg_replication_slots order by diff_lsn desc;
First 2 slots have a huge difference with diff_lsn 793117723072, 427503896816 and then all slots with 92440376. I understand that first 2 definitely needs to be deleted. However what about the remaining since their difference is also huge.
Also how can I prevent such scenarios in future.
Asked by Nik
(101 rep)
Oct 4, 2024, 11:06 AM
Last activity: Jul 17, 2025, 02:00 AM
Last activity: Jul 17, 2025, 02:00 AM