Sample Header Ad - 728x90

Autovacuum unable to vacuum which is causing DB reastart of Aurora Postgres DB

0 votes
2 answers
178 views
(Moved question from stackoverflow to dba.stackexchange) I am using AWS Aurora Postgres 13.8. Now we have around 35-40 replication slots in DB and we dropped few replication slots. The reason for dropping was we were seeing restart of DB. CPU, Memory and Write/Read IOPS are normal. We are currently on db.r6i.24xlarge. We have verified that there are no long running transactions and no blocked pids We dropped around 12(First 7 and then 5) replication slot in DB. xmin horizon is increasing and is not completely able to vacuum table. Below is from logs First, we dropped 7 slots, logs after that were like 2024-10-03 14:20:19 UTC::@::LOG: automatic vacuum of table ".pg_catalog.pg_statistic": index scans: 0 pages: 0 removed, 4175180 remain, 0 skipped due to pins, 649889 skipped frozen tuples: 0 removed, 63044551 remain, 62974732 are dead but not yet removable, oldest xmin: 850469227 buffer usage: 7064601 hits, 0 misses, 0 dirtied avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s system usage: CPU: user: 9.86 s, system: 0.02 s, elapsed: 50.84 s WAL usage: 0 records, 0 full page images, 0 bytes Then we dropped 5 more slots, logs were then like:- 2024-10-03 17:02:12 UTC::@::LOG: automatic vacuum of table ".pg_catalog.pg_statistic": index scans: 0 pages: 0 removed, 4175180 remain, 0 skipped due to pins, 2299157 skipped frozen tuples: 0 removed, 33025155 remain, 32953376 are dead but not yet removable, oldest xmin: 883545046 buffer usage: 3757715 hits, 0 misses, 0 dirtied avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s system usage: CPU: user: 4.93 s, system: 0.00 s, elapsed: 15.72 s WAL usage: 0 records, 0 full page images, 0 bytes I do understand that no. of dead tuples have been cleaned significantly but how many more replication slot do we need to drop in order to remove this error. Also we have started seeing below error in logs now apart from above error. 024-10-04 07:03:15 UTC::@::LOG: automatic vacuum of table "..": index scans: 1 pages: 0 removed, 5949 remain, 0 skipped due to pins, 557 skipped frozen tuples: 5485 removed, 79016 remain, 0 are dead but not yet removable, oldest xmin: 903322403 buffer usage: 3757819 hits, 0 misses, 0 dirtied avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s system usage: CPU: user: 2.03 s, system: 0.00 s, elapsed: 13.15 s WAL usage: 0 records, 0 full page images, 0 bytes As per the latest comment by Laurenz Albe in stack overflow, I was asked to to **drop all stale replication slots (where restart_lsn in pg_replication_slots is old)** I ran this query select slot_name, active, catalog_xmin, restart_lsn, pg_current_wal_insert_lsn()-restart_lsn as diff_lsn from pg_replication_slots order by diff_lsn desc; First 2 slots have a huge difference with diff_lsn 793117723072, 427503896816 and then all slots with 92440376. I understand that first 2 definitely needs to be deleted. However what about the remaining since their difference is also huge. Also how can I prevent such scenarios in future.
Asked by Nik (101 rep)
Oct 4, 2024, 11:06 AM
Last activity: Jul 17, 2025, 02:00 AM