Autovacuum unable to vacuum which is causing DB reastart of Aurora Postgres DB

0 votes
2 answers
178 views
postgresql replication aws-aurora postgres-13
                          (Moved question from stackoverflow to dba.stackexchange)

I am using AWS Aurora Postgres 13.8. Now we have around 35-40 replication slots in DB and we dropped few replication slots. The reason for dropping was we were seeing restart of DB. CPU, Memory and Write/Read IOPS are normal. We are currently on db.r6i.24xlarge. We have verified that there are no long running transactions and no blocked pids

We dropped around 12(First 7 and then 5) replication slot in DB. xmin horizon is increasing and is not completely able to vacuum table. Below is from logs

First, we dropped 7 slots, logs after that were like 

 

       2024-10-03 14:20:19 UTC::@::LOG: automatic vacuum of table ".pg_catalog.pg_statistic": index scans: 0
        pages: 0 removed, 4175180 remain, 0 skipped due to pins, 649889 skipped frozen
        tuples: 0 removed, 63044551 remain, 62974732 are dead but not yet removable, oldest xmin: 850469227 
        buffer usage: 7064601 hits, 0 misses, 0 dirtied
        avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
        system usage: CPU: user: 9.86 s, system: 0.02 s, elapsed: 50.84 s
        WAL usage: 0 records, 0 full page images, 0 bytes
    
Then we dropped 5 more slots, logs were then like:-
    
    2024-10-03 17:02:12 UTC::@::LOG: automatic vacuum of table ".pg_catalog.pg_statistic": index scans: 0
    pages: 0 removed, 4175180 remain, 0 skipped due to pins, 2299157 skipped frozen
    tuples: 0 removed, 33025155 remain, 32953376 are dead but not yet removable, oldest xmin: 883545046
    buffer usage: 3757715 hits, 0 misses, 0 dirtied
    avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
    system usage: CPU: user: 4.93 s, system: 0.00 s, elapsed: 15.72 s
    WAL usage: 0 records, 0 full page images, 0 bytes


I do understand that no. of dead tuples have been cleaned significantly but how many more replication slot do we need to drop  in order to remove this error.

Also we have started seeing below error in logs now apart from above error.

    024-10-04 07:03:15 UTC::@::LOG: automatic vacuum of table "..": index scans: 1
    pages: 0 removed, 5949 remain, 0 skipped due to pins, 557 skipped frozen
    tuples: 5485 removed, 79016 remain, 0 are dead but not yet removable, oldest xmin: 903322403
    buffer usage: 3757819 hits, 0 misses, 0 dirtied
    avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
    system usage: CPU: user: 2.03 s, system: 0.00 s, elapsed: 13.15 s
    WAL usage: 0 records, 0 full page images, 0 bytes

As per the latest comment by Laurenz Albe in stack overflow, I was asked to to **drop all stale replication slots (where restart_lsn in pg_replication_slots is old)**

I ran this query 

    select slot_name, active, catalog_xmin, restart_lsn, pg_current_wal_insert_lsn()-restart_lsn as diff_lsn from pg_replication_slots order by diff_lsn desc; 

 First 2 slots have a huge difference with diff_lsn 793117723072, 427503896816 and then all slots with 92440376. I understand that first 2 definitely needs to be deleted. However what about the remaining since their difference is also huge.

Also how can I prevent such scenarios in future. 

                        
Asked by Nik (101 rep)
Oct 4, 2024, 11:06 AM
Last activity: Jul 17, 2025, 02:00 AM
Autovacuum unable to vacuum which is causing DB reastart of Aurora Postgres DB

Related Questions