Sample Header Ad - 728x90

postgresql 16 filesystem snapshot in combination with pit recovery "invalid checkpoint record"

0 votes
0 answers
81 views
we are migrating many deployment to a kubernetes cluster. I'm currently in the process of testing and documenting the restore process for our postgresql databases. In our legacy environment we did/do netapp snapshots of the volumes of the database. We also write WAL on replica level. In the rare cases that a rollback to a snapshot was not enough and we saved the WALs from pg_wal and our wal_archive. Then restored from snapshot and exchanged the WALs from the snapshot for the previously saved WALs and configured a recovery_target_time. To my general knowledge this did always work when needed but was very rarely the case when it was needed. Now in the kubernetes environment I'd like to do the same. It is still a netapp snapshot although now triggered by velero. But technically netapp is still doing the snap. I can start the database from snapshot, it will of cause complain of not having been shut down correctly, but will recover. But anytime I try to do a PIT-Recovery by removing all WALs from the snapshot before starting the database and exchanging them with all WALs from the live system I get a variant of invalid checkpoint record:
2024-08-14 09:25:58.631	
2024-08-14 09:25:58.630 CEST  LOG:  starting PostgreSQL 16.2 (Debian 16.2-1.pgdg120+2) on x86_64-pc-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit
2024-08-14 09:25:58.631	
2024-08-14 09:25:58.631 CEST  LOG:  listening on IPv4 address "0.0.0.0", port 5432
2024-08-14 09:25:58.631	
2024-08-14 09:25:58.631 CEST  LOG:  listening on IPv6 address "::", port 5432
2024-08-14 09:25:58.636	
2024-08-14 09:25:58.636 CEST  LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"



2024-08-14 09:25:58.675	
2024-08-14 09:25:58.674 CEST  LOG:  database system was interrupted; last known up at 2024-08-13 09:36:16 CEST
2024-08-14 09:26:04.262	
sh: 1: cannot open /var/lib/postgresql/wal/pg_wal/wal_archive/00000002.history.gz: No such file
2024-08-14 09:26:04.265	
2024-08-14 09:26:04.265 CEST  LOG:  starting point-in-time recovery to 2024-08-14 08:40:00+02
2024-08-14 09:26:04.267	
2024-08-14 09:26:04.267 CEST  LOG:  invalid checkpoint record
2024-08-14 09:26:04.267	
2024-08-14 09:26:04.267 CEST  PANIC:  could not locate a valid checkpoint record
2024-08-14 09:26:04.600	
2024-08-14 09:26:04.599 CEST  LOG:  startup process (PID 41) was terminated by signal 6: Aborted
2024-08-14 09:26:04.600	
2024-08-14 09:26:04.599 CEST  LOG:  aborting startup due to startup process failure
2024-08-14 09:26:04.603	
2024-08-14 09:26:04.603 CEST  LOG:  database system is shut down
Does the recovery_target_time conflict with the crash recovery?
archive_command=test ! -f /var/lib/postgresql/wal/pg_wal/wal_archive/%f.gz && /bin/gzip -c %p > /var/lib/postgresql/wal/pg_wal/wal_archive/%f.gz

restore_command = 'gunzip < /var/lib/postgresql/wal/pg_wal/wal_archive/%f.gz %p'

recovery_target_time = '2024-08-14 08:40:00 Europe/Berlin'
I also tried to copy the save WAls over the ones in the snapshot, since crash recovery from snapshot does work, but this did only need to complains about corruption and premature abort of recovery. I fail to see why we were in the past able to do recoveries and no won't. There is one difference. In the test in kubernetes there was an error in which a seperate pod was launched with a script that should delete old WALs from wal_archive. This script was misconfigured and did a pg_basebackup. The information about that basebackup could possibly be the problem, I hope… I rebuild the setup and try with a configuration without basebackup. Do you have any suggestions? I was hoping to do safe snapshots with pg_backup_start()/stop() as velero pre- and posthooks. But this seems to be not so good because of the need to have one session open during snaphot for start/stop. I might not be able to do this as pre-/posthook.
Asked by pema83 (23 rep)
Aug 14, 2024, 08:36 AM