Sample Header Ad - 728x90

Postgres pg_wal size increasing possibly from patroni outage

0 votes
0 answers
50 views
I've recently hit an issue of pg_wal files not being cleaned up on a replica postgres instance server despite these being archived on the master instance. Patroni was out during the time due to an ETCD outage, after correcting this and Patroni reestablishing connections the wal files began to be cleaned up on the read only instance. During this time I saw no replication lag during this time. The second replica was also not affected in the same way. All 3 servers had ETCD being out. System * Postgres 16 * Patroni 4.0.4 * 3 nodes in the cluster 1 leader, 2 read only replicas * cluster is managed by patroni * Replication is done by physical replication Question From my understanding of Patroni the outage on Patroni shouldn't prevent the WAL files on the replication instance being removed. Am I missing something here in terms of the active responsibilities of Patroni? I ensured that the read only replica was still in standby mode and it was operating as such during that time so wasn't that it was promoted during the outage. From that time Patroni was repeatedly failing on
Jul 23 05:02:24  patroni: 2025-07-23 05:02:24,141 ERROR: watchprefix failed: ProtocolError("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes rea
Jul 23 05:02:25  patroni: Traceback (most recent call last):
...
Jul 23 05:02:25  patroni: File "/usr/lib/python3/dist-packages/patroni/dcs/etcd.py", line 262, in _do_http_request Jul 23 05:02:25  patroni: raise etcd.EtcdConnectionFailed('No more machines in the cluster') 
Jul 23 05:02:25  patroni: etcd.EtcdConnectionFailed: No more machines in the cluster 
Jul 23 05:02:25  systemd: patroni.service: Main process exited, code=exited, status=1/FAILURE
Jul 23 05:02:25  systemd: patroni.service: Failed with result 'exit-code'.
Jul 23 05:02:25  systemd: patroni.service: Unit process 407680 (postgres) remains running after unit stopped.
Jul 23 05:02:25  systemd: patroni.service: Unit process 407683 (postgres) remains running after unit stopped.
...
From this, Patroni is crashing whilst the postgres process remains up (The proc for postgres server is 407680) During this time I had also looked at pg_stat_activity and hadn't seen any long running queries that I'd expect to hold the WAL back. In terms of postgres logs, the only things of note were when the pg_wal started to increase this is somewhat confirmed in the logs
2025-07-22 06:46:02.235 UTC  LOG:  restartpoint complete: wrote 517224 buffers (6.2%); 0 WAL file(s) added, 246 removed, 23 recycled; write=66.028 s, sync=0.499 s, total=66.721 s; sync files=686, longest=0.109 s, average=0.001 s; distance=4407526 kB, estimate=4407526 kB; lsn=21204/6CAA5F0, redo lsn=21203/13048F80
2025-07-22 06:47:27.708 UTC  LOG:  restartpoint complete: wrote 523662 buffers (6.2%); 0 WAL file(s) added, 1 removed, 22 recycled; write=77.386 s, sync=0.263 s, total=77.676 s; sync files=676, longest=0.012 s, average=0.001 s; distance=4407428 kB, estimate=4407516 kB; lsn=21205/159E6E88, redo lsn=21204/2006A040
2025-07-22 06:49:06.802 UTC  LOG:  restartpoint complete: wrote 562491 buffers (6.7%); 1 WAL file(s) added, 0 removed, 0 recycled; write=89.970 s, sync=0.256 s, total=90.321 s; sync files=598, longest=0.019 s, average=0.001 s; distance=4407504 kB, estimate=4407515 kB; lsn=21206/2207D1D0, redo lsn=21205/2D09E288
Where the WAL file(s) removed goes to 0 and then on recovery postgres receives a SIGHUP which causes it to reload config
2025-07-23 08:16:53.955 UTC  LOG:  received SIGHUP, reloading configuration files
Asked by Iamterribleatcoding (1 rep)
Jul 23, 2025, 11:37 AM
Last activity: Jul 23, 2025, 09:49 PM