Sample Header Ad - 728x90

Deadlock seems to occasionally and regularly lead to replication failure

2 votes
0 answers
33 views
We have two PostgreSQL databases running on RHEL virtual servers (PostgreSQL 15.5 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-20), 64-bit). Our production server/database which we'll call **Simon** runs Airflow to do ETL and then stores the data, and then our analytics server/database is **Garfunkel** and receives the data. Both virtual servers are theoretically in the same centre, so I don't think networking should be an issue. Nor do we think this issue could be caused by them running out of resources The publication was set up on select tables and select columns, e.g.: CREATE PUBLICATION prod_publication FOR TABLE data_tables WITH (publish = 'insert, update, delete', publish_via_partition_root = true); And then the subscription on Garfunkel was created with CREATE SUBSCRIPTION prod_data CONNECTION 'host=simon port=5432 user=subscriber_bot dbname=data_warehouse' PUBLICATION prod_raw_publication WITH (connect = true, enabled = true, create_slot = false, slot_name = prod_data, synchronous_commit = 'off', binary = false, streaming = 'False', two_phase = false, disable_on_error = false); Airflow is running on Simon, but its application database is a separate "database" on the server, which is not being published. Occasionally when backfilling data, Airflow results in a deadlock, which seems to lead to a failure of replication. For example per this log on Simon: 2025-03-19 15:44:38.792 EDT ERROR: deadlock detected 2025-03-19 15:44:38.792 EDT DETAIL: Process 2354905 waits for ShareLock on transaction 157900530; blocked by process 2355302. Process 2355302 waits for ShareLock on transaction 157900529; blocked by process 2354905. Process 2354905: UPDATE dag_run SET last_scheduling_decision=NULL, updated_at='2025-03-19T19:44:34.528971+00:00'::timestamptz WHERE dag_run.id = 17606 Process 2355302: UPDATE task_instance SET state='scheduled', try_number=CASE WHEN (task_instance.state IS NULL OR task_instance.state != 'up_for_reschedule') THEN task_instance.try_number + 1 ELSE task_instance.try_number END, updated_at='2025-03-19T19:44:37.805821+00:00'::timestamptz WHERE task_instance.dag_id = 'unit_validation' AND task_instance.run_id = 'backfill__2025-02-28T10:00:00-05:00' AND (task_instance.task_id, task_instance.map_index) IN (('BasicValidation.t_staging_area', -1)) 2025-03-19 15:44:38.792 EDT HINT: See server log for query details. 2025-03-19 15:44:38.792 EDT CONTEXT: while updating tuple (412,11) in relation "dag_run" 2025-03-19 15:44:38.792 EDT STATEMENT: UPDATE dag_run SET last_scheduling_decision=NULL, updated_at='2025-03-19T19:44:34.528971+00:00'::timestamptz WHERE dag_run.id = 17606 2025-03-19 15:44:48.804 EDT LOG: checkpoint starting: time 2025-03-19 15:48:53.836 EDT LOG: checkpoint complete: wrote 2104 buffers (12.8%); 0 WAL file(s) added, 0 removed, 0 recycled; write=245.029 s, sync=0.002 s, total=245.033 s; sync files=116, longest=0.001 s, average=0.001 s; distance=128470 kB, estimate=128470 kB 2025-03-19 15:49:48.895 EDT LOG: checkpoint starting: time 2025-03-19 15:50:52.679 EDT LOG: checkpoint complete: wrote 640 buffers (3.9%); 0 WAL file(s) added, 0 removed, 0 recycled; write=63.783 s, sync=0.001 s, total=63.785 s; sync files=38, longest=0.001 s, average=0.001 s; distance=9520 kB, estimate=116575 kB 2025-03-19 15:51:24.374 EDT LOG: could not receive data from client: Connection reset by peer And a more recent example that also occurred after backfilling in Airflow. 2025-04-03 13:37:24.177 EDT LOG:  terminating walsender process due to replication timeout In both cases, dropping and recreating the subscription solves the issue, but this seems unideal. What other information can we collect to help diagnose this issue? How can we be guarding against it?
Asked by raphael (599 rep)
Apr 4, 2025, 03:36 PM
Last activity: Apr 4, 2025, 03:41 PM