Deadlock seems to occasionally and regularly lead to replication failure

2 votes
0 answers
33 views
postgresql replication transactional-replication postgresql-15
                          We have two PostgreSQL databases running on RHEL virtual servers (PostgreSQL 15.5 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-20), 64-bit). Our production server/database which we'll call **Simon** runs Airflow to do ETL and then stores the data, and then our analytics server/database is **Garfunkel** and receives the data. Both virtual servers are theoretically in the same centre, so I don't think networking should be an issue. Nor do we think this issue could be caused by them running out of resources

The publication was set up on select tables and select columns, e.g.:

    CREATE PUBLICATION prod_publication
        FOR TABLE data_tables
        WITH (publish = 'insert, update, delete', publish_via_partition_root = true);

And then the subscription on Garfunkel was created with 

    CREATE SUBSCRIPTION prod_data
        CONNECTION 'host=simon port=5432 user=subscriber_bot dbname=data_warehouse'
        PUBLICATION prod_raw_publication
        WITH (connect = true, enabled = true, create_slot = false, slot_name = prod_data, synchronous_commit = 'off', binary = false, streaming = 'False', two_phase = false, disable_on_error = false);

Airflow is running on Simon, but its application database is a separate "database" on the server, which is not being published. Occasionally when backfilling data, Airflow results in a deadlock, which seems to lead to a failure of replication.

For example per this log on Simon:

    2025-03-19 15:44:38.792 EDT  ERROR:  deadlock detected
    2025-03-19 15:44:38.792 EDT  DETAIL:  Process 2354905 waits for ShareLock on transaction 157900530; blocked by process 2355302.
            Process 2355302 waits for ShareLock on transaction 157900529; blocked by process 2354905.
            Process 2354905: UPDATE dag_run SET last_scheduling_decision=NULL, updated_at='2025-03-19T19:44:34.528971+00:00'::timestamptz WHERE dag_run.id = 17606
            Process 2355302: UPDATE task_instance SET state='scheduled', try_number=CASE WHEN (task_instance.state IS NULL OR task_instance.state != 'up_for_reschedule') THEN task_instance.try_number + 1 ELSE task_instance.try_number END, updated_at='2025-03-19T19:44:37.805821+00:00'::timestamptz WHERE task_instance.dag_id = 'unit_validation' AND task_instance.run_id = 'backfill__2025-02-28T10:00:00-05:00' AND (task_instance.task_id, task_instance.map_index) IN (('BasicValidation.t_staging_area',  -1))
    2025-03-19 15:44:38.792 EDT  HINT:  See server log for query details.
    2025-03-19 15:44:38.792 EDT  CONTEXT:  while updating tuple (412,11) in relation "dag_run"
    2025-03-19 15:44:38.792 EDT  STATEMENT:  UPDATE dag_run SET last_scheduling_decision=NULL, updated_at='2025-03-19T19:44:34.528971+00:00'::timestamptz WHERE dag_run.id = 17606
    2025-03-19 15:44:48.804 EDT  LOG:  checkpoint starting: time
    2025-03-19 15:48:53.836 EDT  LOG:  checkpoint complete: wrote 2104 buffers (12.8%); 0 WAL file(s) added, 0 removed, 0 recycled; write=245.029 s, sync=0.002 s, total=245.033 s; sync files=116, longest=0.001 s, average=0.001 s; distance=128470 kB, estimate=128470 kB
    2025-03-19 15:49:48.895 EDT  LOG:  checkpoint starting: time
    2025-03-19 15:50:52.679 EDT  LOG:  checkpoint complete: wrote 640 buffers (3.9%); 0 WAL file(s) added, 0 removed, 0 recycled; write=63.783 s, sync=0.001 s, total=63.785 s; sync files=38, longest=0.001 s, average=0.001 s; distance=9520 kB, estimate=116575 kB
    2025-03-19 15:51:24.374 EDT  LOG:  could not receive data from client: Connection reset by peer

And a more recent example that also occurred after backfilling in Airflow.

    2025-04-03 13:37:24.177 EDT  LOG:  terminating walsender process due to replication timeout

In both cases, dropping and recreating the subscription solves the issue, but this seems unideal.

What other information can we collect to help diagnose this issue? How can we be guarding against it?
                        
Asked by raphael (599 rep)
Apr 4, 2025, 03:36 PM
Last activity: Apr 4, 2025, 03:41 PM
Deadlock seems to occasionally and regularly lead to replication failure

Related Questions