Optimizing bulk update performance in PostgreSQL

55 votes

2 answers

89234 views

                          Using PG 9.1 on Ubuntu 12.04.

It currently takes up to 24h for us to run a large set of UPDATE
statements on a database, which are of the form:

    UPDATE table
    SET field1 = constant1, field2 = constant2, ...
    WHERE id = constid

(We're just overwriting fields of objects identified by ID.)  The values come from an external data source (not already in the DB in a table).

The tables have handfuls of indices each and no foreign key constraints.
No COMMIT is made till the end.

It takes 2h to import a pg_dump of the entire DB.  This seems like a
baseline we should reasonably target.

Short of producing a custom program that somehow reconstructs a data set
for PostgreSQL to re-import, is there anything we can do to bring the
bulk UPDATE performance closer to that of the import?  (This is an area
that we believe log-structured merge trees handle well, but we're
wondering if there's anything we can do within PostgreSQL.)

Some ideas:

- dropping all non-ID indices and rebuilding afterward?
- increasing checkpoint_segments, but does this actually help sustained
  long-term throughput?
- using the techniques mentioned here ? (Load new data as table, then
  "merge in" old data where ID is not found in new data)

Basically there's a bunch of things to try and we're not sure what the 
most effective are or if we're overlooking other things. We'll be
spending the next few days experimenting, but we thought we'd ask here 
as well.

I do have concurrent load on the table but it's read-only.
                        

Asked by xyzzyrz (671 rep)

Apr 27, 2013, 12:20 AM
Last activity: Sep 3, 2024, 07:51 PM

Optimizing bulk update performance in PostgreSQL

Related Questions