PostgreSQL: Approach for aggregating records and upsert vs delete and insert

0 votes
1 answer
106 views
postgresql performance insert aggregate delete
                          We have a file processing service that extracts some device configuration information from daily files and stores it into PostgreSQL tables.

We have two layers of tables,
- The loading tables that contain records for every single day
- The compiled tables that store aggregated records based on a hash and date_from to show a timeline of configuration changes. The hash includes all the fields except the primary key.

Following is a sample of the tables:

    CREATE TABLE public.logger_main_config_loading (
        id text NOT NULL,
        hash text NOT NULL,
        logger_oem_id text NOT NULL,
        logger_model_id text,
        logger_serial_number text NOT NULL,
        logger_id text,
        logger_name text,
        date_from timestamp without time zone NOT NULL,
        date_to timestamp without time zone,
        encryption_pin_or_key text,
        enclosure_lock_details text,
        data_transfer_details text,
        offset_from_utc_hrs numeric,
        sampling_rate_sec integer,
        averaging_period_minutes integer,
        clock_is_auto_synced boolean,
        notes text,
        update_at timestamp with time zone DEFAULT CURRENT_TIMESTAMP,
        updated_by uuid,
        measurement_location_uuid uuid NOT NULL,
        logger_firmware_version text
    );

This table has the following Primary Key:

    ALTER TABLE ONLY public.logger_main_config_loading
        ADD CONSTRAINT logger_main_config_loading_pkey
    PRIMARY KEY (measurement_location_uuid, id, hash, date_from);

The table has the following indexes:

    CREATE INDEX
    	idx_logger_main_config_loading
    ON
    	public.logger_main_config_column_name_loading
    USING
    	btree (measurement_location_uuid, logger_main_config_loading_id, logger_main_config_loading_hash, logger_main_config_loading_date_from);

    CREATE INDEX idx_main_config_loading_measurement_location_uuid ON public.logger_main_config_loading USING btree (measurement_location_uuid);

The logger_main_config_compiled table has exactly the same structure, but just stores records aggregated by id, hash and date_from.

Here is a sample query that we use to get aggregated records from the loading table and insert them in the compiled table:

    SELECT
            log_main_load_1.*,
            log_main_load_2.date_to AS compiled_date_to
        FROM
            logger_main_config_loading AS log_main_load_1
            INNER JOIN (
                SELECT
                    id,
                    hash,
                    measurement_location_uuid,
                    min(date_from) AS date_from,
                    max(date_to) AS date_to
                FROM
                    logger_main_config_loading
                WHERE measurement_location_uuid = %(measurement_location_uuid)s
                GROUP BY
                    id,
                    hash,
                    measurement_location_uuid
            ) AS log_main_load_2
            ON
                log_main_load_1.id = log_main_load_2.id
                AND log_main_load_1.hash = log_main_load_2.hash
                AND log_main_load_1.measurement_location_uuid = log_main_load_2.measurement_location_uuid
                AND log_main_load_1.date_from = log_main_load_2.date_from
        WHERE
            log_main_load_1.measurement_location_uuid = %(measurement_location_uuid)s;


We are planning to do away with these queries and do the aggregation purely through Python code and insert the records in the compiled table as some scenarios do not get handled.
For example,
if there are records in the loading table already from a particular date_from, and a file before that date is processed (new date_from before the min(date_from) in the table at that point),
the records in the compiled table result in a new record as the date_from is different (different primary key), however, we need the same record's date_from to be updated if the hash has not changed.

Hence, have the following questions:
- What is the cleanest approach to achieve this? Is it a recommended practice to delete the compiled table records each time and insert them after aggregating them through code for each measurement_location_uuid? I do understand that indexes get rebuilt each time records are deleted, however this table may not even a million of records.
- Based on the table structure, are any other performance considerations or indexes that need to be added/removed?
- We expect idempotency in this implementation, even if a file is added again, there should be no effect on the compiled table. In this case delete and rewrite may be unnecessary.

I would really appreciate some suggestions so that this can be implemented in a less complex and efficient manner.

Thank you in advance.
                        
Asked by ShwetaJ (153 rep)
Jul 30, 2024, 02:07 PM
Last activity: Aug 6, 2024, 08:49 PM
PostgreSQL: Approach for aggregating records and upsert vs delete and insert

Related Questions