Database Administrators

Q&A for database professionals who wish to improve their database skills

Latest Questions

0 votes

1 answers

21 views

Cannot access timescaledb data after moving postgres db to new server

I was moving a postgresql database cluster to another server and did what I thought was the simplest way, to copy the data files and start up a new server of the same version (in my case, v. 14, this was the first step in a server upgrade, the old server was on centos7, the new one is rockylinux 9)....

                                  I was moving a postgresql database cluster to another server and did what I thought was the simplest way, to copy the data files and start up a new server of the same version (in my case, v. 14, this was the first step in a server upgrade, the old server was on centos7, the new one is rockylinux 9). This worked well for all the databases, except for the one that was using timescaledb where the tables are unavailable even though I have installed timescaledb also on the new server. (I also have some databases using postGIS heavily, those are working just fine after the move). After starting the up the new server, I have had to manually run a "Create extention timescaledb", that seems to work and \dx shows the extension, but the data are not accessible. Any tips on what to do?
                                

MortenSickel (261 rep)

Jul 17, 2025, 07:50 AM • Last activity: Jul 23, 2025, 11:41 AM

0 votes

1 answers

158 views

First metric in each time "bucket" for multiple IDs

postgresql postgresql-performance timescaledb

I have the following table in PostgreSQL 12: ``` CREATE TABLE vehicle_fuel ( vehicle_id int NOT NULL , submitted_at timestamp NOT NULL , fuel float NOT NULL); ``` Currently there are approximately 1000 vehicle IDs each with an entry every 15 minutes or so, just over 100 million rows in total. I'd li...

I have the following table in PostgreSQL 12:

CREATE TABLE vehicle_fuel (
   vehicle_id     int NOT NULL
 , submitted_at   timestamp NOT NULL
 , fuel           float NOT NULL);

Currently there are approximately 1000 vehicle IDs each with an entry every 15 minutes or so, just over 100 million rows in total. I'd like to be able to plot the fuel usage on a line chart for multiple chosen vehicles for any chosen time interval. I'd like to plot only

=200

points spread relatively evenly over the time interval for each vessel. My current solution uses

()

to generate evenly spread "buckets" and then choosing the first timestamp within the "bucket" for each vehicle. E.g. for 10 vehicles, 200 points over the past 6 months

SELECT vehicle_id, submitted_at, fuel
FROM (VALUES (1), (3), (4), (34), (44), (56), (76), (79), (81), (83)) vehicle_ids(v)
         CROSS JOIN (SELECT generate_series('2020-05-17T00:00:00'::timestamp, '2020-11-17T00:00:00'::timestamp,
                                            '79488 seconds') AS bucket
                     LIMIT 200) AS buckets
         CROSS JOIN LATERAL ( SELECT vehicle_id,
                                     submitted_at,
                                     fuel
                              FROM vehicle_fuel
                              WHERE vehicle_id = v
                                AND submitted_at <= buckets.bucket
                              ORDER BY submitted_at DESC LIMIT 1) data
ORDER BY vehicle_id, submitted_at DESC;

However, this does not scale well with the number of vehicles requested, ~ 2300ms for the above query, significantly longer when I add more vehicles. Is there any way I can make this faster? [dbfiddle](https://dbfiddle.uk/?rdbms=postgres_12&fiddle=5c242838890272564fa8834139194b0d) I'm also using TimescaleDB if there is anything that can be utilised

Vic (1 rep)

Nov 17, 2020, 03:33 PM • Last activity: Jul 8, 2025, 02:05 AM

0 votes

1 answers

426 views

Sharding in Timescaledb (Postgres) Opensource

postgresql sharding timescaledb citus

**If I am wrong in my understanding, please feel free to correct me** In my project, we have a timeseries database. It is setup as 3-node (One leader, 2 read-replicas) patroni cluster. Each node is an AWS EC2 instance where time-series data stored in hypertables supported by TimescaleDB extension on...

                                  **If I am wrong in my understanding, please feel free to correct me**

In my project, we have a timeseries database. It is setup as 3-node (One leader, 2 read-replicas) patroni cluster. Each node is an AWS EC2 instance where time-series data stored in hypertables supported by TimescaleDB extension on a Postgres database. We are using opensource timescaledb here.

As the data is growing each day, the EBS data volume on a node (EC2 instance) is expected to hit its size limit in future. Hence there need of sharding.

As the potential solution, we looked at distributed hypertables in timescaledb. But it seems to be a dead end as they have deprecated multi-node support (on top of which distributed hypertables are provided) in opensource timescaledb.

There is another option i.e. to use Citus (Postgres extension to implement sharding). But, Citus doesn't support TimescaleDB extension in Postgres. So, as a high level solution in this case, we have to convert timescaleDB hypertables to regular Postgres tables first for us to be able to use Citus. So far Citus seems to be most suitable (relatively) choice to implement sharding.

Could someone please suggest a better way (if there is any)?

**Edit Note**: Data archival or purge is not an options for us. All of the data is needed. Compression has been applied already as much it is possible. This has bought us some additional time before the storage limit for an EBS data volume is reached, but eventually sharding will be required.

HelloJack (1 rep)

Aug 10, 2024, 10:50 AM • Last activity: May 11, 2025, 03:04 PM

0 votes

1 answers

391 views

Optimizing TimescaleDB Setup for High-Volume Time-Series Data

postgresql postgresql-performance timescaledb azure-postgres-database

I'm seeking advice on how to optimize my TimescaleDB setup, which handles a large volume of time-series data. I have around 20,000 time-series profiles with a one-year duration, using a quarterly time resolution (4 timestamps per hour). This amounts to approximately 700 million entries. My database is hosted on an Azure PostgreSQL server. Here are the details of my setup: **Hardware Specifications:** 4 vCores 16 GiB memory 512 GB storage Database Structure: I have two tables, one for the load profiles with the columns (id, time, value, sensor_id), and another table with the columns (id, sensor_id). There are two indexes on the load profile table, one on (sensor_id, time), and another on sensor_id. **Sample Query:** A typical query I use to aggregate data is:

SELECT AVG(value), time
FROM public.loadprofilepool
WHERE sensor_id IN (
    SELECT id 
    FROM public.sensor_table
    ORDER BY RANDOM()
    LIMIT 500
)
GROUP BY time;

Please note that this is a sample query where the list of sensor_ids is generated on the fly. In a real situation, the list of ids would come from elsewhere. **Data Distribution:** For now, there are 24 * 4 * 365 rows (one year duration, quarterly) per sensor and there are 20,000 sensors. In the future, there will also be live sensor data, which data distribution will depend on the specific sensor. **Performance Metrics:** When running these queries, the CPU usage does not exceed 20% and memory usage is constant at about 40%. Given these details, I'm struggling with query speed. Extracting 10 to 1000 profiles and summing them up to generate a timeseries for each timestamp currently takes about 5 to 20 seconds, whereas my target is less than 5 seconds. **My questions are as follows:** 1. Is my current setup the most efficient for handling and querying this volume and type of time-series data? If not, could you suggest alternative methods? I've considered NoSQL databases, cloud storage with Zarr or NetCDF files, but I'm not sure which, if any, would be more suitable. 2. How can I optimize my current setup to achieve faster query results? Are there specific TimescaleDB or PostgreSQL configurations or optimizations, indexing strategies, or query formulation tactics that would help improve performance? Thank you in advance for your help. Any suggestions or guidance would be greatly appreciated. Best regards, Hannes

Hannes (1 rep)

Jul 5, 2023, 03:20 PM • Last activity: May 8, 2025, 03:08 PM

0 votes

1 answers

341 views

finding cause for sometimes slow query

postgresql timescaledb

I am trying to find the cause for a query that is slow, but only sometimes. I have slow query logging active, and I see the query being logged, for example: ``` timescaledb-2 timescaledb 2023-08-04 13:43:12 UTC [32626]: [64ccfe56.7f72-3] device_monitoring@postgres,app=PostgreSQL JDBC Driver [00000]...

I am trying to find the cause for a query that is slow, but only sometimes. I have slow query logging active, and I see the query being logged, for example:

timescaledb-2 timescaledb 2023-08-04 13:43:12 UTC : [64ccfe56.7f72-3] device_monitoring@postgres,app=PostgreSQL JDBC Driver  LOG:  duration: 27178.742 ms  bind S_9:

Sometimes the query times out, my client timeout is set to 30s, so I get:

timescaledb-2 timescaledb 2023-08-04 13:33:23 UTC : [64ccf76c.745c-3] device_monitoring@postgres,app=PostgreSQL JDBC Driver  ERROR:  canceling statement due to statement timeout

Running [explain on the query](https://explain.dalibo.com/plan/5e36b5b82fd1d661#plan) , I don't see any immediate problems () and the query runs fast most of the times. I have pg_stat_statement active, but these slow executions are never stored:

postgres=# select max(max_exec_time) from pg_stat_statements ;
-[ RECORD 1 ]----
max | 2286.040693

That is only about 2 seconds. Do I need to activate any setting for slow queries to be included in pg_stat_statement? I have log_lock_waits set to on, but no slow locks are logged. What else can be the cause of these slow queries?

simao (127 rep)

Aug 4, 2023, 01:54 PM • Last activity: Mar 7, 2025, 06:03 PM

3 votes

1 answers

1068 views

A complicated scenario with TimescaleDB and compression

postgresql timescaledb

We have a timescaleDB with a fairly large data set (> 1.5TB, >1B rows). The project has been running into one delay after another because we just can't make the queries we need to perform fast enough (large scans over many rows to find specific conditions, etc.) Recently, we've started to experiment...

                                  We have a timescaleDB with a fairly large data set (> 1.5TB, >1B rows).

The project has been running into one delay after another because we just can't make the queries we need to perform fast enough (large scans over many rows to find specific conditions, etc.)

Recently, we've started to experiment with compression and saw that it had quite a significant impact on performance, but we've been running into a wall.

We record financial trades and, at the core, the trades have a timestamp, the symbol being transacted and various other info.

There are roughly 140 different symbols, and each of them run independently of one another. They don't know about each another.

All transactions end up in a single table. We can't have multiple tables because some queries require access to the different symbols at once.

The chunks are 1 day long, and we write a whole day at once.

Since we process the symbols separately, sometimes symbol A may be writing day 2, while symbol B writes day 5.

This works very well until compression gets enabled:

Each chunk represents one day of ALL symbols, so if a chunk gets compressed while one symbol hasn't been written yet, the write operation will fail.

We can't manually 'close' a day/chunk because some data for some symbol may arrive a few days later or some days may be empty, so there is no way to determine that a day is closed.

So my question is:
how can we partition things knowing that:
- we want to use compression.
- we need access to all symbols within a single query, so we can't have many tables.
- we can't determine when the daily data for a symbol will arrive; it's not uncommon to have 3-4 days difference between symbols. 

If we could create a chunk per timestamp and per symbol, it would work, but I haven't found anything indicating that it would be possible.

--------------------------------

Edit:

I've experimented with partitioning the hypertable, but this doesn't provide the expected result.

Let's take one thread:
- it gets live data that gets written as it comes.
- it gets authoritative data by blocks of 24h. It is sequential (days arrive in order) and can be a bit old (sometimes it takes a few days to get an update), but it has to replace whatever is in the table, since it takes precedence over other data. Prior to inserting the day into the database, it'll erase the whole day in question, which can be populated from the live data.
- I have compression set so that it will not compress the last 3 days of data, but everything older

So, in theory, I'm writing live data and trusted data arrives in 24h blocks, sometimes a few days later and overwrites everything on its way.
These blocks can be compressed.

Now, there are about 140 of these threads, each representing a symbol, and while symbol A may be on day 5, symbol B can be on day 3 and since the chunks represent a day, I can only compress days 0, 1 and 2.

I was hoping that partitionning with:

    SELECT create_hypertable(
        'exchange.{tableName}',
        'ts',
        partitioning_column => 'ticker',
        number_partitions => 150,
        create_default_indexes => false,
        chunk_time_interval => INTERVAL '1 day',
        if_not_exists => TRUE);

would create one chunk per symbol ('ticker') per day. So symbol A could have its own compression, symbol B could have its own compression, etc. because each symbol has its own chunk / day, and I wouldn't have to look at the most laggy symbol to decide where compression stops.

But this doesn't seem to be the case.

Is there any solution for this?

 
                                

Thomas (355 rep)

Jun 28, 2022, 10:49 PM • Last activity: Mar 3, 2025, 05:04 PM

0 votes

1 answers

116 views

What is the maximum number of columns allowed in a TimescaleDB table?

timescaledb

Do hypertables in TimescaleDB share Postgres's ~1600 column limit? I am writing a system to store a large number (1000s) of sensor values into a table, with one nullable column per sensor value. Since the table storage is columnar, I don't expect a wide table to be too bad for performance because ea...

                                  Do hypertables in TimescaleDB share Postgres's ~1600 column limit? I am writing a system to store a large number (1000s) of sensor values into a table, with one nullable column per sensor value. Since the table storage is columnar, I don't expect a wide table to be too bad for performance because each column is stored to disk separately. However, I will run into the column size limit if one exists.

If I can't use a schema with a very wide table, I will probably denormalize into one table per sensor, with a single value column. Would TimescaleDB be likely to offer significant improvements over Postgres for this schema (many tables with a timestamp and a value column)?

John Alexander (1 rep)

Feb 8, 2025, 07:21 AM • Last activity: Feb 11, 2025, 05:20 PM

1 votes

2 answers

638 views

Optimizing table for timeseries Postgres data table

postgresql postgresql-performance timescaledb

I have the below table which maintains a timeseries result. The row only becomes relevant when the signal is true, When signal is false, it just marks that for that particular timestamp we got a result but it is not a valid one, so the res and other columns just contains null values. When signal is...

                                  I have the below table which maintains a timeseries result. The row only becomes relevant when the signal is true, When signal is false, it just marks that for that particular timestamp we got a result but it is not a valid one, so the res and other columns just contains null values. When signal is null, it marks that we are yet to receive result for this timestamp. The signal is very sparse in nature, it is only true for maybe less than 7% of the records. Also the inserts made to this table are not ordered according to timestamp, older dates could arrive at later time. 

    CREATE TABLE public.res
    (
     pid integer NOT NULL,
     aid integer NOT NULL,
     cid integer NOT NULL,
    "time" timestamp without time zone NOT NULL,
     signal boolean,
     price numeric,
     res double precision[] NOT NULL,
     ...
     CONSTRAINT res_pkey PRIMARY KEY (pid, aid, cid, "time")
    )

This table can contains millions of records and is growing exponentially as my database is growing. I want to optimize this table. So have following questions

 - Is each row the same size? Or since the row only makes sense if Signal is true, can it be dynamically sized? Hence keeping the overall size of the table low? A min row size would contain (pid,aid,cid,time,signal,price) and the max will additionally contain (res and the remaining columns)? Is it possible to this in Postgres with its dataTypes? I do not want to create separate tables because run time joins could be very expensive when there are millions of records. 
 
 - This SO answer says "In effect NULL storage is absolutely free for tables up to 8 columns."  , I have many more columns
 - Any other suggestion that you might have to deal with such problems?
 - I read about timescaleDB, but since the records do not get inserted in order of timestamps does it has any advantage over Postgres in this usecase?

Thanks
                                

user4772933 (133 rep)

Jan 18, 2021, 11:34 AM • Last activity: Feb 11, 2025, 12:07 PM

0 votes

1 answers

947 views

Ways to optimize the PostgreSQL/TimeScaleDB query

postgresql optimization timescaledb

What approaches I could take to optimize the performance of the following joinless query on the following PostgreSQL/TimeScaleDB table? So far, I managed to create the right index which is obviously being used by the query planner. But the query is still not fast enough. The layout of the table and...

                                  What approaches I could take to optimize the performance of the following joinless query on the following PostgreSQL/TimeScaleDB table? So far, I managed to create the right index which is obviously being used by the query planner. But the query is still not fast enough.

The layout of the table and its indices is:



The query is:

    SELECT
       entity_id,
       event_type,
       payload_type,
       encode(last(payload, timestamp), 'escape')::json AS aggregated_value
    FROM event_data
    WHERE
       entity_id IN ('AA','AB','AC','AD','AE','AF','AG','AH','AI','AJ','AK','AL','AM','AN','AO','AP','AQ','AR','AS','AT','AU','AV','AW','AX','AY','AZ','BA','BB','BC','BD','BE','BF','BG','BH','BI','BJ','BK','BL','BM','BN','BO','BP','BQ','BR','BS','BT','BU','BV','BW','BX')
       AND payload_type IN ('a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r')
       AND timestamp BETWEEN '2020-06-02T04:52:48.00Z' AND '2020-06-02T07:52:48.00Z'
    GROUP BY 1,2,3
    ORDER BY 1,2,3
    OFFSET 0 LIMIT 2001;

The output of the EXPLAIN ANALYZE is: https://explain.depesz.com/s/okww 

    Limit  (cost=91617.29..92079.57 rows=2001 width=89) (actual time=2026.624..2457.510 rows=800 loops=1)
      ->  Finalize GroupAggregate  (cost=91617.29..103685.10 rows=52235 width=89) (actual time=2026.622..2457.349 rows=800 loops=1)
            Group Key: _hyper_2_88_chunk.entity_id, _hyper_2_88_chunk.event_type, _hyper_2_88_chunk.payload_type
            ->  Gather Merge  (cost=91617.29..101987.46 rows=52235 width=89) (actual time=2026.421..2462.836 rows=1600 loops=1)
                  Workers Planned: 1
                  Workers Launched: 1
                  ->  Partial GroupAggregate  (cost=90617.28..95111.01 rows=52235 width=89) (actual time=2017.985..2427.205 rows=800 loops=2)
                        Group Key: _hyper_2_88_chunk.entity_id, _hyper_2_88_chunk.event_type, _hyper_2_88_chunk.payload_type
                        ->  Sort  (cost=90617.28..91385.44 rows=307264 width=121) (actual time=2017.471..2199.512 rows=260608 loops=2)
                              Sort Key: _hyper_2_88_chunk.entity_id, _hyper_2_88_chunk.event_type, _hyper_2_88_chunk.payload_type
                              Sort Method: external merge  Disk: 35000kB
                              ->  Append  (cost=0.42..50922.41 rows=307264 width=121) (actual time=0.145..818.428 rows=260608 loops=2)
                                    ->  Parallel Index Scan using _hyper_2_88_chunk_idx_event_timestamp on _hyper_2_88_chunk  (cost=0.42..3866.03 rows=11135 width=121) (actual time=0.145..92.745 rows=10440 loops=2)
                                          Index Cond: (("timestamp" >= '2020-06-02 04:52:48'::timestamp without time zone) AND ("timestamp"   Parallel Index Scan using _hyper_2_90_chunk_idx_entity_id_payload_type_timestamp on _hyper_2_90_chunk  (cost=0.42..15838.58 rows=102301 width=121) (actual time=0.045..207.567 rows=86832 loops=2)
                                          Index Cond: ((entity_id = ANY ('{AA,AB,AC,AD,AE,AF,AG,AH,AI,AJ,AK,AL,AM,AN,AO,AP,AQ,AR,AS,AT,AU,AV,AW,AX,AY,AZ,BA,BB,BC,BD,BE,BF,BG,BH,BI,BJ,BK,BL,BM,BN,BO,BP,BQ,BR,BS,BT,BU,BV,BW,BX}'::text[])) AND (payload_type = ANY ('{a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r}'::text[])) AND ("timestamp" >= '2020-06-02 04:52:48'::timestamp without time zone) AND ("timestamp"   Parallel Index Scan using _hyper_2_89_chunk_idx_entity_id_payload_type_timestamp on _hyper_2_89_chunk  (cost=0.42..15811.99 rows=102444 width=121) (actual time=0.031..274.653 rows=86816 loops=2)
                                          Index Cond: ((entity_id = ANY ('{AA,AB,AC,AD,AE,AF,AG,AH,AI,AJ,AK,AL,AM,AN,AO,AP,AQ,AR,AS,AT,AU,AV,AW,AX,AY,AZ,BA,BB,BC,BD,BE,BF,BG,BH,BI,BJ,BK,BL,BM,BN,BO,BP,BQ,BR,BS,BT,BU,BV,BW,BX}'::text[])) AND (payload_type = ANY ('{a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r}'::text[])) AND ("timestamp" >= '2020-06-02 04:52:48'::timestamp without time zone) AND ("timestamp"   Parallel Index Scan using _hyper_2_91_chunk_idx_entity_id_payload_type_timestamp on _hyper_2_91_chunk  (cost=0.42..15405.82 rows=91384 width=121) (actual time=0.051..180.641 rows=76520 loops=2)
                                          Index Cond: ((entity_id = ANY ('{AA,AB,AC,AD,AE,AF,AG,AH,AI,AJ,AK,AL,AM,AN,AO,AP,AQ,AR,AS,AT,AU,AV,AW,AX,AY,AZ,BA,BB,BC,BD,BE,BF,BG,BH,BI,BJ,BK,BL,BM,BN,BO,BP,BQ,BR,BS,BT,BU,BV,BW,BX}'::text[])) AND (payload_type = ANY ('{a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r}'::text[])) AND ("timestamp" >= '2020-06-02 04:52:48'::timestamp without time zone) AND ("timestamp" <= '2020-06-02 07:52:48'::timestamp without time zone))
    Planning time: 3.622 ms
    Execution time: 2478.976 ms


Work mem:

    eventdb=# show work_mem;
     work_mem
     ----------
     5242kB
    (1 row)
                                

Min-Soo Pipefeet (101 rep)

Jun 8, 2020, 07:10 AM • Last activity: Feb 7, 2025, 04:03 PM

0 votes

1 answers

74 views

timescaledb/postgresql size reduction - _timescaledb_internal tables take up tons of space after deleting rows

timescaledb

I have a timescaledb set up, from which I just deleted a large amount of data. I vacuumed the tables that I deleted the data from, and the output of `pg_size_pretty(pg_database_size(current_database()))` dropped from 87GB to 50GB. However, I deleted almost all of the data, so I was expecting the siz...

I have a timescaledb set up, from which I just deleted a large amount of data. I vacuumed the tables that I deleted the data from, and the output of pg_size_pretty(pg_database_size(current_database())) dropped from 87GB to 50GB. However, I deleted almost all of the data, so I was expecting the size to drop to basically nothing. If I run a query to sum the size of the tables from each table schema, I get this result:

NOTICE:  Total size of the database: 50 GB
NOTICE:  Size of large objects: 8192 bytes
NOTICE:  Schema pg_catalog total table size: 24 MB
NOTICE:  Schema information_schema total table size: 248 kB
NOTICE:  Schema _timescaledb_cache total table size: 0 bytes
NOTICE:  Schema _timescaledb_catalog total table size: 1168 kB
NOTICE:  Schema _timescaledb_internal total table size: 50 GB
NOTICE:  Schema _timescaledb_config total table size: 80 kB
NOTICE:  Schema public total table size: 6680 kB

You can see the public tables where I actually have data points take up less than 7MB, but the internal _timescaledb_internal tables are taking 50GB! How can I reduce this now that most of the data has been deleted?

Datguy (11 rep)

Apr 28, 2024, 12:18 AM • Last activity: Dec 31, 2024, 04:30 PM

0 votes

0 answers

81 views

Does TimescaleDB/PostgreSQL work well with Azure Disk backup?

postgresql backup azure timescaledb

I'm investigating improvements to a self-hosted TimescaleDB in Microsoft Azure. We have Timescale running in an AKS cluster, with a Standard SSD Azure Disk mounted as its persistent storage. Thus far we've used `pg_dump` for our backup strategy, but we're dissatisfied with its running time and the d...

                                  I'm investigating improvements to a self-hosted TimescaleDB in Microsoft Azure. We have Timescale running in an AKS cluster, with a Standard SSD Azure Disk mounted as its persistent storage. Thus far we've used pg_dump for our backup strategy, but we're dissatisfied with its running time and the dump size.

Our research suggests that Azure Disk backup may be appropriate for our use case. [Microsoft claims](https://learn.microsoft.com/en-us/azure/backup/disk-backup-overview?source=recommendations#key-benefits-of-disk-backup)  that Azure Disk backup is useful when "invoking freeze and thaw on Linux virtual machines to get application-consistent backup puts undue overhead on production workload availability. To the best of our understanding, that is indeed how one would build TimescaleDB filesystem backups from the ground up.

[The PostgreSQL documentation](https://www.postgresql.org/docs/current/continuous-archiving.html)  also notes that "we can combine a file-system-level backup with backup of the WAL files...We do not need a perfectly consistent file system backup as the starting point." This also bodes well, but it is perhaps a bit unclear how far from "perfectly consistent" we can reasonably get before things break.

Is Azure Disk backup the sort of atomic operation that would help prevent an inconsistent/corrupted database? Has anyone tried this approach before? If it didn't work - why not, and what did?

Dominic Demierre (1 rep)

Nov 7, 2024, 05:28 PM

0 votes

1 answers

86 views

How to Reorder Primary Key Constraint on a Large TimescaleDB Table Without Downtime?

postgresql index postgresql-performance unique-constraint timescaledb

I am trying to reorder the primary key constraint on sampleTable to improve query performance. Here are the details: Database: PostgreSQL (12.18), TimescaleDB (1.7.5) Table Name: sampleTable DDL: ``` CREATE TABLE sampleTable ( ts timestamp NOT NULL, col1 int8 NOT NULL, col2 varchar(30) NOT NULL, col...

I am trying to reorder the primary key constraint on sampleTable to improve query performance. Here are the details: Database: PostgreSQL (12.18), TimescaleDB (1.7.5) Table Name: sampleTable DDL:

CREATE TABLE sampleTable (
    ts timestamp NOT NULL,
    col1 int8 NOT NULL,
    col2 varchar(30) NOT NULL,
    col3 int8 NOT NULL,
    CONSTRAINT idx_sampleTable_pk PRIMARY KEY (ts, col1, col2, col3)
);

The sampleTable contains a large amount of data (approximately 500GB). I attempted to reorder the primary key constraint with the following queries:

-- Drop the old primary key constraint
ALTER TABLE sampleTable
DROP CONSTRAINT idx_sampleTable_pk;

-- Add the new primary key constraint with the updated column order
ALTER TABLE sampleTable
ADD CONSTRAINT idx_sampleTable_pk PRIMARY KEY (col1, col2, col3, ts);

However, creating the new constraint takes approximately 4 hours, during which time the database is inaccessible, resulting in complete downtime. To avoid this downtime, I considered an alternative approach suggested in TimescaleDB documentation, known as [CREATE INDEX (Transaction Per Chunk)](https://docs.timescale.com/api/latest/hypertable/create_index/#create-index-transaction-per-chunk) . I plan to create the temporary index in chunks and, once the index is built, drop the existing constraint and create the new one using the new index. The steps are as follows:

-- Create the temporary index chunkwise
CREATE INDEX sampleTable_temp_idx ON sampleTable (col1, col2, col3, ts)
WITH (timescaledb.transaction_per_chunk);

-- Drop the existing primary key constraint
ALTER TABLE sampleTable
DROP CONSTRAINT idx_sampleTable_pk;

-- Add the new primary key constraint using the temporary index
ALTER TABLE sampleTable
ADD CONSTRAINT idx_sampleTable_pk
PRIMARY KEY USING INDEX sampleTable_temp_idx;

However, I encounter the following error when creating the new constraint with the temporary index:

SQL Error : ERROR: "sampleTable_temp_idx" is not a unique index
Detail: Cannot create a primary key or unique constraint using such an index.

To address this error, I updated the index creation query to:

CREATE UNIQUE INDEX sampleTable_temp_idx ON sampleTable (col1, col2, col3, ts)
WITH (timescaledb.transaction_per_chunk);

But now I get the following error:

SQL Error [0A000]: ERROR: cannot use timescaledb.transaction_per_chunk with unique or primary key

How can I resolve this issue and avoid downtime? I have also tried using CONCURRENTLY for index creation, but it is not supported on hypertables.

Unmesh Kadam (13 rep)

Aug 28, 2024, 04:58 PM • Last activity: Aug 28, 2024, 06:27 PM

0 votes

1 answers

56 views

INSTEAD OF INSERT ON never gets triggered from continous aggregate (timescale)

postgresql timescaledb

I'm trying to make a trigger based on when a continuous aggregate is updated. I had the trigger working on the source table, but when trying to add it on the aggregate: CREATE TRIGGER request_insert_trigger AFTER INSERT ON request_summary FOR EACH ROW EXECUTE FUNCTION notify_request_insert(); I get:...

                                  I'm trying to make a trigger based on when a continuous aggregate is updated.

I had the trigger working on the source table, but when trying to add it on the aggregate:

    CREATE TRIGGER request_insert_trigger
    AFTER INSERT ON request_summary
    FOR EACH ROW
    EXECUTE FUNCTION notify_request_insert();

I get:

    ERROR:  "request_summary" is a view
    Views cannot have row-level BEFORE or AFTER triggers

So I changed AFTER INSERT ON to INSTEAD OF INSERT ON (see below), based on the documentation, and then I dont get an error. **But my notification never triggers.**

Perhaps what I'm trying to do isn't even possible? (but if it is not possible to have triggers on views, why does the trigger creation with AFTER INSERT ON not give an error?)

I've tried using non-real-time aggregation, but the problem remains even then. There are no errors in my logs.

    CREATE MATERIALIZED VIEW request_summary
    WITH (timescaledb.continuous) AS
    SELECT name,
       time_bucket(INTERVAL '5s', time) AS bucket,
       AVG(response_time),
       count(*)
    FROM request
    GROUP BY name, bucket;
    
    CREATE OR REPLACE FUNCTION notify_request_insert()
    RETURNS trigger AS $$
    DECLARE
        payload JSON;
    BEGIN
        -- Convert the new row into JSON
        payload := row_to_json(NEW);
        -- Send the notification with the payload
        PERFORM pg_notify('insert_event', payload::text);
        RETURN NEW;
    END;
    $$ LANGUAGE plpgsql;

    CREATE TRIGGER request_insert_trigger
    INSTEAD OF INSERT ON request_summary
    FOR EACH ROW
    EXECUTE FUNCTION notify_request_insert();
    
                                

Cyberwiz (101 rep)

Jul 16, 2024, 08:35 PM • Last activity: Jul 23, 2024, 03:13 PM

0 votes

1 answers

70 views

Best database solution for high volume tables / data indexing (snapshots) [PostgreSQL]

postgresql postgresql-performance compression timescaledb

My problem at the moment, is that I need to index/save snapshots of some data every x minutes, so every x minutes I am inserting around 500k new rows, into a table, which each one represents a snapshot of an account. My general problem is that this creates around 100Gb of new data per day at the mom...

                                  My problem at the moment, is that I need to index/save snapshots of some data every x minutes, so every x minutes I am inserting around 500k new rows, into a table, which each one represents a snapshot of an account.

My general problem is that this creates around 100Gb of new data per day at the moment, what is the best way to compress this, without increasing the time spent on querying.

Is TimescaleDB a nice solution, even though isn't a time-series based table?

Ala (1 rep)

Jul 15, 2024, 11:52 PM • Last activity: Jul 20, 2024, 04:42 PM

0 votes

0 answers

100 views

What are downsides of using composite types and array type columns in TimescaleDB hypertable?

postgresql array timescaledb composite-types

I'm considering to use a composite type, array of standard type and array of composite type columns in TimescaleDB hypertable. I'm curious whether using such columns has any downsides like less effective compression comparing to using standard scalar data types?

                                  I'm considering to use a composite type, array of standard type and array of composite type columns in TimescaleDB hypertable. I'm curious whether using such columns has any downsides like less effective compression comparing to using standard scalar data types?
                                

oliora (101 rep)

Jun 20, 2024, 01:44 PM • Last activity: Jun 20, 2024, 01:51 PM

1 votes

1 answers

1460 views

Should a time index be in ascending or descending order?

postgresql timescaledb

Planning some new tables, I am trying to decide whether an index should be "ascending" or "descending". The table will quite large (I imagine approx. 2000 inserts per minute, initially migrating from a different table with about 1 billion rows). I will be using timescaledb extension for this (for pa...

create table "Sample"(
	"id" bigserial,
	"deviceId" int not null,
	"timestamp" timestamptz not null,
	"value" float8 not null
);
select create_hypertable('"Sample"', 'timestamp'); -- creates a desc index on "timestamp"

create index on "Sample"("deviceId", "timestamp"); -- should this be "desc"?

This is the two most common queries we'll be running (deviceId and timestamps may vary of course):

select "timestamp", "value"
from "Sample" where "deviceId"=123 and "timestamp"<'2024-01-01Z'
order by "timestamp" desc limit 1;

And

select "timestamp", "value"
from "Sample" where "deviceId"=123 and "timestamp" between '2024-01-01Z' and '2024-02-01Z'
order by "timestamp" asc;

So what I am trying to understand is in what order should the "timestamp" be? And why? My (probably wrong) intuition tells me that the index with "timestamp" should be in ascending order, because I need to order the data by timestamp in *ascending* order. But, the examples in the Timescale Documentation always index the time columns in descending order. I don't quite understand why. What is the ideal choice of indices here?

birgersp (175 rep)

Jun 5, 2024, 11:48 AM • Last activity: Jun 6, 2024, 06:39 AM

1 votes

0 answers

30 views

Optimizing subqueries and common table expression queries

postgresql-performance timescaledb

I have a database table where users send alert messages, the alert messages have their own categories. I use PostgreSQL with Timescale. I have two queries: 1. Given a user, get the latest alert per category. 2. Given a user, get the latest 10 alerts per category. And the queries are as follows: Quer...

                                  I have a database table where users send alert messages, the alert messages have their own categories.

I use PostgreSQL with Timescale.

I have two queries:

 1. Given a user, get the latest alert per category.
 2. Given a user, get the latest 10 alerts per category.

And the queries are as follows:

Query 1:

    SELECT *
    FROM (
            SELECT DISTINCT ON (category) *
            FROM alert_messages
            WHERE username = '' 
                AND (subcategory = ''
                OR subcategory LIKE '%')
            ORDER BY category, timestamp DESC
    )       sub
    ORDER BY timestamp DESC


Query 2:

    WITH cte AS (
    SELECT *, ROW_NUMBER() OVER(
    		PARTITION BY category ORDER BY timestamp DESC) rn
    		FROM alert_messages
    		WHERE username = '' 
    		AND subcategory != '' 
    		AND subcategory != ''
    	)
    SELECT *
    FROM cte
    WHERE rn   Unique  (cost=2493853.29..2581310.26 rows=64 width=48) (actual time=10606.587..10783.197 rows=16 loops=1)"
    "        ->  Gather Merge  (cost=2493853.29..2579577.23 rows=693211 width=48) (actual time=10606.586..10766.832 rows=560581 loops=1)"
    "              Workers Planned: 8"
    "              Workers Launched: 8"
    "              ->  Sort  (cost=2492853.14..2493069.77 rows=86650 width=48) (actual time=10510.154..10515.959 rows=62287 loops=9)"
    "                    Sort Key: _hyper_1_209_chunk.category, _hyper_1_209_chunk.""timestamp"" DESC"
    "                    Sort Method: external merge  Disk: 15824kB"
    "                    Worker 0:  Sort Method: quicksort  Memory: 1282kB"
    "                    Worker 1:  Sort Method: external merge  Disk: 5264kB"
    "                    Worker 2:  Sort Method: external merge  Disk: 6744kB"
    "                    Worker 3:  Sort Method: external merge  Disk: 4344kB"
    "                    Worker 4:  Sort Method: quicksort  Memory: 435kB"
    "                    Worker 5:  Sort Method: quicksort  Memory: 497kB"
    "                    Worker 6:  Sort Method: quicksort  Memory: 1587kB"
    "                    Worker 7:  Sort Method: external merge  Disk: 2992kB"
    "                    ->  Parallel Append  (cost=0.00..2484184.48 rows=86650 width=48) (actual time=7831.976..10441.152 rows=62287 loops=9)"
    "                          ->  Parallel Seq Scan on _hyper_1_209_chunk  (cost=0.00..397507.98 rows=1522 width=40) (actual time=8285.449..10145.736 rows=9042 loops=1)"
    "                                Filter: ((((username)::text = ''::text) AND ((subcategory)::text = ''::text)) OR (()::text ~~ '%diagnostic_updater'::text))"
    "                                Rows Removed by Filter: 7566380"
    "                          ->  Parallel Seq Scan on _hyper_1_220_chunk  (cost=0.00..384560.45 rows=4955 width=37) (actual time=1920.476..2716.668 rows=808 loops=4)"
    "                                Filter: ((((username)::text = ''::text) AND ((subcategory)::text = ''::text)) OR (()::text ~~ '%diagnostic_updater'::text))"
    "                                Rows Removed by Filter: 4076316"
    "                          ->  Parallel Seq Scan on _hyper_1_221_chunk  (cost=0.00..321639.51 rows=2258 width=37) (actual time=935.219..1268.861 rows=480 loops=9)"
    "                                Filter: ((((username)::text = ''::text) AND ((subcategory)::text = ''::text)) OR (()::text ~~ '%diagnostic_updater'::text))"
    "                                Rows Removed by Filter: 1516415"
    "                           [   ....    Removed similar lines    ....    ]"
    "Planning Time: 15.378 ms"
    "JIT:"
    "  Functions: 7309"
    "  Options: Inlining true, Optimization true, Expressions true, Deforming true"
    "  Timing: Generation 465.846 ms, Inlining 513.276 ms, Optimization 39806.196 ms, Emission 28927.894 ms, Total 69713.211 ms"
    "Execution Time: 10814.961 ms"

Query 2:

    "WindowAgg  (cost=693791.34..987063.68 rows=2077557 width=76) (actual time=9684.123..10218.021 rows=226 loops=1)"
    "  Run Condition: (row_number() OVER (?)   Gather Merge  (cost=693791.34..950706.43 rows=2077557 width=68) (actual time=9684.089..10159.708 rows=1671306 loops=1)"
    "        Workers Planned: 8"
    "        Workers Launched: 8"
    "        ->  Sort  (cost=692791.19..693440.41 rows=259687 width=68) (actual time=8732.486..8751.575 rows=185701 loops=9)"
    "              Sort Key: _hyper_1_114_chunk.hardware_id, _hyper_1_114_chunk.""timestamp"" DESC"
    "              Sort Method: external merge  Disk: 9592kB"
    "              Worker 0:  Sort Method: external merge  Disk: 20064kB"
    "              Worker 1:  Sort Method: external merge  Disk: 13016kB"
    "              Worker 2:  Sort Method: quicksort  Memory: 25kB"
    "              Worker 3:  Sort Method: external merge  Disk: 24840kB"
    "              Worker 4:  Sort Method: quicksort  Memory: 26kB"
    "              Worker 5:  Sort Method: external merge  Disk: 33080kB"
    "              Worker 6:  Sort Method: external merge  Disk: 26304kB"
    "              Worker 7:  Sort Method: quicksort  Memory: 26kB"
    "              ->  Parallel Append  (cost=209.98..663196.80 rows=259687 width=68) (actual time=8107.200..8568.288 rows=185701 loops=9)"
    "                    ->  Parallel Bitmap Heap Scan on _hyper_1_114_chunk  (cost=631.98..11084.95 rows=22851 width=65) (actual time=7248.905..7296.457 rows=71547 loops=1)"
    "                          Recheck Cond: ((username)::text = ''::text)"
    "                          Filter: (((subcategory)::text  ''::text) AND ((subcategory)::text  ''::text))"
    "                          Rows Removed by Filter: 1076"
    "                          ->  Bitmap Index Scan on _hyper_1_114_chunk_clerts_clerts_username_28789efc_like_1  (cost=0.00..614.27 rows=72623 width=0) (actual time=4.217..4.217 rows=72623 loops=1)"
    "                                Index Cond: ((username)::text = ''::text)"
    "                    ->  Parallel Bitmap Heap Scan on _hyper_1_116_chunk  (cost=516.44..9825.32 rows=24531 width=68) (actual time=4730.966..4745.737 rows=29615 loops=2)"
    "                          Recheck Cond: ((username)::text = ''::text)"
    "                          Filter: (((subcategory)::text  ''::text) AND ((subcategory)::text  ''::text))"
    "                          ->  Bitmap Index Scan on _hyper_1_116_chunk_clerts_clerts_username_28789efc_like_1  (cost=0.00..501.72 rows=59230 width=0) (actual time=4.505..4.505 rows=59230 loops=1)"
    "                                Index Cond: ((username)::text = ''::text)"
    
    "                               [   ....    Cut similar lines    ....    ]"
    
    "                    ->  Parallel Index Scan using _hyper_1_220_chunk_clerts_clerts_username_28789efc_1 on _hyper_1_220_chunk  (cost=0.43..1753.27 rows=15825 width=68) (actual time=0.928..98.999 rows=25380 loops=1)"
    "                          Index Cond: ((username)::text = ''::text)"
    "                          Filter: (((subcategory)::text  ''::text) AND ((subcategory)::text  ''::text))"
    "                          Rows Removed by Filter: 265"
    "                    ->  Parallel Index Scan using _hyper_1_221_chunk_clerts_clerts_username_28789efc_1 on _hyper_1_221_chunk  (cost=0.43..770.81 rows=5459 width=68) (actual time=1.014..82.271 rows=8872 loops=1)"
    "                          Index Cond: ((username)::text = ''::text)"
    "                          Filter: (((subcategory)::text  ''::text) AND ((subcategory)::text  ''::text))"
    "                          Rows Removed by Filter: 914"
    
    "                               [   ....    Cut similar lines    ....    ]"
    
    "                    ->  Parallel Index Scan using _hyper_1_102_chunk_clerts_clerts_username_28789efc_like_1 on _hyper_1_102_chunk  (cost=0.29..2.51 rows=1 width=68) (actual time=0.284..0.285 rows=0 loops=1)"
    "                          Index Cond: ((username)::text = ''::text)"
    "                          Filter: (((subcategory)::text  ''::text) AND ((subcategory)::text  ''::text))"
    "                    ->  Parallel Index Scan using _hyper_1_104_chunk_clerts_clerts_username_28789efc_like_1 on _hyper_1_104_chunk  (cost=0.29..2.51 rows=1 width=65) (actual time=0.333..0.333 rows=0 loops=1)"
    "                          Index Cond: ((username)::text = ''::text)"
    "                          Filter: (((subcategory)::text  ''::text) AND ((subcategory)::text  ''::text))"
    
    "                               [   ....    Cut similar lines    ....    ]"
    
    "                    ->  Parallel Seq Scan on _hyper_1_206_chunk  (cost=0.00..34850.68 rows=40556 width=68) (actual time=0.124..79.937 rows=2145 loops=6)"
    "                          Filter: (((subcategory)::text  ''::text) AND ((subcategory)::text  ''::text) AND ((username)::text = ''::text))"
    "                          Rows Removed by Filter: 104243"
    "                    ->  Parallel Seq Scan on _hyper_1_117_chunk  (cost=0.00..34643.20 rows=31455 width=69) (actual time=0.955..152.430 rows=22981 loops=3)"
    "                          Filter: (((subcategory)::text  ''::text) AND ((subcategory)::text  ''::text) AND ((username)::text = ''::text))"
    "                          Rows Removed by Filter: 183738"
    
    "                               [   ....    Cut similar lines    ....    ]"
    
    "Planning Time: 15.181 ms"
    "JIT:"
    "  Functions: 10733"
    "  Options: Inlining true, Optimization true, Expressions true, Deforming true"
    "  Timing: Generation 526.707 ms, Inlining 514.124 ms, Optimization 45236.398 ms, Emission 27189.984 ms, Total 73467.213 ms"
    "Execution Time: 10248.488 ms"


Query 1 looks straightforward, with the planner going through the usernames and the subcategories. I have tried to add a multi-column index here for the username and subcategory, but it didn't generally improve the execution time significantly.

Query 2 uses an index that's already there for the username, but having to go through millions and millions of records takes time. 

Not sure how to optimize these queries, is there a better way to write these queries, or an index that would speed up the queries?

Another complication in the situation is that there users and message categories whose last message was 6+ month ago. This may be a flaw in the system, as users are supposed to send messages in all categories on a regular basis. If this is the case, then it might be viable to partition data, in such a way that the queries would only look at the past month or two of data.


**EDIT**
The best way to tackle this issue is to partition the database.

Running queries on a large database (61 million rows and growing) will tend to be slow, especially as the queries get more complicated. 

It of course will depend on a use-case basis. But for my use case, it would be better to just partition the last 3 months of data, to get a fast response while still providing accurate information.
                                

Razgriz (113 rep)

May 28, 2024, 12:06 PM • Last activity: May 30, 2024, 07:17 AM

0 votes

1 answers

72 views

A different filter value results in a different (slower) query plan

postgresql postgresql-performance timescaledb

I am executing the following query in Postgres 15 with the Timescale extension on an alerts table to get the latest alert for a username. EXPLAIN ANALYZE SELECT * FROM alerts_alerts WHERE username IN (' ') ORDER BY timestamp DESC LIMIT 1 For most usernames, the query executes quickly, under 150ms. H...

                                  I am executing the following query in Postgres 15 with the Timescale extension on an alerts table to get the latest alert for a username.

    EXPLAIN ANALYZE
    SELECT *
    FROM alerts_alerts
    WHERE username IN ('')
    ORDER BY timestamp DESC
    LIMIT 1

For most usernames, the query executes quickly, under 150ms. However, for some usernames, it takes longer. Almost all databases have around the same number of alerts, around 450, and most of them have fairly recent data, all within the past 6 months.

Here's the Explain Analyze for the problematic username:

~~~none
"Limit  (cost=0.29..2262.68 rows=1 width=86) (actual time=36129.346..36129.370 rows=1 loops=1)"
"  ->  Custom Scan (ChunkAppend) on alerts_alerts  (cost=0.29..2262.68 rows=1 width=86) (actual time=36129.344..36129.368 rows=1 loops=1)"
"        Order: alerts_alerts.""timestamp"" DESC"
"        ->  Index Scan using _hyper_1_234_chunk_alerts_alerts_timestamp_idx_1 on _hyper_1_234_chunk  (cost=0.29..2262.68 rows=1 width=89) (actual time=5.795..5.796 rows=0 loops=1)"
"              Filter: ((username)::text = 'username_long_query'::text)"
"              Rows Removed by Filter: 30506"
"        ->  Index Scan using _hyper_1_233_chunk_alerts_alerts_timestamp_idx_1 on _hyper_1_233_chunk  (cost=0.29..4337.82 rows=1 width=91) (actual time=11.112..11.112 rows=0 loops=1)"
"              Filter: ((username)::text = 'username_long_query'::text)"
"              Rows Removed by Filter: 59534"
            [   ...     Cut redundant log lines here    ...    ]
"        ->  Index Scan using _hyper_1_156_chunk_alerts_alerts_timestamp_idx_1 on _hyper_1_156_chunk  (cost=0.42..11418.54 rows=2591 width=80) (never executed)"
"              Filter: ((username)::text = 'username_long_query'::text)"
"        ->  Index Scan using _hyper_1_155_chunk_alerts_alerts_timestamp_idx_1 on _hyper_1_155_chunk  (cost=0.29..7353.95 rows=749 width=84) (never executed)"
"              Filter: ((username)::text = 'username_long_query'::text)"
            [   ...     Cut redundant log lines here    ...    ]
"Planning Time: 13.154 ms"
"Execution Time: 36129.923 ms"
~~~

Now, this is the Explain Analyze for the usernames that execute quickly:
~~~none
"Limit  (cost=471.73..471.73 rows=1 width=458) (actual time=1.672..1.691 rows=1 loops=1)"
"  ->  Sort  (cost=471.73..472.76 rows=414 width=458) (actual time=1.671..1.689 rows=1 loops=1)"
"        Sort Key: _hyper_1_234_chunk.""timestamp"" DESC"
"        Sort Method: top-N heapsort  Memory: 27kB"
"        ->  Append  (cost=0.29..469.66 rows=414 width=457) (actual time=1.585..1.654 rows=210 loops=1)"
"              ->  Index Scan using _hyper_1_234_chunk_alerts_alerts_fleet_a3933a38_1 on _hyper_1_234_chunk  (cost=0.29..2.49 rows=1 width=372) (actual time=0.006..0.007 rows=0 loops=1)"
"                    Index Cond: ((username)::text = 'username_value'::text)"
"              ->  Index Scan using _hyper_1_233_chunk_alerts_alerts_fleet_a3933a38_1 on _hyper_1_233_chunk  (cost=0.29..2.37 rows=1 width=385) (actual time=0.006..0.006 rows=0 loops=1)"
"                    Index Cond: ((username)::text = 'username_value'::text)"
            [   ...     Cut redundant log lines here    ...    ]
"              ->  Seq Scan on _hyper_1_83_chunk  (cost=0.00..1.12 rows=1 width=504) (actual time=0.013..0.013 rows=0 loops=1)"
"                    Filter: ((username)::text = 'username_value'::text)"
"                    Rows Removed by Filter: 10"
"              ->  Seq Scan on _hyper_1_81_chunk  (cost=0.00..1.12 rows=1 width=504) (actual time=0.009..0.009 rows=0 loops=1)"
"                    Filter: ((username)::text = 'username_value'::text)"
"                    Rows Removed by Filter: 10"
"Planning Time: 899.811 ms"
"Execution Time: 2.613 ms"
~~~

Preliminary research suggests doing maintenance on the database table. After executing the vacuum command, the queries were executed again but the results did not change.

It should also be noted that there are other usernames that use the "problematic" planning, but the execution time is still quick.

Not sure how to resolve this discrepancy in query execution time. It could be useful to add another index, but as I'm new to PostgreSQL, I'm currently not sure about the best approach to this.
                                

Razgriz (113 rep)

May 24, 2024, 11:41 AM • Last activity: May 27, 2024, 06:34 AM

0 votes

2 answers

251 views

Database size is far larger than sum of tables, even after VACUUM FULL?

postgresql database-size timescaledb

Here is my query and its output in psql, run as an admin on that database. There is a massive discrepancy between the 'total size of the database' at 50GB, and the sum of the tables, ~7MB. This is immediately after running VACUUM FULL ANALYZE table_name for every table. My understanding is that pg_total_relation_size includes the size of things like indexes, so that does not account for the discrepancy. What could be causing this?

core=# DO $$
core$# DECLARE
core$#     table_name TEXT;
core$#     db_size TEXT;
core$#     table_size TEXT;
core$#     index_size TEXT;
core$# BEGIN
core$#     -- Get the total size of the database
core$#     SELECT pg_size_pretty(pg_database_size(current_database())) INTO db_size;
core$#     RAISE NOTICE 'Total size of the database: %', db_size;
core$#     
core$#     -- Cursor to fetch table names
core$#     FOR table_name IN 
core$#         SELECT t.table_name
core$#         FROM information_schema.tables t
core$#         WHERE t.table_schema = 'public' -- Change 'public' to your schema name if it's different
core$#         AND t.table_type = 'BASE TABLE'
core$#     LOOP
core$#         -- Get the size of each table and print
core$#         EXECUTE format('SELECT pg_size_pretty(pg_total_relation_size(''%I.%I''))', 'public', table_name) INTO table_size;
core$#         RAISE NOTICE 'Table % size: %', table_name, table_size;
core$# 
core$#     END LOOP;
core$# END $$;
NOTICE:  Total size of the database: 50 GB
NOTICE:  Table table1 size: 16 kB
NOTICE:  Table table2 size: 48 kB
NOTICE:  Table table3 size: 16 kB
NOTICE:  Table table4 size: 32 kB
NOTICE:  Table table5 size: 32 kB
NOTICE:  Table table6 size: 40 kB
NOTICE:  Table table7 size: 72 kB
NOTICE:  Table table8 size: 80 kB
NOTICE:  Table table9 size: 24 kB
NOTICE:  Table table10 size: 2440 kB
NOTICE:  Table table11 size: 32 kB
NOTICE:  Table table12 size: 32 kB
NOTICE:  Table table13 size: 120 kB
NOTICE:  Table table14 size: 24 kB
NOTICE:  Table table15 size: 24 kB
NOTICE:  Table table16 size: 24 kB
NOTICE:  Table table17 size: 32 kB
NOTICE:  Table table18 size: 32 kB
NOTICE:  Table table19 size: 3352 kB
NOTICE:  Table table20 size: 40 kB
NOTICE:  Table table21 size: 144 kB
NOTICE:  Table table122 size: 24 kB
DO

It doesn't seem to be related to large objects either:

SELECT pg_size_pretty(pg_total_relation_size('pg_largeobject'));
 pg_size_pretty 
----------------
 8192 bytes

Datguy (11 rep)

Apr 27, 2024, 02:32 PM • Last activity: Apr 29, 2024, 03:46 PM

0 votes

0 answers

66 views

Optimize queries to track trading algorithm performance

postgresql timescaledb

I have a panel where a user can upload Python trading algorithms acting on the Binance API and track the performance of each algorithm in e.g a chart like this: [![Chart with datapoints][1]][1] For each algorithm the amount of BTC, USDT and the total funds (USDT + BTC * actual_btc_price) are display...

I have a panel where a user can upload Python trading algorithms acting on the Binance API and track the performance of each algorithm in e.g a chart like this:

For each algorithm the amount of BTC, USDT and the total funds (USDT + BTC * actual_btc_price) are displayed. To achieve this I have the following tables. **algorithms** - Holding all the algorithms and their configuration.

Table "public.algorithms"
      Column      |          Type          | Collation | Nullable |         Default
------------------+------------------------+-----------+----------+-------------------------
 id               | character varying(255) |           | not null |
 description      | character varying(512) |           | not null |
 start_funds_usdt | numeric                |           | not null | 0.00
 interval         | character varying(3)   |           | not null | '1s'::character varying
 run_every_sec    | integer                |           | not null | 0
 user_id          | integer                |           | not null |
Indexes:
    "algorithms_pkey" PRIMARY KEY, btree (id)
Referenced by:
    TABLE "_timescaledb_internal._hyper_4_1_chunk" CONSTRAINT "1_1_history_algorithm_id_fkey" FOREIGN KEY (algorithm_id) REFERENCES algorithms(id) ON DELETE CASCADE
    TABLE "history" CONSTRAINT "history_algorithm_id_fkey" FOREIGN KEY (algorithm_id) REFERENCES algorithms(id) ON DELETE CASCADE

**history** - Each order buy/sell is stored in history.

Table "public.history"
    Column    |            Type             | Collation | Nullable |         Default
--------------+-----------------------------+-----------+----------+-------------------------
 algorithm_id | character varying(255)      |           | not null |
 order_id     | character varying(12)       |           |          | NULL::character varying
 action       | character varying(5)        |           |          | NULL::character varying
 btc          | numeric                     |           | not null |
 usdt         | numeric                     |           | not null |
 btc_price    | numeric                     |           | not null |
 created_at   | timestamp without time zone |           | not null | CURRENT_TIMESTAMP
Indexes:
    "history_created_at_idx" btree (created_at DESC)
Foreign-key constraints:
    "history_algorithm_id_fkey" FOREIGN KEY (algorithm_id) REFERENCES algorithms(id) ON DELETE CASCADE
Triggers:
    ts_insert_blocker BEFORE INSERT ON history FOR EACH ROW EXECUTE FUNCTION _timescaledb_functions.insert_blocker()
Number of child tables: 1 (Use \d+ to list them.)

In the chart I want to display the value of btc, usdt, and total funds for every minute. Note that there were the following impediments: - Not every minute a order is executed and thus the table history doesn't have a record for each minute. - The price of BTC is volatile so the value of total_funds can vary even if the algorithm is not running or isn't making any orders. The first impedement I have solved by executing the following query every minute using cronjob in my Rust code:

# {} is the current btc_price retrieved by the Binance API.
DO
$$
    declare f record;
         begin
              for f in SELECT DISTINCT id from algorithms
                    loop
                        insert into history(algorithm_id, btc, usdt, btc_price) values (f.id, 0.0, 0.00, {});
                    end loop;
end
$$;

By executing this I have access to the latest btc_price for every minute in the table history even if a order was not made on that minute. The following issue was that my initial approach was slow: I took the first timestamp available in history for a given algorithm, add a minute to that one until the current time is reached, and in that loop summed all the values of the columns USDT and BTC where the created_at value was lower then the timestamp in the loop. This was incredibly slow. I solved it by making a view called **history_aggregate**.

CREATE MATERIALIZED VIEW history_aggregate AS
SELECT
    created_at,
    algorithm_id,
    SUM(usdt) OVER (PARTITION BY algorithm_id ORDER BY created_at) AS total_usdt,
    SUM(btc) OVER (PARTITION BY algorithm_id ORDER BY created_at) as total_btc
FROM
    history
GROUP BY
    algorithm_id, history.btc, history.btc_price, history.usdt, history.created_at;

This view gets refreshed every minute. This way the calculations (summing for each minute) is already done in the background before it's requested. When the chart is requested the following query gets executed:

WITH btc_price_cte AS (
SELECT btc_price FROM history where algorithm_id = $1 ORDER BY created_at DESC LIMIT 1
    )
SELECT
             start_funds_usdt + COALESCE(h.total_usdt, 0) + COALESCE(h.total_btc * btc_price_cte.btc_price, 0) AS current_funds_total,
             start_funds_usdt + COALESCE(h.total_usdt, 0) AS current_funds_usdt,
            COALESCE(h.total_btc, 0) AS current_funds_btc,
            h.created_at::TEXT AS ts
        FROM
            algorithms
        LEFT JOIN
            history_aggregate h ON h.algorithm_id = algorithms.id
        CROSS JOIN
            btc_price_cte
        WHERE
            algorithms.id = $1
        GROUP BY
            algorithm_id, start_funds_usdt, h.total_usdt, h.total_btc, btc_price_cte.btc_price, h.created_at
        ORDER BY h.created_at;

I use a Common Table Expression to retrieve the price of BTC at a given timestamp, and by calling data from the view I can get the USDT and BTC amount the algorithm had on that specific timestamp. This approach works and seems to even run pretty fast. But I was wondering if this approach could be optimized by for example using TimescaleDB?

O'Niel (61 rep)

Jan 24, 2024, 02:23 AM • Last activity: Jan 27, 2024, 01:38 AM

Showing page 1 of 20 total questions