Query performance on static, large PostgreSQL table
4
votes
1
answer
1303
views
I tried to have this as detailed as possible. Sorry about the length!
### Background
I created the following partitioned table
protein_snp_assoc
on a PostgreSQL (version 12.13) database:
CREATE TABLE protein_snp_assoc (
protein_id int not null,
snp_id int not null,
beta double precision,
se double precision,
logp double precision
) PARTITION BY RANGE (snp_id);
I then created 51 partitions, each containing roughly 150 million lines (total lines 7.65 billion), based on the following template:
CREATE TABLE IF NOT EXISTS protein_snp_assoc_(x) PARTITION OF protein_snp_assoc
FOR VALUES FROM (y) TO (z);
where x
ranged from 1 to 51, and y, z
defined intervals, each of length 150,000. As an example, the first two and last partitions are:
protein_snp_assoc_1 FOR VALUES FROM (1) TO (150001),
protein_snp_assoc_2 FOR VALUES FROM (150001) TO (300001), ...
protein_snp_assoc_51 FOR VALUES FROM (7500001) TO (7650001)
The variable column protein_id
has 1,000 unique values (1 to 1,000) and snp_id
has 7,500,000 unique values (1 to 7,650,001). As the pair (snp_id, protein_id)
uniquely determines a row in the table, I used the two columns to create a BTree index, with snp_id
as the left-most variable:
CREATE INDEX ON protein_snp_assoc (snp_id, protein_id);
This will be a static database. It currently has about 20% of the total data that will be on it (since I'm prototyping), but once all the data has been added to the database, no further rows will be added (nor deleted).
### Typical queries
The most common queries will be (a) single SNP/protein queries, (b) single protein, multiple SNPs queries, and (c) multiple proteins and multiple SNPs queries.
Example queries where I use VALUES
as I read on this site that it can increase performance when IN(...)
has multiple values.
-- Single SNP/protein
SELECT
*
FROM
protein_snp_assoc
WHERE
snp_id IN (VALUES (1))
AND
protein_id IN (VALUES(1));
-- Multiple SNPs, single protein
SELECT
*
FROM
protein_snp_assoc
WHERE
snp_id IN (VALUES (1), (2))
AND
protein_id IN (VALUES(1));
-- Multiple SNPs, multiple proteins
SELECT
*
FROM
protein_snp_assoc
WHERE
snp_id IN (VALUES (1), (2))
AND
protein_id IN (VALUES (1),(2));
The EXPLAIN ANALYZE
for each query can be seen here (pastebin links):
[Single SNP/protein (pastebin)](https://pastebin.com/zfaRCibT) , [Multiple SNPs, single protein (pastebin)](https://pastebin.com/jLdu2hqJ) , [Mulitple SNPs, mulitple protein (pastebin)](https://pastebin.com/MYjBxjpT) .
### Benchmarking and comparison to Arrow/parquet
I ran 1,000 queries for multiple SNP/protein combinations, where the SNPs and proteins were randomly drawn before being inserted into the query. To get some sort of reference, I converted the raw data files I used to populate the database to .parquet
files and ran similar queries using R and the arrow
package. The results can be seen in the table below (all times are in milliseconds, lq
and uq
are the 25% and 75% percentiles, respectively).
|index | n_snps| n_proteins| min| lq| mean| median| uq| max|
|:--------|----------:|-------:|---------:|----------:|----------:|----------:|----------:|---------:|
|postgres | 1| 1| 0.05900| 14.71125| 18.02112| 17.92850| 21.14350| 45.2850|
|arrow | 1| 1| 34.31822| 44.62842| 49.30316| 46.29033| 48.07222| 577.1411|
|postgres | 1| 2| 4.07100| 20.97125| 25.40618| 25.15250| 29.35375| 68.7700|
|arrow | 1| 2| 47.61873| 61.47562| 67.87060| 63.99824| 65.68011| 629.5121|
|postgres | 10| 1| 118.18900| 167.11100| 181.76304| 180.50250| 196.41475| 262.9640|
|arrow | 10| 1| 138.73902| 164.25678| 177.47847| 167.69684| 176.15489| 704.3115|
|postgres | 10| 2| 168.10500| 231.74825| 248.74577| 248.45400| 264.95825| 330.2810|
|arrow | 10| 2| 219.73495| 269.54206| 287.34815| 281.79291| 286.22803| 819.4827|
|postgres | 10| 10| 731.77300| 893.28625| 940.90282| 941.69650| 989.38625| 1162.4810|
|arrow | 10| 10| 930.18264| 1038.39510| 1089.43522| 1080.01131| 1100.22580| 2313.4975|
|postgres | 50| 1| 665.23800| 799.89600| 849.73860| 850.91900| 898.27900| 1050.0710|
|arrow | 50| 1| 682.10049| 711.62065| 766.24498| 735.49283| 750.97367| 1335.6018|
As you can see, as the number of SNPs or proteins (or both) were increased, PostgreSQL and Arrow started performing similarly (although the worst-case for Arrow was consistently worse).
### Hardware
[CPU (pastebin)](https://pastebin.com/Xr2Y7NLh) . The HDD is a Seagate IronWolf 10TB (ST10000VN0008). Memory is 64GB but I can't see the specific type since I don't have sudo
privileges on the machine. Operating system: Ubuntu 22.04.1 LTS.
### My question
The results of the benchmark make me believe that my database is not optimized. I'm worried that as I start adding more data to the database, performance will suffer. Is there any way I can speed up queries that involve multiple proteins and SNPs, either with better design, queries or some other sort of tuning?
### Update 2023-03-12
Thanks to Erwin and everyone else who has engaged. I followed Erwin's directions exactly (the only exception being that I couldn't update from v12 to v15) and then redid the benchmarks for this new table. Here are the results (compared to the original design), where index_order = snp_first
is the original design and index_order = protein_first
is the design proposed by Erwin:
|index_order | n_snps| n_proteins| min| lq| mean| median| uq| max|
|:-------------|------:|----------:|--------:|----------:|----------:|---------:|----------:|--------:|
|snp_first | 1| 1| 0.059| 14.71125| 18.02112| 17.9285| 21.14350| 45.285|
|protein_first | 1| 1| 0.060| 20.96200| 24.87686| 26.3945| 30.31275| 126.046|
|snp_first | 1| 2| 4.071| 20.97125| 25.40618| 25.1525| 29.35375| 68.770|
|protein_first | 1| 2| 2.764| 37.02300| 44.31820| 45.7595| 52.30925| 84.515|
|snp_first | 10| 1| 118.189| 167.11100| 181.76304| 180.5025| 196.41475| 262.964|
|protein_first | 10| 1| 29.754| 215.37700| 221.30159| 255.3445| 276.62725| 380.930|
|snp_first | 10| 2| 168.105| 231.74825| 248.74577| 248.4540| 264.95825| 330.281|
|protein_first | 10| 2| 88.473| 320.08475| 417.07273| 461.6155| 501.66350| 593.604|
|snp_first | 10| 10| 731.773| 893.28625| 940.90282| 941.6965| 989.38625| 1162.481|
|protein_first | 10| 10| 1189.058| 1906.78050| 2040.40170| 2054.9985| 2194.80550| 2595.215|
|snp_first | 50| 1| 665.238| 799.89600| 849.73860| 850.9190| 898.27900| 1050.071|
|protein_first | 50| 1| 200.521| 910.52700| 934.64351| 1091.5340| 1149.79875| 1319.777|
As you can see, the original design is considerably faster, especially on the most time-consuming queries. I'll have a chat with the sys admin about updating to v15 this week, to see if that improves performance. In any case, I think this experiment has demonstrated that this is either a query problem (the queries I wrote are probably suboptimal, see comments on how I use VALUES
) or a hardware problem (the server is old).
**Answers to some questions in comments**
*jjanes:* See this pastebin: https://pastebin.com/qQR3GtZ4
*a_horse_with_no_name*: The VALUES
idea came from here: https://dba.stackexchange.com/questions/91247/optimizing-a-postgres-query-with-a-large-in
I wrote a test query:
EXPLAIN
WITH value_list (protein_id, snp_id) as (
values
(1, 1),
(1, 2)
)
SELECT
*
FROM protein_snp_assoc AS p
INNER JOIN
value_list v on (p.protein_id, snp_id) = (v.protein_id, v.snp_id);
I thought this gave me the same query plan as WHERE/IN
but I see now I was wrong. I'll look into this and see if it's better. **Edit:** it seems to perform on par with WHERE/IN
and VALUES
. So I guess this isn't the real bottleneck.
*bobflux*: I can't share the data but you can simulate it easily. Here's a quick example in R:
{R}
sim_data <- function(i, n_snps) {
data.frame(
protein_id = rep(i, n_snps),
snp_id = 1:n_snps,
beta = rnorm(n = n_snps, mean = 0, sd = 1),
se = abs(rnorm(n = n_snps, mean = 0, sd = 1)),
logp = abs(rnorm(n = n_snps, mean = 2, sd = 1))
)
}
protein_id <- 10
n_snps <- 7650000
sim_data(protein_id, n_snps)
*nz_21*: I wrote custom scripts in R and bash.
Asked by jommi
(43 rep)
Mar 11, 2023, 12:47 AM
Last activity: Mar 12, 2023, 10:16 PM
Last activity: Mar 12, 2023, 10:16 PM