Query performance on static, large PostgreSQL table

4 votes

1 answer

1303 views

postgresql database-design query-performance postgresql-performance read-only-database

I tried to have this as detailed as possible. Sorry about the length! ### Background I created the following partitioned table protein_snp_assoc on a PostgreSQL (version 12.13) database:

CREATE TABLE protein_snp_assoc (
  protein_id    int not null,
  snp_id        int not null,
  beta          double precision,
  se            double precision,
  logp          double precision
) PARTITION BY RANGE (snp_id);

I then created 51 partitions, each containing roughly 150 million lines (total lines 7.65 billion), based on the following template:

CREATE TABLE IF NOT EXISTS protein_snp_assoc_(x) PARTITION OF protein_snp_assoc
  FOR VALUES FROM (y) TO (z);

where x ranged from 1 to 51, and y, z defined intervals, each of length 150,000. As an example, the first two and last partitions are:

protein_snp_assoc_1 FOR VALUES FROM (1) TO (150001),
protein_snp_assoc_2 FOR VALUES FROM (150001) TO (300001), ...
protein_snp_assoc_51 FOR VALUES FROM (7500001) TO (7650001)

The variable column protein_id has 1,000 unique values (1 to 1,000) and snp_id has 7,500,000 unique values (1 to 7,650,001). As the pair (snp_id, protein_id) uniquely determines a row in the table, I used the two columns to create a BTree index, with snp_id as the left-most variable:

CREATE INDEX ON protein_snp_assoc (snp_id, protein_id);

This will be a static database. It currently has about 20% of the total data that will be on it (since I'm prototyping), but once all the data has been added to the database, no further rows will be added (nor deleted). ### Typical queries The most common queries will be (a) single SNP/protein queries, (b) single protein, multiple SNPs queries, and (c) multiple proteins and multiple SNPs queries. Example queries where I use VALUES as I read on this site that it can increase performance when IN(...) has multiple values.

-- Single SNP/protein
SELECT 
  * 
FROM 
  protein_snp_assoc
WHERE
  snp_id IN (VALUES (1))
AND
  protein_id IN (VALUES(1));
  
-- Multiple SNPs, single protein
SELECT 
  * 
FROM 
  protein_snp_assoc
WHERE
  snp_id IN (VALUES (1), (2))
AND
  protein_id IN (VALUES(1));
  
-- Multiple SNPs, multiple proteins
SELECT 
  * 
FROM 
  protein_snp_assoc 
WHERE
  snp_id IN (VALUES (1), (2))
AND
  protein_id IN (VALUES (1),(2));

The EXPLAIN ANALYZE for each query can be seen here (pastebin links): [Single SNP/protein (pastebin)](https://pastebin.com/zfaRCibT) , [Multiple SNPs, single protein (pastebin)](https://pastebin.com/jLdu2hqJ) , [Mulitple SNPs, mulitple protein (pastebin)](https://pastebin.com/MYjBxjpT) . ### Benchmarking and comparison to Arrow/parquet I ran 1,000 queries for multiple SNP/protein combinations, where the SNPs and proteins were randomly drawn before being inserted into the query. To get some sort of reference, I converted the raw data files I used to populate the database to .parquet files and ran similar queries using R and the arrow package. The results can be seen in the table below (all times are in milliseconds, lq and uq are the 25% and 75% percentiles, respectively). |index | n_snps| n_proteins| min| lq| mean| median| uq| max| |:--------|----------:|-------:|---------:|----------:|----------:|----------:|----------:|---------:| |postgres | 1| 1| 0.05900| 14.71125| 18.02112| 17.92850| 21.14350| 45.2850| |arrow | 1| 1| 34.31822| 44.62842| 49.30316| 46.29033| 48.07222| 577.1411| |postgres | 1| 2| 4.07100| 20.97125| 25.40618| 25.15250| 29.35375| 68.7700| |arrow | 1| 2| 47.61873| 61.47562| 67.87060| 63.99824| 65.68011| 629.5121| |postgres | 10| 1| 118.18900| 167.11100| 181.76304| 180.50250| 196.41475| 262.9640| |arrow | 10| 1| 138.73902| 164.25678| 177.47847| 167.69684| 176.15489| 704.3115| |postgres | 10| 2| 168.10500| 231.74825| 248.74577| 248.45400| 264.95825| 330.2810| |arrow | 10| 2| 219.73495| 269.54206| 287.34815| 281.79291| 286.22803| 819.4827| |postgres | 10| 10| 731.77300| 893.28625| 940.90282| 941.69650| 989.38625| 1162.4810| |arrow | 10| 10| 930.18264| 1038.39510| 1089.43522| 1080.01131| 1100.22580| 2313.4975| |postgres | 50| 1| 665.23800| 799.89600| 849.73860| 850.91900| 898.27900| 1050.0710| |arrow | 50| 1| 682.10049| 711.62065| 766.24498| 735.49283| 750.97367| 1335.6018| As you can see, as the number of SNPs or proteins (or both) were increased, PostgreSQL and Arrow started performing similarly (although the worst-case for Arrow was consistently worse). ### Hardware [CPU (pastebin)](https://pastebin.com/Xr2Y7NLh) . The HDD is a Seagate IronWolf 10TB (ST10000VN0008). Memory is 64GB but I can't see the specific type since I don't have sudo privileges on the machine. Operating system: Ubuntu 22.04.1 LTS. ### My question The results of the benchmark make me believe that my database is not optimized. I'm worried that as I start adding more data to the database, performance will suffer. Is there any way I can speed up queries that involve multiple proteins and SNPs, either with better design, queries or some other sort of tuning? ### Update 2023-03-12 Thanks to Erwin and everyone else who has engaged. I followed Erwin's directions exactly (the only exception being that I couldn't update from v12 to v15) and then redid the benchmarks for this new table. Here are the results (compared to the original design), where index_order = snp_first is the original design and index_order = protein_first is the design proposed by Erwin: |index_order | n_snps| n_proteins| min| lq| mean| median| uq| max| |:-------------|------:|----------:|--------:|----------:|----------:|---------:|----------:|--------:| |snp_first | 1| 1| 0.059| 14.71125| 18.02112| 17.9285| 21.14350| 45.285| |protein_first | 1| 1| 0.060| 20.96200| 24.87686| 26.3945| 30.31275| 126.046| |snp_first | 1| 2| 4.071| 20.97125| 25.40618| 25.1525| 29.35375| 68.770| |protein_first | 1| 2| 2.764| 37.02300| 44.31820| 45.7595| 52.30925| 84.515| |snp_first | 10| 1| 118.189| 167.11100| 181.76304| 180.5025| 196.41475| 262.964| |protein_first | 10| 1| 29.754| 215.37700| 221.30159| 255.3445| 276.62725| 380.930| |snp_first | 10| 2| 168.105| 231.74825| 248.74577| 248.4540| 264.95825| 330.281| |protein_first | 10| 2| 88.473| 320.08475| 417.07273| 461.6155| 501.66350| 593.604| |snp_first | 10| 10| 731.773| 893.28625| 940.90282| 941.6965| 989.38625| 1162.481| |protein_first | 10| 10| 1189.058| 1906.78050| 2040.40170| 2054.9985| 2194.80550| 2595.215| |snp_first | 50| 1| 665.238| 799.89600| 849.73860| 850.9190| 898.27900| 1050.071| |protein_first | 50| 1| 200.521| 910.52700| 934.64351| 1091.5340| 1149.79875| 1319.777| As you can see, the original design is considerably faster, especially on the most time-consuming queries. I'll have a chat with the sys admin about updating to v15 this week, to see if that improves performance. In any case, I think this experiment has demonstrated that this is either a query problem (the queries I wrote are probably suboptimal, see comments on how I use VALUES) or a hardware problem (the server is old). **Answers to some questions in comments** *jjanes:* See this pastebin: https://pastebin.com/qQR3GtZ4 *a_horse_with_no_name*: The VALUES idea came from here: https://dba.stackexchange.com/questions/91247/optimizing-a-postgres-query-with-a-large-in I wrote a test query:

EXPLAIN
WITH value_list (protein_id, snp_id) as (
  values
    (1, 1),
    (1, 2)
)
SELECT 
  *
FROM protein_snp_assoc AS p
INNER JOIN
value_list v on (p.protein_id, snp_id) = (v.protein_id, v.snp_id);

I thought this gave me the same query plan as WHERE/IN but I see now I was wrong. I'll look into this and see if it's better. **Edit:** it seems to perform on par with WHERE/IN and VALUES. So I guess this isn't the real bottleneck. *bobflux*: I can't share the data but you can simulate it easily. Here's a quick example in R:

{R}
sim_data <- function(i, n_snps) {
  data.frame(
    protein_id = rep(i, n_snps),
    snp_id = 1:n_snps,
    beta = rnorm(n = n_snps, mean = 0, sd = 1),
    se = abs(rnorm(n = n_snps, mean = 0, sd = 1)),
    logp = abs(rnorm(n = n_snps, mean = 2, sd = 1))
  )
}
protein_id <- 10
n_snps <- 7650000
sim_data(protein_id, n_snps)

*nz_21*: I wrote custom scripts in R and bash.

Asked by jommi (43 rep)

Mar 11, 2023, 12:47 AM
Last activity: Mar 12, 2023, 10:16 PM

Query performance on static, large PostgreSQL table

Related Questions