Best Database for Bulk Reads and Fast Key Lookups for Genomic Data
-1
votes
1
answer
62
views
I'm developing a system to store and query genomic data using a relational database (PostgreSQL being my primary candidate).
The dataset is quite large around 9 billion records with over 300 columns, totaling more than 30TB of data.
Here's an example of the data structure in Go:
type Variant struct {
VariantVcf pgtype.Text
json:"variant_vcf"
Chromosome pgtype.Text json:"chromosome"
Position pgtype.Text json:"position"
BravoAn pgtype.Int8 json:"bravo_an"
// ... many additional fields
}
Here are some sample SQL queries:
Single ID Lookup:
SELECT * FROM table WHERE variant_vcf = '1-14809-A-T';
Bulk ID Lookup:
SELECT * FROM variants
WHERE variant_vcf IN ('1-14809-A-T', '2-23456-G-C', 'X-78901-T-G');
Questions for the Community:
Although I plan to stick with relational databases, are there scenarios where a key-value store or another specialized database might outperform PostgreSQL in these cases, especially for bulk key-based retrievals?
We expect around 2000 concurrent users (conservative), each performing 1-4 lookups on average. The majority of these lookups fall into
Bulk key based Lookups (e.g., 1k–10M IDs per request)
Asked by mad
(1 rep)
Mar 27, 2025, 10:29 PM
Last activity: Mar 28, 2025, 08:58 PM
Last activity: Mar 28, 2025, 08:58 PM