Sample Header Ad - 728x90

Best Database for Bulk Reads and Fast Key Lookups for Genomic Data

-1 votes
1 answer
62 views
I'm developing a system to store and query genomic data using a relational database (PostgreSQL being my primary candidate). The dataset is quite large around 9 billion records with over 300 columns, totaling more than 30TB of data. Here's an example of the data structure in Go: type Variant struct { VariantVcf pgtype.Text json:"variant_vcf" Chromosome pgtype.Text json:"chromosome" Position pgtype.Text json:"position" BravoAn pgtype.Int8 json:"bravo_an" // ... many additional fields } Here are some sample SQL queries: Single ID Lookup: SELECT * FROM table WHERE variant_vcf = '1-14809-A-T'; Bulk ID Lookup: SELECT * FROM variants WHERE variant_vcf IN ('1-14809-A-T', '2-23456-G-C', 'X-78901-T-G'); Questions for the Community: Although I plan to stick with relational databases, are there scenarios where a key-value store or another specialized database might outperform PostgreSQL in these cases, especially for bulk key-based retrievals? We expect around 2000 concurrent users (conservative), each performing 1-4 lookups on average. The majority of these lookups fall into Bulk key based Lookups (e.g., 1k–10M IDs per request)
Asked by mad (1 rep)
Mar 27, 2025, 10:29 PM
Last activity: Mar 28, 2025, 08:58 PM