Database Administrators
Q&A for database professionals who wish to improve their database skills
Latest Questions
0
votes
1
answers
181
views
Configure mongo for one shot full db read
I have a kubernetes cluster on bare metal with 3 machines having 8Gb of ram each. All my apps and a mongodb 4.0.9 replicaset runs on it. There is an import program that: 1. Download a 8Gb mongo dump from external source and restore it in fresly created database A. There is one collection whitout ind...
I have a kubernetes cluster on bare metal with 3 machines having 8Gb of ram each. All my apps and a mongodb 4.0.9 replicaset runs on it.
There is an import program that:
1. Download a 8Gb mongo dump from external source and restore it in fresly created database A. There is one collection whitout indexes.
2. Fully browse the restored collection with one find({}) and 1M next().
3. Emit amqp messages for each doc, then messages are stored in mongo database B, C, D.
4. Drop the database A.
Part 2 use a lot of ram.
How should I configure mongo database A (and let database B normal) to reduce at max the footprint of the import operation? (Because more important tasks are running in the cluster)
For example, can I configure mongo to not create cache for the temporary collection?
Slim
(291 rep)
Apr 12, 2019, 07:46 PM
• Last activity: Jun 29, 2025, 02:02 AM
0
votes
1
answers
898
views
How do I truncate performance_schema tables on a read replica?
On occasion, you have to truncate events_statements_summary_by_digest, when it's hit the row limit. On a write master, this is easy, but on a read replica, there doesn't seem to be a way to do it. The read-only enforcement extends to the performance_schema tables. Since the data are created locally,...
On occasion, you have to truncate events_statements_summary_by_digest, when it's hit the row limit. On a write master, this is easy, but on a read replica, there doesn't seem to be a way to do it. The read-only enforcement extends to the performance_schema tables. Since the data are created locally, it seems to me that there'd be a way that I could exempt those tables (or even the entire performance_schema DB) from the read-only enforcement?
I have already set the performance_schema_events_statements_history_long_size to 10000, which is getting pretty large. I could set it larger (well, if it weren't read-only), but the size of that table affects overall system performance, and I'd rather not just crank the limit to something obscene - especially when so many of the entries in the history haven't been used in a mont or two. It's better to toss out the garbage.
Sniggerfardimungus
(101 rep)
Apr 18, 2018, 08:47 PM
• Last activity: Jun 28, 2025, 04:03 PM
1
votes
0
answers
77
views
Are there exactly two possible causes of statistics updates on a readable replica?
I have a table that is modified regularly. Due to these frequent modifications, statistics update on it very frequently. To my surprise, I even see queries running on a SQL Server 2022 readable secondary AG that are trying to update statistics. I have been able to distinguish two types of statistics...
I have a table that is modified regularly. Due to these frequent modifications, statistics update on it very frequently. To my surprise, I even see queries running on a SQL Server 2022 readable secondary AG that are trying to update statistics. I have been able to distinguish two types of statistics updates.
I believe the first type to be the replica updating stats in response to being queried. In other words, it's an automatic stats update just like on the primary. I know this because
EXEC sp_WhoIsActive @get_plans = 2
on the replica shows me execution plans that are clearly updating statistics (e.g. they have the blatant STATMAN
stuff in their XML), despite the query text just being a normal user-written query of the replica. They also appear in the auto_stats Extended Event, usually as completing successfully.
With sp_WhoIsActive @show_system_spids = 1, @get_locks = 1
, I can also see a second type of statistics update. This second type is the redo thread trying to take an Sch-M
lock on a resource_type of METADATA.STATS
. The blocked process report confirms this and gives me both the object_id and stats_id of the object in question.
My question is this: **if I see a statistics update on a readable AG replica, is it always one of the two cases described above?**
All of the documentation that I've read, particularly [this](https://learn.microsoft.com/en-us/sql/database-engine/availability-groups/windows/active-secondaries-readable-secondary-replicas-always-on-availability-groups?view=sql-server-ver16#Read-OnlyStats) , suggests that my first type can only be the creation of _temporary_ statistics on the readable replica. I have five objections to this idea, listed below, so I am lead to believe that the two cases that I have listed above cannot be all of the possibly ways that a statistics update can happen on an AG replica. My objections are these:
1. The workload on the secondary is much less than the primary despite the read-only queries being exactly the same, so there is no good reason why the statistics on the primary should be stale.
2. I have never seen statistics from this table in the output of SELECT * FROM [MyDb].sys.stats WHERE is_temporary = 1
. tempdb
doesn't have it either.
3. The auto_stats Extended Event tells me which statistic it is updating. The suffix _readonly_database_statistic
**never** occurs. I either get the names of real statistics or blanks (for failures).
4. The statistics create too frequently. On a bad day, auto_stats reports them being made every 15 minutes. Temporary statistics surely can't be that short lived?
5. [This blog](https://techcommunity.microsoft.com/blog/sqlserver/alwayson-making-latest-statistics-available-on-readable-secondary-read-only-data/383987) suggests that there is a distinction between temporary statistics explicitly created to help a readable replica and statistics that already exist on the primary but are stale when the replica is queried. However, the documentation I linked earlier strongly suggests that all temporary statistics have the _readonly_database_statistic
suffix that I am not seeing.
J. Mini
(1225 rep)
Apr 17, 2025, 09:37 PM
• Last activity: Apr 28, 2025, 08:42 PM
2
votes
1
answers
72
views
What copies of my database, if any, do readable secondaries make bigger?
Suppose that I have an Enterprise Edition database that has completely default settings and nothing but normal rowstore tables. No RCSI, no snapshot isolation and no triggers. Assume nothing that would require [row versioning](https://learn.microsoft.com/en-us/sql/relational-databases/sql-server-tra...
Suppose that I have an Enterprise Edition database that has completely default settings and nothing but normal rowstore tables. No RCSI, no snapshot isolation and no triggers. Assume nothing that would require [row versioning](https://learn.microsoft.com/en-us/sql/relational-databases/sql-server-transaction-locking-and-row-versioning-guide?view=sql-server-ver16#space-used-in-data-rows) .
Clearly, these assumptions mean that I will not be paying the extra 14-bytes-per-row cost of row versioning. However, readable AG secondaries run under SNAPSHOT isolation and SNAPSHOT isolation requires row versioning. So if I put my database in an Availability Group and add a readable secondary, what copies of my database (if any) will start paying the 14-bytes-per-row cost of row versioning? Does this change if I have a non-readable secondary in addition to the readable secondary? What if I fail over and fail back?
The relevant documentation, linked earlier, makes no comment on this. I plan to run experiments myself, but AGs are notoriously easy to make mistakes with and I don't want to draw incorrect conclusions.
J. Mini
(1225 rep)
Mar 17, 2025, 08:59 PM
• Last activity: Mar 30, 2025, 03:17 PM
0
votes
1
answers
2042
views
Does a read replica help if the problem are heavy queries?
The application that our team is responsible for is having ever more often DB performance problems. The application querying it doesn't throw lots of concurrent queries, the problem we have are rather some heavy queries joining a few tables. These queries are executed just every few minutes, and exe...
The application that our team is responsible for is having ever more often DB performance problems. The application querying it doesn't throw lots of concurrent queries, the problem we have are rather some heavy queries joining a few tables. These queries are executed just every few minutes, and execution seldom overlap, but they sometimes make the DB choke and as the system has more data, it's a growing problem. The solution so far has been optimizing these queries, by trying e.g. to fetch less fields, and scaling up the DB.
I am relatively new to the team and today I asked if a read replica had been considered. I know little about DBs, so that felt like natural to me, but a senior engineer told me that it would barely help, because the read replica would have the same problems as the master: it would still need to do the same writes as the master, and the heavy queries would be equally heavy to it and the chances of it timing out, the same. The only gain would be the amount of them each of the instances has to serve.
Put like that, sounds very reasonable. My questions are:
1. Is a read replica really not of any help in this situation?
2. What are some alternatives?
PS: in case it matters, it's PostgreSQL.
antonro
(101 rep)
Jul 29, 2019, 03:48 PM
• Last activity: Mar 29, 2025, 03:09 PM
1
votes
1
answers
469
views
Cannot back up a read_only database with FILESTREAM
I'm facing a strange issue where my backup fails with the message: >Msg 3906, Level 16, State 1, Line 5 >Failed to update database "MyDBB" because the database is read-only. >Msg 3013, Level 16, State 1, Line 5 >BACKUP DATABASE is terminating abnormally. I [saw a post][1] talking about in-memory obj...
I'm facing a strange issue where my backup fails with the message:
>Msg 3906, Level 16, State 1, Line 5
>Failed to update database "MyDBB" because the database is read-only.
>Msg 3013, Level 16, State 1, Line 5
>BACKUP DATABASE is terminating abnormally.
I saw a post talking about in-memory objects causing that issue, but in this case, there is no "memory optimized data".
There is a
FILESTREAM
file but I couldn't find any documentation explaining this behavior.
Has anyone faced this issue before? Any way to work around it without setting the DB to READ_WRITE
?
I think (can't find evidence to prove it) that the backup used to work.
Dominique Boucher
(3297 rep)
Feb 28, 2024, 03:56 PM
• Last activity: Feb 11, 2025, 08:02 PM
0
votes
3
answers
91
views
Performance improvements for read only SQL Db that is only ever restored from backup
I receive a `.bak` file from a 3rd party on a monthly basis and create 1 db each receipt. *.bak file size* is ~1Gb. This data is for lookup only and I never write to it. Are there any steps I can take to improve performance for this scenario? **I am considering:** `ALTER DATABASE [TESTDB] SET READ_O...
I receive a
.bak
file from a 3rd party on a monthly basis and create 1 db each receipt.
*.bak file size* is ~1Gb.
This data is for lookup only and I never write to it. Are there any steps I can take to improve performance for this scenario?
**I am considering:**
ALTER DATABASE [TESTDB] SET READ_ONLY WITH NO_WAIT
**Should I consider:**
- Shrink log file or any log file operations?
- Do anything with indexes such as [reorganize-an-index](
https://learn.microsoft.com/en-us/sql/relational-databases/indexes/reorganize-and-rebuild-indexes?view=sql-server-ver16#reorganize-an-index)
I am using the current version of SQL Express.
ttugates
(37 rep)
Jan 14, 2025, 05:00 PM
• Last activity: Jan 18, 2025, 11:28 AM
0
votes
1
answers
49
views
Is it possible to give readonly access to a database during an upgrade
I am upgrading a sql server environment which has an application server, web server and a database server and I want to be be able to give my users readonly access to their data during the upgrade. Does anyone know if this is possible? and How could I do it?
I am upgrading a sql server environment which has an application server, web server and a database server and I want to be be able to give my users readonly access to their data during the upgrade. Does anyone know if this is possible? and How could I do it?
Jen Burrows
(1 rep)
Nov 15, 2024, 01:16 PM
• Last activity: Nov 15, 2024, 02:24 PM
1
votes
1
answers
350
views
How to set mariadb read_write for some specific tables while starting as read_only=1
I am trying to set a mariadb slave which is exact replica of master db. As per the suggestion I start my slave with command `--relay-log=mysql-relay-bin" "--log-slave-updates=1" "--read-only=1" ` . But as I use my slave db as a stage server of my website, I need some tables like cache_* writable to...
I am trying to set a mariadb slave which is exact replica of master db. As per the suggestion I start my slave with command
--relay-log=mysql-relay-bin" "--log-slave-updates=1" "--read-only=1"
. But as I use my slave db as a stage server of my website, I need some tables like cache_* writable to make my stage server work perfectly.
Can you please help me if it is possible or not ? If not then is there any workaround for it ?
SkyRar
(125 rep)
Jan 2, 2019, 06:42 AM
• Last activity: Aug 7, 2024, 12:03 AM
0
votes
1
answers
112
views
Querying whole PostgreSQL table in replica on CTID
I noticed it is quite fast to query a whole PostgreSQL table in chunks on `ctid` as explained for example in this [Airbyte blog post][1]. I need this for a syncing task that might take from hours to a few days, can’t tell yet as it’s related to syncing data to another service, but it also might be p...
I noticed it is quite fast to query a whole PostgreSQL table in chunks on
ctid
as explained for example in this Airbyte blog post . I need this for a syncing task that might take from hours to a few days, can’t tell yet as it’s related to syncing data to another service, but it also might be paused and resumed on occasion.
But I’m wondering about the issue of ctid
values changing after vacuuming. And more precisely how this affects reading from a read replica. I checked the pg_stat_all_tables
last_vacuum
and vacuum_count
for the table in question in the read replica and the vacuum count is 0 and there is no last vacuum time. I notice the order of rows returned is not exactly chronological, the rows with the first ctid
values are about a year older than the chronologically first ones, but that’s not an issue for me if it wouldn’t change in the span of a few days and I know I can iterate over all rows in chunks on ctid
.
My question is - does this mean that the read replica table has actually not been vacuumed, and therefore the ctids
have never changed? And is there something else that might lead to a change in ctid
in a read replica?
Nikolay D
(101 rep)
Aug 1, 2024, 08:30 AM
• Last activity: Aug 1, 2024, 02:00 PM
0
votes
2
answers
787
views
What happens to my read-replica database in failover mode?
I have a few basic questions about Azure failover. I am new to failover. I've been reading the literature and I'm ready to give this a shot. Before I take the plunge, I'd like to have a better understanding of how this will work. We have a primary database on a server on one region, let's call that...
I have a few basic questions about Azure failover. I am new to failover. I've been reading the literature and I'm ready to give this a shot. Before I take the plunge, I'd like to have a better understanding of how this will work.
We have a primary database on a server on one region, let's call that server DBPrime. We recently created a read-replica on a server in a different region. Our first step was to offload read workloads to the read-replica. Let's call that server DBRead.
We now want to take the next step and set up failover. If I understand the literature correctly, I don't need a third server for failover purposes. Rather, I can use the read-replica server to function as my alternate server for failover?
Here is my understanding of what will happen if failover is triggered. Please let me know if I'm correct about each of these statements, or if I'm not understanding this properly.
1. Azure will swap server names so that DBRead is now DBPrime and vice versa.
2. The new DBPrime will transition from read-only mode to read-write.
3. Application references to DBPrime will seamlessly access the new DBPrime
4. Workloads referencing DBRead will not be able to access the server because that server is currently down (which triggered the failover to begin with).
(We can discuss separately what happens when we come out of failover. Bu no sense in getting to that until I understand what happens when we go in to failover mode.)
Thank you for your advice!
Yossi Geretz
(349 rep)
May 12, 2024, 07:38 PM
• Last activity: Jul 29, 2024, 07:53 AM
1
votes
2
answers
553
views
How can I monitor query performance on my Azure read-replica database?
I'm looking at **Query Performance Insight** for my read-replica database. But the information it shows are the performance stats from the ***primary*** database. How can I get the performance stats for the read-replica? [![Query Performance Insights][1]][1] [1]: https://i.sstatic.net/JfwZ0df2.png
I'm looking at **Query Performance Insight** for my read-replica database. But the information it shows are the performance stats from the ***primary*** database. How can I get the performance stats for the read-replica?

Yossi Geretz
(349 rep)
May 16, 2024, 02:14 PM
• Last activity: Jun 18, 2024, 01:06 AM
0
votes
1
answers
43
views
Can I create a Geo Replica of a Named Replica?
For clarity, we have two databases on a server in a given region. One database is the primary for the read/write workload, and one is a replica for the read-only workload. I'm trying to set up a failover which will work for both workloads. So I'm trying to get the primary, as well as its named repli...
For clarity, we have two databases on a server in a given region. One database is the primary for the read/write workload, and one is a replica for the read-only workload. I'm trying to set up a failover which will work for both workloads. So I'm trying to get the primary, as well as its named replica, replicated over to the standby server.
We're using Azure's hosted database (PaaS) service. We have a primary DB which handles our transactional workload. I created a Named Replica on the same server, to offload read-only workloads.
I also created a Geo Replica of the primary DB which is hosted on a separate server in a different region. I created a failover group which can swap my server back and forth between regions.
But now I'd like to have a Geo Replica of the Named Replica (the read-only DB on my primary server) so that when I go into failover mode, I can support both my transactional and read-only workloads. Is this possible? Seemingly, the only replica I can make is from the primary DB. I can't find any way to use the Named Replica as the base for a Geo Replica. Which leaves me in a catch-22. I can't create a second Geo Replicated database onto the failover server because there's already a database there by that name. I can't create a Named Replica onto the failover server because a Named Replica must be in the same region (which would be useless for failover purposes, anyway).
See the screenshot. In a nutshell, I am trying to get a replicated database named test_read onto my server r8standby in the South Central datacenter, but I'm having trouble figuring out how to set this up. Is this at all possible?

Yossi Geretz
(349 rep)
May 13, 2024, 11:54 PM
• Last activity: May 14, 2024, 10:33 PM
0
votes
1
answers
1246
views
How do I create user logins on my read replica database?
Is it possible to have different users accessing my read replica DB? Isn't this in fact a popular use case for a read replica? I want a certain limited set of credentials to have access to my primary read-write DB, and then I want a bunch of other folks to be able to have read-only access to the rea...
Is it possible to have different users accessing my read replica DB? Isn't this in fact a popular use case for a read replica? I want a certain limited set of credentials to have access to my primary read-write DB, and then I want a bunch of other folks to be able to have read-only access to the read replica for reporting, ad-hoc querying, etc. I don't necessarily want the two databases to have the exact same accessors.
I created a Microsoft Entra login on the server which hosts my read replica, But here's what I get when I attempt to create the user for the database access.
Msg 3906, Level 16, State 2, Line 1
Failed to update database "********" because the database is read-only.
Well, of course the database is read-only, it's my read replica. But I'm not trying to write data into the database, merely set the meta-security details for who in my organization is allowed to access the read replica. This can't be done?

Yossi Geretz
(349 rep)
Apr 15, 2024, 01:13 AM
• Last activity: Apr 15, 2024, 06:07 PM
-1
votes
2
answers
361
views
readonly user on PostgreSQL 14 let me create new tables, why?
I try to apply any of these to create read_only_user: - https://docs.snaplet.dev/guides/postgresql - https://www.commandprompt.com/education/how-to-create-a-read-only-user-in-postgresql/#:~:text=Conclusion-,To%20create%20a%20read%2Donly%20user%20in%20the%20PostgreSQL%20database,schema%2C%20and%20tab...
I try to apply any of these to create read_only_user:
- https://docs.snaplet.dev/guides/postgresql
- https://www.commandprompt.com/education/how-to-create-a-read-only-user-in-postgresql/#:~:text=Conclusion-,To%20create%20a%20read%2Donly%20user%20in%20the%20PostgreSQL%20database,schema%2C%20and%20tables%20in%20it .
- https://www.keyvanfatehi.com/2021/07/14/how-to-create-read-only-user-in-postgresql/
In all cases, the new read_only_user is able to create a new table in my Database (here: MicroReseau) where I want to prevent that.
**Why the read_only_user can create new tables and how to prevent that?**
Additional info:

Eric Ouellet
(101 rep)
Nov 2, 2023, 06:38 PM
• Last activity: Nov 3, 2023, 01:38 PM
1
votes
1
answers
785
views
read only replica of azure sql database within the same server
I have the following bicep to create a sql server and sql database: ``` resource sqlServer 'Microsoft.Sql/servers@2021-02-01-preview' = { name: sqlServerName location: location identity: { type: 'SystemAssigned' } properties: { administratorLogin: userName administratorLoginPassword: password admini...
I have the following bicep to create a sql server and sql database:
resource sqlServer 'Microsoft.Sql/servers@2021-02-01-preview' = {
name: sqlServerName
location: location
identity: {
type: 'SystemAssigned'
}
properties: {
administratorLogin: userName
administratorLoginPassword: password
administrators: {}
}
resource sqlDataBaseOne 'databases@2021-02-01-preview' = {
name: 'databaseone'
location: location
sku: {
name: 'GP_S_Gen5'
tier: 'GeneralPurpose'
family: 'Gen5'
capacity: 4
}
}
}
Is it possible to create read only replica of databaseone
within the same server?
I only see Geo-Replica option on trying to create a replica for which replica is on a different server:

user989988
(111 rep)
May 1, 2023, 10:05 PM
• Last activity: May 2, 2023, 04:21 PM
1
votes
1
answers
783
views
Why can't my read-only user role see any of my tables' primary key information in psql?
I have a PostgreSQL database on Azure, and I've created a readonly_user role that allows for external parties to connect to the db in a read-only capacity using a connection string. This readonly_user has been granted SELECT privileges, and it also has USAGE access on the public schema. However, the...
I have a PostgreSQL database on Azure, and I've created a readonly_user role that allows for external parties to connect to the db in a read-only capacity using a connection string.
This readonly_user has been granted SELECT privileges, and it also has USAGE access on the public schema.
However, the tables in my database have composite primary keys. When I am logged in as the admin user, I can access these primary keys with the following command:
SELECT
tc.table_schema,
tc.table_name,
kcu.column_name
FROM
information_schema.table_constraints tc
JOIN information_schema.key_column_usage kcu ON tc.constraint_name = kcu.constraint_name
WHERE
tc.table_name = 'player_data' AND
tc.constraint_type = 'PRIMARY KEY';
For some reason, no matter how I try to play with the usage and access roles for readonly_user, this command (and others like it) to access the primary key information for the db tables *always* returns 0 rows. There should be at least 3 for every table, which I see clearly when I issue this command connected as the admin role.
Is there something about a read-only user that limits its access to the information_schema? How can I access the primary keys of a table as a read-only user?
justadampaul
(113 rep)
Apr 30, 2023, 06:38 PM
• Last activity: May 1, 2023, 03:19 AM
0
votes
1
answers
2637
views
Alter database from Read-Only to Regular mode? A warm-standby database is read-only (Error 5063)
I am trying to change the mode of database from "Standby / Read-Only" to "Normal" mode so that I could use mdf file to attach to other SQL Server. One instruction that I got was go to Properties --> Options --> State --> change the Database Read-Only from True to False. [

Java
(253 rep)
Mar 18, 2023, 12:44 AM
• Last activity: Mar 18, 2023, 10:07 PM
4
votes
1
answers
1302
views
Query performance on static, large PostgreSQL table
I tried to have this as detailed as possible. Sorry about the length! ### Background I created the following partitioned table `protein_snp_assoc` on a PostgreSQL (version 12.13) database: ``` CREATE TABLE protein_snp_assoc ( protein_id int not null, snp_id int not null, beta double precision, se do...
I tried to have this as detailed as possible. Sorry about the length!
### Background
I created the following partitioned table
protein_snp_assoc
on a PostgreSQL (version 12.13) database:
CREATE TABLE protein_snp_assoc (
protein_id int not null,
snp_id int not null,
beta double precision,
se double precision,
logp double precision
) PARTITION BY RANGE (snp_id);
I then created 51 partitions, each containing roughly 150 million lines (total lines 7.65 billion), based on the following template:
CREATE TABLE IF NOT EXISTS protein_snp_assoc_(x) PARTITION OF protein_snp_assoc
FOR VALUES FROM (y) TO (z);
where x
ranged from 1 to 51, and y, z
defined intervals, each of length 150,000. As an example, the first two and last partitions are:
protein_snp_assoc_1 FOR VALUES FROM (1) TO (150001),
protein_snp_assoc_2 FOR VALUES FROM (150001) TO (300001), ...
protein_snp_assoc_51 FOR VALUES FROM (7500001) TO (7650001)
The variable column protein_id
has 1,000 unique values (1 to 1,000) and snp_id
has 7,500,000 unique values (1 to 7,650,001). As the pair (snp_id, protein_id)
uniquely determines a row in the table, I used the two columns to create a BTree index, with snp_id
as the left-most variable:
CREATE INDEX ON protein_snp_assoc (snp_id, protein_id);
This will be a static database. It currently has about 20% of the total data that will be on it (since I'm prototyping), but once all the data has been added to the database, no further rows will be added (nor deleted).
### Typical queries
The most common queries will be (a) single SNP/protein queries, (b) single protein, multiple SNPs queries, and (c) multiple proteins and multiple SNPs queries.
Example queries where I use VALUES
as I read on this site that it can increase performance when IN(...)
has multiple values.
-- Single SNP/protein
SELECT
*
FROM
protein_snp_assoc
WHERE
snp_id IN (VALUES (1))
AND
protein_id IN (VALUES(1));
-- Multiple SNPs, single protein
SELECT
*
FROM
protein_snp_assoc
WHERE
snp_id IN (VALUES (1), (2))
AND
protein_id IN (VALUES(1));
-- Multiple SNPs, multiple proteins
SELECT
*
FROM
protein_snp_assoc
WHERE
snp_id IN (VALUES (1), (2))
AND
protein_id IN (VALUES (1),(2));
The EXPLAIN ANALYZE
for each query can be seen here (pastebin links):
[Single SNP/protein (pastebin)](https://pastebin.com/zfaRCibT) , [Multiple SNPs, single protein (pastebin)](https://pastebin.com/jLdu2hqJ) , [Mulitple SNPs, mulitple protein (pastebin)](https://pastebin.com/MYjBxjpT) .
### Benchmarking and comparison to Arrow/parquet
I ran 1,000 queries for multiple SNP/protein combinations, where the SNPs and proteins were randomly drawn before being inserted into the query. To get some sort of reference, I converted the raw data files I used to populate the database to .parquet
files and ran similar queries using R and the arrow
package. The results can be seen in the table below (all times are in milliseconds, lq
and uq
are the 25% and 75% percentiles, respectively).
|index | n_snps| n_proteins| min| lq| mean| median| uq| max|
|:--------|----------:|-------:|---------:|----------:|----------:|----------:|----------:|---------:|
|postgres | 1| 1| 0.05900| 14.71125| 18.02112| 17.92850| 21.14350| 45.2850|
|arrow | 1| 1| 34.31822| 44.62842| 49.30316| 46.29033| 48.07222| 577.1411|
|postgres | 1| 2| 4.07100| 20.97125| 25.40618| 25.15250| 29.35375| 68.7700|
|arrow | 1| 2| 47.61873| 61.47562| 67.87060| 63.99824| 65.68011| 629.5121|
|postgres | 10| 1| 118.18900| 167.11100| 181.76304| 180.50250| 196.41475| 262.9640|
|arrow | 10| 1| 138.73902| 164.25678| 177.47847| 167.69684| 176.15489| 704.3115|
|postgres | 10| 2| 168.10500| 231.74825| 248.74577| 248.45400| 264.95825| 330.2810|
|arrow | 10| 2| 219.73495| 269.54206| 287.34815| 281.79291| 286.22803| 819.4827|
|postgres | 10| 10| 731.77300| 893.28625| 940.90282| 941.69650| 989.38625| 1162.4810|
|arrow | 10| 10| 930.18264| 1038.39510| 1089.43522| 1080.01131| 1100.22580| 2313.4975|
|postgres | 50| 1| 665.23800| 799.89600| 849.73860| 850.91900| 898.27900| 1050.0710|
|arrow | 50| 1| 682.10049| 711.62065| 766.24498| 735.49283| 750.97367| 1335.6018|
As you can see, as the number of SNPs or proteins (or both) were increased, PostgreSQL and Arrow started performing similarly (although the worst-case for Arrow was consistently worse).
### Hardware
[CPU (pastebin)](https://pastebin.com/Xr2Y7NLh) . The HDD is a Seagate IronWolf 10TB (ST10000VN0008). Memory is 64GB but I can't see the specific type since I don't have sudo
privileges on the machine. Operating system: Ubuntu 22.04.1 LTS.
### My question
The results of the benchmark make me believe that my database is not optimized. I'm worried that as I start adding more data to the database, performance will suffer. Is there any way I can speed up queries that involve multiple proteins and SNPs, either with better design, queries or some other sort of tuning?
### Update 2023-03-12
Thanks to Erwin and everyone else who has engaged. I followed Erwin's directions exactly (the only exception being that I couldn't update from v12 to v15) and then redid the benchmarks for this new table. Here are the results (compared to the original design), where index_order = snp_first
is the original design and index_order = protein_first
is the design proposed by Erwin:
|index_order | n_snps| n_proteins| min| lq| mean| median| uq| max|
|:-------------|------:|----------:|--------:|----------:|----------:|---------:|----------:|--------:|
|snp_first | 1| 1| 0.059| 14.71125| 18.02112| 17.9285| 21.14350| 45.285|
|protein_first | 1| 1| 0.060| 20.96200| 24.87686| 26.3945| 30.31275| 126.046|
|snp_first | 1| 2| 4.071| 20.97125| 25.40618| 25.1525| 29.35375| 68.770|
|protein_first | 1| 2| 2.764| 37.02300| 44.31820| 45.7595| 52.30925| 84.515|
|snp_first | 10| 1| 118.189| 167.11100| 181.76304| 180.5025| 196.41475| 262.964|
|protein_first | 10| 1| 29.754| 215.37700| 221.30159| 255.3445| 276.62725| 380.930|
|snp_first | 10| 2| 168.105| 231.74825| 248.74577| 248.4540| 264.95825| 330.281|
|protein_first | 10| 2| 88.473| 320.08475| 417.07273| 461.6155| 501.66350| 593.604|
|snp_first | 10| 10| 731.773| 893.28625| 940.90282| 941.6965| 989.38625| 1162.481|
|protein_first | 10| 10| 1189.058| 1906.78050| 2040.40170| 2054.9985| 2194.80550| 2595.215|
|snp_first | 50| 1| 665.238| 799.89600| 849.73860| 850.9190| 898.27900| 1050.071|
|protein_first | 50| 1| 200.521| 910.52700| 934.64351| 1091.5340| 1149.79875| 1319.777|
As you can see, the original design is considerably faster, especially on the most time-consuming queries. I'll have a chat with the sys admin about updating to v15 this week, to see if that improves performance. In any case, I think this experiment has demonstrated that this is either a query problem (the queries I wrote are probably suboptimal, see comments on how I use VALUES
) or a hardware problem (the server is old).
**Answers to some questions in comments**
*jjanes:* See this pastebin: https://pastebin.com/qQR3GtZ4
*a_horse_with_no_name*: The VALUES
idea came from here: https://dba.stackexchange.com/questions/91247/optimizing-a-postgres-query-with-a-large-in
I wrote a test query:
EXPLAIN
WITH value_list (protein_id, snp_id) as (
values
(1, 1),
(1, 2)
)
SELECT
*
FROM protein_snp_assoc AS p
INNER JOIN
value_list v on (p.protein_id, snp_id) = (v.protein_id, v.snp_id);
I thought this gave me the same query plan as WHERE/IN
but I see now I was wrong. I'll look into this and see if it's better. **Edit:** it seems to perform on par with WHERE/IN
and VALUES
. So I guess this isn't the real bottleneck.
*bobflux*: I can't share the data but you can simulate it easily. Here's a quick example in R:
{R}
sim_data <- function(i, n_snps) {
data.frame(
protein_id = rep(i, n_snps),
snp_id = 1:n_snps,
beta = rnorm(n = n_snps, mean = 0, sd = 1),
se = abs(rnorm(n = n_snps, mean = 0, sd = 1)),
logp = abs(rnorm(n = n_snps, mean = 2, sd = 1))
)
}
protein_id <- 10
n_snps <- 7650000
sim_data(protein_id, n_snps)
*nz_21*: I wrote custom scripts in R and bash.
jommi
(43 rep)
Mar 11, 2023, 12:47 AM
• Last activity: Mar 12, 2023, 10:16 PM
4
votes
2
answers
26599
views
Insert- and Read-only table in Postgres
I want to create an "immutable" Postgres database, where users can `insert`/`select` (write/read) data only, but can not `update`/`delete` (change/delete). I came to know there is `FOR UPDATE` lock, but did not get how to use it. Let's say for example I've the below table, how can I make it immutabl...
I want to create an "immutable" Postgres database, where users can
insert
/select
(write/read) data only, but can not update
/delete
(change/delete).
I came to know there is FOR UPDATE
lock, but did not get how to use it.
Let's say for example I've the below table, how can I make it immutable (or, if I understood correctly, how can I use FOR UPDATE
lock permanently)?
CREATE TABLE account(
user_id serial PRIMARY KEY,
username VARCHAR (50) UNIQUE NOT NULL,
password VARCHAR (50) NOT NULL,
email VARCHAR (355) UNIQUE NOT NULL,
created_on TIMESTAMP NOT NULL,
last_login TIMESTAMP
);
Hasan A Yousef
(155 rep)
Mar 18, 2019, 08:01 PM
• Last activity: Jan 17, 2023, 03:27 PM
Showing page 1 of 20 total questions