Database Administrators

Q&A for database professionals who wish to improve their database skills

Latest Questions

0 votes

1 answers

145 views

Storing revisions and relations between revised items in 1:N and N:N scenarios for audits

I have several of the following situations in a MySQL schema that I design for a custom CRM/ERP solution. 1. 1:N related tables `SO` and `SO_ITEMS` (SO is just one example I have several) that both need to have revisions. The data sets stored tend to be get pretty big for some of these tables and, a...

                                  I have several of the following situations in a MySQL schema that I design for a custom CRM/ERP solution.

1. 1:N related tables SO and SO_ITEMS (SO is just one example I have several) that both need to have revisions. The data sets stored tend to be get pretty big for some of these tables and, according to the business rules I gathered, there will be many changes that need auditing so copying all the child items for each revision of the parent will not scale very well I predict.

2. N:N related tables SO and PO tables (where I use a pivot table of course) that needs to be audited also so that changes from at least one way imply (or may imply) storing an updated revision of the data entry form the other table. (We need no worry about the child _items data entries for each of these tables because although they may be related, I don't require auditing for now).

My strategy for now is to use a separate table suffixed with _revision for each of the above tables and store the current revision number in the normal table and a revision number for each revision, at least for the parents.

So my question are: 

* is my schema flawed in some major way that I don't predict for this kind of scenario? (I suppose there are many other applications that have solved this problem out there).
* should I make a direct relationship between x_revison and y_revision or use the revision number and query through the normal (current revision?) table? I want the variant that reduces complexity the most and I'm pretty sure querying through is simpler on most operations (I suspect there will not be a big disparity between SELECT/INSERT/UPDATE in normal operation, so I don't need to optimise for one type of DB operation). Am I missing something?
* is it common practice to copy-on-write the revisions or should I just lazily insert entries in the revisions table when a change appears only? Any possible pros/cons to these two options.

Byakkun (1 rep)

May 11, 2016, 09:52 AM • Last activity: Jul 20, 2025, 12:08 AM

0 votes

1 answers

150 views

Airflow to BigQueryt data load taking forever

mysql python architecture google-bigquery

Im currently working as a junior data engineer. My main job right now is to move data from a DB in MYSQL (which gets updated every few minutes via webhooks) and send it to BigQuery as frequently as posible using Airflow, as this is our main DB for later analyzing data with power BI. The problem is t...

                                  Im currently working as a junior data engineer. My main job right now is to move data from a DB in MYSQL (which gets updated every few minutes via webhooks) and send it to BigQuery as frequently as posible using Airflow, as this is our main DB for later analyzing data with power BI.

The problem is that the bigger tables (which only have ~ 1000 rows) take about 2 hours to load to BQ, and thus making this impossible to scale, I can´t imagine what will happen in the future when only the deltas are 10000 rows each...

This works using pandas and SQLAlchemy by extracting data as a dataframe and using "to_sql" method passing all the BQ connection parameters.

I am already uploading only incrementals/delta, that is not the problem.

Do you have any advice? Is Airflow the right tool for this? I´ve been searching for solutions for weeks but couldn´t find anything.

Ayrton (1 rep)

Aug 28, 2022, 11:11 PM • Last activity: Jul 17, 2025, 11:03 AM

30 votes

8 answers

85786 views

Pros/Cons of using multiple databases vs using a single database

sql-server sql-server-2008 database-design architecture

I was working on a new project which has the requirement to use 7 databases, arguing that performance, stability, optimization are more easily implemented. While I don't agree, I'm having trouble collecting good arguments to use a single database (splitting the tables into logical domains). One argu...

                                  I was working on a new project which has the requirement to use 7 databases, arguing that performance, stability, optimization are more easily implemented.

While I don't agree, I'm having trouble collecting good arguments to use a single database (splitting the tables into logical domains).

One argument I have so far is data integrity (I can't use foreign keys between databases). 

What are good pros/cons for using a single or multiple databases?

**[summary so far]**

Arguments against multiple databases:

- Losing data integrity (can't use foreign keys over databases)

- Losing restore integrity

- Gaining complexity (db users/roles)

- Small odds server/database will go down

Solutions:

- Use schemas to separate domains.

- POC: Use dummy data to prove the point in 7/1 db's execution plans

rdkleine (421 rep)

Jul 12, 2011, 09:08 AM • Last activity: Jul 5, 2025, 05:39 PM

1 votes

1 answers

869 views

SCCM Database on the SQL Server instance shared with other applications

sql-server sql-server-2019 architecture

My company uses SCCM with a dedicated SQL Server instance that is using version 2016. I noticed that this VM is completely idle for most of the time. I barely see any CPU activity, wait statistics since a startup are almost equal to 0. In this situation, I wanted to move SCCM related databases to the SQL instance with version 2019 that would be shared with other applications that are not very active. It would allow me to reduce SQL licencing cost. During the migration, we have encounter two issues. Firstly process failed with an error

ERROR: Site server does not have admin rights on remote sql server machine '*Destination SQL instance*'

We passed that through by adding the primary site server to the local admin group on the new SQL instance. It was a little bit expected, because that prerequisite was mentioned on different websites. But after that process failed again with an error:

Failed to get sql service account, Server:,instance:

Error: Unsupported Sql service account

At this point, we rolled back the change. At the Destination server, we use 'Managed Service Account' as a SQL Service account so I suspect that was a problem. Could you please advise if that could be the case or it is something else? I was expecting that we will simply re-point SCCM app to the new server and it will be communicating with a SQL server as most of the other applications. Looking at the error messages that we got during the installation it seems that is not the case. SCCM is trying (at least during the installation) interact with instance and VM level objects. In this situation, I wonder if that is safe to share its databases with other applications. Could you please advise if anyone is using SCCM with its database on the SQL instance shared with other applications? Is there any risk associated with such an approach (I mean specific to SCCM, not related to consolidation problems in general)?

Radek Gąska (192 rep)

Feb 16, 2023, 02:51 PM • Last activity: Jun 21, 2025, 04:02 PM

0 votes

1 answers

189 views

Efficient many to two relationship

database-design architecture

I'm building out a database with a `users` table, and I want to store the mutual (Facebook) friends between every set of two users in the database. Doing this in a normalized, efficient way seems tricky. I've considered two options so far: ***Option 1*** Create a table mutual_friend_set with columns...

                                  I'm building out a database with a users table, and I want to store the mutual (Facebook) friends between every set of two users in the database. Doing this in a normalized, efficient way seems tricky. I've considered two options so far:

***Option 1***

Create a table mutual_friend_set with columns id, user1_id, user2_id. Create another table mutual_friends with columns id, mutual_friend_set_id, name, picture, unique_mutual_friend_id.

The problem with this solution is that there's nothing to distinguish user1_id from user2_id, so I'd need to either create duplicate entries such as:

    id   user1_id  user2_id
    1    123       456
    1    456       123

which would take up twice as many rows as necessary and throw normalization out the window, or ensure the lower of the 2 user_ids was in the user1_id column. If, however, I wanted to get all mutual friend sets for a single user, I'd need to query across both columns.

***Option 2***

Create a table mutual_friend_set with fields id, user_pair, such that user pair is a string of the combined user relationship, delimited by a comma. Eg: id: 1, user_pair: '123,456'. The user with the lower ID would be placed before the comma.

This would get around the normalization issues, but of course if I wanted to grab all the mutual friends pairs for a single user I'd have to run a LIKE query which isn't exactly efficient.

I assume this is a common problem, which raises the question: is there a standard way of doing this? Given my inexperience, am I missing something obvious?

Any ideas appreciated

PlankTon (183 rep)

Mar 9, 2018, 06:31 AM • Last activity: Jun 20, 2025, 10:00 PM

0 votes

1 answers

249 views

Model data with many to many and overrides

database-design architecture many-to-many

I'm having some trouble modeling the database for this particular scenario in Postgres: We have a set of toys that contain several pieces. Each piece is associated with one or more materials. All of these relationships are many to many, because a piece can be placed in many different toys, and a mat...

                                  I'm having some trouble modeling the database for this particular scenario in Postgres:

We have a set of toys that contain several pieces. Each piece is associated with one or more materials. All of these relationships are many to many, because a piece can be placed in many different toys, and a material can be associated with many different pieces. 

Here's where I lack the knowledge on implementation. There's another many to many relationship, where a toy can be made by many factories, and a factory can manufacture many types of toys. In the general case, for a particular toy, we want to know what pieces it has, and which materials are associated with those pieces. However, there's a case where the material can be different for a given piece *depending on the factory the toy is made in.*

In short, there's a "default" case where we know that, for a given toy, we have a set of pieces, and for each of those pieces, we have a set of materials. But, for an arbitrary factory, the material associated with a piece is different. 

The furthest I've gotten is putting this information in the piece_material junction table:

    | piece_id  | material_id  | factory_id |
    |-----------|--------------|------------|
    | 1         | 1            |      |
    | 1         | 2            |      |
    | 1         | 3            | 1          |

But this obviosuly doesn't work because I don't have information indicating which material is being replaced in a particular factory. I'm looking for a query that will get associated materials with a piece, and only the overrides if there are some present (i.e. in the last row of the above example, if it's overriding material 2, that would only get materials 1 and 3 back).

BrDaHa (101 rep)

May 9, 2017, 06:35 PM • Last activity: Jun 7, 2025, 12:03 PM

-1 votes

1 answers

722 views

Database structure for multiple user types

mysql postgresql database-design architecture user-defined-type

I am developing a database that will store users of different types. Currently, there are only two types of users, `clients` and `agents`. In the future, there may be a new type of user, "`organization`," which will create and manage agents. The system also includes administrators, implemented throu...

                                  I am developing a database that will store users of different types. Currently, there are only two types of users, clients and agents. In the future, there may be a new type of user, "organization," which will create and manage agents. The system also includes administrators, implemented through roles.

The idea is to have a "users" table, which is mainly responsible for login functionality (email and password, token, and other technical fields), and additional tables for each of the corresponding user types, which have their own unique set of columns. For example, agents may have fields such as specialty, experience, diplomas, etc. Currently, clients do not have any unique fields, but this may change in the future.

However, as I have not managed a database of this complexity before, I am interested in how to organize a few aspects:

 1. Firstly, users can be any of the aforementioned types. For example,
    if a user is logged in as an agent, they should still be able to
    place an order.
 2. Secondly, it is necessary to provide the ability to create clients
    or agents without creating a user account. For example, if a client
    contacts the company by phone or email, without registering on the
    website. It is also possible to add the "organization" type, where
    the organization's administrator can create agents with or without
    login credentials (saving the record in the "agent" table, without
    recording in the "user" table).
 3. Thirdly, I would like to draw attention to the "appointment" table.
    The table contains foreign keys for clients and agents, not users.

The main problems I am facing are:

 1. Duplication of basic data (first_name, last_name) for clients and
    agents, as this data is stored in the "user" table, but I need to be
    able to create clients and agents without a user account.
 2. Doubts about the need for the "client" table, as it currently has no
        additional fields and complicates the system.

**Here's what I've got so far**

                                

Fapalz (1 rep)

Apr 18, 2023, 11:18 AM • Last activity: May 29, 2025, 06:09 PM

0 votes

1 answers

408 views

The most generic e-shop DB schema

database-design architecture

Let's say I create a schema for an e-shop what potentially sells everything: laptops, phones, tires, snacks etc. Considering this, I cannot create a new table/entity for each good category the shop sells and describe the good's properties by new attributes. So, I went to more generic model to descri...

                                  Let's say I create a schema for an e-shop what potentially sells everything: laptops, phones, tires, snacks etc.
Considering this, I cannot create a new table/entity for each good category the shop sells and describe the good's properties by new attributes.

So, I went to more generic model to describe each product

To me, it seems fine but as soon as I've tried to implement this solution, it's appeared to impossible to execute any query like

> select names and prices of all laptops where ssd capacity >= 512 and weight < 2kg and price <= 2000 money

How is possible to change this schema and keep the ability to store information about any product?

63350541 (1 rep)

Mar 20, 2021, 12:00 PM • Last activity: Apr 18, 2025, 02:04 PM

1 votes

3 answers

1326 views

Database design for monthly leaderboards

mysql architecture sequelize

I have a system where users can make predictions on events. There are about two events a week. When the event ends the system scores all the predictions and updates this in the database. Here is an approximate representation of my current DB strucutre: Users table | ID | username | score | |----|---...

                                  I have a system where users can make predictions on events. There are about two events a week. When the event ends the system scores all the predictions and updates this in the database. Here is an approximate representation of my current DB strucutre:

Users table

   | ID |  username |  score |
   |----|-----------|--------|
   |  1 |  alice    |     12 |
   |  2 |  bob      |     22 |


Games table

   | ID |    gameTitle    |       gameTime       |  questions |  answers |
   |----|-----------------|----------------------|------------|----------|
   |  1 |  Football match |  2019-08-11 14:00:00 |  {json}    |  {json}  |
   |  2 |  Hockey game    |  2019-07-11 15:00:00 |  {json}    |  {json}  |

Predictions table:

   | ID |  gameID (FK) |  userID (FK) |  prediction |  score |  rank |
   |----|--------------|--------------|-------------|--------|-------|
   |  1 |            1 |            1 |  {json}     |      6 |     1 |
   |  2 |            1 |            2 |  {json}     |      4 |     2 |
   |  3 |            2 |            1 |  {json}     |      2 |     2 |

So initially 'score' and 'rank' in predictions are NULL, and then after the event ends I run a script that goes through all entries for that game and scores it for the user and generates a rank.

With this structure I can do things like: 

 - List the highest scoring users of all time (SELECT from USERS, score descending)
 - List the rankings for each event (SELECT rank from PREDICTIONS where gameID = X, descending)

However, what I now want to do is be able to get time-constrained rankings. So I want to be able to see which users scored highest for all events in the current month, or for the month of February etc.

I know I could do this manually by going through the Games table and getting all gameIDs that are in the given month, and then going through the Predictions table for all predictions with these IDs, and adding up the score for each user and then sending this. However, that seems crazily inefficient and resource-intensive. Especially considering that this will be an API call

I'm therefore wondering how I can accommodate this in my current database - if it's possible and/or advised, or if I should use some other technology.

 - One idea would be have to generate a new table for each month, and
   update and pull from this when relevant.
   
 - Another would be to have a 'monthly score' column in the User table,
   that resets each month. However, this wouldn't have historical data,
   which would be nice (though I guess could always be generated)

I feel like there might be some database paradigm trick that I'm missing, so it'd be great to hear a proposed solution for this.


                                

David (11 rep)

Aug 15, 2019, 09:40 AM • Last activity: Apr 16, 2025, 10:07 PM

0 votes

0 answers

70 views

How do you create a table that can have x number of possible things joining into it?

postgresql database-design architecture

# The problem I have a system that I'm building where I need to allow users to track time to any number of various other systems. For example, I have a concept called a Project, a project can contain 0 or many Tasks. Where each task can contain anything from equipment being used, materials being con...

                                  # The problem
I have a system that I'm building where I need to allow users to track time to any number of various other systems. 
For example, I have a concept called a Project, a project can contain 0 or many Tasks. Where each task can contain anything
from equipment being used, materials being consumed, or other tasks which can have more of the same. Think a ghant chart.
In this chart a user needs to be able to track time to the material being used, a vehicle being driven, or equipment being used.
Right now I just track time to the project as a whole which doesn't allow the granularity that I need. 

## Db schema outline
This is a *VERY ROUGH* outline of what this might be like. It is not fully accurate to my schema but it should get the point across. 
There are more tables around time and projects and employees but think of this as a very abstracted version of the system. As more and more items need to track time they would need to be added to the Activities table which seems wrong.



# Ideas
One way I thought about solving this was having a join table that can join on 0-1 of a project, task, material, equipment, vehicle. 
This seems to work the problem here is as more things get added I'm just going to need to keep adding more and more foreign keys that wont get used as only one
project, task, equipment, etc will ever be used at a time. This seems like a very inefficient way of doing this.

Another way of doing this might be to have each item that can track time have a FK to the join table (currently called activities/effort) with that table not having
any FK out to the tables that can use it, but still having a column for the ID of the thing that it ties back to, I wouldn't be able to enforce the relationships this way 
but it seems to be a decent solution. Especially if I break time management stuff into its own schema so it wouldn't know about other systems anyways. Basically this table 
would need more software to manage it and we would have to be very careful with what writes to it when but it seems like its likely the best solution at the moment.

# The question
Given the problem I have outlined does anyone have any ideas for how one might represent this kind of data? In case it matters our db is Postgres but this is more of an architectural design problem.

# EDIT 1:
I have been working on the problem a bit more and have another idea for how it could be done. I don't know that this is a solution just yet so I don't want to mark it as such. Basically it involves having my timecard table (replacing the activities one) have an id timeTrackingId where each employee can have multiple timecards (mapped via the timesheet). Each timecard tracks the time for one given trackable entity. So if an employee is tracking time to a project and 3 tasks then they will effectively have 4 timecards which whichever being tracked at any given time based on how the software decides. This here is an updated diagram.



                                

Kayden Miller (1 rep)

Feb 8, 2025, 11:04 PM • Last activity: Feb 9, 2025, 09:57 PM

3 votes

2 answers

1545 views

Kafka Internal Data structure vs LSM tree

architecture storage-engine kafka

I was going through Database storage engines and found out about LSM trees. I had also read about Kafka architecture and know that Kafka internally is a commit log. I want to know if Kafka internally uses LSM data structure for append-only store or uses some other data structure for storing data?

                                  I was going through Database storage engines and found out about LSM trees. I had also read about Kafka architecture and know that Kafka internally is a commit log. I want to know if Kafka internally uses LSM data structure for append-only store or uses some other data structure for storing data?
                                

Ayman Patel (153 rep)

Mar 1, 2021, 07:39 AM • Last activity: Nov 26, 2024, 02:48 PM

0 votes

1 answers

53 views

MongoDB ReplicaSet system configuration and modeling

replication mongodb architecture

**I want to create a replica set model that look like this:** PC 1 : ip=1.1.1.1 - The only one that writes PC 2 : ip=2.2.2.2 - Read only PC 3 : ip=3.3.3.3 - Read only **Use-case:** On PC 1 I have a program that writes to the DB, on PC 2 and 3 I have a program that reads that data, I want to be able...

                                  **I want to create a replica set model that look like this:**

    PC 1 : ip=1.1.1.1 - The only one that writes
    PC 2 : ip=2.2.2.2 - Read only
    PC 3 : ip=3.3.3.3 - Read only

**Use-case:** On PC 1 I have a program that writes to the DB, on PC 2 and 3 I have a program that reads that data, I want to be able to access the data from each PC even if there is no connection. The solution I thought of is - creating a replica-set, that way data is replicated to each pc locally, the challenge is how to configure the replica set.

**System requirements:** Only pc 1 does the writing - meaning this is the primary If connection between the PC’s falls each PC should be able to work as stand alone, as far as I understand the solution for that is nominating that PC as primary therefore an arbiter is needed (you need a group of minimum 2 to do an election) - based on this requirement, each pc runs a DB instance and an arbiter. And this should look something like this:

    members: [
      {_id:0,host:“1.1.1.1:27017”,priority:10},      
      {_id:1,host:“1.1.1.1:27018”,arbiterOnly: true},
      {_id:2,host:“2.2.2.2:27017”,priority:1},
      {_id:3,host:“2.2.2.2:27018”,arbiterOnly: true},
      {_id:4,host:“3.3.3.3:27017”,priority:1},
      {_id:5,host:“3.3.3.3:27018”,arbiterOnly: true}
    ]

I have tested this scenario and closed PC1 and PC2 therefore PC3 has left alone; on PC3 the DB instance and arbiter are running, for some reason the instance on PC3 won’t become primary (meaning I cant access the DB and read). I would love knowing why.

Maybe this is the problem?

    {“setDefaultRWConcern”: 1,“defaultWriteConcern”: { “w”: 2 },“defaultReadConcern”: { “level”: “majority” }}


                                

Roman Milman (1 rep)

Jul 30, 2024, 06:37 AM • Last activity: Aug 1, 2024, 09:27 AM

-1 votes

1 answers

33 views

problem of correspondence between scheduled shifts and actual shifts

database-design architecture

The `employee` has a strong key. On the one hand I record the shift calendar of each employee (`shift calendar`) and on the other hand I record the shifts that were actually carried out by the employee. Now if employee "X" has two shifts, one on Monday morning and the other on Tuesday morning, and l...

                                  The employee has a strong key.
On the one hand I record the shift calendar of each employee (shift calendar) and on the other hand I record the shifts that were actually carried out by the employee.

Now if employee "X" has two shifts, one on Monday morning and the other on Tuesday morning, and let's assume that he arrives on Monday evening. When he arrives, is he referring to the Monday morning shift where he arrived late or the Tuesday morning shift where he arrived anticipated?

There is no correspondence between the scheduled shifts and the shifts carried out
How can I add a match without creating redundancy?

Domenico Lugubre (1 rep)

May 10, 2024, 04:22 PM • Last activity: May 10, 2024, 08:24 PM

2 votes

2 answers

8631 views

SQL Server 2016 Multiple Availability Groups and instances

sql-server availability-groups architecture instance

So we are getting ready to make a major architecture shift and I get to be on the hook for it, yay! Currently we have 22 separate availability groups each with their own windows cluster across 44 servers. We want to move to 6 servers to hold the 22 AGs, yes I know it doesn't divide evenly. So for ex...

                                  So we are getting ready to make a major architecture shift and I get to be on the hook for it, yay!  

Currently we have 22 separate availability groups each with their own windows cluster across 44 servers.

We want to move to 6 servers to hold the 22 AGs, yes I know it doesn't divide evenly.

So for example a given server (VM) would be 4 SQL Server instances, each instance would be an availability group and have its own listener.  

So something like:

* NODE1: SQLINST1, SQLINST2, SQLINST3, SQLINST4
* NODE2: SQLINST1, SQLINST2, SQLINST3, SQLINST4

* AGs: SQLINST1, SQLINST2, SQLINST3, SQLINST4  

* Listeners: SQLINST1Lis, SQLINST2Lis, SQLINST3Lis, SQLINST4Lis

My questions are:  

* does each SQL Server instance also need to be its own windows cluster or could one cluster suffice for all the named instances?

* Will this work at all?

trocheaz (165 rep)

Apr 25, 2017, 07:50 PM • Last activity: May 9, 2024, 09:15 AM

0 votes

1 answers

116 views

How Can I Efficiently Structure a Data Model for Handling Both One-Time and Recurring Tasks in a Task Management app?

database-design query-performance architecture scalability

fellow developers and architects! I'm in the process of designing a webapp aimed at managing tasks and recurring tasks, which can be seen as habits. The unique challenge I'm facing revolves around how to best structure my data model to efficiently handle both one-time tasks and recurring tasks, with...

                                  fellow developers and architects! I'm in the process of designing a webapp aimed at managing tasks and recurring tasks, which can be seen as habits. The unique challenge I'm facing revolves around how to best structure my data model to efficiently handle both one-time tasks and recurring tasks, without differentiating between them during data retrieval.

The current design splits the "task" concept into two entities:

1. **Task**: Holds shared task data such as title, description, category, type, recurrence rule, etc.
2. **State**: Contains specific instance data like state (new, in-progress, done), start date, due date, etc.

This structure necessitates that both tables/entities be joined and fetched together at all times, considering that task entity attributes (like description and title) are often updated.

**To provide more context, here are a few use cases to consider:**

**Usecase1**: Imagine a calendar monthly view, and today is January 1st, with each user having hundreds of tasks, each with recurrences many times a week. The user wants to see all the tasks they have to do in the following December and change the due date/time of some of them.

**Usecase2**: With the same setup as above, a user has a task with a daily recurrence. They now want to change the recurrence to every second day and change the due time.

**Usecase3**: Again, the same calendar setup, but now the user wants to change the description and title of some tasks on a regular basis. Of course, this change should apply to every single recurrence of that particular task.

**Given these requirements, I have several questions for the community:**

1. **Performance & Scalability**: What are your thoughts on the proposed data model in terms of performance and scalability?
2. **Database Selection**: Which type of database would be best suited for this application (SQL vs. NoSQL) and why?
3. **Alternative Models**: Are there more efficient data models or structures that could handle these scenarios better?
4. **State Entity Creation**: For recurring tasks, should "state" entities be generated in advance or on-demand? How can this be optimized for both the user experience and system performance?
Considering the data model I described, how and when should the “state” entities be created for a task? In advance? On demand? If I create state entities in advance, I risk that the user picks a time frame where those states are still missing, so I have to make a check anyway if new state entities need to be created. On the other hand, if I do it on demand it will make all the queries very slow as with each fetch operation I first need to check and if needed to create the state entities.

Your insights, especially if you've tackled similar challenges, would be invaluable to me. I'm particularly interested in any architectural advice, design patterns, or technology recommendations that could enhance the app's functionality and user experience.

nanobot (1 rep)

Apr 6, 2024, 12:57 PM • Last activity: Apr 7, 2024, 12:38 PM

0 votes

1 answers

38 views

Handling Id Uniqueness With a Single Key or with a Composite Key

mysql architecture

First off, I'm new to architecting a DB and I'm looking for help from people who are more experienced and are able to point out gotchas and issues that I might not see. I'm building out a new DB schema and I'm weighing the benefits of using composite keys for uniqueness vs a single primary key. Here...

                                  First off, I'm new to architecting a DB and I'm looking for help from people who are more experienced and are able to point out gotchas and issues that I might not see. 

I'm building out a new DB schema and I'm weighing the benefits of using composite keys for uniqueness vs a single primary key.

Here is an example. An Owner owns things, Things have components, Components have descriptions.

### Approach 1 - Composite Key
One way I can architect this is by giving the owner an INT id, OwnerId, and then anything they own a SMALLINT id, OwnedThingId. The result is that each owner has MAX(SMALLINT) things they can own. The other result is that the OwnedThingId's are repeated in the db. For example, the table would have many different OwnedThingId's of 1. Uniqueness is only attainable by composing the OwnedThingId with the OwnerId. In this case the PK would be (OwnerId, OwnedThingId).

### Approach 2 - Single Key
The other way I can architect this is by making the OwnedThingId bigger, say an INT, and make that field unique throughout the whole table. In this case there would only be one thing with an id of 1, regardless of how many owners there are. The result is that there is a limit of MAX(INT) for **all owners**. However, the record is now easier to refer to since I only need an OwnedThingId instead of both an OwnerId and OwnedThingId.

### Approach 1 - Pros & Cons
Pro: I like approach 1 because it has the potential to take less space with smaller keys and because it's more contained. The number of owners in the system doesn't really matter, the OwnedThingId's available to the owners doesn't change. It also mirrors the real world a bit better. i.e. Student id 12 isn't the same student at all the schools ever.
Con: Since multiple ids are required to determine uniqueness, there is the potential for a future table to store and need to refer to X number of id's instead of just storing one id. 

### Approach 2 - Pros & Cons
Pro: Only one field is needed to confer uniqueness. A Record is unique across all owners. There is also less to keep track of, a future table would only need to store and refer to the one unique id.

Con: It can take more space because it uses bigger int values. Each owner that claims an id, reduces the pool of available ids for all owners.

### Conclusion
Like I said, I'm leaning towards approach 1, but I get the feeling like I might be shooting myself in the foot as time goes on and the system grows. Thus I am reaching out to those with more experience in architecture than I to inform me of potential gotcha's or reasons why I should consider one over the other.

jeremysawesome (121 rep)

Mar 27, 2024, 06:01 PM • Last activity: Mar 28, 2024, 04:49 AM

0 votes

1 answers

134 views

Data servers: Do query optimisers re-write queries removing redundant columns during the plan creation?

query-performance optimization view architecture reporting

Although I'm using SQL Server, as this is a question of whether an optimiser will re-write a query to remove redundant columns, I'm pitching this at all data servers: RDBMS, NoSQL, MPP, anything capable of holding and querying stored data using SQL that shall optimise the query before running it. I...

                                  Although I'm using SQL Server, as this is a question of whether an optimiser will re-write a query to remove redundant columns, I'm pitching this at all data servers: RDBMS, NoSQL, MPP, anything capable of holding and querying stored data using SQL that shall optimise the query before running it.

I kinda feel that they would, or at least it would seem logical that they would especially as it'd be crazy to fill a cache unnecessarily, but I can't find any evidence to say they would.

I don't want to get bogged-down on how environments, network, server, table, and cache loads, and table size and performance will alter the selected plans; this is just a very high-level question of: would the server rewrite a query to remove redundant columns and/or joins, but mostly columns, that are not in any way used to generate the result.

On my isolated dev server, I have this test query running against a tiny test 290k row table, it has a pk on an identity field, and a composite index which whilst two of the fields from the derived table are covered as part of the index, the primary field under test is not

The derived table in this instance has 7 redundant columns, and I'm executing these three DBCC commands before each run so as to start with a cold cache : DBCC FreeProcCache; DBCC DropCleanBuffers; DBCC FreeSystemCache('sql plans'):

    Select
    a.provider_type
    ,Count(1) As count_of_provider_type
    
    From (
        Select
        customer_id
        ,access_plan
        ,provider_type
        ,ap_postcode
        ,browser
        ,session_start_date
        ,session_end_date
        ,payment_method
    
        From adhoc..datacentre
    
    )a
    
    Group By
    a.provider_type;
Returns this actual plan:



And from the profiler, CPU: 92, Reads: 23703

And then, having re-ran the three DBCC commands to return the cache to cold, running this re-expressed query:

    Select
    provider_type
    ,Count(1) As count_of_provider_type
    
    From adhoc..datacentre
    
    Group By provider_type;
Gives me this actual plan:



And profiler, CPU: 78, Reads: 23476

Notice any similarities? 
Which given the batch count, io and cpu, leads me to wonder that the optimiser did rewrite the first query to remove the derived table and the redundant columns. 

But how can I prove it. 
I can't find anything at learn.microsoft under the Query Processing Architecture to suggest that the optimiser would rewrite the query, neither can I find a way of seeing what was transferred to cache. 
Does anything exist that can definitively say exactly what was read and cached.

Remember - although I'm using SQL Server, I'd be interested to know how other RDBMS / MPP such as GBQ, Redshift, Athena, Snowflake etc would handle this

And finally, the why. 
What nutjob would write the first query without having realised it could be re-expressed?

This is twofold: Firstly views, and secondly and more prominently: SQL from visualisation and reporting tools capable of accepting an SQL script, which is often functionally equivalent to a non-materialised view.
 
As we all know, views can be abused. They shouldn't, and in an ideal world, users would create views as isolated models to spit-out a result-set ready for ingestion by the tool that the model was designed for, which is also the same direction for visualisation and reporting tools.

But we all know this never happens. 
Just like a doctor who has a cream for that, so too have engineering built a view that includes "what you're after" in its output, "so you don't need to go and create a new query, just query that view". 
And if the view is basic enough, maybe it's being used to replace a table, but includes scd:2 logic, and maybe its selecting non-engineering and/or PII data; or maybe it is a basic model with a couple of non-complex joins. 
But if a user were to query this view for only a couple of fields, would the optimiser rewrite the query to remove redundant columns of a single-table view, and possibly remove redundant joins from a multi-table view?

As I said at the beginning, I feel the optimizer would, but I need to be able to evidence this beyond conjecture or theory.
                                

Steve Martin (9 rep)

Mar 3, 2024, 08:49 AM • Last activity: Mar 3, 2024, 09:15 AM

1 votes

2 answers

60 views

Optimized conception for score table

query-performance optimization architecture

I have a simple question yet I couldn't find a direct answer/explanation. Sorry if duplicate I want to build a score table for a small quiz game where users get questions and they can either be right or wrong. So, I figure I have 2 choices, either a table with 1 row = 1 user answer, like this: |user...

                                  I have a simple question yet I couldn't find a direct answer/explanation. Sorry if duplicate

I want to build a score table for a small quiz game where users get questions and they can either be right or wrong.
So, I figure I have 2 choices, either a table with 1 row = 1 user answer, like this:

|userId|rightAnswer|wrongAnswer
|-|-|-
|1|0|1
|1|0|1
|1|1|0
|...|...|...


Or a table with 1 row = 1 user global score:

|userId|rightAnswers|wrongAnswers
|-|-|-
|1|42|21
|2|100|0
|3|12|13
|...|...|...


I have pretty much 0 experience with database/sql optimization, so i don't know which one is the most efficient.

Here is my thought process:

**First option**:
Better(?)/Faster(?) to add/update a score because I don't care about the current table state, I just have to insert a row
But, to get a user's global score I would need a heavy(?) query like 

> SELECT SUM(right), SUM(wrong) WHERE userId = x;

Since I would want to show a user's global score on each page/request, I feel like this is not the smart choice.
Plus, since 1 row = 1 user answer, the table could get very very large.



**First option**:
Slower(?) to add a score, since I would have to update an existing row. Meaning SELECT then UPDATE, on each answer.
Not particularly my case, but this option is less flexible since - unlike the first option - I wont be able to store which question was rightly/wrongly answered.
But, with this I can get a user's total score without a heavy query. Would be better if an UPDATE query could return the resulting row, so I don't have to UPDATE then SELECT to show.

Lastly, if you are kind enough to spend time answering me, could you explain briefly one or the other is technically better, and how could I test it (is there some sort of tool/procedure)

Thanks
                                

1000k (13 rep)

Feb 14, 2024, 11:34 AM • Last activity: Feb 15, 2024, 12:12 AM

1 votes

1 answers

35 views

Seeking Guidance on Storage Models for Frequently Executed SQL Statement

postgresql database-design storage architecture

An application frequently executes the following prepared statement in SQL: ```sql SELECT AVG(StockUnits) FROM PRODUCT; ``` I'm currently exploring the most suitable storage model for the database architecture. After some research, the term "Decomposition Storage Model" has come up, but could anyone...

An application frequently executes the following prepared statement in SQL:

SELECT AVG(StockUnits)
FROM PRODUCT;

I'm currently exploring the most suitable storage model for the database architecture. After some research, the term "Decomposition Storage Model" has come up, but could anyone help me understand why the "Decomposition Storage Model" might be particularly well-suited for handling the mentioned SQL statement? Any advice, explanations, or pointers to relevant resources would be highly appreciated.

cricket900 (129 rep)

Dec 3, 2023, 04:27 PM • Last activity: Dec 3, 2023, 05:50 PM

0 votes

1 answers

52 views

Two instance on one server vs 2 server

sql-server architecture instance

I've one server with a database who stress the server and is blocking other db. Developer has optimized the query, but there we need more power. The SQL server is a VM in a vCloud Director tenant, we think two solution: * **Two Instance on same server**: we add a second instance on same VM. We doubl...

                                  I've one server with a database who stress the server and is blocking other db. Developer has optimized the query, but there we need more power. The SQL server is a VM in a vCloud Director tenant, we think two solution:
* **Two Instance on same server**: we add a second instance on same VM. We double the system resource, in second instance we move only the stressed DB, limit CPU e RAM usage, so when the DB is in stress, the first instance have dedicated resource and there's no blocking query, when the second instance is down, the first instance has more resource... as another vantage I dont have to pay another SQL server license...
* **Two server**: New dedicated server dedicated to stressed DB... as disadvantage I have to pay another SQL license and the resource are only dedicated to the single DB even when is not under stress...

At this time we think the two instance solution is the right way, we are wrong? There are disadvantage that we have not consider? What you think? Thanks in advance for response...

Stefano
                                

Stefano Ambrogi (13 rep)

Oct 23, 2023, 02:37 PM • Last activity: Oct 23, 2023, 02:50 PM

Showing page 1 of 20 total questions