Database Administrators
Q&A for database professionals who wish to improve their database skills
Latest Questions
2
votes
1
answers
149
views
How to implement updates from mySQL operational DB to Azure SQL DB for reporting
We have an **operational mySQL DB running on AWS** for a transactional system and an **Azure SQL DB for reporting** with PowerBI. Now I'd like to regularly (e.g. every night) do an update of certain tables from the mySQL DB to the Azure SQL DB. I found [this description on how to do incremental copi...
We have an **operational mySQL DB running on AWS** for a transactional system and an **Azure SQL DB for reporting** with PowerBI. Now I'd like to regularly (e.g. every night) do an update of certain tables from the mySQL DB to the Azure SQL DB.
I found this description on how to do incremental copies using Azure Data Factory , however the alternatives don't seem feasible to me:
1. Delta data loading from database by using a watermark requires adding watermark columns to the source DB, but I don't want to make changes to the operational DB because it is managed and regularly updated by the transactional system.
2. Delta data loading from SQL DB by using the Change Tracking technology seems to require an SQL Server DB as the source if I understand this correctly.
The remaining two alternatives apply only to updates from files, not DBs, to my understanding.
Are there other feasible alternatives based on the described conditions? They don't necessarily need to involve Azure Data Factory, however the updates should run completely automated in the cloud.
Maybe a non-incremental update (i.e. full replacement of the target DB tables every time) would be an option too, but I'm afraid that this would lead to high costs on the Azure SQL Server side - please share any experience on that as well, if available.
Jörg Brenninkmeyer
(121 rep)
May 2, 2019, 07:26 AM
• Last activity: Jul 21, 2025, 06:05 AM
0
votes
1
answers
154
views
How to model changing OrderLine rows in fact table
In our business, we have Orders that are made up of OrderItems. Over time, these Orders can change status (e.g. Received, Challenged, Planned, Completed). During this process, the OrderItems for an Order can be changed, added, or removed. For example, when an Order is finally fulfilled, the OrderIte...
In our business, we have Orders that are made up of OrderItems. Over time, these Orders can change status (e.g. Received, Challenged, Planned, Completed). During this process, the OrderItems for an Order can be changed, added, or removed.
For example, when an Order is finally fulfilled, the OrderItems might actually be for different Items and different prices to the OrderItems on the original Order.
I'm wondering how to model this. Any suggestions welcome!
Steve
(21 rep)
Jul 18, 2019, 08:03 AM
• Last activity: Jul 14, 2025, 08:03 PM
1
votes
2
answers
176
views
Best way to store millions of measurements with hundreds of columns
I work with infrared spectrometers (several hundred of them) to analyze chemical compounds, and I am looking for an efficient solution to store the measured data so that I can process it later on. Each time a spectrum is measured by a spectrometer, it gives me what I call an “acquisition”. An acquis...
I work with infrared spectrometers (several hundred of them) to analyze chemical compounds, and I am looking for an efficient solution to store the measured data so that I can process it later on.
Each time a spectrum is measured by a spectrometer, it gives me what I call an “acquisition”.
An acquisition is made of the following data:
- The spectrometer ID (array of max. 15 characters);
- ID of the measured chemical compound (array of characters)
- Date and time of the measurement (as an array of characters e.g. 2023-05-19 13:24:00) ;
- Internal temperatures, voltages, etc… measured by the spectrometer (about 20 measurements) ;
- A vector with the 500 wavelengths we wanted to measure
- Example : 2001, 2002, …, 2499, 2500
- A vector with the 500 wavelengths which were actually measured
- Example : 2000.9456, 2002.5498, …, 2499.7648, 2500.0195
- A vector with 500 values of light intensities measured by the spectrometer (one value per wavelength)
- Example : 11614.9756, 11611.9512, …, 16084.7073, 16127.7561
- A vector with 500 values which represent the stability of each previous light intensity measurement
- Example : 2.6314, 3.2165, …, 2.2051, 1.9872
Note that while the first vector of target wavelengths is constant, the other three vectors will never be the same from one acquisition to another.
All those data are saved as a .csv file, one acquisition = one csv file. The content of each .csv typically looks like that:
| Spectrometer ID | Product ID | Date/time | Temperature | Voltage |
| --- | --- | --- | --- | --- |
| S_A01_A023 | Ethanol | 2023-05-19 13:24:00 | 23.4 °C | 35.8912 V |
| 2001 | 2002 | … | 2499 | 2500 |
| 2000.9456 | 2002.5498 | … | 2499.7648 | 2500.0195 |
| 11614.9756 | 11611.9512 | … | 16084.7073 | 16127.7561 |
| 2.6314 | 3.2165 | … | 2.2051 | 1.9872 |
I already have about 6 millions of such acquisition files and this number will increase by about 1 million every month.
So far I only performed statistics on one spectrometer at a time and this is easily manageable with Matlab as I have at most 50k acquisition per spectrometer.
This is done by parsing the content of each .csv to a matlab structure and save it to a .mat file which is a few hundred megabytes at most. On my machine, loading the .mat file takes at most 5 seconds, which is acceptable.
However, I now want to perform some statistics on acquisitions coming from different case scenarios. One such scenario could be, for instance, to perform some calculations on every ethanol spectrum measured at a temperature above 50°C between January and May 2023 and which measured light intensity at 2005 nm is below 6000.
In Matlab, this would mean loading a file where the 6+ million acquisitions are saved and discarding most of them to keep only the 2k samples that are of interest for the above test case.
This solution is not acceptable since retrieving the data would be inefficient and time-consuming. Moreover, many different scenarios will need to be tested every day.
Therefore, I am looking for the best solution to store my data and which could be interfaced with Matlab (or Python) to quickly extract the relevant data. It is important to note that this data needs to be accessible by multiple people, not just myself, and often two or three people will access the data simultaneously.
Note that all statistical computations will be performed with Matlab or Python.
I do not know much about data bases, data warehouses, SQL, Hadoop or other systems to store and access big quantities of data. However, I am willing to learn and I would be grateful if you could help me with this topic.
I hope that my explanations are clear enough. If not, do not hesitate to ask for clarification.
Xav59130
(11 rep)
May 19, 2023, 03:24 PM
• Last activity: Jul 7, 2025, 12:06 AM
5
votes
2
answers
1555
views
Should I snowflake or duplicate it across my facts?
I'm building up a data warehouse to be used by SSAS to create cubes on, and I'm debating between two possible schemas. In my warehouse, I've got two different fact tables that tracking daily changes in dollar values. Each entity in these fact tables have an underlying Sales Order and Line to which t...
I'm building up a data warehouse to be used by SSAS to create cubes on, and I'm debating between two possible schemas.
In my warehouse, I've got two different fact tables that tracking daily changes in dollar values. Each entity in these fact tables have an underlying Sales Order and Line to which they relate. These SOs and Lines then have other related dimensions, such as customer, product, etc. About 12 sub-dimensions total so far.
My question is if I should be rolling all these sub dimensions up directly into the fact tables, or if I should use a little snowflaking in my warehouse, and have them branching off the Sales Order and Lines dimension instead.
The first option obviously follows a star-schema model better. However, if changes are made such as adding additional dimensions, it becomes more maintenance, basically having to do the ETL twice for each fact table, rather than just the once on the SO dimension. As well, if a new fact is added that relates to Sales Orders, I'd have to go through the whole process again.
As this is my first DW/OLAP project, I'm not familiar on where the line should be drawn on snowflaking, and other people's thoughts would be highly appreciated.
Evan M.
(231 rep)
Mar 26, 2013, 03:45 PM
• Last activity: Jun 28, 2025, 10:06 PM
0
votes
1
answers
189
views
Handling delayed key info in real time data warehouse
Our data warehouse is currently loaded with a traditional daily ETL batch job, but we're looking to soon implement a few stars in real-time (fed by streaming Kafka messages, FWIW). So only a couple fact tables and dimensions will be in real-time, while the rest will remain loaded daily. I was wonder...
Our data warehouse is currently loaded with a traditional daily ETL batch job, but we're looking to soon implement a few stars in real-time (fed by streaming Kafka messages, FWIW). So only a couple fact tables and dimensions will be in real-time, while the rest will remain loaded daily. I was wondering how others have dealt with the issue of a real-time fact table with FKs into daily batch-loaded dimensions.
The scenario we have is our business is pushing for our Sales Workflow star schema to be in real-time so we can run intraday workflow analytics, but there are keys in the fact table, like into customers and sales reps (and several others), that'll still be populated daily. Without turning the entire DW in real-time, what are some best practices for solving the "delayed" key problem?
romanpilot
(101 rep)
Oct 2, 2017, 05:57 PM
• Last activity: Jun 28, 2025, 07:01 PM
1
votes
2
answers
569
views
MySQL data warehouse calculate difference over years in percentage
I'm trying to create a query to calculate the percentage difference based on a time range (e.g. `WHERE dim_date.year BETWEEN 2009 AND 2011`) spreaded on months. My question is how to add the new column to calculate the differences (positive or negative) against same month from base year. I'm aware t...
I'm trying to create a query to calculate the percentage difference based on a time range (e.g.
The simple SELECT statement (without COUNT or SUM) is like this:
SELECT
dim_date.year,
dim_date.month,
fact.total_cost
FROM fact
INNER JOIN dim_date
ON fact.date_key = dim_date.date_key
WHERE dim_date.year BETWEEN 2009 AND 2011
For simplicity please have a look at below star schema.
Thank you in advance for any help.
WHERE dim_date.year BETWEEN 2009 AND 2011
) spreaded on months.
My question is how to add the new column to calculate the differences (positive or negative) against same month from base year.
I'm aware they are two questions in one but if the above question is not possible, it is possible to compare a given date range against current year/month.
Please see the below table for a better overview of the question. This is the desired output..


Gallex
(113 rep)
Dec 1, 2018, 12:28 PM
• Last activity: Jun 17, 2025, 01:03 AM
0
votes
1
answers
221
views
SSRS and DBA Scope
Just a quick questions regarding DBAs, recently I was asked about the scope of DBA in terms of SSRS and BI, in your experience who handles and maintain/configure your SSRS, is this part of the DBA or Sysadmin. and for Datawarehouse do we need a DBA to develop that or a DW engineer? Thank you for all...
Just a quick questions regarding DBAs, recently I was asked about the scope of DBA in terms of SSRS and BI, in your experience who handles and maintain/configure your SSRS, is this part of the DBA or Sysadmin. and for Datawarehouse do we need a DBA to develop that or a DW engineer?
Thank you for all who will answer.
Mel Cuya
(31 rep)
Oct 16, 2018, 03:33 AM
• Last activity: Jun 8, 2025, 01:05 PM
1
votes
2
answers
228
views
best approach to model relation between customers - dimension dim_customer (SCD2) which can be a HEAD/SUB customer
Hi there we have the concept of having department. each department can either be its own head or it can have a head department. (similar to each person in a organisation is a employee but some employee they have head/supervisor or some employee they act as a supervisior for a certain time that can a...
Hi there we have the concept of having department. each department can either be its own head or it can have a head department. (similar to each person in a organisation is a employee but some employee they have head/supervisor or some employee they act as a supervisior for a certain time that can also change over time if employees are leaving)
so we have a dim_department (SC2)
dim_department
so it is an SCD2 which can slowly change over time if the name is changing ect
so let's say the dep1 is now becoming the head of the dep2
later there will be another department "dep3" be open and this will then become the head
so the head of dep2 is changing
what would be the best approach to model that
I guess there are several options/solutions to solve that problem
so I could
first option
add the head SK in the dim_department
so every time a Head is changing it would lead to an entry in dim_customer in our case a new entry for dep3 would be created and a new entry (slowly changing entry) for dep1 with a closing old current_yn and creating new current_yn
second option I thought
create a bridge table or factless fact table
or am I totaly wrong?
a third option would just be to put the Head SK in the fact tables
so we would than have
another options
a snowflake dimension ? hmmm wondering if that would work with an SCD2 dimension?
so in a nutshell it is on this SCD2 table a relation with itself
what I see as a problem with option 1 is that I have first to handle the HEAD entries because the attributes could change (because of changing name etc) and would generate a new SK and than I would need to lookup that key when handling the sub
what do you think?
is this total bullshit or totaly on the wrong path?
Thx




walpino
(11 rep)
Feb 2, 2021, 10:51 AM
• Last activity: Jun 8, 2025, 12:05 AM
0
votes
1
answers
939
views
Periodic snapshot fact table with monthly grain - Question on dimensions
I am fairly new to data-warehousing but have read my fair share of books and online tutorials to gain an introductory understanding of the basic components that build up a data warehouse. My end goal is to be able to do a headcount of employees at the end of the month, therefore I designed my fact t...
I am fairly new to data-warehousing but have read my fair share of books and online tutorials to gain an introductory understanding of the basic components that build up a data warehouse.
My end goal is to be able to do a headcount of employees at the end of the month, therefore I designed my fact table as a monthly snapshot where each end-of-month I will populate my fact table (one row per employee per month).
My dimension table includes a
DimEmployee
table which is an SCD2 (any change made to an employee's information causes a new row to be introduced in the data source and I am keeping track of the most recent employee information via an is_current
flag set to 'Y' or 'N').
My question is: do I need to update the dimensions daily and only insert whatever data I have at end-of-month into the fact table? Or do I update the dimensions end-of-month as well?
Thank you in advance!
Alistair
(141 rep)
Apr 16, 2017, 08:01 PM
• Last activity: Jun 3, 2025, 12:03 PM
1
votes
2
answers
63
views
Sql Server: Query assistance to retrieve count in two different groupings within a dataset
We have a dataset with "visits & events" with their "frequency". Sample data looks like this: ``` DROP TABLE IF EXISTS dbo.EventMetrics GO -- -- Create table with sample dataset -- CREATE TABLE dbo.EventMetrics ( EventMetricsID INT NOT NULL IDENTITY(1,1) , EventDate INT NOT NULL , FacilityID INT NOT...
We have a dataset with "visits & events" with their "frequency". Sample data looks like this:
From this dataset, we want to capture some aggregate metrics.
For example:
1. Number of times we see **ItemValue** (M) for **FacilityID** (1).
- Answer is 3
2. Number of times we see **ItemID** (1) for **FacilityID** (1).
- Answer is 6
3. Number of **UniqueVisitID**'s for a given **ItemValue** (M) & **FacilityID** (1)
- Answer is 3
4. Number of **UniqueVisitID**'s for a given **ItemID** (M) & **FacilityID** (1)
- Answer is 6
5. Number of **UniqueVisitID**'s in the current dataset
- Answer is 21
Below is the SQL query that could answer some of my questions (from 1 to 3), but for questions 4 & 5, I need some mechanism to aggregate the numbers within the same SQL query.
**My progress so far:**
Using CTE and an extra query I can get answers for bullet points 4 & 5. See the query below:
**Current Solution (multi step)**
DROP TABLE IF EXISTS dbo.EventMetrics
GO
--
-- Create table with sample dataset
--
CREATE TABLE dbo.EventMetrics
(
EventMetricsID INT NOT NULL IDENTITY(1,1)
, EventDate INT NOT NULL
, FacilityID INT NOT NULL
, UniqueVisitID VARCHAR(25) NOT NULL
, ItemID INT NOT NULL
, ItemValue VARCHAR(10) NOT NULL
, ItemFrequency INT NOT NULL
)
--
-- Insert some sample dataset
--
INSERT INTO dbo.EventMetrics (EventDate, FacilityID, UniqueVisitID, ItemID, ItemValue, ItemFrequency)
VALUES (20250101, 1, '2025-01-01_1_100', 1, 'M', 1)
, (20250101, 1, '2025-01-01_1_101', 1, 'M', 1)
, (20250101, 1, '2025-01-01_1_102', 1, 'M', 1)
, (20250101, 1, '2025-01-01_1_103', 1, 'F', 1)
, (20250101, 1, '2025-01-01_1_104', 1, 'F', 1)
, (20250101, 1, '2025-01-01_1_105', 1, 'F', 1)
, (20250101, 1, '2025-01-01_1_106', 2, 'Boston', 1)
, (20250101, 1, '2025-01-01_1_107', 2, 'Boston', 1)
, (20250101, 1, '2025-01-01_1_108', 2, 'Boston', 1)
, (20250101, 1, '2025-01-01_1_109', 2, 'Tampa', 1)
, (20250101, 1, '2025-01-01_1_110', 2, 'London', 1)
, (20250101, 2, '2025-01-01_2_111', 1, 'M', 1)
, (20250101, 2, '2025-01-01_2_113', 1, 'M', 1)
, (20250101, 2, '2025-01-01_2_114', 1, 'F', 1)
, (20250101, 2, '2025-01-01_2_115', 2, 'Boston', 1)
, (20250101, 2, '2025-01-01_2_116', 2, 'Tampa', 1)
, (20250101, 2, '2025-01-01_2_117', 2, 'Tampa', 1)
, (20250101, 2, '2025-01-01_2_118', 2, 'Miami', 1)
, (20250101, 2, '2025-01-01_2_119', 2, 'Miami', 1)
, (20250101, 2, '2025-01-01_2_120', 2, 'Miami', 1)
, (20250101, 2, '2025-01-01_2_121', 2, 'Miami', 1)
SELECT * FROM dbo.EventMetrics
Sample data looks like this:

--
-- WIP query
--
SELECT FacilityID
, ItemID
, ItemValue
, SUM(SUM(ItemFrequency)) OVER (PARTITION BY ItemID, FacilityID) AS [Total_Items_Per_Facility]
, SUM(SUM(ItemFrequency)) OVER (PARTITION BY ItemID, FacilityID, ItemValue) AS [Count_Per_ItemValue_Per_Facility]
, COUNT(DISTINCT UniqueVisitID) AS [Count_Of_UniqueVisits_Per_ItemValue_Per_Facility]
, '??' AS [Count_Of_UniqueVisits_Per_ItemID_Per_Facility]
, '??' AS [Count_Of_UniqueVisits_For_This_Dataset]
FROM dbo.EventMetrics
WHERE EventDate BETWEEN 20250101 AND 20250101
AND FacilityID IN (1, 2)
AND ItemID IN (1, 2)
GROUP BY FacilityID
, ItemID
, ItemValue
ORDER BY FacilityID
, ItemID
, ItemValue
The above TSQL gives me the below result.

--
-- Current two step Solution, with two queries.
--
DECLARE @TotalUniqueVisitIDs INT
SELECT @TotalUniqueVisitIDs = COUNT(DISTINCT UniqueVisitID)
FROM dbo.EventMetrics
WHERE EventDate BETWEEN 20250101 AND 20250101
AND FacilityID IN (1, 2)
AND ItemID IN (1, 2)
;WITH Count_UniqueVisitID_Per_Facility (FacilityID, ItemID, Count_UniqueVisitID_Per_Facility)
AS (
SELECT FacilityID, ItemID, COUNT(DISTINCT UniqueVisitID) AS [Count_UniqueVisitID_Per_Facility]
FROM dbo.EventMetrics
WHERE EventDate BETWEEN 20250101 AND 20250101
AND FacilityID IN (1, 2)
AND ItemID IN (1, 2)
GROUP BY FacilityID, ItemID
)
SELECT
E.FacilityID
, E.ItemID
, E.ItemValue
, SUM(SUM(E.ItemFrequency)) OVER (PARTITION BY E.ItemID, E.FacilityID) AS [Total_Items_Per_Facility]
, SUM(SUM(E.ItemFrequency)) OVER (PARTITION BY E.ItemID, E.FacilityID, ItemValue) AS [Count_Per_ItemValue_Per_Facility]
, COUNT(DISTINCT UniqueVisitID) AS [Count_Of_UniqueVisits_Per_ItemValue_Per_Facility]
, UVF.Count_UniqueVisitID_Per_Facility AS [Count_Of_UniqueVisits_Per_ItemID_Per_Facility]
, @TotalUniqueVisitIDs AS [Count_Of_UniqueVisits_For_This_Dataset]
FROM dbo.EventMetrics AS E
INNER JOIN Count_UniqueVisitID_Per_Facility AS UVF
ON UVF.FacilityID = E.FacilityID
AND UVF.ItemID = E.ItemID
WHERE E.EventDate BETWEEN 20250101 AND 20250101
AND E.FacilityID IN (1, 2)
AND E.ItemID IN (1, 2)
GROUP BY E.FacilityID
, E.ItemID
, E.ItemValue
, UVF.Count_UniqueVisitID_Per_Facility
ORDER BY E.FacilityID
, E.ItemID
, E.ItemValue
**Question**
How can I get one query to return metrics for all 5?
**Context:**
1. This is a data warehouse table with billions of records
2. Clustered Columnstore index exists
3. There are more metrics generated as part of this query; For clarity, only the pertinent parts are included.
ToC
(727 rep)
May 29, 2025, 07:35 PM
• Last activity: May 30, 2025, 09:41 AM
2
votes
1
answers
264
views
Securing Functional/Service ID for MSSQL with Windows Authentication
I'm currently assessing a setup of Data warehouse implementation in our environment, based on the requirements: 1. A functional/service ID to be setup to access multiple DB servers of other applications in the company for replication purpose. 2. Password of the service ID cannot be changed and won't...
I'm currently assessing a setup of Data warehouse implementation in our environment, based on the requirements:
1. A functional/service ID to be setup to access multiple DB servers of other applications in the company for replication purpose.
2. Password of the service ID cannot be changed and won't expiry
Based on some researches, I learned that Windows authentications are always recommended as it is more secure (here and here ).
Further exploring on windows authentication, I noticed the guides always linked it to Kerberos authentication, which involve domain controller where the an AD account is being used (link ). It make the ID management easier but due to it is a functional ID, I'm not sure if this is the best approach. There is also an article here mentioned to use a local account instead of domain account when possible.
Question:
1. Since it is a service ID, and it will connects to many applications, shall I really avoid domain account? My concern is that the password isn't expiring and it has accesses to most of the DB, the risk is higher if this single account is compromised.
2. Does setting up login using local Windows account in respective DB server provide the same advantages of domain account (e.g. use Kerberos protocol)?
3. Is creating multiple local Windows account the recommended approach based on above requirements? Or if there is any resources regarding this can be shared for me to further study on my own.
I hope my questions are subjective but not opinion based/open-ended, and I appreciate for any input and sharing.
nlks
(21 rep)
Dec 29, 2022, 02:44 AM
• Last activity: May 22, 2025, 06:01 AM
23
votes
2
answers
42606
views
What are Measures and Dimensions in Cubes
I'm new to Microsoft SQL Server Business Intelligence and `Analysis Service` (but I've been programming for years with SQL Server). Can any one describe Measures and Dimensions in Cubes in simple words (If it's possible with images)?
I'm new to Microsoft SQL Server Business Intelligence and
Analysis Service
(but I've been programming for years with SQL Server). Can any one describe Measures and Dimensions in Cubes in simple words (If it's possible with images)?
DooDoo
(203 rep)
Jul 3, 2013, 11:01 AM
• Last activity: May 5, 2025, 04:22 AM
0
votes
1
answers
1015
views
Detecting deleted rows in mysql
I have a Spring Boot webapp with a MySQL database. I'm trying to replicate this data into a MSSQL for running reports against. All rows are audited with created_at and modified_at datetimes so i can easily fetch new and changed rows. Problem is when a row is deleted this will not be reflected in the...
I have a Spring Boot webapp with a MySQL database. I'm trying to replicate this data into a MSSQL for running reports against.
All rows are audited with created_at and modified_at datetimes so i can easily fetch new and changed rows. Problem is when a row is deleted this will not be reflected in the target database.
From the top of my head there's a couple of ways to solving this
- Triggers and a "history" table. Whenever there's an update to the table, it inserts a copy of the row and a column describing the action taken.
- A deleted bit. 1 is a deleted row (this will have me redo all queries though, and now i see the advantage of having a class in between the controller and the repository)
- And the very verbose method of querying all rows and check what's missing (will not do)
Is there a *right way* of handling this?
jared
(123 rep)
Jun 3, 2023, 01:55 PM
• Last activity: Apr 17, 2025, 04:10 AM
4
votes
1
answers
642
views
Revision Tracking & Source Control for Azure SQL Data Warehouse
What is a good approaching for tracking incremental changes to database tables, stored procedures, etc for Azure SQL Data Warehouse? I am in the process of moving a large database over to Azure SQL Data Warehouse. The prior approach for change tracking was using a 'Database Project' in Visual Studio...
What is a good approaching for tracking incremental changes to database tables, stored procedures, etc for Azure SQL Data Warehouse?
I am in the process of moving a large database over to Azure SQL Data Warehouse. The prior approach for change tracking was using a 'Database Project' in Visual Studio 2015. This allows easy source control integration with TFS or Git or whatever. When you want to publish, you just target the destination database and it generates a change script.
This functionality does not at all work for Azure SQL Data Warehouse. Visual Studio (and the latest SSDT) simply can't target SQL DW. This means the process of publishing is extremely tedious, entirely manual and extremely error prone.
Is there another comparable approach you are using for this type of project?
John Hargrove
(149 rep)
Oct 20, 2017, 05:10 AM
• Last activity: Apr 10, 2025, 01:08 PM
3
votes
3
answers
5928
views
Should dates in dimensional tables use dimDate?
Assuming my dimDate has a surrogate key. Should all the date columns in the *dimensional tables* (not the fact tables) store the surrogate key of date dimension? Or just plain date? For example, in dimensional table dimCustomer, there may be birthday, join date, graduation date, .... Etc.
Assuming my dimDate has a surrogate key. Should all the date columns in the *dimensional tables* (not the fact tables) store the surrogate key of date dimension? Or just plain date?
For example, in dimensional table dimCustomer, there may be birthday, join date, graduation date, .... Etc.
u23432534
(1565 rep)
Jun 5, 2017, 07:54 PM
• Last activity: Apr 5, 2025, 11:08 PM
3
votes
4
answers
7843
views
OBT (One Big Table) vs Star Schema for Data Warehousing
I am trying to prepare myself for some interviews at a FAANG, while coming from a much smaller company. Part of the interview process involves data modeling and ETL design. Kimball's "The Data Warehouse Toolkit" describes a dimensional model that represents a company, like Target or Walmart, retail...
I am trying to prepare myself for some interviews at a FAANG, while coming from a much smaller company. Part of the interview process involves data modeling and ETL design. Kimball's "The Data Warehouse Toolkit" describes a dimensional model that represents a company, like Target or Walmart, retail sales and its associated dimensions. However, Kimball seems to suggest that although dimensions (date, store, customer, product, etc) are highly denormalized, fact tables are normalized with FKs to their related dimensions.
I have heard companies like mine using OBT or the "One Big Table" approach to make analysts lives easier and increase query performance by removing the need for JOINs. Also with storage being so cheap these days, and most modern DWs (Redshift, BiqQuery, etc) using a columnar based architecture, I would think we could safely throw out concerns regarding the extra redundancy (of having the same data appear twice in two tables), but is it possible that when architecting for huge applications that redundancy becomes an issue? What is the correct way of designing a data warehouse for each use case?
ohbrobig
(131 rep)
Mar 29, 2022, 07:25 PM
• Last activity: Mar 10, 2025, 04:56 PM
1
votes
1
answers
426
views
Robust way to build a DWH without downtime
I have a process that build a data warehouse using Postgres. The process runs every day at midnight and works in three steps: * Copy the production database (prod) to a temporary database (dwh_tmp) using pg_dump * Generate the fact and dimension tables in the temporary schema using the data from the...
I have a process that build a data warehouse using Postgres. The process runs every day at midnight and works in three steps:
* Copy the production database (prod) to a temporary database (dwh_tmp) using pg_dump
* Generate the fact and dimension tables in the temporary schema using the data from the production database
* Rename the current DWH database (dwh -> dwh_old) and move in the temporary database instead (dwh_tmp -> dwh) using
ALTER DATABASE dwh_tmp RENAME TO dwh
The problem is that the last step fails if there are still open connections to the dwh database. To get rid of these connections (e.g. using pg_terminate_backend()
) I would need to run the DWH generation process with super-user privileges, which I'd like to avoid for security reasons.
Any suggestions on how to implement this in a more robust way? My main goal is to make sure that the DWH is always in a good state and the downtime is as small as possible, hence my idea of first building a temporary DWH and then renaming it. I'm not in direct control of the connecting processes that keep me from renaming the database (as they're openend from a different host).
ThePhysicist
(641 rep)
Jun 10, 2016, 10:06 AM
• Last activity: Mar 8, 2025, 09:04 AM
1
votes
2
answers
696
views
Dimension for Product Properties based on Product Type - Sparse Dimension?
I am creating dimension of product properties for sales facts. Property of product depend of product type. For example: - Type = smartphone. Properties = model, OS, size - Type = book. Properties = author, title How dimension should be for this case? Should I create dimension which contain ALL prope...
I am creating dimension of product properties for sales facts.
Property of product depend of product type. For example:
- Type = smartphone. Properties = model, OS, size
- Type = book. Properties = author, title
How dimension should be for this case?
Should I create dimension which contain ALL properties?
In this case dimension content will be sparse, there will be many null values.
|----------------------------------------------------|
| DimKey | Type | Model | OS | Size | AUTHOR | TITLE |
OR, should I create dimension for each?
In this case sales fact will have many FKs.
|-------------------------------------------------------------|
| FactKey | Quantity | Total | Book_FK | Smartphone_FK | .... |
Is there any other way to do this?
rendybjunior
(259 rep)
Nov 4, 2014, 12:34 PM
• Last activity: Mar 3, 2025, 10:02 PM
0
votes
1
answers
2587
views
How to Store a Many-to-Many Relationship Between Fact Tables in a Data Warehouse
What are some structures and models to store many-to-many relational data between two fact tables in a data warehouse? Currently, I am using a mapping table which includes the primary keys from both tables, but I am wondering if there is a better approach? My specific use case is relating invoices i...
What are some structures and models to store many-to-many relational data between two fact tables in a data warehouse? Currently, I am using a mapping table which includes the primary keys from both tables, but I am wondering if there is a better approach?
My specific use case is relating invoices issued to payments received. I have an invoice fact table (each record designated with a unique InvoiceId) and a payment fact table (each record is designated with a unique PaymentId). My mapping table has columns for the InvoiceId and PaymentId as well as other useful information, but InvoiceId and PaymentId are not necessarily unique in their respective columns (because a payment can apply to multiple invoices and an invoice can be paid with multiple payments).
Thanks in advance for any advice and suggestions!
Edward
(1 rep)
Jun 26, 2020, 11:20 PM
• Last activity: Feb 22, 2025, 04:08 AM
25
votes
3
answers
13961
views
What are the arguments in favor of using ELT process over ETL?
I realized that my company uses an ELT (extract-load-transform) process instead of using an ETL (extract-transform-load) process. What are the differences in the two approaches and in which situations would one be "better" than the other? It would be great if you could provide some examples.
I realized that my company uses an ELT (extract-load-transform) process instead of using an ETL (extract-transform-load) process.
What are the differences in the two approaches and in which situations would one be "better" than the other? It would be great if you could provide some examples.
HelloWorld1
(797 rep)
Jun 14, 2012, 08:24 AM
• Last activity: Jan 9, 2025, 12:26 AM
Showing page 1 of 20 total questions