Database Administrators
Q&A for database professionals who wish to improve their database skills
Latest Questions
0
votes
1
answers
939
views
Periodic snapshot fact table with monthly grain - Question on dimensions
I am fairly new to data-warehousing but have read my fair share of books and online tutorials to gain an introductory understanding of the basic components that build up a data warehouse. My end goal is to be able to do a headcount of employees at the end of the month, therefore I designed my fact t...
I am fairly new to data-warehousing but have read my fair share of books and online tutorials to gain an introductory understanding of the basic components that build up a data warehouse.
My end goal is to be able to do a headcount of employees at the end of the month, therefore I designed my fact table as a monthly snapshot where each end-of-month I will populate my fact table (one row per employee per month).
My dimension table includes a
DimEmployee
table which is an SCD2 (any change made to an employee's information causes a new row to be introduced in the data source and I am keeping track of the most recent employee information via an is_current
flag set to 'Y' or 'N').
My question is: do I need to update the dimensions daily and only insert whatever data I have at end-of-month into the fact table? Or do I update the dimensions end-of-month as well?
Thank you in advance!
Alistair
(141 rep)
Apr 16, 2017, 08:01 PM
• Last activity: Jun 3, 2025, 12:03 PM
0
votes
0
answers
15
views
Have I converted this DB table to the correct number of dimension tables for a star schema correctly?
I'm trying to replace a pretty complicated query for a dashboard. We currently use this system where we track practices, phone numbers and start/end dates partially in an excel spreadsheet and also in a database. Whenever we want to update any of the info, like which practice uses which number durin...
I'm trying to replace a pretty complicated query for a dashboard. We currently use this system where we track practices, phone numbers and start/end dates partially in an excel spreadsheet and also in a database. Whenever we want to update any of the info, like which practice uses which number during which time period, it is a very involved process that includes checking numbers and practices in excel to make sure that nothing is duplicated, etc. If we want to change something so that practices share a line we have to go into the db and combine the names, etc. I think the entire process could be simplified if we used a star schema instead of the system that we currently have.
We currently have a database table with the following schema
| Group Practice | Practice | TransferType | DialedNumber | DigitsDialed | StartDate | EndDate|
I think the table should be broken out into the following dimensions:
* Practice
| PK | Group Practice | Practice |
* Phone Number
| PK | DialedNumber | DigitsDialed |
* TransferType
| PK| TransferType |
* Calendar Table
There would be a fact table at the center that would track "Transfer Volume by Practice".
A phone number can have one or more transfer types for one or more practices simultaneously.
For example, (123)456-7890 could be transfertype1 for practice A from Jan 1, 2024 - Feb 1, 2024. It could then be used as transfertype2 for practice B from Feb 2, 2024 - Mar 1, 2024.
Or Number (234)567-8901 could be transfertype 2 for practice C from Jan 1, 2024 - Feb 1, 2024 and it could simultaneously be transfertype 3 for practice D from Jan 1, 2024 - Mar 1, 2024.
I've read textbook definitions of how to set up a star schema but I want to make sure that I'm doing this correctly. Do my dimensions make sense? Would this capture the "slowly changing dimensions" in the data?
j.jerrod.taylor
(101 rep)
Oct 18, 2024, 05:11 PM
0
votes
1
answers
71
views
Point-in-Time-Recovery of Postgres at an arbitrary time for audit purposes
TL;DR Would you recommend using CDC or pg_wal-based recovery to reproduce the state of the database in an arbitrary time, or stick with joining SCD2 tables (with start end end date for every record) and query all data before given date? We are developing a new microservice to support our application...
TL;DR
Would you recommend using CDC or pg_wal-based recovery to reproduce the state of the database in an arbitrary time, or stick with joining SCD2 tables (with start end end date for every record) and query all data before given date?
We are developing a new microservice to support our application. For audit purposes our calculations need to be reproducible, so the current plan is to use SCD2 with every record having a start_date and an end_date. This way the state of the database at a given time can be reproduced if calculation from the past needs to be reproduced (if ever requested). Joining and maintaining versioned SCD2 tables seem to be cumbersome, but that is the current plan.
Would making the database store only the current state, i.e. updates would actually update a record, and not just mark it obsolete and insert a new state; and archiving the replica level WAL files with sufficiently good granularity, and rebuilding the state on-demand work?
Or keeping an audit table and rebuild from that?
The scale/magnitude is around a 100k records per year.
Any thoughts welcome.
P.S. we don't have a DBA on the team.
Mr Arkadin
(1 rep)
May 3, 2024, 04:33 PM
• Last activity: May 3, 2024, 08:04 PM
6
votes
1
answers
410
views
Designing a dimensional DB off a normalized source that already implements SCD's
I have built an SSIS ETL to integrate various data sources (one from MySQL, two from SQL Server) into a single SQL Server relational and normalized database, which I've called [NDS]. The SSIS ETL handles type-2 updates, and so the [NDS] generates surrogate keys and SCD tables include an [_EffectiveF...
I have built an SSIS ETL to integrate various data sources (one from MySQL, two from SQL Server) into a single SQL Server relational and normalized database, which I've called [NDS].
The SSIS ETL handles type-2 updates, and so the [NDS] generates surrogate keys and SCD tables include an [_EffectiveFrom] timestamp and a nullable [_EffectiveTo] column, and there are constraints for the natural keys and beautiful foreign keys linking all the data together.
Now, I wanted to build an SSAS dimensional database off of it, and it didn't take too long before I realized I was setting myself up for a snowflake schema:
So I'm thinking of adding a new [DDS] (relational) database, to create the *actual* dimension and fact tables that will feed the DSV's for the SSAS db.
This [DDS] database would be as denormalized as humanly possible, so as to "flatten" the facts and dimensions (like, [OrderHeaders]+[OrderDetails] into an [Orders] fact table, and [CustomerStores]+[Customers]+[SalesReps] into some [Customers] dimension table) - doing this should not only make it easier for me to build the dimension hierarchies in SSAS, it should also make it easier to come up with an actual star schema.
I have a few questions though:
- Can I reuse a subset of my existing surrogate keys? I'm thinking to take the existing key for the most granular level and make that the dimension key. Is that a good approach, or should I just ignore the [NDS] surrogate keys and make the [DDS] (relational db) generate a new set of surrogate keys?
- How to handle SCD's? For example, "Materials" and "Suppliers" will generate new records in [NDS] when some specific fields change in the source system... I think I'll have to design the SSIS ETL to only load the "last image" records into the [DDS] db, and then re-implement the type-2 updates in that process, i.e. treat the [NDS] as a "source system" that keeps history, while duplicating everything in this [DDS] database. But then, why would I need to keep history in the [NDS] *and* the [DDS]? Clearly something's not right.
Am I setting myself up for a Big Mess™, or I'm on the right track?

Mathieu Guindon
(914 rep)
Dec 1, 2015, 06:31 PM
• Last activity: Mar 26, 2024, 05:28 PM
0
votes
0
answers
64
views
SQL Query group by to find EffectiveFrom and EffectiveTo Dates for repeating key value pair
I have a table with a snapshot of repeated data for a key and value pair with record_timestamps. The dataset can also contain late-arriving historical record_timestamps as well. I want to generate an output with an EffectiveFrom and EffectiveTo timestamp for each combination of key and value, where...
I have a table with a snapshot of repeated data for a key and value pair with record_timestamps. The dataset can also contain late-arriving historical record_timestamps as well.
I want to generate an output with an EffectiveFrom and EffectiveTo timestamp for each combination of key and value, where the validity closes when the value changes and a new record is created for a new combination.
Note: The same values can reappear after an interval with a different record_timestamp.
**Snapshot of repeated data**
**Expected Outcome**
I have tried the use lead and lag partitioned by the keys and ordered by the timestamp for each record to find the prev and next value and use this to find a way to group sets. However, this does not work as I am missing something since the same value can reappear after an interval, which should not be grouped together.
create table equipments_staging
(
equipment string,
item string,
value string,
timestamp timestamp
)
insert into equipments_staging
values
('M','X','1','2023-11-10 13:00')
,('M','X','2','2023-11-11 13:00')
,('M','X','2','2023-11-12 13:00')
,('M','X','1','2023-11-13 13:00')
,('M','X','1','2023-11-14 13:00')
,('M','X','1','2023-11-15 13:00')
,('M','X','2','2023-11-16 13:00')
,('M','X','2','2023-11-17 13:00')
,('M','X','1','2023-11-18 13:00')


Select equipment,item,value,
LAG(value,1) OVER (PARTITION BY equipment,item ORDER BY timestamp) AS prev,
LEAD(value,1) OVER (PARTITION BY equipment,item ORDER BY timestamp) AS next,
coalesce(LEAD(timestamp,1),'9999-12-31') OVER (PARTITION BY equipment,item ORDER BY timestamp) AS next_timestamp,
timestamp
from repeated_data
user137279
(1 rep)
Feb 19, 2024, 08:26 PM
• Last activity: Feb 19, 2024, 09:21 PM
0
votes
0
answers
105
views
How Do I create history table that allows me to retrieve the times when a row active by year & month?
I want to be able to retrieve the in-between months when a client was active but from a historical table. Table: DimensionEpisode | episode_id | client_id | episode_status | discharge_date | admission_date | program_id | length_of_stay | subprogram | ...other attributes | |------------|-----------|-...
I want to be able to retrieve the in-between months when a client was active but from a historical table.
Table: DimensionEpisode
| episode_id | client_id | episode_status | discharge_date | admission_date | program_id | length_of_stay | subprogram | ...other attributes |
|------------|-----------|----------------|----------------------|----------------------|------------|----------------|------------|--------------------|
| 63 | 61362 | CLOSED | 2018-09-06T00:00:00 | 2014-10-01T00:00:00 | 72 | 1436 | Union PC | ... |
| 64 | 61343 | OPEN | 2019-03-15T00:00:00 | 2015-05-20T00:00:00 | 80 | 1000 | Rehab | ... |
| 65 | 61344 | OPEN | 2020-01-10T00:00:00 | 2018-11-12T00:00:00 | 65 | 750 | Psychiatry | ... |
| 66 | 61345 | CLOSED | 2021-07-02T00:00:00 | 2019-09-28T00:00:00 | 72 | 1250 | Union PC | ... |
Sam Johnson
(21 rep)
Dec 6, 2023, 03:33 PM
7
votes
6
answers
1965
views
Are duplicated NVARCHAR values stored as copies in SQL Server?
I'm designing a table that will contain a lot of rows. So need to be careful not to store to much information. One of the columns is a NVARCHAR(MAX) column and it contains the address of our customers. As addresses do not change often, this column will contain many repeated values and thus contains...
I'm designing a table that will contain a lot of rows. So need to be careful not to store to much information. One of the columns is a NVARCHAR(MAX) column and it contains the address of our customers. As addresses do not change often, this column will contain many repeated values and thus contains quite some redundancy.
So I was wondering if I need to normalize this myself by maintaining some sort of look-up table to address strings (note that if an address changes I need to maintain history - so it's not a matter of usual normalization), or if SQL Server is pointing to the same reference of the string behind the scenes. Or maybe it offers a column option to do so. Another approach that came into my mind is to use COMPRESS but I guess this does not make sense as the data itself (i.e. the address) is not long.
Reading/writing performance is not so much of a concern as the data will be accumulated over time.
Dejan
(175 rep)
Nov 18, 2020, 09:49 AM
• Last activity: Jun 26, 2023, 02:34 PM
1
votes
1
answers
865
views
Optimal join for joining facts with scd-type-2 dimension for aggregation/reporting
I have a fact table and an scd-type-2 dimension table. I want to produce sales report by region and year. I have working solution with a query that joins them for reporting purposes. When I run the query in spark/databricks, it gives me a little warning at the bottom: `Use range join optimization: T...
I have a fact table and an scd-type-2 dimension table. I want to produce sales report by region and year.
I have working solution with a query that joins them for reporting purposes. When I run the query in spark/databricks, it gives me a little warning at the bottom:
Use range join optimization: This query has a join condition that can benefit from range join optimization. To improve performance, consider adding a range join hint.
and points to this link .
**Question:** Is there a more optimal way to query when I'm joining using a between
condition (instead of =
condition)?
fact table:
create table sales
(name string
,sale_date date
,sold_amt long);
insert into sales values
('John','2022-02-02',100),
('John','2022-03-03',100),
('John','2023-02-02',200),
('John','2023-03-03',200),
('Rick','2022-02-02',300),
('Rick','2023-02-02',400);
dimension table (scd-type-2)
create table employee_scd2
(name string
,region string
,start_date date,
,end_date date,
,is_current boolean); -- unused, kept for completeness
insert into employee_scd2 values
('John','NAM', '2010-01-01', '2022-12-31', false),
-- John transferred from NAM to APAC starting 2023
('John','APAC', '2023-01-01', '9999-01-01', true),
('Rick','NAM', '2020-01-01', '9999-12-31', true);
sales report by region and year
select e.region,
year(s.sale_date) as sale_year,
SUM(s.sold_amt) as sale_amt
from sales s
left join employee_scd2 e
on e.name = s.name
and s.sale_date between e.start_date and e.end_date
group by e.region, year(s.sale_date);
---
I've read following and some more:
* https://dba.stackexchange.com/questions/184639/join-fact-table-to-scd-type-2-how-to-write-query-sql-server
* https://www.startdataengineering.com/post/how-to-join-fact-scd2-tables/
* How to implement Slowly Changing Dimensions (SCD2) Type 2 in Spark
Kashyap
(145 rep)
Mar 7, 2023, 04:49 PM
• Last activity: Jun 7, 2023, 05:43 PM
0
votes
1
answers
849
views
Implementing revisioning logic with SCD2/4
I posted this question on StackOverflow, but it didn't get much luck so I'm hoping I could have better luck here. I'm working on a project that is a sort of bill of materials manager that needs to support revisioning of components. After some research, I've stumbled upon slowly changing dimensions t...
I posted this question on StackOverflow, but it didn't get much luck so I'm hoping I could have better luck here.
I'm working on a project that is a sort of bill of materials manager that needs to support revisioning of components. After some research, I've stumbled upon slowly changing dimensions type 2 and type 4 which seems like the most appropriate pattern in this use cases.
However, I'm a little bit confused on a couple of concepts:
CREATE TABLE projects (
id INTEGER PRIMARY KEY,
name TEXT
)
CREATE TABLE components (
id INTEGER PRIMARY KEY,
name TEXT,
project_id INTEGER,
CONSTRAINT fk_project FOREIGN KEY (project_id) REFERENCES projects (id)
)
If I wanted to implement SCD Type 2, would the components table have the additional columns for start_date
, end_date
, active
? Or, would Type 2 be adding a revisions
table with the same structure as components
with the additional columns and Type 4 having a revisions
table and a revisions_history
table?
Any help would be greatly appreciated!
user1960118
(143 rep)
Sep 5, 2021, 01:58 PM
• Last activity: Apr 15, 2023, 08:01 AM
0
votes
1
answers
211
views
Converting EAVT table into SCD type 2
After a lot of research and head picking, I'm still unable to find a good/clean solution to convert an entity-attribute-value-timestamp table to an scd type 2 dimension. Here's the issue: I have a CRM source that stores all history in a EAVT model (Entity/Attribute/Value of the attribute/valid_from/...
After a lot of research and head picking, I'm still unable to find a good/clean solution to convert an entity-attribute-value-timestamp table to an scd type 2 dimension.
Here's the issue:
I have a CRM source that stores all history in a EAVT model (Entity/Attribute/Value of the attribute/valid_from/valid_to).
So for every object (Company, product...etc) I have a table with the current state that is in a relational model, and another history table that contains all value changes to all attributes with a valid_from/valid_to column for validity of the values themselves.
I want to be able to merge these two tables into an SCD table with a Valid_To/Valid_From and a column per attribute.
To give an example:
Company has two tables:
- **Current state of the Companies:**
| company_id | name | number_of_employees | city |
| -------- | -------------- | -------------- | -------------- |
| 1 | Company 1 | 500 | Paris |
| 2 | Company 2 | 500 | Paris |
- **History Table:**
| company_id | attribute | value | valid_from | valid_to |
| -------- | -------------- | -------------- | -------------- | -------------- |
| 1 | city | New York | 01/01/2020 | 01/05/2022 |
| 1 | city | Paris | 01/05/2022 | 12/31/9999 |
| 1 | number_of_employees | 50 | 01/01/2021 | 01/01/2022 |
| 1 | number_of_employees | 100 | 01/01/2022 | 12/31/9999 |
What I want to have as a result is the following:
| company_id | name| city | number_of_employees | valid_from | valid_to | is_active |
| -------- | -------------- | -------------- | -------------- | -------------- | -------------- | -------------- |
| 1 | Company 1 | New York | null | 01/01/2020 | 01/01/2021 | false |
| 1 | Company 1 | New York | 50 | 01/01/2021 | 01/01/2022 | false |
| 1 | Company 1 | New York | 100 | 01/01/2022 | 01/01/2022 | false |
| 1 | Company 1 | Paris | 100 | 01/05/2022 | 12/31/9999 | true |
So based on this example, we have a company that started on 01/01/2020 with New York as city and number of employees wasn't populated at that time.
We then modified our company to add 50 as the number of employees, this happened on 01/01/2021.
We modified our company again on 01/01/2022 to change the number of employees to 100, only to change the city of the company from New York to Paris on 01/05/2021.
This gives us 4 states for the company, so our SCD should contain a row per state or 4 rows.
The dates should be calculated to overlap and valid_from should be set to the valid_to of the attribute that changed from the "history" table, and valid_to should be set to the valid_from of the attribute that changed from the "history" table.
To add more complexity to the task, imagine we have about 120 attributes but also if a company was never changed (just created and still has the same state from creation) then it won't exist in the "Current State" table. So in our example, Company 2 will not exist in the history table at all and will have to be read from the first table into the SCD (union between current table and history result table). Fun right! :)
To give you a sense of the technical environment, the CRM is hubspot, data is replicated from hubspot to BigQuery and the reporting tool is Power BI.
I have tried to use pivoting in both Power BI and BigQuery, which is the standard solution when it comes to EAV model tables, but I'm stuck at the calculation of the valid/from valid/to in the result SCD. ( example of using the pivoting here: https://dba.stackexchange.com/questions/20275/solutions-for-reporting-off-of-an-eav-structured-database )
I need one process that can be applied to multiple tables (because this example is only for company, but I have also other objects that I need to convert into SCD).
So what is the best way to convert this EAVT data into an SCD without falling into a labyrinth of hard code and performance issues? And how to calculate the valid_from/valid_to dynamically<
Whether it's BigQuery or Power Query or just theoretical, any solutions, tips, ideas or just plain opinion is highly appreciated as this is the last step into the adoption of a whole data culture in the company I work for, and if I cannot make this, well... my credibility will be hit! so please help a fellow lost IT professional! :D
Rami Ouanes
(1 rep)
May 4, 2022, 02:07 PM
• Last activity: Dec 10, 2022, 06:04 AM
5
votes
4
answers
1015
views
Consolidating a row of data, based on previous rows
I'm trying to build a history table out of an audit log (ultimately to build out a type 2 dimension table). Unfortunately, the audit log only records the specific fields being changes. Here's a rough example of what I'm talking about; CREATE TABLE Staff( [ID] int, [Surname] varchar(5), [FirstName] v...
I'm trying to build a history table out of an audit log (ultimately to build out a type 2 dimension table). Unfortunately, the audit log only records the specific fields being changes. Here's a rough example of what I'm talking about;
CREATE TABLE Staff(
[ID] int,
[Surname] varchar(5),
[FirstName] varchar(4),
[Office] varchar(9),
[Date] varchar(10)
);
INSERT INTO Staff ([ID], [Surname], [FirstName], [Office], [Date])
VALUES
(001, 'Smith', 'Bill', 'Melbourne', '2015-01-01'),
(001, NULL, NULL, 'Sydney', '2015-03-01'),
(002, 'Brown', 'Mary', 'Melbourne', '2014-04-01'),
(002, 'Jones', NULL, 'Adelaide', '2014-05-01'),
(002, NULL, NULL, 'Sydney', '2015-01-01'),
(002, NULL, NULL, 'Perth', '2015-03-01');
The first entry for a particular staff member is for when their record is created, and each subsequent record is an update... but only shows the update to the field that was updated*. I want to "fill out" the update row with the rest of the employee record as it currently stands. ie, a result like this;
001, Smith, Bill, Melbourne, 2015-01-01
001, Smith, Bill, Sydney, 2015-03-01
002, Brown, Mary, Melbourne, 2014-04-01
002, Jones, Mary, Adelaide, 2014-05-01
002, Jones, Mary, Sydney, 2015-01-01
002, Jones, Mary, Perth, 2015-03-01
I know I can do this using a
while
loop or a cursor
but I suspect there is probably a more performant option.
---
*A NULL always means "value didn't change" rather than "value changed to NULL".
user2045064
(63 rep)
Sep 2, 2016, 08:56 AM
• Last activity: Aug 2, 2022, 12:23 PM
1
votes
1
answers
1823
views
How to deal with fact table data that needs to be version controlled?
I have the following simplified `sport_match` 'fact' table: | match_id | tournament_id | player_id_p1 | player_id_p2 | p1_final_score | p2_final_score | |----------|---------------|--------------|--------------|----------------|----------------| | 1 | 1 | 1 | 2 | 1 | 0 | | 2 | 1 | 1 | 2 | 3 | 1 | |...
I have the following simplified
sport_match
'fact' table:
| match_id | tournament_id | player_id_p1 | player_id_p2 | p1_final_score | p2_final_score |
|----------|---------------|--------------|--------------|----------------|----------------|
| 1 | 1 | 1 | 2 | 1 | 0 |
| 2 | 1 | 1 | 2 | 3 | 1 |
| 3 | 2 | 3 | 2 | 2 | 3 |
| 4 | 2 | 3 | 2 | 4 | 0 |
The table is updated from an API that issues INSERT
, UPDATE
and DELETE
SQL instructions via text files.
Occasionally there is a mistake in the scores and because I need to be able to run historical analyses from a specific point in time I need to capture the incorrect entry and the correct entry. For this reason I started to look at adopting a Slowly Changing Dimension Type 2 method and translating all the API instructions to INSERT
. This would give me a table that looked like this:
| match_key | match_id | tournament_id | player_id_p1 | player_id_p2 | p1_final_score | p2_final_score | start_date | current_flag |
|-----------|----------|---------------|--------------|--------------|----------------|----------------|------------------|--------------|
| 1 | 1 | 1 | 1 | 2 | 1 | 0 | 01/01/2000 00:00 | Y |
| 2 | 2 | 1 | 1 | 2 | 3 | 1 | 02/01/2000 00:00 | Y |
| 3 | 3 | 2 | 3 | 2 | 2 | 3 | 03/01/2000 00:00 | Y |
| 4 | 4 | 2 | 3 | 2 | 4 | 0 | 04/01/2000 00:00 | N |
| 5 | 4 | 2 | 3 | 2 | 4 | 1 | 04/01/2000 00:01 | Y |
However, I realised I was applying a 'dimension' principle to a 'fact' table.
Is this a viable approach or should I be looking at a different design?
Jossy
(83 rep)
Jun 6, 2022, 08:28 PM
• Last activity: Jun 6, 2022, 09:04 PM
0
votes
1
answers
43
views
Do I have the right Slowly Changing Dimensions type for my version controlled tennis match database?
I'm trying to version control my database using the principles of [Slowly Changing Dimensions][1]. I've opted to use Type 2 with a generation `start` and `end` column instead of datetimes. In a simplified example I have three tables: **player:** | player_key | player_id | country_id | start | end |...
I'm trying to version control my database using the principles of Slowly Changing Dimensions . I've opted to use Type 2 with a generation
start
and end
column instead of datetimes.
In a simplified example I have three tables:
**player:**
| player_key | player_id | country_id | start | end |
|------------|-----------|------------|-------|-----|
| 1 | 1 | 1 | 1 | 2 |
| 2 | 2 | 2 | 1 | |
| 3 | 1 | 3 | 2 | |
**tournament:**
| tournament_key | tournament_id | surface_id | start | end |
|----------------|---------------|------------|-------|-----|
| 1 | 1 | 1 | 1 | 2 |
| 2 | 1 | 2 | 2 | |
**tennis_match:**
| match_id | tournament_key | player_key_p1 | player_key_p2 | start | end |
|----------|----------------|---------------|---------------|-------|-----|
| 1 | 1 | 1 | 2 | 1 | |
| 2 | 1 | 1 | 2 | 1 | |
| 3 | 2 | 3 | 2 | 2 | |
| 4 | 2 | 3 | 2 | 2 | |
I now want to extract all the matches and their respective tournament and player data to run some analysis on it. If I run the following query:
SELECT
match_id,
tournament_key,
player_key_p1,
player_key_p2,
t.surface_id,
p1.country_id,
p2.country_id
FROM
tennis_match AS m
JOIN
player AS p1 ON p1.player_key = m.player_key_p1
JOIN
player AS p1 ON p1.player_key = m.player_key_p1
JOIN
tournament AS t ON t.tournament_key = m.tournament_key
This gives me:
| match_id | tournament_key | player_key_p1 | player_key_p2 | surface_id | p1_country_id | p1_country_id |
|----------|----------------|---------------|---------------|------------|---------------|---------------|
| 1 | 1 | 1 | 2 | 1 | 1 | 2 |
| 2 | 1 | 1 | 2 | 1 | 1 | 2 |
| 3 | 2 | 3 | 2 | 2 | 3 | 2 |
| 4 | 2 | 3 | 2 | 2 | 3 | 2 |
The issue I'm facing is that the surface_id
and p1_country_id
change part way through the matches because, well, they changed part way through the matches. However, for the purposes of my analysis at match_id = 4
I should be using the values of the latest versions of player
and tournament
:
| match_id | tournament_key | player_key_p1 | player_key_p2 | surface_id | p1_country_id | p1_country_id |
|----------|----------------|---------------|---------------|------------|---------------|---------------|
| 1 | 1 | 1 | 3 | 2 | 3 | 2 |
| 2 | 1 | 1 | 3 | 2 | 3 | 2 |
| 3 | 2 | 2 | 3 | 2 | 3 | 2 |
| 4 | 2 | 2 | 3 | 2 | 3 | 2 |
So I figure that to get the data in the format I need then I'm going to need to write some reasonable complex queries (for me) to get the data in a format I want. This has got me questioning whether I have the right structure.
If I'd gone for a Type 4 approach then my queries on the non-history tables would be nice and simple. However, if I wanted to run an analysis from a point in the past I'd have to head to the history table and I reckon I'd have the same challenge as I have now. Plus I'd have the added hassle of managing history tables and having to figure out a solution for deleted records.
I did look at Type 6 but this looked like I needed to duplicate version controlled columns - one to have a current_state
and historic_state
. As some of the version controlled tables have hundreds of columns this didn't seem like the right approach either so I didn't review it much further.
Finally getting to my question... do I have the right data structure and just need to knuckle down on query writing or could I implement a better design?
Jossy
(83 rep)
Jun 4, 2022, 09:26 PM
• Last activity: Jun 6, 2022, 07:42 PM
0
votes
1
answers
282
views
Fact table linking to natual key rather than row key in a slowly changing dimension table?
I’m designing a data warehouse and have a Slowly Changing Dimension Type 2 table something like this: UserKey | UserID | Group | EffectiveDate | ExpiryDate | IsCurrent | ---------|--------|----------|---------------|------------|-----------| 1 | 1001 | GR1 | 2021-01-01 | 9999-12-31 | Y | 2 | 1002 |...
I’m designing a data warehouse and have a Slowly Changing Dimension Type 2 table something like this:
UserKey | UserID | Group | EffectiveDate | ExpiryDate | IsCurrent |
---------|--------|----------|---------------|------------|-----------|
1 | 1001 | GR1 | 2021-01-01 | 9999-12-31 | Y |
2 | 1002 | GR1 | 2021-07-31 | 2022-02-28 | N |
3 | 1002 | GR2 | 2022-28-28 | 9999-12-31 | Y |
And I have a fact table, which whilst it does have dates, doesn't really relate to the user dimension at any particular point in time, say a subscriptions table like this:
SubsKey | SubsID | SubsType | UserKey / UserId ? | StartDate | EndDate
---------|--------|----------|-------------------| -----------|------------|
55501 | SBP501 | Premium | ??? / 2 | 2021-08-01 | 2022-08-01 |
55502 | SBB123 | Bonus | ??? / 2 | 2022-08-01 | 2022-09-01 |
The subscriptions could be set up for dates in the future for example, or be changed retrospectively. Or alternatively we might have a fact table which doesn't have any date aspect at all.
In such cases is it reasonable for the fact table to reference the natural key
UserID
instead of the type 2 dimension row key UserKey
? As it's not obvious which UserKey
ought to be used.
I suppose one might use whichever row key is current at the time the fact row is generated, but that would seem to be implying that the fact relates to a particular snapshot in time for the user, when in reality the fact doesn't really.
stovroz
(631 rep)
Apr 6, 2022, 08:27 PM
• Last activity: Apr 6, 2022, 09:58 PM
0
votes
0
answers
434
views
Configure Type 2 SCD on existing tables in SQL Server
I've got two tables, `FACILITIES` and `CONTRACTS`. It's a 1:M facility->contract. The tables have unique keys generated by the upstream system and are truncate-reload every night. Neither table has any primary key or foreign key/Unique key enforcement. Users typically join on the facility ID which e...
I've got two tables,
FACILITIES
and CONTRACTS
. It's a 1:M facility->contract. The tables have unique keys generated by the upstream system and are truncate-reload every night. Neither table has any primary key or foreign key/Unique key enforcement. Users typically join on the facility ID which exists on the contracts table.
I am trying to set up SCD Type 2 for these two tables and have created the appropriate and working code using the MERGE
command and have tested it successfully. I know that I will need to set up new primary/foreign keys since Type 2 will cause the business key to be duplicated (what they are currently using as a unqiue identifier).
However, I'm unsure how to handle existing data. Both tables have several thousand rows, so I don't know how to link existing contracts and facilities using the new PK/FK combos I set up. I know that going forward they will work fine, but not sure about now.
How do I link the existing data with the new PK/FK setup so the associations for the new keys are correct for SCD going forward?
Here's how they look now:
| FacilityID | Facility_PK (the one I set up) |
| -----------| -------------------------------|
| 12345 | 1 |
| 23456 | 2 |
| ContractID | Contract_PK | Facility ID | Facility_FK |
| ---------- | -------------|---------------------------------| ------------|
| 7890 | 1 | 12345 | ??? |
| 4444 | 2 | 12345 | ??? |
| 2345 | 3 | 23456 | ??? |
EDIT: Further context:
Once the Slowly Changing Dimension logic is implemented, the existing ID fields will be duplicated and really considered business keys.
| FacilityID | Facility_PK (the one I set up) | USER | Active Indicator |
| -----------| -------------------------------|----------|------------------|
| 12345 | 1 | User A |0 |
| 23456 | 2 | User A |1 |
| 12345 | 3 | User B |1 |
Now there is no longer a unique link to the CONTRACTS
table. So I can add in a PK to this FACILITIES
table and keep them unique, but how do I update the data in the CONTRACTS
Foreign key so that this works properly?
If you were to now join based on FacilityID, you'd get back both records and would be unable to tell which contract is associated with which point-in-time facility record.
seve
(1 rep)
Mar 13, 2021, 05:06 PM
• Last activity: Mar 14, 2021, 12:29 AM
1
votes
2
answers
2365
views
Data Warehouse - Slowly Changing Dimensions with Many to Many Relationships
As an example, let's say I have a fact table with two dimensions and one measure **FactMoney table** ---------- ProjectKey int PersonKey int CashAmount money ---------- The two dimensions are defined like this: **DimProject (a type 0 dimension - i.e. static)** ---------- ProjectKey int ProjectName v...
As an example, let's say I have a fact table with two dimensions and one measure
**FactMoney table**
----------
ProjectKey int
PersonKey int
CashAmount money
----------
The two dimensions are defined like this:
**DimProject (a type 0 dimension - i.e. static)**
----------
ProjectKey int
ProjectName varchar(50)
----------
**DimPerson (a type 2 slowly changing dimension)**
----------
PersonKey int
PersonNaturalKey int
PersonName varchar(50)
EffectiveStartDate datetime
EffectiveEndDate datetime
IsCurrent bit
----------
Pretty straightforward so far. Now I'll introduce a Person Category concept.
**DimCategory**
----------
CategoryKey int
CategoryName varchar(50)
----------
And build an M2M relationship between DimPerson and DimCategory
**BridgePersonCategory**
----------
PersonKey int
CategoryKey int
----------
So - people can have 1..n categories.
My problem is - as Person is a slowly changing dimension, when a person's name changes, we add a new person row and update our effective dates and is current flags, no big deal.
But what do we do with the person's categories? Do we need to add more rows to the bridge table every time a new person version pops up?
And as a corollary, if a person's categories change, does that mean we need to create a new row in the person table?
(please note I've also asked this question at stack-O and as directed by the first commenter, now posted this question here.)
Scott Herbert
(193 rep)
Nov 1, 2013, 12:53 AM
• Last activity: Mar 12, 2021, 11:04 PM
1
votes
1
answers
59
views
Type 2 Dimension for a fact measure
I have a need to store products along with their prices and quantities (and many other things) and track the history. Now, given both price and quantity are stored as integers, it is my understanding these are fact measures. So we might have a table like: f_products: id sku price quantity As the pri...
I have a need to store products along with their prices and quantities (and many other things) and track the history.
Now, given both price and quantity are stored as integers, it is my understanding these are fact measures. So we might have a table like:
f_products:
id
sku
price
quantity
As the price varies over time we can insert a new fact, and I can introduce effective dates to the table to allow for querying the price or quantity at a given time.
But with this design, it means if the quantity changes, I must insert another row, repeating the unchanged price value again. And while this isn't too bad for 2 columns, I have in fact 10.
So, my design is:
f_products:
id
sku
price:
id
product_id
price
start_date
end_date
another "dimension":
id
product_id
value
start_date
end_date
and so on
Now the "fact" no longer needs to change when the price changes, and I can query for historical values across these tables.
So my question is, this doesn't seem to fit the standard fact / dimension designs, so I wonder how more experienced modellers would approach these requirements.
sksamuel
(113 rep)
Feb 23, 2021, 02:11 AM
• Last activity: Feb 23, 2021, 02:49 AM
1
votes
3
answers
854
views
The right way to model multiple FACTs schema
**Background** I'm in a process of designing a database (using a STAR schema). There are three tables to model: *products*, *tests*, *states*. The database will be used to store results of *tests* conducted on *products* (in a great simplification). There can be many *tests* pointing out to a single...
**Background**
I'm in a process of designing a database (using a STAR schema).
There are three tables to model: *products*, *tests*, *states*.
The database will be used to store results of *tests* conducted on *products* (in a great simplification). There can be many *tests* pointing out to a single product but each *test* is unique (they are not shared among *products*). Besides, I need to record the current *state* of the product, at the time when the *test* was conducted. Let's assume that the *state* of a *product* describes its current location and owner, which are changing very often. That will most likely involve SCD lvl 2 - to track the history of *state* changes and to be able to locate a *product* with all its *tests* as well as the *states* it had during these *tests*.
**Problem**
I'm not entirely sure how to model this problem. It seems obvious to store every *test* in a FACT table. This table would then consist of thousands of transactions. On the other hand, there will also be hundreds (and later thousands) of *products*, so I should probably keep them in a second FACT table. Then, there will also be thousands of *state* changes, so in order to record their entire history, I would need to keep them in a ... FACT table as well? I've been told that FACT tables are typically used to store multiple-rows data but on the other hand where are the DIMs in this model?
I also don't know how to model the relationships between these tables. *Products* - *states* is a 1:* relationship. *Products* - *tests* is a 1:* as well. Finally, *states* - *tests* is a 1:* too. I would then go with linking *products* to *states* and then *states* to *tests* (*products* 1<-* *states* 1<-* *tests*), what would allow me to find all *states* for a particular *product* and all *tests* (in all *states* or in a selected *state*). What do you think about that? The problem here is that, as I keep adding *states*, I have two options: either keep duplicating *products* in the *products* table (with added "recorded_timestamp" column) or use a SCD lvl 2 in *states* table, pointing out to the *products* table with a FK, but this would effectively make the *product* table a DIM!
Any help here would be very appreciated.
JustGiveMeData
(113 rep)
Jul 12, 2020, 10:14 PM
• Last activity: Jul 17, 2020, 01:12 PM
3
votes
2
answers
416
views
Historicizing of database models
We want to build a SQL-Database for a system to manage regulations in different countries for different years. The problem is that the regulations change every year for every country. Like for social insurance you need different data in 2015 than you needed in 2014 and we have companies that enter t...
We want to build a SQL-Database for a system to manage regulations in different countries for different years. The problem is that the regulations change every year for every country. Like for social insurance you need different data in 2015 than you needed in 2014 and we have companies that enter their data to connect with their social insurance informations for every year.
If I want to build my models on this data I would have to create a different model for every year with different attributes. I can't just use migrations, because I have to keep the historical data and information.
Right now I am researching for the right solution and I would like to know if there is a name for this problem to find some solutions?
herbertp.
(89 rep)
Dec 6, 2016, 08:11 AM
• Last activity: Jun 19, 2020, 07:00 PM
1
votes
0
answers
320
views
Create SCD type 2 table from change log table using SQL
I have a change log table, which captures the transition of a state with its time, for example: user_id,user_status_from,user_status_to, created_date 7cc4d A2 A1 2019-11-03 23:04:26 7cc4d A1 A6 2019-11-03 23:05:28 7cc4d A6 I4 2019-11-16 10:00:34 7cc4d I4 A1 2020-03-16 10:00:36 Basically this table r...
I have a change log table, which captures the transition of a state with its time, for example:
user_id,user_status_from,user_status_to, created_date
7cc4d A2 A1 2019-11-03 23:04:26
7cc4d A1 A6 2019-11-03 23:05:28
7cc4d A6 I4 2019-11-16 10:00:34
7cc4d I4 A1 2020-03-16 10:00:36
Basically this table records the transition from one status to another, here it says that transition from
**A2--> A1 occurred at 2019-11-03 23:04:26** (1st row) and then it moved from
**A1 --> A6 at 2019-11-03 23:05:28** and so on.
Now I want to write a SQL through which I should be able to create table in such a way that tells the history of any status, for example: in particular status user was from this date to that date, basically a life of any status, I want to create a table like below:
userId status statrtingdatetime endingdatetime
7cc4d A1 2019-11-03 23:04:26 2019-11-03 23:05:28
7cc4d A6 2019-11-03 23:05:28 2019-11-16 10:00:34
7cc4d I4 2019-11-16 10:00:34 2020-03-16 10:00:36
7cc4d A1 2020-03-16 10:00:36 2020-04-02 08:24:50
If you see this table, it is nothing but some sort of logical table from above table, it says that a particular status life was from one date to another.
How can write SQL to create this final table.(please note that **LAG/LEAD** function doesn't work in my version of MySql Server)
user3065757
(111 rep)
Apr 5, 2020, 04:02 AM
Showing page 1 of 20 total questions