Database Administrators

Q&A for database professionals who wish to improve their database skills

Latest Questions

0 votes

1 answers

939 views

Periodic snapshot fact table with monthly grain - Question on dimensions

data-warehouse dimensional-modeling facttable slowly-changing-dimension

I am fairly new to data-warehousing but have read my fair share of books and online tutorials to gain an introductory understanding of the basic components that build up a data warehouse. My end goal is to be able to do a headcount of employees at the end of the month, therefore I designed my fact t...

                                  I am fairly new to data-warehousing but have read my fair share of books and online tutorials to gain an introductory understanding of the basic components that build up a data warehouse.

My end goal is to be able to do a headcount of employees at the end of the month, therefore I designed my fact table as a monthly snapshot where each end-of-month I will populate my fact table (one row per employee per month).

My dimension table includes a DimEmployee table which is an SCD2 (any change made to an employee's information causes a new row to be introduced in the data source and I am keeping track of the most recent employee information via an is_current flag set to 'Y' or 'N').  

My question is: do I need to update the dimensions daily and only insert whatever data I have at end-of-month into the fact table? Or do I update the dimensions end-of-month as well?

Thank you in advance!

Alistair (141 rep)

Apr 16, 2017, 08:01 PM • Last activity: Jun 3, 2025, 12:03 PM

0 votes

0 answers

15 views

Have I converted this DB table to the correct number of dimension tables for a star schema correctly?

database-design olap star-schema slowly-changing-dimension

I'm trying to replace a pretty complicated query for a dashboard. We currently use this system where we track practices, phone numbers and start/end dates partially in an excel spreadsheet and also in a database. Whenever we want to update any of the info, like which practice uses which number durin...

                                  I'm trying to replace a pretty complicated query for a dashboard. We currently use this system where we track practices, phone numbers and start/end dates partially in an excel spreadsheet and also in a database. Whenever we want to update any of the info, like which practice uses which number during which time period, it is a very involved process that includes checking numbers and practices in excel to make sure that nothing is duplicated, etc. If we want to change something so that practices share a line we have to go into the db and combine the names, etc. I think the entire process could be simplified if we used a star schema instead of the system that we currently have.

We currently have a database table with the following schema

| Group Practice | Practice | TransferType | DialedNumber | DigitsDialed | StartDate | EndDate|

I think the table should be broken out into the following dimensions:

* Practice

| PK | Group Practice | Practice |

* Phone Number

| PK | DialedNumber | DigitsDialed |

* TransferType

| PK| TransferType |

* Calendar Table

There would be a fact table at the center that would track "Transfer Volume by Practice". 

A phone number can have one or more transfer types for one or more practices simultaneously.

For example, (123)456-7890 could be transfertype1 for practice A from Jan 1, 2024 - Feb 1, 2024. It could then be used as transfertype2 for practice B from Feb 2, 2024 - Mar 1, 2024.

Or Number (234)567-8901 could be transfertype 2 for practice C from Jan 1, 2024 - Feb 1, 2024 and it could simultaneously be transfertype 3 for practice D from Jan 1, 2024 - Mar 1, 2024.

I've read textbook definitions of how to set up a star schema but I want to make sure that I'm doing this correctly. Do my dimensions make sense? Would this capture the "slowly changing dimensions" in the data?

j.jerrod.taylor (101 rep)

Oct 18, 2024, 05:11 PM

0 votes

1 answers

71 views

Point-in-Time-Recovery of Postgres at an arbitrary time for audit purposes

postgresql backup audit slowly-changing-dimension point-in-time-recovery

TL;DR Would you recommend using CDC or pg_wal-based recovery to reproduce the state of the database in an arbitrary time, or stick with joining SCD2 tables (with start end end date for every record) and query all data before given date? We are developing a new microservice to support our application...

                                  TL;DR
Would you recommend using CDC or pg_wal-based recovery to reproduce the state of the database in an arbitrary time, or stick with joining SCD2 tables (with start end end date for every record) and query all data before given date?

We are developing a new microservice to support our application. For audit purposes our calculations need to be reproducible, so the current plan is to use SCD2 with every record having a start_date and an end_date. This way the state of the database at a given time can be reproduced if calculation from the past needs to be reproduced (if ever requested). Joining and maintaining versioned SCD2 tables seem to be cumbersome, but that is the current plan.

Would making the database store only the current state, i.e. updates would actually update a record, and not just mark it obsolete and insert a new state; and archiving the replica level WAL files with sufficiently good granularity, and rebuilding the state on-demand work?

Or keeping an audit table and rebuild from that?

The scale/magnitude is around a 100k records per year.

Any thoughts welcome.

P.S. we don't have a DBA on the team.

Mr Arkadin (1 rep)

May 3, 2024, 04:33 PM • Last activity: May 3, 2024, 08:04 PM

6 votes

1 answers

410 views

Designing a dimensional DB off a normalized source that already implements SCD's

sql-server ssis ssas dimensional-modeling slowly-changing-dimension

I have built an SSIS ETL to integrate various data sources (one from MySQL, two from SQL Server) into a single SQL Server relational and normalized database, which I've called [NDS]. The SSIS ETL handles type-2 updates, and so the [NDS] generates surrogate keys and SCD tables include an [_EffectiveF...

                                  I have built an SSIS ETL to integrate various data sources (one from MySQL, two from SQL Server) into a single SQL Server relational and normalized database, which I've called [NDS].

The SSIS ETL handles type-2 updates, and so the [NDS] generates surrogate keys and SCD tables include an [_EffectiveFrom] timestamp and a nullable [_EffectiveTo] column, and there are constraints for the natural keys and beautiful foreign keys linking all the data together.

Now, I wanted to build an SSAS dimensional database off of it, and it didn't take too long before I realized I was setting myself up for a snowflake schema:

So I'm thinking of adding a new [DDS] (relational) database, to create the *actual* dimension and fact tables that will feed the DSV's for the SSAS db.

This [DDS] database would be as denormalized as humanly possible, so as to "flatten" the facts and dimensions (like, [OrderHeaders]+[OrderDetails] into an [Orders] fact table, and [CustomerStores]+[Customers]+[SalesReps] into some [Customers] dimension table) - doing this should not only make it easier for me to build the dimension hierarchies in SSAS, it should also make it easier to come up with an actual star schema.

I have a few questions though:

 - Can I reuse a subset of my existing surrogate keys? I'm thinking to take the existing key for the most granular level and make that the dimension key. Is that a good approach, or should I just ignore the [NDS] surrogate keys and make the [DDS] (relational db) generate a new set of surrogate keys?
 - How to handle SCD's? For example, "Materials" and "Suppliers" will generate new records in [NDS] when some specific fields change in the source system... I think I'll have to design the SSIS ETL to only load the "last image" records into the [DDS] db, and then re-implement the type-2 updates in that process, i.e. treat the [NDS] as a "source system" that keeps history, while duplicating everything in this [DDS] database. But then, why would I need to keep history in the [NDS] *and* the [DDS]? Clearly something's not right.

Am I setting myself up for a Big Mess™, or I'm on the right track?

Mathieu Guindon (914 rep)

Dec 1, 2015, 06:31 PM • Last activity: Mar 26, 2024, 05:28 PM

0 votes

0 answers

64 views

SQL Query group by to find EffectiveFrom and EffectiveTo Dates for repeating key value pair

group-by aggregate window-functions slowly-changing-dimension

I have a table with a snapshot of repeated data for a key and value pair with record_timestamps. The dataset can also contain late-arriving historical record_timestamps as well. I want to generate an output with an EffectiveFrom and EffectiveTo timestamp for each combination of key and value, where the validity closes when the value changes and a new record is created for a new combination. Note: The same values can reappear after an interval with a different record_timestamp. **Snapshot of repeated data**

create table equipments_staging
(
    equipment string,
    item string,
    value string,
    timestamp	timestamp
)

insert into equipments_staging
values 
('M','X','1','2023-11-10 13:00')
,('M','X','2','2023-11-11 13:00')
,('M','X','2','2023-11-12 13:00')
,('M','X','1','2023-11-13 13:00')
,('M','X','1','2023-11-14 13:00')
,('M','X','1','2023-11-15 13:00')
,('M','X','2','2023-11-16 13:00')
,('M','X','2','2023-11-17 13:00')
,('M','X','1','2023-11-18 13:00')

**Expected Outcome**

I have tried the use lead and lag partitioned by the keys and ordered by the timestamp for each record to find the prev and next value and use this to find a way to group sets. However, this does not work as I am missing something since the same value can reappear after an interval, which should not be grouped together.

Select equipment,item,value,
LAG(value,1) OVER (PARTITION BY equipment,item ORDER BY timestamp) AS prev,
LEAD(value,1) OVER (PARTITION BY equipment,item ORDER BY timestamp) AS next,
coalesce(LEAD(timestamp,1),'9999-12-31') OVER (PARTITION BY equipment,item ORDER BY timestamp) AS next_timestamp,
timestamp
from repeated_data

user137279 (1 rep)

Feb 19, 2024, 08:26 PM • Last activity: Feb 19, 2024, 09:21 PM

0 votes

0 answers

105 views

How Do I create history table that allows me to retrieve the times when a row active by year & month?

sql-server dimensional-modeling slowly-changing-dimension

I want to be able to retrieve the in-between months when a client was active but from a historical table. Table: DimensionEpisode | episode_id | client_id | episode_status | discharge_date | admission_date | program_id | length_of_stay | subprogram | ...other attributes | |------------|-----------|-...

                                  I want to be able to retrieve the in-between months when a client was active but from a historical table.

Table: DimensionEpisode

| episode_id | client_id | episode_status | discharge_date       | admission_date       | program_id | length_of_stay | subprogram | ...other attributes |
|------------|-----------|----------------|----------------------|----------------------|------------|----------------|------------|--------------------|
| 63         | 61362     | CLOSED         | 2018-09-06T00:00:00  | 2014-10-01T00:00:00  | 72         | 1436           | Union PC   | ...                |
| 64         | 61343     | OPEN           | 2019-03-15T00:00:00  | 2015-05-20T00:00:00  | 80         | 1000           | Rehab      | ...                |
| 65         | 61344     | OPEN           | 2020-01-10T00:00:00  | 2018-11-12T00:00:00  | 65         | 750            | Psychiatry | ...                |
| 66         | 61345     | CLOSED         | 2021-07-02T00:00:00  | 2019-09-28T00:00:00  | 72         | 1250           | Union PC   | ...                |

Sam Johnson (21 rep)

Dec 6, 2023, 03:33 PM

7 votes

6 answers

1965 views

Are duplicated NVARCHAR values stored as copies in SQL Server?

sql-server database-design azure-sql-database slowly-changing-dimension

I'm designing a table that will contain a lot of rows. So need to be careful not to store to much information. One of the columns is a NVARCHAR(MAX) column and it contains the address of our customers. As addresses do not change often, this column will contain many repeated values and thus contains...

                                  I'm designing a table that will contain a lot of rows. So need to be careful not to store to much information. One of the columns is a NVARCHAR(MAX) column and it contains the address of our customers. As addresses do not change often, this column will contain many repeated values and thus contains quite some redundancy.

So I was wondering if I need to normalize this myself by maintaining some sort of look-up table to address strings (note that if an address changes I need to maintain history - so it's not a matter of usual normalization), or if SQL Server is pointing to the same reference of the string behind the scenes. Or maybe it offers a column option to do so. Another approach that came into my mind is to use COMPRESS but I guess this does not make sense as the data itself (i.e. the address) is not long.

Reading/writing performance is not so much of a concern as the data will be accumulated over time.

Dejan (175 rep)

Nov 18, 2020, 09:49 AM • Last activity: Jun 26, 2023, 02:34 PM

1 votes

1 answers

865 views

Optimal join for joining facts with scd-type-2 dimension for aggregation/reporting

query-performance join data-warehouse slowly-changing-dimension apache-spark

I have a fact table and an scd-type-2 dimension table. I want to produce sales report by region and year. I have working solution with a query that joins them for reporting purposes. When I run the query in spark/databricks, it gives me a little warning at the bottom: `Use range join optimization: T...

Use range join optimization: This query has a join condition that can benefit from range join optimization. To improve performance, consider adding a range join hint.

and points to this link . **Question:** Is there a more optimal way to query when I'm joining using a between condition (instead of = condition)? fact table:

create table sales
(name             string
,sale_date        date
,sold_amt         long);

insert into sales values
('John','2022-02-02',100),
('John','2022-03-03',100),
('John','2023-02-02',200),
('John','2023-03-03',200),
('Rick','2022-02-02',300),
('Rick','2023-02-02',400);

dimension table (scd-type-2)

create table employee_scd2
(name        string
,region      string
,start_date  date,
,end_date    date,
,is_current  boolean); -- unused, kept for completeness

insert into employee_scd2 values
('John','NAM',  '2010-01-01', '2022-12-31', false),
-- John transferred from NAM to APAC starting 2023
('John','APAC', '2023-01-01', '9999-01-01', true),
('Rick','NAM',  '2020-01-01', '9999-12-31', true);

sales report by region and year

select e.region,
       year(s.sale_date) as sale_year,
       SUM(s.sold_amt)   as sale_amt
from       sales s
left join  employee_scd2 e
  on     e.name       = s.name
  and    s.sale_date between e.start_date and e.end_date
group by e.region, year(s.sale_date);

--- I've read following and some more: * https://dba.stackexchange.com/questions/184639/join-fact-table-to-scd-type-2-how-to-write-query-sql-server * https://www.startdataengineering.com/post/how-to-join-fact-scd2-tables/ * How to implement Slowly Changing Dimensions (SCD2) Type 2 in Spark

Kashyap (145 rep)

Mar 7, 2023, 04:49 PM • Last activity: Jun 7, 2023, 05:43 PM

0 votes

1 answers

849 views

Implementing revisioning logic with SCD2/4

postgresql slowly-changing-dimension

I posted this question on StackOverflow, but it didn't get much luck so I'm hoping I could have better luck here. I'm working on a project that is a sort of bill of materials manager that needs to support revisioning of components. After some research, I've stumbled upon slowly changing dimensions type 2 and type 4 which seems like the most appropriate pattern in this use cases. However, I'm a little bit confused on a couple of concepts:

CREATE TABLE projects (
  id    INTEGER PRIMARY KEY,
  name  TEXT
)

CREATE TABLE components (
  id           INTEGER PRIMARY KEY,
  name         TEXT,
  project_id   INTEGER,
  CONSTRAINT fk_project FOREIGN KEY (project_id) REFERENCES projects (id)
)

If I wanted to implement SCD Type 2, would the components table have the additional columns for start_date, end_date, active? Or, would Type 2 be adding a revisions table with the same structure as components with the additional columns and Type 4 having a revisions table and a revisions_history table? Any help would be greatly appreciated!

user1960118 (143 rep)

Sep 5, 2021, 01:58 PM • Last activity: Apr 15, 2023, 08:01 AM

0 votes

1 answers

211 views

Converting EAVT table into SCD type 2

pivot eav google-bigquery powerbi slowly-changing-dimension

After a lot of research and head picking, I'm still unable to find a good/clean solution to convert an entity-attribute-value-timestamp table to an scd type 2 dimension. Here's the issue: I have a CRM source that stores all history in a EAVT model (Entity/Attribute/Value of the attribute/valid_from/...

                                  After a lot of research and head picking, I'm still unable to find a good/clean solution to convert an entity-attribute-value-timestamp table to an scd type 2 dimension.

Here's the issue:
I have a CRM source that stores all history in a EAVT model (Entity/Attribute/Value of the attribute/valid_from/valid_to).
So for every object (Company, product...etc) I have a table with the current state that is in a relational model, and another history table that contains all value changes to all attributes with a valid_from/valid_to column for validity of the values themselves.

I want to be able to merge these two tables into an SCD table with a Valid_To/Valid_From and a column per attribute.

To give an example:
Company has two tables:

- **Current state of the Companies:**

| company_id | name | number_of_employees | city |
| -------- | -------------- | -------------- | -------------- |
| 1    | Company 1            | 500    | Paris    |
| 2    | Company 2            | 500    | Paris    |

- **History Table:**

| company_id | attribute | value | valid_from | valid_to |
| -------- | -------------- | -------------- | -------------- | -------------- |
| 1    | city            | New York    | 01/01/2020    | 01/05/2022    |
| 1    | city            | Paris    | 01/05/2022    | 12/31/9999    |
| 1   | number_of_employees            | 50   | 01/01/2021    | 01/01/2022    |
| 1   | number_of_employees            | 100    | 01/01/2022    | 12/31/9999    |


What I want to have as a result is the following:

| company_id | name| city | number_of_employees | valid_from | valid_to | is_active |
| -------- | -------------- | -------------- | -------------- | -------------- | -------------- | -------------- |
| 1    | Company 1    | New York            | null    | 01/01/2020    | 01/01/2021    |  false  |
| 1    | Company 1    | New York            | 50    | 01/01/2021    | 01/01/2022    |  false  |
| 1   | Company 1    |  New York            | 100    | 01/01/2022    | 01/01/2022    |  false  |
| 1   | Company 1    |  Paris            | 100    | 01/05/2022    | 12/31/9999    |  true  |


So based on this example, we have a company that started on 01/01/2020 with New York as city and number of employees wasn't populated at that time.
We then modified our company to add 50 as the number of employees, this happened on 01/01/2021.
We modified our company again on 01/01/2022 to change the number of employees to 100, only to change the city of the company from New York to Paris on 01/05/2021.

This gives us 4 states for the company, so our SCD should contain a row per state or 4 rows.
The dates should be calculated to overlap and valid_from should be set to the valid_to of the attribute that changed from the "history" table, and valid_to should be set to the valid_from of the attribute that changed from the "history" table.

To add more complexity to the task, imagine we have about 120 attributes but also if a company was never changed (just created and still has the same state from creation) then it won't exist in the "Current State" table. So in our example, Company 2 will not exist in the history table at all and will have to be read from the first table into the SCD (union between current table and history result table). Fun right! :) 


To give you a sense of the technical environment, the CRM is hubspot, data is replicated from hubspot to BigQuery and the reporting tool is Power BI.

I have tried to use pivoting in both Power BI and BigQuery, which is the standard solution when it comes to EAV model tables, but I'm stuck at the calculation of the valid/from valid/to in the result SCD. ( example of using the pivoting here: https://dba.stackexchange.com/questions/20275/solutions-for-reporting-off-of-an-eav-structured-database  )

I need one process that can be applied to multiple tables (because this example is only for company, but I have also other objects that I need to convert into SCD).
So what is the best way to convert this EAVT data into an SCD without falling into a labyrinth of hard code and performance issues? And how to calculate the valid_from/valid_to dynamically<

Whether it's BigQuery or Power Query or just theoretical, any solutions, tips, ideas or just plain opinion is highly appreciated as this is the last step into the adoption of a whole data culture in the company I work for, and if I cannot make this, well... my credibility will be hit! so please help a fellow lost IT professional! :D 


                                

Rami Ouanes (1 rep)

May 4, 2022, 02:07 PM • Last activity: Dec 10, 2022, 06:04 AM

5 votes

4 answers

1015 views

Consolidating a row of data, based on previous rows

audit sql-server-2016 dimension slowly-changing-dimension

I'm trying to build a history table out of an audit log (ultimately to build out a type 2 dimension table). Unfortunately, the audit log only records the specific fields being changes. Here's a rough example of what I'm talking about; CREATE TABLE Staff( [ID] int, [Surname] varchar(5), [FirstName] v...

                                  I'm trying to build a history table out of an audit log (ultimately to build out a type 2 dimension table).  Unfortunately, the audit log only records the specific fields being changes.  Here's a rough example of what I'm talking about;

    CREATE TABLE Staff(
      [ID] int, 
      [Surname] varchar(5), 
      [FirstName] varchar(4), 
      [Office] varchar(9), 
      [Date] varchar(10)
    );
    
    INSERT INTO Staff ([ID], [Surname], [FirstName], [Office], [Date])
    VALUES
      (001, 'Smith', 'Bill', 'Melbourne', '2015-01-01'),
      (001, NULL, NULL, 'Sydney', '2015-03-01'),
      (002, 'Brown', 'Mary', 'Melbourne', '2014-04-01'),
      (002, 'Jones', NULL, 'Adelaide', '2014-05-01'),
      (002, NULL, NULL, 'Sydney', '2015-01-01'),
      (002, NULL, NULL, 'Perth', '2015-03-01');

The first entry for a particular staff member is for when their record is created, and each subsequent record is an update... but only shows the update to the field that was updated*.  I want to "fill out" the update row with the rest of the employee record as it currently stands.  ie, a result like this;

    001, Smith, Bill, Melbourne, 2015-01-01
    001, Smith, Bill, Sydney, 2015-03-01
    002, Brown, Mary, Melbourne, 2014-04-01
    002, Jones, Mary, Adelaide, 2014-05-01
    002, Jones, Mary, Sydney, 2015-01-01
    002, Jones, Mary, Perth, 2015-03-01

I know I can do this using a while loop or a cursor but I suspect there is probably a more performant option.

---

*A NULL always means "value didn't change" rather than "value changed to NULL".
                                

user2045064 (63 rep)

Sep 2, 2016, 08:56 AM • Last activity: Aug 2, 2022, 12:23 PM

1 votes

1 answers

1823 views

How to deal with fact table data that needs to be version controlled?

mysql database-design facttable slowly-changing-dimension version-control

I have the following simplified `sport_match` 'fact' table: | match_id | tournament_id | player_id_p1 | player_id_p2 | p1_final_score | p2_final_score | |----------|---------------|--------------|--------------|----------------|----------------| | 1 | 1 | 1 | 2 | 1 | 0 | | 2 | 1 | 1 | 2 | 3 | 1 | |...

                                  I have the following simplified sport_match 'fact' table:

| match_id | tournament_id | player_id_p1 | player_id_p2 | p1_final_score | p2_final_score |
|----------|---------------|--------------|--------------|----------------|----------------|
|        1 |             1 |            1 |            2 |              1 |              0 |
|        2 |             1 |            1 |            2 |              3 |              1 |
|        3 |             2 |            3 |            2 |              2 |              3 |
|        4 |             2 |            3 |            2 |              4 |              0 |

The table is updated from an API that issues INSERT, UPDATE and DELETE SQL instructions via text files.

Occasionally there is a mistake in the scores and because I need to be able to run historical analyses from a specific point in time I need to capture the incorrect entry and the correct entry. For this reason I started to look at adopting a Slowly Changing Dimension Type 2 method  and translating all the API instructions to INSERT. This would give me a table that looked like this:

| match_key | match_id | tournament_id | player_id_p1 | player_id_p2 | p1_final_score | p2_final_score |    start_date    | current_flag |
|-----------|----------|---------------|--------------|--------------|----------------|----------------|------------------|--------------|
|         1 |        1 |             1 |            1 |            2 |              1 |              0 | 01/01/2000 00:00 | Y            |
|         2 |        2 |             1 |            1 |            2 |              3 |              1 | 02/01/2000 00:00 | Y            |
|         3 |        3 |             2 |            3 |            2 |              2 |              3 | 03/01/2000 00:00 | Y            |
|         4 |        4 |             2 |            3 |            2 |              4 |              0 | 04/01/2000 00:00 | N            |
|         5 |        4 |             2 |            3 |            2 |              4 |              1 | 04/01/2000 00:01 | Y            |

However, I realised I was applying a 'dimension' principle to a 'fact' table.

Is this a viable approach or should I be looking at a different design?
                                

Jossy (83 rep)

Jun 6, 2022, 08:28 PM • Last activity: Jun 6, 2022, 09:04 PM

0 votes

1 answers

43 views

Do I have the right Slowly Changing Dimensions type for my version controlled tennis match database?

mysql database-design slowly-changing-dimension version-control

I'm trying to version control my database using the principles of Slowly Changing Dimensions . I've opted to use Type 2 with a generation start and end column instead of datetimes. In a simplified example I have three tables: **player:** | player_key | player_id | country_id | start | end | |------------|-----------|------------|-------|-----| | 1 | 1 | 1 | 1 | 2 | | 2 | 2 | 2 | 1 | | | 3 | 1 | 3 | 2 | | **tournament:** | tournament_key | tournament_id | surface_id | start | end | |----------------|---------------|------------|-------|-----| | 1 | 1 | 1 | 1 | 2 | | 2 | 1 | 2 | 2 | | **tennis_match:** | match_id | tournament_key | player_key_p1 | player_key_p2 | start | end | |----------|----------------|---------------|---------------|-------|-----| | 1 | 1 | 1 | 2 | 1 | | | 2 | 1 | 1 | 2 | 1 | | | 3 | 2 | 3 | 2 | 2 | | | 4 | 2 | 3 | 2 | 2 | | I now want to extract all the matches and their respective tournament and player data to run some analysis on it. If I run the following query:

SELECT 
    match_id,
    tournament_key,
    player_key_p1,
    player_key_p2,
    t.surface_id,
    p1.country_id,
    p2.country_id
FROM
    tennis_match AS m
        JOIN
    player AS p1 ON p1.player_key = m.player_key_p1
        JOIN
    player AS p1 ON p1.player_key = m.player_key_p1
        JOIN
    tournament AS t ON t.tournament_key = m.tournament_key

This gives me: | match_id | tournament_key | player_key_p1 | player_key_p2 | surface_id | p1_country_id | p1_country_id | |----------|----------------|---------------|---------------|------------|---------------|---------------| | 1 | 1 | 1 | 2 | 1 | 1 | 2 | | 2 | 1 | 1 | 2 | 1 | 1 | 2 | | 3 | 2 | 3 | 2 | 2 | 3 | 2 | | 4 | 2 | 3 | 2 | 2 | 3 | 2 | The issue I'm facing is that the surface_id and p1_country_id change part way through the matches because, well, they changed part way through the matches. However, for the purposes of my analysis at match_id = 4 I should be using the values of the latest versions of player and tournament: | match_id | tournament_key | player_key_p1 | player_key_p2 | surface_id | p1_country_id | p1_country_id | |----------|----------------|---------------|---------------|------------|---------------|---------------| | 1 | 1 | 1 | 3 | 2 | 3 | 2 | | 2 | 1 | 1 | 3 | 2 | 3 | 2 | | 3 | 2 | 2 | 3 | 2 | 3 | 2 | | 4 | 2 | 2 | 3 | 2 | 3 | 2 | So I figure that to get the data in the format I need then I'm going to need to write some reasonable complex queries (for me) to get the data in a format I want. This has got me questioning whether I have the right structure. If I'd gone for a Type 4 approach then my queries on the non-history tables would be nice and simple. However, if I wanted to run an analysis from a point in the past I'd have to head to the history table and I reckon I'd have the same challenge as I have now. Plus I'd have the added hassle of managing history tables and having to figure out a solution for deleted records. I did look at Type 6 but this looked like I needed to duplicate version controlled columns - one to have a current_state and historic_state. As some of the version controlled tables have hundreds of columns this didn't seem like the right approach either so I didn't review it much further. Finally getting to my question... do I have the right data structure and just need to knuckle down on query writing or could I implement a better design?

Jossy (83 rep)

Jun 4, 2022, 09:26 PM • Last activity: Jun 6, 2022, 07:42 PM

0 votes

1 answers

282 views

Fact table linking to natual key rather than row key in a slowly changing dimension table?

data-warehouse slowly-changing-dimension

I’m designing a data warehouse and have a Slowly Changing Dimension Type 2 table something like this: UserKey | UserID | Group | EffectiveDate | ExpiryDate | IsCurrent | ---------|--------|----------|---------------|------------|-----------| 1 | 1001 | GR1 | 2021-01-01 | 9999-12-31 | Y | 2 | 1002 |...

                                  I’m designing a data warehouse and have a Slowly Changing Dimension Type 2 table something like this: 

 UserKey | UserID | Group    | EffectiveDate | ExpiryDate | IsCurrent |
---------|--------|----------|---------------|------------|-----------|
 1       | 1001   | GR1      | 2021-01-01    | 9999-12-31 | Y         |
 2       | 1002   | GR1      | 2021-07-31    | 2022-02-28 | N         |
 3       | 1002   | GR2      | 2022-28-28    | 9999-12-31 | Y         |

And I have a fact table, which whilst it does have dates, doesn't really relate to the user dimension at any particular point in time, say a subscriptions table like this:

 SubsKey | SubsID | SubsType | UserKey / UserId ? | StartDate  | EndDate
---------|--------|----------|-------------------| -----------|------------|
 55501   | SBP501 | Premium  | ??? / 2           | 2021-08-01 | 2022-08-01 |
 55502   | SBB123 | Bonus    | ??? / 2           | 2022-08-01 | 2022-09-01 |

The subscriptions could be set up for dates in the future for example, or be changed retrospectively. Or alternatively we might have a fact table which doesn't have any date aspect at all.

In such cases is it reasonable for the fact table to reference the natural key UserID instead of the type 2 dimension row key UserKey? As it's not obvious which UserKey ought to be used.

I suppose one might use whichever row key is current at the time the fact row is generated, but that would seem to be implying that the fact relates to a particular snapshot in time for the user, when in reality the fact doesn't really.
                                

stovroz (631 rep)

Apr 6, 2022, 08:27 PM • Last activity: Apr 6, 2022, 09:58 PM

0 votes

0 answers

434 views

Configure Type 2 SCD on existing tables in SQL Server

sql-server slowly-changing-dimension

I've got two tables, `FACILITIES` and `CONTRACTS`. It's a 1:M facility->contract. The tables have unique keys generated by the upstream system and are truncate-reload every night. Neither table has any primary key or foreign key/Unique key enforcement. Users typically join on the facility ID which e...

                                  I've got two tables, FACILITIES and CONTRACTS. It's a 1:M facility->contract. The tables have unique keys generated by the upstream system and are truncate-reload every night. Neither table has any primary key or foreign key/Unique key enforcement. Users typically join on the facility ID which exists on the contracts table.

I am trying to set up SCD Type 2 for these two tables and have created the appropriate and working code using the MERGE command and have tested it successfully. I know that I will need to set up new primary/foreign keys since Type 2 will cause the business key to be duplicated (what they are currently using as a unqiue identifier). 

However, I'm unsure how to handle existing data. Both tables have several thousand rows, so I don't know how to link existing contracts and facilities using the new PK/FK combos I set up. I know that going forward they will work fine, but not sure about now.

How do I link the existing data with the new PK/FK setup so the associations for the new keys are correct for SCD going forward?

Here's how they look now:

| FacilityID | Facility_PK (the one I set up) |
| -----------| -------------------------------|
| 12345      | 1                              |
| 23456      | 2                              |

| ContractID | Contract_PK  | Facility ID                     | Facility_FK |
| ---------- | -------------|---------------------------------| ------------|
| 7890       | 1            |  12345                          | ???         |
| 4444       | 2            |  12345                          | ???         |
| 2345       | 3            |  23456                          | ???         |

EDIT: Further context:

Once the Slowly Changing Dimension logic is implemented, the existing ID fields will be duplicated and really considered business keys.

| FacilityID | Facility_PK (the one I set up) | USER         | Active Indicator |
| -----------| -------------------------------|----------|------------------|
| 12345      | 1                              |   User A     |0                 |
| 23456      | 2                              |   User A     |1                 |
| 12345      | 3                              |   User B     |1                 |

Now there is no longer a unique link to the CONTRACTS table. So I can add in a PK to this FACILITIES table and keep them unique, but how do I update the data in the CONTRACTS Foreign key so that this works properly?

If you were to now join based on FacilityID, you'd get back both records and would be unable to tell which contract is associated with which point-in-time facility record.

seve (1 rep)

Mar 13, 2021, 05:06 PM • Last activity: Mar 14, 2021, 12:29 AM

1 votes

2 answers

2365 views

Data Warehouse - Slowly Changing Dimensions with Many to Many Relationships

sql-server data-warehouse ssas slowly-changing-dimension

As an example, let's say I have a fact table with two dimensions and one measure **FactMoney table** ---------- ProjectKey int PersonKey int CashAmount money ---------- The two dimensions are defined like this: **DimProject (a type 0 dimension - i.e. static)** ---------- ProjectKey int ProjectName v...

                                  As an example, let's say I have a fact table with two dimensions and one measure

**FactMoney table**

----------

ProjectKey int

PersonKey int

CashAmount money

----------

The two dimensions are defined like this:

**DimProject (a type 0 dimension - i.e. static)**

----------

ProjectKey int

ProjectName varchar(50)

----------

**DimPerson (a type 2 slowly changing dimension)**

----------

PersonKey int

PersonNaturalKey int

PersonName varchar(50)

EffectiveStartDate datetime

EffectiveEndDate datetime

IsCurrent bit

----------

Pretty straightforward so far. Now I'll introduce a Person Category concept.

**DimCategory**

----------

CategoryKey int

CategoryName varchar(50)

----------

And build an M2M relationship between DimPerson and DimCategory

**BridgePersonCategory**

----------

PersonKey int

CategoryKey int

----------

So - people can have 1..n categories.

My problem is - as Person is a slowly changing dimension, when a person's name changes, we add a new person row and update our effective dates and is current flags, no big deal. 

But what do we do with the person's categories? Do we need to add more rows to the bridge table every time a new person version pops up? 

And as a corollary, if a person's categories change, does that mean we need to create a new row in the person table?

(please note I've also asked this question at stack-O  and as directed by the first commenter, now posted this question here.)

Scott Herbert (193 rep)

Nov 1, 2013, 12:53 AM • Last activity: Mar 12, 2021, 11:04 PM

1 votes

1 answers

59 views

Type 2 Dimension for a fact measure

database-design dimensional-modeling facttable slowly-changing-dimension

I have a need to store products along with their prices and quantities (and many other things) and track the history. Now, given both price and quantity are stored as integers, it is my understanding these are fact measures. So we might have a table like: f_products: id sku price quantity As the pri...

                                  I have a need to store products along with their prices and quantities (and many other things) and track the history.

Now, given both price and quantity are stored as integers, it is my understanding these are fact measures. So we might have a table like:

    f_products:
      id
      sku 
      price
      quantity

As the price varies over time we can insert a new fact, and I can introduce effective dates to the table to allow for querying the price or quantity at a given time.

But with this design, it means if the quantity changes, I must insert another row, repeating the unchanged price value again. And while this isn't too bad for 2 columns, I have in fact 10.

So, my design is:

    f_products:
      id
      sku 

    price:
       id
       product_id
       price
       start_date
       end_date

    another "dimension":
       id
       product_id
       value
       start_date
       end_date

    and so on

Now the "fact" no longer needs to change when the price changes, and I can query for historical values across these tables.

So my question is, this doesn't seem to fit the standard fact / dimension designs, so I wonder how more experienced modellers would approach these requirements.
                                

sksamuel (113 rep)

Feb 23, 2021, 02:11 AM • Last activity: Feb 23, 2021, 02:49 AM

1 votes

3 answers

854 views

The right way to model multiple FACTs schema

database-design dimensional-modeling facttable star-schema slowly-changing-dimension

**Background** I'm in a process of designing a database (using a STAR schema). There are three tables to model: *products*, *tests*, *states*. The database will be used to store results of *tests* conducted on *products* (in a great simplification). There can be many *tests* pointing out to a single...

                                  **Background**

I'm in a process of designing a database (using a STAR schema). 

There are three tables to model: *products*, *tests*, *states*.

The database will be used to store results of *tests* conducted on *products* (in a great simplification). There can be many *tests* pointing out to a single product but each *test* is unique (they are not shared among *products*). Besides, I need to record the current *state* of the product, at the time when the *test* was conducted. Let's assume that the *state* of a *product* describes its current location and owner, which are changing very often. That will most likely involve SCD lvl 2 - to track the history of *state* changes and to be able to locate a *product* with all its *tests* as well as the *states* it had during these *tests*.

**Problem**

I'm not entirely sure how to model this problem. It seems obvious to store every *test* in a FACT table. This table would then consist of thousands of transactions. On the other hand, there will also be hundreds (and later thousands) of *products*, so I should probably keep them in a second FACT table. Then, there will also be thousands of *state* changes, so in order to record their entire history, I would need to keep them in a ... FACT table as well? I've been told that FACT tables are typically used to store multiple-rows data but on the other hand where are the DIMs in this model?

I also don't know how to model the relationships between these tables. *Products* - *states* is a 1:* relationship. *Products* - *tests* is a 1:* as well. Finally, *states* - *tests* is a 1:* too. I would then go with linking *products* to *states* and then *states* to *tests* (*products* 1<-* *states* 1<-* *tests*), what would allow me to find all *states* for a particular *product* and all *tests* (in all *states* or in a selected *state*). What do you think about that? The problem here is that, as I keep adding *states*, I have two options: either keep duplicating *products* in the *products* table (with added "recorded_timestamp" column) or use a SCD lvl 2 in *states* table, pointing out to the *products* table with a FK, but this would effectively make the *product* table a DIM!

Any help here would be very appreciated.

JustGiveMeData (113 rep)

Jul 12, 2020, 10:14 PM • Last activity: Jul 17, 2020, 01:12 PM

3 votes

2 answers

416 views

Historicizing of database models

database-design slowly-changing-dimension

We want to build a SQL-Database for a system to manage regulations in different countries for different years. The problem is that the regulations change every year for every country. Like for social insurance you need different data in 2015 than you needed in 2014 and we have companies that enter t...

                                  We want to build a SQL-Database for a system to manage regulations in different countries for different years. The problem is that the regulations change every year for every country. Like for social insurance you need different data in 2015 than you needed in 2014 and we have companies that enter their data to connect with their social insurance informations for every year. 

If I want to build my models on this data I would have to create a different model for every year with different attributes. I can't just use migrations, because I have to keep the historical data and information.

Right now I am researching for the right solution and I would like to know if there is a name for this problem to find some solutions?

herbertp. (89 rep)

Dec 6, 2016, 08:11 AM • Last activity: Jun 19, 2020, 07:00 PM

1 votes

0 answers

320 views

Create SCD type 2 table from change log table using SQL

mysql-5.6 slowly-changing-dimension

I have a change log table, which captures the transition of a state with its time, for example: user_id,user_status_from,user_status_to, created_date 7cc4d A2 A1 2019-11-03 23:04:26 7cc4d A1 A6 2019-11-03 23:05:28 7cc4d A6 I4 2019-11-16 10:00:34 7cc4d I4 A1 2020-03-16 10:00:36 Basically this table r...

                                  I have a change log table, which captures the transition of a state with its time, for example:

    user_id,user_status_from,user_status_to,  created_date
     7cc4d      A2               A1       2019-11-03 23:04:26
     7cc4d      A1               A6       2019-11-03 23:05:28
     7cc4d      A6               I4       2019-11-16 10:00:34
     7cc4d      I4               A1       2020-03-16 10:00:36

Basically this table records the transition from one status to another, here it says that transition from 
**A2--> A1 occurred at 2019-11-03 23:04:26** (1st row) and then it moved from 
**A1 --> A6 at 2019-11-03 23:05:28** and so on.

Now I want to write a SQL through which I should be able to create table in such a way that tells the history of any status, for example: in particular status user was from this date to that date, basically a life of any status, I want to create a table like below:

   

       userId status statrtingdatetime  endingdatetime
       7cc4d   A1	2019-11-03 23:04:26	2019-11-03 23:05:28
       7cc4d   A6	2019-11-03 23:05:28	2019-11-16 10:00:34
       7cc4d   I4	2019-11-16 10:00:34	2020-03-16 10:00:36
       7cc4d   A1	2020-03-16 10:00:36	2020-04-02 08:24:50

If you see this table, it is nothing but some sort of logical table from above table, it says that a particular status life was from one date to another.
How can write SQL to create this final table.(please note that **LAG/LEAD** function doesn't work in my version of MySql Server)
                                

user3065757 (111 rep)

Apr 5, 2020, 04:02 AM

Showing page 1 of 20 total questions