Sample Header Ad - 728x90

Do I have the right Slowly Changing Dimensions type for my version controlled tennis match database?

0 votes
1 answer
43 views
I'm trying to version control my database using the principles of Slowly Changing Dimensions . I've opted to use Type 2 with a generation start and end column instead of datetimes. In a simplified example I have three tables: **player:** | player_key | player_id | country_id | start | end | |------------|-----------|------------|-------|-----| | 1 | 1 | 1 | 1 | 2 | | 2 | 2 | 2 | 1 | | | 3 | 1 | 3 | 2 | | **tournament:** | tournament_key | tournament_id | surface_id | start | end | |----------------|---------------|------------|-------|-----| | 1 | 1 | 1 | 1 | 2 | | 2 | 1 | 2 | 2 | | **tennis_match:** | match_id | tournament_key | player_key_p1 | player_key_p2 | start | end | |----------|----------------|---------------|---------------|-------|-----| | 1 | 1 | 1 | 2 | 1 | | | 2 | 1 | 1 | 2 | 1 | | | 3 | 2 | 3 | 2 | 2 | | | 4 | 2 | 3 | 2 | 2 | | I now want to extract all the matches and their respective tournament and player data to run some analysis on it. If I run the following query:
SELECT 
    match_id,
    tournament_key,
    player_key_p1,
    player_key_p2,
    t.surface_id,
    p1.country_id,
    p2.country_id
FROM
    tennis_match AS m
        JOIN
    player AS p1 ON p1.player_key = m.player_key_p1
        JOIN
    player AS p1 ON p1.player_key = m.player_key_p1
        JOIN
    tournament AS t ON t.tournament_key = m.tournament_key
This gives me: | match_id | tournament_key | player_key_p1 | player_key_p2 | surface_id | p1_country_id | p1_country_id | |----------|----------------|---------------|---------------|------------|---------------|---------------| | 1 | 1 | 1 | 2 | 1 | 1 | 2 | | 2 | 1 | 1 | 2 | 1 | 1 | 2 | | 3 | 2 | 3 | 2 | 2 | 3 | 2 | | 4 | 2 | 3 | 2 | 2 | 3 | 2 | The issue I'm facing is that the surface_id and p1_country_id change part way through the matches because, well, they changed part way through the matches. However, for the purposes of my analysis at match_id = 4 I should be using the values of the latest versions of player and tournament: | match_id | tournament_key | player_key_p1 | player_key_p2 | surface_id | p1_country_id | p1_country_id | |----------|----------------|---------------|---------------|------------|---------------|---------------| | 1 | 1 | 1 | 3 | 2 | 3 | 2 | | 2 | 1 | 1 | 3 | 2 | 3 | 2 | | 3 | 2 | 2 | 3 | 2 | 3 | 2 | | 4 | 2 | 2 | 3 | 2 | 3 | 2 | So I figure that to get the data in the format I need then I'm going to need to write some reasonable complex queries (for me) to get the data in a format I want. This has got me questioning whether I have the right structure. If I'd gone for a Type 4 approach then my queries on the non-history tables would be nice and simple. However, if I wanted to run an analysis from a point in the past I'd have to head to the history table and I reckon I'd have the same challenge as I have now. Plus I'd have the added hassle of managing history tables and having to figure out a solution for deleted records. I did look at Type 6 but this looked like I needed to duplicate version controlled columns - one to have a current_state and historic_state. As some of the version controlled tables have hundreds of columns this didn't seem like the right approach either so I didn't review it much further. Finally getting to my question... do I have the right data structure and just need to knuckle down on query writing or could I implement a better design?
Asked by Jossy (83 rep)
Jun 4, 2022, 09:26 PM
Last activity: Jun 6, 2022, 07:42 PM