Aggregate SCD Type 2 data "as of" each day
1
vote
1
answer
249
views
## Problem
When working with SCD type 2 data it's easy to see the state of a table "as of" a given point in time by using your date columns, eg:
valid_from
and valid_to
. For example:
select * from table
where '2023-11-01' between valid_from and valid_to
You can then group by or use window functions to aggregate.
But what if I want to do that repeatedly, for each date in a range (eg: daily)? I do not need to aggregate *across* these dates, just within each one.
## Example
Let's say I have a table that tracks the quantity
of people for each reservation_id
. Both the quantity
of people and reservation_status
can change over time, as tracked by the table. Each reservation is attached to an event_id
. event_date
is included to help limit scope if needed (see assumptions below).
| reservation_id | reservation_status | quantity | event_id | event_date | valid_from | valid_to |
| -------------- | ------------------ | -------- | -------- | ---------- | ---------- | -------- |
| 1 | active | 4 | 100 | 2023-05-25 | 2023-01-01 | 2023-01-02 |
| 2 | active | 2 | 200 | 2024-01-07 | 2023-01-01 | 2023-01-03 |
| 3 | active | 7 | 100 | 2023-05-25 | 2023-01-02 | 9999-12-31 |
| 4 | active | 1 | 200 | 2024-01-07 | 2023-01-02 | 9999-12-31 |
| 1 | active | 5 | 100 | 2023-05-25 | 2023-01-03 | 9999-12-31 |
| 5 | active | 8 | 100 | 2023-05-25 | 2023-01-03 | 9999-12-31 |
| 2 | cancelled | 2 | 200 | 2024-01-07 | 2023-01-04 | 9999-12-31 |
| 6 | active | 3 | 100 | 2023-05-25 | 2023-01-06 | 9999-12-31 |
PostgreSQL since BigQuery is more difficult to test – db-fiddle / SQL:
-sql
CREATE TABLE Reservations (
"reservation_id" INTEGER,
"reservation_status" VARCHAR(9),
"quantity" INTEGER,
"event_id" INTEGER,
"event_date" DATE,
"valid_from" DATE,
"valid_to" DATE
);
INSERT INTO Reservations
("reservation_id", "reservation_status", "quantity", "event_id", "event_date", "valid_from", "valid_to")
VALUES
('1', 'active', '4', '100', '2023-05-25', '2023-01-01', '2023-01-02'),
('2', 'active', '2', '200', '2024-01-07', '2023-01-01', '2023-01-03'),
('3', 'active', '7', '100', '2023-05-25', '2023-01-02', '9999-12-31'),
('4', 'active', '1', '200', '2024-01-07', '2023-01-02', '9999-12-31'),
('1', 'active', '5', '100', '2023-05-25', '2023-01-03', '9999-12-31'),
('5', 'active', '8', '100', '2023-05-25', '2023-01-03', '9999-12-31'),
('2', 'cancelled', '2', '200', '2024-01-07', '2023-01-04', '9999-12-31'),
('6', 'active', '3', '100', '2023-05-25', '2023-01-06', '9999-12-31');
While this is ultimately for BigQuery, answer will be accepted in any dialect as long as it's somewhat generic.
##### Assumptions
- "As of" dates can be a list or range based off valid_from
min/max
- valid_to
of 9999-12-31
is the most recent data
- All reservations for a given event will be between event_date - INTERVAL '2 years'
and event_date
. This doesn't change anything for this example, but maybe is useful for scaling(?)
### Desired output
I'd like to know the sum of quantity
grouped by event_id
and reservation_status
*as of* each interval (day).
| as_of_date | event_id | reservation_status | sum_quantity |
| ---------- | -------- | ------------------ | ------------ |
| 2023-01-01 | 100 | active | 4 |
| 2023-01-01 | 200 | active | 2 |
| 2023-01-02 | 100 | active | 11 |
| 2023-01-02 | 200 | active | 2 |
| 2023-01-03 | 100 | active | 20 |
| 2023-01-03 | 200 | active | 3 |
| 2023-01-04 | 100 | active | 20 |
| 2023-01-04 | 200 | active | 1 |
| 2023-01-04 | 200 | cancelled | 2 |
| 2023-01-06 | 100 | active | 23 |
*Rough estimate of row values. Will be different if using complete range of dates.*
I essentially want to do the following:
-sql
/* Invalid SQL, just for conceptual purposes */
-- Given a list of dates, for each "date":
select
event_id,
reservation_status,
sum(quantity)
from table
where {{date}} between valid_from and valid_to
group by
event_id,
reservation_status
I believe this can be done using procedural language, such as a for loop, but I feel like I am overthinking that, and just having trouble combining simpler concepts.
Asked by camtech
(143 rep)
Nov 28, 2023, 05:58 AM
Last activity: Nov 29, 2023, 08:03 PM
Last activity: Nov 29, 2023, 08:03 PM