What's the fastest way to query multiple tables at once where each table represents one year of data?
1
vote
1
answer
449
views
I'm developing an web app based around a database with multiple decades of data with multiple schema changes over the years. Each year of data is released as a zip file containing CSVs named
table1.csv
, table2.csv
, table3.csv
, etc.
My initial goal was to just get the data into an _intermediate_ MySQL database without making any significant changes. Since there are schema changes over the years, I couldn't get away with simply creating the table1
, table2
, table3
tables so my compromise was to add a year prefix. For example, 1990_table1
, 1991_table1
, 1992_table1
etc. Besides this, I added some basic indexes on columns that can uniquely identify a record which I'll get into next.
One of the complications with the schema is that there's no single primary key. column1
, column2
, column3
, column4
, column5
as well as three date columns date_month
, date_day
, and date_year
comprise the unique identifier in the early years. In the middle years, the three date columns are combined into one date
column with varying formats such as mddyy
and later, mddyyyy
. In recent years, data has been released with a primary key that combines all of the columnX
columns and date
column, for example: column1_column2_column3_column4_column5_date
.
To normalize the date formats, I created a separate table that has a standard date
column with columns that tie back to the original tables. Then I imported the data for each year with using a few functions to parse the varying date formats. Now, to get a standard date for each record, I'd run:
-- Early years.
SELECT
1990_table1
.*,
date_table
.normalized_date
FROM 1990_table1
INNER JOIN
date_table
ON
1990_table1
.column1
= date_table
.column1
AND
1990_table1
.column2
= date_table
.column2
AND
1990_table1
.column3
= date_table
.column3
AND
1990_table1
.column4
= date_table
.column4
AND
1990_table1
.column5
= date_table
.column5
AND
1990_table1
.month
= date_table
.month
AND
1990_table1
.day
= date_table
.day
AND
1990_table1
.year
= date_table
.year
;
-- Middle years.
SELECT
2000_table1
.*,
date_table
.date
FROM 2000_table1
INNER JOIN
date_table
ON
2000_table1
.column1
= date_table
.column1
AND
2000_table1
.column2
= date_table
.column2
AND
2000_table1
.column3
= date_table
.column3
AND
2000_table1
.column4
= date_table
.column4
AND
2000_table1
.column5
= date_table
.column5
AND
2000_table1
.date
= date_table
.date
;
-- Recent years.
SELECT
2020_table1
.*,
date_table
.date
FROM 2020_table1
INNER JOIN
date_table
ON
2020_table1
.combined_identifier
= date_table
.combined_identifier
;
My next step was creating views for each year using the JOIN
s above. In addition to getting a standard date, there are two more joins pulling in data from 1-to-1
tables. These tables relate back to the original table, again, based on the columnX
and date
columns.
After that, I created another view for testing purposes that UNION ALL
s all the years with the same schema. So for example, 1990-1999. My ultimate goal is to create a "master" view that allows me to query all the years at once, but for testing purposes it's 1990-1999 currently:
SELECT * FROM 1990_to_1999_table1_view
WHERE YEAR(date
) = '1990' AND ZIP
= '90210' LIMIT 100;
The problem is that this query is unacceptably slow to be used in an autosuggest search field in my web app. I'm not quite sure why this query is so slow (single digit seconds). Ideally, I'd like to get it under 250ms.
Here's what EXPLAIN
shows. Nested loop inner join
is repeated and looks more-or-less the same for each year in the view that UNION
s all the years together. I cut off the other Nested loop inner join
s for brevity.
-> Limit: 100 row(s) (cost=232.39..232.39 rows=0.1)
-> Table scan on my_view (cost=232.39..232.39 rows=0.1)
-> Union all materialize (cost=229.89..229.89 rows=0.1)
-> Nested loop inner join (cost=1.12 rows=0.0007)
-> Nested loop inner join (cost=0.86 rows=0.01)
-> Nested loop inner join (cost=0.79 rows=0.1)
-> Index lookup on 1990_table1 using ZIP_INDEX (ZIP='90210') (cost=0.35 rows=1)
-> Filter: (1990_table2.column1 = 1990_table1.column1) (cost=0.32 rows=0.1)
-> Index lookup on 1990_table2 using column2_INDEX (column2=1990_table1.column2) (cost=0.32 rows=1)
-> Filter: (1990_table1.column1 = 1990_table1.column1) (cost=0.61 rows=0.1)
-> Index lookup on 1990_table1 using COMBINED_IDENTIFIER_KEY_INDEX (column2=1990_table1.column2, column3=1990_table1.column3, column4=1990_table1.column4) (cost=0.61 rows=1)
-> Filter: ((date_table.year = 1990_table1.year) and (date_table.day = 1990_table1.day) and (date_table.month = 1990_table1.month) and (date_table.column2 = 1990_table1.column2) and (date_table.column1 = 1990_table1.column1) and (date_table.date = 1990_table1.date) and (1990_table1.column4 = date_table.column4) and (year(date_table.DATE
) = 1990)) (cost=17.77 rows=0.05)
-> Index lookup on date_table using column3_index (column3=1990_table1.column3), with index condition: (1990_table1.column3 = date_table.column3) (cost=17.77 rows=19)
Is it a bad idea to create a main view (table1_view
) of sub-views (1990_to_1999_table1_view
, 2000_to_2009_table1_view
, etc.) of sub-sub-views (1990_table1_view
, 1991_table1_view
, etc.)? Or can I improve the queries with different joins or indexes? Or should I bite the bullet and create brand new tables (all-in-all about 100gb) with a unified schema that all the year-prefixed tables can be imported to?
Asked by Tyler
(111 rep)
Apr 25, 2023, 05:09 PM
Last activity: May 27, 2025, 03:10 AM
Last activity: May 27, 2025, 03:10 AM