What's the fastest way to query multiple tables at once where each table represents one year of data?

1 vote

1 answer

449 views

I'm developing an web app based around a database with multiple decades of data with multiple schema changes over the years. Each year of data is released as a zip file containing CSVs named table1.csv, table2.csv, table3.csv, etc. My initial goal was to just get the data into an _intermediate_ MySQL database without making any significant changes. Since there are schema changes over the years, I couldn't get away with simply creating the table1, table2, table3 tables so my compromise was to add a year prefix. For example, 1990_table1, 1991_table1, 1992_table1 etc. Besides this, I added some basic indexes on columns that can uniquely identify a record which I'll get into next. One of the complications with the schema is that there's no single primary key. column1, column2, column3, column4, column5 as well as three date columns date_month, date_day, and date_year comprise the unique identifier in the early years. In the middle years, the three date columns are combined into one date column with varying formats such as mddyy and later, mddyyyy. In recent years, data has been released with a primary key that combines all of the columnX columns and date column, for example: column1_column2_column3_column4_column5_date. To normalize the date formats, I created a separate table that has a standard date column with columns that tie back to the original tables. Then I imported the data for each year with using a few functions to parse the varying date formats. Now, to get a standard date for each record, I'd run:

-- Early years.
SELECT
    1990_table1.*,
    date_table.normalized_date
FROM 1990_table1
INNER JOIN
    date_table
ON
    1990_table1.column1 = date_table.column1 AND
    1990_table1.column2 = date_table.column2 AND
    1990_table1.column3 = date_table.column3 AND
    1990_table1.column4 = date_table.column4 AND
    1990_table1.column5 = date_table.column5 AND
    1990_table1.month   = date_table.month   AND
    1990_table1.day     = date_table.day     AND
    1990_table1.year    = date_table.year;

-- Middle years.
SELECT
    2000_table1.*,
    date_table.date
FROM 2000_table1
INNER JOIN
    date_table
ON
    2000_table1.column1 = date_table.column1 AND
    2000_table1.column2 = date_table.column2 AND
    2000_table1.column3 = date_table.column3 AND
    2000_table1.column4 = date_table.column4 AND
    2000_table1.column5 = date_table.column5 AND
    2000_table1.date    = date_table.date;

-- Recent years.
SELECT
    2020_table1.*,
    date_table.date
FROM 2020_table1
INNER JOIN
    date_table
ON
    2020_table1.combined_identifier = date_table.combined_identifier;

My next step was creating views for each year using the JOINs above. In addition to getting a standard date, there are two more joins pulling in data from 1-to-1 tables. These tables relate back to the original table, again, based on the columnX and date columns. After that, I created another view for testing purposes that UNION ALLs all the years with the same schema. So for example, 1990-1999. My ultimate goal is to create a "master" view that allows me to query all the years at once, but for testing purposes it's 1990-1999 currently:

SELECT * FROM 1990_to_1999_table1_view WHERE YEAR(date) = '1990' AND ZIP = '90210' LIMIT 100;

The problem is that this query is unacceptably slow to be used in an autosuggest search field in my web app. I'm not quite sure why this query is so slow (single digit seconds). Ideally, I'd like to get it under 250ms. Here's what EXPLAIN shows. Nested loop inner join is repeated and looks more-or-less the same for each year in the view that UNIONs all the years together. I cut off the other Nested loop inner joins for brevity.

-> Limit: 100 row(s)  (cost=232.39..232.39 rows=0.1)
    -> Table scan on my_view  (cost=232.39..232.39 rows=0.1)
        -> Union all materialize  (cost=229.89..229.89 rows=0.1)
            -> Nested loop inner join  (cost=1.12 rows=0.0007)
                -> Nested loop inner join  (cost=0.86 rows=0.01)
                    -> Nested loop inner join  (cost=0.79 rows=0.1)
                        -> Index lookup on 1990_table1 using ZIP_INDEX (ZIP='90210')  (cost=0.35 rows=1)
                        -> Filter: (1990_table2.column1 = 1990_table1.column1)  (cost=0.32 rows=0.1)
                            -> Index lookup on 1990_table2 using column2_INDEX (column2=1990_table1.column2)  (cost=0.32 rows=1)
                    -> Filter: (1990_table1.column1 = 1990_table1.column1)  (cost=0.61 rows=0.1)
                        -> Index lookup on 1990_table1 using COMBINED_IDENTIFIER_KEY_INDEX (column2=1990_table1.column2, column3=1990_table1.column3, column4=1990_table1.column4)  (cost=0.61 rows=1)
                -> Filter: ((date_table.year = 1990_table1.year) and (date_table.day = 1990_table1.day) and (date_table.month = 1990_table1.month) and (date_table.column2 = 1990_table1.column2) and (date_table.column1 = 1990_table1.column1) and (date_table.date = 1990_table1.date) and (1990_table1.column4 = date_table.column4) and (year(date_table.DATE) = 1990))  (cost=17.77 rows=0.05)
                    -> Index lookup on date_table using column3_index (column3=1990_table1.column3), with index condition: (1990_table1.column3 = date_table.column3)  (cost=17.77 rows=19)

Is it a bad idea to create a main view (table1_view) of sub-views (1990_to_1999_table1_view, 2000_to_2009_table1_view, etc.) of sub-sub-views (1990_table1_view, 1991_table1_view, etc.)? Or can I improve the queries with different joins or indexes? Or should I bite the bullet and create brand new tables (all-in-all about 100gb) with a unified schema that all the year-prefixed tables can be imported to?

Asked by Tyler (111 rep)

Apr 25, 2023, 05:09 PM
Last activity: May 27, 2025, 03:10 AM

What's the fastest way to query multiple tables at once where each table represents one year of data?

Related Questions