Database design to handle millions of rows in MySQL

8 votes
3 answers
4962 views
                          We are running an application that is  collecting data much faster than we anticipated. Trying to addapt to that, we are doing a redesig of the database.
After reading this , this  and this , I am not sure what the best approach for our design is... considering our HW is very humble.

There are three main tables that are causing problems:
 - SCANS
 - DOMAINS
 - DOCUMENTS
 - VALUES

Currently we have one single table to store data. The relation between them is:

 - 1 **SCAN** -> (avg 4x) **DOMAINS** -> (avg 3000) MANY **DOCUMENTS** -> (avg 51000) MANY **VALUES**
    - 1 SCAN points to average 4 entries on DOMAINS.
    - 4 entries on DOMAINS point to average 12.000 entries ON DOCUMENTS
    - 12000 entries ON DOCUMENTS point TO average 204000 entries on VALUES

We are currently performing around 100 scans/day. That is inserting around 20.400.000 items per day into VALUES.

We are considering to split VALUES table as one "VALUE_table_per_month":

 - **VALUES_year_month** with the intention to distribute the load between them. But if we multiply the number of scanners, this mechanism is not escalable.
 - **VALUES_year_month_day** then we will end up with so many tables into the same DB.

In both cases, if we increase the number of scans per day, none of the solutions seems scalable.

At this point, to keep all the data into a centralized DB does not seem the best option for scalability reasons... but at the same time, a distributed system will increase the load time significantly.

What would be a reasonable approach? I am sure we are not the first team to find this issue! :P

**EDIT**

*How much data do we read per query?*

That depends on the SCAN. Not all scans have the same amount of data. The range varies between:

 - 1 SCAN --> 200 VALUES
 - 1 SCAN --> 200.000 VALUES

The information is presented on a front end to the end user. So we have splitted how the queries are requested to the backend to avoid overload the server, but in some cases it is not enought due the high number of VALUES.

*When is the data read?*

That entirely depends on the end users. Some days they read 10s of SCANS a day, others none and others 100s.


**EDIT II**
ANALYZE DESCRIBE results from two queries. First one quick and second one slow.

    EXPLAIN ANALYZE 
    SELECT value,
            url,
            filetype, 
            severity,
            COUNT(id_value) AS data_count
    FROM VALUES
    WHERE (weigth = 150 OR weigth = 100) 
    AND id_analysis = 23 
    AND is_hidden = 0 
    AND is_hidden_by_user = 0 
    GROUP BY value 
    ORDER BY data_count DESC

 **Result 1:**

    | -> Sort row IDs: data_count DESC  (actual time=34.016..34.016 rows=0 loops=1)
    -> Table scan on   (actual time=34.006..34.006 rows=0 loops=1)
        -> Aggregate using temporary table  (actual time=34.005..34.005 rows=0 loops=1)
            -> Filter: ((VALUES.is_hidden_by_user = 0) and (VALUES.is_hidden = 0) and ((VALUES.weigth = 150) or (VALUES.weigth = 100)))  (cost=1.00 rows=0.05) (actual time=0.024..0.024 rows=0 loops=1)
                -> Index lookup on VALUES using id_analysis (id_analysis=23)  (cost=1.00 rows=1) (actual time=0.024..0.024 rows=0 loops=1)
 |

Result 2:

        | -> Sort row IDs: data_count DESC  (actual time=187172.159..187172.173 rows=136 loops=1)
        -> Table scan on   (actual time=187172.079..187172.111 rows=136 loops=1)
            -> Aggregate using temporary table  (actual time=187172.077..187172.077 rows=136 loops=1)
                -> Filter: ((VALUES.is_hidden_by_user = 0) and (VALUES.is_hidden = 0) and ((VALUES.weigth = 150) or (VALUES.weigth = 100)))  (cost=264956.35 rows=695) (actual time=249.030..186775.012 rows=52289 loops=1)
                    -> Index lookup on VALUES using id_analysis (id_analysis=8950)  (cost=264956.35 rows=265154) (actual time=248.979..186696.529 rows=134236 loops=1)
     |

**EDIT III**

> Consider PARTITIONing

This is a great suggestion. Kudos!. From what I have read now, that is the native equivalent to spliting tables in the way we were consideting to do.

> (weigth = 150 OR weigth = 100) is a rather strange test.

Removing the OR clausule improves the timing: 

    
    | -> Sort row IDs: data_count DESC  (actual time=101261.260..101261.271 rows=113 loops=1)
        -> Table scan on   (actual time=101261.187..101261.216 rows=113 loops=1)
            -> Aggregate using temporary table  (actual time=101261.185..101261.185 rows=113 loops=1)
                -> Filter: ((VALUES.is_hidden_by_user = 0) and (VALUES.is_hidden = 0) and (VALUES.id_analysis = 8950) and (VALUES.weigth = 150))  (cost=79965.29 rows=623) (actual time=83848.835..100942.179 rows=52259 loops=1)
                    -> Intersect rows sorted by row ID  (cost=79965.29 rows=62292) (actual time=83848.830..100908.758 rows=52259 loops=1)
                        -> Index range scan on VALUES using id_analysis over (id_analysis = 8950)  (cost=291.66 rows=265154) (actual time=0.100..443.145 rows=134236 loops=1)
                        -> Index range scan on VALUES using weigth over (weigth = 150)  (cost=13492.63 rows=12380386) (actual time=0.043..83511.686 rows=7822871 loops=1)
     |

> Please elaborate on value versus id_value

I believe it might be just a "bad naming". 

    +-------------------+-------------+------+-----+---------+----------------+
    | Field             | Type        | Null | Key | Default | Extra          |
    +-------------------+-------------+------+-----+---------+----------------+
    | id_value          | int         | NO   | PRI | NULL    | auto_increment |
    | id_document       | int         | NO   | MUL | NULL    |                |
    | id_tag            | int         | YES  | MUL | NULL    |                |
    | value             | mediumtext  | YES  |     | NULL    |                |
    | weigth            | int         | YES  | MUL | NULL    |                |
    | id_analysis       | int         | YES  | MUL | NULL    |                |
    | url               | text        | YES  |     | NULL    |                |
    | domain            | varchar(64) | YES  |     | NULL    |                |
    | filetype          | varchar(16) | YES  |     | NULL    |                |
    | severity_name     | varchar(16) | YES  |     | NULL    |                |
    | id_domain         | int         | YES  | MUL | NULL    |                |
    | id_city           | int         | YES  | MUL | NULL    |                |
    | city_name         | varchar(32) | YES  |     | NULL    |                |
    | is_hidden         | tinyint     | NO   |     | 0       |                |
    | id_company        | int         | YES  |     | NULL    |                |
    | is_hidden_by_user | tinyint(1)  | NO   |     | 0       |                |
    +-------------------+-------------+------+-----+---------+----------------+


                        
Asked by Javi M (61 rep)
Aug 23, 2022, 07:08 PM
Last activity: Aug 24, 2022, 07:36 PM
Database design to handle millions of rows in MySQL

Related Questions