How important is the "clustered" property of MySQL PK index?
-1
votes
2
answers
58
views
I am importing ~50M rows into MySQL 8, InnoDB. It's on AWS RDS with GP3 storage.
The unique key of the rows is a uuid-like string.
When querying we will never care about this unique key, except when upserting new/modified rows from the primary source.
Normally the unique id would be the PK. But I have read that PK index in MySQL is special because it aims to 'cluster' the data for similar values, to enhance performance.
It seems that by using uuid-like string as PK the clustering will not help our queries.
If I was to partition the table I would do it by date range.
I could imagine defining a synthetic PK, or a composite PK, that combines the date field and uuid to get a clustering that is more likely to support the queries we actually do.
My question is this: how important is it to have a PK clustering that supports the typical queries (i.e. fetched results likely to be 'close' in the index)?
Presumably the typical case of an auto-incrementing id for PK also results in clustering that has little relation to typical queries (often no reason to select adjacent ids).
I am thinking specifically about whether modern SSD storage makes this type of optimisation less important, obsolete... or even more important?
### More context
https://dev.mysql.com/doc/refman/8.0/en/innodb-index-types.html
> #### How the Clustered Index Speeds Up Queries
> Accessing a row through the
> clustered index is fast because the index search leads directly to the
> page that contains the row data. If a table is large, the clustered
> index architecture often saves a disk I/O operation when compared to
> storage organizations that store row data using a different page from
> the index record.
It seems like the "clustered"-ness of the PK index is only of value for queries which select by PK.
It's about co-locating the row data with the index (?)
So if all the application queries that I care about use secondary indexes I guess it doesn't really matter what the properties of the PK are? e.g. including a date partition column in the PK isn't going to magically speed up queries using a different index.
Is that right?
Asked by Anentropic
(558 rep)
Apr 20, 2024, 07:48 AM
Last activity: Apr 28, 2024, 06:59 PM
Last activity: Apr 28, 2024, 06:59 PM