Sample Header Ad - 728x90

How important is the "clustered" property of MySQL PK index?

-1 votes
2 answers
58 views
I am importing ~50M rows into MySQL 8, InnoDB. It's on AWS RDS with GP3 storage. The unique key of the rows is a uuid-like string. When querying we will never care about this unique key, except when upserting new/modified rows from the primary source. Normally the unique id would be the PK. But I have read that PK index in MySQL is special because it aims to 'cluster' the data for similar values, to enhance performance. It seems that by using uuid-like string as PK the clustering will not help our queries. If I was to partition the table I would do it by date range. I could imagine defining a synthetic PK, or a composite PK, that combines the date field and uuid to get a clustering that is more likely to support the queries we actually do. My question is this: how important is it to have a PK clustering that supports the typical queries (i.e. fetched results likely to be 'close' in the index)? Presumably the typical case of an auto-incrementing id for PK also results in clustering that has little relation to typical queries (often no reason to select adjacent ids). I am thinking specifically about whether modern SSD storage makes this type of optimisation less important, obsolete... or even more important? ### More context https://dev.mysql.com/doc/refman/8.0/en/innodb-index-types.html > #### How the Clustered Index Speeds Up Queries > Accessing a row through the > clustered index is fast because the index search leads directly to the > page that contains the row data. If a table is large, the clustered > index architecture often saves a disk I/O operation when compared to > storage organizations that store row data using a different page from > the index record. It seems like the "clustered"-ness of the PK index is only of value for queries which select by PK. It's about co-locating the row data with the index (?) So if all the application queries that I care about use secondary indexes I guess it doesn't really matter what the properties of the PK are? e.g. including a date partition column in the PK isn't going to magically speed up queries using a different index. Is that right?
Asked by Anentropic (558 rep)
Apr 20, 2024, 07:48 AM
Last activity: Apr 28, 2024, 06:59 PM