LIKE query with leading wildcards in large MySQL production database
0
votes
1
answer
2335
views
According to insights in Amazon RDS the following query in a large MySQL production database is causing high loads ( ~ 50.000.000 entries):
SELECT * FROM entities WHERE status='ready' AND user_id='81663729' AND (primary_name LIKE '%mysearch%' OR additional_names LIKE '%mysearch%') ORDER BY id DESC LIMIT 0, 100000
Columns responsible for it:
common_name: VARCHAR(255)
additional_names: VARCHAR(2000) (Note: This is *unfortunately* a comma-seperated string of zero to hundreds of names per entry, which is horrible database design)
Not a big surprise that this query is performing bad: Leading wildcards make indices more or less unusable (primary_name
is indexed, additional_names
is not). Some users have a whole lot of entries in this table (currently up to ~ 2.000.000 entries per user, numbers of entries growing) - so the user_id in the query will not rescue its performance.
My question is what might be the best way to solve this problem.
**Approach 1: MySQL FULLTEXT index**
Create a fulltext index:
CREATE FULLTEXT INDEX domain ON entities (primary_name,additional_names);
New Query:
SELECT * FROM entities WHERE status='ready' AND user_id='81663729' AND MATCH (primary_name,additional_names) AGAINST ('mysearch') IN NATURAL LANGUAGE MODE ORDER BY id DESC LIMIT 0, 100000;
*Pro*: Easiest to migrate and implement?
*Contra*: Not sure if this the requirements? Creating that index might take a huge amount of time.
**Approach 2: MySQL JSON**
Step 1: Add an additional column with JSON
ALTER TABLE entities ADD additional_names_j JSON AFTER additional_names;
Step 2: All new entities are inserted into the new JSON column as json as well as into the old column.
Step 3: A script which might run a few days makes sure that all old entities have their entries in the JSON column.
Step 4: Application is adjusted to use the JSON column.
Step 5: Delete the old additional_names column.
Query: * FROM entities WHERE status='ready' AND user_id='81663729' AND 'mysearch' member of ORDER BY id DESC LIMIT 0, 100000;
*Pro*: Searching the JSON would work much better and it would be possible to extract the names without comma exploding and such stuff.
*Contra*: Not sure if partial matching for JSON columns is even possible? Furthermore the effort is quite high.
**Approach 3: Create a relations table and a names table**
This would basically mean that we have a table with id and name and a m:n relationship table.
*Pro*: Would be nice and clean design.
*Contra*: Would have to exist and be populated in parallel to the running application, and in this case it would cause enormous changes in the application logic.
My question to the experts is: Which of those approaches would solve the situation best - and why?
Note: The table is growing by ~ 5 entries per second, downtime is not allowed. I am open to any other approach to make this query faster, which I am not aware of at the moment!
Asked by Blackbam
(225 rep)
May 24, 2022, 02:01 PM
Last activity: May 24, 2022, 03:45 PM
Last activity: May 24, 2022, 03:45 PM