Sample Header Ad - 728x90

optimization with subquery not working as expected

0 votes
1 answer
136 views
I have a MySQL InnoDB database. One table called affymetrixProbeset contains more than 300 million rows. Querying this table with INNER JOIN to other tables, with an ORDER BY and offset/limit takes almost 4 minutes. I saw that creating a subquery could be more optimized. I tried to move from my initial query with INNER JOIN to an approach using a subquery The query is:
SELECT affymetrixProbeset.* 
FROM affymetrixProbeset 
WHERE affymetrixProbeset.bgeeAffymetrixChipId IN(
    SELECT affymetrixChip.bgeeAffymetrixChipId 
    FROM affymetrixChip INNER JOIN cond ON affymetrixChip.conditionId = cond.conditionId
    WHERE cond.speciesId = 9606)
order by affymetrixProbeset.affymetrixProbesetId, affymetrixProbeset.bgeeGeneId
limit 10;
The columns used in the ORDER BY correspond to the primary key of the table affymetrixProbeset. The explain related to this query is
+----+-------------+--------------------+------------+--------+------------------------------+-------------+---------+--------------------------------------------------+-------+----------+----------------------------------------------+
| id | select_type | table              | partitions | type   | possible_keys                | key         | key_len | ref                                              | rows  | filtered | Extra                                        |
+----+-------------+--------------------+------------+--------+------------------------------+-------------+---------+--------------------------------------------------+-------+----------+----------------------------------------------+
|  1 | SIMPLE      | affymetrixChip     | NULL       | index  | PRIMARY,conditionId          | conditionId | 3       | NULL                                             | 12990 |   100.00 | Using index; Using temporary; Using filesort |
|  1 | SIMPLE      | cond               | NULL       | eq_ref | PRIMARY,speciesId            | PRIMARY     | 3       | bgee_v15_dev.affymetrixChip.conditionId          |     1 |    10.99 | Using where                                  |
|  1 | SIMPLE      | affymetrixProbeset | NULL       | ref    | PRIMARY,bgeeAffymetrixChipId | PRIMARY     | 3       | bgee_v15_dev.affymetrixChip.bgeeAffymetrixChipId |   175 |   100.00 | NULL                                         |
+----+-------------+--------------------+------------+--------+------------------------------+-------------+---------+--------------------------------------------------+-------+----------+----------------------------------------------+
The explain is exactly the same as my initial query. The query takes the same time (~4 minutes) It looks like the ORDER BY is not done on the affymetrixProbeset table but directly on the affymetrixChip table that is part of the subquery. I was expecting MySQL to run the subquery and then order using the affymetrixProbeset table. Could you explain to me why it is not done as I expected? I checked also the performance of separating this query in two. The first one is:
SELECT affymetrixChip.bgeeAffymetrixChipId 
FROM affymetrixChip INNER JOIN cond ON affymetrixChip.conditionId = cond.conditionId
WHERE cond.speciesId = 9606;
I then passed the bgeeAffymetrixChipIds to the query
SELECT affymetrixProbeset.* 
FROM affymetrixProbeset 
WHERE affymetrixProbeset.bgeeAffymetrixChipId IN(.....)
order by affymetrixProbeset.affymetrixProbesetId, affymetrixProbeset.bgeeGeneId
limit 10;
I wrote ..... in the query as I had more than 5000 IDs. The query run almost instantaneous as it order using the primary key. Could you please explain why the subquery did not perform as I expected? Is there an other option to optimize the query than creating 2 queries? UPDATE: After adding some indexes proposed by @Rick, the explain now looks like :
+----+-------------+--------------------+------------+--------+-----------------------------------------------------+-------------+---------+------------------------------------------------+-------+----------+----------------------------------------------+
| id | select_type | table              | partitions | type   | possible_keys                                       | key         | key_len | ref                                            | rows  | filtered | Extra                                        |
+----+-------------+--------------------+------------+--------+-----------------------------------------------------+-------------+---------+------------------------------------------------+-------+----------+----------------------------------------------+
|  1 | SIMPLE      | affymetrixChip     | NULL       | index  | PRIMARY,conditionId                                 | conditionId | 6       | NULL                                           | 12561 |   100.00 | Using index; Using temporary; Using filesort |
|  1 | SIMPLE      | cond               | NULL       | eq_ref | PRIMARY,speciesId                                   | PRIMARY     | 3       | bgee_v15_0.affymetrixChip.conditionId          |     1 |    10.94 | Using where                                  |
|  1 | SIMPLE      | affymetrixProbeset | NULL       | ref    | PRIMARY,bgeeAffymetrixChipId_2,bgeeAffymetrixChipId | PRIMARY     | 3       | bgee_v15_0.affymetrixChip.bgeeAffymetrixChipId |   174 |   100.00 | NULL                                         |
+----+-------------+--------------------+------------+--------+-----------------------------------------------------+-------------+---------+------------------------------------------------+-------+----------+----------------------------------------------+
The query still takes 3 minutes to run. The SHOW CREATE TABLE of the PK/FK/indexes of these tables are :
# table cond
  PRIMARY KEY (conditionId),
  UNIQUE KEY speciesId (speciesId,conditionId),
  UNIQUE KEY anatEntityId (anatEntityId,cellTypeId,stageId,speciesId,sex,sexInferred,strain),
  KEY exprMappedConditionId (exprMappedConditionId),
  KEY cellTypeId (cellTypeId),
  KEY stageId (stageId),
  CONSTRAINT cond_ibfk_1 FOREIGN KEY (exprMappedConditionId) REFERENCES cond (conditionId) ON DELETE CASCADE,
  CONSTRAINT cond_ibfk_2 FOREIGN KEY (anatEntityId) REFERENCES anatEntity (anatEntityId) ON DELETE CASCADE,
  CONSTRAINT cond_ibfk_3 FOREIGN KEY (cellTypeId) REFERENCES anatEntity (anatEntityId) ON DELETE CASCADE,
  CONSTRAINT cond_ibfk_4 FOREIGN KEY (stageId) REFERENCES stage (stageId) ON DELETE CASCADE,
  CONSTRAINT cond_ibfk_5 FOREIGN KEY (speciesId) REFERENCES species (speciesId) ON DELETE CASCADE

# table affymetrixChip
  PRIMARY KEY (bgeeAffymetrixChipId),
  UNIQUE KEY affymetrixChipId (affymetrixChipId,microarrayExperimentId),
  UNIQUE KEY conditionId (conditionId,bgeeAffymetrixChipId),
  KEY microarrayExperimentId (microarrayExperimentId),
  KEY chipTypeId (chipTypeId),
  CONSTRAINT affymetrixChip_ibfk_1 FOREIGN KEY (microarrayExperimentId) REFERENCES microarrayExperiment (microarrayExperimentId) ON DELETE CASCADE,
  CONSTRAINT affymetrixChip_ibfk_2 FOREIGN KEY (chipTypeId) REFERENCES chipType (chipTypeId) ON DELETE SET NULL,
  CONSTRAINT affymetrixChip_ibfk_3 FOREIGN KEY (conditionId) REFERENCES cond (conditionId) ON DELETE CASCADE

#table affymetrixProbeset
  PRIMARY KEY (bgeeAffymetrixChipId,affymetrixProbesetId),
  UNIQUE KEY bgeeAffymetrixChipId_2 (bgeeAffymetrixChipId,affymetrixProbesetId,bgeeGeneId),
  KEY bgeeAffymetrixChipId (bgeeAffymetrixChipId,expressionId,bgeeGeneId,normalizedSignalIntensity),
  KEY expressionId (expressionId),
  KEY bgeeGeneId (bgeeGeneId,bgeeAffymetrixChipId,affymetrixProbesetId),
  CONSTRAINT affymetrixProbeset_ibfk_1 FOREIGN KEY (bgeeAffymetrixChipId) REFERENCES affymetrixChip (bgeeAffymetrixChipId) ON DELETE CASCADE,
  CONSTRAINT affymetrixProbeset_ibfk_2 FOREIGN KEY (bgeeGeneId) REFERENCES gene (bgeeGeneId) ON DELETE CASCADE,
  CONSTRAINT affymetrixProbeset_ibfk_3 FOREIGN KEY (expressionId) REFERENCES expression (expressionId) ON DELETE SET NULL
UPDATE 2 : Sizes of the tables as result of SHOW TABLE STATUS
| Name                                                | Engine | Version | Row_format | Rows       | Avg_row_length | Data_length   | Max_data_length | Index_length | Data_free    | Auto_increment | Create_time         | Update_time         | Check_time | Collation       | Checksum | Create_options | Comment     
| affymetrixChip                                      | InnoDB |      10 | Dynamic    |      13020 |            122 |       1589248 |               0 |      1097728 |      2097152 |         104523 | 2022-12-02 13:23:15 | NULL                | NULL       | utf8_general_ci |     NULL |                |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
| affymetrixProbeset                                  | InnoDB |      10 | Dynamic    |  312237020 |             93 |   29169287168 |               0 |  41072721920 |      6291456 |           NULL | 2022-10-31 09:34:15 | NULL                | NULL       | utf8_general_ci |     NULL |                |  
| cond                                                | InnoDB |      10 | Dynamic    |      44578 |            106 |       4734976 |               0 |      6815744 |      2097152 |          44663 | 2022-12-02 13:19:09 | NULL                | NULL       | utf8_general_ci |     NULL |                |
The query below gives 4889 results
SELECT COUNT(*) FROM cond WHERE speciesId = 9606;
The query below gives 5452 results
SELECT COUNT(bgeeAffymetrixChipId) FROM affymetrixChip INNER JOIN cond ON cond.conditionId = affymetrixChip.conditionId WHERE cond.speciesId = 9606;
Asked by jwollbrett (1 rep)
Nov 25, 2022, 05:18 PM
Last activity: Aug 10, 2025, 03:05 PM