I want to try and understand the performance of the
OPTIMIZE
query in Clickhouse .
I am planning on using it to remove duplicates right after a bulk insert from a MergeTree
, hence I have the options of:
OPTIMIZE TABLE db.table DEDUPLICATE
or
OPTIMIZE TABLE db.table FINAL DEDUPLICATE
I understand that the first state only deduplicates the insert if it hasn't already merged, whereas the second will do it to the whole table. However I am concerned about performance; from dirty analysis of OPTIMIZE TABLE db.table FINAL DEDUPLICATE
on different size tables I can see it going to get exponentially worse as the table gets bigger (0.1s for 0.1M rows, 1s for 0.3M rows, 12s for 10M rows). I am assuming OPTIMIZE TABLE db.table DEDUPLICATE
is based however on the insert size and table size, so should be more performative?
Can anyone point to some literature on these performances?
In addition, do these problems go away if I replace the table with a ReplacingMergeTree
? I imagine the same process will happen under the hood, so doesn't matter either way.
Asked by AmyChodorowski
(113 rep)
Aug 19, 2021, 11:47 AM
Last activity: Aug 23, 2021, 04:40 AM
Last activity: Aug 23, 2021, 04:40 AM