Why is sorting a table (loaded with random data) faster than actually sorting random data?

4 votes

2 answers

990 views

I want to run a benchmark of sorting random records using external merge sort algorithm on Postgresql. So I tried the following 2 ways (one right after another, keeping all parameters/configurations same): Attempt 1:

CREATE TABLE test(id BIGINT, name varchar(200));
INSERT INTO test (id,name) SELECT (random() * 1000000), concat(CONCAT(md5(random()::text), md5(random()::text))) FROM generate_series(1, 1000000) as t;

explain analyze select * from test order by id, name;
                                                            QUERY PLAN                                                             
-----------------------------------------------------------------------------------------------------------------------------------
 Gather Merge  (cost=41486.43..63526.06 rows=188898 width=426) (actual time=76.477..207.253 rows=1000000 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Sort  (cost=40486.40..40722.52 rows=94449 width=426) (actual time=73.418..101.593 rows=333333 loops=3)
         Sort Key: id, name
         Sort Method: external merge  Disk: 29744kB
         Worker 0:  Sort Method: external merge  Disk: 26008kB
         Worker 1:  Sort Method: external merge  Disk: 25512kB
         ->  Parallel Seq Scan on test  (cost=0.00..14278.49 rows=94449 width=426) (actual time=0.011..20.945 rows=333333 loops=3)
 Planning Time: 2.820 ms
 Execution Time: 227.090 ms
(11 rows)

Attempt 2:

explain analyze SELECT (random() * 1000000) as id, concat(CONCAT(md5(random()::text), md5(random()::text))) as name
FROM generate_series(1, 1000000) as t order by id, name;
                                                                QUERY PLAN                                                                 
-------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=194348.85..196848.85 rows=1000000 width=40) (actual time=1707.086..1768.986 rows=1000000 loops=1)
   Sort Key: ((random() * '1000000'::double precision)), (concat(concat(md5((random())::text), md5((random())::text))))
   Sort Method: external merge  Disk: 81256kB
   ->  Function Scan on generate_series t  (cost=0.00..40000.00 rows=1000000 width=40) (actual time=55.734..1388.681 rows=1000000 loops=1)
 Planning Time: 0.191 ms
 JIT:
   Functions: 3
   Options: Inlining false, Optimization false, Expressions true, Deforming true
   Timing: Generation 0.338 ms (Deform 0.000 ms), Inlining 0.000 ms, Optimization 0.497 ms, Emission 11.837 ms, Total 12.672 ms
 Execution Time: 1841.843 ms
(10 rows)

Can someone explain to me, why sorting randomly generated data is slower than sorting the similar random data from the disk? I re-ran both the queries with max_parallel_workers_per_gather = 0; the latency of the first query dropped to 360ms, while as expected, the second query hadn't budged.

Asked by Rohan Dvivedi (43 rep)

Dec 27, 2024, 08:40 AM
Last activity: Dec 27, 2024, 03:55 PM

Why is sorting a table (loaded with random data) faster than actually sorting random data?

Related Questions