like_regex vs jsonb_build_object performance
2
votes
0
answers
308
views
I'm working with Postges 14.8 and using JSONB to store data.
The queries that I'm working with perform filtering and JOIN'ing based on stuff in this JSONB column named
data
. We need to perform JOIN's to get the data we need.
The setup is simple. We have one table that is partitioned based on one of our columns and then we JOIN rows in different partitions. Each partition has the same amount of rows in our tests - around 8 mln.
We'd like to have one place to INSERT rows and partitioning improved our performance.
I'm using a gin index like this:
CREATE INDEX ix_tab_data ON public.tab USING gin (data jsonb_path_ops);
And a btree index for pagination:
CREATE INDEX ix_tab_created_timestamp ON public.tab USING btree (created_timestamp DESC);
They're then inherited by partitions.
Most queries look something like this:
SELECT User.data->>'Name' AS User_Name
, User.data->>'Surname' AS User_Surname
, User.data->>'Gender' AS User_Gender
, Address.data->>'StreetName' AS Address_StreetName
FROM public.tab_user AS User
JOIN public.tab_address AS Address ON Address.data @> jsonb_build_object('Id', User.data->'AddressId')
WHERE
Address.data @> jsonb_build_object('StreetName', 'Test')
ORDER BY User.created_timestamp DESC
LIMIT 500
OFFSET 1000;
Query plan:
Limit (cost=10000001275.74..10000001829.33 rows=500 width=136) (actual time=193.677..293.630 rows=500 loops=1)
Buffers: shared hit=106258
-> Nested Loop (cost=10000000168.54..149189116161.91 rows=125713639119 width=136) (actual time=0.333..293.527 rows=1500 loops=1)
Buffers: shared hit=106258
-> Index Scan using tab_user_created_timestamp_idx on tab_user user (cost=0.43..9720087.72 rows=8556337 width=205) (actual time=0.012..7.544 rows=8728 loops=1)
Buffers: shared hit=8754
-> Bitmap Heap Scan on tab_address address (cost=168.11..15972.39 rows=14692 width=373) (actual time=0.032..0.032 rows=0 loops=8728)
Recheck Cond: ((data @> jsonb_build_object('Id', (user.data -> 'AddressId'::text))) AND (data @> jsonb_build_object('StreetName', 'Test')))
Rows Removed by Index Recheck: 0
Heap Blocks: exact=1502
Buffers: shared hit=97504
-> Bitmap Index Scan on tab_address_data_idx (cost=0.00..164.44 rows=14692 width=0) (actual time=0.030..0.030 rows=0 loops=8728)
Index Cond: ((data @> jsonb_build_object('Id', (user.data -> 'AddressId'::text))) AND (data @> jsonb_build_object('StreetName', 'Test')))
Buffers: shared hit=96002
Planning:
Buffers: shared hit=2
Planning Time: 0.175 ms
Execution Time: 293.705 ms
GIN indexes are used and query plans look fine. From what I saw online this @> jsonb_build_object
method is recommended by most.
However when we've done tests by sending many queries we found they don't perform as fast as we'd like to, so we've started working on improvements.
Since jsonb_build_object
creates a new JSONB and then we check for containment we've tried to use a json_path.
So the query looked like this:
SELECT User.data->>'Name' AS User_Name
, User.data->>'Surname' AS User_Surname
, User.data->>'Gender' AS User_Gender
, Address.data->>'StreetName' AS Address_StreetName
FROM public.tab_user AS User
JOIN public.tab_address AS Address ON Address.data@>jsonb_build_object('Id', User.data->'AddressId')
WHERE
Address.data @? '$.StreetName ? (@ == "Test")'
ORDER BY User.created_timestamp DESC
LIMIT 500
OFFSET 1000;
Query plan:
Limit (cost=10000001273.23..10000001825.58 rows=500 width=136) (actual time=189.903..287.950 rows=500 loops=1)
Buffers: shared hit=106258
-> Nested Loop (cost=10000000168.54..148874820513.06 rows=125713639119 width=136) (actual time=0.207..287.849 rows=1500 loops=1)
Buffers: shared hit=106258
-> Index Scan using tab_user_created_timestamp_idx on tab_user user (cost=0.43..9720087.72 rows=8556337 width=205) (actual time=0.007..7.470 rows=8728 loops=1)
Buffers: shared hit=8754
-> Bitmap Heap Scan on tab_address address (cost=168.11..15935.66 rows=14692 width=373) (actual time=0.031..0.031 rows=0 loops=8728)
Recheck Cond: ((data @> jsonb_build_object('Id', (user.data -> 'AddressId'::text))) AND (data @? '$."StreetName"?(@ == "Test")'::jsonpath))
Rows Removed by Index Recheck: 0
Heap Blocks: exact=1502
Buffers: shared hit=97504
-> Bitmap Index Scan on tab_address_data_idx (cost=0.00..164.43 rows=14692 width=0) (actual time=0.030..0.030 rows=0 loops=8728)
Index Cond: ((data @> jsonb_build_object('Id', (user.data -> 'AddressId'::text))) AND (data @? '$."StreetName"?(@ == "Test")'::jsonpath))
Buffers: shared hit=96002
Planning:
Buffers: shared hit=2
Planning Time: 0.141 ms
Execution Time: 288.016 ms
Performance was the same as with jsonb_build_object
.
We've tried using like_regex
instead of ==
in the json_path and surprisingly it was fastest.
Current query looks like this:
SELECT User.data->>'Name' AS User_Name
, User.data->>'Surname' AS User_Surname
, User.data->>'Gender' AS User_Gender
, Address.data->>'StreetName' AS Address_StreetName
FROM public.tab_user AS User
JOIN public.tab_address AS Address ON Address.data@>jsonb_build_object('Id', User.data->'AddressId')
WHERE
Address.data @? '$.StreetName ? (@ like_regex "Test")'
ORDER BY User.created_timestamp DESC
LIMIT 500
OFFSET 1000;
Query plan:
Limit (cost=10050199374.23..10051799360.17 rows=500 width=136) (actual time=40.426..61.151 rows=500 loops=1)
Buffers: shared hit=61138
-> Nested Loop (cost=10046999402.34..402290157972457.19 rows=125713639119 width=136) (actual time=0.054..61.049 rows=1500 loops=1)
Buffers: shared hit=61138
-> Index Scan using tab_user_created_timestamp_idx on tab_user user (cost=0.43..9720087.72 rows=8556337 width=205) (actual time=0.008..6.836 rows=8728 loops=1)
Buffers: shared hit=8754
-> Bitmap Heap Scan on tab_address address (cost=46999401.91..47015169.46 rows=14692 width=373) (actual time=0.006..0.006 rows=0 loops=8728)
Recheck Cond: ((data @> jsonb_build_object('Id', (user.data -> 'AddressId'::text))) AND (data @? '$."StreetName"?(@ like_regex "Test")'::jsonpath))
Rows Removed by Index Recheck: 1
Heap Blocks: exact=8744
Buffers: shared hit=52384
-> Bitmap Index Scan on tab_address_data_idx (cost=0.00..46999398.24 rows=14692 width=0) (actual time=0.003..0.003 rows=1 loops=8728)
Index Cond: ((data @> jsonb_build_object('Id', (user.data -> 'AddressId'::text))) AND (data @? '$."StreetName"?(@ like_regex "Test")'::jsonpath))
Buffers: shared hit=43640
Planning:
Buffers: shared hit=2
Planning Time: 0.171 ms
Execution Time: 61.214 ms
All queries return the same rows but like_regex
is fastest - which is counterintuitive.
Around 16% of rows in public.tab_address
have the StreetName
we're looking for in our queries.
The more JSONB filters in WHERE the bigger the performance difference.
We are currently looking for an explanation of this behavior.
From what I saw online, regex searches tend to be more resource-intensive. Maybe when combined with the jsonb_path_ops index, Postgres can optimize the path traversal and quickly identify relevant records before applying the regex filter?
It seems like query #2 should be faster than a regex search. I guess the performance depends on how the jsonb_path_ops index handles equality checks versus pattern matching.
Does someone have an idea why like_regex
is faster than jsonb_build_object
and ==
in json_path?
Asked by MRzeczkowski
(21 rep)
Dec 12, 2023, 12:40 PM
Last activity: Feb 9, 2025, 06:02 PM
Last activity: Feb 9, 2025, 06:02 PM