Sample Header Ad - 728x90

like_regex vs jsonb_build_object performance

2 votes
0 answers
308 views
I'm working with Postges 14.8 and using JSONB to store data. The queries that I'm working with perform filtering and JOIN'ing based on stuff in this JSONB column named data. We need to perform JOIN's to get the data we need. The setup is simple. We have one table that is partitioned based on one of our columns and then we JOIN rows in different partitions. Each partition has the same amount of rows in our tests - around 8 mln. We'd like to have one place to INSERT rows and partitioning improved our performance. I'm using a gin index like this: CREATE INDEX ix_tab_data ON public.tab USING gin (data jsonb_path_ops); And a btree index for pagination: CREATE INDEX ix_tab_created_timestamp ON public.tab USING btree (created_timestamp DESC); They're then inherited by partitions. Most queries look something like this:
SELECT User.data->>'Name' AS User_Name
, User.data->>'Surname' AS User_Surname
, User.data->>'Gender' AS User_Gender
, Address.data->>'StreetName' AS Address_StreetName
 FROM public.tab_user AS User
 JOIN public.tab_address AS Address ON Address.data @> jsonb_build_object('Id', User.data->'AddressId')
 WHERE
  Address.data @> jsonb_build_object('StreetName', 'Test')
 ORDER BY User.created_timestamp DESC
 LIMIT 500
 OFFSET 1000;
Query plan:
Limit  (cost=10000001275.74..10000001829.33 rows=500 width=136) (actual time=193.677..293.630 rows=500 loops=1)
  Buffers: shared hit=106258
  ->  Nested Loop  (cost=10000000168.54..149189116161.91 rows=125713639119 width=136) (actual time=0.333..293.527 rows=1500 loops=1)
        Buffers: shared hit=106258
        ->  Index Scan using tab_user_created_timestamp_idx on tab_user user  (cost=0.43..9720087.72 rows=8556337 width=205) (actual time=0.012..7.544 rows=8728 loops=1)
              Buffers: shared hit=8754
        ->  Bitmap Heap Scan on tab_address address  (cost=168.11..15972.39 rows=14692 width=373) (actual time=0.032..0.032 rows=0 loops=8728)
              Recheck Cond: ((data @> jsonb_build_object('Id', (user.data -> 'AddressId'::text))) AND (data @> jsonb_build_object('StreetName', 'Test')))
              Rows Removed by Index Recheck: 0
              Heap Blocks: exact=1502
              Buffers: shared hit=97504
              ->  Bitmap Index Scan on tab_address_data_idx  (cost=0.00..164.44 rows=14692 width=0) (actual time=0.030..0.030 rows=0 loops=8728)
                    Index Cond: ((data @> jsonb_build_object('Id', (user.data -> 'AddressId'::text))) AND (data @> jsonb_build_object('StreetName', 'Test')))
                    Buffers: shared hit=96002
Planning:
  Buffers: shared hit=2
Planning Time: 0.175 ms
Execution Time: 293.705 ms
GIN indexes are used and query plans look fine. From what I saw online this @> jsonb_build_object method is recommended by most. However when we've done tests by sending many queries we found they don't perform as fast as we'd like to, so we've started working on improvements. Since jsonb_build_object creates a new JSONB and then we check for containment we've tried to use a json_path. So the query looked like this:
SELECT User.data->>'Name' AS User_Name
, User.data->>'Surname' AS User_Surname
, User.data->>'Gender' AS User_Gender
, Address.data->>'StreetName' AS Address_StreetName
 FROM public.tab_user AS User
 JOIN public.tab_address AS Address ON Address.data@>jsonb_build_object('Id', User.data->'AddressId')
 WHERE
  Address.data @? '$.StreetName ? (@ == "Test")'
 ORDER BY User.created_timestamp DESC
 LIMIT 500
 OFFSET 1000;
Query plan:
Limit  (cost=10000001273.23..10000001825.58 rows=500 width=136) (actual time=189.903..287.950 rows=500 loops=1)
  Buffers: shared hit=106258
  ->  Nested Loop  (cost=10000000168.54..148874820513.06 rows=125713639119 width=136) (actual time=0.207..287.849 rows=1500 loops=1)
        Buffers: shared hit=106258
        ->  Index Scan using tab_user_created_timestamp_idx on tab_user user  (cost=0.43..9720087.72 rows=8556337 width=205) (actual time=0.007..7.470 rows=8728 loops=1)
              Buffers: shared hit=8754
        ->  Bitmap Heap Scan on tab_address address  (cost=168.11..15935.66 rows=14692 width=373) (actual time=0.031..0.031 rows=0 loops=8728)
              Recheck Cond: ((data @> jsonb_build_object('Id', (user.data -> 'AddressId'::text))) AND (data @? '$."StreetName"?(@ == "Test")'::jsonpath))
              Rows Removed by Index Recheck: 0
              Heap Blocks: exact=1502
              Buffers: shared hit=97504
              ->  Bitmap Index Scan on tab_address_data_idx  (cost=0.00..164.43 rows=14692 width=0) (actual time=0.030..0.030 rows=0 loops=8728)
                    Index Cond: ((data @> jsonb_build_object('Id', (user.data -> 'AddressId'::text))) AND (data @? '$."StreetName"?(@ == "Test")'::jsonpath))
                    Buffers: shared hit=96002
Planning:
  Buffers: shared hit=2
Planning Time: 0.141 ms
Execution Time: 288.016 ms
Performance was the same as with jsonb_build_object. We've tried using like_regex instead of == in the json_path and surprisingly it was fastest. Current query looks like this:
SELECT User.data->>'Name' AS User_Name
, User.data->>'Surname' AS User_Surname
, User.data->>'Gender' AS User_Gender
, Address.data->>'StreetName' AS Address_StreetName
 FROM public.tab_user AS User
 JOIN public.tab_address AS Address ON Address.data@>jsonb_build_object('Id', User.data->'AddressId')
 WHERE
  Address.data @? '$.StreetName ? (@ like_regex "Test")'
 ORDER BY User.created_timestamp DESC
 LIMIT 500
 OFFSET 1000;
Query plan:
Limit  (cost=10050199374.23..10051799360.17 rows=500 width=136) (actual time=40.426..61.151 rows=500 loops=1)
  Buffers: shared hit=61138
  ->  Nested Loop  (cost=10046999402.34..402290157972457.19 rows=125713639119 width=136) (actual time=0.054..61.049 rows=1500 loops=1)
        Buffers: shared hit=61138
        ->  Index Scan using tab_user_created_timestamp_idx on tab_user user  (cost=0.43..9720087.72 rows=8556337 width=205) (actual time=0.008..6.836 rows=8728 loops=1)
              Buffers: shared hit=8754
        ->  Bitmap Heap Scan on tab_address address  (cost=46999401.91..47015169.46 rows=14692 width=373) (actual time=0.006..0.006 rows=0 loops=8728)
              Recheck Cond: ((data @> jsonb_build_object('Id', (user.data -> 'AddressId'::text))) AND (data @? '$."StreetName"?(@ like_regex "Test")'::jsonpath))
              Rows Removed by Index Recheck: 1
              Heap Blocks: exact=8744
              Buffers: shared hit=52384
              ->  Bitmap Index Scan on tab_address_data_idx  (cost=0.00..46999398.24 rows=14692 width=0) (actual time=0.003..0.003 rows=1 loops=8728)
                    Index Cond: ((data @> jsonb_build_object('Id', (user.data -> 'AddressId'::text))) AND (data @? '$."StreetName"?(@ like_regex "Test")'::jsonpath))
                    Buffers: shared hit=43640
Planning:
  Buffers: shared hit=2
Planning Time: 0.171 ms
Execution Time: 61.214 ms
All queries return the same rows but like_regex is fastest - which is counterintuitive. Around 16% of rows in public.tab_address have the StreetName we're looking for in our queries. The more JSONB filters in WHERE the bigger the performance difference. We are currently looking for an explanation of this behavior. From what I saw online, regex searches tend to be more resource-intensive. Maybe when combined with the jsonb_path_ops index, Postgres can optimize the path traversal and quickly identify relevant records before applying the regex filter? It seems like query #2 should be faster than a regex search. I guess the performance depends on how the jsonb_path_ops index handles equality checks versus pattern matching. Does someone have an idea why like_regex is faster than jsonb_build_object and == in json_path?
Asked by MRzeczkowski (21 rep)
Dec 12, 2023, 12:40 PM
Last activity: Feb 9, 2025, 06:02 PM