Sample Header Ad - 728x90

Find friends of friends (recursively) efficiently using Postgresql

9 votes
1 answer
717 views
Objective: Users submit their Contact Books, and then the application looks for connections between users, according to their Phone Number. Something like "6 Handshakes" idea (https://en.wikipedia.org/wiki/Six_degrees_of_separation) . Problem: Make this query performance close to real time. When the User submits his phone number and gets the **full** list of other phones, he may know. Plain list - without connections (graph vertices etc), but full, not paginated (this requirement is here because original goal is more complex). Question: is it possible to achieve close-to-realtime performance with pure relational database, without Graph databases (Neo4j, etc), graph extensions (bitnine agensgraph) or workflow redesign? Any denormalization is possible, but to my understanding, it won't help. Given:
test=# select * from connections;
 user_phone | friend_phone
------------+--------------
          1 |            2
          1 |            3
          1 |            4
          2 |            6
          2 |            7
          2 |            8
          8 |           10
          8 |           11
          8 |           12
         20 |           30
         40 |           50
         60 |           70
I expect to receive the following connections for User with Phone === 1:
friend_phone
--------------
            2
            3
            4
            6
            7
            8
           10
           11
           12
(9 rows)
It is really difficult to estimate real-world connections numbers. But I was testing at least with: - 10,000 Users (Phone Numbers) - Each user was randomly assigned with 50-1000 *Connections* with pseudo-random other Users - This resulted in about 1,000,000 Connections If it is impossible to achieve this in general (using some tricky ORDER BYs or subqueries etc) - what metrics should be considered for example to understand that: > with 1,000,000 connections you need 128GB RAM instance to get 2 seconds response time and > for 100,000,000 connections you need 1TB RAM instance to get 5 seconds response time ? P.S. I tried subqueries, CTEs, JOINs, but eventually I've found that WITH RECURSIVE is the most explicit way, and it has the same resulting time as other approaches. This is the table:
CREATE TABLE connections (
  user_phone bigint NOT NULL,
  friend_phone bigint NOT NULL
);
This is how I seed the data:
INSERT INTO connections(user_phone, friend_phone) (
  SELECT generate_series AS user_phone, generate_series(1, (random()*5000)::int) AS friend_phone from generate_series(1, 500) ORDER BY user_phone
);
I've created some indexes:
test=# \d connections
               Table "public.connections"
    Column    |  Type  | Collation | Nullable | Default
--------------+--------+-----------+----------+---------
 user_phone   | bigint |           | not null |
 friend_phone | bigint |           | not null |
Indexes:
    "connections_user_phone_friend_phone_idx" UNIQUE, btree (user_phone, friend_phone)
    "connections_friend_phone_idx" btree (friend_phone)
    "connections_friend_phone_user_phone_idx" btree (friend_phone, user_phone)
    "connections_user_phone_idx" btree (user_phone)
I expect friend_phones.count >>> user_phones.count, so index connections(friend_phones, user_phones) seems to be the most appropriate, but I've created all 4 indexes during testing. 2270852 connections records were generated. Then I run this query:
WITH RECURSIVE p AS (
    SELECT friend_phone FROM connections WHERE connections.user_phone = 1
    
    UNION

    SELECT friends.friend_phone FROM connections AS friends JOIN p ON friends.user_phone = p.friend_phone
  )
SELECT COUNT(friend_phone) FROM p;
It returns 5002 records and executes for about 3 seconds. EXPLAIN ANALYZE looks like the following:
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=3111105.00..3111105.01 rows=1 width=8) (actual time=3151.781..3151.781 rows=1 loops=1)
   CTE p
     ->  Recursive Union  (cost=0.43..2146207.20 rows=42884347 width=8) (actual time=0.060..3148.294 rows=5002 loops=1)
           ->  Index Scan using connections_user_phone_idx on connections  (cost=0.43..3003.69 rows=4617 width=8) (actual time=0.055..2.024 rows=4137 loops=1)
                 Index Cond: (user_phone = 1)
           ->  Merge Join  (cost=4500.77..128551.66 rows=4287973 width=8) (actual time=768.577..1359.598 rows=635428 loops=2)
                 Merge Cond: (friends.user_phone = p_1.friend_phone)
                 ->  Index Scan using connections_user_phone_idx on connections friends  (cost=0.43..54054.59 rows=2270852 width=16) (actual time=0.013..793.467 rows=1722262 loops=2)
                 ->  Sort  (cost=4500.34..4615.77 rows=46170 width=8) (actual time=0.765..74.850 rows=637677 loops=2)
                       Sort Key: p_1.friend_phone
                       Sort Method: quicksort  Memory: 65kB
                       ->  WorkTable Scan on p p_1  (cost=0.00..923.40 rows=46170 width=8) (actual time=0.001..0.314 rows=2501 loops=2)
   ->  CTE Scan on p  (cost=0.00..857686.94 rows=42884347 width=8) (actual time=0.062..3150.755 rows=5002 loops=1)
 Planning Time: 0.409 ms
 Execution Time: 3152.412 ms
(15 rows)
I feel like I'm missing something, because, even if many loops are required, it is a finite number of connections, for each user, which is greatly less than the total amount of connections (in the example above - 5000 user connections against 2,2M connections overall ~ 0,25%). Maybe some specific index type? Maybe some other tricks I don't know about? Thanks in advance for reading such a big question :) P.P.S, used Postgres 12 with next postgresql.conf:
# DB Version: 12
# OS Type: mac
# DB Type: web
# Total Memory (RAM): 16 GB
# CPUs num: 8
# Data Storage: ssd

max_connections = 200
shared_buffers = 4GB
effective_cache_size = 12GB
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.7
wal_buffers = 16MB
default_statistics_target = 100
random_page_cost = 1.1
work_mem = 5242kB
min_wal_size = 1GB
max_wal_size = 4GB
max_worker_processes = 8
max_parallel_workers_per_gather = 4
max_parallel_workers = 8
max_parallel_maintenance_workers = 4
Asked by Viktor Vsk (213 rep)
Jul 15, 2020, 04:17 PM
Last activity: May 23, 2021, 11:01 PM