I have 2 tables: *mailings* and *form_responses*. The *form_responses* table has a foreign key with the *mailings* table, so a *mailing* has *one-to-many* *form_responses*. What I want to do is calculate *mailing* stats: per *mailing* get total *form_responses* and when the last *form_response* was received. I want to return the first 20 *mailings* that have last received a *form_response*. But when it has to query 50k+ rows, the query gets really slow.
**mailings** (~100k rows)
Column | Type | Modifiers
--------------+-----------------------------+---------------------------------------------------
id | integer | pkey
project_id | integer | not null
deleted | boolean |
active | boolean |
created_at | timestamp with time zone |
Indexes:
"mailings_pkey" PRIMARY KEY, btree (id)
"mailing_on_project_id_where_not_deleted_and_active" btree (project_id) (WHERE deleted IS NOT true AND active IS TRUE)
+ individual indexes on almost all columns
**form_responses** (~6m rows)
Column | Type | Modifiers
--------------+-----------------------------+---------------------------------------------------
id | integer | pkey
project_id | integer | not null
mailing_id | integer |
deleted | boolean |
is_unanswered| boolean |
created_at | timestamp with time zone |
Indexes:
"form_responses_pkey" PRIMARY KEY, btree (id)
"form_responses_on_pid_cr_time_where_not_del_not_is_unans" btree (project_id, created_time) (WHERE deleted IS NOT true AND is_uanswered IS NOT TRUE)
+ individual indexes on almost all columns
I want to get the mailing stats of a project (id=1), within a date range. I only want the stats of mailings that are not deleted and active and only count non deleted responses.
So this gives me the following query
EXPLAIN (ANALYZE, BUFFERS)
SELECT
count(DISTINCT form_responses.id) AS total,
max(form_responses.created_time) AS form_response_last_received,
(SELECT mailings.name FROM mailings WHERE id = form_responses.mailing_id) AS name,
form_responses.mailing_id AS mailing_id FROM form_responses
JOIN
mailings ON mailings.id = form_responses.mailing_id
WHERE
form_responses.deleted IS NOT true AND
form_responses.is_unanswered IS NOT true AND
form_responses.created_time >= '2017-08-02 00:00:00' AND
form_responses.created_time Sort (cost=308997.69..308998.19 rows=197 width=16) (actual time=7762.552..7762.557 rows=20 loops=1)
Sort Key: (max(form_responses.created_time))
Sort Method: top-N heapsort Memory: 26kB
Buffers: shared hit=74535 read=13513 dirtied=1
-> GroupAggregate (cost=307039.18..308992.45 rows=197 width=16) (actual time=7673.523..7762.278 rows=686 loops=1)
Group Key: form_responses.mailing_id
Buffers: shared hit=74535 read=13513 dirtied=1
-> Sort (cost=307039.18..307111.58 rows=28961 width=16) (actual time=7673.277..7702.783 rows=137542 loops=1)
Sort Key: form_responses.mailing_id
Sort Method: quicksort Memory: 12592kB
Buffers: shared hit=71790 read=13512 dirtied=1
-> Hash Join (cost=7376.42..304892.90 rows=28961 width=16) (actual time=44.576..7609.818 rows=137542 loops=1)
Hash Cond: (form_responses.mailing_id = mailings.id)
Buffers: shared hit=71790 read=13512 dirtied=1
-> Bitmap Heap Scan on form_responses (cost=3815.27..299862.55 rows=134811 width=16) (actual time=41.439..7499.305 rows=137543 loops=1)
Recheck Cond: ((project_id = 1) AND (created_time >= '2017-08-02 00:00:00+00'::timestamp with time zone) AND (created_time Bitmap Index Scan on form_responses_on_pid_cr_time_where_not_del_not_is_unans (cost=0.00..3781.57 rows=134811 width=0) (actual time=23.892..23.892 rows=137543 loops=1)
Index Cond: ((project_id = 1) AND (created_time >= '2017-08-02 00:00:00+00'::timestamp with time zone) AND (created_time Hash (cost=3286.82..3286.82 rows=21946 width=4) (actual time=3.093..3.093 rows=2866 loops=1)
Buckets: 4096 Batches: 1 Memory Usage: 101kB
Buffers: shared hit=758 dirtied=1
-> Bitmap Heap Scan on mailings (cost=874.50..3286.82 rows=21946 width=4) (actual time=0.403..2.145 rows=2866 loops=1)
Recheck Cond: ((project_id = 1) AND (deleted IS NOT TRUE) AND (active IS TRUE))
Heap Blocks: exact=732
Buffers: shared hit=758 dirtied=1
-> Bitmap Index Scan on mailing_on_project_id_where_not_deleted_and_active (cost=0.00..869.01 rows=21946 width=0) (actual time=0.315..0.315 rows=2866 loops=1)
Index Cond: (project_id = 1)
Buffers: shared hit=26
SubPlan 1
-> Index Scan using mailings_pkey on mailings mailings_1 (cost=0.42..8.44 rows=1 width=26) (actual time=0.003..0.003 rows=1 loops=686)
Index Cond: (id = form_responses.mailing_id)
Buffers: shared hit=2745 read=1
Planning time: 0.865 ms
Execution time: 7764.951 ms
In this example it queries ~135k rows, this can go up to almost ~600k. As you can see 99% of the time is spend on the Bitmap heap scan. I run a
db.t2.large
RDS database at AWS with 100gb (SSD) of storage, which runs postgres 9.4.7
- Is there anyway to avoid doing the bitmap heap scan?
- Or a way to improve the speed of this?
- Can I create a covering index, so it doesn't have to read the form responses table itself?
- Should I cluster the *form_responses* table on *project_id* and/or *mailing_id*?
- Should I upgrade my hardware?
- Am I asking too much of my Database?
Asked by Peter Van 't Zand
(13 rep)
Jan 31, 2018, 02:00 PM
Last activity: Mar 6, 2020, 10:30 AM
Last activity: Mar 6, 2020, 10:30 AM