Bitmap Heapscan is slow

1 vote
1 answer
4160 views
                          I have 2 tables: *mailings* and *form_responses*. The *form_responses* table has a foreign key with the *mailings* table, so a *mailing* has *one-to-many* *form_responses*. What I want to do is calculate *mailing* stats: per *mailing* get total *form_responses* and when the last *form_response* was received. I want to return the first 20 *mailings* that have last received a *form_response*. But when it has to query 50k+ rows, the query gets really slow.

**mailings** (~100k rows)


        Column    |            Type             |                     Modifiers                     
    --------------+-----------------------------+---------------------------------------------------
     id           | integer                     | pkey
     project_id   | integer                     | not null
     deleted      | boolean                     |
     active       | boolean                     |
     created_at   | timestamp with time zone    | 
    
    Indexes:
     "mailings_pkey" PRIMARY KEY, btree (id)
     "mailing_on_project_id_where_not_deleted_and_active" btree (project_id) (WHERE deleted IS NOT true AND active IS TRUE)
     + individual indexes on almost all columns


**form_responses** (~6m rows)


        Column    |            Type             |                     Modifiers                     
    --------------+-----------------------------+---------------------------------------------------
     id           | integer                     | pkey
     project_id   | integer                     | not null
     mailing_id   | integer                     | 
     deleted      | boolean                     |
     is_unanswered| boolean                     |
     created_at   | timestamp with time zone    | 
    
    Indexes:
     "form_responses_pkey" PRIMARY KEY, btree (id)
     "form_responses_on_pid_cr_time_where_not_del_not_is_unans" btree (project_id, created_time) (WHERE deleted IS NOT true AND is_uanswered IS NOT TRUE)
     + individual indexes on almost all columns

I want to get the mailing stats of a project (id=1), within a date range. I only want the stats of mailings that are not deleted and active and only count non deleted responses.

So this gives me the following query

    EXPLAIN (ANALYZE, BUFFERS)
    SELECT
      count(DISTINCT form_responses.id) AS total,
      max(form_responses.created_time) AS form_response_last_received,
      (SELECT mailings.name FROM mailings WHERE id = form_responses.mailing_id) AS name,
      form_responses.mailing_id AS mailing_id FROM form_responses
    JOIN
      mailings ON mailings.id = form_responses.mailing_id
    WHERE
      form_responses.deleted IS NOT true AND
      form_responses.is_unanswered IS NOT true AND
      form_responses.created_time >= '2017-08-02 00:00:00' AND
      form_responses.created_time   Sort  (cost=308997.69..308998.19 rows=197 width=16) (actual time=7762.552..7762.557 rows=20 loops=1)
            Sort Key: (max(form_responses.created_time))
            Sort Method: top-N heapsort  Memory: 26kB
            Buffers: shared hit=74535 read=13513 dirtied=1
            ->  GroupAggregate  (cost=307039.18..308992.45 rows=197 width=16) (actual time=7673.523..7762.278 rows=686 loops=1)
                  Group Key: form_responses.mailing_id
                  Buffers: shared hit=74535 read=13513 dirtied=1
                  ->  Sort  (cost=307039.18..307111.58 rows=28961 width=16) (actual time=7673.277..7702.783 rows=137542 loops=1)
                        Sort Key: form_responses.mailing_id
                        Sort Method: quicksort  Memory: 12592kB
                        Buffers: shared hit=71790 read=13512 dirtied=1
                        ->  Hash Join  (cost=7376.42..304892.90 rows=28961 width=16) (actual time=44.576..7609.818 rows=137542 loops=1)
                              Hash Cond: (form_responses.mailing_id = mailings.id)
                              Buffers: shared hit=71790 read=13512 dirtied=1
                              ->  Bitmap Heap Scan on form_responses  (cost=3815.27..299862.55 rows=134811 width=16) (actual time=41.439..7499.305 rows=137543 loops=1)
                                    Recheck Cond: ((project_id = 1) AND (created_time >= '2017-08-02 00:00:00+00'::timestamp with time zone) AND (created_time   Bitmap Index Scan on form_responses_on_pid_cr_time_where_not_del_not_is_unans  (cost=0.00..3781.57 rows=134811 width=0) (actual time=23.892..23.892 rows=137543 loops=1)
                                          Index Cond: ((project_id = 1) AND (created_time >= '2017-08-02 00:00:00+00'::timestamp with time zone) AND (created_time   Hash  (cost=3286.82..3286.82 rows=21946 width=4) (actual time=3.093..3.093 rows=2866 loops=1)
                                    Buckets: 4096  Batches: 1  Memory Usage: 101kB
                                    Buffers: shared hit=758 dirtied=1
                                    ->  Bitmap Heap Scan on mailings  (cost=874.50..3286.82 rows=21946 width=4) (actual time=0.403..2.145 rows=2866 loops=1)
                                          Recheck Cond: ((project_id = 1) AND (deleted IS NOT TRUE) AND (active IS TRUE))
                                          Heap Blocks: exact=732
                                          Buffers: shared hit=758 dirtied=1
                                          ->  Bitmap Index Scan on mailing_on_project_id_where_not_deleted_and_active  (cost=0.00..869.01 rows=21946 width=0) (actual time=0.315..0.315 rows=2866 loops=1)
                                                Index Cond: (project_id = 1)
                                                Buffers: shared hit=26
                  SubPlan 1
                    ->  Index Scan using mailings_pkey on mailings mailings_1  (cost=0.42..8.44 rows=1 width=26) (actual time=0.003..0.003 rows=1 loops=686)
                          Index Cond: (id = form_responses.mailing_id)
                          Buffers: shared hit=2745 read=1
                          Planning time: 0.865 ms
    Execution time: 7764.951 ms

In this example it queries ~135k rows, this can go up to almost ~600k. As you can see 99% of the time is spend on the Bitmap heap scan. I run a db.t2.large RDS database at AWS with 100gb (SSD) of storage, which runs postgres 9.4.7

- Is there anyway to avoid doing the bitmap heap scan?
- Or a way to improve the speed of this?
- Can I create a covering index, so it doesn't have to read the form responses table itself? 
- Should I cluster the *form_responses* table on *project_id* and/or *mailing_id*?
- Should I upgrade my hardware? 
- Am I asking too much of my Database?

                        
Asked by Peter Van 't Zand (13 rep)
Jan 31, 2018, 02:00 PM
Last activity: Mar 6, 2020, 10:30 AM
Bitmap Heapscan is slow

Related Questions