What index to use with lots of duplicate values?

22 votes

3 answers

17148 views

postgresql index performance clustered-index postgresql-9.6 query-performance

                          Let's make a few assumptions:

I have table that looks like this:


     a | b
    ---+---
     a | -1
     a | 17
      ...
     a | 21
     c | 17
     c | -3
      ...
     c | 22

Facts about my set:

* Size of the whole table is ~ 1010 rows.

* I have ~ 100k rows with value a in column a, similar for other values (e.g. c).

* That means ~ 100k distinct values in column 'a'.

* Most of my queries will read all or most of the values for a given value in a, e.g. select sum(b) from t where a = 'c'.

* The table is written in such a way that consecutive values are physically close (either it's written in order, or we assume CLUSTER  was used on that table and column a).

* The table is rarely if ever updated, we're only concerned about read speed.

* The table is relatively narrow (say ~25 bytes per tuple, + 23 bytes overhead).

Now the question is, what kind of index should I be using? My understanding is:

* **BTree** My issue here is that the BTree index will be huge since as far as I know it will store duplicate values (it has to, since it can't assume the table is physically sorted). If the BTree is huge, I end up having to read both the index and the parts of the table that the index points to. (We can use fillfactor = 100 to decrease the size of the index a bit.)

* **BRIN** My understanding is that I can have a small index here at the expense of reading useless pages. Using a small pages_per_range means that the index is bigger (which is a problem with BRIN since I need to read the whole index), having a big pages_per_range means that I'll read a lot of useless pages. Is there a magic formula to find a good value of pages_per_range that takes into account those trade-offs?

* **GIN/GiST** Not sure those are relevant here since they're mostly used for full-text search, but I also hear that they're good at dealing with duplicate keys. Would either a GIN or GiST index help here?

Another question is, will Postgres use the fact that a table is CLUSTERed (assuming no updates) in the query planner (e.g. by binary searching for the relevant start/end pages)? Somewhat related, can I just store all my columns in a BTree and drop the table altogether (or achieve something equivalent, I believe those are clustered indices in SQL server)? Is there some hybrid BTree/BRIN index that would help here?

I'd rather avoid using arrays to store my values since my query will end up less readable that way (I understand this would reduce the cost of the 23 bytes per tuple overhead by reducing the number of tuples).
                        

Asked by foo (323 rep)

Mar 10, 2017, 03:24 PM
Last activity: Nov 21, 2024, 09:12 PM

What index to use with lots of duplicate values?

Related Questions