PostgreSQL (big data): search by tags and order by timestamp
4
votes
5
answers
4959
views
We need to add a search feature on a large table (200M+ rows):
item_id | tags | created_at | ...
-------------------------------------------------------------------
1 | ['tag1', 'bar2'] | 2020-01-06 12:43:32 |
2 | ['example5', 'tag9', 'foo2'] | 2020-01-10 10:40:00 |
3 | ['test1', 'tag5'] | 2020-01-11 12:43:32 |
...
The queries would be similar to this one:
SELECT * FROM items
WHERE tags @> ARRAY['t2', 't5']::varchar[]
ORDER BY created_at DESC
LIMIT 100;
Basically it's like searching some logs by tags and ordering them by timestamp. Seems a common scenario...
What index should we use? Have you ever tested something similar in production?
- Example 1: create a GIN index on tags. The problem is that the search may return millions of results and in order to apply order / limit you need to make millions of reads from the table on disk (in order to get the created_at value for each row).
- Example 2: add the btree_gin extension and create a composite index on created_at and tags. The problem is the same as above: I think that PostgreSQL cannot use ordering since the index is declared as a GIN index and not as a btree.
- Example 3: create a btree index on created_at and tags. PostgreSQL needs to scan the whole index, since btree doesn't support array operators. I also fear that due to the SELECT *
PostgreSQL will not use an index-only scan, thus resulting in millions of reads from disk (that would be actually useless since it only needs 100 reads from disk).
Asked by collimarco
(653 rep)
Jan 16, 2020, 11:32 AM
Last activity: Dec 31, 2024, 06:01 AM
Last activity: Dec 31, 2024, 06:01 AM