Sample Header Ad - 728x90

PostgreSQL imbalanced index query optimization

2 votes
1 answer
215 views
I am having performance issues with my query because the index is heavily imbalanced. I have a column family_id on a table. The table has about 150 million records. This family_id column identifies which records inside the table belong to the same family. A family has, on average, ~5 members. However, because of the data is not clean, there exist edge cases in our data where the family_id is NULL (~600.000 records) or where the family_id is -1 (~3.2 million records). I am building a data pipeline in Logstash which queries this data and incrementally forwards it to a Elasticsearch index. For my use case, I need to bring the data on a per family basis. The case where family_id is -1 or NULL requires special treatment, and therefore I need to filter based on that. However, such queries take extremely long for a batch of 10.000 records. To give an idea of the time difference; the normal case takes about 10-20 seconds, whereas the -1 or NULL case takes about 30 minutes. Is there a proper solution for such cases such that I can speed up this query? Thanks! **Query and DDL** As requested I have added the table DDL, query and EXPLAIN ANALYZE. The table DDL:
CREATE TABLE "xml".t_patent_document_values (
	patent_document_value_id serial4 NOT NULL,
	publication_id int4 NOT NULL,
	ucid varchar(32) NULL,
	lang bpchar(2) NULL,
	country bpchar(2) NULL,
	doc_number varchar(32) NULL,
	kind varchar(4) NULL,
	published date NULL,
	produced date NULL,
	withdraw bool NULL DEFAULT false,
	family_id int4 NULL,
	status varchar(16) NULL,
	modified_load_id int4 NOT NULL,
	created_load_id int4 NULL,
	deleted_load_id int4 NULL,
	CONSTRAINT t_patent_document_values_pkey PRIMARY KEY (patent_document_value_id)
);
CREATE INDEX family_id_publication_id_idx ON xml.t_patent_document_values USING btree (family_id, publication_id) WHERE (family_id = '-1'::integer);
CREATE INDEX family_id_publication_id_null_idx ON xml.t_patent_document_values USING btree (family_id, publication_id) WHERE ((family_id IS NULL) AND (publication_id > 0));
CREATE INDEX idx_country_published ON xml.t_patent_document_values USING btree (country, published);
CREATE INDEX idx_created_load_id ON xml.t_patent_document_values USING btree (created_load_id);
CREATE INDEX idx_created_load_id_modified_load_id ON xml.t_patent_document_values USING btree (created_load_id, modified_load_id);
CREATE INDEX idx_created_load_id_modified_load_id_family_id ON xml.t_patent_document_values USING btree (created_load_id, modified_load_id, family_id);
CREATE INDEX idx_ctry_dnum ON xml.t_patent_document_values USING btree (country, doc_number);
CREATE INDEX idx_family_id ON xml.t_patent_document_values USING btree (family_id);
CREATE INDEX idx_modified_load_id ON xml.t_patent_document_values USING btree (modified_load_id);
CREATE INDEX idx_patent_document_publication_id ON xml.t_patent_document_values USING btree (publication_id);
CREATE INDEX idx_patent_document_published ON xml.t_patent_document_values USING btree (published);
CREATE INDEX idx_patent_document_ucid ON xml.t_patent_document_values USING btree (ucid);
CREATE INDEX idx_patent_document_values_country ON xml.t_patent_document_values USING btree (country);
CREATE INDEX idx_published_month ON xml.t_patent_document_values USING btree (date_part('month'::text, published));
CREATE INDEX idx_published_year ON xml.t_patent_document_values USING btree (date_part('year'::text, published));
CREATE INDEX idx_t_patent_document_values_cmp ON xml.t_patent_document_values USING btree (publication_id, modified_load_id);
CREATE INDEX idx_withdrawn ON xml.t_patent_document_values USING btree (withdraw);
The fast query (family_id != -1 or family_id is not null):
select array_agg(tpdv.publication_id)
from 
	xml.t_patent_document_values tpdv 
where 
	tpdv.family_id > 0
group by tpdv.family_id
order by tpdv.family_id
limit 1000
Its EXPLAIN ANALYZE:
Limit  (cost=0.57..20178.18 rows=1000 width=36) (actual time=0.017..16.191 rows=1000 loops=1)
  ->  GroupAggregate  (cost=0.57..473243036.65 rows=23453864 width=36) (actual time=0.016..16.123 rows=1000 loops=1)
        Group Key: family_id
        ->  Index Scan using idx_family_id on t_patent_document_values tpdv  (cost=0.57..472220861.17 rows=145800436 width=8) (actual time=0.012..15.608 rows=1002 loops=1)
              Index Cond: (family_id > 0)
Planning time: 0.323 ms
Execution time: 16.259 ms
Note that in reality this query will do a lot more aggregations over the family_id, but this is a simplified version such that the problem might be easier to tackle. The slow query:
select *
from 
	xml.t_patent_document_values tpdv 
	where (tpdv.family_id = -1 or tpdv.family_id is null) and publication_id > 0
order by publication_id
limit 1000
Its EXPLAIN ANALYZE:
Limit  (cost=0.57..122889.16 rows=1000 width=73) (actual time=52632.236..1731648.507 rows=1000 loops=1)
  ->  Index Scan using idx_patent_document_publication_id on t_patent_document_values tpdv  (cost=0.57..484776575.75 rows=3944846 width=73) (actual time=52632.235..1731648.326 rows=1000 loops=1)
        Index Cond: (publication_id > 0)
        Filter: ((family_id = '-1'::integer) OR (family_id IS NULL))
        Rows Removed by Filter: 27646323
Planning time: 3.102 ms
Execution time: 1731657.620 ms
Asked by thijsvdp (121 rep)
Dec 12, 2022, 11:51 AM
Last activity: Jun 11, 2025, 05:06 AM