Database Administrators

Q&A for database professionals who wish to improve their database skills

Latest Questions

0 votes

1 answers

190 views

Small databases, large backup files

postgresql postgresql-9.6 pg-dump vacuum

We have PostgreSQL 9.6 databases and a dailytask running a pg_dump for all databases, though these backups are getting "large" at this point. My database was 900MB, then I tried clearing it by deleting old history which is not necessary anymore and after that ran VACUUM FULL. The statistics in pgAdm...

                                  We have PostgreSQL 9.6 databases and a dailytask running a pg_dump for all databases, though these backups are getting "large" at this point. 

My database was 900MB, then I tried clearing it by deleting old history which is not necessary anymore and after that ran VACUUM FULL. The statistics in pgAdmin say that the database is now only 30MB.

When I run the pg_dump command manually through a Command prompt, it creates a file of 22MB. When I run my daily task (windows task scheduler) it still creates a backup file of 1GB. 

What am I missing at this point?

TimVK (111 rep)

May 9, 2018, 01:42 PM • Last activity: Jun 26, 2025, 10:07 PM

0 votes

1 answers

3603 views

Stored Procedure for Vacuum in PostgreSQL

postgresql-9.6 plpgsql vacuum

I want to create a stored procedure to trigger vacuum, I know there are many ways to automate vacuum, but I would like to use the stored procedure (Postgresql 9.6). The problem Im facing is, I can't run the vacuum inside the `BEGIN/END` so there is a workaround, we can use `set autocommit=on`. But w...

I want to create a stored procedure to trigger vacuum, I know there are many ways to automate vacuum, but I would like to use the stored procedure (Postgresql 9.6). The problem Im facing is, I can't run the vacuum inside the BEGIN/END so there is a workaround, we can use set autocommit=on. But when I add this autocommit inside the stored procedure, its throwing error. ### Sample SP:

create procedure test(table_list varchar(1000))
LANGUAGE plpgsql 
AS 
$$ 
DECLARE 
*** some values ***
BEGIN 
EXECUTE 'set autocommit=on';
*** Rest of the command to get the table names one by one and vacuum in a loop***
END;
$$

Is this possible? Or any alternate way?

TheDataGuy (1986 rep)

Apr 11, 2020, 08:27 PM • Last activity: Jun 9, 2025, 04:04 PM

1 votes

1 answers

1532 views

Perform update and raise notice in Postgres Function

postgresql trigger functions postgresql-9.6

I'm trying to add a notice (or something similar) to a function that is executed on update by a trigger in PostgreSQL 9.6 db. I want to let the user who edited the table that something happened. I don't know is "NOTICE" is the proper RAISE (Why not "INFO"?). The central question is: **how I can add...

                                  I'm trying to add a notice (or something similar) to a function that is executed on update by a trigger in PostgreSQL 9.6 db. I want to let the user who edited the table that something happened.
I don't know is "NOTICE" is the proper RAISE (Why not "INFO"?).

The central question is: **how I can add the raise notice argument after the update?**
                                

guillermo_dangelo (175 rep)

Mar 4, 2021, 04:27 PM • Last activity: May 29, 2025, 10:04 PM

1 votes

3 answers

269 views

PostgreSQL query performance issue

postgresql postgresql-performance postgresql-9.6

We are sometimes getting poor performance (~14s) when using the following query (PostgreSQL 9.6) to fetch rows from the table `items` whose ID is present in the table `items_categories`: SELECT items.* FROM items WHERE EXISTS ( SELECT item_id FROM items_categories WHERE item_id = items.id AND catego...

                                  We are sometimes getting poor performance (~14s) when using the following query (PostgreSQL 9.6) to fetch rows from the table items whose ID is present in the table items_categories:

    SELECT items.*
    FROM items 
    WHERE EXISTS (
    	SELECT item_id 
    	FROM items_categories 
    	WHERE item_id = items.id  AND category_id = 626 
    ) 
    AND items.active = TRUE
    -- possibly some others "AND" here to use more filters on "items", but not considered for this question
    ORDER BY modified_at DESC 
    LIMIT 10

Relevant parts of our schema:

                                  Table "public.items"
               Column      |       Type        |                     Modifiers
    -----------------------+-------------------+----------------------------------------------------
     id                    | integer           | not null default nextval('items_id_seq'::regclass)
     active                | boolean           | default true
     modified_at           | timestamp without time zone | default now()
    Indexes:
        "items_pkey" PRIMARY KEY, btree (id)
        "active_idx" btree (active)
        "aggregate_idx" btree (id)
        "items_modified_at_idx" btree (modified_at)
    
    
      Table "public.items_categories"
       Column    |  Type   | Modifiers
    -------------+---------+-----------
     item_id     | integer | not null
     category_id | integer | not null
    Indexes:
        "unique_cat_item_assoc" UNIQUE CONSTRAINT, btree (item_id, category_id)
        "items_categories_1_idx" btree (category_id)
        "items_categories_2_idx" btree (item_id)
    Foreign-key constraints:
        "items_categories_category_id_fkey" FOREIGN KEY (category_id) REFERENCES categories(id)
        "items_categories_item_id_fkey" FOREIGN KEY (item_id) REFERENCES items(id)

The table items contains ~2 M rows, and the table items_categories contains ~4 M rows

When we ask for 10 items (i.e. LIMIT 10 at the end of the above query) and 10 or more rows match in items_categories, the performance is good (~10ms), but when we ask for 10 items and less than 10 rows match in items_categories, then the query takes ~14s because it’s doing an index scan on items.modified_at to look at each 2 M rows.

Query plan when less than 10 rows match in items_categories (poor performance):

    Limit  (cost=0.86..11696.68 rows=10 width=1797) (actual time=168.376..14484.854 rows=7 loops=1)
      ->  Nested Loop Semi Join  (cost=0.86..2746178.23 rows=2348 width=1797) (actual time=168.376..14484.836 rows=7 loops=1)
            ->  Index Scan Backward using items_modified_at_idx on items  (cost=0.43..1680609.95 rows=2243424 width=1797) (actual time=0.054..7611.300 rows=2251395 loops=1)
                  Filter: active
                  Rows Removed by Filter: 2467
            ->  Index Only Scan using unique_cat_item_assoc on items_categories  (cost=0.43..0.47 rows=1 width=4) (actual time=0.003..0.003 rows=0 loops=2251395)
                  Index Cond: ((item_id = items.id) AND (category_id = 626))
                  Heap Fetches: 7
    Planning time: 3.082 ms
    Execution time: 14485.057 ms

Query plan when more than 10 rows match in items_categories (good performance):

    Limit  (cost=0.86..24.07 rows=10 width=1857) (actual time=3.575..3.757 rows=10 loops=1)
      ->  Nested Loop Semi Join  (cost=0.86..2763459.56 rows=1190819 width=1857) (actual time=3.574..3.752 rows=10 loops=1)
            ->  Index Scan Backward using items_modified_at_idx on items  (cost=0.43..1684408.22 rows=2246967 width=1857) (actual time=0.013..2.205 rows=751 loops=1)
                  Filter: active
            ->  Index Only Scan using unique_cat_item_assoc on items_categories  (cost=0.43..0.47 rows=1 width=4) (actual time=0.002..0.002 rows=0 loops=751)
                  Index Cond: ((item_id = items.id) AND (category_id = 20))
                  Heap Fetches: 10
    Planning time: 1.650 ms
    Execution time: 3.868 ms


How can we tune this query to handle both situations? (i.e. good performances no matter how many rows of items_categories match).

I have a POC working where I first count the number of matching rows in items_categories (separate query) then if the number is low I use a CTE to work on a subset of items instead of all its rows, but it’s really a dirty temporary hack IMO… If the number is high, the CTE takes too long and it's more efficient to NOT use this CTE in this case (i.e. the above query performs better).

Thank you!



                                

xav (113 rep)

Jul 26, 2021, 11:31 AM • Last activity: May 29, 2025, 11:01 AM

1 votes

2 answers

1810 views

Update all the elements of JSON array nested in a JSON column (Postgres)

postgresql postgresql-9.6

I have json column with data that looks like this: ```sql CREATE TABLE foo(mycol) AS SELECT $${ "a": [ { "q": 5 }, { "q": 8 }, { "q": 10 } ] }$$::jsonb; ``` I'd like to map over the elements in the array and add an extra property to get something like this: ```json { a: [ { q: 5, r: [] }, { q: 8, r:...

I have json column with data that looks like this:

CREATE TABLE foo(mycol)
AS SELECT $${
    "a": [
        { "q": 5 },
        { "q": 8 },
        { "q": 10 }
    ]
}$$::jsonb;

I'd like to map over the elements in the array and add an extra property to get something like this:

{
    a: [
        { q: 5, r: [] },
        { q: 8, r: [] },
        { q: 10, r: [] }
    ]
}

Any and all would be greatly appreciated. PostgreSQL 9.6.6

mushishi78 (121 rep)

Aug 19, 2019, 12:57 PM • Last activity: May 27, 2025, 11:02 AM

2 votes

1 answers

246 views

Postgresql 9.6, create a user with only select permissions for a given database/schema

postgresql postgresql-9.6

I want to create a user that has: - has its default privileges revoked when I create it - can only perform selects against a given database/schema Is this possible?

                                  I want to create a user that has:

 - has its default privileges revoked when I create it
 - can only perform selects against a given database/schema

Is this possible?

LiteWait (121 rep)

Jun 24, 2020, 03:12 PM • Last activity: May 26, 2025, 09:07 PM

0 votes

1 answers

370 views

pglogical library

postgresql replication transactional-replication postgresql-9.5 postgresql-9.6

I would like to try `pglogical` to set up an `archive server` from a live database, filtering with `INSERT`, `UPDATE` statements (`DELETE` statements will not be replicated in the `archive server`). I follow the [official documentation][1] but when i set `shared_preload_libraries` = `'pglogical'` in...

                                  I would like to try pglogical to set up an archive server from a live database, filtering with INSERT, UPDATE statements (DELETE statements will not be replicated in the archive server).  
I follow the official documentation  but when i set shared_preload_libraries = 'pglogical' in my configuration file (postgresql.conf) the server could not start.    

There is no pglogical file in ./usr/pgsql-9.6/lib/ directory.    

How setting up all prerequisites to start properly (with v9.5 and 9.6)?  
I am using PostgreSQL v9.5 and v9.6 for POC on CentOS 7.  

Many Thanks.
                                

Mika (361 rep)

Mar 4, 2018, 04:09 PM • Last activity: May 26, 2025, 03:03 PM

2 votes

1 answers

272 views

How to fetch related rows on multiple condition with ranking

postgresql query-performance query subquery postgresql-9.6

I have following table for storing item information in PostgreSQL. id | name | category | seller | m_age | status | type_id | region | -------------------------------------------------------------------------- 10 | TBox | Gun | 14 | 4 | Q4 | 12 | EU | 11 | Globe | Gun | 36 | 12 | Q1 | 17 | EU | 12 |...

                                  I have following table for storing item information in PostgreSQL.

    id	|  name		 | category | seller | m_age | status | type_id | region |
    --------------------------------------------------------------------------
    10  | TBox  	 |  Gun		| 14	 |	4	 | Q4	  | 12		|	EU	 |
    11 	| Globe  	 |  Gun		| 36	 |  12	 | Q1	  | 17		|	EU	 |
    12	| Nike		 |  CFK		| 12	 |  5	 | Q2	  | 12		|	AS	 |
    32	| Raven  	 |  Knife	| 21	 |  14	 | Q1	  | 12		|	EU	 |
    46	| TBox		 |  Gun		| 14	 |	10	 | Q4	  | 12		|	AS	 |
    46	| Blaster  	 |  Gun		| 14	 |	10	 | Q4	  | 12		|	EU	 |
    78	| PaintGun 	 |  Gun		| 12	 |	10	 | Q2	  | 54		|	US	 |
    98	| TBox       |  OD		| 21	 |	4	 | Q3	  | 54		|	AS	 |

I want to find related items on basic of certain criteria. 
    
    Category > title > Tech Info Matched(m_age, status, type_id, region) > Seller

So for id 10 - TBox item rank should be following(name and category column value and Info matched and reason cololms are for visual reference only)

    id	| name	| category | INFO MATCHED |	RANK | 			   REASON  			   |
    --------------------------------------------------------------------------------
    10  | TBox  |  Gun	   |   	  - 	  |   -  |		     	  -				   |
    --------------------------------------------------------------------------------
    11 	| Diff  |  Same	   |      1       |	  3  | same cat diff name 1 info match |
    12	| Diff	|  Diff	   |      1   	  |	  6  | diff cat diff name 1 info match |
    32	| Diff  |  Diff	   |      2   	  |	  7  | diff cat diff name 2 info match |
    46	| Same	|  Same	   |      3   	  |	  1  | same cat and name			   |
    46	| Diff	|  Same	   |      4   	  |	  2  | same cat diff name 4 info match |
    78	| Diff  |  Same	   |      0   	  |   4  | same cat diff name 0 info match |
    98	| Same  |  Diff	   |      1   	  |	  5  | diff cat same name              |

I am still a novice to SQL. I have tried with normal CASE and function. but haven't got proper results. Rank will not work with this many condition. Is there any way to pull this results anyhow with PostgreSQL? 
                                

Rajan M (21 rep)

Apr 1, 2020, 06:33 AM • Last activity: May 14, 2025, 12:04 AM

2 votes

1 answers

342 views

Slow PostgreSQL 9.6.5 query with LIMIT, but only for tsquery with no results

postgresql performance postgresql-9.6 postgresql-performance

Table `a` has > 3 million rows and this index: CREATE INDEX idx_a_txt ON a USING gin (txt); This query takes 5 seconds if I search for something that *doesn't* exist, and 31 milliseconds if I search for something that does: SELECT * FROM a WHERE a.txt @@ to_tsquery('simple_english', $1) LIMIT 10 The...

                                  Table a has > 3 million rows and this index:

    CREATE INDEX idx_a_txt
      ON a
      USING gin
      (txt);

This query takes 5 seconds if I search for something that *doesn't* exist, and 31 milliseconds if I search for something that does:

    SELECT *
    FROM a
    WHERE a.txt @@ to_tsquery('simple_english', $1)
    LIMIT 10

The query plan:

    Limit  (cost=0.00..847.17 rows=10 width=1285)
    ->  Seq Scan on a  (cost=0.00..560995.49 rows=6622 width=1285)
            Filter: (txt @@ '''test'''::tsquery)

If I leave off the LIMIT 10, the results are reversed: In the non-existing case, it takes 12 milliseconds, but in the existing case, it takes > 30 seconds (naturally, because it's trying to return 50,000 rows.)  Anyway, the query plan instead is:

    Bitmap Heap Scan on a  (cost=1839.32..26016.86 rows=6622 width=1285)
    Recheck Cond: (txt @@ '''blahblahblah'''::tsquery)
    ->  Bitmap Index Scan on idx_a_txt  (cost=0.00..1837.67 rows=6622 width=0)
            Index Cond: (txt @@ '''blahblahblah'''::tsquery)

I've run ANALYZE on table a as suggested [here](https://dba.stackexchange.com/a/80992/28774)  but the results are still as above.

If I instead SELECT COUNT(*) I get 50000 back in 150 milliseconds, or 0 back in 12 milliseconds.

Is there any way to make such a query work quickly whether there are results or not?


Reproducible example
============================================

Not quite the 5s, but 2s still vs 12-31ms, so you get the idea, at least:

    CREATE TEXT SEARCH DICTIONARY simple_english (
       TEMPLATE = simple,
       stopwords = 'english'
    );
    CREATE TEXT SEARCH CONFIGURATION simple_english (
      PARSER = "default"
    );
    ALTER TEXT SEARCH CONFIGURATION simple_english ADD MAPPING FOR asciihword WITH simple_english;
    ALTER TEXT SEARCH CONFIGURATION simple_english ADD MAPPING FOR asciiword WITH simple_english;
    ALTER TEXT SEARCH CONFIGURATION simple_english ADD MAPPING FOR email WITH simple;
    ALTER TEXT SEARCH CONFIGURATION simple_english ADD MAPPING FOR file WITH simple;
    ALTER TEXT SEARCH CONFIGURATION simple_english ADD MAPPING FOR float WITH simple;
    ALTER TEXT SEARCH CONFIGURATION simple_english ADD MAPPING FOR host WITH simple;
    ALTER TEXT SEARCH CONFIGURATION simple_english ADD MAPPING FOR hword WITH simple_english;
    ALTER TEXT SEARCH CONFIGURATION simple_english ADD MAPPING FOR hword_asciipart WITH simple_english;
    ALTER TEXT SEARCH CONFIGURATION simple_english ADD MAPPING FOR hword_numpart WITH simple;
    ALTER TEXT SEARCH CONFIGURATION simple_english ADD MAPPING FOR hword_part WITH simple_english;
    ALTER TEXT SEARCH CONFIGURATION simple_english ADD MAPPING FOR int WITH simple;
    ALTER TEXT SEARCH CONFIGURATION simple_english ADD MAPPING FOR numhword WITH simple;
    ALTER TEXT SEARCH CONFIGURATION simple_english ADD MAPPING FOR numword WITH simple;
    ALTER TEXT SEARCH CONFIGURATION simple_english ADD MAPPING FOR sfloat WITH simple;
    ALTER TEXT SEARCH CONFIGURATION simple_english ADD MAPPING FOR uint WITH simple;
    ALTER TEXT SEARCH CONFIGURATION simple_english ADD MAPPING FOR url WITH simple;
    ALTER TEXT SEARCH CONFIGURATION simple_english ADD MAPPING FOR url_path WITH simple;
    ALTER TEXT SEARCH CONFIGURATION simple_english ADD MAPPING FOR version WITH simple;
    
    drop table if exists a;
    create table a (id text,
      b text NOT NULL,
      c text,
      d timestamp with time zone NOT NULL,
      e text NOT NULL,
      f timestamp with time zone,
      g timestamp with time zone,
      h integer NOT NULL DEFAULT 0,
      i integer NOT NULL DEFAULT 0,
      j text,
      k text,
      l text,
      m text,
      n timestamp with time zone,
      o text NOT NULL,
      p text,
      q timestamp with time zone NOT NULL,
      r timestamp with time zone NOT NULL,
      s text,
      t character varying(5) NOT NULL DEFAULT 'fb'::character varying,
      txt tsvector
    );
    CREATE INDEX idx_a_txt
      ON a
      USING gin
      (txt);
    insert into a (id, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, txt)
    SELECT uuid_in(md5(random()::text || now()::text)::cstring), 'foobar', 'foobar2', now(), md5(random()::text), now(), now(), random()*100, random()*100, random()::text, random()::text || ' ' || random()::text, random()::text || ' ' || random()::text, '', null, (random()*100000)::text, '', now(), now(), '', 'fb', to_tsvector('simple_english', md5(random()::text))
    FROM generate_series(1,3400000);


                                

Kev (474 rep)

Oct 23, 2017, 07:20 PM • Last activity: Apr 22, 2025, 07:04 AM

0 votes

1 answers

519 views

select time intervals from a nested query

postgresql query postgresql-9.6 datetime time-series-database

I have a table which has the following columns: id, start_timestamp, device_id I am trying to get results of all start_timestamps within 5 seconds of a range of generated periodic timestamps. For example, generating a series of timestamps every 80 seconds, I need to get all 'start_timestamps' that f...

                                  I have a table which has the following columns:

     id, start_timestamp, device_id

I am trying to get results of all start_timestamps within 5 seconds of a range of generated periodic timestamps.

For example, generating a series of timestamps every 80 seconds, I need to get all 'start_timestamps' that fall within 5 seconds of each of those timestamps.

So something like this query, but can't figure out the actual query. CTEs might be too exhaustive as the table has millions of records. 

Appreciate any help, thanks!

```

SELECT *
FROM time_intersections 
WHERE  start_timestamp **within 5 seconds of each of these:** 
                                               (
                                          		 SELECT *
                                        		 FROM generate_series(timestamp '2021-03-07',
                                          							  timestamp '2021-08-16',
                                   							          interval  '80 sec') 
	                                           )

                                

roy naufal (187 rep)

Sep 23, 2021, 04:13 PM • Last activity: Mar 16, 2025, 06:04 PM

1 votes

2 answers

3899 views

Setup VIP for PostgreSQL Replication

postgresql postgresql-9.6

I have been assigned a task to setup Streaming replication in Postgresql. I am not a DBA by profession, doing this thing for the first time. I have set up replication, now as we don't want to change the database connection string in the application during failover, I need to set up VIP for the same...

                                  I have been assigned a task to setup Streaming replication in Postgresql. I am not a DBA by profession, doing this thing for the first time.  I have set up replication, now as we don't want to change the database connection string in the application during failover, I need to set up VIP for the same so that I provide only one IP in the connection string.

I have used this  article for setting and testing VIP. I am able to setup VIP on a machine and able to connect PostgreSQL with this VIP. But when I tried to assign the same VIP to the slave server, i got an error that cannot assign requested address bind.

I need help to setup VIP and make it work with streaming replication so that applications always point to the master server i.e. point to new master after failover.

lets suppose a case , if VIP is assigned to master and that VM goes down, how will I unassign VIP from old master and assign to a new master.

YogeshR (93 rep)

Feb 17, 2020, 09:13 AM • Last activity: Feb 25, 2025, 02:04 PM

0 votes

3 answers

346 views

Postgres Specific table access

postgresql postgresql-9.6 postgresql-9.5

We have a requirement in our Postgres Database, We want to give specific table permissions to a particular set of user, we also have airflow in our environment which syncs tables, but sometimes any new columns are added to a table so we have to drop the table, due to which the specific table access...

                                  We have a requirement in our Postgres Database, We want to give specific table permissions to a particular set of user, we also have airflow in our environment which syncs tables, but sometimes any new columns are added to a table so we have to drop the table, due to which the specific table access for the user is gone. Access to a specific table is given through GRANT. Can you guys suggest us a way in which specific table access can be given, and will remain if the table is dropped and recreated?
                                

Jatin Bisht (3 rep)

Dec 3, 2020, 06:42 AM • Last activity: Jan 25, 2025, 12:04 PM

3 votes

2 answers

4650 views

Issues with enabling pg_stat_statements

postgresql-performance postgresql-9.6

I am having issues quering pg_stat_statements and puzzled what else I missed? See below what I have done. I still have issues when I query pg_stat_statements! I got error saying the libraries is not loaded. 1. #show config_file; ```none ------------------------------------------------------------ /v...

------------------------------------------------------------
 /var/opt/rh/rh-postgresql96/lib/pgsql/data/postgresql.conf

2. I have edit the config file and add the following

shared_preload_libraries = 'pg_stat_statements'         # (change requires restart)
pg_stat_statements.max = 10000
pg_stat_statements.track = all

3.# \dx

List of installed extensions
        Name        | Version |   Schema   |                        Description
--------------------+---------+------------+-----------------------------------------------------------
 pg_stat_statements | 1.4     | public     | track execution statistics of all SQL statements executed
 plpgsql            | 1.0     | pg_catalog | PL/pgSQL procedural language
(2 rows)

` 4. #select count(*) from pg_stat_activity;

count
-------
    43
(1 row)

5. #select * from pg_stat_statements;

ERROR:  pg_stat_statements must be loaded via shared_preload_libraries

Thanks, Lucia

Lucia Sugianto (31 rep)

Oct 7, 2020, 10:56 PM • Last activity: Dec 31, 2024, 03:05 PM

0 votes

3 answers

1123 views

How to allow in-progress transactions to complete when shutting down PostgreSQL?

postgresql transaction postgresql-9.6 shutdown

I'm working on automating as much as possible the mechanics of resizing an EBS volume on which we store all databases. We're going to host a large number of large-but-infrequently-used databases on a single EC2 instance, where the secondary EBS volume hosts PGDATA (aka all the data files), whose con...

                                  I'm working on automating as much as possible the mechanics of resizing an EBS volume on which we store all databases.  We're going to host a large number of large-but-infrequently-used databases on a single EC2 instance, where the secondary EBS volume hosts PGDATA (aka all the data files), whose configuration we've embedded in this script here .

One of the issues we've run into is that when extending the filesystem using **fsadm** under a running PostgreSQL, it determines that the target is busy and won't **umount** the volume.  This is expected.

I would like to shut down the database as gracefully as possible - while I'm fine to kill of open idle connections, I'd rather not kill it in the middle of any in-progress transactions.  I'm hoping to find some kind of mechanism that allows me to block new connections, block any further transactions from existing connections, and allow existing transactions to complete.

As far as I can tell, the most native way to shut down the database is using **pg_ctl stop -m**, where the **-m** takes one of three options: smart, fast and immediate.

From my experiments, it appears that

 - **smart** will fail if there are any open idle connections (which is too generous for my needs, as my developers may leave long-lived connections open in either their Django app or their PGAdmin consoles).
 - **fast** will kill all connections whether active or idle, which seems to set me up to risk corrupting an in-progress transaction

Does anyone have any suggestions for more gracefully shutting down PostgreSQL than killing off all active connections?  Are there SQL commands I can run before firing off **pg_ctl stop**?

ParanoidMike (101 rep)

Mar 25, 2018, 06:01 PM • Last activity: Dec 13, 2024, 06:34 PM

22 votes

3 answers

17144 views

What index to use with lots of duplicate values?

postgresql index performance clustered-index postgresql-9.6 query-performance

Let's make a few assumptions: I have table that looks like this: a | b ---+--- a | -1 a | 17 ... a | 21 c | 17 c | -3 ... c | 22 Facts about my set: * Size of the whole table is ~ 10 10 rows. * I have ~ 100k rows with value `a` in column `a`, similar for other values (e.g. `c`). * That means ~ 100k...

                                  Let's make a few assumptions:

I have table that looks like this:


     a | b
    ---+---
     a | -1
     a | 17
      ...
     a | 21
     c | 17
     c | -3
      ...
     c | 22

Facts about my set:

* Size of the whole table is ~ 1010 rows.

* I have ~ 100k rows with value a in column a, similar for other values (e.g. c).

* That means ~ 100k distinct values in column 'a'.

* Most of my queries will read all or most of the values for a given value in a, e.g. select sum(b) from t where a = 'c'.

* The table is written in such a way that consecutive values are physically close (either it's written in order, or we assume CLUSTER  was used on that table and column a).

* The table is rarely if ever updated, we're only concerned about read speed.

* The table is relatively narrow (say ~25 bytes per tuple, + 23 bytes overhead).

Now the question is, what kind of index should I be using? My understanding is:

* **BTree** My issue here is that the BTree index will be huge since as far as I know it will store duplicate values (it has to, since it can't assume the table is physically sorted). If the BTree is huge, I end up having to read both the index and the parts of the table that the index points to. (We can use fillfactor = 100 to decrease the size of the index a bit.)

* **BRIN** My understanding is that I can have a small index here at the expense of reading useless pages. Using a small pages_per_range means that the index is bigger (which is a problem with BRIN since I need to read the whole index), having a big pages_per_range means that I'll read a lot of useless pages. Is there a magic formula to find a good value of pages_per_range that takes into account those trade-offs?

* **GIN/GiST** Not sure those are relevant here since they're mostly used for full-text search, but I also hear that they're good at dealing with duplicate keys. Would either a GIN or GiST index help here?

Another question is, will Postgres use the fact that a table is CLUSTERed (assuming no updates) in the query planner (e.g. by binary searching for the relevant start/end pages)? Somewhat related, can I just store all my columns in a BTree and drop the table altogether (or achieve something equivalent, I believe those are clustered indices in SQL server)? Is there some hybrid BTree/BRIN index that would help here?

I'd rather avoid using arrays to store my values since my query will end up less readable that way (I understand this would reduce the cost of the 23 bytes per tuple overhead by reducing the number of tuples).
                                

foo (323 rep)

Mar 10, 2017, 03:24 PM • Last activity: Nov 21, 2024, 09:12 PM

0 votes

2 answers

263 views

unable to extract json column data (posgres 9.6)

postgresql postgresql-9.6

I can run below query to get the data in given format as: SELECT to_timestamp(unnest(ARRAY[[1725042600, 1725043500], [1725041700, 1725042600], [1725043500, 1725044400], [1725040800, 1725041700]])); to_timestamp ------------------------ 2024-08-30 18:30:00+00 2024-08-30 18:45:00+00 2024-08-30 18:15:0...

                                  I can run below query to get the data in given format as:

    SELECT to_timestamp(unnest(ARRAY[[1725042600, 1725043500], [1725041700, 1725042600], [1725043500, 1725044400], [1725040800, 1725041700]]));
    
          to_timestamp
    ------------------------
     2024-08-30 18:30:00+00
     2024-08-30 18:45:00+00
     2024-08-30 18:15:00+00
     2024-08-30 18:30:00+00
     2024-08-30 18:45:00+00
     2024-08-30 19:00:00+00
     2024-08-30 18:00:00+00
     2024-08-30 18:15:00+00
    (8 rows)

I am using postgres version 9.6 and below is schema

    test_database=> \d fact_completeness_count_requests
                        Table "public.fact_completeness_count_requests"
             Column          |            Type             | Collation | Nullable | Default
    -------------------------+-----------------------------+-----------+----------+---------
     request_id              | character varying(64)       |           | not null |
     event_type              | character varying(255)      |           | not null |
     technology              | character varying(255)      |           |          |
     vendor                  | character varying(255)      |           |          |
     name                    | character varying(255)      |           | not null |
     dataset_metadata        | json                        |           | not null |


I am using below query to genrate the expected output. As shown in query output above. Not sure how to use array and to_timestamp function with below query. Is it possible to archive with postgres 9.6 version

    select 
     request_id
    ,dataset_metadata->> 'intervals_epoch_seconds' as epoc_seconds
    from fact_completeness_count_requests ;
    
     31319ad1-e848-4ec3-9c3e-967981e2ae45-0  | [[1725048000, 1725051600]]
     7a05cc38-5303-417d-88ce-fe3a604570d2    | [[1725055200, 1725056100]]
     ae6c2b09-8a95-4ac0-9846-6e76071579af    | [[1725050700, 1725051600], [1725049800, 1725050700], [1725048900, 1725049800], [1725048000, 1725048900]]


                                

vikrant rana (291 rep)

Aug 31, 2024, 12:30 PM • Last activity: Sep 3, 2024, 02:22 PM

1 votes

1 answers

52 views

Why does ORDER BY significantly slow down my query with a computed score of trigram similarity and null field?

full-text-search postgresql-9.6 order-by sorting

I'm working on optimizing a query in PostgreSQL, and I've encountered a performance issue when using the ORDER BY clause. The query is intended to search profiles based on a similarity match to a name (for example: 'john') and then order the results by a computed score. The score is a combination of...

                                  
I'm working on optimizing a query in PostgreSQL, and I've encountered a performance issue when using the ORDER BY clause. The query is intended to search profiles based on a similarity match to a name (for example: 'john') and then order the results by a computed score. The score is a combination of word similarity and whether the profile has an avatar.

Here's the query:

    SELECT
        uuid,
        type,
        byline,
        display_name,
        username,
        avatar,
        (
            word_similarity('john', search_text) +
            CASE WHEN avatar != '' THEN 1 ELSE 0 END
        ) AS combined_score
    FROM test_mv_all_profiles
    WHERE
        'john'   Sort  (cost=35130.07..35158.41 rows=11335 width=52) (actual time=8092.502..8092.565 rows=100 loops=1)
            Sort Key: ((word_similarity('john'::text, search_text) + (CASE WHEN ((avatar)::text  ''::text) THEN 1 ELSE 0 END)::double precision)) DESC
            Sort Method: top-N heapsort  Memory: 51kB
            Buffers: shared hit=66811
            ->  Bitmap Heap Scan on test_mv_all_profiles  (cost=187.84..34696.86 rows=11335 width=52) (actual time=69.060..8052.737 rows=90765 loops=1)
                  Recheck Cond: ('john'::text   Bitmap Index Scan on test_idx_mv_social_profile_search_text_trigram_idx_gin  (cost=0.00..185.01 rows=11335 width=0) (actual time=58.323..58.323 rows=91483 loops=1)
                        Index Cond: ('john'::text   Index Scan using test_idx_mv_social_profile_search_text_trigram_idx on test_mv_all_profiles  (cost=0.42..44444.13 rows=11335 width=52) (actual time=0.506..4.417 rows=100 loops=1)
            Index Cond: ('john'::text <% search_text)
            Rows Removed by Index Recheck: 1
            Buffers: shared hit=311
    Planning time: 0.118 ms
    Execution time: 4.482 ms

My questions:

 - Why does the ORDER BY clause slow down the query so much? 
 - Is there a
   way to optimize this query while keeping the ORDER BY clause? Would
   adding an index on the computed score help, and if so, how should I
   approach that?

Additional Information:

The table test_mv_all_profiles is a materialized view with around 11M rows.

We are using a rather old version of Postgres (9.6) so some newer features are not available to us in the mean time.

The search_text field is a concatenation of multiple columns (like username, first name, and last_name).

I already have a trigram index on search_text for the similarity search.

I'm looking for advice on how to maintain performance while still being able to sort by the combined score. Any insights or recommendations would be greatly appreciated!


                                

Sheila Loekito (11 rep)

Aug 28, 2024, 10:11 PM • Last activity: Aug 29, 2024, 10:29 AM

0 votes

1 answers

117 views

UNIQUE CONSTRAINT on two columns, but duplicates appeared: how to track down and fix?

postgresql postgresql-9.6 unique-constraint duplication

We have table for users: ```PostgreSQL CREATE UNIQUE INDEX uk_users_login_socnet ON public.users USING btree (login, socnet) ``` We had maintenance on our server, during which it was backed up and replicated to different server using `pg_basebackup` (comleted by our sysadmins, reportedly without err...

We have table for users:

CREATE UNIQUE INDEX uk_users_login_socnet ON public.users USING btree (login, socnet)

We had maintenance on our server, during which it was backed up and replicated to different server using pg_basebackup (comleted by our sysadmins, reportedly without errors and issues). Since then we have errors in our logs like following:

ERROR:  duplicate key value violates unique constraint "uk_users_login_socnet"
DETAIL:  Key (login, socnet)=(user@example.com, 1) already exists. (UPDATE users SET extra = COALESCE(extra, '') || hstore('udid', '') || hstore('udid_last_update', '1721646485') WHERE id = 1234567;)

In the meantime following query returns no results:

SELECT u1.login, u1.socnet
FROM users u1
INNER JOIN users u2 ON u1.login = u2.login AND u1.socnet = u2.socnet

while it's EXPLAIN ANALYZE is:

Hash Join (cost=2817129.95..6386837.44 rows=9847200 width=18) (actual time=13654.755..31681.796 rows=15584155 loops=1)
	
Hash Cond: (((u1.login)::text = (u2.login)::text) AND (u1.socnet = u2.socnet))
	
-> Seq Scan on users u1 (cost=0.00..2490077.18 rows=15678918 width=18) (actual time=0.026..8806.271 rows=15582973 loops=1)
	
-> Hash (cost=2490077.18..2490077.18 rows=15678918 width=18) (actual time=13653.093..13653.093 rows=15582973 loops=1)
	
Buckets: 524288 Batches: 32 Memory Usage: 28510kB
	
-> Seq Scan on users u2 (cost=0.00..2490077.18 rows=15678918 width=18) (actual time=0.186..10887.626 rows=15582973 loops=1)
	
Planning time: 1.749 ms
	
Execution time: 32166.739 ms

But this query:

SELECT t1.login, t1.socnet, t2.login, t2.socnet
FROM users AS t1
LEFT JOIN (SELECT id, login, socnet FROM users) AS t2
ON t2.login = t1.login AND t1.socnet = t2.socnet
WHERE t1.id != t2.id

shows a list of duplicated rows with same login and socnet, which to me makes no sense. Here's respective EXPLAIN ANALYZE:

Hash Join (cost=2817129.95..6411455.44 rows=9847199 width=36) (actual time=17015.349..33466.957 rows=1182 loops=1)
	
Hash Cond: (((t1.login)::text = (users.login)::text) AND (t1.socnet = users.socnet))
	
Join Filter: (t1.id  users.id)
	
Rows Removed by Join Filter: 15583110
	
-> Seq Scan on users t1 (cost=0.00..2490077.18 rows=15678918 width=22) (actual time=0.034..9902.685 rows=15583110 loops=1)
	
-> Hash (cost=2490077.18..2490077.18 rows=15678918 width=22) (actual time=14344.722..14344.722 rows=15583110 loops=1)
	
Buckets: 524288 Batches: 32 Memory Usage: 30951kB
	
-> Seq Scan on users (cost=0.00..2490077.18 rows=15678918 width=22) (actual time=0.024..11382.363 rows=15583110 loops=1)
	
Planning time: 1.764 ms
	
Execution time: 33467.260 ms

The PostgreSQL server version is 9.6 My questions are: - What is happening? Why do I have duplicate data within UNIQUE CONSTRAINT? - Why I have different results for INNER JOIN and LEFT JOIN? - What might have happened to DB structure and how to debug it? Where to look for hints on what's broken? - How to fix this to consistent state?

Eduard Sukharev (129 rep)

Jul 22, 2024, 11:49 AM • Last activity: Jul 22, 2024, 01:28 PM

2 votes

1 answers

1109 views

Arbitrary queries on n:m relationship, including "all" and "any"

postgresql performance postgresql-9.6 postgresql-10 relational-division postgresql-performance

I'm using postgres >= 9.6. I have tasks, tags and task_tags tables, for a typical n:m relationship between tasks and tags. I'd like to be able to programmatically build queries against the tasks table that supports queries against the actual fields of tasks, but also on the tags (tag names) of a tas...

                                  I'm using postgres >= 9.6. I have tasks, tags and task_tags tables, for a typical n:m relationship between tasks and tags.

I'd like to be able to programmatically build queries against the tasks table that supports queries against the actual fields of tasks, but also on the tags (tag names) of a task.

Queries on the task fields themselves are straight-forward. Queries on the tags for a "does it have tag A?" are also straight-forward. What I am struggling with is coming up with a select/query structure that would allow me to also match things like "does it have tag A and tag B"?

The best I've come up with is a lateral join on a subquery with an array aggregation, and then using array matching functions, e.g.:

    SELECT DISTINCT ON (tasks.id) tasks.* 
    FROM   tasks, LATERAL (SELECT array_agg(tags.name) AS tags 
                           FROM   task_tags 
                           INNER JOIN tags 
                           ON task_tags.tag_id = tags.id 
                           WHERE task_tags.task_id = tasks.id 
                           GROUP BY task_tags.task_id) tt
    WHERE  tt.tags @> array['tag1'::varchar, 'tag3'::varchar];

That way, it should be possible to programmatically build a WHERE clause (using tasks.* and tt.tags) satisfying all of the conditions of the user-provided "query".

However, I'm not sure if this is the best way of doing it - thoughts? Is this query any efficient? Is there any index I could create that would improve it?

Similarly, is there any way at all of making it work with wildcards against the tag names? Normal array matching wouldn't allow that, and solutions I've seen suggest using unnest (or, well, not using arrays in the first place), but then I'd lose the ability of saying "it needs to have both tagA and tagB".

Is there any other way of building a query on these relationships that would allow that kind of "both tagA and tagB" matching?
                                

Alex Hornung (123 rep)

Dec 31, 2017, 09:34 AM • Last activity: Jul 22, 2024, 12:50 AM

1 votes

2 answers

6333 views

pg_dump --exclude-table not working on one specific table

postgresql backup postgresql-9.6

Wanting to exclude a few very large table in a `pg_dump`, I use the -T option multiple times, like this: pg_dump --clean clientportal -T article_plain -T callstats -T cdrviewer_cdr -T calls -T cdrviewer_datasource -T django_sesssion -T django_admin_log `article_plain` however still gets included in...

                                  Wanting to exclude a few very large table in a pg_dump, I use the -T option multiple times, like this:

    pg_dump --clean clientportal -T article_plain -T callstats -T cdrviewer_cdr -T calls -T cdrviewer_datasource -T django_sesssion -T django_admin_log

article_plain however still gets included in the dump.  Using --schema-only here for demonstration purposes:

    postgres@server:~/bin$ pg_dump -s --clean clientportal -T article_plain -T callstats -T cdrviewer_cdr -T calls -T cdrviewer_datasource -T django_sesssion -T django_admin_log | grep article_plain
    ALTER TABLE ONLY otrs.article_plain DROP CONSTRAINT fk_article_plain_create_by_id;
    ALTER TABLE ONLY otrs.article_plain DROP CONSTRAINT fk_article_plain_change_by_id;
    ALTER TABLE ONLY otrs.article_plain DROP CONSTRAINT fk_article_plain_article_id_id;
    DROP INDEX otrs.article_plain_article_id;
    ALTER TABLE ONLY otrs.article_plain DROP CONSTRAINT article_plain_pkey;
    ALTER TABLE otrs.article_plain ALTER COLUMN id DROP DEFAULT;
    DROP SEQUENCE otrs.article_plain_id_seq;
    DROP TABLE otrs.article_plain;
    -- Name: article_plain; Type: TABLE; Schema: otrs; Owner: sg1admin
    CREATE TABLE article_plain (                                                 <----- there is article_plain
    ALTER TABLE article_plain OWNER TO sg1admin;
    ...

Why is that?
                                

Julien Lamarche (131 rep)

May 13, 2021, 06:11 PM • Last activity: Jun 25, 2024, 05:19 PM

Showing page 1 of 20 total questions