Database Administrators
Q&A for database professionals who wish to improve their database skills
Latest Questions
1
votes
0
answers
24
views
Suggestion regarding Migrating and Sharding 1TB standalone PostgreSQL Cluster with Partitions and Multiple Databases, using Citus
We're currently in the **evaluation/planning phase** evaluating Citus (open-source) as a sharding solution (multi tenant) for our existing PostgreSQL database cluster. We anticipate significant data growth with a new application feature and need to scale beyond our current single-node setup. Current...
We're currently in the **evaluation/planning phase** evaluating Citus (open-source) as a sharding solution (multi tenant) for our existing PostgreSQL database cluster. We anticipate significant data growth with a new application feature and need to scale beyond our current single-node setup.
Current Setup:
Standalone single node PostgreSQL (open source). Approximately 1 TB of data. pgpool-II is used for load balancing with a primary and one standby for basic high availability. The database structure involves multiple databases within the cluster. Each database contains multiple schemas. Within schemas, we have numerous tables, functions, and other SQL objects. Many tables currently utilize standard PostgreSQL declarative partitioning (both range and hash methods) for performance management. Goal: We aim to migrate to a Citus cluster and implement database sharding to handle the expected large volume of new data, while also effectively incorporating our existing 1 TB of historical data.
We have reviewed some of the basic Citus documentation but have several key questions regarding the migration and ongoing management of our specific setup:
1. Handling Multiple Databases: Our current structure has data spread across multiple logical databases within the single PostgreSQL instance. How should this multi-database structure be handled when migrating to Citus? Is there a recommended approach for managing multiple logical databases within a Citus cluster from a migration and ongoing data distribution perspective?
2. Existing and Future Partitioning: We heavily use standard PostgreSQL range and hash partitioning on our tables. How are these existing partitioned tables and their data migrated into a Citus distributed table setup? Does Citus automatically handle the data from partitions? How is partitioning typically handled for new and ongoing data within a Citus distributed table? Can we still use time-based partitioning, for example, effectively within Citus shards?
3. Load Balancing and High Availability (pgpool/Replication): Can our existing pgpool-II setup be repurposed or used in conjunction with a Citus cluster for load balancing client connections? What are the recommended strategies for high availability and replication within a Citus cluster itself (for both coordinator and worker nodes), and load balancing?
4. Schema Distribution to Workers: When we distribute a table (e.g., using create_distributed_table), how does Citus handle the schema definition on the worker nodes? Does the full schema structure (including other non-distributed objects or schemas) get replicated to workers, or only the necessary table definitions for the shards?
5. Monitoring in a Distributed Environment: On our single node, we rely on standard PostgreSQL system views and functions like pg_stat_activity, pg_stat_statements, pg_control_checkpoint(), pg_buffercache, pg_stat_user_tables, pg_stat_bgwriter, etc., for monitoring and performance tuning. How do these tools work in a distributed Citus cluster, where data and activity are spread across multiple worker nodes? How do we get a comprehensive, cluster-wide view of performance and activity?
We would appreciate guidance, insights, or experiences from the community on these points as we plan our migration to Citus.
**Also, please advise if there is another sharding solution that can be tried for our current setup.**
Ramzy
(11 rep)
May 16, 2025, 07:48 AM
0
votes
1
answers
424
views
Sharding in Timescaledb (Postgres) Opensource
**If I am wrong in my understanding, please feel free to correct me** In my project, we have a timeseries database. It is setup as 3-node (One leader, 2 read-replicas) patroni cluster. Each node is an AWS EC2 instance where time-series data stored in hypertables supported by TimescaleDB extension on...
**If I am wrong in my understanding, please feel free to correct me**
In my project, we have a timeseries database. It is setup as 3-node (One leader, 2 read-replicas) patroni cluster. Each node is an AWS EC2 instance where time-series data stored in hypertables supported by TimescaleDB extension on a Postgres database. We are using opensource timescaledb here.
As the data is growing each day, the EBS data volume on a node (EC2 instance) is expected to hit its size limit in future. Hence there need of sharding.
As the potential solution, we looked at distributed hypertables in timescaledb. But it seems to be a dead end as they have deprecated multi-node support (on top of which distributed hypertables are provided) in opensource timescaledb.
There is another option i.e. to use Citus (Postgres extension to implement sharding). But, Citus doesn't support TimescaleDB extension in Postgres. So, as a high level solution in this case, we have to convert timescaleDB hypertables to regular Postgres tables first for us to be able to use Citus. So far Citus seems to be most suitable (relatively) choice to implement sharding.
Could someone please suggest a better way (if there is any)?
**Edit Note**: Data archival or purge is not an options for us. All of the data is needed. Compression has been applied already as much it is possible. This has bought us some additional time before the storage limit for an EBS data volume is reached, but eventually sharding will be required.
HelloJack
(1 rep)
Aug 10, 2024, 10:50 AM
• Last activity: May 11, 2025, 03:04 PM
1
votes
0
answers
1330
views
Postgres Citus error "complex joins are only supported when all distributed tables are joined on their distribution columns with equal operator"
I have two tables defined as distributed in Citus based on the same field: ``` select create_distributed_table('gedi','clould_metadata_id'); select create_distributed_table('cloud_metadata','clould_metadata_id'); ``` The clould_metadata_id is unique in cloud_metadata, gedi table stores millions of r...
I have two tables defined as distributed in Citus based on the same field:
select create_distributed_table('gedi','clould_metadata_id');
select create_distributed_table('cloud_metadata','clould_metadata_id');
The clould_metadata_id is unique in cloud_metadata, gedi table stores millions of records, cloud_metadata stores around 3000.
When I try to inner join those tables using:
select * from cloud_metadata cm
inner join gedi g
on cm.clould_metadata_id = g.clould_metadata_id
I get the error message "SQL Error [0A000]: ERROR: complex joins are only supported when all distributed tables are joined on their distribution columns with equal operator".
I believe that's precisely what I'm trying to do, join those tables using de distribution columns, so what I'm doing wrong?
Mauro Assis
(111 rep)
Sep 26, 2021, 01:17 PM
• Last activity: Mar 3, 2025, 07:04 AM
0
votes
0
answers
77
views
Postgresql Logical Replication from citus distributed table to another citus distributed table
I followed this guide (https://docs.citusdata.com/en/stable/develop/integrations.html) to implement logical replication of the "items" table between two postgres db. On DB1 the "items" table is distributed using the citus extension (1 coordinator + 2 worker nodes), on DB2 the citus extension is inst...
I followed this guide (https://docs.citusdata.com/en/stable/develop/integrations.html) to implement logical replication of the "items" table between two postgres db. On DB1 the "items" table is distributed using the citus extension (1 coordinator + 2 worker nodes), on DB2 the citus extension is installed but the "items" table is not distributed (single node).
In detail: On DB1 I created the items_pub publication:
CREATE PUBLICATION items_pub FOR TABLE items;
and then I created the replication slots:
SELECT * FROM run_command_on_all_nodes(
$$ SELECT pg_create_logical_replication_slot('cdc_slot', 'pgoutput', false) $$
);
On DB2 I created the subscriptions:
create subscription sub
connection 'host=10.100.100.37 user=user dbname=DB1 password=password port=5432'
publication items_pub
WITH (copy_data=true,create_slot=false,slot_name='cdc_slot');
create subscription sub_1
connection 'host=10.100.100.38 user=user dbname=DB1 password=password port=5432'
publication items_pub
WITH (copy_data=false,create_slot=false,slot_name='cdc_slot');
create subscription sub_2
connection 'host=10.100.100.39 user=user dbname=DB1 password=password port=5432'
publication items_pub
WITH (copy_data=false,create_slot=false,slot_name='cdc_slot');
Everything works perfectly and every event of INSERT, DELETE, UPDATE on the items table of DB1 is propagated correctly to the items table of DB2.
The problem arises when I transform the "items" table on DB2 into a distributed table via Citus. (again with 1 coordinator + 2 worker nodes). At this point any event coming from DB1 is not received by DB2. I have tried several solutions without success. I would like to understand if what I am trying to implement is possible or not and if possibly there are other solutions.
Babylon Big Data
(1 rep)
Jul 10, 2024, 12:07 PM
• Last activity: Jul 10, 2024, 12:16 PM
0
votes
1
answers
104
views
Citus: Parallel worker not launching
We are performing `Citus` benchmarking and came across a situation where the planner makes the `parallel execution plan` but during the execution no `parallel workers` are launched. The cluster has 3 worker nodes each with 2CPUs running on Centos. What can be the reason behind this? -> Finalize Aggr...
We are performing
Citus
benchmarking and came across a situation where the planner makes the parallel execution plan
but during the execution no parallel workers
are launched. The cluster has 3 worker nodes each with 2CPUs running on Centos. What can be the reason behind this?
-> Finalize Aggregate (cost=76650.31..76650.32 rows=1 width=32) (actual time=174741.832..174741.999 rows=1 loops=1)
-> Gather (cost=76650.09..76650.30 rows=2 width=32) (actual time=174741.807..174741.974 rows=1 loops=1)
Workers Planned: 2
Workers Launched: 0
-> Partial Aggregate (cost=75650.09..75650.10 rows=1 width=32) (actual time=174740.490..174740.492 rows=1 loops=
Query we are using for benchmarking is
select sum(amount) from tab_dist_1 where id>=49820 and id<=59743292;
goodfella
(595 rep)
Jun 10, 2024, 05:03 AM
• Last activity: Jun 12, 2024, 07:12 AM
0
votes
1
answers
54
views
INSERT my_table (fields,...) value (value_1,...), (value_2, ...)... is slow in citus
Environment: - host: - resources:32 cores, 64GB RAM, Docker 24.07 - OS: CentOS 7 - disk: 10 separate disk partitions The 10 separate disk partitions are as below df -h filesystem capacity used avail used% mount devtmpfs 32G 0 32G 0% /dev tmpfs 32G 0 32G 0% /dev/shm tmpfs 32G 66M 32G 1% /run tmpfs 32...
Environment:
- host:
- resources:32 cores, 64GB RAM, Docker 24.07
- OS: CentOS 7
- disk: 10 separate disk partitions
The 10 separate disk partitions are as below
df -h
filesystem capacity used avail used% mount
devtmpfs 32G 0 32G 0% /dev
tmpfs 32G 0 32G 0% /dev/shm
tmpfs 32G 66M 32G 1% /run
tmpfs 32G 0 32G 0% /sys/fs/cgroup
/dev/mapper/centos-root 168G 16G 153G 10% /
/dev/sde1 9.8T 107M 9.8T 1% /data03
/dev/sdj1 9.8T 33M 9.8T 1% /data08
/dev/sdd1 9.8T 105M 9.8T 1% /data02
/dev/sdl1 9.8T 33M 9.8T 1% /data10
/dev/sdb1 9.8T 1.4G 9.8T 1% /data00
/dev/sdh1 9.8T 33M 9.8T 1% /data06
/dev/sdi1 9.8T 33M 9.8T 1% /data07
/dev/sdf1 9.8T 107M 9.8T 1% /data04
/dev/sdc1 9.8T 107M 9.8T 1% /data01
/dev/sdg1 9.8T 33M 9.8T 1% /data05
/dev/sda1 1014M 226M 789M 23% /boot
tmpfs 6.3G 0 6.3G 0% /run/user/1010
I use below docker-compose.yml to build a 3-node citus cluser in docker, and a standalone postgresql named "pg-16". They all use the same image. The only difference between pg-16 and citus-xxxx container is that I remove
shared_preload_libraries='citus'
from postgresql.conf
for pg-16. By the way, I didn't change other configuration in postgresql.conf. All default values
version: "3.7"
services:
citus-coordinator:
image: citusdata/citus:latest
container_name: citus-coordinator
hostname: citus-coordinator
deploy:
resources:
limits:
cpus: "4"
memory: "16G"
ports:
- 5440:5432
environment:
POSTGRES_PASSWORD: 'postgres'
volumes:
- /data01/postgresql:/var/lib/postgresql
- /data01/postgresql/data:/var/lib/postgresql/data
citus-worker1:
image: citusdata/citus:latest
container_name: citus-worker1
hostname: citus-worker1
deploy:
resources:
limits:
cpus: "4"
memory: "16G"
ports:
- 5441:5432
environment:
POSTGRES_PASSWORD: 'postgres'
volumes:
- /data02/postgresql:/var/lib/postgresql
- /data02/postgresql/data:/var/lib/postgresql/data
citus-worker2:
image: citusdata/citus:latest
container_name: citus-worker2
hostname: citus-worker2
deploy:
resources:
limits:
cpus: "4"
memory: "16G"
ports:
- 5442:5432
environment:
POSTGRES_PASSWORD: 'postgres'
volumes:
- /data03/postgresql:/var/lib/postgresql
- /data03/postgresql/data:/var/lib/postgresql/data
citus-worker3:
image: citusdata/citus:latest
container_name: citus-worker3
hostname: citus-worker3
deploy:
resources:
limits:
cpus: "4"
memory: "16G"
ports:
- 5443:5432
environment:
POSTGRES_PASSWORD: 'postgres'
volumes:
- /data04/postgresql:/var/lib/postgresql
- /data04/postgresql/data:/var/lib/postgresql/data
pg-16:
image: citusdata/citus:latest
container_name: pg-16
hostname: pg-16
deploy:
resources:
limits:
cpus: "4"
memory: "16G"
ports:
- 5433:5432
environment:
POSTGRES_PASSWORD: 'postgres'
volumes:
- /data00/postgresql:/var/lib/postgresql
- /data00/postgresql/data:/var/lib/postgresql/data
As you can see in my docker-compose.yml, each citus node (incl. coordinator) uses one disk partition, so that the I/O of disk partition is not shared by other nodes.
My table gm_data2
is as below:
create table if not exists gm_data2
(
id bigint not null,
period_type varchar(2) not null,
billing_determinates varchar(40),
file_id bigint not null,
region varchar(2) not null,
ratio varchar(8) not null,
date_created timestamp not null,
last_updated timestamp not null,
read_seq varchar(40) not null,
cyc_dt varchar(3) not null,
cust_addr varchar(40) not null,
hes_code varchar(30),
seg_type varchar(3) not null,
ctr_type varchar(5) not null,
custno varchar(20) not null,
meterno varchar(16) not null constraint gm_data2_pkey primary key,
tou varchar(5),
cust_name varchar(30),
publish_id varchar(5),
group_id varchar(3),
reg_ctr_vol varchar(8),
period_dt varchar(3),
service_group_id varchar(4),
read_dt varchar(10),
billing_mark varchar(2),
high_low_vol integer
);
After the above citus containers are created, I have completed the required actions to configure the cluster
dbbench=> SELECT * FROM citus_get_active_worker_nodes();
node_name | node_port
---------------+-----------
citus-worker3 | 5432
citus-worker2 | 5432
citus-worker1 | 5432
dbbench=> SELECT create_distributed_table('gm_data2', 'meterno');
rebalance_table_shards
------------------------
(1 row)
dbbench=> SELECT rebalance_table_shards('gm_data2');
rebalance_table_shards
------------------------
(1 row)
SELECT count(1) FROM citus_shards;
count
-------
32
(1 row)
dbbench=> SELECT * FROM citus_shards;
table_name | shardid | shard_name | citus_table_type | colocation_id | nodename | nodeport | shard_size
------------+---------+-----------------+------------------+---------------+-------------------+----------+------------
gm_data2 | 102040 | gm_data2_102040 | distributed | 2 | citus-coordinator | 5432 | 565248
gm_data2 | 102041 | gm_data2_102041 | distributed | 2 | citus-worker1 | 5432 | 581632
gm_data2 | 102042 | gm_data2_102042 | distributed | 2 | citus-worker2 | 5432 | 557056
gm_data2 | 102043 | gm_data2_102043 | distributed | 2 | citus-worker3 | 5432 | 892928
gm_data2 | 102044 | gm_data2_102044 | distributed | 2 | citus-coordinator | 5432 | 917504
gm_data2 | 102045 | gm_data2_102045 | distributed | 2 | citus-worker1 | 5432 | 589824
gm_data2 | 102046 | gm_data2_102046 | distributed | 2 | citus-worker2 | 5432 | 933888
gm_data2 | 102047 | gm_data2_102047 | distributed | 2 | citus-worker3 | 5432 | 892928
gm_data2 | 102048 | gm_data2_102048 | distributed | 2 | citus-coordinator | 5432 | 892928
gm_data2 | 102049 | gm_data2_102049 | distributed | 2 | citus-worker1 | 5432 | 573440
gm_data2 | 102050 | gm_data2_102050 | distributed | 2 | citus-worker2 | 5432 | 933888
gm_data2 | 102051 | gm_data2_102051 | distributed | 2 | citus-worker3 | 5432 | 925696
gm_data2 | 102052 | gm_data2_102052 | distributed | 2 | citus-coordinator | 5432 | 892928
gm_data2 | 102053 | gm_data2_102053 | distributed | 2 | citus-worker1 | 5432 | 573440
gm_data2 | 102054 | gm_data2_102054 | distributed | 2 | citus-worker2 | 5432 | 917504
gm_data2 | 102055 | gm_data2_102055 | distributed | 2 | citus-worker3 | 5432 | 909312
gm_data2 | 102056 | gm_data2_102056 | distributed | 2 | citus-coordinator | 5432 | 950272
gm_data2 | 102057 | gm_data2_102057 | distributed | 2 | citus-worker1 | 5432 | 548864
gm_data2 | 102058 | gm_data2_102058 | distributed | 2 | citus-worker2 | 5432 | 909312
gm_data2 | 102059 | gm_data2_102059 | distributed | 2 | citus-worker3 | 5432 | 950272
gm_data2 | 102060 | gm_data2_102060 | distributed | 2 | citus-coordinator | 5432 | 917504
gm_data2 | 102061 | gm_data2_102061 | distributed | 2 | citus-worker1 | 5432 | 573440
gm_data2 | 102062 | gm_data2_102062 | distributed | 2 | citus-worker2 | 5432 | 901120
gm_data2 | 102063 | gm_data2_102063 | distributed | 2 | citus-worker3 | 5432 | 933888
gm_data2 | 102064 | gm_data2_102064 | distributed | 2 | citus-coordinator | 5432 | 917504
gm_data2 | 102065 | gm_data2_102065 | distributed | 2 | citus-worker1 | 5432 | 581632
gm_data2 | 102066 | gm_data2_102066 | distributed | 2 | citus-worker2 | 5432 | 917504
gm_data2 | 102067 | gm_data2_102067 | distributed | 2 | citus-worker3 | 5432 | 876544
gm_data2 | 102068 | gm_data2_102068 | distributed | 2 | citus-coordinator | 5432 | 917504
gm_data2 | 102069 | gm_data2_102069 | distributed | 2 | citus-worker1 | 5432 | 581632
gm_data2 | 102070 | gm_data2_102070 | distributed | 2 | citus-worker2 | 5432 | 884736
gm_data2 | 102071 | gm_data2_102071 | distributed | 2 | citus-worker3 | 5432 | 925696
(32 rows)
Now I need to insert 100000 records to a table named gm_data2
. There are two methods to insert these records.
- method 1 (batch size=1):
Use below INSERT for each records. That means I need to do 100000 times of INSERT actions
INSERT INTO gm_data2 (field11, field12, ...) values (value1_1, value1_2, ...)
- method 2 (batch size=100000):
Use batch INSERT to insert the 100000 records into foo_table in one command as below:
INSERT INTO gm_data2(field11, field12, ...) values
(value1_1, value1_2, ...),
(value2_1, value2_2, ...),
(value3_1, value3_2, ...),
.....
(value100000_1, value100000_2, ...)
below is my test result. the cell is seconds
| server | batch_size=1 | batch_size=1000 | batch_size=10000 | batch_size=100000 |
| -------------- | ------------ | --------------- | ---------------- | ----------------- |
| citus cluster | 62.752 | 26.746 | 35.002 | 40.899 |
| pg-16 | 28.908 | 7.876 | 8.375 | 8.909 |
My question is, why a 3-node citus cluster is far slower than that of a standalone pg-16 ? And how to tune this citus performance ?
wureka
(159 rep)
Dec 11, 2023, 11:47 PM
• Last activity: Jan 4, 2024, 01:09 PM
0
votes
1
answers
152
views
Setup Citus extension in a ssh environment
I tried to do `CREATE EXTENSION citus;`. However, I was given the following error: ``` ERROR: Citus can only be loaded via shared_preload_libraries HINT: Add citus to shared_preload_libraries configuration variable in postgresql.conf in master and workers. Note that citus should be at the beginning...
I tried to do
CREATE EXTENSION citus;
. However, I was given the following error:
ERROR: Citus can only be loaded via shared_preload_libraries
HINT: Add citus to shared_preload_libraries configuration variable in postgresql.conf in master and workers. Note that citus should be at the beginning of shared_preload_libraries.
However, I do not have root access. How can I create the extension in this case?
I am using PSQL's source code for setting up the database. Is there a specific file in here that I can modify to allow for the extension?
Here are the list of files in my postgresql-15.3
source code folder:
COPYRIGHT GNUmakefile.in INSTALL README config config.status configure.ac doc
GNUmakefile HISTORY Makefile aclocal.m4 config.log configure contrib src
Disdainty
(3 rep)
Oct 7, 2023, 06:06 PM
• Last activity: Oct 8, 2023, 05:08 AM
2
votes
1
answers
492
views
Can the citus data coordiator and worker node be on the same machine?
Reading [citus data's tutorial][1], I was wondering if in a small cluster, I can have the coordinator and one of the workers installed on the same machine? I am testing this on a small scale cluster with two worker nodes to do the computing job. And I'd like to use as few machines as possible. So fa...
Reading citus data's tutorial , I was wondering if in a small cluster, I can have the coordinator and one of the workers installed on the same machine?
I am testing this on a small scale cluster with two worker nodes to do the computing job. And I'd like to use as few machines as possible.
So far, I only have experience with installing one PostgreSQL server instance in each machine (Ubuntu Linux). In this setting, does the coordinator have to be on its own machine?
(This is with PostgreSQL 14 and citus 11.0)
tinlyx
(3820 rep)
Aug 22, 2022, 10:03 PM
• Last activity: Aug 5, 2023, 02:04 AM
6
votes
0
answers
411
views
Right way to set up a secondary citus worker
I am experimenting with the postgres-addon citus. So far I have setup a cluster with one coordinator and three worker nodes. Now I want to setup a secondary for one of the worker nodes. So far I have setup streaming replication from that worker node to the fourth (standby) node. Do I have to add the...
I am experimenting with the postgres-addon citus.
So far I have setup a cluster with one coordinator and three worker nodes.
Now I want to setup a secondary for one of the worker nodes.
So far I have setup streaming replication from that worker node to the fourth (standby) node.
Do I have to add the standby node as secodary now with add_node and noderole "secondary"?
**citus_add_secondary_node** is better, since you have to define the suiting *primary* node with that command?
Or is it simply enough to setup a fresh database with citus extension and the *add_node/secondary* will turn that node into a secondary with all the data from the worker?
I get the right group-id for the secondary with this command:
select * from pg_dist_node;
Nils
(149 rep)
Nov 17, 2022, 02:14 PM
• Last activity: Dec 13, 2022, 09:59 PM
1
votes
2
answers
680
views
Postgresql doesn't working with Citus and pg_stat_statements at the same time
So, I built the PostgreSQL with citus extension in docker. I use the official documentation in citus, then I run this command in the terminal. ``` docker run -d -v /root/volumes/citus-data:/data --network citus-network --name citus_coordinator -p 5500:5432 -e POSTGRES_PASSWORD=mypassword citusdata/c...
So, I built the PostgreSQL with citus extension in docker.
I use the official documentation in citus, then I run this command in the terminal.
docker run -d -v /root/volumes/citus-data:/data --network citus-network --name citus_coordinator -p 5500:5432 -e POSTGRES_PASSWORD=mypassword citusdata/citus:11.1
Then Database successfully building.
But I want to create the
extension.
I configuration the postgresql.conf
file.
shared_preload_libraries = 'pg_stat_statements'
pg_stat_statements.track = all
pg_stat_statements.max = 10000
track_activity_query_size = 2048
Then I restarted the PostgreSQL container. Wrote the this query
SELECT * FROM pg_stat_statements;
in terminal.
I saw this error
ERROR: pg_stat_statements must be loaded via shared_preload_libraries
I didn't understand, why the config file didn't see this extension, What was my mistake?
**Solved**
I am running the PostgreSQL container with postgres:latest image.
Then I entered the container.
So, I installed the citus from here into the container.
Then installed apt-get install postgresql-contrib
into the container.Next step, run create extension pg_stat_statements;
And everything worked for me.
Gismat Gusein
(61 rep)
Oct 13, 2022, 07:32 AM
• Last activity: Oct 18, 2022, 08:33 AM
1
votes
2
answers
1047
views
Replicating PostgreSQL data into Citus/Greenplum?
I need to integrate data from 3 different PostgreSQL databases (OLTP application backends) in a data warehouse. For the data warehouse itself I consider using Citus or Greenplum. There is a requirement that the data from applications has to be synced with the data warehouse as close to real time as...
I need to integrate data from 3 different PostgreSQL databases (OLTP application backends) in a data warehouse. For the data warehouse itself I consider using Citus or Greenplum. There is a requirement that the data from applications has to be synced with the data warehouse as close to real time as possible (everything above 3-5 minutes delay is unacceptable, real time replication would be the best). In this regard I have the following questions:
1. Will Postgres logical replication work with Citus? Citus is a Postgres extension, can you treat a Citus cluster as an ordinary Postgres database? If yes, then logical replication should theoretically work, but how does it deal with distributed tables?
2. Greenplum is a Postgres fork, so will Postgres logical replication work with it at all? I have also read that Greenplum is not optimized for OLTP workloads, does that mean it will break when I try to ingest OLTP data into it?
3. If logical replication does not work with Citus/Greenplum, then how to stream data from Postgres? Do I need to stream logical-level WAL into Kafka and then write custom logic for translating it into SQL statements on the target database? Are there any tools for that?
Bonus question: does anyone have experience with both Citus and Greenplum, especially with their SQL limitations? I know that Citus does not fully support correlated subqueries and recursive CTEs, does Greenplum have any similar limitations?
I would appreciate any help with these questions, I tried googling but there is little or no info on the subject, could you please give at least some direction?
Denis Arharov
(101 rep)
Feb 4, 2021, 12:44 PM
• Last activity: Jan 13, 2022, 10:15 AM
0
votes
0
answers
105
views
Why does postgres/citus use the index when querying the table, but not trough the view?
I have a Citus 9.5 database (postgres 11) on Azure and executing a complicated query with CTEs. The schema has a simple view from another schema as: ``` create view my_schema.table1 as select * from other_schema.table1 ``` When my query joins to `my_schema.table1` it does a Seq Scan on the table whi...
I have a Citus 9.5 database (postgres 11) on Azure and executing a complicated query with CTEs.
The schema has a simple view from another schema as:
create view my_schema.table1
as
select * from other_schema.table1
When my query joins to my_schema.table1
it does a Seq Scan on the table which leads to poor performance.
However when my query joins directly to other_schema.table1
it uses the primary key index which is fast.
I have run ANALYZE [table]
on both the table and the view.
Why does Postgres not use the index when I query via the view, but it does when I query via the table?
Brendan Hill
(301 rep)
Dec 13, 2021, 10:55 PM
• Last activity: Dec 14, 2021, 12:16 AM
0
votes
0
answers
1506
views
AWS DMS Task failed with error: Error executing source loop; Stream component failed at subtask 0
I want to migrate my PostgresDB hosted in Citus cloud service to AWS RDS Aurora Postgres. I am using AWS DMS service. Have created task but getting following errors: Last failure message Last Error Stream Component Fatal error. Task error notification received from subtask 0, thread 0 [reptask/repli...
I want to migrate my PostgresDB hosted in Citus cloud service to AWS RDS Aurora Postgres. I am using AWS DMS service. Have created task but getting following errors:
Last failure message Last Error Stream Component Fatal error. Task error notification received from subtask 0, thread 0 [reptask/replicationtask.c:2860] Error executing source loop; Stream component failed at subtask 0, component st_0_QOIS7XIGJDKNPY6RXMGYRLJQHY2P7IQBWIBA5NQ; Stream component 'st_0_QOIS7XIGJDKNPY6RXMGYRLJQHY2P7IQBWIBA5NQ' terminated [reptask/replicationtask.c:2868] Stop Reason FATAL_ERROR Error Level FATAL
Frankly speaking not able to understand what is wrong here, so any help is appreciated.
cloudwatch logs:

Ashish Karpe
(101 rep)
Oct 15, 2021, 10:48 AM
0
votes
1
answers
602
views
Slow insert in Postgresql when sharding using declarative partitioning and postgres_fdw?
We have been trying to partition a Postgres database on google cloud using the built-in Postgres declarative partitioning and `postgres_fdw` as explained [here][1]. We are running commands as follow: Shard 1: ``` CREATE TABLE message_1 ( id SERIAL, m_type character varying(20), content character var...
We have been trying to partition a Postgres database on google cloud using the built-in Postgres declarative partitioning and
postgres_fdw
as explained here .
We are running commands as follow:
Shard 1:
CREATE TABLE message_1 (
id SERIAL,
m_type character varying(20),
content character varying(256) NOT NULL,
is_received boolean NOT NULL,
is_seen boolean NOT NULL,
is_active boolean NOT NULL,
created_at timestamp with time zone NOT NULL,
room_no_id integer NOT NULL,
sender_id integer NOT NULL
);
CREATE TABLE message_2 (
id SERIAL,
m_type character varying(20),
content character varying(256) NOT NULL,
is_received boolean NOT NULL,
is_seen boolean NOT NULL,
is_active boolean NOT NULL,
created_at timestamp with time zone NOT NULL,
room_no_id integer NOT NULL,
sender_id integer NOT NULL
);
Shard 2:
CREATE TABLE message_3 (
id SERIAL,
m_type character varying(20),
content character varying(256) NOT NULL,
is_received boolean NOT NULL,
is_seen boolean NOT NULL,
is_active boolean NOT NULL,
created_at timestamp with time zone NOT NULL,
room_no_id integer NOT NULL,
sender_id integer NOT NULL
);
CREATE TABLE message_4 (
id SERIAL,
m_type character varying(20),
content character varying(256) NOT NULL,
is_received boolean NOT NULL,
is_seen boolean NOT NULL,
is_active boolean NOT NULL,
created_at timestamp with time zone NOT NULL,
room_no_id integer NOT NULL,
sender_id integer NOT NULL
);
Source machine:
CREATE SERVER shard_1 FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host 'shard_1_ip', dbname 'shard_1_db', port '5432');
CREATE SERVER shard_2 FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host 'shard_2_ip', dbname 'shard_2_db', port '5432');
CREATE USER MAPPING for source_user SERVER shard_1 OPTIONS (user 'shard_1_user', password 'shard_1_user_password');
CREATE USER MAPPING for source_user SERVER shard_2 OPTIONS (user 'shard_2_user', password 'shard_2_user_password');
CREATE TABLE room (
id SERIAL PRIMARY KEY,
name character varying(20) NOT NULL,
created_at timestamp with time zone NOT NULL,
updated_at timestamp with time zone NOT NULL,
is_active boolean NOT NULL
);
insert into room (
name, created_at, updated_at, is_active
)
select
concat('Room_', floor(random() * 400000 + 1)::int, '_', floor(random() * 400000 + 1)::int),
i,
i,
TRUE
from generate_series('2019-01-01 00:00:00'::timestamp, '2019-4-30 01:00:00', '5 seconds') as s(i);
CREATE TABLE message (
id SERIAL,
m_type character varying(20),
content character varying(256) NOT NULL,
is_received boolean NOT NULL,
is_seen boolean NOT NULL,
is_active boolean NOT NULL,
created_at timestamp with time zone NOT NULL,
room_no_id integer NOT NULL,
sender_id integer NOT NULL
) PARTITION BY HASH (room_no_id);
CREATE FOREIGN TABLE message_1
PARTITION OF message
FOR VALUES WITH (MODULUS 4, REMAINDER 1)
SERVER shard_1;
CREATE FOREIGN TABLE message_2
PARTITION OF message
FOR VALUES WITH (MODULUS 4, REMAINDER 2)
SERVER shard_1;
CREATE FOREIGN TABLE message_3
PARTITION OF message
FOR VALUES WITH (MODULUS 4, REMAINDER 3)
SERVER shard_2;
CREATE FOREIGN TABLE message_4
PARTITION OF message
FOR VALUES WITH (MODULUS 4, REMAINDER 0)
SERVER shard_2;
The problem we are facing is that when we are trying to insert data using following query:
insert into message (
m_type, content, is_received, is_seen, is_active, created_at, room_no_id, sender_id
)
select
'TEXT',
CASE WHEN s.i % 2 = 0 THEN 'text 1'
ELSE 'text 2'
end,
TRUE,
TRUE,
TRUE,
dr.created_at + s.i * (interval '1 hour'),
dr.id,
CASE WHEN s.i % 2 = 0 THEN split_part(dr.name, '_', 2)::int
ELSE split_part(dr.name, '_', 3)::int
end,
from room as dr, generate_series(0, 10) as s(i);
It is taking nearly 1 hour 50 minutes to do so. When we are not sharding the table, it takes around 8 minutes to perform the same. So, that is basically 14 times slower than without sharding. Are we missing anything here or inserts are that slow in sharding using this method?
Citus seems to be performing better in insert as described in this video , so it seems a little odd to me that sharding will actually degrade the performance by this much. So, it might be the case that it will not have as good performance as Citus but why so much low performance.
hardik24
(93 rep)
Jun 23, 2020, 08:53 PM
• Last activity: Sep 27, 2021, 12:07 AM
1
votes
2
answers
6092
views
PostgreSQL surrogate keys: use sequence/bigint or OID?
I'm looking at a need for surrogate keys in tables I'm designing on a PostgreSQL-derived DBMS ([Citus][1]). Would OIDs suffice? Is there a downside to using them instead of creating a `bigint` field and a sequence? [1]: https://www.citusdata.com/
I'm looking at a need for surrogate keys in tables I'm designing on a PostgreSQL-derived DBMS (Citus ). Would OIDs suffice? Is there a downside to using them instead of creating a
bigint
field and a sequence?
Andrew Wolfe
(584 rep)
Feb 1, 2016, 05:43 PM
• Last activity: Sep 27, 2021, 12:04 AM
0
votes
1
answers
3568
views
PostgreSQL + Citus - multi master - replication of data
I have a question, I have found a Citus extension which is recommended for a replication of data between multi-master nodes (I was using Galera Cluster prior to PostgreSQL). However I cannot find any example of how to replicate tables without creating any additional tables (distributed) as replicati...
I have a question, I have found a Citus extension which is recommended for a replication of data between multi-master nodes (I was using Galera Cluster prior to PostgreSQL).
However I cannot find any example of how to replicate tables without creating any additional tables (distributed) as replication has to work without any modification of C# code that is using PostgreSQL as a database.
It has to work in a same way as Galera Cluster (multi-master replication between nodes).
Perhaps I've tried wrong extension, but I've tried to review the options for 3 days and seemed like a good idea.
Perhaps someone could give the recommendation for a good extension that can offer multi-master replication without any additional change in existing queries that are used by our software.
Mike
(111 rep)
Apr 9, 2019, 11:13 AM
• Last activity: Sep 27, 2021, 12:03 AM
5
votes
2
answers
685
views
PostgreSQL Citus Extension Temp Table
Is it possible with citus extension in PostgreSQL to create temp table that is copied to each worker node (like reference table)? When I run SQL like this: DROP TABLE IF EXISTS mypoint; CREATE TEMP TABLE mypoint (mpoint geometry primary key); SELECT create_reference_table('mypoint'); I get the error...
Is it possible with citus extension in PostgreSQL to create temp table that is copied to each worker node (like reference table)?
When I run SQL like this:
DROP TABLE IF EXISTS mypoint;
CREATE TEMP TABLE mypoint (mpoint geometry primary key);
SELECT create_reference_table('mypoint');
I get the error:
> ERROR: unacceptable schema name "pg_temp_6"
> DETAIL: The prefix "pg_" is reserved for system schemas.
> CONTEXT: while executing command on mynode01:5432 SQL state: 42939
The reason I am asking this is because there are more SQL commands afterwards where I need to do a join between the temp table and a distributed table (which is not allowed with citus extension).
Slobodan Savkovic
(151 rep)
Mar 17, 2018, 10:16 PM
• Last activity: Sep 27, 2021, 12:00 AM
Showing page 1 of 17 total questions