Sample Header Ad - 728x90

Suggestion regarding Migrating and Sharding 1TB standalone PostgreSQL Cluster with Partitions and Multiple Databases, using Citus

1 vote
0 answers
24 views
We're currently in the **evaluation/planning phase** evaluating Citus (open-source) as a sharding solution (multi tenant) for our existing PostgreSQL database cluster. We anticipate significant data growth with a new application feature and need to scale beyond our current single-node setup. Current Setup: Standalone single node PostgreSQL (open source). Approximately 1 TB of data. pgpool-II is used for load balancing with a primary and one standby for basic high availability. The database structure involves multiple databases within the cluster. Each database contains multiple schemas. Within schemas, we have numerous tables, functions, and other SQL objects. Many tables currently utilize standard PostgreSQL declarative partitioning (both range and hash methods) for performance management. Goal: We aim to migrate to a Citus cluster and implement database sharding to handle the expected large volume of new data, while also effectively incorporating our existing 1 TB of historical data. We have reviewed some of the basic Citus documentation but have several key questions regarding the migration and ongoing management of our specific setup: 1. Handling Multiple Databases: Our current structure has data spread across multiple logical databases within the single PostgreSQL instance. How should this multi-database structure be handled when migrating to Citus? Is there a recommended approach for managing multiple logical databases within a Citus cluster from a migration and ongoing data distribution perspective? 2. Existing and Future Partitioning: We heavily use standard PostgreSQL range and hash partitioning on our tables. How are these existing partitioned tables and their data migrated into a Citus distributed table setup? Does Citus automatically handle the data from partitions? How is partitioning typically handled for new and ongoing data within a Citus distributed table? Can we still use time-based partitioning, for example, effectively within Citus shards? 3. Load Balancing and High Availability (pgpool/Replication): Can our existing pgpool-II setup be repurposed or used in conjunction with a Citus cluster for load balancing client connections? What are the recommended strategies for high availability and replication within a Citus cluster itself (for both coordinator and worker nodes), and load balancing? 4. Schema Distribution to Workers: When we distribute a table (e.g., using create_distributed_table), how does Citus handle the schema definition on the worker nodes? Does the full schema structure (including other non-distributed objects or schemas) get replicated to workers, or only the necessary table definitions for the shards? 5. Monitoring in a Distributed Environment: On our single node, we rely on standard PostgreSQL system views and functions like pg_stat_activity, pg_stat_statements, pg_control_checkpoint(), pg_buffercache, pg_stat_user_tables, pg_stat_bgwriter, etc., for monitoring and performance tuning. How do these tools work in a distributed Citus cluster, where data and activity are spread across multiple worker nodes? How do we get a comprehensive, cluster-wide view of performance and activity? We would appreciate guidance, insights, or experiences from the community on these points as we plan our migration to Citus. **Also, please advise if there is another sharding solution that can be tried for our current setup.**
Asked by Ramzy (11 rep)
May 16, 2025, 07:48 AM