Citus Installation and Con iguration
PostgreSQL 15.8
Citus 12.1-1
Contents
Citus Con iguration ................................................................................................................ 3
Con igure the Cluster .......................................................................................................... 3
Worker Nodes: ................................................................................................................ 4
Create and Distribute Data .................................................................................................. 5
Adding New Node to the cluster .......................................................................................... 7
Rebalancing the Data....................................................................................................... 7
Un-distributing the Data ................................................................................................ 10
Test Case .......................................................................................................................... 11
Con iguring Citus on an AWS RDS PostgreSQL instance: ..................................................... 12
Citus Con iguration
Citus is an extension that enables PostgreSQL to scale horizontally across multiple machines by
sharding tables and creating redundant copies.
Key features of a Citus cluster include:
- Creating distributed tables that are sharded across a cluster of PostgreSQL nodes, effectively
combining their CPU, memory, storage, and I/O resources.
- Replicating reference tables across all nodes to facilitate joins, foreign keys, and maximize read
performance from distributed tables.
- Routing and parallelizing SELECT, DML, and other operations on distributed tables across the
cluster using the distributed query engine.
Con igure the Cluster
Simulated a Cluster con iguration with the below Con iurations:
Here is the cluster’s list
nodeid | nodename | groupid | isactive
--------+------------+---------+----------
1 | 10.88.11.8 | 0|t
2 | 10.88.11.8 | 1|t
3 | 10.88.11.8 | 2|t
4 | 10.88.11.8 | 3|t
5 | 10.88.11.8 | 5|t
>select citus_set_coordinator_host(’10.88.11.8', 5432);
>Add rest of the nodes to the cluster.
Since I have used the same server for POC purpose, I am using the below command:
citus_add_node('10.88.11.8',5433);
else use pg_dist_node, if the nodes are distributed across remote servers.
Repeat the above for all the nodes in the cluster
Since I have created four separate clusters on the same node, you are seeing the same IP
address. Internally, the clusters are managed through different ports.
Node 0 serves as the primary node (/Coordinator Node), acting as the central controller in the
Citus cluster to manage query execution, data distribution, and transaction consistency. Nodes 2
through 5 are redundancy nodes.
Created a 2X redundancy across a cluster of four (4)
shardeg=# show citus.shard_replication_factor;
citus.shard_replication_factor
--------------------------------
2
(1 row)
Worker Nodes:
shardeg=# select * from master_get_active_worker_nodes();
node_name | node_port
------------+-----------
10.88.11.8 | 5435
10.88.11.8 | 5439
10.88.11.8 | 5434
10.88.11.8 | 5436
10.88.11.8 | 5433
(5 rows)
Create and Distribute Data
Login to the coordinator and execute the following commands
shardeg=# \dt+
List of relations
Schema | Name | Type | Owner | Persistence | Access method | Size | Description
--------+------------------+-------+----------+-------------+---------------+---------+-------------
public | pgbench_accounts | table | postgres | permanent | heap | 0 bytes |
public | pgbench_branches | table | postgres | permanent | heap | 0 bytes |
public | pgbench_history | table | postgres | permanent | heap | 0 bytes |
public | pgbench_tellers | table | postgres | permanent | heap | 0 bytes |
(4 rows)
Distribute the pgbench tables across nodes:
shardeg=# select create_distributed_table('pgbench_history', 'aid');
(1 row)
shardeg=# select create_distributed_table('pgbench_accounts', 'aid');
create_distributed_table
--------------------------
(1 row)
shardeg=# select create_distributed_table('pgbench_branches', 'bid');
create_distributed_table
--------------------------
(1 row)
shardeg=# select create_distributed_table('pgbench_tellers', 'tid');
create_distributed_table
--------------------------
(1 row)
The default number of shards created for a table is 32, which are distributed across the available
clusters.
shardeg=# select * from citus_shards;
table_name | shardid | shard_name | citus_table_type | colocation_id | nodename |
nodeport | shard_size
------------------+---------+-------------------------+------------------+---------------+------------+----------+----
--------
pgbench_accounts | 102328 | pgbench_accounts_102328 | distributed | 12 | 10.88.11.8 |
5433 | 125911040
pgbench_accounts | 102328 | pgbench_accounts_102328 | distributed | 12 | 10.88.11.8
| 5434 | 125911040
pgbench_branches | 102360 | pgbench_branches_102360 | distributed | 12 | 10.88.11.8
| 5433 | 8192
pgbench_branches | 102360 | pgbench_branches_102360 | distributed | 12 | 10.88.11.8
| 5434 | 8192
pgbench_history | 102309 | pgbench_history_102309 | distributed | 12 | 10.88.11.8 |
5436 | 0
pgbench_history | 102309 | pgbench_history_102309 | distributed | 12 | 10.88.11.8 |
5439 | 0
pgbench_tellers | 102414 | pgbench_tellers_102414 | distributed | 12 | 10.88.11.8 |
5435 | 8192
pgbench_tellers | 102414 | pgbench_tellers_102414 | distributed | 12 | 10.88.11.8 |
5436 | 8192
Adding New Node to the cluster
shardeg=# SELECT * from citus_add_node('10.88.11.8',5411);
citus_add_node
----------------
7
(1 row)
shardeg=# select * from master_get_active_worker_nodes();
node_name | node_port
------------+-----------
10.88.11.8 | 5435
10.88.11.8 | 5439
10.88.11.8 | 5434
10.88.11.8 | 5436
10.88.11.8 | 5411
10.88.11.8 | 5433
(6 rows)
shardeg=#
Rebalancing the Data
Create the index’s on the table.
shardeg=# create unique index pgbench_accounts_pk on pgbench_accounts(aid);
CREATE INDEX
shardeg=# create unique index pgbench_branches_pk on pgbench_branches(bid);
CREATE INDEX
shardeg=# create unique index pgbench_tellers_pk on pgbench_tellers(tid);
CREATE INDEX
shardeg=#
shardeg=# select * from rebalance_table_shards();
NOTICE: Moving shard 102333 from 10.88.11.8:5434 to 10.88.11.8:5411 ...
ERROR: ERROR: logical decoding requires wal_level >= logical
CONTEXT: while executing command on 10.88.11.8:5434
while executing command on localhost:5432
shardeg=#
The Above Error is because the replication was not set to Logical. Replication on all nodes
should be set to logical, as the rebalancing process will be based on polling.
When rebalancing is executed, the data is distributed across the new cluster.
shardeg=# select * from rebalance_table_shards(); -- with factor 2.
NOTICE: Moving shard 102333 from 10.88.11.8:5434 to 10.88.11.8:5411 ...
NOTICE: Moving shard 102340 from 10.88.11.8:5435 to 10.88.11.8:5411 ...
NOTICE: Moving shard 102329 from 10.88.11.8:5434 to 10.88.11.8:5411 ...
NOTICE: Moving shard 102358 from 10.88.11.8:5433 to 10.88.11.8:5411 ...
NOTICE: Moving shard 102354 from 10.88.11.8:5435 to 10.88.11.8:5411 ...
NOTICE: Moving shard 102335 from 10.88.11.8:5436 to 10.88.11.8:5411 ...
NOTICE: Moving shard 102349 from 10.88.11.8:5434 to 10.88.11.8:5411 ...
NOTICE: Moving shard 102337 from 10.88.11.8:5433 to 10.88.11.8:5411 ...
NOTICE: Moving shard 102331 from 10.88.11.8:5439 to 10.88.11.8:5411 ...
NOTICE: Moving shard 102345 from 10.88.11.8:5435 to 10.88.11.8:5411 ...
You'll notice that data rebalancing occurs from the sharded nodes, not from the coordinator
(primary) node.
Execute: Select * from citus_shard
table_name | shardid | shard_name | citus_table_type | colocation_id | nodename |
nodeport | shard_size
------------------+---------+-------------------------+------------------+---------------+------------+----------+----
--------
pgbench_accounts | 102328 | pgbench_accounts_102328 | distributed | 12 | 10.88.11.8
| 5433 | 146989056
pgbench_accounts | 102328 | pgbench_accounts_102328 | distributed | 12 | 10.88.11.8
| 5434 | 146989056
pgbench_accounts | 102329 | pgbench_accounts_102329 | distributed | 12 | 10.88.11.8
| 5435 | 147357696
pgbench_accounts | 102329 | pgbench_accounts_102329 | distributed | 12 | 10.88.11.8
| 5411 | 147357696
pgbench_branches | 102361 | pgbench_branches_102361 | distributed | 12 | 10.88.11.8
| 5411 | 57344
pgbench_branches | 102361 | pgbench_branches_102361 | distributed | 12 | 10.88.11.8
| 5435 | 57344
Altering the replication Factor
Recreated the `pg_bench` history table, rebalanced it, and replicated it onto the new node with a
replication factor of 3.
citus.shard_replication_factor
--------------------------------
3 (1 row)
pgbench_branches | 102391 | pgbench_branches_102391 | distributed | 12 | 10.88.11.8
| 5434 | 57344
pgbench_branches | 102391 | pgbench_branches_102391 | distributed | 12 | 10.88.11.8
| 5435 | 57344
pgbench_history | 102424 | pgbench_history_102424 | distributed | 13 | 10.88.11.8 |
5411 | 40960
pgbench_history | 102424 | pgbench_history_102424 | distributed | 13 | 10.88.11.8 |
5433 | 40960
pgbench_history | 102424 | pgbench_history_102424 | distributed | 13 | 10.88.11.8 |
5434 | 40960
Un-distributing the Data
Recreated the distribution, after deleting the original distribution.
table_name | shardid | shard_name | citus_table_type | colocation_id | nodename |
nodeport | shard_size
------------------+---------+-------------------------+------------------+---------------+------------+----------+----
--------
pgbench_accounts | 102488 | pgbench_accounts_102488 | distributed | 14 | 10.88.11.8
| 5411 | 146907136
pgbench_accounts | 102488 | pgbench_accounts_102488 | distributed | 14 | 10.88.11.8
| 5433 | 146907136
pgbench_accounts | 102488 | pgbench_accounts_102488 | distributed | 14 | 10.88.11.8
| 5434 | 146907136
Test Case
shardeg=# explain (analyze ) select * from pgbench_accounts where aid=30 limit 50;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.56..8.58 rows=1 width=352) (actual time=0.107..0.111 rows=1 loops=1)
-> Index Scan using pk_accounts on pgbench_accounts (cost=0.56..8.58 rows=1 width=352) (actual time=0.104..0.107 rows=1
loops=1)
Index Cond: (aid = 30)
Planning Time: 0.384 ms
Execution Time: 0.161 ms
shardeg=#
Data was being retrieved from the Primary Node.
Deleted the tables from the Primary (Co-ordinator).
Executed the same Query:
shardeg=# explain (analyze) select * from pgbench_accounts where aid=30 limit 50;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-------
Custom Scan (Citus Adaptive) (cost=0.00..0.00 rows=0 width=0) (actual time=12.721..12.724 rows=1 loops=1)
Task Count: 1
Tuple data received from nodes: 96 bytes
Tasks Shown: All
-> Task
Tuple data received from node: 96 bytes
Node: host=10.88.11.8 port=5411 dbname=shardeg
-> Limit (cost=0.42..8.44 rows=1 width=97) (actual time=0.168..0.171 rows=1 loops=1)
-> Index Scan using pk_accounts_102516 on pgbench_accounts_102516 pgbench_accounts (cost=0.42..8.44 rows=1
width=97) (actual time=0.162..0.164 rows=1 loops=1)
Index Cond: (aid = 30)
Planning Time: 0.671 ms
Execution Time: 0.256 ms
Planning Time: 2.036 ms
Execution Time: 12.831 ms
shardeg=#
Data is being retrieved from the shard node, while in both cases, the query is executed from the
primary node.
Con iguring Citus on an AWS RDS PostgreSQL instance:
Check Compatibility and Prerequisites
RDS PostgreSQL Version:
PostgreSQL 11 or later
Citus Extension Availability:
Citus cannot be installed or con igured directly on an AWS RDS PostgreSQL
instance. As a fully managed database service, AWS RDS imposes limitations on the
installation of custom extensions that are not natively supported by Amazon. While AWS
RDS does offer support for a range of PostgreSQL extensions, Citus is not included.
Amazon EC2 to Deploy Citus:
Deply PostgreSQL with the Citus extension on Amazon EC2 instances. This
approach grants you complete control over the operating system and PostgreSQL
con iguration, enabling you to install and customize Citus according to your needs.
Native Partitioning:
Consider leveraging PostgreSQL's native partitioning features within AWS RDS.
Although not as robust as Citus for handling distributed workloads, these features can
still enhance performance and manageability for large tables.