Contents
- PLAN_PARTITIONING_SHARDING.md — PostgreSQL Partitioning & Sharding Compatibility
- 1. Executive Summary
- 2. Background: PostgreSQL Partitioning
- 3. Scenario 1: Partitioned Source Tables
- 4. Scenario 2: Partitioned Stream Table Storage
- 5. Scenario 3: Distributed Sharding (Citus)
- 6. Scenario 4: Foreign Data Wrapper (FDW) Sharding
- 7. Test Coverage Gaps
- 8. Action Items
- 9. Summary Matrix
- Appendix A: PostgreSQL Partition Trigger Propagation
- Appendix B: Partition Key in Row ID
PLAN_PARTITIONING_SHARDING.md — PostgreSQL Partitioning & Sharding Compatibility
Status: Implemented (PT1–PT5 in v0.6.0)
Date: 2026-03-01
Updated: 2026-03-14 — PT1–PT5 implemented. E2E tests, ATTACH PARTITION detection, WAL publish_via_partition_root, foreign table info, and user-facing tutorial all shipped.
Related: PLAN_CITUS.md · REPORT_TIMESCALEDB.md · REPORT_TRIGGERS_VS_REPLICATION.md
1. Executive Summary
This report analyses how pg_trickle interacts with PostgreSQL’s native table partitioning (declarative partitioning, available since PG 10) and external sharding solutions (Citus, pg_partman). It covers three scenarios:
- Partitioned source tables — source tables referenced in stream table defining queries are partitioned.
- Partitioned stream table storage — the materialized storage table backing a stream table is itself partitioned.
- Distributed/sharded deployments — the database spans multiple nodes (Citus, foreign-data-wrapper sharding, read replicas).
Key findings:
- Native partitioning of source tables works today with minor caveats. PostgreSQL 13+ propagates row-level triggers from the partitioned parent to child partitions, so CDC captures all DML transparently. The extension already logs an info message (F13) confirming this.
- Partitioned storage tables are not supported and would require non-trivial changes to the storage table creation, MERGE/delta application, and row ID management paths.
- Distributed sharding (Citus) is a much larger problem with a dedicated plan already in PLAN_CITUS.md. The core issues (OID locality, WAL LSN incomparability, trigger propagation) are architectural.
- The most impactful near-term improvement is validating and documenting the native partitioning story, and adding targeted tests.
2. Background: PostgreSQL Partitioning
2.1. Declarative Partitioning (PG 10+)
PostgreSQL supports declarative partitioning via PARTITION BY:
CREATE TABLE orders (
id BIGSERIAL,
created_at TIMESTAMPTZ NOT NULL,
customer_id INT,
total NUMERIC
) PARTITION BY RANGE (created_at);
CREATE TABLE orders_2025 PARTITION OF orders
FOR VALUES FROM ('2025-01-01') TO ('2026-01-01');
CREATE TABLE orders_2026 PARTITION OF orders
FOR VALUES FROM ('2026-01-01') TO ('2027-01-01');
Partition strategies: RANGE, LIST, HASH. Sub-partitioning is supported.
2.2. Key Properties for pg_trickle
| Property | Behaviour | Impact |
|---|---|---|
| OID | Parent table has its own OID; each child partition has a separate OID | CDC triggers are keyed by OID — must use parent OID |
| Triggers | PG 13+: row-level triggers on parent fire for DML routed to any partition | CDC works transparently |
| Triggers (PG 11–12) | Triggers on parent do NOT fire for partition-routed DML | CDC would miss all changes — effectively broken |
| TRUNCATE | TRUNCATE parent truncates all partitions; fires statement-level trigger on parent |
Our TRUNCATE capture trigger works |
| DDL propagation | ALTER TABLE parent ADD COLUMN propagates to all partitions |
DDL hooks see the parent’s OID — reinit works |
| Partition DDL | ALTER TABLE parent ATTACH/DETACH PARTITION fires ddl_command_end |
Hook fires but no column change → no action needed |
| Primary keys | PK must include all partition key columns | pk_hash computation works (resolves PK via parent OID) |
| Statistics | pg_class.reltuples on parent is the sum of child reltuples (PG 14+) |
Adaptive FULL/DIFFERENTIAL ratio works correctly |
| WAL | Each partition generates its own WAL records | Logical replication must include all partitions in publication |
| Indexes | Partitioned indexes are virtual; each partition has its own | No impact (we don’t read source indexes) |
2.3. Partition Management Extensions
- pg_partman — Automated partition creation/retention. Compatible: it
manages DDL on child partitions, which doesn’t affect the parent’s schema.
Dropping old partitions is equivalent to
DROP TABLEon a child — the parent OID remains valid. - TimescaleDB hypertables — Transparent time-based partitioning. See REPORT_TIMESCALEDB.md for detailed analysis. Hypertables are partitioned tables with additional metadata; triggers on the parent propagate to chunks.
3. Scenario 1: Partitioned Source Tables
3.1. Current Status — Works (PG 13+)
When a user creates a stream table over a partitioned source:
SELECT pgtrickle.create_stream_table(
'daily_totals',
$$SELECT date_trunc('day', created_at) AS day,
sum(total) AS total
FROM orders
GROUP BY 1$$,
schedule := '5 minutes'
);
The following happens:
Source resolution —
extract_source_relations()resolvesordersto the parent’s OID. Child partition OIDs are NOT returned because the parser sees only the parent.Relkind detection —
warn_source_table_properties()detectsrelkind = 'p'and emits F13 info message.CDC trigger creation —
create_change_trigger()creates the trigger on the parent table. PostgreSQL 13+ automatically propagates the trigger to all existing and future child partitions.Change capture — When DML hits any child partition, the trigger fires with
NEW/OLDrows, writing topgtrickle_changes.changes_<parent_oid>. Thepg_current_wal_lsn()call in the trigger correctly reflects the WAL position of the child partition’s write.PK resolution —
resolve_pk_columns()queriespg_constrainton the parent OID. For partitioned tables, the PK constraint is on the parent (and includes partition key columns). This works correctly.Column resolution —
resolve_source_column_defs()queriespg_attributeon the parent OID. All partitions share the same column set. Works correctly.Full refresh —
SELECT ... FROM ordersnaturally reads all partitions via partition pruning. Works correctly.Differential refresh — The delta query reads from
pgtrickle_changes.changes_<parent_oid>. Since all partition DML writes to the same buffer (keyed by parent OID), the delta is complete.
3.2. Caveats
| Caveat | Severity | Details |
|---|---|---|
| PG 11–12 incompatibility | N/A | pg_trickle targets PG 18 only; not relevant |
| ATTACH PARTITION with existing data | Medium | ALTER TABLE orders ATTACH PARTITION orders_2027 ... does NOT fire row-level triggers for existing rows in the newly attached partition. The stream table will miss these rows until the next FULL refresh or reinitialization. |
| DETACH PARTITION | Low | ALTER TABLE orders DETACH PARTITION orders_2024 ... removes the partition. Future DML on the detached table won’t fire the parent’s trigger. Existing change buffer data for already-captured rows is fine. DDL hooks should detect the ALTER TABLE and consider reinit. |
| Very large partition counts | Low | With hundreds of partitions, the trigger is installed on each child. Trigger overhead is per-row and per-partition, but the trigger function is shared. No scalability concern observed up to ~500 partitions in PG benchmarks. |
| WAL mode CDC | Medium | For logical replication, the publication must include all partitions. CREATE PUBLICATION ... FOR TABLE orders in PG 13+ uses the publish_via_partition_root option to group all child WAL records under the parent table. wal_decoder.rs must set this option when creating publications for partitioned sources. |
3.3. ATTACH PARTITION Gap
This is the most significant gap. When a partition with pre-existing data is attached to a partitioned source table:
-- This table already has 1M rows:
CREATE TABLE orders_2027 (LIKE orders INCLUDING ALL);
INSERT INTO orders_2027 SELECT ... ;
-- Attach it — no row-level triggers fire for existing rows:
ALTER TABLE orders ATTACH PARTITION orders_2027
FOR VALUES FROM ('2027-01-01') TO ('2028-01-01');
The 1M existing rows are now visible via SELECT * FROM orders but were
never captured by the CDC trigger. The stream table is stale.
Detection: The DDL event trigger (_on_ddl_end) fires for
ALTER TABLE ... ATTACH PARTITION. The command_tag is ALTER TABLE and the
objid is the parent table’s OID. The hook should detect that the parent is a
tracked source and mark downstream STs for reinitialize.
Current behaviour: The DDL hook currently checks for column changes (ADD/
DROP/ALTER COLUMN). An ATTACH PARTITION doesn’t change columns, so it may not
trigger reinit. This needs verification and potentially a targeted check for
the ATTACH PARTITION subcommand.
Recommended fix: In handle_ddl_command(), when the command_tag is
ALTER TABLE and the target is a partitioned source, additionally check
pg_event_trigger_ddl_commands() for partition-related subcommands. If
ATTACH PARTITION is detected, force needs_reinit = true.
3.4. WAL Mode: Publication Configuration
When transitioning from trigger to WAL mode for a partitioned source, the publication must be configured correctly:
-- Current code in wal_decoder.rs creates:
CREATE PUBLICATION pgtrickle_<oid> FOR TABLE <table_name>;
-- For partitioned tables, this should use:
CREATE PUBLICATION pgtrickle_<oid>
FOR TABLE <table_name>
WITH (publish_via_partition_root = true);
The publish_via_partition_root option (PG 13+) ensures that changes to child
partitions are published under the parent table’s identity, matching the
trigger-mode behavior where all changes appear in a single buffer keyed by
the parent OID.
Without this option, each child partition publishes its changes under its own OID, and the decoder would need to map child OIDs back to the parent — a significant complication.
4. Scenario 2: Partitioned Stream Table Storage
4.1. Current Status — Not Supported
Stream table storage tables are created as plain heap tables:
CREATE TABLE <schema>.<name> (
__pgt_row_id BIGINT NOT NULL,
col1 TYPE,
col2 TYPE,
...
);
CREATE UNIQUE INDEX ... ON <schema>.<name> (__pgt_row_id);
There is no mechanism to partition the storage table.
4.2. Why Users Might Want This
- Large materialized views — When a stream table materializes billions of rows (e.g., time-series aggregations), partitioning the storage table by time could enable partition pruning on downstream queries and efficient retention management.
- TTL / retention — Users could drop old partitions of a stream table rather than running expensive DELETE operations.
- Query performance — Partition pruning on the storage table accelerates analytical queries that filter by the partition key.
4.3. Challenges
| Challenge | Severity | Details |
|---|---|---|
| Storage table creation | Medium | api.rs creates the storage table via CREATE TABLE. Would need to accept an optional PARTITION BY clause from the user and generate child partitions or rely on pg_partman. |
| MERGE/delta application | High | The differential refresh uses DELETE + INSERT keyed by __pgt_row_id. With partitioning, the partition key must be part of the row or derivable from it. The MERGE target would need to be the parent table, and PostgreSQL would route rows to the correct partition. This works transparently for INSERT but DELETE requires the partition key to be in the WHERE clause for partition pruning (otherwise it scans all partitions). |
| Row ID uniqueness | Medium | __pgt_row_id is a hash, not monotonically increasing. A UNIQUE INDEX on a partitioned table requires the partition key in the index. The unique constraint would become (__pgt_row_id, <partition_key>), which weakens the uniqueness guarantee to per-partition scope. |
| Full refresh (TRUNCATE + INSERT) | Low | TRUNCATE on a partitioned table truncates all partitions. Works correctly. |
| Reinitialize | Medium | Currently drops and recreates the storage table. Would need to preserve the partition structure, or the user would need to re-specify it. |
| User triggers | Low | User triggers on the storage parent propagate to partitions. No issue. |
4.4. Potential Design
If implemented, partitioned storage could be an opt-in feature:
SELECT pgtrickle.create_stream_table(
'daily_totals',
$$SELECT date_trunc('day', created_at) AS day,
sum(total) AS total
FROM orders GROUP BY 1$$,
schedule := '5 minutes',
partition_by := 'RANGE (day)', -- new parameter
partition_spec := '{
"auto_create": true,
"retention": "2 years",
"interval": "1 month"
}'
);
The extension would:
1. Create the parent storage table with PARTITION BY RANGE (day).
2. Use pg_partman or custom logic to auto-create child partitions.
3. Modify delta application to include the partition key in DELETE clauses.
4. Adjust the __pgt_row_id unique index to include the partition key.
5. Optionally manage retention (drop partitions older than threshold).
Recommendation: Defer this to a future release. The complexity is moderate and the user base requesting partitioned storage tables should be established first. Users needing partitioned storage today can create a partitioned table manually and use a full-refresh-only stream table.
5. Scenario 3: Distributed Sharding (Citus)
5.1. Summary
Distributed sharding via Citus is fundamentally different from native partitioning. While native partitioning keeps all data on a single PostgreSQL instance with transparent query routing, Citus distributes data across multiple independent PostgreSQL nodes.
A comprehensive plan exists in PLAN_CITUS.md. The key incompatibilities are:
| Component | Native Partitioning | Citus Distributed |
|---|---|---|
| OIDs | Parent OID is stable; child OIDs are local | OIDs differ across coordinator and workers |
| WAL | Single WAL stream | Independent WAL per node |
| Triggers | Propagate from parent to children | Do NOT propagate to worker shards |
| SPI | Single instance, full catalog access | Coordinator has catalog; workers have partial |
| MERGE | Fully supported | Limited support; cross-shard joins problematic |
| Background workers | Single scheduler | One scheduler per node or coordinator-only |
| Advisory locks | Instance-wide | Node-local |
5.2. Impact on pg_trickle Components
| Component | Native Partitioning Impact | Citus Impact |
|---|---|---|
CDC triggers (cdc.rs) |
Works — triggers propagate | Broken — triggers don’t propagate to workers |
| Change buffers | Single buffer per parent OID | Need distributed or coordinator-aggregated buffers |
WAL decoder (wal_decoder.rs) |
Needs publish_via_partition_root |
Need multi-node slot management |
Frontier/LSN (version.rs) |
Single WAL, single LSN timeline | Incomparable LSNs across nodes |
Delta SQL (dvm/) |
No change needed | Cross-shard join complications |
MERGE application (refresh.rs) |
Works transparently | Need INSERT ON CONFLICT + DELETE |
DAG/scheduler (dag.rs, scheduler.rs) |
No change | Need coordinator-only or multi-node coordination |
DDL hooks (hooks.rs) |
Works — events fire on parent | Only fires on coordinator |
Shared memory (shmem.rs) |
No change | Node-local; need LISTEN/NOTIFY |
5.3. Recommendation
Native PostgreSQL partitioning and Citus sharding should be treated as two entirely separate features with different timelines:
- Native partitioning — Near-term; mostly works today. Needs testing, ATTACH PARTITION detection, and WAL publication configuration.
- Citus — Long-term; architectural changes needed. Follow the phased plan in PLAN_CITUS.md.
6. Scenario 4: Foreign Data Wrapper (FDW) Sharding
6.1. Overview
Some deployments use postgres_fdw to create a “sharded” view across multiple
PostgreSQL instances:
CREATE FOREIGN TABLE remote_orders (...)
SERVER remote_server OPTIONS (table_name 'orders');
Or with partitioned foreign tables (PG 14+):
CREATE TABLE orders (...) PARTITION BY RANGE (region);
CREATE FOREIGN TABLE orders_us PARTITION OF orders
FOR VALUES IN ('US')
SERVER us_server OPTIONS (table_name 'orders');
CREATE FOREIGN TABLE orders_eu PARTITION OF orders
FOR VALUES IN ('EU')
SERVER eu_server OPTIONS (table_name 'orders');
6.2. Compatibility
| Aspect | Status | Details |
|---|---|---|
| Source resolution | Works | Foreign tables have OIDs in the local catalog |
| CDC triggers | Broken | Row-level triggers cannot be created on foreign tables (ERROR: "..." is not a table or view) |
| WAL mode | Broken | Foreign tables don’t generate local WAL |
| Full refresh | Works | SELECT * FROM foreign_table executes the remote query |
| Partition mix | Partial | A partitioned table with some local and some foreign children: triggers fire only on local partitions; foreign partition DML is invisible to CDC |
6.3. Recommendation
Foreign-table sources should be restricted to FULL refresh mode only. The
extension should detect relkind = 'f' (foreign table) during source
resolution and emit a warning that differential refresh is not available.
For partitioned tables with a mix of local and foreign children, the safest
approach is to require FULL refresh mode, since CDC cannot capture changes
from foreign partitions.
7. Test Coverage Gaps
Current test coverage for partitioning scenarios:
| Scenario | Covered? | Notes |
|---|---|---|
| Partitioned source — basic CRUD | No | Need E2E test with RANGE partitioned source |
| Partitioned source — differential refresh | No | Verify delta capture across partitions |
| Partitioned source — partition pruning in full refresh | No | Verify optimizer uses pruning |
| ATTACH PARTITION with existing data | No | Verify reinit or staleness detection |
| DETACH PARTITION | No | Verify DDL hook fires |
| pg_partman automatic partition creation | No | Verify triggers propagate to new partitions |
| Source with sub-partitioning | No | Two-level partitioning |
| Partitioned source — WAL mode CDC | No | Verify publish_via_partition_root |
| Foreign table source — full refresh only | No | Verify trigger creation error handling |
| Mixed local/foreign partitions | No | Verify correct warning or mode restriction |
Recommended test additions: Create a new tests/e2e_partitioning_tests.rs
file with tests covering the scenarios above.
8. Action Items
Near-Term (Current Release)
| # | Action | Component | Effort |
|---|---|---|---|
| 1 | Add E2E tests for partitioned source tables (basic CRUD, differential refresh) | tests/ |
1–2 days |
| 2 | Verify ATTACH PARTITION triggers reinit; add detection if missing | src/hooks.rs |
1 day |
| 3 | Set publish_via_partition_root = true in WAL publication creation for partitioned sources |
src/wal_decoder.rs |
Half day |
| 4 | Detect foreign table sources (relkind = 'f') and restrict to FULL mode |
src/api.rs |
Half day |
| 5 | Document partitioned source table support in user docs | docs/ |
Half day |
Medium-Term (Future Release)
| # | Action | Component | Effort |
|---|---|---|---|
| 6 | Explore partitioned storage table support (opt-in) | src/api.rs, src/refresh.rs |
2–3 weeks |
| 7 | Handle DETACH PARTITION explicitly in DDL hooks | src/hooks.rs |
1 day |
| 8 | Test with pg_partman automated partition management | tests/ |
1–2 days |
Long-Term
| # | Action | Component | Effort |
|---|---|---|---|
| 9 | Citus distributed table support | All modules | ~6 months (see PLAN_CITUS.md) |
| 10 | FDW-sharded source support with remote CDC | New module | Research needed |
9. Summary Matrix
| Deployment Pattern | Source Tables | Storage Tables | Status |
|---|---|---|---|
| Native partitioning (RANGE/LIST/HASH) | Works (PG 13+) | Not supported | Ready with caveats |
| Sub-partitioning | Works (same as above) | Not supported | Untested |
| pg_partman managed | Works (triggers propagate to new partitions) | Not supported | Untested |
| TimescaleDB hypertables | Works (see REPORT_TIMESCALEDB.md) | Not supported | Research |
| Citus distributed | Not supported | Not supported | See PLAN_CITUS.md |
| Citus reference tables | Works (coordinator-local triggers) | Possible | Untested |
| FDW foreign tables | Full refresh only | N/A | Needs implementation |
| Mixed local + foreign partitions | Full refresh only | N/A | Needs implementation |
Appendix A: PostgreSQL Partition Trigger Propagation
Starting in PostgreSQL 13, row-level triggers created on a partitioned parent
table are automatically cloned to all existing and newly created child partitions.
The trigger fires with the child partition’s TG_TABLE_NAME and TG_RELID,
but the trigger function is the same.
Key behavior details:
-- Create trigger on parent
CREATE TRIGGER cdc_trigger
AFTER INSERT OR UPDATE OR DELETE ON orders
FOR EACH ROW EXECUTE FUNCTION capture_changes();
-- PostgreSQL automatically creates matching triggers on:
-- orders_2025, orders_2026, orders_2027, ...
-- New partition created later:
CREATE TABLE orders_2028 PARTITION OF orders
FOR VALUES FROM ('2028-01-01') TO ('2029-01-01');
-- Trigger is automatically cloned to orders_2028
Within the trigger function for pg_trickle, the TG_RELID is the child
partition’s OID, not the parent’s. However, since our trigger function is
generated with the parent OID hardcoded (e.g., pg_trickle_cdc_fn_<parent_oid>),
the change buffer table name is derived from the parent OID, and all partition
DML is correctly routed to the same buffer table.
Appendix B: Partition Key in Row ID
For partitioned storage tables (Scenario 2), the __pgt_row_id uniqueness
guarantee requires special handling:
Current: UNIQUE INDEX ON storage (__pgt_row_id)
→ Works for unpartitioned tables
With partitioning:
UNIQUE INDEX ON storage (__pgt_row_id, <partition_key>)
→ Uniqueness is per-partition only
→ Two rows in different partitions could have the same __pgt_row_id
Alternative:
Use __pgt_row_id as a composite of (original_hash, partition_key_hash)
→ Globally unique but changes hash semantics
This is one of the key design decisions that would need resolution before implementing partitioned storage tables.