PLAN: OFFSET Support (ORDER BY + LIMIT + OFFSET)

Status: Complete — Core implementation done. Clippy clean. Upgrade SQL and CHANGELOG shipped in v0.2.2. Effort: ~4–8 hours (leverages existing TopK infrastructure)


Problem Statement

OFFSET is currently rejected in all defining-query contexts:

-- ❌ Rejected today
SELECT * FROM orders ORDER BY created_at DESC LIMIT 10 OFFSET 20

-- ❌ Also rejected
SELECT * FROM orders ORDER BY created_at DESC OFFSET 10

The original rationale (documented in the FAQ and PLAN_ORDER_BY_LIMIT_OFFSET) was that OFFSET “destroys boundary stability” — every source change can shift which rows fall inside the window. However, this argument applies to delta-based incremental maintenance, not to scoped recomputation.

Since TopK already uses scoped recomputation (re-execute the full query and MERGE the result), adding OFFSET to the query has zero additional complexity in the refresh path.


Analysis

Why OFFSET was originally rejected

The FAQ states:

TopK without OFFSET has a stable boundary: only rows that cross the N-th position threshold change membership. OFFSET destroys this stability.

This is true for delta-based IVM (Option D in PLAN_ORDER_BY_LIMIT_OFFSET — the overflow-buffer approach). But TopK was implemented as Option B — scoped recomputation — which doesn’t reason about boundaries at all. It simply:

  1. Re-executes SELECT ... ORDER BY ... LIMIT N against source tables
  2. MERGEs the result into the stream table

The MERGE is agnostic to what the query returns. Whether it’s LIMIT 10 or LIMIT 10 OFFSET 20, the flow is identical.

Impact on future optimizations

PLAN_ORDER_BY_LIMIT_OFFSET describes two future optimizations:

Optimization Impact of OFFSET
Option C — Boundary tracking With LIMIT N alone, you track one boundary (the N-th row’s sort key). With OFFSET M LIMIT N, you need two boundaries (the M-th and M+N-th). This doubles the boundary-check logic but doesn’t fundamentally break it.
Option D — Overflow buffer More complex: the overflow buffer must cover positions before the OFFSET too. Would need rethinking. Unlikely to be implemented for OFFSET.

Neither optimization is implemented or planned near-term. The scoped recomputation path handles OFFSET correctly with no additional cost.

Three OFFSET patterns to consider

Pattern Example Bounded? Useful?
ORDER BY + LIMIT + OFFSET ORDER BY score DESC LIMIT 10 OFFSET 20 Yes (N rows) Yes — “page” or “window” into ranked results
ORDER BY + OFFSET (no LIMIT) ORDER BY score DESC OFFSET 10 No (all rows minus M) Marginal — large, unbounded result set
OFFSET without ORDER BY OFFSET 10 No No — non-deterministic, already rejected

Recommendation

Support ORDER BY + LIMIT + OFFSET (the “Paged TopK” pattern). Reject the other two:

  • ORDER BY + OFFSET without LIMIT materializes a potentially huge result set (total rows minus M). This negates the TopK benefit of bounded storage and bounded refresh cost. Reject with a clear message suggesting adding LIMIT.

  • OFFSET without ORDER BY remains non-deterministic and is already rejected.


Use Cases

Use Case Query Pattern
Paginated dashboards ORDER BY created_at DESC LIMIT 20 OFFSET 40 — “page 3”
Excluding outliers ORDER BY score DESC LIMIT 50 OFFSET 5 — top 50, skipping top 5
Windowed leaderboards ORDER BY points DESC LIMIT 10 OFFSET 10 — “second tier”
Cascading pages Multiple stream tables with different OFFSETs for different pages

Caveat for users

When source data changes, the “page” shifts — a row on page 3 may move to page 2 or 4. This is semantically correct (the stream table always reflects the current state of the page) but may surprise users who expect row stability. This should be documented clearly.


Implementation Plan

Step 1: Parser — accept ORDER BY + LIMIT + OFFSET

File: src/dvm/parser.rs

Current behavior in detect_topk_pattern(): rust if !select.limitOffset.is_null() { return Err(PgTrickleError::UnsupportedOperator( "OFFSET is not supported with LIMIT in defining queries. ..." )); }

Change: When ORDER BY + LIMIT + OFFSET are all present, extract the OFFSET value (must be a constant non-negative integer, same as LIMIT):

let offset_value = if !select.limitOffset.is_null() {
    match unsafe { extract_const_int_from_node(select.limitOffset) } {
        Some(v) if v >= 0 => Some(v),
        Some(_) => return Err(PgTrickleError::UnsupportedOperator(
            "OFFSET must be a non-negative integer.".into()
        )),
        None => return Err(PgTrickleError::UnsupportedOperator(
            "OFFSET in defining queries must be a constant integer.".into()
        )),
    }
} else {
    None
};

Extend TopKInfo to include the offset:

pub struct TopKInfo {
    pub limit_value: i64,
    pub offset_value: Option<i64>,  // NEW
    pub full_query: String,
    pub base_query: String,
    pub order_by_sql: String,
}

reject_limit_offset() — currently rejects OFFSET unconditionally. Change to only reject OFFSET when ORDER BY + LIMIT are not present (the non-TopK path):

// OFFSET without ORDER BY + LIMIT is not supported.
// If ORDER BY + LIMIT + OFFSET → handled by detect_topk_pattern().
if !select.limitOffset.is_null() {
    // Only reject if this is not a TopK candidate.
    // TopK detection runs first and validates OFFSET there.
    return Err(PgTrickleError::UnsupportedOperator(
        "OFFSET is not supported in defining queries without ORDER BY + LIMIT. \
         Use ORDER BY + LIMIT + OFFSET for paged TopK stream tables."
            .into(),
    ));
}

Since detect_topk_pattern() runs before reject_limit_offset() in the API flow, and TopK tables bypass the DVM pipeline (which has its own reject_limit_offset in step 7), no additional changes are needed in the rejection flow.


Step 2: Catalog — store offset metadata

File: src/catalog.rs

Add a new nullable column to pgtrickle.pgt_stream_tables:

ALTER TABLE pgtrickle.pgt_stream_tables
  ADD COLUMN topk_offset INT;  -- NULL = no OFFSET (same as OFFSET 0)

Add field to StreamTableMeta: rust pub topk_offset: Option<i32>,

Update insert(), extract_meta(), and all SELECT queries to include the new column.

Migration SQL (for pg_trickle--{current}--{next}.sql): sql ALTER TABLE pgtrickle.pgt_stream_tables ADD COLUMN topk_offset INT;


Step 3: Refresh — include OFFSET in reconstructed query

File: src/refresh.rsexecute_topk_refresh()

Current query reconstruction: rust let topk_query = format!( "{} ORDER BY {} LIMIT {}", st.defining_query, topk_order_by, topk_limit );

Change to: rust let topk_query = if let Some(offset) = st.topk_offset { format!( "{} ORDER BY {} LIMIT {} OFFSET {}", st.defining_query, topk_order_by, topk_limit, offset ) } else { format!( "{} ORDER BY {} LIMIT {}", st.defining_query, topk_order_by, topk_limit ) };

No other refresh logic changes needed — the MERGE is agnostic to the source query shape.


Step 4: API — pass offset through

File: src/api.rs

When calling StreamTableMeta::insert(), pass topk_info.offset_value:

StreamTableMeta::insert(
    // ... existing args ...
    topk_info.as_ref().map(|t| t.limit_value as i32),
    topk_info.as_ref().map(|t| t.order_by_sql.as_str()),
    topk_info.as_ref().and_then(|t| t.offset_value.map(|v| v as i32)),  // NEW
    // ...
);

Step 5: strip_order_by_and_limit → also strip OFFSET

File: src/dvm/parser.rs

The existing strip_order_by_and_limit() function removes ORDER BY and LIMIT from the query to produce the base_query. It must also strip OFFSET. Rename to strip_order_by_limit_offset() or extend the existing implementation.


Step 6: Tests

E2E tests (add to tests/e2e_topk_tests.rs):

Test Scenario
test_topk_with_offset_create ORDER BY score DESC LIMIT 10 OFFSET 5 — verify 10 rows stored, starting from position 6
test_topk_with_offset_differential Insert/delete rows, verify page shifts correctly after refresh
test_topk_with_offset_zero OFFSET 0 — equivalent to no offset
test_topk_offset_without_limit_rejected ORDER BY x OFFSET 10 (no LIMIT) → error
test_topk_offset_without_order_by_rejected LIMIT 10 OFFSET 5 (no ORDER BY) → error
test_topk_offset_non_constant_rejected OFFSET (SELECT ...) → error
test_topk_offset_negative_rejected OFFSET -1 → error
test_topk_offset_with_aggregates GROUP BY dept ORDER BY SUM(x) DESC LIMIT 5 OFFSET 2
test_topk_offset_catalog_metadata Verify topk_offset stored correctly in catalog
test_topk_offset_fetch_syntax FETCH NEXT 10 ROWS ONLY OFFSET 5

Unit tests in parser.rs: - detect_topk_pattern with OFFSET returns correct offset_value - strip_order_by_limit_offset removes all three clauses


Step 7: Documentation updates

File Change
docs/FAQ.md Update “Why is OFFSET not supported with TopK?” → explain it’s now supported. Add caveat about page shifting.
docs/SQL_REFERENCE.md Update create_stream_table docs to mention OFFSET support with TopK.
docs/DVM_OPERATORS.md Update line 12: LIMIT / OFFSET → clarify OFFSET is allowed with ORDER BY + LIMIT.
README.md Update the SQL support table row for LIMIT/OFFSET.
CHANGELOG.md Add entry for OFFSET support.

Task Checklist

  • [x] Parser: accept OFFSET in detect_topk_pattern() (Step 1)
  • [x] Parser: update reject_limit_offset() flow (Step 1)
  • [x] Parser: extend strip_order_by_and_limit() to strip OFFSET (Step 5)
  • [x] Parser: extend TopKInfo with offset_value field (Step 1)
  • [x] Catalog: add topk_offset column to schema DDL in lib.rs (Step 2)
  • [x] Catalog: update StreamTableMeta struct + CRUD (Step 2)
  • [x] Refresh: include OFFSET in query reconstruction (Step 3)
  • [x] API: pass offset through to catalog insert (Step 4)
  • [x] E2E tests: 8 new tests + 2 updated rejection tests (Step 6)
  • [x] Documentation: FAQ, SQL_REFERENCE, DVM_OPERATORS, README (Step 7)
  • [x] cargo fmt -- --check clean
  • [x] just lint (clippy) — clean as of v0.2.2
  • [ ] just test-e2e — requires Docker E2E image build
  • [x] Upgrade migration SQL (sql/pg_trickle--0.2.1--0.2.2.sql) — topk_offset pre-provisioned in 0.2.1; 0.2.2 script handles function signature updates
  • [ ] Unit tests: parser tests for OFFSET handling in parser.rs #[cfg(test)]
  • [x] CHANGELOG entry — added in v0.2.2

What Remains (Prioritized)

# Item Priority Notes
1 clippy validation P0 ✅ Done — clean in v0.2.2
2 E2E test run P0 just build-e2e-image && just test-e2e to validate all 9 OFFSET tests
3 Upgrade migration SQL P1 ✅ Done — topk_offset column pre-provisioned in 0.2.1
4 CHANGELOG entry P1 ✅ Done — added in v0.2.2
5 Unit tests P2 Parser-level tests for detect_topk_pattern with OFFSET; nice-to-have since E2E covers the full path

Complexity Assessment

Component Effort Risk
Parser changes 1–2 hours Low — extends existing TopK detection
Catalog schema + migration 1 hour Low — single nullable column
Refresh path 30 min Very low — one conditional format string
API plumbing 30 min Low — pass-through
Tests 2–3 hours Low — follows existing TopK test patterns
Documentation 1 hour Low
Total ~4–8 hours Low

The low effort is because this is purely an extension of the existing TopK infrastructure. No new operators, no new refresh strategies, no query rewriting — just one more clause in the reconstructed query.


What We’re NOT Doing

  • ORDER BY + OFFSET without LIMIT — rejected because the result set is unbounded. Users should add LIMIT.
  • Delta-based OFFSET maintenance — not needed. Scoped recomputation handles it correctly.
  • Boundary tracking optimization (Option C) for OFFSET — deferred. Can be added later as a pure optimization. Would need to track two sort-key boundaries instead of one.
  • IMMEDIATE mode + OFFSET — rejected (same as TopK without OFFSET). Scoped recomputation is incompatible with within-transaction IVM.

Further Reading