Design & Historical Implementation Plan

This document preserves the initial phased implementation plan and design considerations for pgproto.

🏗️ Architecture (Historical)

1. Internal Storage

Protobuf messages are binary. We store them internally using a Postgres varlena (variable length) structure.

typedef struct {
    int32 length;      // Total size including this header
    char  data[1];     // Serialized Protobuf bytes
} ProtobufData;

2. Schema Registry (Dynamic Reflection)

To understand what fields are in a binary blob, the extension needs the schema. We will use the Schema-Registered model.

  1. Registry Table: A system table (or extension-owned table) will store FileDescriptorSet blobs generated by protoc.
  2. Caching (Shared/Session Memory): To avoid parsing the schema on every row access, we will cache parsed descriptors in a hash table using Postgres' TopMemoryContext for session duration.

📅 Phased Implementation Plan

Phase 0: Toolchain Setup (Docker)

Establish the development environment inside an isolated Docker container to avoid polluting the host machine. - Base Environment: A Dockerfile based on the official postgres:18 image (Latest Stable). - System Dependencies: build-essential, postgresql-server-dev-18, libprotobuf-c-dev, protobuf-c-compiler.

Phase 1: Varlena Infrastructure & Field-Tag Extraction

Establish the custom type and the C build environment. - Files Requirements: pgproto.control, Makefile (PGXS), pgproto--1.0.sql, pgproto.c. - Internal Custom Type: protobuf tracking a Varlena structure (vl_len_ and vl_dat). - I/O Handlers: protobuf_in and protobuf_out using Hex encoding. - Target Functions: pb_get_int32(protobuf, tag_number).

Phase 2: Schema Registry & Dynamic Reflection

Transition from hardcoded tag numbers to named query paths. - Schema Table: pb_schemas storing FileDescriptorSet binary blobs. - Caching Architecture: Cache parsed descriptors in a session-wide hash table (TopMemoryContext) to prevent parsing on every row fetch. - Target Functions: pb_get_string(protobuf, 'schema_name.MessageName', 'field.subfield').

Phase 3: Optimizations & Lazy Parsing

Improve performance of reading large protobuf messages. - Core Logic: Instead of full deserialization, skip byte-streams of unrelated tags. Use protobuf-c pointer skipping or raw wire format tag jumps.

Phase 4: Query Polish (TOAST, Operators)

Bridge developer ergonomics. - TOAST Support: Mark storage as extended so Postgres automatically compresses large protobuf messages out-of-line. - Operators: Shorthand syntaxes like protobuf -> 'field' and protobuf #> '{path,to_field}'.

Phase 5: Purge JSONB (Strict Native Purity)

The final objective of zero JSONB reliance. - Removals: Strip any pb_to_jsonb utilities or internal jsonb conversion pathways used as bridges. - Custom Indexing: Implement direct indexing using custom C operator classes rather than relying on JSONB indices.


💻 API Draft (Initial)

Custom Types

  • protobuf: The custom type for storing serialized bytes.

Functions

  • pb_to_jsonb(protobuf, text schema_name) returns jsonb
  • pb_get_string(protobuf, text schema_name, text path) returns text
  • pb_get_int(protobuf, text schema_name, text path) returns int4

Operators

  • protobuf -> path (Shorthand for extraction).