Contents
Design & Historical Implementation Plan
This document preserves the initial phased implementation plan and design considerations for pgproto.
🏗️ Architecture (Historical)
1. Internal Storage
Protobuf messages are binary. We store them internally using a Postgres varlena (variable length) structure.
typedef struct {
int32 length; // Total size including this header
char data[1]; // Serialized Protobuf bytes
} ProtobufData;
2. Schema Registry (Dynamic Reflection)
To understand what fields are in a binary blob, the extension needs the schema. We will use the Schema-Registered model.
- Registry Table: A system table (or extension-owned table) will store
FileDescriptorSetblobs generated byprotoc. - Caching (Shared/Session Memory): To avoid parsing the schema on every row access, we will cache parsed descriptors in a hash table using Postgres'
TopMemoryContextfor session duration.
📅 Phased Implementation Plan
Phase 0: Toolchain Setup (Docker)
Establish the development environment inside an isolated Docker container to avoid polluting the host machine.
- Base Environment: A Dockerfile based on the official postgres:18 image (Latest Stable).
- System Dependencies: build-essential, postgresql-server-dev-18, libprotobuf-c-dev, protobuf-c-compiler.
Phase 1: Varlena Infrastructure & Field-Tag Extraction
Establish the custom type and the C build environment.
- Files Requirements: pgproto.control, Makefile (PGXS), pgproto--1.0.sql, pgproto.c.
- Internal Custom Type: protobuf tracking a Varlena structure (vl_len_ and vl_dat).
- I/O Handlers: protobuf_in and protobuf_out using Hex encoding.
- Target Functions: pb_get_int32(protobuf, tag_number).
Phase 2: Schema Registry & Dynamic Reflection
Transition from hardcoded tag numbers to named query paths.
- Schema Table: pb_schemas storing FileDescriptorSet binary blobs.
- Caching Architecture: Cache parsed descriptors in a session-wide hash table (TopMemoryContext) to prevent parsing on every row fetch.
- Target Functions: pb_get_string(protobuf, 'schema_name.MessageName', 'field.subfield').
Phase 3: Optimizations & Lazy Parsing
Improve performance of reading large protobuf messages.
- Core Logic: Instead of full deserialization, skip byte-streams of unrelated tags. Use protobuf-c pointer skipping or raw wire format tag jumps.
Phase 4: Query Polish (TOAST, Operators)
Bridge developer ergonomics.
- TOAST Support: Mark storage as extended so Postgres automatically compresses large protobuf messages out-of-line.
- Operators: Shorthand syntaxes like protobuf -> 'field' and protobuf #> '{path,to_field}'.
Phase 5: Purge JSONB (Strict Native Purity)
The final objective of zero JSONB reliance.
- Removals: Strip any pb_to_jsonb utilities or internal jsonb conversion pathways used as bridges.
- Custom Indexing: Implement direct indexing using custom C operator classes rather than relying on JSONB indices.
💻 API Draft (Initial)
Custom Types
protobuf: The custom type for storing serialized bytes.
Functions
pb_to_jsonb(protobuf, text schema_name)returnsjsonbpb_get_string(protobuf, text schema_name, text path)returnstextpb_get_int(protobuf, text schema_name, text path)returnsint4
Operators
protobuf -> path(Shorthand for extraction).