json_fdw

This Release
json_fdw 1.0.0
Date
Status
Stable
Abstract
Foreign Data Wrapper for JSON files
Description
PostgreSQL extension which implements a Foreign Data Wrapper (FDW) for JSON files.
Released By
hadi
License
GPL 3
Resources
Special Files
Tags

Extensions

json_fdw 1.0.0
Foreign Data Wrapper for JSON files

README

json_fdw

This PostgreSQL extension implements a Foreign Data Wrapper (FDW) for JSON files. The extension doesn't require any data to be loaded into the database, and supports analytic queries against array types, nested fields, and heterogeneous documents.

json_fdw currently only works with PostgreSQL 9.2, and uses YAJL to parse JSON files. Future releases of this wrapper will use the JSON parser functions that are to going to be introduced in the PostgreSQL 9.3 release.

Building

json_fdw depends on yajl-2.0 for parsing, and zlib-devel to read compressed files. So we need to install these packages first:

## Fedora 17+
sudo yum install zlib-devel yajl-devel

## Ubuntu 12.10+
sudo apt-get update
sudo apt-get install zlib1g-dev libyajl-dev

## Other Linux Distributions
(First install zlib-devel, cmake, and ruby)
wget http://github.com/lloyd/yajl/tarball/2.0.1 -O yajl-2.0.1.tar.gz
tar -xzvf yajl-2.0.1.tar.gz
cd lloyd-yajl-f4b2b1a
./configure
make
sudo make install
echo "/usr/local/lib" | sudo tee /etc/ld.so.conf.d/libyajl.conf
sudo ldconfig

Once you have yajl-2.0 and zlib installed on your machine, you are ready to build json_fdw. For this, you need to include the pg_config directory path in your make command. This path is typically the same as your PostgreSQL installation's bin/ directory path. For example:

PATH=/usr/local/pgsql/bin/:$PATH make
sudo PATH=/usr/local/pgsql/bin/:$PATH make install

Note: In RedHat 5.X and CentOS 5.X you may need to edit the Makefile and change "-l:libyajl.so.2" to "-lyajl".

Usage

These two parameters can be set on a JSON foreign table object.

  • filename: The absolute path of a json file or a gzipped json file.
  • max_error_count: Maximum number of invalid json documents to skip before erroring out. Defaults to 0.

As an example, we demonstrate querying a compressed JSON file from scratch here. We note that the underlying file contains JSON documents separated by newlines, and that no data needs to be loaded into the database. Let's now start with downloading the file.

wget http://examples.citusdata.com/customer_reviews_nested_1998.json.gz

Next, let's log into Postgres, and run the following commands to create a foreign table associated with this JSON file.

-- load extension first time after install
CREATE EXTENSION json_fdw;

-- create server object
CREATE SERVER json_server FOREIGN DATA WRAPPER json_fdw;

-- create foreign table
CREATE FOREIGN TABLE customer_reviews
(
    customer_id TEXT,
    "review.date" DATE,
    "review.rating" INTEGER,
    "product.id" CHAR(10),
    "product.group" TEXT,
    "product.title" TEXT,
    "product.similar_ids" CHAR(10)[]
)
SERVER json_server
OPTIONS (filename '/home/citusdata/customer_reviews_nested_1998.json.gz');

-- optionally, collect data distribution statistics
ANALYZE customer_reviews;

Finally, let's run some example SQL queries on your JSON file.

-- find all reviews a particular customer made on the Dune series in 1998

SELECT
    customer_id, "review.rating", "product.id", "product.title"
FROM
    customer_reviews
WHERE
    customer_id ='A27T7HVDXA3K2A' AND
    "product.title" LIKE '%Dune%' AND
    "review.date" >= '1998-01-01' AND
    "review.date" <= '1998-12-31';

-- do we have a correlation between a book's title's length and its review ratings?

SELECT
    width_bucket(length("product.title"), 1, 50, 5) title_length_bucket,
    round(avg("review.rating"), 2) AS review_average,
    count(*)
FROM
    customer_reviews
WHERE
    "product.group" = 'Book'
GROUP BY
    title_length_bucket
ORDER BY
    title_length_bucket;

Table Schema Conventions

There are three things worth noting about table schemas. First, nested fields in JSON documents are referenced using dot separators. For example, a field defined as "review": { "rating" : 5 } in a JSON document is declared as "review.rating" in the foreign table schema. The quotes around "review.rating" are necessary, as identifiers that include dots aren't valid in Postgres otherwise.

Second, the foreign table schema is defined at read-time. If you have an additional field that you'd like to query, such as "review.votes", you can simply add the column name and start querying for data. You can even create multiple table schemas for the same underlying file, and query through them.

Third, json_fdw assumes that underlying data can be heterogeneous. If you are querying for a column, and this field doesn't exist in a document, or the field's data type doesn't match the declared column type, json_fdw considers that particular field to be null.

Querying Multiple Files

json_fdw borrows its semantics from file_fdw, and associates one foreign table with one JSON file. If you'd like to query all your JSON files from one table, you could take one of two approaches. You could either use PostgreSQL's basic table partitioning feature, and manually create one child table per JSON file.

Alternatively, you could use CitusDB binaries, and "stage" data into a distributed foreign table. With this approach, you can also have the database automatically collect statistics about the underlying data, and apply query optimizations such as partition pruning. For more info, please see our documentation page at http://citusdata.com/docs/foreign-data , or contact us at engage @ citusdata.com.

Limitations

  • json_fdw only supports files that consist of one JSON document per line. It doesn't support objects that span multiple lines.

  • PostgreSQL limits column names to 63 characters by default. If you need column names that are longer, you can increase the NAMEDATALEN constant in src/include/pg_config_manual.h, compile, and reinstall.

Copyright (c) 2013 Citus Data, Inc.

This module is free software; you can redistribute it and/or modify it under the GNU GPL v3.0 License.

For all types of questions and comments about the wrapper, please contact us at engage @ citusdata.com.