Extensions
- json_fdw 1.0.0
- Foreign Data Wrapper for JSON files
README
Contents
json_fdw
This PostgreSQL extension implements a Foreign Data Wrapper (FDW) for JSON files. The extension doesn't require any data to be loaded into the database, and supports analytic queries against array types, nested fields, and heterogeneous documents.
json_fdw currently only works with PostgreSQL 9.2, and uses YAJL to parse JSON files. Future releases of this wrapper will use the JSON parser functions that are to going to be introduced in the PostgreSQL 9.3 release.
Building
json_fdw depends on yajl-2.0 for parsing, and zlib-devel to read compressed files. So we need to install these packages first:
## Fedora 17+
sudo yum install zlib-devel yajl-devel
## Ubuntu 12.10+
sudo apt-get update
sudo apt-get install zlib1g-dev libyajl-dev
## Other Linux Distributions
(First install zlib-devel, cmake, and ruby)
wget http://github.com/lloyd/yajl/tarball/2.0.1 -O yajl-2.0.1.tar.gz
tar -xzvf yajl-2.0.1.tar.gz
cd lloyd-yajl-f4b2b1a
./configure
make
sudo make install
echo "/usr/local/lib" | sudo tee /etc/ld.so.conf.d/libyajl.conf
sudo ldconfig
Once you have yajl-2.0 and zlib installed on your machine, you are ready to build json_fdw. For this, you need to include the pg_config directory path in your make command. This path is typically the same as your PostgreSQL installation's bin/ directory path. For example:
PATH=/usr/local/pgsql/bin/:$PATH make
sudo PATH=/usr/local/pgsql/bin/:$PATH make install
Note: In RedHat 5.X and CentOS 5.X you may need to edit the Makefile and change "-l:libyajl.so.2" to "-lyajl".
Usage
These two parameters can be set on a JSON foreign table object.
- filename: The absolute path of a json file or a gzipped json file.
- max_error_count: Maximum number of invalid json documents to skip before erroring out. Defaults to 0.
As an example, we demonstrate querying a compressed JSON file from scratch here. We note that the underlying file contains JSON documents separated by newlines, and that no data needs to be loaded into the database. Let's now start with downloading the file.
wget http://examples.citusdata.com/customer_reviews_nested_1998.json.gz
Next, let's log into Postgres, and run the following commands to create a foreign table associated with this JSON file.
-- load extension first time after install
CREATE EXTENSION json_fdw;
-- create server object
CREATE SERVER json_server FOREIGN DATA WRAPPER json_fdw;
-- create foreign table
CREATE FOREIGN TABLE customer_reviews
(
customer_id TEXT,
"review.date" DATE,
"review.rating" INTEGER,
"product.id" CHAR(10),
"product.group" TEXT,
"product.title" TEXT,
"product.similar_ids" CHAR(10)[]
)
SERVER json_server
OPTIONS (filename '/home/citusdata/customer_reviews_nested_1998.json.gz');
-- optionally, collect data distribution statistics
ANALYZE customer_reviews;
Finally, let's run some example SQL queries on your JSON file.
-- find all reviews a particular customer made on the Dune series in 1998
SELECT
customer_id, "review.rating", "product.id", "product.title"
FROM
customer_reviews
WHERE
customer_id ='A27T7HVDXA3K2A' AND
"product.title" LIKE '%Dune%' AND
"review.date" >= '1998-01-01' AND
"review.date" <= '1998-12-31';
-- do we have a correlation between a book's title's length and its review ratings?
SELECT
width_bucket(length("product.title"), 1, 50, 5) title_length_bucket,
round(avg("review.rating"), 2) AS review_average,
count(*)
FROM
customer_reviews
WHERE
"product.group" = 'Book'
GROUP BY
title_length_bucket
ORDER BY
title_length_bucket;
Table Schema Conventions
There are three things worth noting about table schemas. First, nested fields in JSON documents are referenced using dot separators. For example, a field defined as "review": { "rating" : 5 } in a JSON document is declared as "review.rating" in the foreign table schema. The quotes around "review.rating" are necessary, as identifiers that include dots aren't valid in Postgres otherwise.
Second, the foreign table schema is defined at read-time. If you have an additional field that you'd like to query, such as "review.votes", you can simply add the column name and start querying for data. You can even create multiple table schemas for the same underlying file, and query through them.
Third, json_fdw assumes that underlying data can be heterogeneous. If you are querying for a column, and this field doesn't exist in a document, or the field's data type doesn't match the declared column type, json_fdw considers that particular field to be null.
Querying Multiple Files
json_fdw borrows its semantics from file_fdw, and associates one foreign table with one JSON file. If you'd like to query all your JSON files from one table, you could take one of two approaches. You could either use PostgreSQL's basic table partitioning feature, and manually create one child table per JSON file.
Alternatively, you could use CitusDB binaries, and "stage" data into a distributed foreign table. With this approach, you can also have the database automatically collect statistics about the underlying data, and apply query optimizations such as partition pruning. For more info, please see our documentation page at http://citusdata.com/docs/foreign-data , or contact us at engage @ citusdata.com.
Limitations
json_fdw only supports files that consist of one JSON document per line. It doesn't support objects that span multiple lines.
PostgreSQL limits column names to 63 characters by default. If you need column names that are longer, you can increase the NAMEDATALEN constant in src/include/pg_config_manual.h, compile, and reinstall.
Copyright
Copyright (c) 2013 Citus Data, Inc.
This module is free software; you can redistribute it and/or modify it under the GNU GPL v3.0 License.
For all types of questions and comments about the wrapper, please contact us at engage @ citusdata.com.