PostgreSQL_Anonymizer 0.3.0

This Release
PostgreSQL_Anonymizer 0.3.0
Date
Status
Stable
Other Releases
Abstract
Data Anonymization for Postgres
Description
Mask or replace sensitive data with Postgres.
Released By
daamien
License
PostgreSQL
Resources
Special Files
Tags

Extensions

anon 0.3.0
Data Anonymization for Postgres

Documentation

AUTHORS
PostgreSQL Anonymizer Development Team
CHANGELOG
CHANGELOG
README
Data Sources
CONTRIBUTING
How To Contribute
NEWS
PostgreSQL Anonymizer 0.3 : In-Place Masking and Anonymous Dumps
LICENSE
The PostgreSQL License
RELEASING
Publishing a new Release

README

PostgreSQL Anonymizer

Anonymization & Data Masking for PostgreSQL

postgresql_anonymizer is an extension to mask or replace personally identifiable information (PII) or commercially sensitive data from a PostgreSQL database.

The projet is aiming toward a declarative approach of anonymization. This means we're trying to extend PostgreSQL Data Definition Language (DDL) in order to specify the anonymization strategy inside the table definition itself.

Once the maskings rules are defined, you can access the anonymized data in 3
different ways :

In addition, various Masking Functions are available : randomization, faking, partial scrambling, shufflin, noise or even your own custom function !

Read the Concepts section for more details and NEWS.md for information about the latest version.

Declaring The Masking Rules

The main idea of this extension is to offer anonymization by design.

The data masking rules should be written by the people who develop the application because they have the best knowledge of how the data model works. Therefore masking rules must be implemented directly inside the database schema.

This allows to mask the data directly inside the PostgreSQL instance without using an external tool and thus limiting the exposure and the risks of data leak.

The data masking rules are declared simply by using the COMMENT syntax :

```sql =# CREATE EXTENSION IF NOT EXISTS anon CASCADE;

=# SELECT anon.load();

=# CREATE TABLE player( id SERIAL, name TEXT, points INT);

=# COMMENT ON COLUMN player.name IS 'MASKED WITH FUNCTION anon.fake_last_name()'; ```

If your columns already have comments, simply append the MASKED WITH FUNCTION statement at the end of the comment.

In-Place Anonymization

You can permanetly remove the PII from a database with anon.anymize_database(). This will destroy the original data. Use with care.

```sql =# SELECT * FROM customer; id | full_name | birth | employer | zipcode | fk_shop -----+------------------+------------+---------------+---------+--------- 911 | Chuck Norris | 1940-03-10 | Texas Rangers | 75001 | 12 112 | David Hasselhoff | 1952-07-17 | Baywatch | 90001 | 423

=# CREATE EXTENSION IF NOT EXISTS anon CASCADE; =# SELECT anon.load();

=# COMMENT ON COLUMN customer.full_name -# IS 'MASKED WITH FUNCTION anon.fake_first_name() || '' '' || anon.fake_last_name()';

=# COMMENT ON COLUMN customer.birth
-# IS 'MASKED WITH FUNCTION anon.random_date_between(''01/01/1920''::DATE,now())';

=# COMMENT ON COLUMN customer.employer -# IS 'MASKED WITH FUNCTION anon.fake_company()';

=# COMMENT ON COLUMN customer.zipcode -# IS 'MASKED WITH FUNCTION anon.random_zip()';

=# SELECT anon.anonymize_database();

=# SELECT * FROM customer; id | full_name | birth | employer | zipcode | fk_shop -----+-------------------+------------+------------------+---------+--------- 911 | michel Duffus | 1970-03-24 | Body Expressions | 63824 | 12 112 | andromache Tulip | 1921-03-24 | Dot Darcy

```

You can also use anonymize_table() and anonymize_column() to remove data from a subset of the database.

Dynamic Masking

You can hide the PII from a role by declaring it as a "MASKED". Other roles will still access the original data.

Example:

sql =# SELECT * FROM people; id | fistname | lastname | phone
----+----------+----------+------------ T1 | Sarah | Conor | 0609110911 (1 row)

Step 1 : Activate the dynamic masking engine

sql =# CREATE EXTENSION IF NOT EXISTS anon CASCADE; =# SELECT anon.start_dynamic_masking();

Step 2 : Declare a masked user

sql =# CREATE ROLE skynet LOGIN; =# COMMENT ON ROLE skynet IS 'MASKED';

Step 3 : Declare the masking rules

```sql =# COMMENT ON COLUMN people.lastname IS 'MASKED WITH FUNCTION anon.fake_last_name()';

=# COMMENT ON COLUMN people.phone IS 'MASKED WITH FUNCTION anon.partial(phone,2,$$**$$,2)'; ```

Step 4 : Connect with the masked user

sql =# \! psql peopledb -U skynet -c 'SELECT * FROM people;' id | fistname | lastname | phone
----+----------+-----------+------------ T1 | Sarah | Stranahan | 06******11 (1 row)

Anonymous Dumps

Due to the core design of this extension, you cannot use pg_dump with a masked user. If you want to export the entire database with the anonymized data, you must use the anon.dump() function :

console $ psql [...] -qtA -c 'SELECT anon.dump()' your_dabatase > dump.sql

NB: The -qtA flags are required.

Warning

This is projet is at an early stage of development and should used carefully.

We need your feedback and ideas ! Let us know what you think of this tool,how it fits your needs and what features are missing.

You can either open an issue or send a message at contact@dalibo.com.

Requirements

This extension is officially supported on PostgreSQL 9.6 and later. It should also work on PostgreSQL 9.5 with a bit of hacking. See NOTES.md for more details.

It requires two extensions :

  • tsm_system_rows which is delivered by the postgresql-contrib package of the main linux distributions

  • ddlx a very cool DDL extrator

Install

Simply run :

console sudo pgxn install ddlx sudo pgxn install postgresql_anonymizer

or see INSTALL.md for more detailed instructions or if you want to deploy it on Amazon RDS or some other DBAAS service.

Limitations

  • The masking are declared using the comments on columns. If you data model already contains comments on some columns, you must append the masking rule after the original comment

  • The dynamic masking system only works with one schema (by default public). When you start the masking engine with start_dynamic_masking(), you can specify the schema that will be masked with SELECT start_dynamic_masking('sales');. However in-place anonymization with anon.anonymize()and anonymous export with anon.dump() will work fine will multiple schemas.

Performance

So far, we've done very few performance tests. Depending on the size of your data set and number of columns your need to anonymize, you might end up with a very slow process.

Here's some ideas :

Sampling

If your need to anonymize data for testing purpose, chances are that a smaller subset of your database will be enough. In that case, you can easily speed up the anonymization by downsizing the volume of data. There are mulitple way to extract a sample of database :

Materialized Views

Dynamic masking is not always required ! In some cases, it is more efficient to build Materialized Views instead.

For instance:

SQL CREATE MATERIALIZED VIEW masked_customer AS SELECT id, anon.random_last_name() AS name, anon.random_date_between('01/01/1920'::DATE,now()) AS birth, fk_last_order, store_id FROM customer;