PostgreSQL_Anonymizer 0.2.1

This Release

Date

2018-10-29

Status

Stable

Latest Stable

PostgreSQL_Anonymizer 1.3.2 — 2024-04-02

Other Releases

Abstract

Data Anonymization for Postgres

Description

This extension is a suite of database functions that replace sensitive data with plausible values.

Released By

daamien

License

PostgreSQL

Resources

Special Files

Tags

Data Anonymization, Masking

Extensions

anon 0.2.1: Data Anonymization for Postgres

Documentation

LICENSE: The PostgreSQL License
NEWS: Inroducing PostgreSQL Anonymizer 0.2.1 !
NOTES: Development Notes
CHANGELOG: CHANGELOG
README: Data Sources
CONTRIBUTING: How To Contribute

README

Anonymizing and Masking Data with PostgreSQL

postgresql_anonymizer is an extension to mask or replace personally identifiable information (PII) or commercially sensitive data from a PostgreSQL database.

The projet is aiming toward a declarative approach of anonymization. This means we're trying to extend PostgreSQL Data Definition Language (DDL) in order to specify the anonymization strategy inside the table definition itself.

The extension can be used to put dynamic masks on certain users or permanently modify sensitive data. Various masking techniques are available : randomization, partial scrambling or custom rules.

Read the Concepts section for more details.

Warning

This is projet is at an early stage of development and should used carefully.

I need your feedback and ideas ! Let me know what you think of this tool,how it fits your needs and what features are missing.

You can either open an issue or send a message at daamien@gmail.com.

Example

```sql =# CREATE EXTENSION IF NOT EXISTS anon CASCADE;

=# SELECT anon.load();

=# UPDATE customer -# SET -# full_name=anon.random_first_name() || ' ' || anon.random_last_name(), -# birth=anon.random_date_between('01/01/1920'::DATE,now()), -# employer=anon.random_company(), -# zipcode=anon.random_zip() -# ;

Declarative Data Masking

sql =# SELECT * FROM people; id | name | phone ------+----------------+------------ T800 | Schwarzenegger | 0609110911 (1 row)

STEP 1 : Activate the masking engine

sql =# CREATE EXTENSION IF NOT EXISTS anon CASCADE; =# SELECT anon.mask_init();

STEP 2 : Declare a masked user

sql =# CREATE ROLE skynet; =# COMMENT ON ROLE skynet IS 'MASKED';

STEP 3 : Declare the masking rules

```sql =# COMMENT ON COLUMN people.name IS 'MASKED WITH FUNCTION anon.random_last_name()';

=# COMMENT ON COLUMN people.phone IS 'MASKED WITH FUNCTION anon.partial(phone,2,$$**$$,2)'; ```

STEP 4 : Connect with the masked user

sql =# \! psql test -U skynet -c 'SELECT * FROM people;' id | name | phone ------+----------+------------ T800 | Nunziata | 06******11 (1 row)

Requirements

This extension is officially supported on PostgreSQL 9.6 and later. It should also work on PostgreSQL 9.5 with a bit of hacking. See NOTES.md for more details.

It requires an extension named tsm_system_rows, which is delivered by the postgresql-contrib package of the main linux distributions.

Install

With PGXN :

console sudo apt install pgxnclient (or pip install pgxn) sudo pgxn install postgresql_anonymizer

From source :

console make sudo make install

How To Use

Load the extension in your database like this:

sql CREATE EXTENSION IF NOT EXISTS anon CASCADE; SELECT anon.load();

The load() function will charge a default dataset of random data ( lists names, cities, etc. ). If you want to use your own dataset, you can load custom CSV files with load('/path/to/custom_cvs_files/')

You now have access to the following functions :

Generic data

anon.random_date() returns a date
anon.random_date_between(d1,d2) returns a date between d1 and d2
anon.random_int_between(i1,i2) returns an integer between i1 and i2
anon.random_string(n) returns a TEXT value containing n letters

Personal data

anon.random_first_name() returns a generic first name
anon.random_last_name() returns a generic last name
anon.random_email() returns a valid email address
anon.random_zip() returns a 5-digit code
anon.random_city() returns an existing city
anon.random_city_in_country(c) returns a city in country c
anon.random_region() returns an existing region
anon.random_region_in_country(c) returns a region in country c
anon.random_country() returns a country
anon.random_phone(p) return a 8-digit phone with p as a prefix

Company data

anon.random_company() returns a generic company name
anon.random_iban() returns a valid IBAN
anon.random_siret() returns a valid SIRET
anon.random_siren() returns a valid SIREN

Upgrade

Currently there's no way to upgrade easily from a version to another. The operation ALTER EXTENSION ... UPDATE ... is not supported.

You need to drop and recreate the extension after every upgrade.

Concepts

Two main strategies are used:

Dynamic Masking offers an altered view of the real data without modifying it. Some users may only read the masked data, others may access the authentic version.
Permanent Destruction is the definitive action of substituting the sensitive information with uncorrelated data. Once processed, the authentic data cannot be retrieved.

The data can be altered with several techniques:

Deletion or Nullification simply removes data.
Static Subtitution consistently replaces the data with a generic values. For instance: replacing all values of TEXT column with the value "CONFIDENTIAL".
Variance is the action of "shifting" dates and numeric values. For example, by applying a +/- 10% variance to a salary column, the dataset will remain meaningful.
Encryption uses an encryption algorithm and requires a private key. If the key is stolen, authentic data can be revealed.
Shuffling mixes values within the same columns. This method is open to being reversed if the shuffling algorithm can be deciphered.
Randomization replace sensitive data with random-but-plausible values. The goal is to avoid any identification from the data record while remaining suitable for testing, data analysis and data processing.
Partial scrambling is similar to static substitution but leaves out some part of the data. For instance : a credit card number can be replaced by '40XX XXXX XXXX XX96'
Custom rules are designed to alter data following specific needs. For instance, randomizing simultanously a zipcode and a city name while keeping them coherent.

For now, this extension is especially focusing on randomization and Partial Scrambling and Custom Rules but it should be easy to implement other methods as well.

Performance

So far, we've done very few performance tests. Depending on the size of your data set and number of columns your need to anonymize, you might end up with a very slow process.

Here's some ideas :

Sampling

If your need to anonymize data for testing purpose, chances are that a smaller subset of your database will be enough. In that case, you can easily speed up the anonymization by downsizing the volume of data. There are mulitple way to extract a sample of database :

Materialized Views

Dynamic masking is not always required ! In some cases, it is more efficient to build Materialized Views instead.

For instance:

SQL CREATE MATERIALIZED VIEW masked_customer AS SELECT id, anon.random_last_name() AS name, anon.random_date_between('01/01/1920'::DATE,now()) AS birth, fk_last_order, store_id FROM customer;

Links

Dynamic Data Masking With MS SQL Server https://docs.microsoft.com/fr-fr/sql/relational-databases/security/dynamic-data-masking
Citus : Using search_path and views to hide columns for reporting with Postgres https://www.citusdata.com/blog/2018/07/03/masking-columns-in-postgresql/
MariaDB : Masking with maxscale https://mariadb.com/kb/en/mariadb-enterprise/mariadb-maxscale-21-masking/
pgsync https://github.com/ankane/pgsync

PGXN

PostgreSQL Extension Network

PostgreSQL_Anonymizer 0.2.1

Extensions

Documentation

README

Contents

Anonymizing and Masking Data with PostgreSQL

Warning

Example

Declarative Data Masking

Requirements

Install

With PGXN :

From source :

How To Use

Generic data

Personal data

Company data

Upgrade

Concepts

Performance

Sampling

Materialized Views

Links