Extensions
- anon 0.0.3
- Data Anonymization for Postgres
Documentation
README
Contents
Data Anonymizer Extension for PostgreSQL
postgresql_anonymizer
is a set of SQL functions that remove personally
identifiable values from a PostgreSQL table and replace them with
random-but-plausible values. The goal is to avoid any identification
from the data record while remaining suitable for testing, data analysis and
data processing.
This is projet is at an early stage of development and should used carefully.
Example
```sql =# CREATE EXTENSION IF NOT EXISTS tsm_system_rows; =# CREATE EXTENSION IF NOT EXISTS anon;
=# SELECT * FROM customer; id | full_name | birth | employer | zipcode | fk_shop -----+------------------+------------+---------------+---------+--------- 911 | Chuck Norris | 1940-03-10 | Texas Rangers | 75001 | 12 112 | David Hasselhoff | 1952-07-17 | Baywatch | 90001 | 423
=# UPDATE customer -# SET -# full_name=anon.random_first_name() || ' ' || anon.random_last_name(), -# birth=anon.random_date_between('01/01/1920'::DATE,now()), -# employer=anon.random_company(), -# zipcode=anon.random_zip() -# ;
=# SELECT * FROM customer; id | full_name | birth | employer | zipcode | fk_shop -----+-------------------+------------+------------------+---------+--------- 911 | michel Duffus | 1970-03-24 | Body Expressions | 63824 | 12 112 | andromache Tulip | 1921-03-24 | Dot Darcy | 73231 | 423 ```
Requirements
This extension will work with PostgreSQL 9.5 and later versions.
It requires an extension named tsm_system_rows
, which is delivered by the
postgresql-contrib
package of the main linux distributions.
Install
With PGXN :
console
apt install pgxnclient (or pip install pgxn)
pgxn install postgresql_anonymizer
From source :
console
make
make install
How To Use
Load the extension in your database like this:
sql
CREATE EXTENSION IF NOT EXISTS tsm_system_rows;
CREATE EXTENSION IF NOT EXISTS anon;
You now have access to the following functions :
Generic data
- anon.random_date() returns a date
- anon.random_date_between(d1,d2) returns a date between
d1
andd2
- anon.random_int_between(i1,i2) returns an integer between
i1
andi2
- anon.random_string(n) returns a TEXT value containing
n
letters
Personal data
- anon.random_first_name() returns a generic first name
- anon.random_last_name() returns a generic last name
- anon.random_email() returns a valid email address
- anon.random_zip() returns a 5-digit code
- anon.random_city() returns an existing city
- anon.random_city_in_country(c) returns a city in country
c
- anon.random_region() returns an existing region
- anon.random_region_in_country(c) returns a region in country
c
- anon.random_country() returns a country
- anon.random_phone(p) return a 8-digit phone with
p
as a prefix
Company data
- anon.random_company() returns a generic company name
- anon.random_iban() : returns a valid IBAN
- anon.random_siret() : returns a valid SIRET
- anon.random_siren() : returns a valid SIREN
Performance
So far, we've done very few performance tests. Depending on the size of your data set and number of columns your need to anonymize, you might end up with a very slow process.
Here's some ideas :
Sampling
If your need to anonymize data for testing purpose, chances are that a smaller subset of your database will be enough. In that case, you can easily speed up the anonymization by downsizing the volume of data. There are mulitple way to extract a sample of database :
Dynamic Masking
If you need to limit sensitive data exposure to non-privileged users, you can build Dynamic Masking Views that will automatically replace personal data with anonymized values.
For instance:
SQL
CREATE VIEW masked_customer AS
SELECT
id,
anon.random_last_name() AS name,
anon.random_date_between('01/01/1920'::DATE,now()) AS birth,
fk_last_order,
store_id
FROM customer;
In certain use cases, Materialized View can be usefull here.