Contents
Important notice
This is a pre-release version of the extension and is not intended for general use yet. It may be unstable and documentation is limited. If you have any questions, please contact us at hello@open-diffix.org.
User Guide
This document provides detailed information about the configuration, behavior and recommended usage of Diffix for PostgreSQL.
What is Diffix?
Diffix is a bundled set of mechanisms for anonymizing structured data. It was jointly developed by Aircloak GmbH and the Max Planck Institute for Software Systems. Diffix exploits mechanisms that have been in use by national statistics offices for decades: aggregation, generalization, noise, suppression, and swapping. It automatically applies these mechanisms as needed on a query-by-query basis to minimize noise while ensuring strong anonymity.
What is Diffix for PostgreSQL?
Diffix for PostgreSQL is an implementation of the Diffix mechanism for PostgreSQL databases. It is provided as an extension module for PostgreSQL version 13 or higher.
Configuring the extension
System behaviour can be configured using a combination of custom variables and security labels.
Labeling database objects
The module acts as a security provider and allows the marking of database objects with various anonymization labels, which associate additional metadata, needed for anonymization, with existing objects. For more details about security labels, see the official documentation page.
Only superusers can set anonymization labels.
To remove an anonymization label from an object, set it to NULL
.
Execute SELECT * FROM diffix.show_labels();
to display the current labels in use by the extension.
Tables can be labeled as public
or personal
. Direct access is allowed to public data even for restricted users.
Unlabeled tables can not be queried (unless treat_unmarked_tables_as_public
is set to true
).
Anonymization ID (AID) columns for a personal table have to be marked with the anonymization label aid
.
A personal table can have one or more AID columns.
In order to label a table as personal
and AID columns, use the
diffix.mark_personal(namespace, table_name, aid_columns...)
procedure, for example:
SQL
CALL diffix.mark_personal('public', 'my_table', 'id', 'last_name');
In order to label a table as public
, use the diffix.mark_public(namespace, table_name)
procedure.
Regular users can be marked with the anoymization labels direct
, anonymized_trusted
or anonymized_untrusted
.
The value of the custom variable pg_diffix.default_access_level
determines the access level for unlabeled regular users.
SQL
SECURITY LABEL FOR pg_diffix ON ROLE analyst IS 'anonymized_trusted';
Restricted features and extensions
At access levels other than direct
, various data and features built into PostgreSQL are restricted. Among others:
- Issue utility statements like
COPY
andALTER TABLE
, beside a few allowlisted ones, are not allowed. - Some of the data in
pg_catalog
tables likepg_user_functions
is not accessible. - Selected subset of less frequently used PostgreSQL query features like
EXISTS
orNULLIF
are disabled. - Inheritance involving a personal table is not allowed.
- Some of the output of
EXPLAIN
for queries involving a personal table is censored.
NOTE If any of the currently blocked features is necessary for your use case, open an issue and let us know.
Row level security (RLS) can be enabled and used on personal tables. It is advised that the active policies are vetted from the point of view of anonymity.
It is also strongly advised to vet any other extensions which are enabled alongside pg_diffix
,
as well as any user-defined functions and aggregate functions.
System settings
The module exposes a number of custom variables under the pg_diffix
prefix.
Superusers can change these variables at runtime for their own session,
while regular users only have read access to them (with few notable exceptions).
To use different values for all future sessions, they have to be set in the configuration file.
Execute SELECT * FROM diffix.show_settings();
to display the current settings of the extension.
If the result is empty, make sure pg_diffix
is loaded.
Data access settings
pg_diffix.default_access_level
- Determines the access level for unlabeled users. Default value is direct
.
pg_diffix.session_access_level
- Sets the access level for the current session. It can never be higher than the access
level for the current user. Can be changed by all users. Defaults to maximum access level allowed.
pg_diffix.treat_unmarked_tables_as_public
- Controls whether unmarked tables are readable and treated as public data.
Default value is false
.
Noise settings
pg_diffix.noise_layer_sd
- Standard deviation for each noise layer added to aggregates. Default value is 1.0.
Low count filter settings
pg_diffix.low_count_min_threshold
- The lower bound for the number of distinct AID values that must be present in a
bucket for it to pass the low count filter. Default value is 2.
pg_diffix.low_count_mean_gap
- The number of standard deviations between the lower bound and the mean of the
low count filter threshold. Default value is 2.0.
pg_diffix.low_count_layer_sd
- The standard deviation for each noise layer used when calculating the low count filter
threshold. Default value is 1.0.
pg_diffix.compute_suppress_bin
- If True
, some rows in the returned result might belong to the suppress bin,
combining data of all the suppressed (rejected by the low count filter) protected entities. For an aggregating query,
the suppress bin will be returned in the first row, while for a non-aggregating query - in multiple initial rows. In
both cases, these rows have all column values NULL
(*
for text-typed columns, customizable via
pg_diffix.text_label_for_suppress_bin
). Note that the suppress bin may itself be suppressed and not returned at all.
pg_diffix.text_label_for_suppress_bin
- The value to use for the text-typed grouping labels in the suppress bin row.
Default value is *
.
Aggregation settings
pg_diffix.outlier_count_min
- Default value is 1. Must not be greater than outlier_count_max
.
pg_diffix.outlier_count_max
- Default value is 2. Must not be smaller than outlier_count_min
.
pg_diffix.top_count_min
- Default value is 4. Must not be greater than top_count_max
.
pg_diffix.top_count_max
- Default value is 6. Must not be smaller than top_count_min
.
NOTE The outlier interval (outlier_count_min, outlier_count_max)
must not be wider than the (top_count_min, top_count_max)
interval.