This is a pre-release version of the extension and is not intended for general use yet. It may be unstable and documentation is limited. If you have any questions, please contact us at email@example.com.
This document provides detailed information about the configuration, behavior and recommended usage of Diffix for PostgreSQL.
What is Diffix?
Diffix is a bundled set of mechanisms for anonymizing structured data. It was jointly developed by Aircloak GmbH and the Max Planck Institute for Software Systems. Diffix exploits mechanisms that have been in use by national statistics offices for decades: aggregation, generalization, noise, suppression, and swapping. It automatically applies these mechanisms as needed on a query-by-query basis to minimize noise while ensuring strong anonymity.
What is Diffix for PostgreSQL?
Diffix for PostgreSQL is an implementation of the Diffix mechanism for PostgreSQL databases. It is provided as an extension module for PostgreSQL version 13 or higher.
Configuring the extension
System behaviour can be configured using a combination of custom variables and security labels.
Labeling database objects
The module acts as a security provider and allows the marking of database objects with various anonymization labels, which associate additional metadata, needed for anonymization, with existing objects. For more details about security labels, see the official documentation page.
Only superusers can set anonymization labels.
To remove an anonymization label from an object, set it to
SELECT * FROM diffix.show_labels(); to display the current labels in use by the extension.
Tables can be labeled as
personal. Direct access is allowed to public data even for restricted users.
Unlabeled tables can not be queried (unless
treat_unmarked_tables_as_public is set to
Anonymization ID (AID) columns for a personal table have to be marked with the anonymization label
A personal table can have one or more AID columns.
In order to label a table as
personal and AID columns, use the
diffix.mark_personal(namespace, table_name, aid_columns...) procedure, for example:
CALL diffix.mark_personal('public', 'my_table', 'id', 'last_name');
In order to label a table as
public, use the
diffix.mark_public(namespace, table_name) procedure.
Regular users can be marked with the anoymization labels
The value of the custom variable
pg_diffix.default_access_level determines the access level for unlabeled regular users.
SECURITY LABEL FOR pg_diffix ON ROLE analyst IS 'anonymized_trusted';
Restricted features and extensions
At access levels other than
direct, various data and features built into PostgreSQL are restricted. Among others:
- Issue utility statements like
ALTER TABLE, beside a few allowlisted ones, are not allowed.
- Some of the data in
pg_user_functionsis not accessible.
- Selected subset of less frequently used PostgreSQL query features like
- Inheritance involving a personal table is not allowed.
- Some of the output of
EXPLAINfor queries involving a personal table is censored.
NOTE If any of the currently blocked features is necessary for your use case, open an issue and let us know.
Row level security (RLS) can be enabled and used on personal tables. It is advised that the active policies are vetted from the point of view of anonymity.
It is also strongly advised to vet any other extensions which are enabled alongside
as well as any user-defined functions and aggregate functions.
The module exposes a number of custom variables under the
Superusers can change these variables at runtime for their own session,
while regular users only have read access to them (with few notable exceptions).
To use different values for all future sessions, they have to be set in the configuration file.
SELECT * FROM diffix.show_settings(); to display the current settings of the extension.
If the result is empty, make sure
pg_diffix is loaded.
Data access settings
pg_diffix.default_access_level - Determines the access level for unlabeled users. Default value is
pg_diffix.session_access_level - Sets the access level for the current session. It can never be higher than the access
level for the current user. Can be changed by all users. Defaults to maximum access level allowed.
pg_diffix.treat_unmarked_tables_as_public - Controls whether unmarked tables are readable and treated as public data.
Default value is
pg_diffix.noise_layer_sd - Standard deviation for each noise layer added to aggregates. Default value is 1.0.
Low count filter settings
pg_diffix.low_count_min_threshold - The lower bound for the number of distinct AID values that must be present in a
bucket for it to pass the low count filter. Default value is 2.
pg_diffix.low_count_mean_gap - The number of standard deviations between the lower bound and the mean of the
low count filter threshold. Default value is 2.0.
pg_diffix.low_count_layer_sd - The standard deviation for each noise layer used when calculating the low count filter
threshold. Default value is 1.0.
pg_diffix.compute_suppress_bin - If
True, some rows in the returned result might belong to the suppress bin,
combining data of all the suppressed (rejected by the low count filter) protected entities. For an aggregating query,
the suppress bin will be returned in the first row, while for a non-aggregating query - in multiple initial rows. In
both cases, these rows have all column values
* for text-typed columns, customizable via
pg_diffix.text_label_for_suppress_bin). Note that the suppress bin may itself be suppressed and not returned at all.
pg_diffix.text_label_for_suppress_bin - The value to use for the text-typed grouping labels in the suppress bin row.
Default value is
pg_diffix.outlier_count_min - Default value is 1. Must not be greater than
pg_diffix.outlier_count_max - Default value is 2. Must not be smaller than
pg_diffix.top_count_min - Default value is 4. Must not be greater than
pg_diffix.top_count_max - Default value is 6. Must not be smaller than
NOTE The outlier interval
(outlier_count_min, outlier_count_max) must not be wider than the
(top_count_min, top_count_max) interval.