Contents
Analyst guide
This document describes features and restrictions of pg_diffix
for users with anonymized access to a database.
The banking notebook provides a walkthrough with examples and explanations of various
mechanisms that Diffix uses to protect personal data.
Table of Contents
- Access levels
- Anonymizing queries
- Unrestricted queries
- Post processing
- Utility statements
- Suppress bin
- Supported functions
- Aggregates
- diffix.count_histogram(aid, bin_size)
- Numeric generalization functions
- diffix.floor_by(col, K)
- diffix.round_by(col, K)
- diffix.ceil_by(col, K)
- width_bucket(operand, low, high, count)`
- String generalization functions
- substring(text_column, start, count)
- Type casts
- Utility functions
- diffix.is_suppress_bin(*)
- diffix.unnest_histogram(histogram)
Access levels
Users can have one of the following access levels to a database:
direct
- Direct (non-anonymized) access to data. Restrictions listed in this document do not apply in direct mode.anonymized_trusted
- Anonymized access to data. Prevents accidental release of personal data.anonymized_untrusted
- Anonymized access to data. Prevents intentional release of personal data.
Use SELECT diffix.access_level()
to see the current access level.
Anonymizing queries
In anonymized access level, queries targeting personal tables are restricted to a limited subset of SQL. A personal table is a relation that contains data of individuals or other protected entities. Administrators identify and mark such tables during configuration.
When selecting data from personal tables, the following SQL features are allowed:
Explicit grouping
Grouping queries have the following form:
SELECT col1, col2, ..., count(...)
FROM personal_table
GROUP BY col1, col2, ...
Zero or more table columns col1
, col2
, ... may be specified.
Numeric and string columns may optionally be generalized.
count(...)
is any of the supported aggregate variants.
Any number of aggregates may be specified (including none).
Example:
SELECT city, year_of_birth, count(*), diffix.count_noise(*)
FROM customers
GROUP BY city, year_of_birth
Example:
SELECT count(*), count(DISTINCT city)
FROM customers
Implicit grouping
The GROUP BY
clause may be omitted:
SELECT col1, col2, ..., colX
FROM personal_table
In this case, the results are grouped implicitly by the selected columns.
Bins with insufficient individuals are filtered out, and the remaining
bins are repeated for (anonymized) count(*)
times.
Semantically this is equivalent to:
SELECT col1, col2, ..., colX, generate_series(1, anon_count)
FROM (
SELECT col1, col2, ..., colX, count(*) AS anon_count
FROM personal_table
GROUP BY col1, col2, ..., colX
) x
Example:
SELECT city, year_of_birth
FROM customers
Input filtering
The WHERE
clause may specify one or more simple equality conditions which all have to match simultaneously for a row to be included in the analysis.
A simple equality condition consists of an equality between a raw or generalized column, on the left side, and a constant value, on the right side.
Combining conditions using OR
or specifying oher types of conditions is not permitted.
Example:
SELECT count(*)
FROM customers
WHERE city = 'Berlin' AND gender = 'F'
Example:
SELECT city, count(*)
FROM customers
WHERE substring(date_of_birth, 1, 4) = 2000
GROUP BY city
AID columns are not allowed in filtering conditions. Administrators can additionally restrict the usage of certain columns for data filtering by untrusted users.
Unrestricted queries
Queries against personal tables are limited to the SQL subset described in the anonymized queries section. However, queries that do not target personal tables are unrestricted, meaning any SQL can be executed.
If pg_diffix.treat_unmarked_tables_as_public
is set to false
, then tables not marked as personal
or public
are not queryable in any way.
Use diffix.show_labels() to see which tables are marked as personal
or public
.
Post processing
Any SQL may be executed on the output of anonymizing subqueries as their output is no longer considered personal.
Example: The following query selects the average number of individuals in each city.
SELECT avg(x.num_individuals)
FROM (
SELECT city, count(*) as num_individuals
FROM customers
GROUP BY city
) x
Projections of grouping expressions in SELECT
and the HAVING
clause of anonymizing
queries are also considered post processing and are accepted.
Example: The following query post processes the grouping labels and filters the output bins.
SELECT 'City: ' || upper(city), count(*)
FROM customers
GROUP BY city
HAVING count(*) > 10
Utility statements
Other SQL statements
Most utility statements are disallowed in anonymized access level. Allowed SQL statements include:
EXPLAIN
- explain is partially allowed. OptionsCOSTS
andANALYZE
are rejected.SET
- a subset of parameters can be changed by non-superusers, even in anonymized mode. Note that mostpg_diffix
parameters require superuser access.
Any statement that alters the database (INSERT
, DELETE
, UPDATE
, DROP
, ...) or causes any other persistent state change is forbidden.
COPY
statements and other indirect methods of reading data from the database are forbidden.
diffix.show_labels()
Run SELECT * FROM diffix.show_labels()
to see the complete list of labels on database objects.
diffix.access_level()
Run SELECT diffix.access_level()
to see the current access level of the session.
Possible values are direct
, anonymized_trusted
, and anonymized_untrusted
.
Suppress bin
When performing grouping in queries that target personal tables, bins that pertain to too few individuals are suppressed.
Diffix combines the counts of all suppressed rows to a special suppress bin.
The suppress bin is the first row of the anonymized query result, unless sorted or post processed by other means.
The suppress bin may itself be suppressed.
All grouping columns of the suppress bin are set to *
for text-typed columns and NULL
for other types.
The configuration parameter pg_diffix.compute_suppress_bin
controls whether the suppress bin is computed and emitted.
The configuration parameter pg_diffix.text_label_for_suppress_bin
specifies the value to use for text-typed grouping labels.
Default value is *
.
The aggregate diffix.is_suppress_bin(*)
returns true
for the suppress bin, and false for any other bin.
This can be used to identify the suppress bin if NULL
values are ambiguous for a grouping label.
Supported functions
Aggregates
The following versions of aggregates are supported:
count(*)
- count all rows.count(col)
- counts non-null occurrences of the given column.count(distinct col)
- counts distinct values of the given column.sum(col)
- sums values in the given column.avg(col)
- calculates the average of the given column.diffix.count_histogram(aid, bin_size=1)
- computes a histogram that describes the distribution of rows among entities. See below for details.
Results of these aggregates are anonymized by applying noise as described in the specification.
Each of the count(...)
, sum(...)
, avg(...)
has an accompanying aggregate,
which returns the approximate magnitude of noise added during anonymization (in terms of its standard deviation).
These are: diffix.count_noise(...)
, diffix.sum_noise(...)
, diffix.avg_noise(...)
, respectively.
diffix.count_histogram(aid, bin_size)
Returns a 2-dimensional array of shape bigint[][2]
, where each entry is a pair of [row_count, num_entities]
.
The row_count
represents the number of rows contributed by num_entities
distinct protected entities.
Example:
``` SELECT diffix.count_histogram(account) FROM transactions;
count_histogram
{{NULL,7},{1,15},{2,13},{4,6}} (1 row) ```
The result of the above query can be interpreted as:
15 accounts have made a single transaction (1 row in result bucket), 13 accounts have made 2 transactions (2 rows),
6 accounts have made 4 transactions, and 7 accounts have made some other number of transactions (identified by the NULL
count).
The reported num_entities
is a noisy value, but not the row_count
itself. Bins with insufficient num_entities
are merged to
a suppress bin of shape {NULL, num_entities}
where num_entities
is also noisy. The suppress bin may itself be suppressed.
The optional bin_size
parameter allows generalizing the bins' row_count
to minimize suppression.
It acts identically to the diffix.floor_by()
function.
The histogram array can be unwrapped to a set of pairs by using diffix.unnest_histogram().
Restrictions: The aid
parameter must be a reference to a column tagged as an AID (identifier of a protected entity).
In untrusted mode, bin_size
is restricted to a money style number:
1, 2, or 5 preceeded by or followed by zeros ⟨... 0.1, 0.2, 0.5, 1, 2, 5, 10, ...⟩.
Numeric generalization functions
diffix.floor_by(col, K)
Rounds column to width K
and aligns to the lower edge of the interval. Equivalent to floor(col / K) * K
.
Restrictions: In untrusted mode, K
is restricted to a money style number:
1, 2, or 5 preceeded by or followed by zeros ⟨... 0.1, 0.2, 0.5, 1, 2, 5, 10, ...⟩.
diffix.round_by(col, K)
Rounds column to width K
and aligns to the closest edge of the interval. Equivalent to round(col / K) * K
.
Restrictions: Not allowed in untrusted mode.
diffix.ceil_by(col, K)
Rounds column to width K
and aligns to the upper edge of the interval. Equivalent to ceil(col / K) * K
.
Restrictions: Not allowed in untrusted mode.
width_bucket(operand, low, high, count)`
Default Postgres function that returns the bucket number of the column value in an equal-width histogram.
Restrictions: Not allowed in untrusted mode.
String generalization functions
substring(text_column, start, count)
Default Postgres function.
Restrictions: In untrusted mode, only start = 1
is allowed.
Type casts
When selecting or generalizing columns of personal tables, the following type conversions are allowed:
- All numeric types may be converted to other numeric types.
Numeric types are:
smallint
,integer
,bigint
,float
,double
, andnumeric
(decimal
). When converting from a real type to an integer type, implicit rounding occurs (away from zero). Such casts count as generalizing expressions and no further generalizations are allowed. - Date/time types may be converted to text.
Date/time types are:
date
,time
,timestamp
,timetz
, andtimestamptz
. The DateStyle configuration parameter determines the output format of the string.
Example:
SELECT substring(date_of_birth::text, 1, 4) AS year, count(*)
FROM customers
GROUP BY 1
Utility functions
diffix.is_suppress_bin(*)
Aggregate that returns true
only for the suppress bin, false
otherwise.
diffix.unnest_histogram(histogram)
Unnests a 2-dimensional array into a result set of 1-dimensional arrays.
Example:
``` SELECT diffix.unnest_histogram(diffix.count_histogram(account)) AS bins FROM transactions;
bins
{NULL,7} {1,15} {2,13} {4,6} (4 rows) ```