Contents
Analyst guide
This document describes features and restrictions of pg_diffix
for users with anonymized access to a database.
The banking notebook provides a walkthough with examples and explanations of various
mechanisms that Diffix Elm uses to protect personal data.
Table of Contents
- Analyst guide
- Access levels
- Anonymized queries
- Unrestricted queries
- Post processing
- Utility statements
- Suppress bin
- Supported functions
Access levels
Users can have one of the following access levels to a database:
direct
- Direct (non-anonymized) access to data. Restrictions listed in this document do not apply in direct mode.anonymized_trusted
- Anonymized access to data. Prevents accidental release of personal data.anonymized_untrusted
- Anonymized access to data. Prevents intentional release of personal data.
Use SELECT diffix.access_level()
to see the current access level.
Anonymized queries
In anonymized access level, queries targeting personal tables are restricted to a limited subset of SQL. A personal table is a relation that contains data of individuals or other protected entities. Administrators identify and mark such tables during configuration.
When selecting data from personal tables, the following queries are allowed.
Queries with grouping
Grouping queries have the following form:
SELECT col1, col2, ..., count(...)
FROM personal_table
GROUP BY col1, col2, ...
Zero or more table columns col1
, col2
, ... may be specified.
Numeric and string columns may optionally be generalized.
count()
is any of the supported count aggregate variants.
Any number of count aggregates may be specified (including none).
Example:
SELECT city, year_of_birth, count(*)
FROM customers
GROUP BY city, year_of_birth
Example:
SELECT count(*), count(DISTINCT city)
FROM customers
Queries with implicit grouping
The GROUP BY
clause may be omitted:
SELECT col1, col2, ..., colX
FROM personal_table
In this case, the results are grouped implicitly by the selected columns.
Bins with insufficient individuals are filtered out, and the remaining
bins are repeated for (anonymized) count(*)
times.
Semantically this is equivalent to:
SELECT col1, col2, ..., colX, generate_series(1, anon_count)
FROM (
SELECT col1, col2, ..., colX, count(*) AS anon_count
FROM personal_table
GROUP BY col1, col2, ..., colX
) x
Example:
SELECT city, year_of_birth
FROM customers
Unrestricted queries
Queries against personal tables are limited to the SQL subset described in the anonymized queries section. However, queries that do not target personal tables are unrestricted, meaning any SQL can be executed.
If pg_diffix.treat_unmarked_tables_as_public
is set to false
, then tables not marked as personal
or public
are not queryable in any way.
Use diffix.show_labels() to see which tables are marked as personal
or public
.
Post processing
Any SQL may be executed on the output of anonymizing subqueries as their output is no longer considered personal.
Example: The following query selects the average number of individuals in each city.
SELECT avg(x.num_individuals)
FROM (
SELECT city, count(*) as num_individuals
FROM customers
GROUP BY city
) x
Projections of grouping expressions in SELECT
and the HAVING
clause of anonymizing
queries are also considered post processing and are accepted.
Example: The following query post processes the grouping labels and filters the output bins.
SELECT 'City: ' || upper(city), count(*)
FROM customers
GROUP BY city
HAVING count(*) > 10
Utility statements
Other SQL statements
Most utility statements are disallowed in anonymized access level. Allowed SQL statements include:
EXPLAIN
- explain is partially allowed. OptionsCOSTS
andANALYZE
are rejected.SET
- a subset of parameters can be changed by non-superusers, even in anonymized mode. Note that mostpg_diffix
parameters require superuser access.
Any statement that alters the database (INSERT
, DELETE
, UPDATE
, DROP
, ...) or causes any other persistent state change is forbidden.
COPY
statements and other indirect methods of reading data from the database are forbidden.
diffix.show_labels()
Run SELECT * FROM diffix.show_labels()
to see the complete list of labels on database objects.
diffix.access_level()
Run SELECT diffix.access_level()
to see the current access level of the session.
Possible values are direct
, anonymized_trusted
, and anonymized_untrusted
.
Suppress bin
When performing grouping in queries that target personal tables, bins that pertain to too few individuals are suppressed.
Diffix combines the counts of all suppressed rows to a special suppress bin.
The suppress bin is the first row of the anonymized query result, unless sorted or post processed by other means.
The suppress bin may itself be suppressed.
All grouping columns of the suppress bin are set to *
for text-typed columns and NULL
for other types.
The configuration parameter pg_diffix.compute_suppress_bin
controls whether the suppress bin is computed and emitted.
The configuration parameter pg_diffix.text_label_for_suppress_bin
specifies the value to use for text-typed grouping labels.
Default value is *
.
The aggregate diffix.is_suppress_bin(*)
returns true
for the suppress bin, and false for any other bin.
This can be used to identify the suppress bin if NULL
values are ambiguous for a grouping label.
Supported functions
Count
The following versions of the count aggregate are supported:
count(*)
- count all rows.count(col)
- counts non-null occurrences of the given column.count(distinct col)
- counts distinct values of the given column.
Results of these aggregates are anonymized by applying noise as described in the specification.
Numeric generalization functions
diffix.floor_by(col, K)
Rounds column to width K
and aligns to the lower edge of the interval. Equivalent to floor(col / K) * K
.
Restrictions: In untrusted mode, K
is restricted to a money style number:
1, 2, or 5 preceeded by or followed by zeros ⟨... 0.1, 0.2, 0.5, 1, 2, 5, 10, ...⟩.
diffix.round_by(col, K)
Rounds column to width K
and aligns to the closest edge of the interval. Equivalent to round(col / K) * K
.
Restrictions: Not allowed in untrusted mode.
diffix.ceil_by(col, K)
Rounds column to width K
and aligns to the upper edge of the interval. Equivalent to ceil(col / K) * K
.
Restrictions: Not allowed in untrusted mode.
width_bucket(operand, low, high, count)`
Default Postgres function that returns the bucket number of the column value in an equal-width histogram.
Restrictions: Not allowed in untrusted mode.
String generalization functions
substring(text_column, start, count)
Default Postgres function.
Restrictions: In untrusted mode, only start = 1
is allowed.
Type casts
When selecting or generalizing columns of personal tables, the following type conversions are allowed:
- All numeric types may be converted to other numeric types.
Numeric types are:
smallint
,integer
,bigint
,float
,double
, andnumeric
(decimal
). When converting from a real type to an integer type, implicit rounding occurs (away from zero). Such casts count as generalizing expressions and no further generalizations are allowed. - Date/time types may be converted to text.
Date/time types are:
date
,time
,timestamp
,timetz
, andtimestamptz
. The DateStyle configuration parameter determines the output format of the string.
Example:
SELECT substring(date_of_birth::text, 1, 4) AS year, count(*)
FROM customers
GROUP BY 1
Utility functions
diffix.is_suppress_bin(*)
Aggregate that returns true
only for the suppress bin, false
otherwise.