pg_anon

Overview

pg_anon is an efficient tool for the anonymization of Postgres data specifically designed for IT companies. These companies often store "sensitive" data that includes both commercial secrets and personal user information such as contact numbers, passport details, etc.

The tool comes in handy when it is necessary to transfer the database contents from the production environment to other environments for performance testing or functionality debugging during the development process. With pg_anon, no sensitive data is exposed, preventing potential data leaks.

Features

pg_anon works in several modes:

init: Creates anon_funcs schema with anonymization functions.
create_dict: Scans the DB data and creates a metadict with an anonymization profile.
dump: Creates a database structure dump using Postgres pg_dump tool, and data dumps using COPY ... queries with anonymization functions. The data dump step saves data locally in *.bin.gz format. During this step, the data is anonymized on the database side by anon_funcs.
restore: Restores database structure using Postgres pg_restore tool and data from the dump to the target DB. restore mode can separately restore database structure or data.

Requirements & Dependencies

pg_anon is based on Python3 and also requires the third-party libraries listed in requirements.txt.

It uses the following tools and technologies:

Postgres pg_dump tool for dumping the database structure.
Postgres pg_restore tool for restoring the database structure.
Postgres functions for the anonymization process.

Installation Guide

Preconditions

The tool supports Python3.11 and higher versions. The code is hosted on the following repository: pg_anon repository on Github.

Installation Instructions

Installation processes slightly differ depending on your operating system.

macOS

Install Python3 if it isn't installed:
- Install Homebrew
- brew install python@3.11
Clone the repository: git clone https://github.com/TantorLabs/pg_anon.git
Go to the project directory: cd pg_anon
Set up a virtual environment:
- Install the virtual environment: python3 -m venv venv
- Activate the virtual environment: source venv/bin/activate
Install the dependencies: pip install -r requirements.txt

Ubuntu/Redhat/CentOS

Install Python3 if it isn't installed: sudo apt-get install python3.11 (for Ubuntu), sudo yum install python311 (for Redhat/Centos)
Clone the repository: git clone https://github.com/TantorLabs/pg_anon.git
Go to the project directory: cd pg_anon
Set up a virtual environment:
- Install the virtual environment: python3 -m venv venv
- Activate the virtual environment: source venv/bin/activate
Install the dependencies: pip install -r requirements.txt

Windows 7/Windows 11

Install Python3 if it isn't installed: Download it from the official Python website
Clone the repository: git clone https://github.com/TantorLabs/pg_anon.git
Go to the project directory: cd pg_anon
Set up a virtual environment:
- Install the virtual environment: py -m venv venv
- Activate the virtual environment: .\venv\Scripts\activate
Install the dependencies: pip install -r requirements.txt

Testing

To test pg_anon, you need to have a local database installed. This section covers the installation of postgres and running the test suite.

Setting Up PostgreSQL

To facilitate the testing, here are instructions on how to set up PostgreSQL on Ubuntu:

Add repository configuration:

echo "deb [arch=amd64] http://apt.postgresql.org/pub/repos/apt focal-pgdg main" >> /etc/apt/sources.list.d/pgdg.list
wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | apt-key add -

Update packages and install PostgreSQL:

apt -y install postgresql-15 postgresql-client-15

Allow connections to the PostgreSQL server:

sed -i  '/listen_addresses/s/^#//g' /etc/postgresql/15/main/postgresql.conf
sed -ie "s/^listen_addresses.*/listen_addresses = '127.0.0.1'/" /etc/postgresql/15/main/postgresql.conf

Restart the PostgreSQL instance for the changes to take effect:
```
pg_ctlcluster 15 main restart
```

Create a test user with superuser rights to allow running the COPY commands:

psql -c "CREATE USER anon_test_user WITH PASSWORD 'mYy5RexGsZ' SUPERUSER;" -U postgres

Executing Tests

To validate that your setup is functioning correctly, run the unit tests:

python test/full_test.py -v

Upon successful execution, the output should resemble the following:

Ran N tests in ...
OK

If all tests pass, the application is ready to use.

To run a specific test case, use the following pattern:

python3 test/full_test.py -v PGAnonValidateUnitTest

Test Database Configuration

Test database connection settings can be overridden using environment variables:

set TEST_DB_USER=anon_test_user
set TEST_DB_USER_PASSWORD=mYy5RexGsZ
set TEST_DB_HOST=127.0.0.1
set TEST_DB_PORT=5432
set TEST_SOURCE_DB=test_source_db
set TEST_TARGET_DB=test_target_db

Usage

To display the help message for the CLI, run:

python pg_anon.py --help

Common pg_anon options:

Option	Description
`--debug`	Enable debug mode (default false)
`--verbose`	Configure verbose mode: [info, debug, error] (default info)
`--threads`	Amount of threads for IO operations (default 4)
`--processes`	Amount of processes for multiprocessing operations (default 4)

Database configuration options:

Option	Description
`--db-host`	Specifies your database host
`--db-port`	Specifies your database port
`--db-name`	Specifies your database name
`--db-user`	Specifies your database user
`--db-user-password`	Specifies your database user password
`--db-passfile`	Path to the file containing the password to be used when connecting to the database
`--db-ssl-key-file`	Path to the client SSL key file for secure connections to the database
`--db-ssl-cert-file`	Path to the client SSL certificate file for secure connections to the database
`--db-ssl-ca-file`	Path to the root SSL certificate file. This certificate is used to verify the server's certificate

Run init mode

To init schema "anon_funcs", run pg_anon in 'init' mode:

python pg_anon.py --mode init \
                  --db-user postgres \
                  --db-user-password postgres \
                  --db-name test_source_db

Run create_dict mode

Prerequisites:

Generated or manually created dictionary *.py file with anonymization profile
"anon_funcs" created in init mode

To create the dictionary:

python pg_anon.py --mode create-dict \
                  --db-user postgres \
                  --db-user-password postgres \
                  --db-name test_source_db \
                  --dict-file test_meta_dict.py \
                  --output-dict-file test_dict_output.py \
                  --processes 2

Option	Description
`--dict-file`	Specify the dictionary file with data about anonymization
`--output-dict-file`	Output file will be saved to this value
`--scan-mode`	defines whether to scan all data or only part of it ["full", "partial"] (default "partial")
`--scan-partial-rows`	In `--scan-mode partial` defines amount of rows to scan (default 10000)

Run dump mode

Prerequisites:

anon_funcs schema with anonymization functions should be created. See --mode init example.
output dict file with meta information about database fields, and it's anonymization should be created. See --mode create-dict

Dump modes:

To create the structure dump and data dump:

python pg_anon.py --mode dump \
                  --db-host=127.0.0.1 \
                  --db-user postgres \
                  --db-user-password postgres \
                  --db-name test_source_db \
                  --dict-file test_dict_output.py

To create only structure dump:

python pg_anon.py --mode sync-struct-dump \
                  --db-host=127.0.0.1 \
                  --db-user postgres \
                  --db-user-password postgres \
                  --db-name test_source_db \
                  --output-dir test_sync_struct_dump \
                  --dict-file test_dict_output.py

To create only data dump:

python pg_anon.py --mode sync-data-dump \
                  --db-host=127.0.0.1 \
                  --db-user postgres \
                  --db-user-password postgres \
                  --db-name test_source_db \
                  --output-dir test_sync_data_dump \
                  --dict-file test_dict_output.py

This mode could be useful for scheduling the database synchronization, for example with cron.

Possible options in mode=dump:

Option	Description
`--validate-dict`	Validate dictionary, show the tables and run SQL queries without data export (default false)
`--validate-full`	Same as `--validate-dict` + data export with limit (default false)
`--clear-output-dir`	In dump mode clears output dict from previous dump or another files. (default true)
`--pg-dump`	Path to the `pg_dump` Postgres tool (default `/usr/bin/pg_dump`).
`--output-dir`	Output directory for dump files. (default "")

Run restore mode

Prerequisites:

Each mode requires dump for restore.

Restore modes:

Restore structure and data:
This mode requires the dump output, created in --mode=dump.

python pg_anon.py --mode restore \
                  --db-host=127.0.0.1 \
                  --db-user postgres \
                  --db-user-password postgres \
                  --db-name test_target_db \
                  --input-dir test_dict_output \
                  --verbose debug

Restore structure only:
This mode requires the dump output, created in --mode=sync-struct-dump.

python pg_anon.py --mode sync-struct-restore \
                  --db-host=127.0.0.1 \
                  --db-user postgres \
                  --db-user-password postgres \
                  --db-name test_target_db \
                  --input-dir test_sync_struct_dump \
                  --verbose debug

Restore data only:
This mode requires the dump output, created in --mode=sync-data-dump.

python pg_anon.py --mode sync-data-restore \
                  --db-host=127.0.0.1 \
                  --db-user postgres \
                  --db-user-password postgres \
                  --db-name test_target_db \
                  --input-dir test_sync_data_dump \
                  --verbose debug

Possible options in --mode restore:

Option	Description
`--input-dir`	Input directory, with the dump files, created in dump mode
`--disable-checks`	Disable checks of disk space and PostgreSQL version (default false)
`--seq-init-by-max-value`	Initialize sequences based on maximum values. Otherwise, the sequences will be initialized based on the values of the source database.
`--drop-custom-check-constr`	Drop all CHECK constrains containing user-defined procedures to avoid performance degradation at the data loading stage.
`--pg-restore`	Path to the `pg_dump` Postgres tool.

Generate dictionary from table rows

If you have a table that contains objects and fields for anonymization, you can use this SQL query to generate a dictionary in json format:

select
	jsonb_pretty(
		json_agg(json_build_object('schema', T.schm, 'table', T.tbl, 'fields', flds ))::jsonb
	)
from (
    select
        T.schm,
        T.tbl,
        JSON_OBJECT_AGG(fld, mrule) as flds
    from (
        select 'schm_1' as schm, 'tbl_a' as tbl, 'fld_1' as fld, 'md5(fld_1)' as mrule
        union all
        select 'schm_1', 'tbl_a', 'fld_2', 'md5(fld_2)'
        union all
        select 'schm_1','tbl_b', 'fld_1', 'md5(fld_1)'
        union all
        select 'schm_1','tbl_b', 'fld_2', 'md5(fld_2)'
    ) T
    group by schm, tbl
) T
>>
	[
	    {
	        "table": "tbl_b",
	        "fields": {
	            "fld_1": "md5(fld_1)",
	            "fld_2": "md5(fld_2)"
	        },
	        "schema": "schm_1"
	    },
	    {
	        "table": "tbl_a",
	        "fields": {
	            "fld_1": "md5(fld_1)",
	            "fld_2": "md5(fld_2)"
	        },
	        "schema": "schm_1"
	    }
	]

How to escape/unescape complex names of objects

import json
j = {"k": "_TBL.$complex#имя;@&* a'2"}
json.dumps(j)
>>
	'{"k": "_TBL.$complex#\\u0438\\u043c\\u044f;@&* a\'2"}'

s = '{"k": "_TBL.$complex#\\u0438\\u043c\\u044f;@&* a\'2"}'
u = json.loads(s)
print(u['k'])
>>
	_TBL.$complex#имя;@&* a'2

Contributing

Dependencies

The pg_anon uses Poetry dependency management tool for managing dependencies and creating packages.
For adding new dependencies install Poetry and run command:

poetry add <package_name>

For locking the dependencies use command:

poetry lock --no-update

Additionally, export the latest packages to requirements.txt using poetry export plugin:

poetry export -f requirements.txt --output requirements.txt

Build package

For building the package use command:

poetry build

Additionally package could be build package using setuptools:

python3 setup.py sdist

Future plans:

--format: COPY data format, can be overwritten by --copy-options. Selects the data format to be read or written: text, csv or binary.
--copy-options: Options for COPY command like "with binary".
Supporting restore after full dump: Right now struct restoring the structure of the database possible only after struct dump. So you don't able to restore the structure after full dump.
Simplify commands and options to improve the user experience.

FranckPachot/pg_anon

pg_anon

Overview

Features

Requirements & Dependencies

Installation Guide

Preconditions

Installation Instructions

macOS

Ubuntu/Redhat/CentOS

Windows 7/Windows 11

Testing

Setting Up PostgreSQL

Executing Tests

Test Database Configuration

Usage

Run init mode

Run create_dict mode

Prerequisites:

Run dump mode

Prerequisites:

Dump modes:

Run restore mode

Prerequisites:

Restore modes:

Generate dictionary from table rows

How to escape/unescape complex names of objects

Contributing

Dependencies

Build package

Future plans: