pg_anon
is an efficient tool for the anonymization of Postgres data specifically designed for IT companies. These companies often store "sensitive" data that includes both commercial secrets and personal user information such as contact numbers, passport details, etc.
The tool comes in handy when it is necessary to transfer the database contents from the production environment to other environments for performance testing or functionality debugging during the development process. With pg_anon
, no sensitive data is exposed, preventing potential data leaks.
pg_anon
works in several modes:
init
: Createsanon_funcs
schema with anonymization functions.create_dict
: Scans the DB data and creates a metadict with an anonymization profile.dump
: Creates a database structure dump using Postgrespg_dump
tool, and data dumps usingCOPY ...
queries with anonymization functions. The data dump step saves data locally in*.bin.gz
format. During this step, the data is anonymized on the database side byanon_funcs
.restore
: Restores database structure using Postgrespg_restore
tool and data from the dump to the target DB.restore
mode can separately restore database structure or data.
pg_anon
is based on Python3
and also requires the third-party libraries listed in requirements.txt
.
It uses the following tools and technologies:
- Postgres
pg_dump
tool for dumping the database structure. - Postgres
pg_restore
tool for restoring the database structure. - Postgres functions for the anonymization process.
The tool supports Python3.11 and higher versions. The code is hosted on the following repository: pg_anon repository on Github.
Installation processes slightly differ depending on your operating system.
- Install Python3 if it isn't installed:
- Install Homebrew
brew install python@3.11
- Clone the repository:
git clone https://github.com/TantorLabs/pg_anon.git
- Go to the project directory:
cd pg_anon
- Set up a virtual environment:
- Install the virtual environment:
python3 -m venv venv
- Activate the virtual environment:
source venv/bin/activate
- Install the virtual environment:
- Install the dependencies:
pip install -r requirements.txt
- Install Python3 if it isn't installed:
sudo apt-get install python3.11
(for Ubuntu),sudo yum install python311
(for Redhat/Centos) - Clone the repository:
git clone https://github.com/TantorLabs/pg_anon.git
- Go to the project directory:
cd pg_anon
- Set up a virtual environment:
- Install the virtual environment:
python3 -m venv venv
- Activate the virtual environment:
source venv/bin/activate
- Install the virtual environment:
- Install the dependencies:
pip install -r requirements.txt
- Install Python3 if it isn't installed: Download it from the official Python website
- Clone the repository:
git clone https://github.com/TantorLabs/pg_anon.git
- Go to the project directory:
cd pg_anon
- Set up a virtual environment:
- Install the virtual environment:
py -m venv venv
- Activate the virtual environment:
.\venv\Scripts\activate
- Install the virtual environment:
- Install the dependencies:
pip install -r requirements.txt
To test pg_anon
, you need to have a local database installed. This section covers the installation of postgres and running the test suite.
To facilitate the testing, here are instructions on how to set up PostgreSQL on Ubuntu:
-
Add repository configuration:
echo "deb [arch=amd64] http://apt.postgresql.org/pub/repos/apt focal-pgdg main" >> /etc/apt/sources.list.d/pgdg.list wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | apt-key add -
-
Update packages and install PostgreSQL:
apt -y install postgresql-15 postgresql-client-15
-
Allow connections to the PostgreSQL server:
sed -i '/listen_addresses/s/^#//g' /etc/postgresql/15/main/postgresql.conf sed -ie "s/^listen_addresses.*/listen_addresses = '127.0.0.1'/" /etc/postgresql/15/main/postgresql.conf
-
Restart the PostgreSQL instance for the changes to take effect:
pg_ctlcluster 15 main restart
-
Create a test user with superuser rights to allow running the COPY commands:
psql -c "CREATE USER anon_test_user WITH PASSWORD 'mYy5RexGsZ' SUPERUSER;" -U postgres
To validate that your setup is functioning correctly, run the unit tests:
python test/full_test.py -v
Upon successful execution, the output should resemble the following:
Ran N tests in ...
OK
If all tests pass, the application is ready to use.
To run a specific test case, use the following pattern:
python3 test/full_test.py -v PGAnonValidateUnitTest
Test database connection settings can be overridden using environment variables:
set TEST_DB_USER=anon_test_user
set TEST_DB_USER_PASSWORD=mYy5RexGsZ
set TEST_DB_HOST=127.0.0.1
set TEST_DB_PORT=5432
set TEST_SOURCE_DB=test_source_db
set TEST_TARGET_DB=test_target_db
To display the help message for the CLI, run:
python pg_anon.py --help
Common pg_anon options:
Option | Description |
---|---|
--debug |
Enable debug mode (default false) |
--verbose |
Configure verbose mode: [info, debug, error] (default info) |
--threads |
Amount of threads for IO operations (default 4) |
--processes |
Amount of processes for multiprocessing operations (default 4) |
Database configuration options:
Option | Description |
---|---|
--db-host |
Specifies your database host |
--db-port |
Specifies your database port |
--db-name |
Specifies your database name |
--db-user |
Specifies your database user |
--db-user-password |
Specifies your database user password |
--db-passfile |
Path to the file containing the password to be used when connecting to the database |
--db-ssl-key-file |
Path to the client SSL key file for secure connections to the database |
--db-ssl-cert-file |
Path to the client SSL certificate file for secure connections to the database |
--db-ssl-ca-file |
Path to the root SSL certificate file. This certificate is used to verify the server's certificate |
To init schema "anon_funcs", run pg_anon in 'init' mode:
python pg_anon.py --mode init \
--db-user postgres \
--db-user-password postgres \
--db-name test_source_db
- Generated or manually created dictionary
*.py
file with anonymization profile - "anon_funcs" created in init mode
To create the dictionary:
python pg_anon.py --mode create-dict \
--db-user postgres \
--db-user-password postgres \
--db-name test_source_db \
--dict-file test_meta_dict.py \
--output-dict-file test_dict_output.py \
--processes 2
Option | Description |
---|---|
--dict-file |
Specify the dictionary file with data about anonymization |
--output-dict-file |
Output file will be saved to this value |
--scan-mode |
defines whether to scan all data or only part of it ["full", "partial"] (default "partial") |
--scan-partial-rows |
In --scan-mode partial defines amount of rows to scan (default 10000) |
anon_funcs
schema with anonymization functions should be created. See --mode init example.- output dict file with meta information about database fields, and it's anonymization should be created. See --mode create-dict
-
To create the structure dump and data dump:
python pg_anon.py --mode dump \ --db-host=127.0.0.1 \ --db-user postgres \ --db-user-password postgres \ --db-name test_source_db \ --dict-file test_dict_output.py
-
To create only structure dump:
python pg_anon.py --mode sync-struct-dump \ --db-host=127.0.0.1 \ --db-user postgres \ --db-user-password postgres \ --db-name test_source_db \ --output-dir test_sync_struct_dump \ --dict-file test_dict_output.py
-
To create only data dump:
python pg_anon.py --mode sync-data-dump \ --db-host=127.0.0.1 \ --db-user postgres \ --db-user-password postgres \ --db-name test_source_db \ --output-dir test_sync_data_dump \ --dict-file test_dict_output.py
This mode could be useful for scheduling the database synchronization, for example with
cron
.
Possible options in mode=dump:
Option | Description |
---|---|
--validate-dict |
Validate dictionary, show the tables and run SQL queries without data export (default false) |
--validate-full |
Same as --validate-dict + data export with limit (default false) |
--clear-output-dir |
In dump mode clears output dict from previous dump or another files. (default true) |
--pg-dump |
Path to the pg_dump Postgres tool (default /usr/bin/pg_dump ). |
--output-dir |
Output directory for dump files. (default "") |
- Each mode requires dump for restore.
-
Restore structure and data:
This mode requires the dump output, created in--mode=dump
.python pg_anon.py --mode restore \ --db-host=127.0.0.1 \ --db-user postgres \ --db-user-password postgres \ --db-name test_target_db \ --input-dir test_dict_output \ --verbose debug
-
Restore structure only:
This mode requires the dump output, created in--mode=sync-struct-dump
.python pg_anon.py --mode sync-struct-restore \ --db-host=127.0.0.1 \ --db-user postgres \ --db-user-password postgres \ --db-name test_target_db \ --input-dir test_sync_struct_dump \ --verbose debug
-
Restore data only:
This mode requires the dump output, created in--mode=sync-data-dump
.python pg_anon.py --mode sync-data-restore \ --db-host=127.0.0.1 \ --db-user postgres \ --db-user-password postgres \ --db-name test_target_db \ --input-dir test_sync_data_dump \ --verbose debug
Possible options in --mode restore
:
Option | Description |
---|---|
--input-dir |
Input directory, with the dump files, created in dump mode |
--disable-checks |
Disable checks of disk space and PostgreSQL version (default false) |
--seq-init-by-max-value |
Initialize sequences based on maximum values. Otherwise, the sequences will be initialized based on the values of the source database. |
--drop-custom-check-constr |
Drop all CHECK constrains containing user-defined procedures to avoid performance degradation at the data loading stage. |
--pg-restore |
Path to the pg_dump Postgres tool. |
If you have a table that contains objects and fields for anonymization, you can use this SQL query to generate a dictionary in json format:
select
jsonb_pretty(
json_agg(json_build_object('schema', T.schm, 'table', T.tbl, 'fields', flds ))::jsonb
)
from (
select
T.schm,
T.tbl,
JSON_OBJECT_AGG(fld, mrule) as flds
from (
select 'schm_1' as schm, 'tbl_a' as tbl, 'fld_1' as fld, 'md5(fld_1)' as mrule
union all
select 'schm_1', 'tbl_a', 'fld_2', 'md5(fld_2)'
union all
select 'schm_1','tbl_b', 'fld_1', 'md5(fld_1)'
union all
select 'schm_1','tbl_b', 'fld_2', 'md5(fld_2)'
) T
group by schm, tbl
) T
>>
[
{
"table": "tbl_b",
"fields": {
"fld_1": "md5(fld_1)",
"fld_2": "md5(fld_2)"
},
"schema": "schm_1"
},
{
"table": "tbl_a",
"fields": {
"fld_1": "md5(fld_1)",
"fld_2": "md5(fld_2)"
},
"schema": "schm_1"
}
]
import json
j = {"k": "_TBL.$complex#имя;@&* a'2"}
json.dumps(j)
>>
'{"k": "_TBL.$complex#\\u0438\\u043c\\u044f;@&* a\'2"}'
s = '{"k": "_TBL.$complex#\\u0438\\u043c\\u044f;@&* a\'2"}'
u = json.loads(s)
print(u['k'])
>>
_TBL.$complex#имя;@&* a'2
The pg_anon uses Poetry
dependency management tool for managing dependencies and creating packages.
For adding new dependencies
install Poetry and run command:
poetry add <package_name>
For locking the dependencies use command:
poetry lock --no-update
Additionally, export the latest packages to requirements.txt using poetry export plugin:
poetry export -f requirements.txt --output requirements.txt
For building the package use command:
poetry build
Additionally package could be build package using setuptools:
python3 setup.py sdist
--format
: COPY data format, can be overwritten by --copy-options. Selects the data format to be read or written: text, csv or binary.--copy-options
: Options for COPY command like "with binary".- Supporting restore after full dump: Right now struct restoring the structure of the database possible only after struct dump. So you don't able to restore the structure after full dump.
- Simplify commands and options to improve the user experience.