Anonymizing pg_dump data isn't always straight forward. This tool helps. It has no dependencies, other than ruby. Another good thing is that sensitive data doesn't need to ever be stored temporarily as this can be used with Unix pipes and the data is passed through as IO data.
pg_dump_anonymize
does not anonymize any data automatically. It is very much BYOAD (Bring Your Own Anonymizing Definition). Inside your anonymizing definition, you can use any ruby gems you like, such as faker. This gem makes applying anonymizing definitions to pg_dump
output easy.
Add this line to your application's Gemfile:
gem 'pg_dump_anonymize'
And then execute:
$ bundle install
Or install it yourself as:
$ gem install pg_dump_anonymize
pg_dump_anonymize
does not anonymize any data by default. You must provide your own anonymizing definition. The gem currently requires the format of output pg_dump
is plain text (not compressed) and uses the default COPY
behavior instead of INSERT INTO
.
Example usage:
pg_dump --no-privileges --no-owner <database-name> | pg_dump_anonymize -d sample_definition.rb > anonymized_dump.sql
You can also pipe the anonymized SQL directly into psql
to avoid intermediate SQL dump files.
The definition file is any ruby code you'd like, but it must return a hash with table names as the top level keys and attribute names nested under the table keys. Any faking ruby gems dependencies, and other gem dependencies you may choose to use in your definitions, are gems that will need to be available on the server this is ran on.
Example:
{
table_1: {
sensitive_field_1: 'some string value',
sensitive_field_2: -> { rand(100) },
sensitive_field_3: -> (original_value, row_context) do
if original_value
'xxxxx'
end.tap { |new_val| row_context[field_3_val] = new_val }
end
sensitive_field_4: -> (_original_value, row_context) do
row_context[:field_3_val] ? 'foo' : 'bar'
end
},
table_2: {
sensitive_field_3: nil
}
}
Here is a more concrete example using the faker gem.
require 'faker'
{
users: {
first_name: -> { Faker::Name.first_name },
last_name: -> { Faker::Name.last_name },
email: -> { Faker::Internet.email },
city: 'Portland'
},
accounts: {
bank_name: -> { Faker::Bank.name },
account_num: -> { Faker::Bank.account_number },
routing_num: -> { Faker::Bank.routing_number }
}
}
- Write some tests (so far this has been tested manually)
After checking out the repo, run bin/setup
to install dependencies. Then, run rake spec
to run the tests. You can also run bin/console
for an interactive prompt that will allow you to experiment.
To install this gem onto your local machine, run bundle exec rake install
. To release a new version, update the version number in version.rb
, and then run bundle exec rake release
, which will create a git tag for the version, push git commits and tags, and push the .gem
file to rubygems.org.
This has been tested with a 1GB dump file, and it took 11 seconds on a 2013 MacBook Pro to anonymize it. It largely depends on how much you're anonymizing, and the anonymizing definitions you're applying. So milage will vary. Still, this is plenty fast for most of my needs.
Bug reports and pull requests are welcome on GitHub at https://github.com/[USERNAME]/pg_dump_anonymize.
The gem is available as open source under the terms of the MIT License.