Tool to create anonymized production data dump to use for PERF and other TEST environments.
Install gem using:
$ gem install data-anonymization
Install required database adapter library for active record:
$ gem install sqlite3
Create ruby program using data-anonymization DSL as following my_dsl.rb
:
require 'data-anonymization'
database 'DatabaseName' do
strategy DataAnon::Strategy::Blacklist # whitelist (default) or blacklist
# database config as active record connection hash
source_db :adapter => 'sqlite3', :database => 'sample-data/chinook-empty.sqlite'
# User -> table name (case sensitive)
table 'User' do
# id, DateOfBirth, FirstName, LastName, UserName, Password -> table column names (case sensitive)
primary_key 'id' # composite key is also supported
anonymize 'DateOfBirth','FirstName','LastName' # uses default anonymization based on data types
anonymize('UserName').using FieldStrategy::StringTemplate.new('user#{row_number}')
anonymize('Password') { |field| "password" }
end
...
end
Run using:
$ ruby my_dsl.rb
- Whitelist using Chinoook sample database
- Blacklist using Chinoook sample database
- Whitelist with composite primary key using DellStore sample database
- Blacklist with composite primary key using DellStore sample database
Major changes:
- Added support for Parallel table execution
Please see the Github 0.3.0 milestone page for more details on changes/fixes in release 0.3.0
- Added the progress bar using 'powerbar' gem. Which also shows the ETA for each table.
- Added More strategies
- Fixed default anonymization strategies for boolean and integer values
- Added support for composite primary key
- First initial release
- MongoDB anonymization support (NoSQL document based database support)
- Generate DSL from database and build schema from source as part of Whitelist approach.
Please use Github issues to share feedback, feature suggestions and report issues.
For almost all projects there is a need for production data dump in order to run performance tests, rehearse production releases and debug production issues. However, getting production data and using it is not feasible due to multiple reasons, primary being privacy concerns for user data. And thus the need for data anonymization. This tool helps you to get anonymized production data dump using either Blacklist or Whitelist strategies.
This approach essentially leaves all fields unchanged with the exception of those specified by the user, which are scrambled/anonymized (hence the name blacklist).
For Blacklist
create a copy of prod database and chooses the fields to be anonymized like e.g. username, password, email, name, geo location etc. based on user specification. Most of the fields have different rules e.g. password should be set to same value for all users, email needs to be valid.
The problem with this approach is that when new fields are added they will not be anonymized by default. Human error in omitting users personal data could be damaging.
database 'DatabaseName' do
strategy DataAnon::Strategy::Blacklist
source_db :adapter => 'sqlite3', :database => 'sample-data/chinook-empty.sqlite'
...
end
This approach, by default scrambles/anonymizes all fields except a list of fields which are allowed to copied as is. Hence the name whitelist. By default all data needs to be anonymized. So from production database data is sanitized record by record and inserted as anonymized data into destination database. Source database needs to be readonly. All fields would be anonymized using default anonymization strategy which is based on the datatype, unless a special anonymization strategy is specified. For instance special strategies could be used for emails, passwords, usernames etc. A whitelisted field implies that it's okay to copy the data as is and anonymization isn't required. This way any new field will be anonymized by default and if we need them as is, add it to the whitelist explicitly. This prevents any human error and protects sensitive information.
database 'DatabaseName' do
strategy DataAnon::Strategy::Whitelist
source_db :adapter => 'sqlite3', :database => 'sample-data/chinook.sqlite'
destination_db :adapter => 'sqlite3', :database => 'sample-data/chinook-empty.sqlite'
...
end
- In Whitelist approach make source database connection READONLY.
- Change default field strategies to avoid using same strategy again and again in your DSL.
- To run anonymization in parallel at Table level, provided no FK constraint on tables use DataAnon::Parallel::Table strategy
Currently provides capability of running anonymization in parallel at table level provided no FK constraints on tables. It uses Parallel gem provided by Michael Grosser. By default it starts multiple parallel ruby processes processing table one by one.
database 'DellStore' do
strategy DataAnon::Strategy::Whitelist
execution_strategy DataAnon::Parallel::Table # by default sequential table processing
...
end
The object that gets passed along with the field strategies.
has following attribute accessor
name
current field/column namevalue
current field/column valuerow_number
current row numberar_record
active record of the current row under processing
Default anonymization strategy for string
content. Uses default 'Lorem ipsum...' text or text supplied in strategy to generate same length string.
anonymize('UserName').using FieldStrategy::LoremIpsum.new
anonymize('UserName').using FieldStrategy::LoremIpsum.new("very large string....")
anonymize('UserName').using FieldStrategy::LoremIpsum.new(File.read('my_file.txt'))
Generates random string of same length.
anonymize('UserName').using FieldStrategy::RandomString.new
Simple string evaluation within DataAnon::Core::Field context. Can be used for email, username anonymization. Make sure to put the string in 'single quote' else it will get evaluated inline.
anonymize('UserName').using FieldStrategy::StringTemplate.new('user#{row_number}')
anonymize('Email').using FieldStrategy::StringTemplate.new('valid.address+#{row_number}@gmail.com')
anonymize('Email').using FieldStrategy::StringTemplate.new('useremail#{row_number}@mailinator.com')
Select randomly one of the values specified.
anonymize('State').using FieldStrategy::SelectFromList.new(['New York','Georgia',...])
anonymize('NameTitle').using FieldStrategy::SelectFromList.new(['Mr','Mrs','Dr',...])
Similar to SelectFromList only difference is the list of values are picked up from file. Classical usage is like states field anonymization.
anonymize('State').using FieldStrategy::SelectFromFile.new('states.txt')
Keeping the format same it changes each digit in the string with random digit.
anonymize('CreditCardNumber').using FieldStrategy::FormattedStringNumber.new
Similar to SelectFromList with difference is the list of values are collected from the database table using distinct column query.
# values are collected using `select distinct state from customers` query
anonymize('State').using FieldStrategy::SelectFromDatabase.new('customers','state')
Generates address using the geojson format file. The default US/UK file chooses randomly from 300 addresses. The large data set can be downloaded from here
anonymize('Address').using FieldStrategy::RandomAddress.region_US
anonymize('Address').using FieldStrategy::RandomAddress.region_UK
# get your own geo_json file and use it
anonymize('Address').using FieldStrategy::RandomAddress.new('my_geo_json.json')
Similar to RandomAddress, generates city using the geojson format file. The default US/UK file chooses randomly from 300 addresses. The large data set can be downloaded from here
anonymize('City').using FieldStrategy::RandomCity.region_US
anonymize('City').using FieldStrategy::RandomCity.region_UK
# get your own geo_json file and use it
anonymize('City').using FieldStrategy::RandomCity.new('my_geo_json.json')
Similar to RandomAddress, generates province using the geojson format file. The default US/UK file chooses randomly from 300 addresses. The large data set can be downloaded from here
anonymize('Province').using FieldStrategy::RandomProvince.region_US
anonymize('Province').using FieldStrategy::RandomProvince.region_UK
# get your own geo_json file and use it
anonymize('Province').using FieldStrategy::RandomProvince.new('my_geo_json.json')
Similar to RandomAddress, generates zipcode using the geojson format file. The default US/UK file chooses randomly from 300 addresses. The large data set can be downloaded from here
anonymize('Address').using FieldStrategy::RandomZipcode.region_US
anonymize('Address').using FieldStrategy::RandomZipcode.region_UK
# get your own geo_json file and use it
anonymize('Address').using FieldStrategy::RandomZipcode.new('my_geo_json.json')
Keeping the format same it changes each digit in the string with random digit.
anonymize('PhoneNumber').using FieldStrategy::RandomPhoneNumber.new
Anonymizes each field(except year and seconds) within the natural range (e.g. hour between 1-24 and day within the month) based on true/false input for that field. By default, all fields are anonymized.
#anonymizes month and hour fields, leaving the day and minute fields untouched
anonymize('DateOfBirth').using FieldStrategy::AnonymizeDateTime.new(true,false,true,false)
In addition to customizing which fields you want anonymized, there are some helper methods which allow for quick anonymization
# anonymizes only the month field
anonymize('DateOfBirth').using FieldStrategy::AnonymizeDateTime.only_month
# anonymizes only the day field
anonymize('DateOfBirth').using FieldStrategy::AnonymizeDateTime.only_day
# anonymizes only the hour field
anonymize('DateOfBirth').using FieldStrategy::AnonymizeDateTime.only_hour
# anonymizes only the minute field
anonymize('DateOfBirth').using FieldStrategy::AnonymizeDateTime.only_minute
Exactly similar to the above DateTime strategy, except that the returned object is of type Time
Anonmizes day and month fields within natural range based on true/false input for that field. By defaut both fields are anonymized
# anonymizes month and leaves day unchanged
anonymize('DateOfBirth').using FieldStrategy::AnonymizeDate.new(true,false)
In addition to customizing which fields you want anonymized, there are some helper methods which allow for quick anonymization
# anonymizes only the month field
anonymize('DateOfBirth').using FieldStrategy::AnonymizeDate.only_month
# anonymizes only the day field
anonymize('DateOfBirth').using FieldStrategy::AnonymizeDate.only_day
Shifts data randomly within given range. Default shifts date within 10 days + or - and shifts time within 30 minutes.
anonymize('DateOfBirth').using FieldStrategy::DateTimeDelta.new
# shifts date within 20 days and time within 50 minutes
anonymize('DateOfBirth').using FieldStrategy::DateTimeDelta.new(20, 50)
Exactly similar to the above DateTime strategy, except that the returned object is of type Time
Shifts date randomly within given delta range. Default shits date within 10 days + or -
anonymize('DateOfBirth').using FieldStrategy::AnonymizeDate.new
# shifts date within 25 days
anonymize('DateOfBirth').using FieldStrategy::DateDelta.new(25)
Generates email randomly using the given HOSTNAME and TLD. By defaults generates hostname randomly along with email id.
anonymize('Email').using FieldStrategy::RandomEmail.new('thoughtworks','com')
Generates a valid unique gmail address by taking advantage of the gmail + strategy. Takes in a valid gmail username and generates emails of the form username+@gmail.com
anonymize('Email').using FieldStrategy::GmailTemplate.new('username')
Generates random email using mailinator hostname. e.g. @mailinator.com
anonymize('Email').using FieldStrategy::RandomMailinatorEmail.new
Generates random user name of same length as original user name.
anonymize('Username').using FieldStrategy::RandomUserName.new
Randomly picks up first name from the predefined list in the file. Default file is part of the gem. File should contain first name on each line.
anonymize('FirstName').using FieldStrategy::RandomFirstName.new
anonymize('FirstName').using FieldStrategy::RandomFirstName.new('my_first_names.txt')
Randomly picks up last name from the predefined list in the file. Default file is part of the gem. File should contain last name on each line.
anonymize('LastName').using FieldStrategy::RandomLastName.new
anonymize('LastName').using FieldStrategy::RandomLastName.new('my_last_names.txt')
Generates full name using the RandomFirstName and RandomLastName strategies. It also creates the s
anonymize('FullName').using FieldStrategy::RandomFullName.new
anonymize('FullName').using FieldStrategy::RandomLastName.new('my_first_names.txt', 'my_last_names.txt')
Generates random integer number between given two numbers. Default range is 0 to 100.
anonymize('Age').using FieldStrategy::RandomInteger.new(18,70)
Shifts the current value randomly within given delta + and -. Default is 10
anonymize('Age').using FieldStrategy::RandomIntegerDelta.new(2)
Generates random float number between given two numbers. Default range is 0.0 to 100.0
anonymize('points').using FieldStrategy::RandomInteger.new(3.0,5.0)
Shifts the current value randomly within given delta + and -. Default is 10.0
anonymize('points').using FieldStrategy::RandomFloatDelta.new(2.5)
field parameter in following code is DataAnon::Core::Field
class MyFieldStrategy
# method anonymize is what required
def anonymize field
# write your code here
end
end
write your own anonymous field strategies within DSL,
table 'User' do
anonymize('Password') { |field| "password" }
anonymize('email') do |field|
"test+#{field.row_number}@gmail.com"
end
end
# Work in progress... TO BE COMPLETED
DEFAULT_STRATEGIES = {:string => FieldStrategy::LoremIpsum.new,
:fixnum => FieldStrategy::RandomIntegerDelta.new(5),
:bignum => FieldStrategy::RandomIntegerDelta.new(5000),
:float => FieldStrategy::RandomFloatDelta.new(5.0),
:datetime => FieldStrategy::DateTimeDelta.new,
:time => FieldStrategy::TimeDelta.new,
:date => FieldStrategy::DateDelta.new,
:trueclass => FieldStrategy::RandomBoolean.new,
:falseclass => FieldStrategy::RandomBoolean.new
}
Overriding default field strategies,
database 'Chinook' do
...
default_field_strategies :string => FieldStrategy::RandomString.new
...
end
How do I switch off the progress bar?
# add following line in your ruby file
ENV['show_progress'] = 'false'
Logger
provides debug level messages including database queries of active record.
DataAnon::Utils::Logging.logger.level = Logger::INFO
- Fork it
- Create your feature branch (
git checkout -b my-new-feature
) - Commit your changes (
git commit -am 'Add some feature'
) - Push to the branch (
git push origin my-new-feature
) - Create new Pull Request
- ThoughtWorks Inc, for allowing us to build this tool and make it open source.
- Birinder and Panda for reviewing the documentation.
- Dan Abel for introducing me to Blacklist and Whitelist approach for data anonymization.
- Chirga Doshi for encouraging me to get this done.
- Aditya Karle for the Logo. (Coming Soon...)