sunitparekh/data-anonymization

batch_size acts like limit

brki opened this issue · 4 comments

brki commented

When using batch_size, not all records are being processed. In fact only the number of records defined in batch_size is being processed.

I'm using a whitelisting strategy on an sqlite database. The anonymization is defined like this:

require 'data-anonymization'
require 'sqlite3'

DataAnon::Utils::Logging.logger.level = Logger::INFO

database 'foobar' do

  strategy DataAnon::Strategy::Whitelist
  source_db :adapter => 'sqlite3', :database => 'foo.db'
  destination_db :adapter => 'sqlite3', :database => 'foo.anon.db'

  table "foo" do
    primary_key "id"
    batch_size 10

    whitelist "id"
    anonymize("title") { |field| field.value + "foo" }
  end

end

The table foo was created in source and destination database like this :

sqlite> CREATE TABLE foo(id INTEGER, title TEXT);

The table foo has 4999 records in the source_db, and no records in the destination_db.

When I run the anonymization script, only 10 records are created in the destination_db. I'm expecting that all 4999 records should appear in the destination db.

No errors are reported, the script output looks like:

[vagrant@localhost]$ ruby ruby_scripts/test.rb
I, [2016-01-19T15:11:19.867351 #8410]  INFO -- : Processing table foo records in batch size of 10
foo                  [     1/4999  ] ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉  0% 00:00:00
I, [2016-01-19T15:11:21.740611 #8410]  INFO -- : Fields missing the anonymization strategy

I tried with debug-level logging enabled, but no errors were shown then, either.

These are the gem versions installed:

*** LOCAL GEMS ***
activemodel (4.2.5)
activerecord (4.2.5)
activesupport (4.2.5)
arel (6.0.3)
bigdecimal (1.2.6)
bson (3.2.6, 1.12.5)
bson_ext (1.12.4)
builder (3.2.2)
composite_primary_keys (8.1.2)
data-anonymization (0.7.2)
hashie (3.4.3)
i18n (0.7.0)
io-console (0.4.3)
json (1.8.1)
minitest (5.4.3)
mongo (2.1.2)
parallel (1.6.1)
pg (0.18.4)
power_assert (0.2.2)
powerbar (1.0.16)
protected_attributes (1.1.3)
psych (2.0.8)
rake (10.4.2)
rdoc (4.2.0)
rgeo (0.5.2)
rgeo-geojson (0.4.2)
sqlite3 (1.3.11)
test-unit (3.0.8)
thor (0.19.1)
thread_safe (0.3.5)
tzinfo (1.2.2)

The version of ruby is 2.2.2p95.

Thank u for reporting issue. Give me couple of days and I will look into
it.

Regards,
Sunit

On Tue, 19 Jan 2016 at 7:56 PM, Brian notifications@github.com wrote:

When using batch_size, not all records are being processed. In fact only
the number of records defined in batch_size is being processed.

I'm using a whitelisting strategy on an sqlite database. The anonymization
is defined like this:

require 'data-anonymization'
require 'sqlite3'

DataAnon::Utils::Logging.logger.level = Logger::INFO

database 'foobar' do

strategy DataAnon::Strategy::Whitelist
source_db :adapter => 'sqlite3', :database => 'foo.db'
destination_db :adapter => 'sqlite3', :database => 'foo.anon.db'

table "foo" do
primary_key "id"
batch_size 10

whitelist "id"
anonymize("title") { |field| field.value + "foo" }

end

end

The table foo was created in source and destination database like this :

sqlite> CREATE TABLE foo(id INTEGER, title TEXT);

The table foo has 4999 records in the source_db, and no records in the
destination_db.

When I run the anonymization script, only 10 records are created in the
destination_db. I'm expecting that all 4999 records should appear in the
destination db.

No errors are reported, the script output looks like:

[vagrant@localhost]$ ruby ruby_scripts/test.rb
I, [2016-01-19T15:11:19.867351 #8410] INFO -- : Processing table foo records in batch size of 10
foo [ 1/4999 ] ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 0% 00:00:00
I, [2016-01-19T15:11:21.740611 #8410] INFO -- : Fields missing the anonymization strategy

I tried with debug-level logging enabled, but no errors were shown then,
either.

These are the gem versions installed:

*** LOCAL GEMS ***
activemodel (4.2.5)
activerecord (4.2.5)
activesupport (4.2.5)
arel (6.0.3)
bigdecimal (1.2.6)
bson (3.2.6, 1.12.5)
bson_ext (1.12.4)
builder (3.2.2)
composite_primary_keys (8.1.2)
data-anonymization (0.7.2)
hashie (3.4.3)
i18n (0.7.0)
io-console (0.4.3)
json (1.8.1)
minitest (5.4.3)
mongo (2.1.2)
parallel (1.6.1)
pg (0.18.4)
power_assert (0.2.2)
powerbar (1.0.16)
protected_attributes (1.1.3)
psych (2.0.8)
rake (10.4.2)
rdoc (4.2.0)
rgeo (0.5.2)
rgeo-geojson (0.4.2)
sqlite3 (1.3.11)
test-unit (3.0.8)
thor (0.19.1)
thread_safe (0.3.5)
tzinfo (1.2.2)

The version of ruby is 2.2.2p95.


Reply to this email directly or view it on GitHub
#30.

Looked into it quickly. Looks like changes in activerecord library is
causing it to break. Will need more time to look into activerecord code.

Give me sometime, I will look into it over weekend. Till than avoid using
batch :-)

Thanks
Sunit

On Tue, 19 Jan 2016 at 9:27 PM, Sunit Parekh parekh.sunit@gmail.com wrote:

Thank u for reporting issue. Give me couple of days and I will look into
it.

Regards,
Sunit

On Tue, 19 Jan 2016 at 7:56 PM, Brian notifications@github.com wrote:

When using batch_size, not all records are being processed. In fact only
the number of records defined in batch_size is being processed.

I'm using a whitelisting strategy on an sqlite database. The
anonymization is defined like this:

require 'data-anonymization'
require 'sqlite3'

DataAnon::Utils::Logging.logger.level = Logger::INFO

database 'foobar' do

strategy DataAnon::Strategy::Whitelist
source_db :adapter => 'sqlite3', :database => 'foo.db'
destination_db :adapter => 'sqlite3', :database => 'foo.anon.db'

table "foo" do
primary_key "id"
batch_size 10

whitelist "id"
anonymize("title") { |field| field.value + "foo" }

end

end

The table foo was created in source and destination database like this :

sqlite> CREATE TABLE foo(id INTEGER, title TEXT);

The table foo has 4999 records in the source_db, and no records in the
destination_db.

When I run the anonymization script, only 10 records are created in the
destination_db. I'm expecting that all 4999 records should appear in the
destination db.

No errors are reported, the script output looks like:

[vagrant@localhost]$ ruby ruby_scripts/test.rb
I, [2016-01-19T15:11:19.867351 #8410] INFO -- : Processing table foo records in batch size of 10
foo [ 1/4999 ] ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 0% 00:00:00
I, [2016-01-19T15:11:21.740611 #8410] INFO -- : Fields missing the anonymization strategy

I tried with debug-level logging enabled, but no errors were shown then,
either.

These are the gem versions installed:

*** LOCAL GEMS ***
activemodel (4.2.5)
activerecord (4.2.5)
activesupport (4.2.5)
arel (6.0.3)
bigdecimal (1.2.6)
bson (3.2.6, 1.12.5)
bson_ext (1.12.4)
builder (3.2.2)
composite_primary_keys (8.1.2)
data-anonymization (0.7.2)
hashie (3.4.3)
i18n (0.7.0)
io-console (0.4.3)
json (1.8.1)
minitest (5.4.3)
mongo (2.1.2)
parallel (1.6.1)
pg (0.18.4)
power_assert (0.2.2)
powerbar (1.0.16)
protected_attributes (1.1.3)
psych (2.0.8)
rake (10.4.2)
rdoc (4.2.0)
rgeo (0.5.2)
rgeo-geojson (0.4.2)
sqlite3 (1.3.11)
test-unit (3.0.8)
thor (0.19.1)
thread_safe (0.3.5)
tzinfo (1.2.2)

The version of ruby is 2.2.2p95.


Reply to this email directly or view it on GitHub
#30.

@janraasch Thanks for the pull request. I merged it and released 0.7.3

Sure. Thank you for merging this :)