dyno failing to restore incremental backup?
keen99 opened this issue · 6 comments
simple test case with a small table:
ENV=dev
TABLE=dsrtest2
. config.env.$ENV
bin/incremental-backfill.js $AWS_REGION/$TABLE s3://$BackupBucket/$BackupPrefix
bin/incremental-snapshot.js s3://$BackupBucket/$BackupPrefix/$TABLE s3://$BackupBucket/${TABLE}-snapshot
s3print s3://$BackupBucket/${TABLE}-snapshot | dyno put $AWS_REGION/dsr-test-restore-$TABLE
%% sh test-backup.sh
12 - 11.89/s[Fri, 09 Dec 2016 23:54:59 GMT] [info] [incremental-snapshot] Starting snapshot from s3://dsr-ddb-rep-testing/testprefix/dsrtest2 to s3://dsr-ddb-rep-testing/dsrtest2-snapshot
[Fri, 09 Dec 2016 23:55:01 GMT] [info] [incremental-snapshot] Starting upload of part #0, 0 bytes uploaded, 12 items uploaded @ 6.26 items/s
[Fri, 09 Dec 2016 23:55:01 GMT] [info] [incremental-snapshot] Uploaded snapshot to s3://dsr-ddb-rep-testing/dsrtest2-snapshot
[Fri, 09 Dec 2016 23:55:01 GMT] [info] [incremental-snapshot] Wrote 12 items and 148 bytes to snapshot
undefined:1
�
^
SyntaxError: Unexpected token in JSON at position 0
at Object.parse (native)
at Function.module.exports.deserialize (/Users/draistrick/git/github/dynamodb-replicator/node_modules/dyno/lib/serialization.js:49:18)
at Transform.Parser.parser._transform (/Users/draistrick/git/github/dynamodb-replicator/node_modules/dyno/bin/cli.js:94:25)
at Transform._read (_stream_transform.js:167:10)
at Transform._write (_stream_transform.js:155:12)
at doWrite (_stream_writable.js:307:12)
at writeOrBuffer (_stream_writable.js:293:5)
at Transform.Writable.write (_stream_writable.js:220:11)
at Stream.ondata (stream.js:31:26)
at emitOne (events.js:96:13)
Next step would be to diff the two tables - but the pipe to dyno fails. I've tried 1.0.0 and 1.3.0 with the same result.
What data format is dyno expecting? The file on s3 (tried multiple tables including real data tables) is a binary blob?
cheese:~%% aws --region=us-west-2 s3 cp s3://dsr-ddb-rep-testing/dsrtest-snapshot -
m�1�
��ߠl�EG�EB�uL0\�Tuq�ݵ#������$L�6�/8�%Z�r�[d�p
���5h)��X�ֻ�j�ƪ�
ۘ��&�WJ'❑��`�T��������
cheese:~%%
So maybe this is a problem with backfill? or I'm missing something? :)
2016-12-09 18:54:35 149 dsrtest-snapshot
2016-12-09 18:55:01 148 dsrtest2-snapshot
2016-12-09 18:37:20 1428 receipt_log_dev-01-snapshot
2016-12-09 18:53:15 13457328 showdownlive_dev-01-snapshot
oh.
cheese:~%% aws --region=us-west-2 s3 cp s3://dsr-ddb-rep-testing/dsrtest-snapshot -|gzcat
{"a":{"S":"b"},"what":{"S":"new10"}}
{"b":{"S":"ccd"},"what":{"S":"a"}}
{"aa":{"S":"bb"},"what":{"S":"asdf"}}
{"a":{"S":"11"},"what":{"S":"new2"}}
{"a":{"S":"asdf"},"what":{"S":"sdfg"}}
{"what":{"S":"test2"}}
{"a":{"S":"fish faster 8"},"what":{"S":"new"}}
{"a":{"S":"bb"},"what":{"S":"bb"}}
{"a":{"S":"b"},"what":{"S":"new1"}}
{"a":{"S":"aa"},"b":{"S":"cc"},"what":{"S":"b"}}
{"a":{"S":"test1"},"what":{"S":"test"}}
{"what":{"S":"test4"}}
that still doesnt work - s3print | gzip fails, apprently s3print is outputting something extra...
%% s3print s3://$BackupBucket/${TABLE}-snapshot | gzcat
{"what":{"S":"new10"},"a":{"S":"b"}}
{"what":{"S":"test4"}}
{"what":{"S":"new2"},"a":{"S":"11"}}
{"b":{"S":"ccd"},"what":{"S":"a"}}
{"what":{"S":"new1"},"a":{"S":"b"}}
{"aa":{"S":"bb"},"what":{"S":"asdf"}}
{"b":{"S":"cc"},"what":{"S":"b"},"a":{"S":"aa"}}
{"what":{"S":"new"},"a":{"S":"fish faster 8"}}
{"what":{"S":"sdfg"},"a":{"S":"asdf"}}
{"what":{"S":"test2"},"a":{"S":"asdf"}}
{"what":{"S":"test"},"a":{"S":"test1"}}
{"what":{"S":"bb"},"a":{"S":"bb"}}
gzcat: (stdin): trailing garbage ignored
but
aws s3 cp s3://$BackupBucket/${TABLE}-snapshot - | gzcat | dyno put $AWS_REGION/dsr-test-restore-$TABLE
almost works - except dyno requires the table to already exist.
I guess we're not storing table create details with the backups, so we can't actually directly do a restore to a new table - have to discover the old table's setup and recreate that first. :/
dyno's export
CLI writes a similar snapshot with a table description as the first line, and then dyno's import
CLI can read that table description and create a new table. This doesn't exactly help for the incremental snapshot case (which doesn't even have knowledge of the table schema), but perhaps there's some code over there that can help you with a pipeline that utilizes dyno import
?
thanks. seems like the logical process for incremental and snapshots from those would be to to have the streams triggered function just maintain a describeTable object in s3 that's updated every time the trigger is run, then the snapshot creation would include that, and we could just use dyno import to load it into a new table? rolling back through time through the versioned bucket would always have the corresponding table schema (not that this can really change, I suppose. but maybe there are important parts of config that matter that can change)
ok, I've got working logic to extract table descriptions at the time of the lambda events and store them - while the only thing that really might be changing is scaling limits, if I'm recreating a point in time, I'd rather have the whole point in time. :)
Would you be interested in this as a PR? It currently store the description at bucket/prefix/tablename.description
and would be non-impacting (except for IAM policy updates to read it, I guess) for existing workflows.
My next step is to update the s3-snapshot.js
code to include the description in a form that dyno import
can handle.
If you want this as a PR - would you prefer this to NOT attach the description if it doesnt exist? My options seem to be: no description, description from s3, or current live description if there isn't one on s3.
Not sure what the best approach would be for existing consumers - so figured I'd ask. Maybe another cli arg for the different path? Dunno.
I'll start working on that tomorrow for my needs - but happy to make it portable if you've got a direction you'd prefer.
I'm glad you're finding something that can work for you here.
There are a couple of issues that I could imagine coming up with the approach you're pursuing:
-
Are you writing the table description every time lambda is invoked? This could cause throttling on the DynamoDB DescribeTable API and/or S3 PutObject API for tables with very high write load.
-
On tables with very large record counts, we've found that we have to perform the snapshot spread across several processes. Each process ends up being responsible for scraping the S3 objects that start with names starting with
0
,1
, ... up tof
. At the end, a "reduce" process rolls up the results of each of the individual processes into a single snapshot file. I am imagining that having thetablename.description
record in the midst of the incremental records might mess with that final rollup step.
Generally speaking, we've used the snapshots for point-in-time restores of individual records, rather than wholesale restores on the entire database. I do feel like there's some appeal towards keeping the incremental backup step (that happens on Lambda) responsible for nothing more than that one thing. What if the DescribeTable request was made alongside the s3-snapshot.js
call, perhaps optional?