s3 support for data handling
dstengle opened this issue · 4 comments
Is it possible to use s3 and elastic mapreduce for processing sstables from backups? Do you push data into hdfs and then process it from -> to there?
Yes, so what we are doing currently is moving the data from S3 to HDFS and splitting the files. We do this for two reasons. First is that the data is snappy compressed in our workflow by Priam. We could get around that by allowing Aegisthus to decompress the files directly from S3. The other reason is that we often have very large files and we have found that splitting large SSTables vastly improves our performance.
If you didn't have these constraints you could read directly from s3. Or a patch to the code if the snappy compression was the only issue.
My plan is to check in our distcp that we use in our workflow as soon as I can get back to the project (next week at this point).
@charsmith, just fyi I recall it's not possible to stream from s3 for (non cassandra-compressed) Priam data files, as aegisthus needs to know what's the uncompressed size of the file, which Priam doesn't store anywhere.
I think the distcp code could potentially be useful though.
@danchia That is currently correct. We experimented a long time ago with just decompressing on the fly and letting the job run to the end of the file. It is possible, but the performance when dealing with 100+ GB files was much worse, and so in the end we left our two part process.
As of commit 319b913 Aegisthus can now use s3 as an input directory instead of only hdfs input directories.
We don't use this feature in production. All of our jobs still use the two step process of distcp to hdfs and then running Aegisthus against the uncompressed data on hdfs that Charles mentioned above.