Cassandra 2.x support
charsmith opened this issue · 15 comments
Cassandra moves on and we need to upgrade. This task needs a little research and at least a dependency change.
Is this on the radar at all? We have committed to Cassandra 2.x and we're now looking at ways to do reporting that don't necessarily impact the runtime schema of the data.
This is on my radar. I have had limited time to work on this so I have been prioritizing things that Netflix needs in our current workflow, but 2.x support is actually coming faster than I expect. I don't believe that we are going to need a code change for it, just to change the dependencies, but I need to test that.
@charsmith There are a couple of code changes that need to be made. I'm working on this, this week, I will keep you posted.
Awesome. Our timeline is not until Oct, so I haven't started the debugging yet.
I think I have it working =)
@charsmith I haven't tested this, but a rough sketch will look like:
https://github.com/coursera/aegisthus/compare/cassandra-20?expand=1
We'd have to drop 1.1 support, as I don't see an easy way to make it work. Also, the lack of rowsizes within the Data file is a huge pain - the code to read columns from a row is kinda hacky in order to support dropping bad columns past a certain size.
@danchia I merged your code into the branch where I'm working on Cassandra 2.0.x support. Hopefully I'll get that finished up next week. Thanks for the help.
@danielbwatson what's the status of your C* 2.0 branch? I'm planning to update Aegisthus to the new columnar version here at Coursera this week, I'm happy to test it out / fix any outstanding issues. Let me know how we can collaborate on this!
Looking at it, the version on github doesn't really do C* 2.0 yet. I'm going to base my work of your c2_support branch.
Right, the columnar code that is currently in master will not work with C* 2.0 yet.
Charles wanted to deprecate the old scanner and output formats, but there was a lot of testing overhead and conceptual overhead to keep in mind with all the places Aegisthus was reading and writing SSTables.
The work in my c2_support branch you referenced cleans up the code and I think simplifies a lot of parts. In particular (for anyone else who hasn't looked at it yet):
From the commit message:
- Removed the non-columnar reducer and all of its input and output formats.
- The reducer now returns a BytesWriteable which is the bytes for the row id as a key and a RowWritable as a value. The RowWritable represents a row and all its columns.
- All outputting and formatting is now done in the OutputFormatter classes.
- All configuration options are now in Aegisthus.Feature
- All code in aegisthus-hadoop has how now been consistently formatted for tabs and spacing.
I forgot to mention:
- Added the ability to specify the converter for the column value instead of always using BytesType.
- All JSON is now formatted by Jackson Streaming JSON generator.
@danchia All that said I'd be happy to collaborate more closely on this with you. I wanted to look at your KV Mapper and see how big an impact my changes would have on you but I haven't had a chance yet. We can either coordinate through issues or you can email me (my email username is dwatson).
I'm going to redo the KVMapper too, so I wouldn't worry about it.
As a strawman proposal, I'm happy to attempt to get C* 2.0 support working with your c2_support branch. I have it working for the old code, so I should be able to massage things into place. I'll send a PR your way soon.
I think a lot of the simplifications in the commit message are very welcome. No more tabs is welcome too :)
Ah, I messed up while looking at branches. the c2_support branch seems to have the necessary changes in place.
I'll let you know if I find anything in testing.
Can you guys confirm if there is support for commit log readers with Cass2X in c2_support Aegisthus yet. I dont see any changes but I just want to know if there are any plans to provide any such support in future. Advice is much appreciated.
We haven't updated the commit log reader yet as we no longer use it in our pipeline. @danielbwatson Can we merge the 2.0 pull request and then open the commit log reader as a separate issue? I believe the change should be a matter of refactoring the deserialization into something we can use for commit logs as well.
@narsing3 The reason that we don't support it is that we thought it would allow us to do point of time pulls of a cluster without needing a flush. Which is still true. But the problem for us is that our clusters are write heavy so we would have to be constantly flushing out the commit logs to S3 (which is how we manage this flow). So we have remained content to pull from the sstables themselves.