fingltd/4mc

merge to hadoop?

Closed this issue · 9 comments

I'm surprized it's not yet part of Apache Hadoop project :)
LZO is a pain to index. Plus has some licensing issues.
Great project.

Thanks for good feedback.
On Hadoop 2.x by default you have LZ4 Codec but it's not configurable w.r.t. desired compression ratio and also not actually providing any splittability.
I would be happy to see this as patch to hadoop 2.x, but so far I was not even able to get attention of ElephantBird guys to work on an integration of 4mc into EB to replace LZO.

I just emailed Cloudera folks to have a look and file a JIRA ticket to integrate it in.
Hopefully this will get integrated. Thanks a lot!

Thanks!

please let us know when it is integrated.

waiting for integration with hadoop

ianoc commented

EB as in elephantbird from twitter? Do you have a PR/issue to add support?

(Replacing isn't really an option for something like a serialization library since people have TB/PB's of data written with existing formats).

Yes sorry 'replacing' is wrong here, 'add support' makes much more sense.
I got in touch with some EB dev but never had positive feedback about the idea of integration, thus I never did open a PR/issue on EB about that.

ianoc commented

I think we'd be fine with the integration, we @ twitter aren't super likely to use it. Though I'd like to try it out, will probably do that outside EB. We have discussed getting off those container formats in EB, so if we were to migrate it would more likely be to something sequence file based for ourselves(which handles splitting regardless of compression). But the extra options and such I plan on trying out from 4mc to see how they perform for our existing lz4 use cases now

Very good, let me know what you think and how you find it.
Moreover I agree with your approach as well, using protobuf container is not best option from performance point of view when you have already a super-packet containing other info. In our tests we saw some little performance degradation when moving from our data-blocks (compressed with LZ4 anyways) to EB/4mc (also inside only C++ native code). Of course it was more than acceptable wrt the scalability we have in hadoop/EB architecture and most of all wrt having the EB framework coded and bug-free already :)

Hi,

I think I am not in anyway connected to this mail.
Please remove me from the notifications.

Regards,
Ravitej

On Mon, Jul 25, 2016 at 5:53 AM, Carlo Medas notifications@github.com
wrote:

Closed #4 #4.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#4 (comment), or mute the
thread
https://github.com/notifications/unsubscribe-auth/ANI2ORU7_dR4EqNqdGqNs_3BoycgPnz-ks5qZJWsgaJpZM4ELBLt
.

Regards

RaviTej Somayajula