/bin2seq

A suite of code to demonstrate big binary data processing in Hadoop, e.g., images, netcdf, hdfs5, etc.

Primary LanguageJavaApache License 2.0Apache-2.0

It is a prerequisite to convert files in arbitrary binary formats into a Hadoop SequenceFile format,
so they can be split by M/R jobs and processing can be parallelized to distribute nodes.  
 
The 'bin2seq' borrows ideas from 'tar2seq' developed by [Stuart Sierra](http://stuartsierra.com/2008/04/24/a-million-little-files)

Thanks to the latest Hadoop development like multiple NameNode-support and HDFS-federation (e.g., HDFS-1052), 
the “large-number-of-small-files” does not pose the same challenge as when 'tar2seq' was developed (prior to Hadoop v0.22).  
Therefore, bin2seq is a much simpler because we are performing one-binary-file-to-one-sequencefile (bin2seq) 
conversion, and the benefit of do so is to maintain the same file structure and nomenclature across 
traditional file system and HDFS.  

Code-wise, there is little similarity between bin2seq and tar2seq. bin2seq is built by Maven.