Collection of utilities for managing data on Hadoop powered by Apache Spark.
Tested on HDInsight cluster using JAR at http://go.microsoft.com/fwlink/?LinkID=723585&clcid=0x409.
This Avro Folder Defragmenter takes multiple small avro files of same or compatible schema and merge all of them together in lesser number of larger files (which is more suitable for Hadoop).
Some of the features of this Defragmenter are:
- Ability to overwrite the source folder itself (just specify target path same as source folder)
- Moves the target to a trash folder if overwrite is enabled
- Can work on an external avro schema. If Avro Schema file is not mentioned, then this utility picks the avro schema from latest avro file in the folder (the file with maximum Last modification date). New files (after defrag) gets the new Avro Schema.
- Defragmenter make sure right before writing the target that state of source is exactly same as it was when the job first started reading the data for defragmentation. If there is any modification in source after checkpoint, process abort the final target overwrite.
- Source should have atleast 2 Avro files to work with.
- Ability to work on partitioned (multilevel) folder. Target folder will be created with the same folder structure.
** In Progress: Logic to calculate the number of partitions dynamically based on data size
DefragmentAvroFolder 1.0 Usage: DefragmentAvroFolder [options]
--sourceFolder
####Sample command line arguments for spark-submit (DefragmentAvroFolder):
--sourceFolder /data/AvroFolderCompactor/avrodata --targetFolder /data/AvroFolderCompactor/avrodata --avroSchema /data/AvroFolderCompactor/schema/airline.avsc --fileCount 2 --runningLocally --overwriteTarget --trashFolder /data/AvroFolderCompactor/trash --tmpFolder /data/AvroFolderCompactor/tmp
1.2. Source and Target can be diffrent (Target will be overwritten at the end (moved to trash first)):
--sourceFolder /data/AvroFolderCompactor/source/avrodata --targetFolder /data/AvroFolderCompactor/target/avrodata --avroSchema /data/AvroFolderCompactor/schema/airline.avsc --fileCount 2 --runningLocally --overwriteTarget --trashFolder /data/AvroFolderCompactor/trash --tmpFolder /data/AvroFolderCompactor/tmp