kafka-to-hdfs

Pipe the messages from a Kafka topic into HDFS.

Outline

This is much, much simpler than LinkedIn's Camus, but it does the job for simple loads out of Kafka into HDFS. It's useful for taking a live sample of a Kafka topic for a period of time in order to run MapReduce on it later.

Usage

Given a Kafka topic, the program will stream it into HDFS at the filename given. It will create new files every time the number of messages defined in --messages.per.file has been written.

Required arguments

--zookeeper.host     e.g. zookeeper1
--output.path        e.g. hdfs://hdfs-cluster:9000/user/me/tweets.txt
--hdfs.site.xml      e.g. /Users/me/libraries/hadoop-2.6.0/etc/hadoop/hdfs-site.xml
--core.site.xml      e.g. /Users/me/libraries/hadoop-2.6.0/etc/hadoop/core-site.xml
--kafka.topic        e.g. query.mentions
--flush.size         e.g. 100
--messages.per.file  e.g. 10000

jstanier/kafka-to-hdfs

kafka-to-hdfs

Outline

Usage

Required arguments