Flamingo HDFS File Uploader - a tool to upload from datasource to datasource and schedule jobs.

License : The GNU General Public License v3.0

Flamingo HDFS File Uploader는 Local, HDFS, FTP 등과 같은 DataSource의 파일을 업로드하는 다양한 기능을 제공합니다. 스케줄링 기능을 포함하고 있어서 일정한 주기로 DataSource를 처리하고자 하는 경우 유용하게 사용할 수 있습니다.

[Features]

  1. Job Scheduler
     - Multiple jobs in XML
       o Multiple <job> tags
     - Scheduling
       o Cron Expression
       o Timezone (default: Asia/Seoul)
       o Start(defualt: current), End (default: infinite)
       o Job Priority (default: 5)
       o Misfire Instruction (Smart Policy)
  2. Expression Language in XML
     - EL Functions (ex; ${concat('a', 'b')} ${dateFormat('yyyyMMdd')} ${tommorow('yyyyMMdd')} ...)
     - EL Constants (ex; ${GB} ${TB} ...)
     - System Properties (ex; ${user.dir} ...)
     - Global Variables (ex; ${maxFiles} ...)
  3. DataSource
     - Local FileSystem
       o File Selector
         - Regular Expression
         - Ant Path Style
         - Start With
         - End With
         - Simple Date Format
     - HDFS
       o Hadoop Cluster Selection
     - FTP (planned)
       o Passive, Active
  4. File processing procedures
     - Local FileSystem
       o Source Directory -> Working Directory
       o Working Directory -> Staging Directory (HDFS)
       o Working Directory -> Error Directory or Complete Directory
     - HDFS
       o Staging Directory -> Target Directory

[Example]

  <?xml version="1.0" encoding="UTF-8"?>
  <flamingo xmlns="http://www.openflamingo.org/schema/uploader"
            xsi:schemaLocation="http://www.openflamingo.org/schema/uploader flamingo-uploader-1.0.xsd"
            xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">

    <description>Seoul Public Data - Rain Collection Job</description>

    <clusters>
      <cluster name="dev" description="Development Hadoop Cluster">
        <fs.default.name>hdfs://172.27.21.143:9000</fs.default.name>
        <mapred.job.tracker>172.27.21.143:9001</mapred.job.tracker>
      </cluster>
    </clusters>

    <globalVariables>
      <globalVariable name="currentDate" value="${dateFormat('yyyyMMdd')}" description="string"/>
    </globalVariables>

    <job name="Seoul_Rain" description="Seoul Public Data - Rain Collection Job">
      <schedule>
        <cronExpression>0 * * * * ?</cronExpression>
      </schedule>
      <policy>
        <ingress>
          <local>
            <sourceDirectory conditionType="antPattern">
              <path>/home/${user.name}/rain</path>
              <condition>rain_*.txt</condition>
            </sourceDirectory>
            <workingDirectory>/tmp/${user.name}/work</workingDirectory>
            <completeDirectory>/tmp/${user.name}/complete</completeDirectory>
            <removeAfterCopy>false</removeAfterCopy>
            <errorDirectory>/tmp/${user.name}/error</errorDirectory>
          </local>
        </ingress>
        <outgress>
          <hdfs cluster="dev">
            <targetPath>/rain/upload/${user.name}/${dateFormat('yyyy')}/${dateFormat('MM')}/${dateFormat('dd')}</targetPath>
            <stagingPath>/rain/stage/${user.name}/${currentDate}</stagingPath>
          </hdfs>
        </outgress>
      </policy>
    </job>
  </flamingo>

[Example log messages]

  [Seoul_Rain_201208200208_517544283] --------------------------------------------
  [Seoul_Rain_201208200208_517544283] Job 'Seoul_Rain'을 시작합니다
  [Seoul_Rain_201208200208_517544283] --------------------------------------------
  [Seoul_Rain_201208200208_517544283] 원본 파일 'file:/root/kimbyounggon/rain/rain_20120820-020701.txt'을 작업 디렉토리 'file:/tmp/slurper/work/rain_20120820-020701.txt'으로 이동하였습니다.
  [Seoul_Rain_201208200208_517544283] HDFS에 업로드하기 위해서 사용할 Hadoop Cluster 'dev'이며 Hadoop Cluster의 파일 시스템을 얻었습니다.
  [Seoul_Rain_201208200208_517544283] HDFS에 업로드하기 위해서 사용할 최종 목적지 디렉토리는 '/rain_test'이며 스테이징 디렉토리는 '/rain_stage'입니다.
  [Seoul_Rain_201208200208_517544283] 작업 디렉토리의 파일 'file:/tmp/slurper/work/rain_20120820-020701.txt.processing'을 스테이징 디렉토리에 '/rain_stage/20120820004808_1072302239'으로 업로드하였습니다.
  [Seoul_Rain_201208200208_517544283] 스테이징 디렉토리에 '/rain_stage/20120820004808_1072302239' 파일을 '/rain_test/rain_20120820-020701.txt'으로 이동하였습니다.
  [Seoul_Rain_201208200208_517544283] 파일 복사를 완료하였습니다. 원본 파일 'file:/tmp/slurper/work/rain_20120820-020701.txt.processing'을 'file:/tmp/slurper/complete/rain_20120820-020701.txt'으로 이동하였습니다.
  [Seoul_Rain_201208200208_517544283] Job 'Seoul_Rain'을 완료하였습니다
  [Seoul_Rain_201208200208_517544283] Job 'Seoul_Rain'을 처리하는데 소요된 총 시간은 0초 (시작: 2012-08-20 02:08:00 / 종료: 2012-08-20 02:08:00)입니다.

[Run procedure]

  1. Run 'git clone https://github.com/fharenheit/flamingo-hdfs-file-uploader.git'
  2. Run 'mvn clean package dependency:copy-dependencies assembly:assembly'
     or Run 'sh assembly.sh'
  3. Run 'unzip target/flamingo-hdfs-file-uploader-0.x-bin.zip'
     or 'tar xvfz target/flamingo-hdfs-file-uploader-0.x-bin.tar.gz'
  4. Edit 'flamingo-hdfs-file-uploader-0.x/conf/example.xml'
  5. Run 'cd flamingo-hdfs-file-uploader-0.x;sh start.sh'

[Assembly procedure for distribution]

  1. Run 'git clone https://github.com/fharenheit/flamingo-hdfs-file-uploader.git'
  2. Edit pom.xml and change version (from 1.x-SNAPSHOT to 1.x)
  3. Run 'mvn clean package dependency:copy-dependencies assembly:assembly'
     or Run 'sh assembly.sh'

[Maven Release procedure]

  1. Run 'git clone https://github.com/fharenheit/flamingo-hdfs-file-uploader.git'
  2. Edit pom.xml and change <distributionManagement>
  3. Edit $USER_HOME/.m2/settings.xml and write <servers>
    <?xml version="1.0" encoding="UTF-8"?>
    <settings xmlns="http://maven.apache.org/POM/4.0.0"
              xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
              xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
                                  http://maven.apache.org/xsd/settings-1.0.0.xsd">

      <servers>
        <server>
          <id>snapshots</id>
          <username>YOUR_USERNAME</username>
          <password>YOUR_PASSWORD</password>
        </server>
        <server>
          <id>releases</id>
          <username>YOUR_USERNAME</username>
          <password>YOUR_PASSWORD</password>
        </server>
      </servers>
    </settings>
  4. Run 'mvn -DperformRelease=true clean deploy' then deploy distributions to Maven repoistory.