The process can be summarized in
- Ingestion of the Data (CSV) from an SFTP server.
- Processed by Spark and stored in Parquet/Hive, original data + the ones with tranformations
- Storing the results on the SFTP server.
There are currently two implementations:
- Spark with a SFTP connector
- Spark with Hadoop's SFTP FileSystem Support
This is one of the prototypes developed that works and is based on the springml connector https://github.com/springml/spark-sftp
This solution is currently the fastest ReadingSFTPConnectorApp
With this solution we used the native support by Hadoop, although the implementation in the current releases contain some bugs, the patches that exists have not been merged into the Hadoop released libraries, so we have included the patched version in our code.
Issues about native support of SFTP in Hadoop: https://issues.apache.org/jira/browse/HADOOP-5732 https://issues.apache.org/jira/browse/HADOOP-14444
To run them locally with "sbt run" you would need to add a setting to the SparkConf .set("spark.master", "local[*]"
sbt assembly
to build the .jar
PATH-TO-SPARK/bin/spark-submit --class org.fortysevendeg.sparksftp.ReadingSFTPHadoopApp --driver-class-path (/PATH-TO-HADOOP/bin/hadoop classpath) ./target/scala-2.11/sparksftpTest-assembly-0.0.1.jar --conf 'spark.executor.extraJavaOptions=-Dspark.executorEnv.SFTP_USER=XXXX' --files sftp.conf --driver-java-options ="-Dspark.executorEnv.SFTP_USER=XXX"
You can use the scripts gcloudsubmit1.sh
and gcloudsubmit2.sh
to see examples for submitting the spark jobs to a Google Cloud Dataproc cluster.