How to run the pipeline

All commands assume that spark is correctly installed and available on your $PATH

Local pipeline run with provided test files

All test files are located in ./test_data directory. To run the spark pipeline use the following commands:

# build project and produce fat jar file
sbt clean assembly

# submit spark joib using this command
spark-submit \
  	--class "com.spark.home.assignment.S3App" \
  	--master "local[*]" \
  	target/scala-2.13/s3-app.jar \
			--input ./test_data \
			--output ./target/result.tsv \

Local pipeline run with your data folder

# build project and produce fat jar file
sbt clean assembly

# submit spark joib using this command
spark-submit \
  	--class "com.spark.home.assignment.S3App" \
  	--master "local[*]" \
  	target/scala-2.13/s3-app.jar \
			--input /your/input/data/directory \
			--output /your/result/file/path.tsv \

Run pipeline on S3 bucket

First of all you need to private proper credentials in your credentials file located in ~/.aws/credentials. By default the pipeline will use default profile. If you want to use custom file use option --credentials and provide full path of your file of choice. Full command below:

# submit spark joib using this command
spark-submit \
  	--class "com.spark.home.assignment.S3App" \
  	--master "local[*]" \
  	target/scala-2.13/s3-app.jar \
    --input s3n://data-processing-spark/input \
    --output s3n://data-processing-spark/output/result.tsv \
    --credentials ~/.aws/credentials



sbt test

to execute unit tests.