/ScalaSparkSample

Primary LanguageScalaOtherNOASSERTION

sample present.

#this is a boilerplate spark/scala code , with sbt and scalatests
# To run this on windown instal hadoop-comon (winutins.exe)
# to run on cluster do spark-submit
#step 1 to setup code on you machine (install git)
git clone https://github.com/bjaggi/SparkScalaDemo.git

#step 2 (install sbt)
sbt package or sbt run

#step 2.1 to run this in eclipse do 
sbt eclipse

#step 3
change code (todo add unit tests)

# step 4 (review all code changed)
git status

# step 5 ( commit code with comments)
git commit -am "comments for code changed"

#step 6 ( run on spark)
spark-submit --class com.eva.app.HiveJoin --master local \sparkdemo.jar

#step 7 to build a package, go to root and do : sbt package
sbt package

#step 8 to test sbt
sbt test

#step9 to run sbt (not completely configured yet, probably not required as well)
sbt run

Ball Park Config
1> 4-6 cores per executor
1.a> if 90 cores on the cluster & we apply 5 cores per executor...we end up getting 90/5 = 18 executors
2> Leave one EXECUTOR (5 cores) for the YARN- Application manager
3> Partition size is usually 128MB
4> Each Node has 3 EXECUTORS (15 core Node)
5> 63GB/3 = 21GB (19GB memory each executor)...counting off head overhead.



Good Link : 
https://stackoverflow.com/questions/47445328/adding-a-new-column-to-a-dataframe-by-using-the-values-of-multiple-other-columns?rq=1&utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa