The project has been released to use spark 1.6. The branch name is spark-1.6.
These are the steps:
Run the next commands:
git clone -b spark-1.6 <git.repository>
cd spark_ocr_tesseract
mvn clean package
The following command will execute the job in the cluster:
----------------SPARK DINAMIC ALLOCATION---------- MOST STABLE 34 TASKS 18 CONTAINERS 35 CORES 140800 MEMORY ALLOCATED
spark-submit --master yarn-cluster --class "com.jene.cognitive.OcrTess4jSpark" --driver-memory 1g --driver-cores 1 --executor-memory 3g --executor-cores 2 --driver-java-options "-Denv=dev -Djna.library.path=$LD_LIBRARY_PATH:/usr/local/lib64 -Djava.library.path='$LD_LIBRARY_PATH:/usr/local/lib64 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps" --conf spark.yarn.executor.memoryOverhead=5046 --conf spark.yarn.executor.memoryOverhead=5046 --conf spark.rdd.compress=true --conf spark.broadcast.compress=true --conf "spark.executor.extraJavaOptions=-Djna.library.path=$LD_LIBRARY_PATH:/usr/local/lib64" --conf "spark.yarn.am.extraLibraryPath=/apps/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/hadoop/lib/native:/usr/local/lib64" --conf "spark.executor.extraLibraryPath=/apps/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/hadoop/lib/native:/usr/local/lib64" --conf "spark.driver.extraLibraryPath=/apps/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/hadoop/lib/native:/usr/local/lib64" --conf spark.yarn.maxAppAttempts=4 --conf spark.yarn.am.attemptFailuresValidityInterval=1h --conf spark.dynamicAllocation.enabled=true target/job-ocr-pdf-1.0.jar
----------------NO SPARK DINAMIC ALLOCATION---------- MOST STABLE 34 TASKS 23 CONTAINERS 45 CORES 181760 MEMORY ALLOCATED
spark-submit --master yarn-cluster --class "com.jene.cognitive.OcrTess4jSpark" --driver-memory 1g --driver-cores 1 --executor-memory 3g --executor-cores 2 --num-executors 40 --driver-java-options "-Denv=dev -Djna.library.path=$LD_LIBRARY_PATH:/usr/local/lib64 -Djava.library.path='$LD_LIBRARY_PATH:/usr/local/lib64 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps" --conf spark.yarn.executor.memoryOverhead=5046 --conf spark.yarn.executor.memoryOverhead=5046 --conf spark.rdd.compress=true --conf spark.broadcast.compress=true --conf "spark.executor.extraJavaOptions=-Djna.library.path=$LD_LIBRARY_PATH:/usr/local/lib64" --conf "spark.yarn.am.extraLibraryPath=/apps/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/hadoop/lib/native:/usr/local/lib64" --conf "spark.executor.extraLibraryPath=/apps/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/hadoop/lib/native:/usr/local/lib64" --conf "spark.driver.extraLibraryPath=/apps/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/hadoop/lib/native:/usr/local/lib64" --conf spark.yarn.maxAppAttempts=4 --conf spark.yarn.am.attemptFailuresValidityInterval=1h target/job-ocr-pdf-1.0.jar
----------------KILL YARN APP----------
yarn application -kill application_XXXXXXXX_XXXX
----------------TEST PURPOUSES----------
spark-submit --master local[*] --class "com.jene.cognitive.OcrTess4jSpark" --driver-memory 8g --driver-java-options "-Denv=dev -Djna.library.path=$LD_LIBRARY_PATH:/usr/local/lib64 -Djava.library.path='$LD_LIBRARY_PATH:/usr/local/lib64' -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps" target/job-ocr-pdf-1.0.jar
----------------CLIENT MODE LESS STABLE ----------
spark-submit --master yarn-client --class "com.jene.cognitive.OcrTess4jSpark" --driver-java-options "-Denv=dev -Djna.library.path='$LD_LIBRARY_PATH:/usr/local/lib64' -Djava.library.path=$LD_LIBRARY_PATH:/usr/local/lib64 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps" --driver-memory 6g --num-executors 8 --executor-memory 6g --conf "spark.executor.extraJavaOptions=-Djna.library.path='$LD_LIBRARY_PATH:/usr/local/lib64'" --conf "spark.executor.extraLibraryPath=/apps/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/hadoop/lib/native:/usr/local/lib" target/job-ocr-pdf-1.0.jar
You have to check that the project properties set are:
path-pdfs=/data/ocr/pdfs (hdfs)
path-output=/apps/tmp/ocr_txtfiles/ (fs)