
Homeworks for the course on Big Data Computing, held @University of Padua

Homework 4

Accesing the cluster:

putty: gajdusek@login.dei.unipd.it

when the connection opens: ssh -p 2222 group50@ (insert the new password...)

to upload the file from windows to machine in the lab

  • check that you have putty folder in the PATH environmental variable
  • in windows cmd: pscp C:\Users\pavel\PycharmProjects\BigDataAll\HW3\HW3_files\G50HM4.py gajdusek@login.dei.unipd.it:Downloads

to upload the file from machine in the lab to the cluster (in the putty command line)

  • scp -P 2222 G50HM4.py group50@

  • log into the cluster: ssh -p 2222 group50@

  • hdfs dfs -ls /tmp

  • copy the file from /tmp to our folder: hdfs dfs -copyFromLocal /tmp/G50HM4.py G50HM4.py

  • check that the file was copied: hdfs dfs -ls

  • yarn application -list

TODO: running jobs:

TODO: I don't know why, but I don't manage to run the script from our local folder... It shows that the file was not found. So I run it from tmp.

According to the webpage of hw:

  • spark-submit --conf spark.pyspark.python=python3 --num-executors 4 G50HM4.py /data/
  • maximum X: 32
  • to pass one of preloaded files as an argument: specify path /data/filename, e.g. /data/HIGGS11M7D.txt.gz

for more details see here