/SparkSQL-Using-Pyspark

Spark SQL functions and operations in PySpark

Primary LanguagePython

SparkSQL-Using-Pyspark

Spark SQL functions and operations using Pyspark and Spark Submit

A Basic Movie CSV Dataset of about 5000 records is read into a SparkSQL dataframe and manipulated using different Dataframe operations. I have tried to :

  1. Adding a computed column to a datframe.
  2. Grouping Operations on a dataframe
  3. reading into a SparkSQl dataframe from a CSV and JSON file
  4. using a external package during runtime using spark submit
  5. UDF in pyspark
  6. join operations
  7. other operations like case-when,union all , orderBY , Array column manipulation,Windowing operations , Hive Context and so on..
  8. writing a dataframe to a file

Different types of reports are retrieved using dataframe operations and results are exported as output files. And the sample output will also be shown along wit the code just like in spark shell.