/data-engineering-exercises

Data engineering exercises including Spark (Scala and PySpark), PostgreSQL query optimization, Hadoop, Hive and Neo4j.

Primary LanguageJupyter Notebook

Data engineering exercises

  • dic/ex1:
    • Hadoop implementation for $\chi^2$ selection for text features.
  • dic/ex2:
    • Spark (Scala) implementation for $\chi^2$ selection for text features.
    • PySpark implementation of text classification using tf-idf features and $\chi^2$ feature selection.
  • adbs/ex1:
    • PostgreSQL query optimization.
  • adbs/ex2:
    • Spark RDD API; Spark Dataframe API and Spark SQL queries (Scala).
    • Hadoop job on local filesystem and HDFS.
    • Hive QL query optimization.
  • adbs/ex3:
    • Relational DBs:
      • Distributed joins with minimal communication cost.
      • Denormalization
    • Cypher queries in a Neo4j (graph) database.