-
dic/ex1
:- Hadoop implementation for
$\chi^2$ selection for text features.
- Hadoop implementation for
-
dic/ex2
:- Spark (Scala) implementation for
$\chi^2$ selection for text features. - PySpark implementation of text classification using tf-idf features and
$\chi^2$ feature selection.
- Spark (Scala) implementation for
-
adbs/ex1
:- PostgreSQL query optimization.
-
adbs/ex2
:- Spark RDD API; Spark Dataframe API and Spark SQL queries (Scala).
- Hadoop job on local filesystem and HDFS.
- Hive QL query optimization.
-
adbs/ex3
:- Relational DBs:
- Distributed joins with minimal communication cost.
- Denormalization
- Cypher queries in a Neo4j (graph) database.
- Relational DBs:
nopperl/data-engineering-exercises
Data engineering exercises including Spark (Scala and PySpark), PostgreSQL query optimization, Hadoop, Hive and Neo4j.
Jupyter Notebook