/pyspark_projects

A collection of small projects exploring PySpark features and functionality including packages and modules, algorithms, and general data science techniques.

Primary LanguageJupyter Notebook

PySpark Projects

A collection of small projects exploring PySpark features and functionality including:

Packages and modules

readStream/writeStream, Pipeline, OneHotEncoder, StringIndexer, StandardScaler, VectorAssembler

Algorithms

RandomForestClassifier, KMeans, LinearRegressionm, ridge and LASSO regressions, LogisticRegression

Techniques

feature extraction, evaluating the colinearity of features, calculating AUC, extracting feature importances, pre-processing, and EDA.