- jupyter_kafka - How to do a sentiment analysis in realtime using the Jupyter notebook, Kafka and NLTK
- kafka_nlp - Building a realtime NLP pipeline using Kafka and spaCy
- livy_batch_emr - How to do better deployments of spark jobs to aws emr using apache livy
- pandas_validation - How to do column validation with pandas
- porto_seguro_spark - Safe driver prediction using PySpark and Logistic Regression
- pyspark-project-template - How to setup the Python and Spark environment for development, with good software engineering practices
- titanic_spark - Realtime prediction using Spark Structured Streaming, XGBoost and Scala
- titanic_xgboost - PySpark ML and XGBoost full integration tested on the Kaggle Titanic dataset
- scala_notebook_test - How to run Scala and Spark in the Jupyter notebook
- mlflow-automl - How to build an integration between AutoML and MLFlow
- realtime_kafka - Building a real-time prediction pipeline using Spark Structured Streaming and Microservices
- realtime_fraud_detection - How to build a real-time fraud detection pipeline using Faust and MLFlow
- terraform_eks_spark - How to run a PySpark job in Kubernetes (AWS EKS)
- s3a_spark - How to read parquet data from S3 using the S3A protocol and temporary credentials in PySpark
If you enjoyed my articles and find them useful, please feel free to buy me a beer. I actually spend a lot of time making them. Cheers!