Pinned Repositories
apache-airflow-study
Python code that implement simple etl on Apache Airflow
Hive_emr_python
Hive streaming using Python and Hive transform function
kafka-stream-poc
POC on how to use kafka-stream that read AVRO from Kafka topic and filter only the desire value to print to console.
kafka_connect_rds_to_s3_json
Retrieve the data from Posgresql on RDS (non CDC) and ingest to AWS S3 as Json String.
kafka_flink_deduplicate1
This project consume the message from Kafka topic using Flink and do deduplication on the incoming message.
kafka_publisher_json_to_azure_event_hub
This demo is for test kafka publisher that publish json string to Azure Event Hub (enable Kafka support)
kstreams
martingale_ea_improvement
To improve the forex robot that use martingale strategy
poc_streaming_twitter_to_kafka_to_spark_to_hdfs
I try to build the data pipeline that read the twitter stream and store tweet data into HDFS
pyspark_read_write_to_hive
Correct way to read the json file on AWS S3 with Pyspark
jitkasempin's Repositories
jitkasempin/airflow-maintenance-dags
A series of DAGs/Workflows to help maintain the operation of Airflow
jitkasempin/automl-gs
Provide an input CSV and a target field to predict, generate a model + code to run it.
jitkasempin/avro-fastserde
Fast Apache Avro serialization/deserialization library
jitkasempin/avro-util
Collection of utilities to allow writing java code that operates across a wide range of avro versions.
jitkasempin/bigquery-etl-dataflow-sample
jitkasempin/bigquery-ml-templates
BigQuery ML SQL templates for common marketing use cases
jitkasempin/cloud-opensource-python
Dependency Management Toolkit for Google Cloud Python Projects
jitkasempin/code-snippets
Small Google Cloud Platform examples and code snippets.
jitkasempin/Data-Wrangling-with-Python
Simplify your ETL processes with these hands-on data sanitation tips, tricks, and best practices
jitkasempin/DataflowTemplates
Google-provided Cloud Dataflow template pipelines for solving simple in-Cloud data tasks
jitkasempin/datalake
Data Lake template
jitkasempin/dbeam
DBeam extracts SQL tables using JDBC and Apache Beam
jitkasempin/faust
Python Stream Processing
jitkasempin/getting_started_with_pyspark
Materials for class Getting Started with Pyspark
jitkasempin/kaniko
Build Container Images In Kubernetes
jitkasempin/lambda-arch
Applying the Lambda Architecture with Spark, Kafka, and Cassandra.
jitkasempin/mlflow
Open source platform for the machine learning lifecycle
jitkasempin/modin
Modin: Speed up your Pandas workflows by changing a single line of code
jitkasempin/nuclio
High-Performance Serverless event and data processing platform
jitkasempin/pro-devops-with-google-cloud-platform
Source Code for 'Pro DevOps with Google Cloud Platform' by Pierluigi Riti
jitkasempin/professional-services
Common solutions and tools developed by Google Cloud's Professional Services team
jitkasempin/PyMySQL
Pure Python MySQL Client
jitkasempin/python-mysql-replication
Pure Python Implementation of MySQL replication protocol build on top of PyMYSQL
jitkasempin/rabbitmq-connect
jitkasempin/serverless_data_pipeline_gcp
schedule a data pipeline in Google Cloud using cloud function, BigQuery, cloud storage, cloud scheduler, and pub/sub
jitkasempin/sope
Sope - Apache Spark ETL Utilities
jitkasempin/streamalert
StreamAlert is a serverless, realtime data analysis framework which empowers you to ingest, analyze, and alert on data from any environment, using datasources and alerting logic you define. Also, we are hiring!!!!!!!!
jitkasempin/tableschema-py
A Python library for working with Table Schema.
jitkasempin/useful_blog_post_data_engineer
jitkasempin/yq
Command-line YAML and XML processor - jq wrapper for YAML/XML documents