falaybeg

VodafoneTurkey

falaybeg's Stars

remoteintech/remote-jobs
A list of semi to fully remote-friendly companies (jobs) in tech.
Language:JavaScript30.2k 926 1783.2k
dotnet-architecture/eShopOnContainers
Cross-platform .NET sample microservices and container based application that runs on Linux Windows and macOS. Powered by .NET 7, Docker Containers and Azure Kubernetes Services. Supports Visual Studio, VS for Mac and CLI based environments with Docker CLI, dotnet CLI, VS Code or any other code editor. Moved to https://github.com/dotnet/eShop.
Language:C#24.5k 1.5k 1.3k10.3k
recommenders-team/recommenders
Best Practices on Recommendation Systems
Language:Python19.4k 276 8723.1k
DataTalksClub/mlops-zoomcamp
Free MLOps course from DataTalks.Club
Language:Jupyter Notebook11.2k 185 932.2k
rzashakeri/beautify-github-profile
This repository will assist you in creating a more beautiful and appealing github profile, and you will have access to a comprehensive range of tools and tutorials for beautifying your github profile. 🪄 ⭐
11.2k 74 23574
mjhea0/awesome-fastapi
A curated list of awesome things related to FastAPI
8.8k 169 23672
khuyentran1401/Data-science
Collection of useful data science topics along with articles, videos, and code
Language:Jupyter Notebook4.1k 143 81k
mercari/ml-system-design-pattern
System design patterns for machine learning
2.3k 73 14245
mstrYoda/kubernetes-kitap
1.7k 72 26191
ruanyf/simple-bash-scripts
A collection of simple Bash scripts
Language:Shell1.7k 55 91k
damklis/DataEngineeringProject
Example end to end data engineering project.
Language:Python1.2k 13 6229
jfrazee/awesome-nifi
A list of useful Apache NiFi resources, processor bundles and tools
941 94 5230
alanchn31/Data-Engineering-Projects
Personal Data Engineering Projects
Language:Jupyter Notebook871 9 0191
ververica/flink-sql-cookbook
The Apache Flink SQL Cookbook is a curated collection of examples, patterns, and use cases of Apache Flink SQL. Many of the recipes are completely self-contained and can be run in Ververica Platform as is.
Language:Dockerfile863 55 7199
AmoDinho/datacamp-python-data-science-track
All the slides, accompanying code and exercises all stored in this repo. 🎈
Language:Python800 20 5527
rasbt/machine-learning-notes
Collection of useful machine learning codes and snippets (originally intended for my personal use)
Language:Jupyter Notebook789 25 15142
ankurchavda/SparkLearning
A comprehensive Spark guide collated from multiple sources that can be referred to learn more about Spark or as an interview refresher.
653 19 074
josephmachado/beginner_de_project
Beginner data engineering project - batch edition
Language:HTML479 10 18142
ozlerhakan/datacamp
🍧 DataCamp data-science and machine learning courses
Language:Jupyter Notebook341 13 0185
Paulescu/real-time-data-pipelines-in-python
Real-time Feature Pipelines in Python ⚡
Language:Python251 8 262
Jcharis/DataScienceTools
Useful Data Science and Machine Learning Tools,Libraries and Packages
Language:Jupyter Notebook230 15 0229
trevoirwilliams/HR.LeaveManagement.CleanArchitecture-dotnet5
Educational Project to demonstrate MediatR, CQRS & Onion/Clean Architecture in ASP.NET Core
Language:C#230 5 1145
adrianhajdin/node_express_crud_api
Language:JavaScript157 4 096
Nneji123/Serving-Machine-Learning-Models
This repository contains instructions, template source code and examples on how to serve/deploy machine learning models using various frameworks and applications such as Docker, Flask, FastAPI, BentoML, Streamlit, MLflow and even code on how to deploy your machine learning model as an android app.
Language:CSS53 0 010
seikko/Project.MicroServices
Language:JavaScript47 1 24
alpinegizmo/flink-mobile-data-usage
Language:Java42 3 115
ultranet1/APACHE_AIRFLOW_DATA_PIPELINES
Project Description: A music streaming company wants to introduce more automation and monitoring to their data warehouse ETL pipelines and they have come to the conclusion that the best tool to achieve this is Apache Airflow. As their Data Engineer, I was tasked to create a reusable production-grade data pipeline that incorporates data quality checks and allows for easy backfills. Several analysts and Data Scientists rely on the output generated by this pipeline and it is expected that the pipeline runs daily on a schedule by pulling new data from the source and store the results to the destination. Data Description: The source data resides in S3 and needs to be processed in a data warehouse in Amazon Redshift. The source datasets consist of JSON logs that tell about user activity in the application and JSON metadata about the songs the users listen to. Data Pipeline design: At a high-level the pipeline does the following tasks. Extract data from multiple S3 locations. Load the data into Redshift cluster. Transform the data into a star schema. Perform data validation and data quality checks. Calculate the most played songs for the specified time interval. Load the result back into S3. dag Structure of the Airflow DAG Design Goals: Based on the requirements of our data consumers, our pipeline is required to adhere to the following guidelines: The DAG should not have any dependencies on past runs. On failure, the task is retried for 3 times. Retries happen every 5 minutes. Catchup is turned off. Do not email on retry. Pipeline Implementation: Apache Airflow is a Python framework for programmatically creating workflows in DAGs, e.g. ETL processes, generating reports, and retraining models on a daily basis. The Airflow UI automatically parses our DAG and creates a natural representation for the movement and transformation of data. A DAG simply is a collection of all the tasks you want to run, organized in a way that reflects their relationships and dependencies. A DAG describes how you want to carry out your workflow, and Operators determine what actually gets done. By default, airflow comes with some simple built-in operators like PythonOperator, BashOperator, DummyOperator etc., however, airflow lets you extend the features of a BaseOperator and create custom operators. For this project, I developed several custom operators. operators The description of each of these operators follows: StageToRedshiftOperator: Stages data to a specific redshift cluster from a specified S3 location. Operator uses templated fields to handle partitioned S3 locations. LoadFactOperator: Loads data to the given fact table by running the provided sql statement. Supports delete-insert and append style loads. LoadDimensionOperator: Loads data to the given dimension table by running the provided sql statement. Supports delete-insert and append style loads. SubDagOperator: Two or more operators can be grouped into one task using the SubDagOperator. Here, I am grouping the tasks of checking if the given table has rows and then run a series of data quality sql commands. HasRowsOperator: Data quality check to ensure that the specified table has rows. DataQualityOperator: Performs data quality checks by running sql statements to validate the data. SongPopularityOperator: Calculates the top ten most popular songs for a given interval. The interval is dictated by the DAG schedule. UnloadToS3Operator: Stores the analysis result back to the given S3 location. Code for each of these operators is located in the plugins/operators directory. Pipeline Schedule and Data Partitioning: The events data residing on S3 is partitioned by year (2018) and month (11). Our task is to incrementally load the event json files, and run it through the entire pipeline to calculate song popularity and store the result back into S3. In this manner, we can obtain the top songs per day in an automated fashion using the pipeline. Please note, this is a trivial analyis, but you can imagine other complex queries that follow similar structure. S3 Input events data: s3://<bucket>/log_data/2018/11/ 2018-11-01-events.json 2018-11-02-events.json 2018-11-03-events.json .. 2018-11-28-events.json 2018-11-29-events.json 2018-11-30-events.json S3 Output song popularity data: s3://skuchkula-topsongs/ songpopularity_2018-11-01 songpopularity_2018-11-02 songpopularity_2018-11-03 ... songpopularity_2018-11-28 songpopularity_2018-11-29 songpopularity_2018-11-30 The DAG can be configured by giving it some default_args which specify the start_date, end_date and other design choices which I have mentioned above. default_args = { 'owner': 'shravan', 'start_date': datetime(2018, 11, 1), 'end_date': datetime(2018, 11, 30), 'depends_on_past': False, 'email_on_retry': False, 'retries': 3, 'retry_delay': timedelta(minutes=5), 'catchup_by_default': False, 'provide_context': True, } How to run this project? Step 1: Create AWS Redshift Cluster using either the console or through the notebook provided in create-redshift-cluster Run the notebook to create AWS Redshift Cluster. Make a note of: DWN_ENDPOINT :: dwhcluster.c4m4dhrmsdov.us-west-2.redshift.amazonaws.com DWH_ROLE_ARN :: arn:aws:iam::506140549518:role/dwhRole Step 2: Start Apache Airflow Run docker-compose up from the directory containing docker-compose.yml. Ensure that you have mapped the volume to point to the location where you have your DAGs. NOTE: You can find details of how to manage Apache Airflow on mac here: https://gist.github.com/shravan-kuchkula/a3f357ff34cf5e3b862f3132fb599cf3 start_airflow Step 3: Configure Apache Airflow Hooks On the left is the S3 connection. The Login and password are the IAM user's access key and secret key that you created. Basically, by using these credentials, we are able to read data from S3. On the right is the redshift connection. These values can be easily gathered from your Redshift cluster connections Step 4: Execute the create-tables-dag This dag will create the staging, fact and dimension tables. The reason we need to trigger this manually is because, we want to keep this out of main dag. Normally, creation of tables can be handled by just triggering a script. But for the sake of illustration, I created a DAG for this and had Airflow trigger the DAG. You can turn off the DAG once it is completed. After running this DAG, you should see all the tables created in the AWS Redshift. Step 5: Turn on the load_and_transform_data_in_redshift dag As the execution start date is 2018-11-1 with a schedule interval @daily and the execution end date is 2018-11-30, Airflow will automatically trigger and schedule the dag runs once per day for 30 times. Shown below are the 30 DAG runs ranging from start_date till end_date, that are trigged by airflow once per day. schedule
Language:Python14 1 02
georgeanndata/data_engineer
Language:Jupyter Notebook6 1 02
Subatomic-Software/autoflink
Completely runtime driven flink streaming application, with UI construction and orchestration
Language:Java3 3 00
Alan-pan/FlinkTutorial-1.17
Language:Java1 1 01

falaybeg

falaybeg's Stars

remoteintech/remote-jobs

dotnet-architecture/eShopOnContainers

recommenders-team/recommenders

DataTalksClub/mlops-zoomcamp

rzashakeri/beautify-github-profile

mjhea0/awesome-fastapi

khuyentran1401/Data-science

mercari/ml-system-design-pattern

mstrYoda/kubernetes-kitap

ruanyf/simple-bash-scripts

damklis/DataEngineeringProject

jfrazee/awesome-nifi

alanchn31/Data-Engineering-Projects

ververica/flink-sql-cookbook

AmoDinho/datacamp-python-data-science-track

rasbt/machine-learning-notes

ankurchavda/SparkLearning

josephmachado/beginner_de_project

ozlerhakan/datacamp

Paulescu/real-time-data-pipelines-in-python

Jcharis/DataScienceTools

trevoirwilliams/HR.LeaveManagement.CleanArchitecture-dotnet5

adrianhajdin/node_express_crud_api

Nneji123/Serving-Machine-Learning-Models

seikko/Project.MicroServices

alpinegizmo/flink-mobile-data-usage

ultranet1/APACHE_AIRFLOW_DATA_PIPELINES

georgeanndata/data_engineer

Subatomic-Software/autoflink

Alan-pan/FlinkTutorial-1.17