Covid-19 is currently the topic of research this season. We have decided to work on analyzing the recovery rates of patients based on their age, gender, and other criteria. To achieve this, we will be using a dataset from Kaggle to process static data, then we will move to live streaming data from Twitter. Apache Hadoop will be used as the file system, Apache Flink will be used for live streaming data from Twitter and Python will be tying together this whole project, hence the name: PyFlink-Covid-Vaccine.
Swaroop Reddy |
Annie Chandolu |
Alekhya Jaddu |
Tejaswi Reddy Kandula |
Naga Anshitha Velagapudi |
Harika Kulkarni |
- We will be using live streaming data, tweets, from Twitter as part of our future improvements to the project.
- Progamming Language: Python
- Steaming Engine: Flink
- Wiki-Link for Flink
- File System: Hadoop
- Swaroop Reddy - Going to work on HDFS (Hadoop) MapReduce Programming Model.
- Annie Samarpitha - I will be working with Alekhya on Python programming.
- Alekhya Jaddu - Will be working on the programming part using Python scripts.
- Tejaswi Reddy Kandula - Going to work on Shell Scripting.
- Naga Anshitha Velagapudi - Going to work on Flink which is used to process data streams in large data.
- Harika Kulkarni - Will be working on Flink.
- Swaroop Reddy Gottigundala- Writing a Flink python datastream API program.
- Annie Samarpitha Chandolu- Analysis on weekly-cases and deaths-weekly counts.
- Alekhya Jaddu - Wordcount using pyFlink.
- Tejaswi Reddy Kandula - Wordcount for all covidcases using pyFlink for the dataset covid_19_clean_complete
- Naga Anshitha Velagapudi - Analyzing number of times/days the country had taken vaccinations.
- Harika Kulkarni - countrywise highest recovery rates versus death rates.
- Apache Flink
- Pip
- Python(3.6.0 to 3.8.0 version)
- Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Flink has been designed to run in all common cluster environments, perform computations at in-memory speed and at any scale.
- Flink also provides batch processing, graph processing, Itearative proccessing for Machine learning applications.
- Flink is considered as the next-gen stream processing system.
- Flink offers substantially higher processing speeds to spark and hadoop.
- Flink provides low latency and high throughput
- Apache Flink
- PIP
- Python(3.6.0 to 3.8.0 version)
If any other versions of python are previously installed in your system use the below command to uninstall
choco uninstall python
To install python of a specific version use the below command
choco install python --version=3.8.0
The version of python should be (3.5, 3.6, 3.7 or 3.8) for PyFlink. Please run the following command to make sure that it meets the requirements:
$ python --version
Use the below command to install apache-flink
$ python -m pip install apache-flink
You can also build PyFlink from source by following the development guide.
Note Starting from Flink 1.11, it’s also supported to run PyFlink jobs locally on Windows and so you could develop and debug PyFlink jobs on Windows.
https://app.vidgrid.com/view/QTuVfghYRV38
I am doing an analysis on a Covid dataset which is stored in the following repository:
https://github.com/annie0sc/practice-flink-wordcount
I am working on providing Countrywise highest recovery rate versus death rates on Covid data.
- Python
- Flink
- pip
Colab is a Python development environment that runs in the browser using Google Cloud. We can perform the following using Google Colab.
- Write and execute code in Python
- Document your code that supports mathematical equations
- Create/Upload/Share notebooks
- Import/Save notebooks from/to Google Drive
- Import/Publish notebooks from GitHub
- Import external datasets e.g. from Kaggle
- Integrate PyTorch, TensorFlow, Keras, OpenCV
- Free Cloud service with free GPU
Input File: Link to input file
-
Step1: As Colab implicitly uses Google Drive for storing your notebooks, ensure that you are logged in to your Google Drive account before proceeding further.
-
Step2: Open the following URL in your browser: https://colab.research.google.com Your browser would display the following screen (assuming that you are logged into your Google Drive):
-
Step3: Click on the NEW NOTEBOOK link at the bottom of the screen. A new notebook would open up as shown in the screen below.
-
Step4: You will now enter a trivial Python code in the code window and execute it.Enter the following two Python statements in the code window and click on the arrow on the left side of the code window.
-
Step5: Install apache-flink and all necessary packages using below command.
-
Step7: Comparing between TotalDeaths and TotalRecovered from Covid data.
Output File:Link to Output file
References:https://flink.apache.org/flink-architecture.html
I'm working on analyzing number of times/days a country had taken vaccinations.
**Click on view raw to view/access the video
- Python
- Flink
- pip
-Installing Pyflink
- $ python -m pip install apache-flink
- packages
- from pyflink.common.serialization import SimpleStringEncoder
- from pyflink.common.typeinfo import Types
- from pyflink.datastream import StreamExecutionEnvironment
- from pyflink.datastream.connectors import StreamingFileSink
- you have run this command First, make sure that the output directory doesn’t exist:
- rm -rf /tmp/output
- Then you have to run the example you just created on the command line:
- $ python datastream_tutorial.py
- Finally, you can see the result on the command line which is in /tmp/output folder:
- $ find /tmp/output -type f -exec cat {} ;
- 1,aaa
- 2,bbb