/Bigdata-PyFlink

Introduction to PyFlink and its examples

Primary LanguageJupyter Notebook

Bigdata-PyFlink

Introduction to PyFlink and its examples

Team- Members


Sumana Reddy


Navya Devineni


Ravichander Reddy


Krishna Sumanth


Swaroopa Tirumalareddy


Vishal Reddy

Data Sets

Introduction to PyFlink

  • PyFlink is simply a combination of Apache Flink with Python, or rather Flink on Python.

Subtopics

  • Krishna Sumanth Koyyalamudi - Average time gap between the Movie/TV Show Released year and the year it is added to Netflix
  • Swaroopa Tirumalareddy - At intial I would like perform operations that will give us the number of rows, memory usage, details about the columns and whether there are any null values, along with the type of data about my data set.
  • Ravichander Reddy Goli - To really understand what is going on in the data, we will need to see a distribution.I would like to perform this operation using Histogram
  • Vishal Reddy Vennavaram - I would like to perform various operations like word count by using pyflink.
  • Sumana Reddy Reddybathula - I would like to perform scatter matrix to look how potentially the data is related to each other.
  • Navya Devineni - Worked on how to install PyFlink and installed locally.

Vid grid video links (Individual)

Swaroopa Tirumalareddy

For this project, I have taken data set from kaggle.com which contains the information regarding netflix movies and Tv shows. My Contribution in this project is I have performed some operations on dataset to get the basic information about the data set like displaying feature names of the dataset, fetching the details like the number of rows, memory usage, details about the columns and whether there are any null values, along with the type of data and counting the number of different values in a single column specified and so on.

Prerequisites:

  • Apache Flink
  • python
  • Colaboratory(Google Colab)(Colab allows us to write and execute Python in your browser, with Zero configuration required, Free access to GPUs and it is easy to share).
  • A dataset to perform operations

Process and Commands:

1 First we need to open colab in a web browser then select new notebook

2 We need install Apache Flink using the following command

  • !pip install apache-flink

3 Once we get installing ApacheFlink, we need to import all neccesary libraies

4 As per my operations, I have imported the following libraries

  • from pyflink.table import StreamTableEnvironment, DataTypes, table_config

  • from pyflink.datastream import StreamExecutionEnvironment

  • import pandas as pd

  • from pandas.plotting import scatter_matrix

5 After importing all the required libraries, upload your dataset into colab and start working on your project

References: