Apache Beam basically a data processing platform. Data processing can be either for analytics purpose or it can be ETL (Extract, Transfer, Load). Apache beam doesn't reply on any one execution engine.The input data can be streaming data or batch data. Input data can be from some database like relational database or memory database. so apache beam is execution platform agnostic and data agnostic also programming agnostic i.e, it supports multiple programming language you can write your logic in java python,go.
Pipelines
End to end data processing.Pcollection
Reading of the input data is p collection applying any transormations on that data and creating new data from that is also p collection.Ptransorm
Logic applying to data is p transform ((https://beam.apache.org/documentation/programming-guide/#transforms)PRunner
specifies where and how the pipeline should execute.
python --version
pip --version
python must be 3.6 or higher, pip must be 7.0.0 or newer
python -m pip install apache-beam
- Extra Requirements
Installation for extra dependencies follow below command
pip install apache-beam[gcp,aws,test,docs]
For more detail go to this link
Google Colab has python preinstalled. On it, it is easy to start using apache beam.
- Open firefox or safari browser
- Type Google Colab
- Click on first link that is Google Colab
- Sign in with google account
- Click on notebook after appearing the window with recent
Note: Google Colab works similar to jupyter notebook
- After writing and execution of code,save file in local or Github
Look at my netflixGroupBy.ipynb Colab python notebook
Sri Sudheera Project input file