Task: implementing a batch processing
job as part of a data ETL pipeline.
- Subtasks:
- Parsing the input file
- 70/30 Partition of the data into two splits - namely train and eval for each class Input: Bounded data for Beam
- Please generate the data from slightly changed generate.py file (cmd:
python generate_data.py
) - run command:
python data_split_script.py
which saves output folders (train and eval) to a folder result.
Distributed computing framework: Apache Beam It is handy for batch data processing (serves our purpose here) pipelines and job scheduling. The task involves designing a pipeline and executing on a runner. The execution of a pipeline on many workers handled by the Beam automatically on distributed computing frameworks like Spark. Beam has flexibility for different SDKs, designing pipelines, custom transformations, execution on different runners, and much more.
Currently, I am testing the Beam pipeline locally (a DirectRunner) since the data file size is small. In Beam, we can choose different runners, eg, Spark, Flink. We can set up the runner while designing pipeline using the pipeline options for configuring the pipeline's execution.
- Setup in an Option list:
from apache_beam.options.pipeline_options import PipelineOptions
options = PipelineOptions([
"--runner=PortableRunner",
"--job_endpoint=localhost:8099",
"--environment_type=LOOPBACK"
])
with beam.Pipeline(options) as pdf:
- The distribution of datapoints per class is random.
- two possible classes:
asset_1
orasset_2
. - Independent of the size of the input data file, the number of datapoints per class in the data.