/dataskewexample

Data skew case example for Flink

Primary LanguageScalaGNU General Public License v3.0GPL-3.0

# Data Skew in Flink Example Project

This is the project I used to exemplify the issue explored in this article: https://medium.com/@sanr_71172/dealing-with-data-skew-in-flink-b7e4c82c35ef

To run it you will need both Scala and Python installed. Also, we are using Flink 1.13 as it's the current version that AWS Kinesis supports.

### Preparation

You must first generate some synthetic data that Flink will use to run on. This is done by executing the script "data_generator.py". You can customize the amount of records inside the script itself.

Once this is done, you should fill in the absolute path to the generated .csv file that contains the data (in src/resources) inside the two jobs used in this example: SkewedJob.scala and FixedJob.scala

### How to run

Once the preparation is done, it is recommended to run the project in a local cluster with some amount of parallelization (I used 4) to really see the benefit of the Bundle Operator.

I recommend this guide to set up the cluster: https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/try-flink/local_installation/

Once you have the cluster up, you can build the project with:
sbt clean assembly

And then run the desired job on the cluster with:
flink run -c net.sitecore.SkewedJob target/scala-2.11/DataSkewExample-assembly-0.1-SNAPSHOT.jar