/spark-scala-data-engineering-framework

This will create srchtype for MAVEn data engineering framework

Primary LanguageScalaMIT LicenseMIT

PROMPT Framework V2 - The Highly Efficient and Modular Spark Data Engineering Framework

Table of Contents

Building new Spark Data engineering ETL/ELT pipeline from scratch is really hard and sometimes a bit deviating from final goal. Keeping hard eye on spark daily production implementation and hurdles leads to PROMPT. So need some framework that make things much eaiser when it comes to implimenting Big Data Engineering projects.

Prompt is a modular framework for building fast, testable, scala spark application on top of any structured and structured data. Prompt provides an easy-to-use API for reading and writing data parallely as much as possible

Currently, there is no frame work that works closely to SPARK and still provides ability to achieve large vision goals like designing a whole financial system or pipeline for aviation data flow pipeline.

The Prompt Framework combines productivity and performance, making it an easy to build, scalable, type-safe Spark data engineering application with Scala, specially when it comes to develop production pipeline. Prompt is functional as well as developer-friendly,providing modular level of transformation steps in the whole data flow pipeline. With Prompt, Spark can scale predictably due to an abstracted and non-blocking architecture. By being as static at functional level and more modular at development level.

1. Functional Smoothness!

Only thing that make any data flow pipeline work great is the answer oriented thinking means what is the final goal, that what PROMPT will make happen in the world of big data where its always easy to flow with the complexity of system, interfaces, data size, velocity and lastly goal.

1. PROMPT provides data modelling concept back in the world of Big Data.
2. Dont care about the setup of pipeline code.
3. Helper API's to make things generic.

2. Technical Smoothness

Providing the capability to debug at any level of data flow from reading, transforming , processing till visualization of data.

PROMPT consists of below modules:

├── common
│   ├── common.iml
│   ├── pom.xml
│   └── src
│       ├── main
│       │   └── scala
│       │       └── com
│       │           └── example
│       │               └── data
│       │                   └── engineering
│       │                       └── common
│       │                           ├── Helpers.scala
│       │                           ├── jobs
│       │                           │   └── SparkSessionForCommonModuleJobs.scala
│       │                           ├── oracle
│       │                           │   └── CaseClassGenerator.scala
│       │                           └── TypeString.scala
│       └── test
│           └── scala
│               └── com
│                   └── example
│                       └── data
│                           └── engineering
│                               └── common
│                                   ├── oracle
│                                   │   └── CaseClassGeneratorSpec.scala
│                                   └── SparkSpec.scala
├── data-engineering.iml
├── data-layer
│   ├── data-layer.iml
│   ├── pom.xml
│   └── src
│       ├── main
│       │   └── scala
│       │       └── com
│       │           └── example
│       │               └── dataengineering
│       │                   └── data
│       │                       └── layer
│       │                           ├── clients
│       │                           │   ├── DataProvider.scala
│       │                           │   ├── FileSystem.scala
│       │                           │   ├── MySql.scala
│       │                           │   ├── Oracle.scala
│       │                           │   └── Sap.scala
│       │                           ├── datasources
│       │                           │   ├── dbOracle
│       │                           │   │   ├── Load.scala
│       │                           │   │   └── Schema.scala
│       │                           │   ├── fileSystemSource
│       │                           │   │   ├── Load.scala
│       │                           │   │   └── Schema.scala
│       │                           │   └── Loader.scala
│       │                           ├── jobs
│       │                           │   ├── filesystemJob
│       │                           │   │   ├── FileSystemBatchJob.scala
│       │                           │   │   └── JobCaseClassHandler.scala
│       │                           │   └── SparkSessionForDataLayerJobs.scala
│       │                           ├── output
│       │                           │   └── Writer.scala
│       │                           └── schemas
│       │                               └── LoaderSchema.scala
│       └── test
│           └── scala
│               └── com
│                   └── example
│                       └── dataengineering
│                           └── data
│                               └── layer
│                                   ├── datasources
│                                   │   ├── dbOracle
│                                   │   │   └── LoadSpec.scala
│                                   │   └── fileSystemSource
│                                   │       ├── LoadSpec.scala
│                                   │       └── resources
│                                   │           ├── userdata
│                                   │           │   └── userdata.parquet
│                                   │           └── users
│                                   │               └── users.parquet
│                                   ├── jobs
│                                   │   └── filesystemJob
│                                   │       ├── FileSystemBatchJobSpec.scala
│                                   │       └── JobCaseClasshandlerSpec.scala
│                                   └── SparkSpec.scala
├── functional-models-layer
│   ├── functional-models-layer.iml
│   └── pom.xml
├── LICENSE
├── models-layer
│   ├── models-layer.iml
│   └── pom.xml
├── pipeline
│   ├── pipeline.iml
│   └── pom.xml
├── pom.xml
├── README.md
├── src
│   └── main
│       └── resources
│           └── META-INF
│               └── maven
│                   └── archetype-metadata.xml
└── visualizations-layer
├── pom.xml
└── visualizations-layer.iml

1. Modules In Details

a. io - Input/Output (SCALA Object)

ioSchema in PROMPT consists of multiple case classs that will define and bound with input and output data.

Loader in PROMPT reads data in SPARK Data Structres like RDD or Datasets, now it can behave diffrentls in case of varied kind of input data but the output of loader should be bounded with some case class structure.

Writer in PROMPT writes data from SPARK Data Structres like RDD or Datasets, that is already bounded with ioSchema case classes.

b. api (SCALA Object)

LoaderHelper is the object where helper methods will reside to load data through loader perfectly and smoothly.

ModellerHelper is the object where helper methods will reside to model data according to the needs of either models or modelProcessors.

c. modeller (SCALA Object)

Is the object where all the transformation would be present aligned to the business requirements or targets.There can be multiple layers of modellers but they all need to be either taking params from loaders or other modellers. output from all the methods inside modellers need to be mapped with case classes in ModellerSchema.

ModellerSchema in PROMPT consists of multiple case classs that will define and bound with input and output data.

d. functionalModel (SCALA Trait)

This is wrapper covering all the models in side and provides the functional layer to connect and implement the data flow pipeline according to business requirements, it can process one or more type of models.

e. Jobs (SCALA Object with main)

Here the application starts to execute and it mainly resides over functionalModel, so the control over specific execution flows from start to end of the pipeline.

Below mind map diagram shows PROMPT preserves the type safety of scala and still provides the ability to move around the pipeline to do multi-level transformation at great ease.

Whole frame is based on the unique ness of scala Singleton Objects,as mentioned on SCALA.org Methods and values that aren’t associated with individual instances of a class belong in singleton objects, denoted by using the keyword object instead of class. So once singelton objects are cached they can be used till the params like loading path or other requirements changed.

That makes more sense when you do some spark job and every other transformation, layer is doing the same stuff again and again.

                           main
                            ▲
              +---functionalModel--+
	      |		           |
      ModellerFile12Cache         ModellerFile23Cache
                  |         |	       |
          +-------+---------+----------+------+
          |                 |                 |
   LoaderFile1Cache  LoaderFile2Cache  LoaderFile3Cache
          |                 |                 |
          |                 |                 |
        file1             file2   	       file3

By this modular approach we will be able to crunch lots of data at minute level as a part of functionality.

As every stage Spark creates new instance of serialized objects because of Java serialization. when a class instance is serialized, on deserialization a new object was created every time. The same made on singleton (Scala's object) shown the opposing - even if it's read 10 times, always the same object is created.

Below is the diagram of handling files as buffer by creating cache at off-heap memory.

Multi-Singelton Object Explation, as a part of PROMPT, modeller will never connects from loader as its already in JVM off-heap memory.

Spark process that runs on either cluster or local is a JVM process. For any JVM process, configuration of heap size is done by -Xmx or -Xms flags.

By default, Spark starts with 512MB JVM heap. spark.storage.safetyFraction parameter of Spark enables usage of 90% of heap memory to by pass Out of memory error. Spark is not really in-memory tool, it just utilizes the memory for its Least Recently Used cache.

Then there is another parameter called as spark.storage.memoryFraction that controlls the processing over 60% of safe heap and then there is spark.shuffle.memoryFraction,that manages suffle memory. If shuffles exceeds the safety memory it crashed or driver OOM error is catched.

To by pass that PROMPT provides the singleton objects where every method that explicity implements a transformations and can be used at various places in multiple context is used statically.

https://github.com/prompt-spark/stackexchange-spark-scala-analyser

Just download or clone the frame work repository and open build.sbt in any IDE like intellij
$ cd target/generated-sources/archetype/
$ mvn install

$ mkdir /tmp/archetype
$ cd /tmp/archetype
$ mvn archetype:generate -DarchetypeCatalog=local

$ mvn archetype:create-from-project