/postman-pipeline

Spark pipeline to ingest and query data

Primary LanguageJupyter Notebook

postman-pipeline

Multiple approaches to create spark pipelines for data ingestiona and querying

Points to achieve

  • Your code should follow concept of OOPS
  • Non-blocking parallel ingestion
  • Updating products in the table based on sku as the primary key
  • Count aggregated names of products.
  • Multiple notebook runs without truncating the created table

The spark-mongo approach is best suited for this task, as data modelling is the concern. Querying and updating documnts on constrainsts is relatively easier in mongo.