/propertyScrapingDataEngineering

Simple project where I test different Data Engineering tools, like Apache Druid, Dagster, Scrapy, etc.

Primary LanguageJupyter NotebookMIT LicenseMIT

Project Title

A example project where I test out different data engineering tools, like Apache Druid, PySPark and Dagster.

With scrapy I#m extracting some key information on various apartments, load them up into a S3 storage and run into Apache Druid.

Tech Stack

  • Scrapy
  • Dagster
  • Apache Druid
  • Docker
  • Apache Superset
  • Jupyter Notebook
  • Min.io (https://min.io/)
  • PySpark

Environment Variables

To run this project, you will need to add the following environment variables to your .env file

MINIO_USER

MINIO_PASSWORD

Installation

Install my-project with pip or poetry

  pip install -r requirements.txt

Or

  poetry install

Run Locally

Clone the project

  git clone https://github.com/stejul/dataEngineeringExample

Install dependencies

  poetry install

or pip install -r requirements.txt

Start the server

WIP SECTION

Running Tests

To run tests, run the following command

  npm run test

Deployment

To deploy this project run

  npm run deploy

Lessons Learned

wip

License

MIT

Authors