/pybcn22-modern-data-stack

Building an End-to-End Open-Source Modern Data Platform for Biomedical Data

Primary LanguageTSQLGNU General Public License v3.0GPL-3.0

Building an End-to-End Open-Source Modern Data Platform for Biomedical Data

This workshop at PyBCN 2022 is a detailed guide to help you navigate the modern data stack and build your own platform using open-source technologies. Data engineering has experiences enormous growth in the last years, allowing for rapid progress and innovation as more people than ever are thinking about data resources and how to better leverage them. In this talk we will explore the related technologies and build from scratch an end-to-end modern data platform for the analysis of medical data.

We will be using open-source tools and libraries, including python-based DBT, Apache Airflow and Querybook.

The platform will consist of the following components:

  • Data warehouse
  • Data integration
  • Data transformation
  • Data orchestration
  • Data visualization

INSTALL REQUIREMENTS

  • Install Python
  • Install Java
  • Install docker
    • in Linux edit your /etc/hosts and add 172.17.0.1 docker.host.internal

INSTALL COMPONENTS

Have a drink, and relax ...

FIRST STEPS

  • Clone this repo
git clone https://github.com/alabarga/pybcn22-modern-data-stack.git