/dataforge-core

DataForge helps data teams write functional transformation pipelines by leveraging software engineering principles

Primary LanguagePLpgSQLApache License 2.0Apache-2.0

DataForge Core-Light DataForge Core-Dark

DataForge helps data analysts and engineers build and extend data solutions by leveraging modern software engineering principles.

OSSRank

Understanding DataForge

DataForge enables writing of inline functions using single-column SQL expressions rather than CTEs, procedural scripts, or set-based models.

For an overview of the underlying concepts, check out this introduction blog.

Each function:

  • is pure, with no side effects
  • returns single column
  • is composable with other functions

DataForge software engineering principles:

These principles allow DataForge projects to be easy to modify and extend - even with thousands of integrated pipelines.

Explore the Core CLI or learn more about how Core powers DataForge Cloud.

Requirements

Dataforge Core is a code framework and command line tool to develop transformation functions and compile them into executable Spark SQL.

To run the CLI you will need:

  • Java 8 or higher
  • A PostgreSQL v14+ server with a dedicated empty database
    • Check out our friends over at Tembo
  • Python version 3.12+

The CLI also includes an integration to run the code in Databricks. To support this you need:

Installation and Quickstart

  • Open a new command line window

  • Validate Java and Python are installed correctly:

    > java --version
    openjdk 21.0.3 2024-04-16 LTS
    
    > python --version
    Python 3.12.3
    
  • Install Dataforge by running:

    > pip install dataforge-core
    Collecting dataforge-core...
    Installing collected packages: dataforge-core
    Successfully installed dataforge-core...
    
  • Validate installation:

    > dataforge --version
    dataforge-core 1.0.0
    
  • Configure connections and credentials to Postgres and optionally Databricks

    > dataforge --configure
    Enter postgres connection string: postgresql://postgres:<postgres-server-url>:5432/postgres
    Do you want to configure Databricks SQL Warehouse connection (y/n)? y
    Enter Server hostname: <workspace-url>.cloud.databricks.com
    Enter HTTP path: /sql/1.0/warehouses/<warehouse-guid>
    Enter access token: <token-guid>
    Enter catalog name: <unity_catalog_name>
    Enter schema name: <schema_in_catalog_name>
    Connecting to Databricks SQL Warehouse <workspace-url>.cloud.databricks.com
    Databricks connection validated successfully
    Profile saved in C:\Users...
    
  • Navigate to an empty folder and initialize project structure and sample files:

    > dataforge --init
    Initialized project in C:\Users...
    
  • Deploy dataforge structures to Postgres

    > dataforge --seed
    All objects in schema(s) log,meta in postgres database will be deleted. Do you want to continue (y/n)? y
    Initializing database..
    Database initialized
    
  • Build sample project

    > dataforge --build
    Validating project path C:\Users...
    Started import with id 1
    Importing project files...
    <list of files>
    Files parsed
    Loading objects...
    Objects loaded
    Expressions validated
    Generated 8 source queries
    Generated 1 output queries
    Generated run.sql
    Import completed successfully
    
  • Execute in Databricks

    > dataforge --run
    Connecting to Databricks SQL Warehouse <workspace-url>.cloud.databricks.com
    Executing query
    Execution completed successfully
    

Commands

-h, --helpDisplay this help message and exit
-v, --versionDisplay the installed DataForge version
-c, --configureConnect to Postgres database and optionally Databricks SQL Warehouse
-s, --seedDeploy tables and scripts to postgres database
-i, --init [Project Path]Initialize project folder structure with sample code
-b, --build [Project Path]Compile code, store results in Postgres, and generate target SQL files
-r, --run [Project Path]Run compiled project on Databricks SQL Warehouse
-p, --profile [Profile Path]Update path of stored credentials profile file

Links