PowerTools

PowerTools is a utility library designed to simplify and enhance your experience with Python, Apache Spark, and AWS Glue Spark. It provides a collection of tools and functions to streamline your data processing workflows.

PowerTools
Table of Contents
Installation
Usage
- Quick Start
- Python Utilities
- Spark Utilities
- Glue Spark Utilities
  - 1. Read
    - CSV
      - PARQUET
    - HUDI
    - DELTA LAKE
  - 2. Tran
  - 3. Write
  - 4. Log
  - 5. AWS
- Contributing
- License

Installation

You can install PowerTools using pip:

pip install powertools

Usage

Quick Start

from lps_glue import LPSGlue

with LPSGlue(spark_shell=True) as lpsglue:
    df = lpsglue.read.csv(path)   # Read data from CSV
    df = lpsglue.tran.add_column(df, 'example_col1', f.lit('example'))  # Add column
    lpsglue.write.hudi(
        df=df,
        path=path,
        primary_key='pk1',
        partition_by=["part1", "part2"]
        order_by='ts',
        dedup=False
    ) # Write df in HUDI format

Python Utilities

*Work In Progress:*

data manipulation using pandas
parallelization using concurrent.futures
and more. Stay tuned for updates!

Spark Utilities

*Coming Soon*

Glue Spark Utilities

There are 5 main modules available in Glue Spark Utilities.

1. Read

Read data in ANY format using Spark without dependencies installation.

CSV

  lpsglue.read.csv(path=filename)

PARQUET

  lpsglue.read.parquet(path=filename)

HUDI

  lpsglue.read.hudi(path=filename)

DELTA LAKE

  lpsglue.read.delta(path=filename)

2. Tran

3. Write

4. Log

5. AWS

Contributing

We welcome contributions to PowerTools! If you have any ideas, suggestions, or bug reports, please open an issue or submit a pull request on our GitHub repository.

License

PowerTools is licensed under the MIT License.

jeffreykky/powertools