PowerTools is a utility library designed to simplify and enhance your experience with Python, Apache Spark, and AWS Glue Spark. It provides a collection of tools and functions to streamline your data processing workflows.
You can install PowerTools using pip:
pip install powertools
from lps_glue import LPSGlue
with LPSGlue(spark_shell=True) as lpsglue:
df = lpsglue.read.csv(path) # Read data from CSV
df = lpsglue.tran.add_column(df, 'example_col1', f.lit('example')) # Add column
lpsglue.write.hudi(
df=df,
path=path,
primary_key='pk1',
partition_by=["part1", "part2"]
order_by='ts',
dedup=False
) # Write df in HUDI format
- data manipulation using
pandas
- parallelization using
concurrent.futures
- and more. Stay tuned for updates!
There are 5 main modules available in Glue Spark Utilities.
Read data in ANY format using Spark without dependencies installation.
lpsglue.read.csv(path=filename)
lpsglue.read.parquet(path=filename)
lpsglue.read.hudi(path=filename)
lpsglue.read.delta(path=filename)
We welcome contributions to PowerTools! If you have any ideas, suggestions, or bug reports, please open an issue or submit a pull request on our GitHub repository.
PowerTools is licensed under the MIT License.