Pandas on AWS
An AWS Professional Service open source initiative | aws-proserve-opensource@amazon.com
Source | Downloads | Installation Command |
---|---|---|
PyPi | pip install awswrangler |
|
Conda | conda install -c conda-forge awswrangler |
- Quick Start
- Read The Docs
- Community Resources
- Who uses AWS Data Wrangler?
- Amazon SageMaker Data Wrangler?
Installation command: pip install awswrangler
import awswrangler as wr
import pandas as pd
df = pd.DataFrame({"id": [1, 2], "value": ["foo", "boo"]})
# Storing data on Data Lake
wr.s3.to_parquet(
df=df,
path="s3://bucket/dataset/",
dataset=True,
database="my_db",
table="my_table"
)
# Retrieving the data directly from Amazon S3
df = wr.s3.read_parquet("s3://bucket/dataset/", dataset=True)
# Retrieving the data from Amazon Athena
df = wr.athena.read_sql_query("SELECT * FROM my_table", database="my_db")
# Get a Redshift connection from Glue Catalog and retrieving data from Redshift Spectrum
con = wr.redshift.connect("my-glue-connection")
df = wr.redshift.read_sql_query("SELECT * FROM external_schema.my_table", con=con)
con.close()
- What is AWS Data Wrangler?
- Install
- Tutorials
- 001 - Introduction
- 002 - Sessions
- 003 - Amazon S3
- 004 - Parquet Datasets
- 005 - Glue Catalog
- 006 - Amazon Athena
- 007 - Databases (Redshift, MySQL and PostgreSQL)
- 008 - Redshift - Copy & Unload.ipynb
- 009 - Redshift - Append, Overwrite and Upsert
- 010 - Parquet Crawler
- 011 - CSV Datasets
- 012 - CSV Crawler
- 013 - Merging Datasets on S3
- 014 - Schema Evolution
- 015 - EMR
- 016 - EMR & Docker
- 017 - Partition Projection
- 018 - QuickSight
- 019 - Athena Cache
- 020 - Spark Table Interoperability
- 021 - Global Configurations
- 022 - Writing Partitions Concurrently
- 023 - Flexible Partitions Filter
- 024 - Athena Query Metadata
- 025 - Redshift - Loading Parquet files with Spectrum
- 026 - Amazon Timestream
- 027 - Amazon Timestream 2
- API Reference
- License
- Contributing
- Legacy Docs (pre-1.0.0)
Please send a Pull Request with your resource reference and @githubhandle.
- Optimize Python ETL by extending Pandas with AWS Data Wrangler [@igorborgest]
- Reading Parquet Files With AWS Lambda [@anand086]
- Transform AWS CloudTrail data using AWS Data Wrangler [@anand086]
- Rename Glue Tables using AWS Data Wrangler [@anand086]
- Getting started on AWS Data Wrangler and Athena [@dheerajsharma21]
- Simplifying Pandas integration with AWS data related services [@bvsubhash]
Knowing which companies are using this library is important to help prioritize the project internally.
Please send a Pull Request with your company name and @githubhandle if you may.
- Amazon
- AWS
- Cepsa [@alvaropc]
- Cognitivo [@msantino]
- Digio [@afonsomy]
- DNX [@DNXLabs]
- Funcional Health Tech [@webysther]
- Informa Markets [@mateusmorato]
- LINE TV [@bryanyang0528]
- M4U [@Thiago-Dantas]
- nrd.io [@mrtns]
- OKRA Technologies [@JPFrancoia, @schot]
- Pier [@flaviomax]
- Pismo [@msantino]
- ringDNA [@msropp]
- Serasa Experian [@andre-marcos-perez]
- Shipwell [@zacharycarter]
- strongDM [@mrtns]
- Thinkbumblebee [@dheerajsharma21]
- Zillow [@nicholas-miles]
Amazon SageMaker Data Wrangler is a new SageMaker Studio feature that has a similar name but has a different purpose than the AWS Data Wrangler open source project.
-
AWS Data Wrangler is open source, runs anywhere, and is focused on code.
-
Amazon SageMaker Data Wrangler is specific for the SageMaker Studio environment and is focused on a visual interface.