Version 2.0.0 introduces a change which may not be compatible with users current workflows/piplines. Previous to 2.0.0 the default when writing a DataFrame to redshift was to write all data types as VARCHAR. In the current version the redshift data types will be inferred from the DataFrame dtypes.
This package is designed to make it easier to get data from redshift into a pandas DataFrame and vice versa. The pandas_redshift package only supports python3.
pip install pandas-redshift
import pandas_redshift as pr
Connect to redshift. If port is not supplied it will be set to amazon default 5439.
As of release 1.1.2 you can exclude the password if you are using a .pgpass file.
pr.connect_to_redshift(dbname = <dbname>,
host = <host>,
port = <port>,
user = <user>,
password = <password>)
Query redshift and return a pandas DataFrame.
data = pr.redshift_to_pandas('select * from gawronski.nba_shots_log')
data.ix[0:5,0:5]
GAME_ID1 MATCHUP LOCATION W FINAL_MARGIN
0 21400899 MAR 4, 2015 - CHA @ BKN A W 24
1 21400899 MAR 4, 2015 - CHA @ BKN A W 24
2 21400899 MAR 4, 2015 - CHA @ BKN A W 24
3 21400899 MAR 4, 2015 - CHA @ BKN A W 24
4 21400899 MAR 4, 2015 - CHA @ BKN A W 24
5 21400899 MAR 4, 2015 - CHA @ BKN A W 24
Write a pandas DataFrame to redshift. Requires access to an S3 bucket and previously running pr.connect_to_redshift.
If the table currently exists IT WILL BE DROPPED and then the pandas DataFrame will be put in it's place.
If you set append = True the table will be appended to (if it exists).
# Connect to S3
pr.connect_to_s3(aws_access_key_id = <aws_access_key_id>,
aws_secret_access_key = <aws_secret_access_key>,
bucket = <bucket>,
subdirectory = <subdirectory>
# As of release 1.1.1 you are able to specify an aws_session_token (if necessary):
# aws_session_token = <aws_session_token>
)
# Write the DataFrame to S3 and then to redshift
pr.pandas_to_redshift(data_frame = data,
redshift_table_name = 'gawronski.nba_shots_log')
Other options:
As of v1.1.2 you can specify the region (necessary if the S3 bucket is in a different location than Redshift).
pr.pandas_to_redshift(data_frame,
redshift_table_name,
# Defaults:
column_data_types = None, # A list of column data types. As of 2.0.0 If not supplied the data types will be inferred from the DataFrame dtypes
index = False,
save_local = False, # If set to True a csv from the data frame will save in the current directory
delimiter = ',',
quotechar = '"',
dateformat = 'auto',
timeformat = 'auto',
region = '',
append = False)
Redshift data types: http://docs.aws.amazon.com/redshift/latest/dg/c_Supported_data_types.html
Finally close the cursor, commit and close the database connection, and remove variables from the environment.
pr.close_up_shop()
As this package is largely a layer over psycopg2 a convenience function has been added to execute and commit sql queries that don't have anything to do with your local machine (for example creating a new table).
pr.exec_commit("""
create table combined_table as
select *
from table1
union
select *
from table2;
""")
If you encounter the error: psycopg2.InternalError: current transaction is aborted, commands ignored until end of transaction block
you can access the pyscopg2 internals with the following:
pr.core.connect.commit()
pr.core.connect.rollback()