Kelly McVey / Wenbin Zhao Billy
For our ETL project we will be collecting stock market data from the S&P for the last 5 years using Kaggle, along with consumer confidence index data from OECD. They come in the form of CSV files, and will be joined by date. Since the consumer confidence data ia based on quarter averages, we will need to find the averages for all 500 companies over the same time period.
Both data sets are in CSV format. We need to make some selections on the source webpage in order to download the proper format. After downloading the CSV files. We use pandas to load them into dataframes.
https://data.oecd.org/leadind/consumer-confidence-index-cci.htm
https://www.kaggle.com/camnugent/sandp500/version/4#
- All stocks are grouped by month
- An average (mean) aggregated value was calculated from closing price.
- Date format is changed to YYYY-MM format, renamed column to "month"
- only select country US
- change time filed name to 'month' in order to match with S&P500 data set
- final dataset (merged data) is completed product / dataset