This project is a data analytics project for a 401k portfolio.
The data in example.csv has been sanitized.
Apache Spark via Pyspark and pandas, Python, jupyter notebook,matplotlib and delta tables
- The jupyter notebook uses pyspark to read the example dataset.
- pyspark has SQL capability to perform some data cleaning.
- Few columns had $ which could not be processes so $ were removed.
- Few columns had - and ( ) to indicate a negative amount or quality
- Dates in dataset were assumed date on initial import of csv
- Use SQL:
- identify securities invested
- identify total amount allocated
- identify personal contributions (category - Employee pre-tax contributions)
- identify employer contributions (category - Employer matching 401k contributions (fully vested))
- identify portfolio fees
- identify allocated contributions per category
- use pandas to generate visualization of contribution percentages
- calculated total contribution per category
- calculate total quantity (shares) and total amount per security
- Visualize security progress
- Use API to gather current market price via Twelve Data API
- As a github data analytics project, some analytics cannot be shared due to personal nature of data
- Stuff to share
- Tracking personal portfolio strategy changes
- Tracking dividend income and fees
- percentage of dividend to total security shares held
- yearly fees can be calculated
- Stuff to share