- Version 0.1
- One pager
This project aims to utilize machine learning on combined Census and American Community Survey datasets to accurately predict the future population of any place in the United States.
- Executive Summary
- Presentation
- Dependencies
- Results (Overview)
- Process
- Prophet
- Results
- Python 3.7.2
- fbprophet 0.4.post2
- pandas 0.24.1
- NumPy 1.16.2
- scikit-learn 0.20.3
- Matplotlib 3.0.3
Overview
- The model preformed with an average 2.29% cross validated Mean Absolute Percentage Error
- Calculated by compairing the final training year's actual values with the Model's predictions for that year
- Predictions on test years 2016 and 2017 proved consistently better than this
- Averaged 1.826% and 1.993% MAPE (respectively)
- Slightly more accurate than Baseline in 2016, and much more accurate than Baseline in 2017
- Averaged 2.052% and 3.855% MAPE (respectively)
- Baseline predicted change equal to the largest year-to-year change seen in the past 5 years
- Examined large number of Geographic filters on Total Population
- E.g. Place, 5-digit Zip, County
- Determined Place to be most usable
- Why Place?
- Maximum census measurements was 5
- A large majority of Places had at least 3 measurements
- Almost all had at least 2
- Counties were too ranged in Number of Measurements
- East Coast having many being measured since 1790 and having 20+
- Numerous others having less than 10, some restricted to 1 or 2 measurements
- 5-digit Zip measurements were initialized too recently
- No multi-decade historical data
- Why Place?
- US Census
- 1970-2010
- ACS 5-year Estimate
- 2011, '12, '13, '14, '15, '16, '17
- Turned out to be incredibly difficult to beat
- Especially in highly populated areas (e.g. New York City)
- Made use of Facebook’s Prophet to forecast Total Population
- Required conversion of data to time-series format
- Allows interpretation on per year and decades-long basis
- Did not consider places with Total Population less than 1,000
- Neither Baseline nor Model preformed close to as well as on other areas
- Did not consider places with Total Population less than 1,000
- Fit Model on each sample place
- Compared to Actual 2016 & 2017 Populations
- Model consistently outperformed when randomly sampling places (largest sample was 5,000)
- In specific instances, however, it does come up short
- "Sweet spots" seemed in Places with total population below 150,000 or above 650,000
- Model struggled with mid-sized metropolises (e.g. New Orleans)
- "Sweet spots" seemed in Places with total population below 150,000 or above 650,000
- In specific instances, however, it does come up short
- Hint: repeat
Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. The method was first open sourced by Facebook in late 2017, and is available in Python and R.
Prophet makes use of a decomposable time series model with three main model components: trend, seasonality, and holidays.
y(t) = g(t) + s(t) + h(t) + e(t)
where:
- g(t)
- trend models non-periodic changes (i.e. growth over time)
- s(t)
- seasonality presents periodic changes (i.e. weekly, monthly, yearly)
- h(t)
- ties in effects of holidays (on potentially irregular schedules ≥ 1 day(s))
- e(t)
- covers idiosyncratic changes not accommodated by the model
- For more on the equation behind the procedure, check out The Math of Prophet [10 min]
- To learn more about how to use Prophet, see Intro to Facebook Prophet [9 min]
As the question at hand relied on decennial and yearly datapoints, Prophet
, was set to exclude daily and weekly seasonality while staying alert when identifying year-to-year trends and shifts in those trends over time. This was achieved;
# use fbprophet to make Prophet model
place_prophet = fbprophet.Prophet(changepoint_prior_scale=0.15,
daily_seasonality=False,
weekly_seasonality=False,
yearly_seasonality=True,
n_changepoints=10)
- The model preformed with an average 2.29% cross validated Mean Absolute Percentage Error
- Calculated by compairing the final training year's actual values with the Model's predictions for that year
- Predictions on test years 2016 and 2017 proved consistently better than this
- Averaged 1.826% and 1.993% MAPE (respectively)
- Slightly more accurate than Baseline in 2016, and much more accurate than Baseline in 2017
- Averaged 2.052% and 3.855% MAPE (respectively)
- Baseline predicted change equal to the largest year-to-year change seen in the past 5 years
- Results are based on averages from 5 random samples of 100 Places