We will learn how to use Python for forecasting time series data to predict new future data points.
Course Curriculum
- Course Goals
- Understand how to use Python to work with Time Series data
- Use Pandas and Statsmodels to visualize Time Series data
- Be able to use a wide variety of forecasting techniques on Time Series data
- Set Up and Installation
- NumPy Basics: Quick section on NumPy basics and how to manipulate data with it
- Pandas Basics: Pandas is used for data analysis and data exploration. We'll learn how to use this library since its fundamental to handling our data sources
- Pandas Visualization: Pandas also comes with a lot of built in visualization tools which we will explore to quickly view time series data on a chart
- Time Series with Pandas: After learning the Pandas Fundamentals we will focus on specialized tools within pandas specifically designed to work with time stamped data
- Time Series Analysis with Statsmodels: Statsmodels is a statistical library for Python that contains an entire library of time series statistical analysis tools. This section will be an introduction to use Statsmodels for basic time series analysis
- General Forecasting Models: In this section we'll dive deep into various forecasting models based on ARIMA (AutoRegressive Integrated Moving Averages)
- Deep Learning and Prophet: The final sections will show the latest state of the art methods for forecasting, including Recurrent Neural Networks and Facebook's Prophet library
Table of Contents
NumPy is a numerical processing library that can efficiently handle large data sets stored as arrays (see Glossary). Later we will learn about Pandas, which is built directly off of the NumPy library. It provides (see Repo):
- A powerful N-dimensional array object
- Sophisticated (broadcasting) functions
- Useful linear algebra, Fourier transform, and random number capabilities
Example:
import numpy as np
# Create a 2-D array, set every second element in
# some rows and find max per row:
x = np.arange(15, dtype=np.int64).reshape(3, 5)
x[1:, ::2] = -99
x
# array([[ 0, 1, 2, 3, 4],
# [-99, 6, -99, 8, -99],
# [-99, 11, -99, 13, -99]])
x.max(axis=1)
# array([ 4, 8, 13])
Section Goals
- Creating NumPy Arrays
- Indexing and Selection on Arrays
- General Operations
We transform a nest list into a NumPy 2-dim array (matrix). Example:
l = [[1,2,3],[4,5,6]]
m = np.array(l)
Note that NumPy broadcasts operations on arrays (see Documentation). Generally, the smaller array is “broadcast” across the larger array so that they have compatible shapes.
NumPy operations are usually done on pairs of arrays on an element-by-element basis. Conceptually we can prepend ones, until we have compatible shapes. See mCoding Video.
It works with plus, minus, times, exponentiation, min/max, and many more element-wise operations.
Note: from Documentation.
# Do this (new version)
from numpy.random import default_rng
rng = default_rng()
vals = rng.standard_normal(10)
more_vals = rng.standard_normal(10)
# instead of this (legacy version)
from numpy import random
vals = random.standard_normal(10)
more_vals = random.standard_normal(10)
Create an array of the given shape and populate it with random samples from a uniform distribution:
# Create a 2-D array and populate it with
# random numbers from a uniform distro
np.random.rand(2,2)
This uses a legacy container RandomState
. The new version looks like this:
rng = np.random.default_rng()
rng.standard_normal(10)
Call default_rng
to get a new instance of a Generator
, then call its methods to obtain samples from different distributions. By default, Generator
uses bits provided by PCG64
which has better statistical properties than the legacy MT19937
used in RandomState
. See Documentation.
The seed is an optional parameter. Note that the seed only works in the cell it is passes as a parameter.
import numpy as np
rng = np.random.default_rng(12345)
# print(rng)
# Generator(PCG64)
rfloat = rng.random()
rfloat
# 0.22733602246716966
type(rfloat)
# <class 'float'>
Slices from an array point to the original array. Broadcasting and reassigning will affect the original array. Example:
arr = rng.integers(1,11,10)
arr
# array([ 5, 4, 10, 4, 1, 5, 8, 2, 5, 2])
# slice of the first 5 elements of array
s = arr[0:5:]
s[::] = 99
s
# array([99, 99, 99, 99, 99])
arr
# array([99, 99, 99, 99, 99, 5, 8, 2, 5, 2])
If this is not the intended behavior, we can copy the array using the copy
method.
We can also slice n-dim arrays. Example:
arr = rng.integers(1,11,10)
arr = arr.reshape(2,5)
arr
# array([[ 7, 5, 4, 3, 6],
# [ 7, 10, 5, 2, 9]])
arr[1::,1::]
# array([[10, 5, 2, 9]])
Conditional selection: we can also broadcast comparisons. Example:
arr = rng.integers(1,11,10)
arr
# array([7, 8, 1, 4, 8, 9, 5, 9, 9, 4])
bool_arr = arr > 4
bool_arr
# array([ True, True, False, False, True, True, True, True, True, False])
We can pass the Boolean array as a filter for the original array.
arr[bool_arr]
# array([7, 8, 8, 9, 5, 9, 9])
# or more compact
# read as array where array is greater than 4
arr[arr>4]
We can broadcast operations or choose an axis or axes along which the operations are computer. Example:
arr = np.arange(9).reshape(3,3)
arr
# array([[0, 1, 2],
# [3, 4, 5],
# [6, 7, 8]])
# Determine the median along the rows
m_arr = np.median(arr, axis=1).astype('i')
m_arr
# array([1, 4, 7], dtype=int32)
Note that we cast the elements of the resulting array to an integer (see astype method). The data type codes can be found in the documentation. Alternatively we can use the NumPy data types.
This section we'll learn about:
- Series and DataFrames
- Missing Data
- GroupBy
- Operations
- Data I/0 (Input and Output)
Note that the Documentation has a section on Time series / date functionality. From its Repo: pandas is a Python package that provides fast, flexible, and expressive data structures designed to make working with "relational" or "labeled" data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language.
- Easy handling of missing data (represented as
NaN
,NA
, orNaT
) in floating point as well as non-floating point data - Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects
- Automatic and explicit data alignment: objects can
be explicitly aligned to a set of labels, or the user can simply
ignore the labels and let
Series
,DataFrame
, etc. automatically align the data for you in computations - Powerful, flexible group by functionality to perform split-apply-combine operations on data sets, for both aggregating and transforming data
- Make it easy to convert ragged, differently-indexed data in other Python and NumPy data structures into DataFrame objects
- Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
- Intuitive merging and joining data sets
- Flexible reshaping and pivoting of data sets
- Hierarchical labeling of axes (possible to have multiple labels per tick)
- Robust IO tools for loading data from flat files (CSV and delimited), Excel files, databases, and saving/loading data from the ultrafast HDF5 format
- Time series-specific functionality: date range generation and frequency conversion, moving window statistics, date shifting and lagging
A series is the basic building block of Pandas. It holds an array of information organized by an (named) Index.
NumPy array has data with an index:
index | data |
---|---|
0 | 123 |
1 | 456 |
2 | 789 |
Series is built on top of a NumPy array and can have a named index.
index | named_index | data |
---|---|---|
0 | A | 123 |
1 | B | 456 |
2 | C | 789 |
Note that a pandas Series takes in data and an index.
my_series = pd.Series(data=[1,2,3], index=list('ABC'))
We can broadcast operations directly on Series based on the index (position), for example:
s1 = pd.Series(data=[1,2,3], index=list('ABC'))
s2 = pd.Series(data=[4,5,6,7,8], index=list('ABCDE'))
s1 + s2
# A 5.0
# B 7.0
# C NaN
# D NaN
# E NaN
# dtype: float64
s1.iloc[0]
# 1
s1.loc['A']
# 1
A DataFrame is simply multiple series that share the same index. Two-dimensional, size-mutable, potentially heterogeneous tabular data. Data structure also contains labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The primary pandas data structure (see Documentation). Example:
# Create a 2-dim NumPy array
arr = np.arange(1,10).reshape(3,3)
arr
# array([[1, 2, 3],
# [4, 5, 6],
# [7, 8, 9]])
# Create DataFrame (2-dim data structure)
# The rows in the 2-dim array become the rows of the DataFrame
# The index here is integer position based
df = pd.DataFrame(data = arr)
df
# 0 1 2
# 0 1 2 3
# 1 4 5 6
# 2 7 8 9
# The index can also be label based
# We can also named the columns (Series in the DataFrame)
pd.DataFrame(data = d, columns = list('ABC'), index=list('XYZ'))
# A B C
# X 1 2 3
# Y 4 5 6
# Z 7 8 9
The axis labeling information in pandas objects serves many purposes:
- Identifies data (i.e. provides metadata) using known indicators, important for analysis, visualization, and interactive console display
- Enables automatic and explicit data alignment
- Allows intuitive getting and setting of subsets of the data set
Another example:
rng = np.random.default_rng(42)
m = rng.standard_normal((5,4))
m
pd.DataFrame(data=m, index=list('ABCDE'), columns=list('WXYZ'))
# W X Y Z
# A 0.304717 -1.039984 0.750451 0.940565
# B -1.951035 -1.302180 0.127840 -0.316243
# C -0.016801 -0.853044 0.879398 0.777792
# D 0.066031 1.127241 0.467509 -0.859292
# E 0.368751 -0.958883 0.878450 -0.049926
We can index and slice data (label based or integer position based) in the data frame (see Documentation).
Object Type | Indexers |
---|---|
Series | s.loc[indexer] |
DataFrame | df.loc[row_indexer,column_indexer] |
If we're interested in getting the columns, we use the basic indexing with []
.
Object Type | Selection | Return Value Type |
---|---|---|
Series | series[label] | scalar value |
DataFrame | frame[colname] | Series corresponding to colname |
# label based slicing
df.loc[::,::]
# integer position based slicing
df.iloc[::,::]
# If we slice the DataFrame, we see that it consists of J Series, J = len(columns)
type(df.loc[::,'W'])
# pandas.core.series.Series
# We can also pass a list or array of labels
df.loc[::,['W','Z']]
Another common operation is the use of boolean vectors to filter the data. The operators are: |
for or, &
for and, and ~
for not. These must be grouped by using parentheses. Example:
# Change value in cell [A,X] to one
df.loc['A','X'] = 1
# Create filter, return cells where value is greater than zero
cond = df > 0
# Filter DataFrame and drop NA cells
df[crit].dropna(axis='index')
# W X Y Z
# A 0.304717 1.0 0.750451 0.940565
From Documentation: you may select rows from a DataFrame using a boolean vector the same length as the DataFrame’s index (for example, something derived from one of the columns of the DataFrame):
crit = df['W'] > 0
df[crit]
# or
df.loc[crit,::]
# W X Y Z
# A 0.304717 -1.039984 0.750451 0.940565
# D 0.066031 1.127241 0.467509 -0.859292
# E 0.368751 -0.958883 0.878450 -0.049926
We can reset and set the index of the DataFrame.
rng = np.random.default_rng(42)
data = rng.standard_normal((5,4))
data
# array([[ 0.30471708, -1.03998411, 0.7504512 , 0.94056472],
# [-1.95103519, -1.30217951, 0.1278404 , -0.31624259],
# [-0.01680116, -0.85304393, 0.87939797, 0.77779194],
# [ 0.0660307 , 1.12724121, 0.46750934, -0.85929246],
# [ 0.36875078, -0.9588826 , 0.8784503 , -0.04992591]])
idx = list('ABCDE')
cols = list('WXYZ')
df = pd.DataFrame(data=data, index=idx, columns=cols)
df
# W X Y Z
# A 0.304717 -1.039984 0.750451 0.940565
# B -1.951035 -1.302180 0.127840 -0.316243
# C -0.016801 -0.853044 0.879398 0.777792
# D 0.066031 1.127241 0.467509 -0.859292
# E 0.368751 -0.958883 0.878450 -0.049926
# Reset index to default integer position based index
df.reset_index(inplace=True)
# Rename old index
df.rename({'index':'old_index'},axis=1,inplace=True)
# Create data for the new index
new_index = 'AA BB CC DD EE'.split()
# Assign data to new column
df['new_index'] = new_index
# Set new index
df.set_index('new_index',inplace=True)
df
# old_index W X Y Z
# new_index
# AA A 0.304717 -1.039984 0.750451 0.940565
# BB B -1.951035 -1.302180 0.127840 -0.316243
# CC C -0.016801 -0.853044 0.879398 0.777792
# DD D 0.066031 1.127241 0.467509 -0.859292
# EE E 0.368751 -0.958883 0.878450 -0.049926
We have three options:
- Keep the missing data (NaN), if the forecasting method can handle it
- Drop the missing data (the entire row including the timestamp
- Fill in the missing data with some value (best estimated guess)
series_dict = dict(A=[1,2,np.nan],B=[5,np.nan,np.nan],C=[1,2,3])
df = pd.DataFrame(series_dict)
df
# A B C
# 0 1.0 5.0 1
# 1 2.0 NaN 2
# 2 NaN NaN 3
df.dropna()
# A B C
# 0 1.0 5.0 1
df.dropna(axis=1)
# C
# 0 1
# 1 2
# 2 3
# Drop if the row has two or more NAs
df.dropna(thresh=2)
# A B C
# 0 1.0 5.0 1
# 1 2.0 NaN 2
df.fillna('fill_value')
# A B C
# 0 1.0 5.0 1
# 1 2.0 fill_value 2
# 2 fill_value fill_value 3
df.fillna(method='pad')
# A B C
# 0 1.0 5.0 1
# 1 2.0 5.0 2
# 2 2.0 5.0 3
Often you may want to perform an analysis based off the value of a specific column, meaning you want to group together other columns based off another. In order to do this, we to perform 3 steps:
- Split / Group
- Apply
- Combine
Pandas will automatically make the grouped by column the index of the new resulting DataFrame. Example:
data = dict(Company='GOOG GOOG MSFT MSFT FB FB'.split(),
Person='Sam Charlie Amy Vanessa Carl Sarah'.split(),
Sales=[200, 120, 340, 124, 243, 350])
data
# {'Company': ['GOOG', 'GOOG', 'MSFT', 'MSFT', 'FB', 'FB'],
# 'Person': ['Sam', 'Charlie', 'Amy', 'Vanessa', 'Carl', 'Sarah'],
# 'Sales': [200, 120, 340, 124, 243, 350]}
df = pd.DataFrame(data)
df
# Company Person Sales
# 0 GOOG Sam 200
# 1 GOOG Charlie 120
# 2 MSFT Amy 340
# 3 MSFT Vanessa 124
# 4 FB Carl 243
# 5 FB Sarah 350
# Split the DataFrame
df.groupby('Company')
# <pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fe9c704f7d0>
# Apply aggregate method or function call and combine
df.groupby('Company').describe()
# Sales
# count mean std min 25% 50% 75% max
# Company
# FB 2.0 296.5 75.660426 243.0 269.75 296.5 323.25 350.0
# GOOG 2.0 160.0 56.568542 120.0 140.00 160.0 180.00 200.0
# MSFT 2.0 232.0 152.735065 124.0 178.00 232.0 286.00 340.0
# We can also sort by a multi-index column
df.groupby('Company').describe().sort_values([('Sales','mean')])
# Sales
# count mean std min 25% 50% 75% max
# Company
# GOOG 2.0 160.0 56.568542 120.0 140.00 160.0 180.00 200.0
# MSFT 2.0 232.0 152.735065 124.0 178.00 232.0 286.00 340.0
# FB 2.0 296.5 75.660426 243.0 269.75 296.5 323.25 350.0
# Custom describe method with range
def describe2(df):
df_res = df.describe()
df_res.loc['range'] = df_res.loc['max'] - df_res.loc['min']
return df_res
Methods:
- unique
- value_counts(normalize=False)
- apply
- drop
- info
- describe
- sort_values
Attributes:
- index
- columns
See Documentation. Example:
l = pd.read_html(
'https://www.fdic.gov/resources/resolutions/bank-failures/failed-bank-list/'
)
len(l)
# 1
df = l[0]
df.head(2)
# Bank NameBank CityCity StateSt CertCert \
# 0 Almena State Bank Almena KS 15426
# 1 First City Bank of Florida Fort Walton Beach FL 16748
# Acquiring InstitutionAI Closing DateClosing FundFund
# 0 Equity Bank October 23, 2020 10538
# 1 United Fidelity Bank, fsb October 16, 2020 10537
See Documentation.
Histogram of dataframe or column:
- x-axis = value of num_observation along column
- y-axis = its frequency in dataframe
- color = columns of dataframe (note that a KDE might be a better choice to compare distributions)
df1['A'].plot.hist()
# We can customize the histogram, for example:
df1['A'].plot.hist(bins=20, edgecolor='k', grid=True).autoscale(enable=True,
axis='both',
tight=True)
Bar Plot of dataframe:
- x-axis = index (treated as a cat_observation)
- y-axis = value of num_observation along column
- color = columns of dataframe
We can also stack the bars with the stacked param.
df2.plot.bar(stacked=True,grid=True)
Line Plot of dataframe or column:
- x-axis = index (treated as a num_observation)
- y-axis = value of num_observation along column
- color = columns of dataframe
Information is displayed as a series of data points called 'markers' connected by straight line segments.
df2.plot.line()
# Customize
df2.plot.line(y='a', figsize=(10, 4), grid=True, lw=3).autoscale(enable=True,
axis='both',
tight=True)
Area Plot of dataframe or column:
- x-axis = index
- y-axis = stacked num_observations
- color = columns of dataframe
Cumme stacked line plots.
# Cumme line plots; index is the x-axis of the plot
df2.plot.area(figsize=(10, 4), grid=True, alpha=0.5).autoscale(axis='both',
tight=True)
# Overlapping area plots of singular columns; not very useful
df2.plot.area(figsize=(10, 4), grid=True, alpha=0.5,
stacked=False).autoscale(enable=True, axis='both', tight=True)
Scatter Plot of two num_columns of a dataframe (as param): We visualize the following data for each marker / record (along 0-axis):
- x-axis = num_observation_1
- y-axis = num_observation_2
- color = num_observation_3
- size = num_observation_4
df1.plot.scatter(x='A',
y='B',
s=df1['C'] * 50,
c='D',
alpha=.5,
cmap='coolwarm')
Box Plot of dataframe of column: Graphical representation of a Five-number summary with some additional information:
- x-axis = num_column
- y-axis = statistics of num_observations along the column
If we group by a cat_column then we get:
- box plot for each column
- x-axis = value of cat_observation in group-by-column
- y-axis = statistics of num_observation along the column of the group
Note: The Five-number summary consists of:
- (1) sample minimum
- (2) lower / first quartile
- (3) median / second quartile
- (4) upper / third quartile
- (5) sample maximum
- IQR = Q_3 - Q_1
- the whiskers
- outliers
df2.head(3)
# a b c d e
# 0 0.039762 0.218517 0.103423 0.957904 x
# 1 0.937288 0.041567 0.899125 0.977680 y
# 2 0.780504 0.008948 0.557808 0.797510 x
print(df2['e'].unique())
# ['x' 'y' 'z']
# Basic boxplot
df2.boxplot()
# We can group by values of a categorical column
df2.boxplot(by='e')
Example from the documentation:
# Instantiate RNG with seed
rng = np.random.default_rng(42)
# Create dataframe with two num_columns
df = pd.DataFrame(rng.standard_normal((10,2)),columns=['Col1', 'Col2'])
# Add a third cat_column
df['Col3'] = pd.Series(['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B'])
df
# Col1 Col2 Col3
# 0 0.304717 -1.039984 A
# 1 0.750451 0.940565 A
# 2 -1.951035 -1.302180 A
# 3 0.127840 -0.316243 A
# 4 -0.016801 -0.853044 A
# 5 0.879398 0.777792 B
# 6 0.066031 1.127241 B
# 7 0.467509 -0.859292 B
# 8 0.368751 -0.958883 NaN
# 9 0.878450 -0.049926 NaN
boxplot = df.boxplot(by='Col3', figsize=(10, 4))
The result is two subplots (one for each num_column) with two box plots each (one for cat_observation value in the cat_column) with statistics from each group (Col1, CatA & Col1, CatB + Col2, CatA & Col2, CatB)
KDE of dataframe of column:
- x-axis = value of num_observation along column
- y-axis = approx. of PDF for num_column
- color = columns of dataframe
df2.plot.kde()
Hexbin of
Motivation: in a scatter plot with too many points, we cannot see how many are stacked on top of each other. Hexbin plots can be a useful alternative to scatter plots if your data are too dense to plot each point individually.
- x-axis = num_observation_1
- y-axis = num_observation_2
# Instantiate RNG with seed
rng = np.random.default_rng(42)
df = pd.DataFrame(data=rng.standard_normal((1_000, 2)), columns=list('ab'))
df.head(3)
# a b
# 0 0.304717 -1.039984
# 1 0.750451 0.940565
# 2 -1.951035 -1.302180
df.plot.hexbin(x='a',y='b',gridsize=25,cmap='Oranges')
ax = df2[list('ac')].plot(figsize=(10, 3), ls=':', lw=3, title='My titel')
# c = 'blue'
ax.set(xlabel='My x label', ylabel='My y label')
ax.legend(bbox_to_anchor=(1, 1))
# loc = 0
In this section we will discuss:
- Date Time Index Basics
- Time Resampling
- Time shifting
- Rolling and Expanding
- Time Series Visualization
- Time Series Project Exercise
We start by creating datetime objects:
from datetime import datetime
datetime(year, month, day[, hour[, minute[, second[, microsecond[,tzinfo]]]]])
NumPy handles dates more efficiently than Python's datetime format. See Documentation.
# This is treated as a normal array
np.array('2020-03-15 2020-03-16 2020-03-17'.split())
# array(['2020-03-15', '2020-03-16', '2020-03-17'], dtype='<U10')
# This is an array of datetime64 objects with day [D] precision
np.array('2020-03-15 2020-03-16 2020-03-17'.split(), dtype='datetime64')
#array(['2020-03-15', '2020-03-16', '2020-03-17'], dtype='datetime64[D]')
# We can also create a range of datetime objects
np.arange('2018-06-01', '2018-06-23', 7, dtype='datetime64[D]')
# array(['2018-06-01', '2018-06-08', '2018-06-15', '2018-06-22'],
# dtype='datetime64[D]')
Pandas also has methods to work with datetime data. See Documentation. Pandas has a specialized DateTimeIndex to work with datetime objects.
pd.to_datetime(['1/2/2018', '1/3/2018'])
# DatetimeIndex(['2018-01-02', '2018-01-03'], dtype='datetime64[ns]', freq=None)
pd.to_datetime(['2/1/2018', '3/1/2018'])
# DatetimeIndex(['2018-02-01', '2018-03-01'], dtype='datetime64[ns]', freq=None)
# We can specify the format to bypass Pandas inferring it it
pd.to_datetime(['2/1/2018', '3/1/2018'], format='%d/%m/%Y')
# DatetimeIndex(['2018-01-02', '2018-01-03'], dtype='datetime64[ns]', freq=None)
pd.date_range('2020-01-01', periods=7, freq='D')
# DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',
# '2020-01-05', '2020-01-06', '2020-01-07'],
# dtype='datetime64[ns]', freq='D')
Note that we use arange
to specify the step size and type, and date_range
to specify the number of elements (periods) and frequency (type).
data = rng.standard_normal((3, 2))
cols = list('AB')
idx = pd.date_range('2020-01-01', periods=3, freq='D')
df = pd.DataFrame(data, index=idx, columns=cols)
df
# A B
# 2020-01-01 0.066031 1.127241
# 2020-01-02 0.467509 -0.859292
# 2020-01-03 0.368751 -0.958883
df.index
# DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03'], dtype='datetime64[ns]', freq='D')
df.index.min()
# Timestamp('2020-01-03 00:00:00', freq='D')
df.index.argmin()
# 0
Resampling is a convenience method for frequency conversion and resampling of time series.
Similar to a group operation but based on time frequency. See Time Series / Date Documentation and thread. Also see Offset Alias, Anchored Offsets, and Resampling. Refer to the Resample Page. Some aliases can be found in the course material.
Alias | Description |
---|---|
B | business day frequency |
C | custom business day frequency |
D | calendar day frequency |
W | weekly frequency |
M | month end frequency |
SM | semi-month end frequency (15th and end of month) |
BM | business month end frequency |
CBM | custom business month end frequency |
MS | month start frequency |
SMS | semi-month start frequency (1st and 15th) |
BMS | business month start frequency |
CBMS | custom business month start frequency |
Q | quarter end frequency |
BQ | business quarter end frequency |
QS | quarter start frequency |
BQS | business quarter start frequency |
A, Y | year end frequency |
BA, BY | business year end frequency |
AS, YS | year start frequency |
BAS, BYS | business year start frequency |
BH | business hour frequency |
H | hourly frequency |
T, min | minutely frequency |
S | secondly frequency |
L, ms | milliseconds |
U, us | microseconds |
N | nanoseconds |
# Instantiate the dataframe with a named index
df = pd.read_csv(file, index_col='Date', parse_dates=True)
# If we do not pass the parse_dates parameter, use the following.
# df.index = pd.to_datetime(df.index)
# daily -> yearly mean
df.resample(rule='A').mean()
# Close Volume
# Date
# 2015-12-31 50.078100 8.649190e+06
# 2016-12-31 53.891732 9.300633e+06
# 2017-12-31 55.457310 9.296078e+06
# 2018-12-31 56.870005 1.122883e+07
# daily close price -> mean yearly close price
df['Close'].resample('A').mean()
# Date
# 2015-12-31 50.078100
# 2016-12-31 53.891732
# 2017-12-31 55.457310
# 2018-12-31 56.870005
# Freq: A-DEC, Name: Close, dtype: float64
df['Close'].resample('A').mean().plot.bar()
We use matplotlib.pyplot
when we want to customize the plot. See Documentation; Bar Plot Page. Pyplot also offers a Tutorial. Also see this thread.
import numpy as np
import pandas as pd
from datetime import datetime
import matplotlib.pyplot as plt
# import matplotlib.dates as mdates
from matplotlib.dates import DateFormatter
# Create x- and y-arrays
idx = df.resample(rule='A').mean().index.year
y = df['Close'].resample('A').mean().values
# Create plot
ax = plt.bar(idx,y,tick_label=idx)
We can shift the data along the 0-axis. See Documentation.
# Empty data point in the first row and lose last data point
df.shift(1)
# Shift data points to end of the month
df.shift(periods=1, freq='M')
See Documentation. A common process with time series is to create data based on a rolling mean -> we divide data into windows of time -> apply aggregate function for each window.
The rolling
method behaves like rows between n preceding and current row
in postgres.
# Plot of the closing price and its 30 moving average
df.Close.plot(figsize=(12, 5))
df.Close.rolling(window=30).mean().plot()
To add a legend to the plot, we can add the MA(30) column to the dataframe.
df['MA(30)'] = df.Close.rolling(window=30).mean()
df.iloc[29:34:]
# Close Volume MA(30)
# Date
# 2015-02-13 42.8942 6109522 39.835057
# 2015-02-17 43.1050 6386900 40.005020
# 2015-02-18 43.5593 6541986 40.214393
# 2015-02-19 43.6390 6109176 40.436533
# 2015-02-20 43.7982 6462662 40.633647
df['Close MA(30)'.split()].plot(figsize=(12, 5))
The expanding
method behaves like rows between unbounded preceding and current row
in postgres.
df.Close.expanding().mean().head()
# Date
# 2015-01-02 38.006100
# 2015-01-05 37.642100
# 2015-01-06 37.419667
# 2015-01-07 37.535950
# 2015-01-08 37.727980
# Name: Close, dtype: float64
See Anatomy of a figure for more details. See notebook for reference. See plt dates Documentation.
See also strftime() and strptime() Behavior.
# Columns have wildly different scales -> Plot the curves separately
df.plot()
# Creaate plot and set title
ax = df.Close.plot(figsize=(12, 6), title='Close Price vs Date', grid=True)
# Scale the graph
ax.autoscale(axis='both', tight=True)
# Set labels
ax.set(xlabel='Date', ylabel='Close Price')
# Filter the data and plot
# Alternatively, pass the xlim and ylim param in the plot method
ax = df.Close.loc['2017-01-01':'2018-12-31'].plot(figsize=(12, 4), grid=True)
ax.autoscale(axis='both', tight=True)
We can also adjust the tick label:
# import dates module from plt
from matplotlib import dates
# Create subplot
ax = df.Close.loc['2017-01-01':'2017-03-01'].plot(figsize=(12, 4), grid=True)
# Scale subplot
ax.autoscale(axis='x',tight=True)
# Remove x-label
ax.set(xlabel='')
# Set location of tick labels
ax.xaxis.set_major_locator(dates.WeekdayLocator(byweekday=0))
# Set format of tick labels
ax.xaxis.set_major_formatter(dates.DateFormatter('%B %d (%a)'))
We can also adjust the minor tick label and the grid for each axis:
# Create subplot
ax = df.Close.loc['2017-01-01':'2017-03-01'].plot(figsize=(12, 4))
# Scale subplot
ax.autoscale(axis='x',tight=True)
# Remove x-label
ax.set(xlabel='')
# Set location of tick labels
ax.xaxis.set_major_locator(dates.WeekdayLocator(byweekday=0))
# Set format of tick labels
ax.xaxis.set_major_formatter(dates.DateFormatter('%d'))
# Set location of minor tick labels
ax.xaxis.set_minor_locator(dates.MonthLocator())
# Set format of minor tick labels
ax.xaxis.set_minor_formatter(dates.DateFormatter('\n\n%b'));
# Show grid
ax.xaxis.grid(True)
Libaries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime
# from matplotlib.dates import DateFormatter
# from matplotlib import dates
# import matplotlib.dates as mdates
# General
import numpy as np
import pandas as pd
# from matplotlib import dates
import matplotlib.pyplot as plt
from datetime import datetime
import pylab
import seaborn as sns
# import statsmodels.api as sm
sns.set_style("whitegrid")
pylab.rc("figure", figsize=(16, 8))
pylab.rc("font", size=14)
# Forecasting Models
from statsmodels.tsa.filters.hp_filter import hpfilter
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.holtwinters import SimpleExpSmoothing
from statsmodels.tsa.holtwinters import ExponentialSmoothing
See Forecasting: Principles and Practice by Rob J Hyndman and George Athanasopoulos.
Statsmodels Documentation. Statsmodels TSA Documentation. See also: How to Decompose Time Series Data into Trend and Seasonality. See also the TSA Tutorial.
Section Goals:
- Introduction to Statsmodels
- ETS Decomposition
- Moving Averages
- Holt Winters Methods
- Statsmodels Time Series Exercises
First we'll learn how to call a function test from Stasmodels and about the Hodrick-Prescott filter. Time series data has particular properties:
- Trends:
- Upward: slope is positive in general (if you would average it out)
- Horizontal or Stationary
- Downward:
- Seasonality: Repeating trends
- Cyclical: Trends with no set repetition
The Hodrick-Prescott filter separates a time-series y
into a trend component tau
and a cyclical component c
:
y_t = \tau_t + c_t
The components are determined by minimizing the following quadratic loss / cost / error function, where lambda is a smoothing parameter:
\min _{\tau_{t}} \sum_{t=1}^{T} c_{t}^{2}+\lambda \sum_{t=1}^{T}\left[\left(\tau_{t}-\tau_{t-1}\right)-\left(\tau_{t-1}-\tau_{t-2}\right)\right]^{2}
The lambda value above handles variations in the growth rate of the trend component.
Recommended lambda values for Hodrick-Prescott filter:
- Quarterly: 1,600
- Anually: 6.25
- Monthly: 129,600
ETS Models (Error-Trend-Seasonality). This general term stands for a variety of different models, e.g.:
- Exponential Smoothing
- Trend Methods Models
- ETS Decomposition
Statsmodels provides a seasonal decomposition tool we can use to separate out the different components. We've already seen a simplistic example of this in the Introduction to Statsmodels section with the Hodrick-Prescott filter.
ETS (Error-Trend-Seasonality Models will take each of those terms for "smoothing" and may add them, multiply them, or even just leave some of them out. Based off these key factors, we can try to create a model to fit our data.
Visualizing the data based off its ETS is a good way to build an understanding of its behaviour.
There are two types of ETS models:
- Additive model
- Multiplicative model
We apply an additive model when it seems that the trend is more linear and the seasonality and trend components seem to be constant over time (e.g. every year we add 10,000 passengers).
A multiplicative model is more appropriate when we are increasing (or decreasing) at a non-linear rate (e.g. each year we double the amount of passengers).
See Pandas Documentation. See Statsmodels Documentation. EWMA (Exponentially weighted moving average) models. We've previously seen how calculating simple moving averages can allow us to create a simple model that describes some trend level behavior of a time series.
We could expand off the idea of an SMA (simple moving average) by utilizing a EWMA (Exponentially weighted moving average). As issue with SMA is that the entire model will be constrained to the same window size.
It would be nice if we could have more recent data be weighted more than older data. We do this by implementing a EWMA instead of SMA. Basic SMA has some cons:
- Smaller windows will lead to more noise, rather than signal
- It will always lag by the size of the window
- It will never reach to full peak or valley of the data due to the averaging.
- Does not really inform you about possible future behavior, all it really does is describe trends in your data.
- Extreme historical values can skew your SMA significantly
To help fix some of these issues, we can use an EWMA. EWMA will allow us to reduce the lag effect from SMA and it will put more weight on values that occurred more recently (by applying more weight to the more recent values, thus the name).
The amount of weight applied to the most recent values will depend on the actual parameters used in the EWMA and the number of periods given a window size. In general a weighted moving average is described by the following formula:
y_t = \frac{\sum\limits_{i=0}^t w_i x_{t-i}}{\sum\limits_{i=0}^t w_i}
We now need to define the weight function w_i
. When the param adjust = True
then:
y_t = \frac{x_t + (1 - \alpha)x_{t-1} + (1 - \alpha)^2 x_{t-2} + ...
+ (1 - \alpha)^t x_{0}}{1 + (1 - \alpha) + (1 - \alpha)^2 + ...
+ (1 - \alpha)^t}
Otherwise:
\begin{split}y_0 &= x_0 \\
y_t &= (1 - \alpha) y_{t-1} + \alpha x_t,\end{split}
\begin{split}
w_i = \begin{cases}
\alpha (1 - \alpha)^i & \text{if } i < t \\
(1 - \alpha)^i & \text{if } i = t.
\end{cases}
\end{split}
To adjust alpha we have three different params:
\begin{split}\alpha =
\begin{cases}
\frac{2}{s + 1}, & \text{for span}\ s \geq 1\\
\frac{1}{1 + c}, & \text{for center of mass}\ c \geq 0\\
1 - \exp^{\frac{\log 0.5}{h}}, & \text{for half-life}\ h > 0
\end{cases}
\end{split}
Previously with EWMA we applied Simple Exponential Smoothing using just one smoothing factor alpha. This failed to account for other contributing factors like trend and seasonality.
The Holt-Winters seasonal method comprises of the forecast equation and three smoothing equations.
One for the level l_t
, one for the trend b_t
, and one for the seasonal component s_t
, with corresponding smoothing parameters alpha, beta, and gamma.
There are two variations to this method that differ in the nature of the seasonal component: the additive method is preferred when the seasonal variations are roughly constant through the series, while the multiplicative method is preferred when the seasonal variations are changing proportional to the level of the series.
We start with a Single Exponential Smoothing:
y_0 = X_0
y_t = (1-\alpha)y_{t-1} + \alpha x_t
We can expand this to a Double Exponential Smoothing. In Double Exponential Smoothing (aka Holt's Method) we introduc a new smoothing factor beta that addresses trend:
\text{level: } l_t =(1-\alpha)l_{t-1}1+\alpha x,,
\text{trend: } b_t = (1-\beta)b_{t-1}+ \beta(l_t-t_{t-1})
\text{fitted model: } y_t = l_t + b_t
\text{forecasting model: } \hat{y}_{t+h} = l_t + hb_t,~ h \text{ is the periods into the future}
In this method we do not take into account a seasonal component. With Triple Exponential Smoothing (aka the Holt-Winters Method) we introduce a smoothing factor gamma that addresses seasonality:
\begin{aligned}
\text{level: }l_{t} &=(1-\alpha) l_{t-1}+\alpha x_{t}, \\
\text{trend: }b_{t} &=(1-\beta) b_{t-1}+\beta\left(l_{t}-l_{t-1}\right) \\
\text{seasonal: }c_{t} &=(1-\gamma) c_{t-L}+\gamma\left(x_{t}-l_{t-1}-b_{t-1}\right) \\
\text{fitted model: }y_{t} &=\left(l_{t}+b_{t}\right) c_{t} \\
\text{forecasting model: }\hat{y}_{t+m} &=\left(l_{t}+m b_{t}\right) c_{t-L+1+(m-1) \bmod L},~m \text{ is the periods into the future}
\end{aligned}
Here L represents the number of divisions per cycle. In our case looking at monthly data that displays a repeating pattern each year, we would use L=12.
We need to set the frequency of the index to use this method. See Offset Aliases. See also this github thread if the methods does not converge.
# Forecasting Models
from statsmodels.tsa.holtwinters import ExponentialSmoothing
# Metrics: Evaluating Predicitions
from sklearn.metrics import mean_absolute_error, mean_squared_error
# Autocorrelation
import statsmodels.api as sm
from statsmodels.tsa.stattools import acovf, acf, pacf, pacf_yw, pacf_ols
# import warnings
# warnings.filterwarnings('ignore')
# Lag Plot
from pandas.plotting import lag_plot
Section Overview:
- Introduction to Forecasting
- ACF and PACF plots
- Auto Regression AR
- Descriptive Statistics and Tests
- Choosing ARIMA orders
- ARIMA based models
Standard Forecasting Procedure:
- Choose a Model
- Split data into train and test sets
- Fit model on training set
- Evaluate model on test set
- Re-fit model on entire data set
- Forecast for future data
Test sets will be the most recent end of the data (see sklearn TimeSeriesSplit)
The size of the test set is typically about 20% of the total sample, although this value depends on how long the sample is and how far ahead you want to forecast. The test set should ideally be at least as large as the maximum forecast horizon required. Keep in mind, the longer the forecast horizon, the more likely your prediction becomes less accurate.
After we fit model on the training data, we forecast to match up to the test data dates. Then we can compare our results for evaluation.
You may have heard of some evaluation metrics like accuracy or recall. These sort of metrics aren't useful for time series forecasting problems, we need metrics designed for continuous values!
Let's discuss some of the most common evaluation metrics for regression:
- Mean Absolute Error (MAE)
- Mean Squared Error (MSE)
- Root Mean Square Error (RMSE)
Whenever we perform a forecast for a continuous value on a test set, we have two values:
y
the value of the observation (in the test data)y_hat
the predicted value from our forecast
Mean Absolute Error (MAE) This is the mean of the absolute value of errors: Easy to understand.
\frac{1}{n} \sum_{i=1}^{n}\left|y_{i}-\hat{y}_{i}\right|
An issue with MAE though, is that simply averaging the residuals won't alert us if the forecast was really off fo a few points. We want to be aware of any prediction errors that are very large (even if there only a few).
Mean Squared Error (MSE): This is the mean of the squared errors. Larger errors are noted more than with MAE making MSE more popular.
\frac{1}{n} \sum_{i=1}^{n}\left(y_{i}-\hat{y}_{i}\right)^{2}
There is an issue with MSE however! Because we squared the residual, the units are now also squared which is hard to interpret!
Root Mean Square Error (RMSE): This is the root of the mean of the squared errors. Most popular (has same units as y).
\sqrt{\frac{1}{n} \sum_{i=1}^{n}\left(y_{i}-\hat{y}_{i}\right)^{2}}
Let's learn about 2 very useful plot types
- ACF AutoCorrelation Function Plot
- PACF Partial AutoCorrelation Function Plot
To understand these plots, we first need to understand correlation. Correlation is a measure of the strength of the linear relationship between two variables. The closer the correlation is to +1, the stronger the positive linear relationship The closer the correlation is to-1, the stronger the negative linear relationship. And the closer the correlation is to zero, the weaker the linear relationship.
An autocorrelation plot (also known as a Correlogram) shows the correlation of the series with itself, lagged by t time units. So the y axis the correlation and the x axis is the number of time units of lag.
Imagine we had some sales data. We can compare the standard sales data against the sales data shifted by 1 time step. This answers the question, "How correlated are today's sales to yesterday's sales?"
We consider y = Sales on Day t
vs x = Sales on Day t-1
and calculate the correlation. We then plot the corr value on the plot y = Autocorr
vs x = Shift
.
An autocorrelation plot shows the correlation of the series with itself, lagged by X time units. You go on and do this for all possible time lags x and this defines the plot. Typical examples:
- Gradual decline: Common for seasonal data. The distance between peaks represent the season period
- Sharp Drop-off: After a couple of shifts the corr rapidly drops
It makes sense that in general there is a decline of some sort, the further away you get with the shift, the less likely the time series would be correlated with itself.
For the partial autocorr plot, we essentially plot out the relationship between y = previous day's residuals
vs the real values of the current day
. In general we expect the partial autocorrelation to drop off quite quickly.
The ACF describes the autocorrelation between an observation and another observation at a prior time step that includes direct and indirect dependence information. The PACF only describes the direct relationship between an observation and its lag.
These two plots can help choose order parameters for ARIMA based models. Later on, we will see that it is usually much easier to perform a grid search of the parameter values, rather than attempt to read these plots directly.
Continue here!
Downgrading Jupyter Lab and Conda Packages
We downgrade jupyter lab from 3.2 to 3.1 to increase contextual help performance. We also install seaborn and a formatter.
conda install -c conda-forge jupyterlab=3.1.19
conda install -c conda-forge seaborn==0.11.0
conda install -c conda-forge jupyterlab_code_formatter
We also install seaborn.
If-Name-Equals_Main Idiom
From video. We should be using the following idiom in all python (non-library) scripts:
def main():
print('Hello World')
if __name__ == '__main__':
main()
Dictionaries
6 Different ways to create Dictionaries (source)
Stack Overflow Threads
Time series decomposition involves thinking of a series as a combination of the following components.
- level
- trend
- seasonality
- noise