bukosabino/ta

Issue in the calculation of the EMA

TomLouisKeller opened this issue · 5 comments

I might be wrong here, but i think the calculation of the EMA is not entirely correct.

Current code:

def ema(series, periods):
    sma = series.rolling(window=periods, min_periods=periods).mean()[:periods]
    rest = series[periods:]
    return pd.concat([sma, rest]).ewm(span=periods, adjust=False).mean()

Analysis:
Assuming that periods=5. (for easier explanation)
The first line of code results in the first 4 rows to be NaN
and the 5th row to be the SMA of the first 5 rows.
Then you fuse that together with the rest of the series.
This causes the 6th row to be the result of the EMA of:
[None, None, None, None, SMA for first 5 rows, EMA of(4xNone+SMA for first 5 rows)]

Problems:
The 5th row is an SMA.
The following rows are still not a calculation of the expected 5 prices,
but instead of NaN, the SMA and then the actual prices.

I made a quick Jupyter Notebook for a better explanation:
https://colab.research.google.com/drive/1aPb0qPoeNWVLIk_ErhvoZTk1ImNzznCO

My proposition:

def ema(series, periods, fillna=False):
    if fillna is True:
        return series.ewm(span=periods, min_periods=0, adjust=False).mean()

    return series.ewm(span=periods, min_periods=periods, adjust=False).mean()

Hi Tom,

You are not incorrect. There are a few different ways that EMA is being calculated out in the wild. One of them includes using the method where the SMA is calculated for the first period and the rest of the EMA is created off that initial value and the rest of closing prices. I thought it was odd also, but I found it being used by the popular TALib.

In fact, it was documented in ta_EMA.c:

   /* The first EMA is calculated differently. It
    * then become the seed for subsequent EMA.
    *
    * The algorithm for this seed vary widely.
    * Only 3 are implemented here:
    *
    * TA_MA_CLASSIC:
    *    Use a simple MA of the first 'period'.
    *    This is the approach most widely documented.
    *
    * TA_MA_METASTOCK:
    *    Use first price bar value as a seed
    *    from the begining of all the available
    *    data.
    *
    * TA_MA_TRADESTATION:
    *    Use 4th price bar as a seed, except when
    *    period is 1 who use 2th price bar or something
    *    like that... (not an obvious one...).
    */

In my version so far, I have incorporated both methods. Your straight up Pandas ewm version that you have coded as well as @christian-janiake-movile submission. Furthermore, I have added a bunch of other indicators on top of the current code base while extending the Pandas DataFrame.

KJ

Good question Tom,

What do you think about this? @cjaniake @christian-janiake-movile @paduel

hi @bukosabino
@TomLouisKeller 's proposition looks perfect, and is much more accurate when series are incomplete.
agreed!

Hy Guys,

I did some more research on EMAs.
I initially assumed that EMAs, like SMAs, only take the N last periods in a series,
and then weights them somehow. This assumption is incorrect.

From wikipedia:
... EMA is referred to as an N-day EMA. Despite the name suggesting there are N periods, the terminology only specifies the α factor. N is not a stopping point for the calculation in the way it is in an SMA or WMA.

The EMA takes all prior available data points for it's calculation.
The variable N/periods/span is for the calculation of alpha, which is then used to calculate the weight of each prior data point.

In the notebook you see, that index 9 and 10 are different if we cut off the first 5 data points:
https://colab.research.google.com/drive/10DMtAQ7quQ-oVkuq0Tk8SYnOuCntnvGQ

If we use the existing ta.ema code, the values of index 9 and 10 in cell 5 and 7 are even further apart:
https://colab.research.google.com/drive/1pbBfZTo8VE4K6XsWv8k7OYlawaHF6kE9

If we turn adjust=True, the values of index 9 and 10 in cell 6 are closer to their true values:
https://colab.research.google.com/drive/1HbcbaD4nocxAZgtzGtlAkTZG9PWUnnCg

My updated proposition: (adjust=True is the default)

def ema(series, periods, fillna=False):
    if fillna:
        return series.ewm(span=periods, min_periods=0).mean()
    return series.ewm(span=periods, min_periods=periods).mean()

What do you guys think?

The last code suggested by Tom is implemented in the last ta library version (0.3.4).

Regards.