USEPA/flowsa

Estimate non-zero FlowAmounts when below source reporting threshold

matthewlchambers opened this issue · 6 comments

The Aircraft row is missing from a newly generated EPA_GHGI_T_3_14 FBA (year is 2016, if that matters).

Yes there are a few lingering issues in a handful of tables that i'm working through. Feel free to document any more that you see here.

I see that for 2016, that row is identified as being less than 0.05 MMT (specific value not given). Is that why it's excluded? If so, maybe there's a better way of handling that situation?

I see that for 2016, that row is identified as being less than 0.05 MMT (specific value not given). Is that why it's excluded? If so, maybe there's a better way of handling that situation?

Yep that would be it. I do think your update to add a new field for that makes sense. In the current GHGI branch its not yet implemented.

Renamed this issue to broaden the scope to the issue at heart of this example.. when a source provided no value but it is also not given as zero, such as a '+' or '-' symbol. This could be, for example, when it's less than the significant figures provided.

This is really something that needs to be implemented for each specific FBA. We have a few examples now where this is done:

EIA MECS:

df = df.assign(
FlowAmount=df.FlowAmount.mask(df.FlowAmount.str.isnumeric() == False,
np.nan),
Suppressed=df.FlowAmount.where(df.FlowAmount.str.isnumeric() == False,
np.nan),
Spread=df.Spread.mask(df.Spread.str.isnumeric() == False, np.nan)
)

Census SAS:

# set suppressed values to 0 but mark as suppressed
# otherwise set non-numeric to nan
df = (df.assign(
Suppressed = np.where(df.FlowAmount.str.strip().isin(["S", "Z", "D"]),
df.FlowAmount.str.strip(),
np.nan),
FlowAmount = np.where(df.FlowAmount.str.strip().isin(["S", "Z", "D"]),
0,
df.FlowAmount)))
df = (df.assign(
Suppressed = np.where(df.FlowAmount.str.endswith('(s)') == True,
'(s)',
df.Suppressed),
FlowAmount = np.where(df.FlowAmount.str.endswith('(s)') == True,
df.FlowAmount.str.replace(',','').str[:-3],
df.FlowAmount),
))

GHGI:

# set suppressed values to 0 but mark as suppressed
# otherwise set non-numeric to nan
try:
df = (df.assign(
Suppressed = np.where(df.FlowAmount.str.strip() == "+", "+",
np.nan),
FlowAmount = pd.Series(
np.where(df.FlowAmount.str.strip() == "+", 0,
df.FlowAmount.str.replace(',',''))))
)
df = (df.assign(
FlowAmount = np.where(pd.to_numeric(
df.FlowAmount, errors='coerce').isnull(),
np.nan, pd.to_numeric(
df.FlowAmount, errors='coerce')))
.dropna(subset='FlowAmount')
)
except AttributeError:
# if no string in FlowAmount, then proceed
df = df.dropna(subset='FlowAmount')

The approach to handling of the suppressed data is then indicated in a FBS, for example this function for MECS:

def estimate_suppressed_mecs_energy(
fba: FlowByActivity,
**kwargs
) -> FlowByActivity:
'''
Rough first pass at an estimation method, for testing purposes. This
will drop rows with 'D' or 'Q' values, on the grounds that as far as I can
tell we don't have any more information for them than we do for any
industry without its own line item in the MECS anyway. '*' is for value
less than 0.5 Trillion Btu and will be assumed to be 0.25 Trillion Btu
'''
if 'Suppressed' not in fba.columns:
log.warning('The current MECS dataframe does not contain data '
'on estimation method and so suppressed data will '
'not be assessed.')
return fba
dropped = fba.query('Suppressed not in ["D", "Q"]')
unsuppressed = dropped.assign(
FlowAmount=dropped.FlowAmount.mask(dropped.Suppressed == '*', 0.25)
)
return unsuppressed.drop(columns='Suppressed')

Going to close this issue as resolved knowing that this can be added to FBAs as they are updated.