Estimate non-zero FlowAmounts when below source reporting threshold

The Aircraft row is missing from a newly generated EPA_GHGI_T_3_14 FBA (year is 2016, if that matters).

Yes there are a few lingering issues in a handful of tables that i'm working through. Feel free to document any more that you see here.

I see that for 2016, that row is identified as being less than 0.05 MMT (specific value not given). Is that why it's excluded? If so, maybe there's a better way of handling that situation?

I see that for 2016, that row is identified as being less than 0.05 MMT (specific value not given). Is that why it's excluded? If so, maybe there's a better way of handling that situation?

Yep that would be it. I do think your update to add a new field for that makes sense. In the current GHGI branch its not yet implemented.

Renamed this issue to broaden the scope to the issue at heart of this example.. when a source provided no value but it is also not given as zero, such as a '+' or '-' symbol. This could be, for example, when it's less than the significant figures provided.

This is really something that needs to be implemented for each specific FBA. We have a few examples now where this is done:

EIA MECS:

flowsa/flowsa/data_source_scripts/EIA_MECS.py

Lines 426 to 432 in 0d44ae9

    
           df = df.assign( 
        
               FlowAmount=df.FlowAmount.mask(df.FlowAmount.str.isnumeric() == False, 
        
                                             np.nan), 
        
               Suppressed=df.FlowAmount.where(df.FlowAmount.str.isnumeric() == False, 
        
                                              np.nan), 
        
               Spread=df.Spread.mask(df.Spread.str.isnumeric() == False, np.nan) 
        
           )

Census SAS:

flowsa/flowsa/data_source_scripts/Census_SAS.py

Lines 66 to 82 in c18da34

    
           # set suppressed values to 0 but mark as suppressed 
        
           # otherwise set non-numeric to nan 
        
           df = (df.assign( 
        
                   Suppressed = np.where(df.FlowAmount.str.strip().isin(["S", "Z", "D"]), 
        
                                         df.FlowAmount.str.strip(), 
        
                                         np.nan), 
        
                   FlowAmount = np.where(df.FlowAmount.str.strip().isin(["S", "Z", "D"]), 
        
                                         0, 
        
                                         df.FlowAmount))) 
        
           df = (df.assign( 
        
                   Suppressed = np.where(df.FlowAmount.str.endswith('(s)') == True, 
        
                                         '(s)', 
        
                                         df.Suppressed), 
        
                   FlowAmount = np.where(df.FlowAmount.str.endswith('(s)') == True, 
        
                                         df.FlowAmount.str.replace(',','').str[:-3], 
        
                                         df.FlowAmount), 
        
               ))

GHGI:

flowsa/flowsa/data_source_scripts/EPA_GHGI.py

Lines 441 to 460 in 56c6da4

    
           # set suppressed values to 0 but mark as suppressed 
        
           # otherwise set non-numeric to nan 
        
           try: 
        
               df = (df.assign( 
        
                       Suppressed = np.where(df.FlowAmount.str.strip() == "+", "+", 
        
                                             np.nan), 
        
                       FlowAmount = pd.Series( 
        
                           np.where(df.FlowAmount.str.strip() == "+", 0, 
        
                                    df.FlowAmount.str.replace(',','')))) 
        
                   ) 
        
               df = (df.assign( 
        
                       FlowAmount = np.where(pd.to_numeric( 
        
                           df.FlowAmount, errors='coerce').isnull(), 
        
                                             np.nan, pd.to_numeric( 
        
                                                 df.FlowAmount, errors='coerce'))) 
        
                   .dropna(subset='FlowAmount') 
        
                   ) 
        
           except AttributeError: 
        
               # if no string in FlowAmount, then proceed 
        
               df = df.dropna(subset='FlowAmount')

The approach to handling of the suppressed data is then indicated in a FBS, for example this function for MECS:

flowsa/flowsa/data_source_scripts/EIA_MECS.py

Lines 437 to 458 in 0d44ae9

    
           def estimate_suppressed_mecs_energy( 
        
                   fba: FlowByActivity, 
        
                   **kwargs 
        
               ) -> FlowByActivity: 
        
               ''' 
        
               Rough first pass at an estimation method, for testing purposes. This 
        
               will drop rows with 'D' or 'Q' values, on the grounds that as far as I can 
        
               tell we don't have any more information for them than we do for any 
        
               industry without its own line item in the MECS anyway. '*' is for value 
        
               less than 0.5 Trillion Btu and will be assumed to be 0.25 Trillion Btu 
        
               ''' 
        
               if 'Suppressed' not in fba.columns: 
        
                   log.warning('The current MECS dataframe does not contain data ' 
        
                               'on estimation method and so suppressed data will ' 
        
                               'not be assessed.') 
        
                   return fba 
        
               dropped = fba.query('Suppressed not in ["D", "Q"]') 
        
               unsuppressed = dropped.assign( 
        
                   FlowAmount=dropped.FlowAmount.mask(dropped.Suppressed == '*', 0.25) 
        
               ) 
        
               return unsuppressed.drop(columns='Suppressed')

Going to close this issue as resolved knowing that this can be added to FBAs as they are updated.

	df = df.assign(
	FlowAmount=df.FlowAmount.mask(df.FlowAmount.str.isnumeric() == False,
	np.nan),
	Suppressed=df.FlowAmount.where(df.FlowAmount.str.isnumeric() == False,
	np.nan),
	Spread=df.Spread.mask(df.Spread.str.isnumeric() == False, np.nan)
	)

	# set suppressed values to 0 but mark as suppressed
	# otherwise set non-numeric to nan
	df = (df.assign(
	Suppressed = np.where(df.FlowAmount.str.strip().isin(["S", "Z", "D"]),
	df.FlowAmount.str.strip(),
	np.nan),
	FlowAmount = np.where(df.FlowAmount.str.strip().isin(["S", "Z", "D"]),
	0,
	df.FlowAmount)))
	df = (df.assign(
	Suppressed = np.where(df.FlowAmount.str.endswith('(s)') == True,
	'(s)',
	df.Suppressed),
	FlowAmount = np.where(df.FlowAmount.str.endswith('(s)') == True,
	df.FlowAmount.str.replace(',','').str[:-3],
	df.FlowAmount),
	))

	# set suppressed values to 0 but mark as suppressed
	# otherwise set non-numeric to nan
	try:
	df = (df.assign(
	Suppressed = np.where(df.FlowAmount.str.strip() == "+", "+",
	np.nan),
	FlowAmount = pd.Series(
	np.where(df.FlowAmount.str.strip() == "+", 0,
	df.FlowAmount.str.replace(',',''))))
	)
	df = (df.assign(
	FlowAmount = np.where(pd.to_numeric(
	df.FlowAmount, errors='coerce').isnull(),
	np.nan, pd.to_numeric(
	df.FlowAmount, errors='coerce')))
	.dropna(subset='FlowAmount')
	)
	except AttributeError:
	# if no string in FlowAmount, then proceed
	df = df.dropna(subset='FlowAmount')

	def estimate_suppressed_mecs_energy(
	fba: FlowByActivity,
	**kwargs
	) -> FlowByActivity:
	'''
	Rough first pass at an estimation method, for testing purposes. This
	will drop rows with 'D' or 'Q' values, on the grounds that as far as I can
	tell we don't have any more information for them than we do for any
	industry without its own line item in the MECS anyway. '*' is for value
	less than 0.5 Trillion Btu and will be assumed to be 0.25 Trillion Btu
	'''
	if 'Suppressed' not in fba.columns:
	log.warning('The current MECS dataframe does not contain data '
	'on estimation method and so suppressed data will '
	'not be assessed.')
	return fba
	dropped = fba.query('Suppressed not in ["D", "Q"]')
	unsuppressed = dropped.assign(
	FlowAmount=dropped.FlowAmount.mask(dropped.Suppressed == '*', 0.25)
	)

	return unsuppressed.drop(columns='Suppressed')