`NA`s in benchmark-columns should yield `NA`s in `risk_category` (except "all")

Question

`NA`s in benchmark-columns should yield `NA`s in `risk_category` (except "all")

Closed this issue 5 months ago · 8 comments

Dear @maurolepore, I was writing test for this issue. And found out that the indicators' code do not assign risk category for NA in profile_ranking column. Those rows should have a NA risk category. The NA appear in the profile_ranking column due to NA in isic_4digit. @AnneSchoenauer Do you agree?

Please have a look at this reprex:

library(tibble)
  library(readr)
  library(tiltIndicator)
  companies <- tibble(
    companies_id = c("id1"),
    clustered = c("cl1"),
    activity_uuid_product_uuid = c("uuid1"),
    unit = c("any"))
  
  co2 <- tibble(
    activity_uuid_product_uuid = c("uuid1", "uuid1", "uuid1", "uuid1", "uuid1", "uuid1"),
    co2_footprint = c(1.0, 1.0, 1.0, 1.0, 1.0, 1.0),
    ei_activity_name = c("any", "any", "any", "any", "any", "any"),
    ei_geography = c("any", "any", "any", "any", "any", "any"),
    isic_4digit = c(NA, NA, NA, NA, NA, NA),
    tilt_sector = c("any", "any", "any", "any", "any", "any"),
    tilt_subsector = c("any", "any", "any", "any", "any", "any"),
    unit = c("any", "any", "any", "any", "any", "any"),
    grouped_by = c("all", "isic_4digit", "tilt_sector", "unit", "unit_isic_4digit", "unit_tilt_sector"),
    profile_ranking = c(1.0, NA, 1.0, 1.0, NA, 1.0))
  
  co2
#> # A tibble: 6 × 10
#>   activity_uuid_produc…¹ co2_footprint ei_activity_name ei_geography isic_4digit
#>   <chr>                          <dbl> <chr>            <chr>        <lgl>      
#> 1 uuid1                              1 any              any          NA         
#> 2 uuid1                              1 any              any          NA         
#> 3 uuid1                              1 any              any          NA         
#> 4 uuid1                              1 any              any          NA         
#> 5 uuid1                              1 any              any          NA         
#> 6 uuid1                              1 any              any          NA         
#> # ℹ abbreviated name: ¹activity_uuid_product_uuid
#> # ℹ 5 more variables: tilt_sector <chr>, tilt_subsector <chr>, unit <chr>,
#> #   grouped_by <chr>, profile_ranking <dbl>
  
  result <- emissions_profile(
    companies,
    co2) |>
    unnest_product()
  
  # bad
  result
#> # A tibble: 4 × 7
#>   companies_id grouped_by       risk_category profile_ranking clustered
#>   <chr>        <chr>            <chr>                   <dbl> <chr>    
#> 1 id1          all              high                        1 cl1      
#> 2 id1          tilt_sector      high                        1 cl1      
#> 3 id1          unit             high                        1 cl1      
#> 4 id1          unit_tilt_sector high                        1 cl1      
#> # ℹ 2 more variables: activity_uuid_product_uuid <chr>, co2_footprint <dbl>

# Good
  Expected
#> # A tibble: 4 × 7
#>   companies_id grouped_by       risk_category profile_ranking clustered
#>   <chr>        <chr>            <chr>                   <dbl> <chr>    
#> 1 id1          all              high                        1 cl1      
#> 2 id1          tilt_sector      high                        1 cl1      
#> 3 id1          unit             high                        1 cl1      
#> 4 id1          unit_tilt_sector high                        1 cl1 
#> 5 id1          isic_4digit        NA                        NA cl1
#> 6 id1          unit_isic_4digit   NA                        NA cl1       
#> # ℹ 2 more variables: activity_uuid_product_uuid <chr>, co2_footprint <dbl>

^{Created on 2023-12-11 with reprex v2.0.2}

Answer 1 · 2023-12-12T06:56:35.000Z

Hi both. Yes I agree. If there is NA in the ISIC_4digit the risk category for the benchmark isic_sec, isic_sec_unit is NA. Same holds if tilt sector is missing then the risk category is as well for benchmark tilt_sec, tilt_sec_unit NA.

Answer 2 · 2023-12-12T13:04:05.000Z

a01

If there is NA in the ISIC_4digit the risk category for the benchmark isic_sec, isic_sec_unit is NA.*

Thanks @kalashsinghal for the reprex and @AnneSchoenauer for the confirmation.

ml01. This seems to conflict with this other requirement but I need to check: #393 (comment)
ml02. Do you expect this both at product and company level?
ml03. Do you expect this only emissions_profile()?
ml04. Is the current behaviour expected for the other benchmarks (i.e. values of grouped_by)?

devtools::load_all()
#> ℹ Loading tiltIndicator
packageVersion("tiltIndicator")
#> [1] '0.0.0.9107'

companies <- example_companies()
companies
#> # A tibble: 1 × 8
#>   companies_id clustered activity_uuid_product_uuid sector subsector tilt_sector
#>   <chr>        <chr>     <chr>                      <chr>  <chr>     <chr>      
#> 1 a            a         a                          total  energy    a          
#> # ℹ 2 more variables: tilt_subsector <chr>, type <chr>

co2 <- example_products(
  profile_ranking = c(1.0, NA, NA),
  grouped_by = c("all", "isic_4digit", "unit_isic_4digit")
)
co2
#> # A tibble: 3 × 7
#>   profile_ranking grouped_by       activity_uuid_product_uuid tilt_sector unit 
#>             <dbl> <chr>            <chr>                      <chr>       <chr>
#> 1               1 all              a                          a           a    
#> 2              NA isic_4digit      a                          a           a    
#> 3              NA unit_isic_4digit a                          a           a    
#> # ℹ 2 more variables: isic_4digit <chr>, co2_footprint <dbl>

result <- emissions_profile(companies, co2)

result |> unnest_product()
#> # A tibble: 1 × 7
#>   companies_id grouped_by risk_category profile_ranking clustered
#>   <chr>        <chr>      <chr>                   <dbl> <chr>    
#> 1 a            all        high                        1 a        
#> # ℹ 2 more variables: activity_uuid_product_uuid <chr>, co2_footprint <dbl>

result |> unnest_company()
#> # A tibble: 3 × 4
#>   companies_id grouped_by risk_category value
#>   <chr>        <chr>      <chr>         <dbl>
#> 1 a            all        high              1
#> 2 a            all        medium            0
#> 3 a            all        low               0

a02

Same holds if tilt sector is missing then the risk category is as well for benchmark tilt_sec, tilt_sec_unit NA.

ml05. So in the reprex you expect risk_category to be NA instead of 1?
ml06. Is the output at company level as you expect?

# If tilt sector is missing then the risk category is as well for benchmark
# tilt_sec, tilt_sec_unit NA.

devtools::load_all()
#> ℹ Loading tiltIndicator
packageVersion("tiltIndicator")
#> [1] '0.0.0.9107'

companies <- example_companies()
companies
#> # A tibble: 1 × 8
#>   companies_id clustered activity_uuid_product_uuid sector subsector tilt_sector
#>   <chr>        <chr>     <chr>                      <chr>  <chr>     <chr>      
#> 1 a            a         a                          total  energy    a          
#> # ℹ 2 more variables: tilt_subsector <chr>, type <chr>

co2 <- example_products(tilt_sector = c(1.0, NA))
co2
#> # A tibble: 2 × 5
#>   tilt_sector activity_uuid_product_uuid unit  isic_4digit co2_footprint
#>         <dbl> <chr>                      <chr> <chr>               <dbl>
#> 1           1 a                          a     '1234'                  1
#> 2          NA a                          a     '1234'                  1

result <- emissions_profile(companies, co2)

# FIXME `risk_category` should be NA?
result |> unnest_product() |> filter(grepl("tilt_sec", grouped_by))
#> # A tibble: 2 × 7
#>   companies_id grouped_by       risk_category profile_ranking clustered
#>   <chr>        <chr>            <chr>                   <dbl> <chr>    
#> 1 a            tilt_sector      high                        1 a        
#> 2 a            unit_tilt_sector high                        1 a        
#> # ℹ 2 more variables: activity_uuid_product_uuid <chr>, co2_footprint <dbl>

# ASK: Is this as you expect?
result |> unnest_company() |> filter(grepl("tilt_sec", grouped_by))
#> # A tibble: 6 × 4
#>   companies_id grouped_by       risk_category value
#>   <chr>        <chr>            <chr>         <dbl>
#> 1 a            tilt_sector      high              1
#> 2 a            tilt_sector      medium            0
#> 3 a            tilt_sector      low               0
#> 4 a            unit_tilt_sector high              1
#> 5 a            unit_tilt_sector medium            0
#> 6 a            unit_tilt_sector low               0

Answer 3 · 2023-12-14T11:37:09.000Z

Hi Mauro,
Here the answers:

ml01. This seems to conflict with this other requirement but I need to check: #393 (comment).

I think it should first produce the risk_category NA. If we want to drop this later is a question for the tiltIndicatorAfter package (so if we want to view it). However, what is important if for one benchmark the risk category is NA that this doesn't mean that all other benchmarks are dropped. So it could be that for example the product apple has a risk category for the benchmark all_unit, but not for tilt_sector. In this case we need to have risk category values for the benchmark all_unit and for the benchmark tilt_sector NA.

ml02. Do you expect this both at product and company level?

Please see ml06 for this.

ml03. Do you expect this only emissions_profile()?

No. But for the sector_profile() it depends on the availability of scenarios. So if for example a product has no reduction targets for certain scenarios (IPR, WEO), the risk_category will also be NA. However, I think the code works fine for those cases. I didn't notice that there were any complications in the output tables.

ml04. Is the current behaviour expected for the other benchmarks (i.e. values of grouped_by)?

Yes. It can happen for all benchmarks except from "all".

ml05. So in the reprex you expect risk_category to be NA instead of 1?

It depends on which benchmark. So if I understand correctly in your reprex the profile_ranking was 1 if we group it with "all". In this case the risk_category for the benchmark "all" should be 1. However, the profile ranking was "NA" in the cases of the group "Isic_4_digit" and "Isic_4_digit_unit". If the profile ranking is NA it means that the product cannot be grouped in those groups as most likely the product doesnt have an isic code associated with. In this case the risk_category should be for the benchmark "isic_4_digit" and "isic_4_digit_unit" NA.

ml06. Is the output at company level as you expect?

No in this case I would expect that there is another category NA. So it would be

#>   companies_id grouped_by       risk_category value
#>   <chr>        <chr>            <chr>         <dbl>
#> 1 a            tilt_sector      high              **0.5**
#> 2 a            tilt_sector      medium            0
#> 3 a            tilt_sector      low               0
**#> 4 a            tilt_sector      NA              0.5**

@Tilmon could you please also read this repo and confirm? Thanks!

Answer 4 · 2023-12-14T18:05:05.000Z

Hi there!

I agree with @AnneSchoenauer 's comments and explanations. I have one question about @maurolepore 's reprex though, re ml05 & ml06.

What's confusing to me (not sure if it's a mistake in the reprex or I just misunderstood sth) is that in the reprex for ml05 and ml06, you use an example dataset that has two different tilt_sector entries (once tilt_sector = 1, once tilt_sector = NA) for the same activity_uuid_product_uuid, see screenshot below:

I don't think such case exists, right? If it's true that this is an "impossible" example, I think the result reprex that you gave as example may also be misleading.

In any case, I agree with Anne's suggestion:

No in this case I would expect that there is another category NA. So it would be

Answer 5 · 2023-12-19T14:14:27.000Z

OK thanks everyone for your input. There is a whole lot going on here and I'll need to dive deep before I can do anything meaningful. We can discuss the priority during the tech meeting today.

@Tilmon, thanks for mentioning that the toy dataset I created is confusing. Once I understand this thread deeply enough I hope to be able to expose the problem with a reprex using a more realistic dataset.

Answer 6 · 2024-01-08T22:20:37.000Z

I'm re-doing the reprex because grouped_by and profile_raking are no longer passed via the co2 datasets but instead they are computed internally in tiltIndicator.

Currently

devtools::load_all()
#> ℹ Loading tiltIndicator

options(width = 500)
packageVersion("tiltIndicator")
#> [1] '0.0.0.9108'


companies <- example_companies()
companies
#> # A tibble: 1 × 8
#>   companies_id clustered activity_uuid_product_uuid sector subsector tilt_sector tilt_subsector type 
#>   <chr>        <chr>     <chr>                      <chr>  <chr>     <chr>       <chr>          <chr>
#> 1 a            a         a                          total  energy    a           a              ipr

co2 <- example_products(isic_4digit = NA)
co2
#> # A tibble: 1 × 5
#>   isic_4digit activity_uuid_product_uuid tilt_sector unit  co2_footprint
#>   <lgl>       <chr>                      <chr>       <chr>         <dbl>
#> 1 NA          a                          a           a                 1

result <- emissions_profile(companies, co2)

result |> unnest_product()
#> # A tibble: 4 × 7
#>   companies_id grouped_by       risk_category profile_ranking clustered activity_uuid_product_uuid co2_footprint
#>   <chr>        <chr>            <chr>                   <dbl> <chr>     <chr>                              <dbl>
#> 1 a            all              high                        1 a         a                                      1
#> 2 a            tilt_sector      high                        1 a         a                                      1
#> 3 a            unit             high                        1 a         a                                      1
#> 4 a            unit_tilt_sector high                        1 a         a                                      1

result |> unnest_company()
#> # A tibble: 12 × 4
#>    companies_id grouped_by       risk_category value
#>    <chr>        <chr>            <chr>         <dbl>
#>  1 a            all              high              1
#>  2 a            all              medium            0
#>  3 a            all              low               0
#>  4 a            tilt_sector      high              1
#>  5 a            tilt_sector      medium            0
#>  6 a            tilt_sector      low               0
#>  7 a            unit             high              1
#>  8 a            unit             medium            0
#>  9 a            unit             low               0
#> 10 a            unit_tilt_sector high              1
#> 11 a            unit_tilt_sector medium            0
#> 12 a            unit_tilt_sector low               0

Notes to self

My new read of the answers to my questions reveal that this issue is very complex. I'll need to break it down and address each component separately. A clear division is between the output at product level and the output at company level. Although

At product level the requirement seems pretty straightforward -- although it's still unclear if the requirement applies to all indicators, and all benchmarks. Here I'll need to ask again.

At company level the requirement is seems less straightforward. We introduce a new level of risk_categoy: NA. This changes the structure of the expected output and may break multiple tests.

Computing a value for the NA level of risk_category may be a lil tricky, so make sure to test it well.

Answer 7 · 2024-01-29T07:26:35.000Z

@Tilmon I put this on high priority as I will need this for the Bundesbank data. If this is a problem let me know!

Answer 8 · 2024-02-13T15:23:47.000Z

Note:
cc' @maurolepore @AnneSchoenauer

As discussed in the tech sprint (2024-02-13),

this issue applies to all 4 indicators
for emission profile & emission upstream profile, missing values occur if we don't have data on either of unit, tilt_sector, or isic_4digit, as these variables define the benchmarks.
For sector and sector upstream, missing values occur for sector & subsector which will make mapping to scenario-year-combination impossible and hence lead to NAs for a given grouped_by