`NA`s in benchmark-columns should yield `NA`s in `risk_category` (except "all")
Closed this issue · 8 comments
Dear @maurolepore, I was writing test for this issue. And found out that the indicators' code do not assign risk category for NA
in profile_ranking
column. Those rows should have a NA
risk category. The NA
appear in the profile_ranking
column due to NA
in isic_4digit
. @AnneSchoenauer Do you agree?
Please have a look at this reprex:
library(tibble)
library(readr)
library(tiltIndicator)
companies <- tibble(
companies_id = c("id1"),
clustered = c("cl1"),
activity_uuid_product_uuid = c("uuid1"),
unit = c("any"))
co2 <- tibble(
activity_uuid_product_uuid = c("uuid1", "uuid1", "uuid1", "uuid1", "uuid1", "uuid1"),
co2_footprint = c(1.0, 1.0, 1.0, 1.0, 1.0, 1.0),
ei_activity_name = c("any", "any", "any", "any", "any", "any"),
ei_geography = c("any", "any", "any", "any", "any", "any"),
isic_4digit = c(NA, NA, NA, NA, NA, NA),
tilt_sector = c("any", "any", "any", "any", "any", "any"),
tilt_subsector = c("any", "any", "any", "any", "any", "any"),
unit = c("any", "any", "any", "any", "any", "any"),
grouped_by = c("all", "isic_4digit", "tilt_sector", "unit", "unit_isic_4digit", "unit_tilt_sector"),
profile_ranking = c(1.0, NA, 1.0, 1.0, NA, 1.0))
co2
#> # A tibble: 6 × 10
#> activity_uuid_produc…¹ co2_footprint ei_activity_name ei_geography isic_4digit
#> <chr> <dbl> <chr> <chr> <lgl>
#> 1 uuid1 1 any any NA
#> 2 uuid1 1 any any NA
#> 3 uuid1 1 any any NA
#> 4 uuid1 1 any any NA
#> 5 uuid1 1 any any NA
#> 6 uuid1 1 any any NA
#> # ℹ abbreviated name: ¹activity_uuid_product_uuid
#> # ℹ 5 more variables: tilt_sector <chr>, tilt_subsector <chr>, unit <chr>,
#> # grouped_by <chr>, profile_ranking <dbl>
result <- emissions_profile(
companies,
co2) |>
unnest_product()
# bad
result
#> # A tibble: 4 × 7
#> companies_id grouped_by risk_category profile_ranking clustered
#> <chr> <chr> <chr> <dbl> <chr>
#> 1 id1 all high 1 cl1
#> 2 id1 tilt_sector high 1 cl1
#> 3 id1 unit high 1 cl1
#> 4 id1 unit_tilt_sector high 1 cl1
#> # ℹ 2 more variables: activity_uuid_product_uuid <chr>, co2_footprint <dbl>
# Good
Expected
#> # A tibble: 4 × 7
#> companies_id grouped_by risk_category profile_ranking clustered
#> <chr> <chr> <chr> <dbl> <chr>
#> 1 id1 all high 1 cl1
#> 2 id1 tilt_sector high 1 cl1
#> 3 id1 unit high 1 cl1
#> 4 id1 unit_tilt_sector high 1 cl1
#> 5 id1 isic_4digit NA NA cl1
#> 6 id1 unit_isic_4digit NA NA cl1
#> # ℹ 2 more variables: activity_uuid_product_uuid <chr>, co2_footprint <dbl>
Created on 2023-12-11 with reprex v2.0.2
Hi both. Yes I agree. If there is NA in the ISIC_4digit the risk category for the benchmark isic_sec, isic_sec_unit is NA. Same holds if tilt sector is missing then the risk category is as well for benchmark tilt_sec, tilt_sec_unit NA.
a01
If there is NA in the ISIC_4digit the risk category for the benchmark isic_sec, isic_sec_unit is NA.*
Thanks @kalashsinghal for the reprex and @AnneSchoenauer for the confirmation.
- ml01. This seems to conflict with this other requirement but I need to check: #393 (comment)
- ml02. Do you expect this both at product and company level?
- ml03. Do you expect this only
emissions_profile()
? - ml04. Is the current behaviour expected for the other benchmarks (i.e. values of
grouped_by
)?
devtools::load_all()
#> ℹ Loading tiltIndicator
packageVersion("tiltIndicator")
#> [1] '0.0.0.9107'
companies <- example_companies()
companies
#> # A tibble: 1 × 8
#> companies_id clustered activity_uuid_product_uuid sector subsector tilt_sector
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 a a a total energy a
#> # ℹ 2 more variables: tilt_subsector <chr>, type <chr>
co2 <- example_products(
profile_ranking = c(1.0, NA, NA),
grouped_by = c("all", "isic_4digit", "unit_isic_4digit")
)
co2
#> # A tibble: 3 × 7
#> profile_ranking grouped_by activity_uuid_product_uuid tilt_sector unit
#> <dbl> <chr> <chr> <chr> <chr>
#> 1 1 all a a a
#> 2 NA isic_4digit a a a
#> 3 NA unit_isic_4digit a a a
#> # ℹ 2 more variables: isic_4digit <chr>, co2_footprint <dbl>
result <- emissions_profile(companies, co2)
result |> unnest_product()
#> # A tibble: 1 × 7
#> companies_id grouped_by risk_category profile_ranking clustered
#> <chr> <chr> <chr> <dbl> <chr>
#> 1 a all high 1 a
#> # ℹ 2 more variables: activity_uuid_product_uuid <chr>, co2_footprint <dbl>
result |> unnest_company()
#> # A tibble: 3 × 4
#> companies_id grouped_by risk_category value
#> <chr> <chr> <chr> <dbl>
#> 1 a all high 1
#> 2 a all medium 0
#> 3 a all low 0
a02
Same holds if tilt sector is missing then the risk category is as well for benchmark tilt_sec, tilt_sec_unit NA.
- ml05. So in the reprex you expect
risk_category
to beNA
instead of1
? - ml06. Is the output at company level as you expect?
# If tilt sector is missing then the risk category is as well for benchmark
# tilt_sec, tilt_sec_unit NA.
devtools::load_all()
#> ℹ Loading tiltIndicator
packageVersion("tiltIndicator")
#> [1] '0.0.0.9107'
companies <- example_companies()
companies
#> # A tibble: 1 × 8
#> companies_id clustered activity_uuid_product_uuid sector subsector tilt_sector
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 a a a total energy a
#> # ℹ 2 more variables: tilt_subsector <chr>, type <chr>
co2 <- example_products(tilt_sector = c(1.0, NA))
co2
#> # A tibble: 2 × 5
#> tilt_sector activity_uuid_product_uuid unit isic_4digit co2_footprint
#> <dbl> <chr> <chr> <chr> <dbl>
#> 1 1 a a '1234' 1
#> 2 NA a a '1234' 1
result <- emissions_profile(companies, co2)
# FIXME `risk_category` should be NA?
result |> unnest_product() |> filter(grepl("tilt_sec", grouped_by))
#> # A tibble: 2 × 7
#> companies_id grouped_by risk_category profile_ranking clustered
#> <chr> <chr> <chr> <dbl> <chr>
#> 1 a tilt_sector high 1 a
#> 2 a unit_tilt_sector high 1 a
#> # ℹ 2 more variables: activity_uuid_product_uuid <chr>, co2_footprint <dbl>
# ASK: Is this as you expect?
result |> unnest_company() |> filter(grepl("tilt_sec", grouped_by))
#> # A tibble: 6 × 4
#> companies_id grouped_by risk_category value
#> <chr> <chr> <chr> <dbl>
#> 1 a tilt_sector high 1
#> 2 a tilt_sector medium 0
#> 3 a tilt_sector low 0
#> 4 a unit_tilt_sector high 1
#> 5 a unit_tilt_sector medium 0
#> 6 a unit_tilt_sector low 0
Hi Mauro,
Here the answers:
ml01. This seems to conflict with this other requirement but I need to check: #393 (comment).
I think it should first produce the risk_category NA. If we want to drop this later is a question for the tiltIndicatorAfter package (so if we want to view it). However, what is important if for one benchmark the risk category is NA that this doesn't mean that all other benchmarks are dropped. So it could be that for example the product apple has a risk category for the benchmark all_unit, but not for tilt_sector. In this case we need to have risk category values for the benchmark all_unit and for the benchmark tilt_sector NA.
ml02. Do you expect this both at product and company level?
Please see ml06 for this.
ml03. Do you expect this only emissions_profile()?
No. But for the sector_profile() it depends on the availability of scenarios. So if for example a product has no reduction targets for certain scenarios (IPR, WEO), the risk_category will also be NA. However, I think the code works fine for those cases. I didn't notice that there were any complications in the output tables.
ml04. Is the current behaviour expected for the other benchmarks (i.e. values of grouped_by)?
Yes. It can happen for all benchmarks except from "all".
ml05. So in the reprex you expect risk_category to be NA instead of 1?
It depends on which benchmark. So if I understand correctly in your reprex the profile_ranking was 1 if we group it with "all". In this case the risk_category for the benchmark "all" should be 1. However, the profile ranking was "NA" in the cases of the group "Isic_4_digit" and "Isic_4_digit_unit". If the profile ranking is NA it means that the product cannot be grouped in those groups as most likely the product doesnt have an isic code associated with. In this case the risk_category should be for the benchmark "isic_4_digit" and "isic_4_digit_unit" NA.
ml06. Is the output at company level as you expect?
No in this case I would expect that there is another category NA. So it would be
#> companies_id grouped_by risk_category value
#> <chr> <chr> <chr> <dbl>
#> 1 a tilt_sector high **0.5**
#> 2 a tilt_sector medium 0
#> 3 a tilt_sector low 0
**#> 4 a tilt_sector NA 0.5**
@Tilmon could you please also read this repo and confirm? Thanks!
Hi there!
I agree with @AnneSchoenauer 's comments and explanations. I have one question about @maurolepore 's reprex though, re ml05 & ml06.
What's confusing to me (not sure if it's a mistake in the reprex or I just misunderstood sth) is that in the reprex for ml05 and ml06, you use an example dataset that has two different tilt_sector
entries (once tilt_sector = 1, once tilt_sector = NA) for the same activity_uuid_product_uuid
, see screenshot below:
I don't think such case exists, right? If it's true that this is an "impossible" example, I think the result reprex that you gave as example may also be misleading.
In any case, I agree with Anne's suggestion:
No in this case I would expect that there is another category NA. So it would be
OK thanks everyone for your input. There is a whole lot going on here and I'll need to dive deep before I can do anything meaningful. We can discuss the priority during the tech meeting today.
@Tilmon, thanks for mentioning that the toy dataset I created is confusing. Once I understand this thread deeply enough I hope to be able to expose the problem with a reprex using a more realistic dataset.
I'm re-doing the reprex because grouped_by
and profile_raking
are no longer passed via the co2
datasets but instead they are computed internally in tiltIndicator.
Currently
devtools::load_all()
#> ℹ Loading tiltIndicator
options(width = 500)
packageVersion("tiltIndicator")
#> [1] '0.0.0.9108'
companies <- example_companies()
companies
#> # A tibble: 1 × 8
#> companies_id clustered activity_uuid_product_uuid sector subsector tilt_sector tilt_subsector type
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 a a a total energy a a ipr
co2 <- example_products(isic_4digit = NA)
co2
#> # A tibble: 1 × 5
#> isic_4digit activity_uuid_product_uuid tilt_sector unit co2_footprint
#> <lgl> <chr> <chr> <chr> <dbl>
#> 1 NA a a a 1
result <- emissions_profile(companies, co2)
result |> unnest_product()
#> # A tibble: 4 × 7
#> companies_id grouped_by risk_category profile_ranking clustered activity_uuid_product_uuid co2_footprint
#> <chr> <chr> <chr> <dbl> <chr> <chr> <dbl>
#> 1 a all high 1 a a 1
#> 2 a tilt_sector high 1 a a 1
#> 3 a unit high 1 a a 1
#> 4 a unit_tilt_sector high 1 a a 1
result |> unnest_company()
#> # A tibble: 12 × 4
#> companies_id grouped_by risk_category value
#> <chr> <chr> <chr> <dbl>
#> 1 a all high 1
#> 2 a all medium 0
#> 3 a all low 0
#> 4 a tilt_sector high 1
#> 5 a tilt_sector medium 0
#> 6 a tilt_sector low 0
#> 7 a unit high 1
#> 8 a unit medium 0
#> 9 a unit low 0
#> 10 a unit_tilt_sector high 1
#> 11 a unit_tilt_sector medium 0
#> 12 a unit_tilt_sector low 0
Notes to self
My new read of the answers to my questions reveal that this issue is very complex. I'll need to break it down and address each component separately. A clear division is between the output at product level and the output at company level. Although
At product level the requirement seems pretty straightforward -- although it's still unclear if the requirement applies to all indicators, and all benchmarks. Here I'll need to ask again.
At company level the requirement is seems less straightforward. We introduce a new level of risk_categoy
: NA
. This changes the structure of the expected output and may break multiple tests.
- Computing a
value
for theNA
level ofrisk_category
may be a lil tricky, so make sure to test it well.
@Tilmon I put this on high priority as I will need this for the Bundesbank data. If this is a problem let me know!
Note:
cc' @maurolepore @AnneSchoenauer
As discussed in the tech sprint (2024-02-13),
- this issue applies to all 4 indicators
- for emission profile & emission upstream profile, missing values occur if we don't have data on either of
unit
,tilt_sector
, orisic_4digit
, as these variables define the benchmarks. - For sector and sector upstream, missing values occur for sector & subsector which will make mapping to scenario-year-combination impossible and hence lead to
NAs
for a givengrouped_by