pfmc-assessments/nwfscSurvey

Counts of bins in `UnexpandedLFs.fn()` on lines 95-96 is unexpected

Closed this issue · 13 comments

https://github.com/nwfsc-assess/nwfscSurvey/blob/7755b4e1dda51bbbe4210d47a675ef5bfe7d9850/R/unexpandedLF.fn.R#L95-L96

I am uncertain what is being counted here. I would think that all of the columns for each bin should sum to equal Nsamp but it doesn't. For example, when you are using data(bio_nwfsc_combo) there are 1244 rows of sexed fish in 2003 yet if you sum the columns for bins of out you get 126.
Here is my example call

data(bio_nwfsc_combo)
new <- UnexpandedLFs.fn(
  dir = getwd(),
  datL = bio_nwfsc_combo,
  fleet = "my best fleet",
  month = 7,
  lgthBins = seq(1,100,by=10)
)

Here is the {dplyr} code that I was using to get at the number of sexed lengths per year while I was inside UnexpandedLFs.fn().

  datL %>%
    dplyr::filter(Sex %in% sex_out) %>% # Filter for sexes in sex_out
    dplyr::group_by(bin = cut(Length_cm, breaks = lgthBins)) %>% # Make new column "bin"
    dplyr::count(Year, Sex, bin) %>% # Make new column n, count of number of lengths per year and sex in a bin
    tidyr::pivot_wider(names_from = c(bin, Sex), values_from = n) # transform long to wide

I just ran the following, where the numbers did add up to 1244:

> library(nwfscSurvey)
> data(bio_nwfsc_combo)
> new <- UnexpandedLFs.fn(
  dir = getwd(),
  datL = bio_nwfsc_combo,
  fleet = "my best fleet",
  month = 7,
  lgthBins = seq(1,100,by=10)
)
> head(new$comps, 1)
    year month         fleet sex partition Nsamp F-1 F-11 F-21 F-31 F-41 F-51
Row 2003     7 my best fleet   3         0  1244   0   36   56   68  162  202
    F-61 F-71 F-81 F-91 M-1 M-11 M-21 M-31 M-41 M-51 M-61 M-71 M-81 M-91
Row  185   52   32    9   0   39   41   55   95  127   80    5    0    0
> sum(new$comps[1,-(1:6)])
[1] 1244

@kellijohnson-NOAA, is it possible that either bio_nwfsc_combo or UnexpandedLFs.fn() got modified on your computer?

I think the trouble was I changed the input length bins to understand what it was doing and b/c it is not using raw length_cm but rather an augmented vector of information then I was operating on different information. Which is why I normally combine {dplyr} calls rather than saving objects. Nevertheless, it was something that I was doing not an error in the code.

@brianlangseth-NOAA @andi-stephens-NOAA and @chantelwetzel-noaa can you respond on the comments that I put in the example {dplyr} calls above. Are these comments helpful? What comments would actually be helpful? It is difficult to know what to put because I would have the tendency to use ?dplyr::count or args(dplyr::count) rather than relying on comments. I tend to put comments for things that are not included in the code such as one-off examples or discussions on why one path instead of another was chosen.

The two lines of code flagged above file in either a 0 when there were no observations of fish in that year, fleet, and length bin combination and the second line of code is triggered when there is >0 observations of fish in that year, fleet, and length bin combination. @kellijohnson-NOAA discussed how these two lines could be simplified if we opted to keep the looped code (noting that we probably would be moving to the dplyr approach).

Looking over the comments, I still have questions on the last two commands, count and pivot_wider. Does the count just count the number of observation in a length bin for a sex within a year? I had to google a video on pivot_wider to understand this function. I think it might be more helpful here to say that the call is "appending columns for sex and bin combinations where the value by sex and bin is equal to the observations". Is this approximately correct?

@kellijohnson-NOAA This is not easy because it is impossible to know what is intuitive to each user.

To me dplyr::filter is straightforward and the comment just replicates the code. I would say it should be removed. dplyr::group_by is straightforward too however I'm use to dplyr::mutate being used to add a variable, so in this case the comment was helpful. However, if you had used mutate instead, I would say the comment just replicates the code and it should be removed. For the third comment, I did not know count creates a new variable. It is not really relevant but it is used in pivot_wider so having that in the comment helped. Otherwise, I tend to prefer writing the purpose of the line as opposed to what it does.

#Convert into the format needed for SS3 with the number of lengths entered for
#each year (as rows) for each length bins and sex (as columns)
  datL %>%
    dplyr::filter(Sex %in% sex_out) %>% 
    dplyr::group_by(bin = cut(Length_cm, breaks = lgthBins)) %>% #Put every length into a length bin
    dplyr::count(Year, Sex, bin) %>% # Make new column n, which is number of lengths per year and sex in a bin
    tidyr::pivot_wider(names_from = c(bin, Sex), values_from = n) # Reformat

As an alternative, because the code seems to be doing two things (determining the number of lengths for each year, sex, and bin; and then reformatting it into the format needed for SS3) applying unique piping with overall comments for each step could make it more clear. In this case you wouldn't really need the individual comments

#Determine the number of lengths by year for the desired bins and sexes
datL_counts <-  datL %>%
    dplyr::filter(Sex %in% sex_out) %>%
    dplyr::group_by(bin = cut(Length_cm, breaks = lgthBins)) %>%
    dplyr::count(Year, Sex, bin) 

#Reformat to have years as rows and bins by sexes as columns
datL_counts %>%
    tidyr::pivot_wider(names_from = c(bin, Sex), values_from = n) 

Thanks everyone for the discussion on this. Here is a function that I wrote to get the frequencies.

  create_frequencies <- function(data, valuecolumn, bins) {
    # TODO: Find a away to pass Year and Sex as ...
    stopifnot("Year" %in% colnames(data))
    stopifnot("Sex" %in% colnames(data))
    tidy <- data %>%
      dplyr::mutate(bin = cut(
        x = .data[[valuecolumn]],
        breaks = bins,
        right = FALSE
      )) %>%
      dplyr::count(Year, Sex, bin) %>%
      tidyr::pivot_wider(
        names_from = c(Sex, bin),
        values_from = n,
        values_fill = 0) %>%
      dplyr::mutate(Nsamp = rowSums(select(., -Year), na.rm = TRUE))
    return(tidy)
  }

I obviously need to comment what is going on and choices here but instead of replicating the for loops in UnexpandedLFs.fn for males and females and then again for unsexed fish, we could just use

  tidyFM <- create_frequencies(
    datL %>% filter(Sex %in% c("F", "M")),
    valuecolumn = "Length_cm",
    bins = lgthBins
  )
  tidyU <- create_frequencies(
    datL %>% filter(Sex %in% c("U")),
    valuecolumn = "Length_cm",
    bins = lgthBins
  )

Thoughts?

Can I suggest that do the data manipulation in separate lines of code?

  create_frequencies <- function(data, valuecolumn, bins) {

    # TODO: Find a away to pass Year and Sex as ...
    stopifnot("Year" %in% colnames(data))
    stopifnot("Sex" %in% colnames(data))

    tidy <- data %>%
      dplyr::mutate(bin = cut(
        x = .data[[valuecolumn]],  <--- Can this be written as "data[, valuecolumn]" ?
        breaks = bins,
        right = FALSE
      )) 

    tidy <- tidy %>%
      dplyr::count(Year, Sex, bin) %>%
      tidyr::pivot_wider(
        names_from = c(Sex, bin),
        values_from = n,
        values_fill = 0) 

   tidy <- tidy %>%
      dplyr::mutate(Nsamp = rowSums(select(., -Year), na.rm = TRUE))  

    return(tidy)
  }

I am not sure if the code works as separated above, but I find breaking out these commands makes it much simpler to understand for us dplyr newbies. I am also really big into providing blank lines in my code to make it visually easier to read which may be partially why I get confused with these large multiple data manipulation code chunks.

@kellijohnson-NOAA, this looks great to me and I like the @chantelwetzel-noaa suggestion.

In terms of repeating for males and females and unsexed, I would love to get some better consensus on if/when it makes sense to

  1. partition unsexed fish into male and female bins using the sex ratio of the observed fish in each length bin,
  2. include unsexed fish as a separate vector of unsexed observations, or
  3. ignore unsexed fish.

The current system where each STAT independently struggles to figure out what seems best in the absence of clear best practices is not awesome. If Option 1 is going to be available, we should think about where it makes the most sense to add that step and how to streamline it.

I definitely would like to develop a method to partition the unsexed fish into either males or females based on sex ratio information. I went to add this a while ago and realized the logic needed for this was a bit more complicated and opted to set it aside for the time being. Of course, I did not put notes in the code as to why I thought it was going to be complicated and can't remember exactly what the stumbling blocks at the time were (could have just been a time issue).

However, I think we need to set this up in a manner that the default is to not apply a sex ratio. This at least forces people to really see the split of observations by sexed and unsexed if they have not been looking. I have encountered (and probably done) some really terrible applications where the sex ratio was applied by default resulting in really augmented length compositions versus to those that would have been created when not applying a sex ratio.

I moved this thread about what to do with unsexed fish to PEPtools issues #8. I like being able to close out issues and use them for reference later. Thus, when they cover multiple topics the previous is hard to do.

Thanks for the comments on the dplyr code and what would be helpful to make it more readable. I am going to do a quick test in the morning to see if writing to three separate objects makes the code slower. If it doesn't, then I am fine with breaking it up. It is not best practice to overwrite the name of the object though so I would advocate for providing each object a unique name that explains what it is. I learned about this from Matthew Supernaw in C++ training.

Then I propose that as a group we write a few tests for this new function in a branch and a test for the the old function. Then we can change out some of the code and see if the test for the old function still passes. Please add to this to do list if you see other things that need to occur.

Perfect. Thank you @kellijohnson-NOAA. I completely agree about using different object names for each step, but was too lazy to do this in my example broken up code. Also, in terms of speed, I suspect the broken up version is still faster than the old loop code which was "fast" enough for me to not be bothered.

Since this thread has been moved to PEPtools I will close this issue. Additionally, users are being encouraged to use get_raw_comps() rather than this function.