gadget-framework/mfdb

standardized indices of abundance

Closed this issue · 13 comments

I begin to have some doubt on the way I'm calculating indices of abundance in mfdb from bottom trawl surveys (DATRAS).

Info on number of fish by length class (with or without standardization for haulDuration [CPUE]) is stored in a so called HL table and imported with 'mfdb_import_survey'.
I routinely extract length distributions from this table to make gadget 'catchdistribution' LH components which I think is fine.

I wonder if the same table can be used to calculate also indices of abundance for specified length groups. The main issue that I see here is that because the number of hauls varies among years if I use 'mfdb_sample_count' which sums the number of fish within a length group I would easily get biased indices. Do you agree?
(1) Do you think a mean rather than the sum would resolve (most of) the problem?
(2) Alternatively, to extract the CPUE at a haul level and fit a GLM-type regression to calculate a year effect to use as index of abundance would be better?

Which approach do you think should I prefer, (1) or (2)?
I suppose other case studies (ie Iceland) have data arranged similarly, how do you approach this?

Thank you in advance for any thought

bthe commented

To calculate survey indices from a survey one should use number/biomass divided by area swept. In the Icelandic case the survey indices are built from a standardised survey, i.e. same number of tow station, tow length is constant (unless the bag is full), etc.. So straight aggregation into number per length group "should" be fine in Iceland, as we are only looking for a relative index of abundance/biomass.

A problem occurs with the "Icelandic approach" when, as you describe, the index of abundance/biomass is derived from a) a non-standard (between years) survey like you have or b) commercial logbooks. In the a) case I would suggest to standardise the counts by the area swept but for the b)-case a glm where the year effects are extracted would be sensible.

The a) case should be fairly trivial to implement. The b) I would think would be more involved, as one would like to control the independent variables in the regression. Both these approaches would be useful to have, in particular linked with the bootstrap mechanics.

agree, the a) case is what would be needed for the Baltic (but also the North Sea) trawl surveys.
In practice, I would need of a mfdb function that calculates the index from data imported with mfdb_import_survey, and this function should sum the number (or weight) of fish in each area, time and length-group and divide it by the sum of the tow duration or even better swept area in each area, time and length-group.

@vbartolino I think this already exists. Or at least something along these lines.

One can upload tow information (lat, long, depth, length) with mfdb_import_tow_taxonomy, and then associate samples uploaded with mfdb_import_survey with a given tow.

Then one can use tow_length as an abundance index in mfdb_sample_meanlength and friends. This tries to calculate what you describe, but it'll be sum(count/tow-duration, . . .) rather than sum(count, ...) / sum(tow-duration, ...). There isn't an equivalent using weight instead of count either.

Unfortunately I didn't add a gap for swept area, but it could be added.

Does this sound useful?

mfdb_sample_meanlength is perfect to calculate for instance mean length at age which serves well a CatchStatistics likelihood component in Gadget, but I'm not sure it can solve the issue here.

By definition mfdb_sample_meanlength doesn't allow to specify a length interval which is exactly what we would need if we look for an index of abundance for a specified length group. I've the feeling that mfdb_sample_count is the closest one, but I think sum(count, ...) / sum(tow-duration, ...) is the required calculation if we want to standardize for an unbalanced sampling effort in different years.

Sorry, mentioning mfdb_sample_meanlength confused the issue. I should have said mfdb_sample_count.

to clarify the issue with an example, this is the distribution of hauls from the bottom trawl survey in the Baltic
hauls_position

looking at the number of hauls per year it is clear that the sum of the number of fish or even the sum of the standardised number of fish per haul (ie, CPUE) would be very biased. In addition to what written above, an alternative could be to calculate the mean(count/tow-duration) rather the sum(count/tow-duration)

Sorry @vbartolino I got halfway through this then got it confused with the other issue. I understand what's going on I think but I was getting excessively fancy trying to make it fit in with what's already there. Will try a separate function that you can use at least for now.

@vbartolino I've (finally) added mfdb_sample_scaled which gives total counts / mean weights, divided by SUM(tow_length). Does this help you along?

As for spread area, we'd need to store h and x_2 in the database. before it could be calculated. The main question here is if they should be a property of the gear or the tow in the database. I'm tempted to say the latter, since we only store reasonably broad classifications of gear, and presumably these values can vary within similar gear types.

Thanks for the addition. I'll test it asap and give you feedback.
Is h the tow_length? Not sure what x_2 is.

In this case, tow_length should be naturally a tow property.

I was just going off a definition I found online for trawl surveys.

swept area

So here D is tow length, and then h and X_2 turn this into a swept area. Course, what you do here probably depends on gear type. But I guess the basic idea is to take a surface area that the gear covers and multiply it by the tow length.

bthe commented

@lentinj when you talk about total counts / mean weights I assume you mean either total counts or total biomass, not division.

In general the tow properties recorded in our surveys are:

  • weather (wind force and direction, temperature (air, surface and bottom)
  • Sea (current strength and direction)
  • tow and properties (speed, direction, depth, meshsize, horizontal and vertical opening)
  • location in time and space (both at beginning and end of the haul, so diel variations could be accounted for)

I think that the gear opening can be fairly variable but in general the survey try to keep these constant as much as possible (and we only standardise by tow length). How gear opening is kept fixed varies between surveys, I know Norwegians have some straps in the mouth of the gear.

You may find this informative, https://github.com/einarhjorleifsson/smx/blob/master/R/indices_dplyr.R this shows how the official survey indices from the Icelandic surveys are calculated.

@lentinj when you talk about total counts / mean weights I assume you mean either total counts or total biomass, not division.

Yes :)

Not sure there's anything to do here now, closing to tidy up.