Abalone Harvest Analysis

Abalone Image Photo by La La

Requirements

  • R
  • renv

Environment set up

renv::restore()

Introduction

Abalones are an economic and recreational resource that is threatened by a variety of factors which include: pollution, disease, loss of habitat, predation, commercial harvesting, sport fishing and illegal harvesting. Environmental variation and the availability of nutrients affect the growth and maturation rate of abalones. Over the last 20+ years it is estimated the commercial catch of abalone worldwide has declined in the neighborhood of 40%. Abalones are easily over harvested because of slow growth rates and variable reproductive success. Being able to quickly determine the age composition of a regional abalone population would be an important capability. The information so derived could be used to manage harvesting requirements.

The data are derived from an observational study of abalones. The intent of the investigators was to predict the age of abalone from physical measurements thus avoiding the necessity of counting growth rings for aging. Ideally, a growth ring is produced each year of age. Currently, age is determined by drilling the shell and counting the number of shell rings using a microscope. This is a difficult and time consuming process. Ring clarity can be an issue. At the completion of the breeding season sexing abalone can be difficult. Similar difficulties are experienced when trying to determine the sex of immature abalone. The study was not successful. The investigators concluded additional information would be required such as weather patterns and location which affect food availability.

I will analyze the data provided from this study further.

1. Data Preparation


The structure of the data. We can see that there are 1,036 observations (rows) and 8 variables (columns).

## 'data.frame':    1036 obs. of  8 variables:
##  $ SEX   : Factor w/ 3 levels "F","I","M": 2 2 2 2 2 2 2 2 2 2 ...
##  $ LENGTH: num  5.57 3.67 10.08 4.09 6.93 ...
##  $ DIAM  : num  4.09 2.62 7.35 3.15 4.83 ...
##  $ HEIGHT: num  1.26 0.84 2.205 0.945 1.785 ...
##  $ WHOLE : num  11.5 3.5 79.38 4.69 21.19 ...
##  $ SHUCK : num  4.31 1.19 44 2.25 9.88 ...
##  $ RINGS : int  6 4 6 3 6 6 5 6 5 6 ...
##  $ CLASS : Factor w/ 5 levels "A1","A2","A3",..: 1 1 1 1 1 1 1 1 1 1 ...

The basic summary of the data with a table of Class vs Rings.

##  SEX         LENGTH           DIAM            HEIGHT          WHOLE        
##  F:326   Min.   : 2.73   Min.   : 1.995   Min.   :0.525   Min.   :  1.625  
##  I:329   1st Qu.: 9.45   1st Qu.: 7.350   1st Qu.:2.415   1st Qu.: 56.484  
##  M:381   Median :11.45   Median : 8.925   Median :2.940   Median :101.344  
##          Mean   :11.08   Mean   : 8.622   Mean   :2.947   Mean   :105.832  
##          3rd Qu.:13.02   3rd Qu.:10.185   3rd Qu.:3.570   3rd Qu.:150.319  
##          Max.   :16.80   Max.   :13.230   Max.   :4.935   Max.   :315.750  
##      SHUCK              RINGS        CLASS        VOLUME       
##  Min.   :  0.5625   Min.   : 3.000   A1:108   Min.   :  3.612  
##  1st Qu.: 23.3006   1st Qu.: 8.000   A2:236   1st Qu.:163.545  
##  Median : 42.5700   Median : 9.000   A3:329   Median :307.363  
##  Mean   : 45.4396   Mean   : 9.993   A4:188   Mean   :326.804  
##  3rd Qu.: 64.2897   3rd Qu.:11.000   A5:175   3rd Qu.:463.264  
##  Max.   :157.0800   Max.   :25.000            Max.   :995.673  
##      RATIO        
##  Min.   :0.06734  
##  1st Qu.:0.12241  
##  Median :0.13914  
##  Mean   :0.14205  
##  3rd Qu.:0.15911  
##  Max.   :0.31176

##     
##        3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20
##   A1   9   8  24  67   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##   A2   0   0   0   0  91 145   0   0   0   0   0   0   0   0   0   0   0   0
##   A3   0   0   0   0   0   0 182 147   0   0   0   0   0   0   0   0   0   0
##   A4   0   0   0   0   0   0   0   0 125  63   0   0   0   0   0   0   0   0
##   A5   0   0   0   0   0   0   0   0   0   0  48  35  27  15  13   8   8   6
##     
##       21  22  23  24  25
##   A1   0   0   0   0   0
##   A2   0   0   0   0   0
##   A3   0   0   0   0   0
##   A4   0   0   0   0   0
##   A5   4   1   7   2   1

The variables being discussed here is SEX and CLASS. SEX contains male, female, and infant. CLASS are ordered age classification based on RINGS (A1=youngest, and A6= oldest). Overall, the distribution seems like it’s following a normal distribution, but with a slight left skewness. Potential outliers may be abalones that are considered infants in the A4 and A5 groups. Since A4 and A5 are older, their sex should have been identified and it could be an error in the data.

##      
##         A1   A2   A3   A4   A5  Sum
##   F      5   41  121   82   77  326
##   I     91  133   65   21   19  329
##   M     12   62  143   85   79  381
##   Sum  108  236  329  188  175 1036

Female and male abalones are left skewed and infant abalones are right skewed in the ordered class. This is most likely because it is harder to identify the sex of younger abalones. The number of infant abalones decrease as the age groups increase because the sex are identified in older groups.


Both plots have a positive correlation to whole. As age groups increase, whole increases and the y variable of volume and shuck increase as well. The first plot has A3-A5 scattered around the middle and top right part of the graph. This behavior seems random without a clear trend. However, for the second plot, we can observe certain trends. The younger groups such as A1-A3 are much closer to the maximum line compared to the older groups such A4-A5. It seems like as abalones age, the ratio of shuck to whole starts to decrease. This is an interesting finding because the first plot is a visualization of volume vs whole. Even though the volume may increase as abalones age, that doesn’t mean the shuck will increase as well.


All three sexes are close to a normality. However, all 3 sexes seem to be right skewed as we can observe from the histogram and qqplot. The boxplot shows there are outliers in all 3 sexes above the upper extreme.

SEX LENGTH DIAM HEIGHT WHOLE SHUCK RINGS CLASS VOLUME RATIO
3 I 10.080 7.350 2.205 79.37500 44.00000 6 A1 163.364040 0.2693371
37 I 4.305 3.255 0.945 6.18750 2.93750 3 A1 13.242072 0.2218308
42 I 2.835 2.730 0.840 3.62500 1.56250 4 A1 6.501222 0.2403394
58 I 6.720 4.305 1.680 22.62500 11.00000 5 A1 48.601728 0.2263294
67 I 5.040 3.675 0.945 9.65625 3.93750 5 A1 17.503290 0.2249577
89 I 3.360 2.310 0.525 2.43750 0.93750 4 A1 4.074840 0.2300704
105 I 6.930 4.725 1.575 23.37500 11.81250 7 A2 51.572194 0.2290478
200 I 9.135 6.300 2.520 74.56250 32.37500 8 A2 145.027260 0.2232339
350 F 7.980 6.720 2.415 80.93750 40.37500 7 A2 129.505824 0.3117620
379 F 15.330 11.970 3.465 252.06250 134.89812 10 A3 635.827846 0.2121614
420 F 11.550 7.980 3.465 150.62500 68.55375 10 A3 319.365585 0.2146560
421 F 13.125 10.290 2.310 142.00000 66.47062 9 A3 311.979938 0.2130606
458 F 11.445 8.085 3.150 139.81250 68.49062 9 A3 291.478399 0.2349767
586 F 12.180 9.450 4.935 133.87500 38.25000 14 A5 568.023435 0.0673388
746 M 13.440 10.815 1.680 130.25000 63.73125 10 A3 244.194048 0.2609861
754 M 10.500 7.770 3.150 132.68750 61.13250 9 A3 256.992750 0.2378764
803 M 10.710 8.610 3.255 160.31250 70.41375 9 A3 300.153640 0.2345924
810 M 12.285 9.870 3.465 176.12500 99.00000 10 A3 420.141472 0.2356349
852 M 11.550 8.820 3.360 167.56250 78.27187 10 A3 342.286560 0.2286735

The majority of the outliers have a ratio above 0.22. There was 1 observation that had a ratio of 0.06 which made it as an outlier in the lower extreme. Also, there were no outliers for abalones in the A4 class.


Since there is a positive correlation between the variables, I think it would do fairly well to predict the age. However, as age approaches the A4 & A5 class, it would be difficult to classify because both groups have similar boxplot stats for volume and whole.


A1 A2 A3 A4 A5
F 255.29938 276.8573 412.6079 498.0489 486.1525
I 66.51618 160.3200 270.7406 316.4129 318.6930
M 103.72320 245.3857 358.1181 442.6155 440.2074
A1 A2 A3 A4 A5
F 38.90000 42.50305 59.69121 69.05161 59.17076
I 10.11332 23.41024 37.17969 39.85369 36.47047
M 16.39583 38.33855 52.96933 61.42726 55.02762
A1 A2 A3 A4 A5
F 0.1546644 0.1554605 0.1450304 0.1379609 0.1233605
I 0.1569554 0.1475600 0.1372256 0.1244413 0.1167649
M 0.1512698 0.1564017 0.1462123 0.1364881 0.1262089

It seems volume and shuck follow a similar trend. As age increases, the volume and shuck increases as well. The females have the highest average volume and shuck in any age group for these 2 variables. However, when it comes to ratio, it follows a different trend. As age increases, the average ratio decreases. Also, the lines intersect frequently. In A1 groups, the infants have the highest ratio. After A2, the female and male keeps intersecting. I wonder why the trend for the ratio are inconsistent compared to volume and shuck. It makes me even wonder if some of the abalones were correctly classified because there are inconsistencies in the ratio.

We can suggest that as abalones age, the volume and whole increase. Comparing the infant and adults, we can observe that the adults have more variability as rings increases. The infants have a relatively small variance compared to the adults.


The biggest reason that the original study failed was because it was not possible to completely predict the age of the abalone from the physical measurements. Specifically, the age group of A4-A5 had identical sizes which was difficult to differentiate. For the age groups in early ages such as A1-A3, it may be possibly to predict the age with the physical appearances.

I would inquire if there are certain groups that need to be considered. For example, for studies related to health, age is an extremely important variable. During the COVID pandemic, the Simpson’s paradox became a topic of concern while analyzing the statistics of fatality rate and vaccine effectiveness. I would inquire about any demographic information as possible.

One of the most difficult elements of observational studies is that the sample may not represent the population. For example, even though a study used stratified sampling methods to gather data, the same results may not hold true if the data was collected in an entirely different environment (e.g., country, time of day/year, etc.). With observational studies, correlation can possibly be determined, but not causality. Determining causality would be extremely difficult and it will require domain knowledge and critical thinking.

2. Data Analysis


The structure of the data. We can see that there are 1,036 observations (rows) and 10 variables (columns).

## 'data.frame':    1036 obs. of  10 variables:
##  $ SEX   : Factor w/ 3 levels "F","I","M": 2 2 2 2 2 2 2 2 2 2 ...
##  $ LENGTH: num  5.57 3.67 10.08 4.09 6.93 ...
##  $ DIAM  : num  4.09 2.62 7.35 3.15 4.83 ...
##  $ HEIGHT: num  1.26 0.84 2.205 0.945 1.785 ...
##  $ WHOLE : num  11.5 3.5 79.38 4.69 21.19 ...
##  $ SHUCK : num  4.31 1.19 44 2.25 9.88 ...
##  $ RINGS : int  6 4 6 3 6 6 5 6 5 6 ...
##  $ CLASS : Factor w/ 5 levels "A1","A2","A3",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ VOLUME: num  28.7 8.1 163.4 12.2 59.7 ...
##  $ RATIO : num  0.15 0.147 0.269 0.185 0.165 ...

## skewness: 0.714705598826372

## kurtosis: 4.66729775944278

## excess kurtosis: 1.66729775944278

## skewness: -0.0939154815619048

## kurtosis: 3.535430948318

## excess kurtosis: 0.535430948318001

## 
##  Bartlett test of homogeneity of variances
## 
## data:  mydata$RATIO by mydata$CLASS
## Bartlett's K-squared = 21.49, df = 4, p-value = 0.0002531

## 
##  Bartlett test of homogeneity of variances
## 
## data:  mydata$L_RATIO by mydata$CLASS
## Bartlett's K-squared = 3.1891, df = 4, p-value = 0.5267

L_RATIO (log ratio) exhibits a better conformance to a normal distribution with homogeneous variances across age. The skewness is close to 0 which indicates it’s a symmetrical distribution. The kurtosis is also close to 3 which indicates it is mesokurtic and resembles a normal distribution peak. We can visually see this with the histogram and qqplot as well. We can also see that the L_RATIO have homogeneous variances across ages with the Barlett test. Since the p-value is much greater than 0.05, we fail to reject the null hypothesis of having identical variances. On the other hand, the RATIO has p-value of 0.0002531 for the Barlett test which indicates that the variances are not homogeneous across ages.

##               Df Sum Sq Mean Sq F value  Pr(>F)    
## CLASS          4  1.055 0.26384  38.370 < 2e-16 ***
## SEX            2  0.091 0.04569   6.644 0.00136 ** 
## CLASS:SEX      8  0.027 0.00334   0.485 0.86709    
## Residuals   1021  7.021 0.00688                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##               Df Sum Sq Mean Sq F value  Pr(>F)    
## CLASS          4  1.055 0.26384  38.524 < 2e-16 ***
## SEX            2  0.091 0.04569   6.671 0.00132 ** 
## Residuals   1029  7.047 0.00685                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

While we get statistically significant results with CLASS and SEX independently, that is not the case when we factor CLASS and SEX. We can say there is an interaction effect between these two independent variables. Since both of these variables affects the outcome of L_RATIO, we inevitably get a non-significant result when these two are interacting.

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = L_RATIO ~ CLASS + SEX, data = mydata)
## 
## $CLASS
##              diff         lwr          upr     p adj
## A2-A1 -0.01248831 -0.03876038  0.013783756 0.6919456
## A3-A1 -0.03426008 -0.05933928 -0.009180867 0.0018630
## A4-A1 -0.05863763 -0.08594237 -0.031332896 0.0000001
## A5-A1 -0.09997200 -0.12764430 -0.072299703 0.0000000
## A3-A2 -0.02177176 -0.04106269 -0.002480831 0.0178413
## A4-A2 -0.04614932 -0.06825638 -0.024042262 0.0000002
## A5-A2 -0.08748369 -0.11004316 -0.064924223 0.0000000
## A4-A3 -0.02437756 -0.04505283 -0.003702280 0.0114638
## A5-A3 -0.06571193 -0.08687025 -0.044553605 0.0000000
## A5-A4 -0.04133437 -0.06508845 -0.017580286 0.0000223
## 
## $SEX
##             diff          lwr           upr     p adj
## I-F -0.015890329 -0.031069561 -0.0007110968 0.0376673
## M-F  0.002069057 -0.012585555  0.0167236691 0.9412689
## M-I  0.017959386  0.003340824  0.0325779478 0.0111881

We can identify many statistically significant results from different age groups. The only non-significant result we observe was from A2-A1. This indicates that as the abolone’s age group change (besides from A1 to A2), the L_RATIO will change as well. Male and female abalones can be combined into a single category because both have significant differences from infants and we have an extremely high adjusted p-value which suggests there is no difference between male and female.

## 
## ADULT     I 
##   707   329

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Infants have a right skewed distribution compared to adults. It would be hard to separate infants and adults around the lower volume areas, since there are a lot of overlaps in the two distributions (especially volumes under 300). For larger volumes, it may be easier to separate infants and adults. For volumes over 500, most of the abalones tend to be adults.

The log-transformation reduces the variability of the relationship between volume and shuck we can make a better distinction between groups. We can see that there is a positive correlation between volume and shuck. For class levels, we can observe that the younger age groups, especially the A1 groups, are separated from the rest. We can observe that in general, when age groups increase, the shuck and volume increase as well. This observation is replicated in the type level. Most of the infants are scattered around the lower end of the volume and shuck. The adults are scattered around the higher end of the volume and shuck.

## 
## ADULT     I 
##   747   289

## 
## Call:
## lm(formula = L_SHUCK ~ L_VOLUME + CLASS + TYPE, data = mydata)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.270634 -0.054287  0.000159  0.055986  0.309718 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.796418   0.021718 -36.672  < 2e-16 ***
## L_VOLUME     0.999303   0.010262  97.377  < 2e-16 ***
## CLASSA2     -0.018005   0.011005  -1.636 0.102124    
## CLASSA3     -0.047310   0.012474  -3.793 0.000158 ***
## CLASSA4     -0.075782   0.014056  -5.391 8.67e-08 ***
## CLASSA5     -0.117119   0.014131  -8.288 3.56e-16 ***
## TYPEI       -0.021093   0.007688  -2.744 0.006180 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08297 on 1029 degrees of freedom
## Multiple R-squared:  0.9504, Adjusted R-squared:  0.9501 
## F-statistic:  3287 on 6 and 1029 DF,  p-value: < 2.2e-16

The summary here shows the reference level of class as A1 and type as adult. The coefficients for classes A2-A5 are the differences associated with A1 and adult. We can imply that the classes A1-A5 are highly correlated variables. When we plotted the data in the scatter plot earlier, it showed positive correlation for volume and shuck for all classes A1-A5 independently. However, when we fit the model and considered all the predictor variables together, we can see that the classes A2-A5 have a negative effect. The coefficients shows that if the data belongs to a class in A2-A5 and holding every other predictors constant, it will have a negative effect when the reference class is A1 and reference type is adult.

Compared to class, type is not as an important predictor for shuck. The coefficient are much smaller compared to most of the other classes. However, it is important to keep in mind that the dependent variable is in log 10 scale. This means that even though it may be a small coefficient, it still can influence the results greatly.


## skewness: -0.0594523446253668

## kurtosis: 3.34330818394132

## excess kurtosis: 0.343308183941319

## 
##  Bartlett test of homogeneity of variances
## 
## data:  tempdf$residuals by tempdf$CLASS
## Bartlett's K-squared = 3.6882, df = 4, p-value = 0.4498

## 
##  Bartlett test of homogeneity of variances
## 
## data:  tempdf$residuals by tempdf$TYPE
## Bartlett's K-squared = 0.31156, df = 1, p-value = 0.5767

With the above calculations, we checked the assumptions of linear regression. We checked if the residuals were normally distributed and tested the homogeneity of variance of the residuals across classes. Both of these assumptions were met. We can also observe that volume and shuck have a linear relationship from 3(b) and we know each abalones are independent from each other. Therefore, this model indeed fit the dataset and it may be useful for harvesting decisions.


Harvest Strategy:

There is a tradeoff faced in managing abalone harvest. The infant population must be protected since it represents future harvests. On the other hand, the harvest should be designed to be efficient with a yield to justify the effort. This assignment will use VOLUME to form binary decision rules to guide harvesting. If VOLUME is below a “cutoff” (i.e. a specified volume), that individual will not be harvested. If above, it will be harvested. Different rules are possible.The Management needs to make a decision to implement 1 rule that meets the business goal.


## [1] 526.6383

## [1] 0.2476573

## [1] 0

## [1] 133.8214

## [1] 0.4982699

## [1] 0.9330656

## [1] 384.5584

## [1] 0.02422145

## [1] 0.4993307

## 
## Attaching package: 'reshape2'

## The following object is masked from 'package:tidyr':
## 
##     smiths

We can clearly observe a difference in median between infants and adults. There will be a trade off of trying to preserve as much infants versus trying to harvest as much adults as possible. Since we want to save as much infants as possible, we don’t want to choose the median infant point as the cut off. This would harvest roughly half of all infants which would have a negative effect on future harvests. Choosing the median adult point as the cut off would be the better option because it would still save roughly 97.6% of all infants while harvesting half of all adults. There needs to be a business decision made based on the choosing the right cut off point for the business.


More harvest strategies:

This part will address the determination of a cutoff volume.value corresponding to the observed maximum difference in harvest percentages of adults and infants. In other words, we want to find the volume value such that the vertical distance between the infant curve and the adult curve is maximum. To calculate this result, the vectors of proportions from item (6) must be used. These proportions must be converted from “not harvested” to “harvested” proportions by using (1 - prop.infants) for infants, and (1 - prop.adults) for adults. The reason the proportion for infants drops sooner than adults is that infants are maturing and becoming adults with larger volumes.


## [1] 0.7416332

## [1] 0.7416332

## [1] 0.1764706

## [1] 0.1764706

## [1] 0.2871972

## [1] 0.8259705

## [1] 0.8666894

## [1] 0.246

## [1] 0.1785714

##                      Volume     TPR     FPR totalHarvest
## Protect all infants 526.600   0.246   0.000        0.180
## Median infants      133.800   0.933   0.498        0.813
## Median adults       384.600   0.499   0.024        0.367
## Max difference      262.100   0.742   0.176        0.584
## Zero A1 infants     206.800   0.826   0.287        0.676

There are various thresholds depending on the cutoff volume and there are trade offs. Protect all infants is the most conservative threshold with only harvesting volumes that are equal or greater than 526.8. Because this method is protecting all infants, the FPR is 0, but TPR is limited to 0.246 as well. This will leave out many adult abalones that should have been harvested in order to protect all infants. The median infants, on the other hand, have a very aggressive cutoff volume at 133.8. This method has a high TPR in 0.933, but has a FPR of 0.498. This method harvests roughly half of infant abalones that shouldn’t have been harvested in order to harvest more adults. The median adults, max difference, and zero A1 infants are in between the 2 extreme cutoff points. The business needs to make a decision based on the trade offs of harvesting the abalones now or keeping them for the future.