NEON-biodiversity/Ostats

Plotting function

Closed this issue · 28 comments

@qdread
Hi Quentin, I have been working on the plotting function based on your code. May I ask you a few questions? The line number refers to the R script in internal_docs.

Line 11: sample size=24. Does it need to be changeable/customizable for the user? Is there a way to make colors more distinct from one another?

Line 25: stat_density() adjust = 2. Does it need to be changeable/customizable?

Line 28 & 29: scale_x_continuous and scale_y continuous: breaks, labels and limits settings need to be removed for the function to generalize over other datasets, (or user can input these as arguments?)

Line 30: labelling of overlap values -- if geom_text(aes(.., x=..., y=...)) is used, the label shifts around when the axes change. I could think of two other ways: one is to use annotation_custom with textGrob (from here,) and the other is to add overlap values to the labeller so that the values appear together with the names of the sites. I tried both but failed. Do you have any suggestions?

When I removed scale_x_continuous and tried plot weight instead of log10(weight) the plotting is not visible. Does this mean I have to input suitable values for breaks, limits, etc?

When I remove limits for scale_x_continuous limits = c(0.5,3), it looks like the tail of the graph goes up. How might this be solved?

Thank you very much for your patience!

@yyue-r @isafluck
In general, I think the best way to proceed with the plotting function is to provide sensible default values that will work "out of the box" for most cases, but also make everything as customizable as possible.

In response to your specific questions:

Line 11: There is nothing special about the colors. I think it would be best to provide a default color scheme using the viridis palette or something similar. I chose sample size 24 for the example plot because there were 24 species in the original dataset for the plot. So I would just keep the colors more or less as is, but allow the user to specify the number of distinct colors in the palette (I used 10 because if you have too many, the eye can't really tell them apart anyway). Then sample that palette with size equal to the number of species so that each species gets a color.

Line 25: Yes it would be a good idea to provide default values for arguments to stat_density() but allow the user to customize them. I would add an ellipsis ... to the arguments of the plotting function and then pass the additional arguments to stat_density(). You can implement this as follows:

plotting function <- function(<the main arguments>, ...) {
  stat_density_args <- list(...)
  <put some code here to use default arguments to stat_density if they aren't provided by ... >
  density_layer <- do.call(stat_density, c(stat_density_args, list(other arguments to stat_density))
  
  ggplot(blah blah blah) +
    density_layer +
    etc etc 
}

If that is confusing, let me know.

Line 28 and 29: Yes, you need to probably supply default limits, labels, and breaks, but also provide a way for the user to input them. The best way to get default x axis limits would be to take the minimum and maximum trait value in the dataset for plotting, and multiply the minimum value by something like 0.5 and the maximum value by 1.5. That should include all the values but include a little extra space so that if the tails of the density polygons stick out a long way past the highest value, it will not get cut off as you noticed when you removed the x axis limits. Maybe 1.5 is not enough, you can try out other numbers.

Line 30: That can probably be fixed by using coordinates for annotating the plot relative to the panel area, rather than using data coordinates. Then the label will always appear in the same location even if the axes change. I think you are on the right track with annotation_custom(). That should work so when I get the chance, I will go back in and try to edit the code to get it to work. Alternatively, we could just leave out the overlap label entirely and let the users add that in themselves if they want to. I might prefer the second option.

Let me know if this answers all the questions!

@qdread Hi Quentin Thank you so much! I will make the changes to the function.

hey! found a way to not have to use the argument "trait name" in the function! it is because the trait name is on the name column of the ostats output (first matrix of the list) already
Now I`m trying to understand an Error in UseMethod("mutate_") ........

@yyue-r , did you include how you joined the community name with its overlap value in your last update of the plot function? I don't find it. I managed to solve the mutate problem now I am running everything again to test with other data. Fingers crossed!

The function worked! Only with this command:
Ostats_plot(indiv_dat = indiv_dat, siteID = siteID, taxonID = taxonID, trait = trait, overlap_dat = overlap_dat)
the traits distribution for each of the communities are plotted with the ostats overlap values in the top left!

#---#
I think we could add an additional condition argument: "sites2use", for those who don't want to plot all the sites and only some of them, What do you think?
Like, if it is null, the function plot all the sites, but if people select some sites in the sites2use" argument, the argument works and the function plot only the sites selected.
Sorry if my programming language is not so clear.

Hello!
I have updated the Ostats_plot function!
For me everything went well, I could plot all the sites and only selected ones just changing the "sites2use" argument. However, if you could try it to see if it works for you I would appreciate it!
I think we have to test with fake data as well before finishing for good...

Further discussion:
1)simple free_y option @yyue-r
2)passing on stat_density arguments and providing the user with the option to set their own x (maybe also y) limits- @yyue-r
3)Sydne suggested adding another function that plots the mean trait measurements as vertical lines for each community

Arya, if you could explain exactly what you mean in the 1 and 2 (about the stat_density) it would be great, so I could improve it into the function. I don't know exactly what to think (like, if it is needed) about the x and y limits, once they follow the limits of the trait values.

I think the 3th item could be imputed into this ostat_plot function

Hi @yyue-r and @isafluck ! I just fixed up the plotting function some, to clean up some redundant code, add the labels with overlap statistic on each panel, and a couple other little things. I wanted to bring a couple things to your attention that you could work on. See commit da4931f.

First of all, if you notice, I put the prefix ggplot2:: in front of all the ggplot2 functions. It would be great if you could go through all the code and add that "namespace prefix" explicitly for all functions that come from a non-base package. That is needed because you can't really call library(ggplot2) from inside a package code. You just need to specify the package every single time you call a function. Annoying, but it's part of writing packages.

Second, I tagged a bunch of stuff with the tag #FIXME. That is a special tag so Rstudio highlights it in green for you! It points out a few places that still need your attention. Mainly we still need to add more options for the user to change labels and other details, but still provide a default value if they don't want to provide a value.

Great work on this! Let me know if you have any questions.

Hello!
@qdread just a question... are these lines
siteID <- indiv_dat$siteID
taxonID <- indiv_dat$taxonID
etc
necessary? My question is, if the user put another name for the argument siteID for example, the function will know that this name refers to the siteID argument (if it is in the right position)?
Ex:
If the user use this:
Ostats_plot<-function(indiv_dat, sitesnames, taxons, trait, overlap_dat, sites2use = NULL, n_col = 3, colorvalues = NULL)

will the function know that "sitesnames" is the siteID argument and "taxons" is the taxonID argument?

Sorry if it is a super simple programming question

@isafluck To answer the second part of the question first, yes, if you give the arguments in the correct order, even if they are not named explicitly, it will work. The function will use the arguments in the order the user puts them in, so yes it would know that the second argument is siteID, and so forth.

As for the first part of the question, the way the function is currently written, you are overwriting any input the user would have put in for siteID and taxonID, so probably you can get rid of those lines.

Nice! I will exclude them.
I'm fixing the stat_density now!

Thank you very much!

Hello all!
I have just updated the function to add an optional free_y scale in facet_wrap (scale_yo argument; but I think the documentation of it needs to be improved), and also add the option to plot the mean of each species trait over the polygons (media argument).
In our meetings, surged the idea to include an option to plot only the means without the polygons. But I think Arya's idea to create a separate function for that is great. In my opinion, this ostats plot function is complete (even though the code still needs to be improved once I think I do a lot of unnecessary commands to get what I want... hehe...). I believe a separate function to plot only the trait means for each species would make the package cleaner. What do you think?

Hi @isafluck thanks for your work! Would you mind putting a few images in this thread showing the output of the plotting function with different options? That would be really great. I can take a look after that.

Hey @isafluck I took a look and here are a couple of things that could be improved on the function.

  • There is a variable called all which should have a different name because all is the name of a function in base R and it is better not to duplicate that name, to avoid conflicts.
  • The argument media says that trait medians are plotted, but the function calculates the mean, not the median. We should decide which to use and stick with it.
  • Line 66 has a for loop that goes from 1 to 23 but it is not guaranteed that there will be 23 species. It should be changed to find the number of unique names and loop through that many times.
  • I wasn't sure what the suffix _o stands for on the argument names adjust_o, limits_xo, alpha_o, scale_o. We should probably just call them adjust, scale, etc.

Again thanks for your work on this!

Hello, Quentin!
*** I fixed the first and second items you cited - sorry for the media thing... is because in Portuguese media is mean, I did not mean "median").

***But when I tried to fix the for, something got weird... when I set for(i in length(unique(taxonID))){.... it accuses that we have 25 species (?). Before it was 23?
I believe I did not modify the raw data because I'm always getting the data from figshare indiv_dat <- read_csv('https://ndownloader.figshare.com/files/9167548', col_names = T)
Look:
These are the 25 species when I run unique(taxonID) (after filtering for the sites sites2use<- c('BART','KONZ','JORN'))
[1] "MYGA" "NAIN" "PELE" "PEMA" "CHPE" "DIME" "DIOR" "DISP" "ONAR" "ONLE" "PEFL" "NEAL" "PGSP"
[14] "SIHI" "MUMU" "NEMI" "PEFA" "CHHI" "MIOC" "REME" "ZAHU" "NEFL" "MIPI" "PESP" "DPSP"

These two last species seem weird to me...is it right?
When I try to run the for step by step (species per species), the mean values appear only until the i<-23 (23th species). When I set i<-24 and i<-25 I get NA for the mean value (like they does not exist).

With these two NA's, the second column with the plotting means does not work.

See the pictures before I tried to fix the for and after
ostats_plot fiz colors and for MEAN OK
ostats_plot fiz colors and for MEAN FAIL

***I'm also trying for the colors of the means and distribution to match, I don't understand why they ate ending up different if I use the same object to specify the colors (taxonID).

***About the suffix, names of arguments etc, I will try with these simpler names when all these bigger problems are resolved, what do you think?

Thank you so much,

Hi @isafluck , It looks like the two additional species are something like PESP for "Peromyscus sp.", an individual that was only identified to the genus level. Those are very rare in the dataset. Probably, the mouse escaped before it could be identified and weighed. So there is only one individual of each of those species codes, and neither of them have a body mass measurement. So there is no mean for those species because all measurements are NA for those species. That is fine if we have some species with only NA measurements. I think the problem was that you removed the NA before calculating the taxon means. So those taxa did not appear in the table. I edited the code so that the NA are removed at the same step as calculating the means. Now all 25 should be there.

Your question about the colors matching is an easy fix. I just replace scale_fill_manual with scale_colour_manual on line 120 because since it is now lines and not polygons, you need to use a color aesthetic, not fill.

See commit 592d2dd . I only made a few tiny changes. But I didn't test them so please test and make sure this fixes it! :-D

Omg, it was just a na.rm = TRUE instead of a na.omit before! And I did not know about these differences between scale_fill and scale_colour... thank you!
I tested the new code (also changed the default of ncol for 1) but the colors still seem not to match... I'll try to fix it!
ostats_plot fixed for, fix colors still

***Ok, I tried a lot of things to make the colors match, but nothing seemed to work! :( I tried to attach each species to a different color in another table inside the function, changed positions, tried to merge one ggplot inside the other for not having to call the colorvalues again, but nothing worked, colors still do not match, as you can see in the image.
I believe I need some extra help here... I'm going to chat with ben on next Wednesday, so maybe he and I can fix this.

ostats plot why colors dont match_2

***I also fixed the suffixes in the function and removed the overlap values from the means plot (because I think it did not has much sense...) and leave the y label only in the first column of plots (distributions), as you can see in the image too.

@isafluck I think the way to do it would be to add names to colorvalues which is passed to the scale_color_manual and scale_fill_manual functions to make sure that the same color matches with the same taxon each time. You would do this by setting the names of colorvalues to be the taxon IDs.

Just include this somewhere before the plot is created:

names(colorvalues) <- unique(taxonID)

It worked!! My face is like "can't believe it was this" and at the same time "uhul!"
Thank you so much @qdread !
Next step: I will format the code according to the neon tutorial. Then I think is done! Do you think the function needs some other argument or other changing?
colors match

Hi @isafluck and @yyue-r,

I've finished my work on the vignette for now, but note that the code chunk for running the plotting function is not yet working (it's commented out right now to allow for knitting). Can you take a look and get that running?

Also, it would be great to have a chunk of code in the vignette that provides an example for the mean plotting function. I've made an empty code chunk where I think that would make sense to insert that you'll see in the .Rmd file for the vignette.

Thanks!
Sydne

Hello @sydnerecord !!
I have updated the example for the ostats_plot in the vignette!
I tested here and It worked.
Just two questions:

  • Do I have to change it in another place as well?
  • Shouldn't we give the user the results of the ostats function? Or just saying that the argument overlap_dat is "#your results from the Ostat function - see \code{\link{Ostats}} " is enough?

Thank you!

Hi @isafluck,

Great! You should add this as an example to the man page for ostats_plot.

I think as long as the overlap_dat is still in the history of the current R session, we don't need to run Ostats again. We'll catch this if it is wrong when we go to knit the vignette.

Thanks!
Sydne

@isafluck
The vignette is almost ready to go out to the larger group, but I need for you to run the code chunk with your example for the plotting function and error check it. As of now, it does not run. To knit the document it needs the Ostats output for input. Sorry if this was not clear in our exchange over Github issues. This example will also need to be updated in the documentations scripts in the man folder.

Also, in the final code chunk I have made a note asking you to put in some code to illustrate how to make a plot with distributions and means - if the current example does not already do that.

One other note, Arya had a change of class schedules, so she can no longer make the meeting on 9/28. Are you ok doing the presentation solo to the group? It should be short (5 minutes) and just give a quick intro to Ostats before we delve into their comments about the package and vignette.

Let me know when the above changes have been made to the vignette and man pages, then I'll send the vignette out to the group shortly thereafter (unless you'd like to). Either way, I'll open the contents of this email as an issue in Github, so we can keep track of things.

Hi @isafluck - Arya made a couple changes to the Ostats_plot example in the vignette, but it still yields this error message: Error in Ostats_plot(indiv_dat = indiv_dat, siteID = siteID, taxonID = taxonID, :
unused arguments (name_x = "Body Weight (log-transformed)", means = TRUE)
I'd like you to take a stab at fixing it, but let Ben or I know if you get stuck and then we can look into it more. This is the last thing that needs to be addressed before the vignette goes out to the team. Thanks!

Hi @isafluck Never mind about the error I mentioned earlier today. I think we've got it sorted out. Arya reminded me that I needed to rebuild Ostats with dev tools. I'll take one last look at stuff then send the vignette out to the team for feedback (likely tomorrow).

Hello @sydnerecord sorry for the delay in solving this problem, I am dealing with the consulate to issue my visa.
Please see if the code for the example looks like this:

#'@examples
#'#load data:
#'#indiv_dat <- read_csv('https://ndownloader.figshare.com/files/9167548', col_names = T)
#'
#'#set the arguments:
#'#overlap_dat <- Ostats_bysite2015 #your results from the Ostat function - see \code{\link{Ostats}}
#'#siteID <- indiv_dat$siteID
#'#taxonID <- indiv_dat$taxonID
#'#trait <- indiv_dat$logweight
#'
#'#to plot only selected sites:
#'#sites2use<- c('BART','KONZ','JORN')
#'
#'#Ostats_plot(indiv_dat = indiv_dat, siteID = siteID, taxonID = taxonID, trait = trait, overlap_dat = overlap_dat, sites2use = sites2use, name_x = 'Body Weight (log-transformed)', means=T)

If it is like that, it should work (after running the ostats function to obtain the Ostats_bysite2015 object which is the result of the ostats function)

@sydnerecord did it work?