ropensci/pathviewr

Automation of `max_frame_gap` argument

Closed this issue · 10 comments

https://github.com/vbaliga/pathviewR/blob/f4a043e75cb22e993aebc82790ecd730604b5439/R/utility_functions.R#L1029

I made some changes to how the max_frame_gap argument of separate_trajectories() behaves. The gist:

  • numeric values (e.g. max_frame_gap = 1) can be used and the function basically behaves as it always has. One minor thing: if this value is set too high (e.g. max_frame_gap = 10000000000), then the function internally resets it to the actual maximum frame gap found within the data.
  • alternatively, you can set max_frame_gap = "autodetect" to ask the function to guesstimate this for you.
    • The autodetect procedure, right now, is to generate a list of all frame gaps (on a per-subject basis; I can explain why later) and then filter out all cases where the frame gap is 1 (i.e. things are working normally). From the remaining frame gaps, we then take the median value as a rough estimate of what a max_frame_gap value could be.

That said, I am not sold on the idea of the median observed frame gap being the best value to use.

I'm very enthusiastic about keeping Melissa's determine_fg_M() function in pathviewR (if she's cool with that). At the very least, it is a very useful function for a user to be able to visualize how things change with frame gap choice and not need to blindly trust what we may automate within separate_trajectories().

But I also think that in addition to keeping this standalone plotting function, some of determine_fg_M()'s internals can be added into separate_trajectories(). Automating the max_frame_gap value to be where the "elbow" of the plot is seems like a better way to go vs. choosing the median as I mentioned above.

To that end, I think something like changepoint detection could work. There's even a changepoint package that could be useful here. I have a little experience playing around with this; I'll leave this piece as an example.

Curious to see what you guys think. Also please scream if any changes I made today break stuff for you!

Cool! I really like the idea of using the "elbow" of determine_fg for the "autodetect" in max_frame_gap with the option to override if you think it should be otherwise. I also agree that the max_frame_gap should be applied per-subject and not to an entire file. At least with my setup, the quality of tracking is dependent on the quality each bird's head device and the quality varied wildly depending on how long the device was on the bird (some only lasted a couple days, others an entire month and so the tracking was terrible by the end = many dropped frames for that individual).

I also want to add an output to get_full_trajectories so the user knows how many trajectories they are "losing" at that step--I'm thinking of adding it to the metadata in attributes unless you think it should go somewhere else? I'll get it working the way I'm thinking in ZFVG first and see what you think before sending it over to pathviewR.

I really like the idea of using the "elbow" of determine_fg for the "autodetect" in max_frame_gap with the option to override if you think it should be otherwise.

Yup! Very much agree. The thing I anticipate having difficulty coding is if you'd like to override autodetect -- would it still be OK to provide a single number which is then applied across subjects? It might be a little harder to ask the user to provide a list of frame gaps (one per subject). Then again, I have seen other functions do things like: argument = "manual", argument_options = c(x, y, z), where x, y, z, would be the max_frame_gapvalues specific to each of 3 subjects. The slight issue there is we'd also have to institute a check that the argument_options (or whatever we'd name it) is the same length as the number of subjects in the data. That all said, do you guys have any other ideas of how to handle all this?

I think at some point today, I'll modify the steps within the autodetect to find the elbow spot when trajectories from all subjects are pooled. Then, when we decide how to specifically code things for a per-subject basis, we can modify this further.

I also want to add an output to get_full_trajectories so the user knows how many trajectories they are "losing" at that step--I'm thinking of adding it to the metadata in attributes unless you think it should go somewhere else?

Yeah I like this, too! Definitely agree it should go in the attributes.

I also like the idea of using the elbow but I wonder if in every case, there will be an elbow to select? I don't know anything about change points but I wonder if would it work well on a smooth curve resembling f(x) = 1/sqrt(x)?
If we were to automate all of it, could a change point function be run on "elbow plots" drawn for each bird?

I also want to add an output to get_full_trajectories so the user knows how many trajectories they are "losing" at that step--I'm thinking of adding it to the metadata in attributes unless you think it should go somewhere else?

Yeah I like this, too! Definitely agree it should go in the attributes.

Just added this to pathviewR

OK, I made big changes to separate_trajectories()'s "autodetect" behavior in ddada72

When autodetecting, the function now internally generates a plot of trajectory counts over a range of frame gaps (a la determine_fg_M()). The range of frame gaps runs from 1 to (0.25 * framerate), as I felt that was a reasonable upper bound. All of this is done collectively over all data: pooling all data from all subjects (can be split up later, but getting to this point was hard enough). The "elbow" of the plot is then found by drawing an imaginary line between the first and final points on that graph and then finding the distance of each point to that line. The frame gap value that maximizes that distance is the one at the "elbow" point, and the max_frame_gap value is thereafter set to that value. If you'd like a visualization of how this all looks, I got the inspiration from the accepted solution in this stackoverflow post.

I also renamed determine_fg_M() to visualize_frame_gap_choice() and made edits within that function to add the same elbow distance calculation. A black vertical line is now drawn at the frame gap choice that is the most elbow-y

It is also worth noting that the range of frame gaps considered has heavy influence over where the elbow specifically is. An easy way to see this is via visualize_frame_gap_choice(); from pathviewR_pancakes:333 :

## Complementary visualization function, adapted from Melissa's determine_fg_M()
visualize_frame_gap_choice(jul_29_selected, loops = 25)
## Note that you'll get a different answer if a different loop length is used:
visualize_frame_gap_choice(jul_29_selected, loops = 20)
visualize_frame_gap_choice(jul_29_selected, loops = 50)

This bothers me a little bit. Maybe we can find a better way to go about this?

Ah yes, now I recall this problem from writing determine_fg_M() in the first place. I think this is why I kept coming back to needing an additional way to make this decision--either setting an upper limit based on distance (eg. 1 frame = ~ 2cm of tunnel length, so if 20cm seems like too large of a gap if we're only analyzing say 50% of the tunnel (~1m) then the upper limit should be a frame gap of 10? you could use an avg velocity to determine what distance you're missing when you lose 1 frame?). Or maybe another way to think about limiting it based on distance would be you can't have an upper limit higher than x% of the total length you're analyzing? Like, it wouldn't make sense if you're only analyzing 1m of tunnel (after select 50%), to have a frame gap of 90-100cm, so could set the upper limit at say 25% of total tunnel length (1m) and then determine the frame gap # equivalent to that distance? Though your default of .25*frame rate is essentially already doing that. Or building a giant parameter space plot of frame gap vs span vs select x percent from ALL data to help decide optimal ranges for all 3 of those arguments, or any other logical way of guiding frame gap decision making. It's also possible that once the data isn't pooled but is divided per subject (as mentioned earlier) each individual's ideal frame gap will be more clear regardless of the upper limit? I dunno if that's true, but my sense is birds with good devices/good tracking, the upper limit will matter less than birds with bad devices/bad tracking... and if setting the upper limit pretty low essentially filters out those bad birds/tracking, that's not a bad thing (how many times can I say bad in one sentence)?

I don't think I've helped at all... but these are at least the kinds of things I've been thinking RE this particular problem and my particular data set. Not sure how useful these would be for other applications/data sets?

This definitely helps, and I'm getting the sense that we might benefit from meeting and plotting some of this stuff together in real time. Perhaps we should touch base on Monday and see what our availability looks like for that week?

In the meantime, here's some thoughts:

  • When I next get a chance to work in pathviewR, I'll see about implementing a per-subject scheme of autodetecting frame gaps. I don't think it will be all that hard -- I just ran out of steam last night, as getting the function to that point took some time.
  • I agree that we should consider what metric we are trying to minimize. My motivation for doing 0.25 * frame rate was to see if we could minimize the loss of total number of frames (or total time). The 0.25 is definitely arbitrary, but I think if we're gonna go this route we should ensure that we have << 1-sec total loss of info. But we could just make that decision part of an optional argument and let the user decide.
  • A distance-based metric could make sense too. It would get complicated as velocity varies (the frame gap lengths would necessarily be inconsistent), but setting the gap so that on average we lose no more than e.g. 5% of the tunnel length (or selected length) could be a great way to go.
  • Another way to frame it is that minimizing the total number of frames lost is perhaps best if it is observed that the recording software blips out at consistent intervals. Minimizing the total distance lost could work best if there is a possibility of an external object obscuring the view of subjects in a consistent way.
  • Another thing is maybe looking for the "elbow" isn't the best practice? We might consider assessing ways of determining if the trajectory count vs. frame gap curve shows asymptotic behavior. Then again, I expect that the point at which it flattens out would be pretty closely estimated by the elbow point anyway.

OK, "autodetect" now does things on a per-subject basis as of 57b67b5

The max_frame_gaps still end up at the elbows of the curves, as I had it before. Again, happy to refine all this after we get a chance to discuss best practices. But we can simply swap out the method of determining max_frame_gap and we should be good to go

Some example code from pancakes:317:

## Splits the data by subject and computes a max_frame_gap for each subject
jul_29_labeled_autodetect <-
  jul_29_selected %>% separate_trajectories(max_frame_gap = "autodetect")
plot(jul_29_labeled_autodetect$position_length,
     jul_29_labeled_autodetect$position_width,
     asp = 1, col = as.factor(jul_29_labeled_autodetect$traj_id))

Frame gap values are reported in attributes:

  attr(jul_29_labeled_autodetect, "max_frame_gap")

And two new features!!!

  1. Use frame_gap_messaging = TRUE to get reports of selected frame gaps
jul_29_labeled_autodetect <-
  jul_29_selected %>% separate_trajectories(max_frame_gap = "autodetect",
                                            frame_gap_messaging = TRUE)
  1. Use frame_gap_plotting = TRUE to get elbow plots! One per subject
jul_29_labeled_autodetect <-
  jul_29_selected %>% separate_trajectories(max_frame_gap = "autodetect",
                                            frame_gap_plotting = TRUE)

Have fun playing around with it and let me know if anything looks odd! But in the meantime have a good weekend!

Some notes based on our meeting and newly-implemented changes:

  • Determining max_frame_gap on a per-subject basis helped out a lot
  • We landed on the idea that faulty subjects (e.g. birds with frame gaps > 10) may be more easily handled by removing them after the get_full_trajectories() step via Melissa's removal function rmbird_byflightnum(). This removal function will likely be edited to allow removal of subjects with high frame gaps in addition to subject-treatment combos that are not well-represented.
  • As of 79a4ebc, a frame_rate_proportion argument has been implemented to multiply the value inserted (default = 0.1) and the frame rate to get an upper bound for what the maximum frame gap could be.
  • As of 132058e, separate_trajectories() no longer uses attr(obj_name, "subject_names_simple") internally but rather unique(obj_name$subject) to determine the number and identity of subjects. This, I hope, will be more friendly to batch-analyses, but at a later point we might want to develop functions that specialize on analyzing files in batch.

Closing this because I think the specific issue of how to determine max_frame_gap is resolved. But note that we decided that functions to remove faulty subjects and/or underrepresented treatments downstream will be necessary.