Effect of changing epoch periods on activity endpoints
Closed this issue · 4 comments
Hello,
This post is related to #193 (which closed and fixed). I ran a quick comparison of few subjects using --epochPeriod 60
and the standard settings which corresponds to --epochPeriod 30
. I calculated the Spearman's ranked correlations of all computed activity endpoints across the subjects (n = 20) to see if changing the epoch periods would have any effect on the computed activity endpoints.
I experienced some unexpected behavior: While most of the endpoints correlate very well, there is a considerable amount of features with poor correlation. Additionally, I observed very large fluctuations in individual cases. For example: "sedentary-hourOfWeekend-19-avg" would be estimated 0 using --epochPeriod 30
and 1 using --epochPeriod 60
in one of the subjects.
My guess is that this observation can be explained by the fact that the models estimating the activity endpoints are trained on 30s epochs? Does anyone have more information regarding this?
Thanks!
Hi @sehamsick
You're right that the model is trained on 30s epochs, so it might be less reliable when applied to 60s epochs. I think that's something we need to sort out in the future @aidendoherty We could disentangle the activity prediction pipeline, which should always use 30s (or whatever epoch was trained on), from the rest of the pipeline.
Can you share which features you found to have very poor correlations? I think most of the features should more or less be stable wrt epoch size.
Hi @chanshing
Thanks for the reply! If this is the case, I would agree to disentangle it from the activity prediction pipeline or to include a warning (code or documentation) if a user deviates from the standard (trained) epoch definition.
I didn't see a pattern (e.g. only sleep would be affected), but I also did not spend a lot of time in the analysis. Additionally, I was more concerned about individual outliers than small random shifts in the data. However here's some additional information:
The definition of "poor" correlation is often subjective and depending on the quesiton, so here you see the distribution of Spearman's ranked correlations across the features:
And here's a list of top 20 lowest correlated features (ranging from ~0.5 - 0.7)
['sleep-hourOfWeekday-10-avg', 'sedentary-hourOfDay-20-avg', 'light-hourOfDay-20-avg', 'sedentary-hourOfWeekend-20-avg', 'sedentary-hourOfWeekday-20-avg', 'cutPointVPA-hourOfWeekday-14-avg', 'moderate-vigorous-hourOfWeekend-20-avg', 'moderate-vigorous-hourOfWeekend-11-avg', 'sleep-hourOfWeekend-20-avg', 'light-hourOfWeekday-20-avg', 'sleep-hourOfDay-10-avg', 'sedentary-hourOfWeekend-19-avg', 'moderate-vigorous-hourOfWeekend-9-avg', 'sleep-fri-avg', 'cutPointVPA-hourOfWeekday-17-avg', 'cutPointVPA-hourOfDay-16-avg', 'MET-hourOfDay-20-avg', 'sedentary-hourOfWeekend-10-avg', 'sedentary-hourOfDay-10-avg', 'MET-hourOfWeekday-20-avg']
Thank you for sharing this @sehamsick
It's hard to tell what's going on, but I think since we're looking at 500+ variables we're bound to find some low correlation variables. For now I think your best option is to stick to 30s epoch, and postprocess the TimeSeries.csv file to rederive the statistics that you want with 1min epochs (for example, by averaging consecutive 30s epochs).