Effect of changing epoch periods on activity endpoints

Question

Effect of changing epoch periods on activity endpoints

Closed this issue 2 years ago · 4 comments

Hello,

This post is related to #193 (which closed and fixed). I ran a quick comparison of few subjects using --epochPeriod 60 and the standard settings which corresponds to --epochPeriod 30. I calculated the Spearman's ranked correlations of all computed activity endpoints across the subjects (n = 20) to see if changing the epoch periods would have any effect on the computed activity endpoints.

I experienced some unexpected behavior: While most of the endpoints correlate very well, there is a considerable amount of features with poor correlation. Additionally, I observed very large fluctuations in individual cases. For example: "sedentary-hourOfWeekend-19-avg" would be estimated 0 using --epochPeriod 30 and 1 using --epochPeriod 60 in one of the subjects.

My guess is that this observation can be explained by the fact that the models estimating the activity endpoints are trained on 30s epochs? Does anyone have more information regarding this?

Thanks!

Answer 1 · 2022-04-11T09:36:53.000Z

Dear Sender, I’m currently on annual leave but acknowledge that I’ve received the email and will respond on my return, week commencing 26th April. Best wishes Aiden From: sehamsick ***@***.***> Sent: 11 April 2022 08:18 To: activityMonitoring/biobankAccelerometerAnalysis ***@***.***> Cc: Subscribed ***@***.***> Subject: [activityMonitoring/biobankAccelerometerAnalysis] Effect of changing epoch periods on activity endpoints (Issue #195) Hello, This post is related to #193<#193> (which closed and fixed). I ran a quick comparison of few subjects using --epochPeriod 60 and the standard settings which corresponds to --epochPeriod 30. I calculated the Spearman's ranked correlations of all computed activity endpoints across the subjects (n = 20) to see if changing the epoch periods would have any effect on the computed activity endpoints. I experienced some unexpected behavior: While most of the endpoints correlate very well, there is a considerable amount of features with poor correlation. Additionally, I observed very large fluctuations in individual cases. For example: "sedentary-hourOfWeekend-19-avg" would be estimated 0 using --epochPeriod 30 and 1 using --epochPeriod 60 in one of the subjects. My guess is that this observation can be explained by the fact that the models estimating the activity endpoints are trained on 30s epochs? Does anyone have more information regarding this? Thanks! — Reply to this email directly, view it on GitHub<#195>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AB77JTWJBL3KXSYX7DJWSM3VEPG2HANCNFSM5TCHGPLQ>. You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

Answer 2 · 2022-04-11T11:27:01.000Z

Hi @sehamsick

You're right that the model is trained on 30s epochs, so it might be less reliable when applied to 60s epochs. I think that's something we need to sort out in the future @aidendoherty We could disentangle the activity prediction pipeline, which should always use 30s (or whatever epoch was trained on), from the rest of the pipeline.

Can you share which features you found to have very poor correlations? I think most of the features should more or less be stable wrt epoch size.

Answer 3 · 2022-04-11T12:00:32.000Z

Hi @chanshing

Thanks for the reply! If this is the case, I would agree to disentangle it from the activity prediction pipeline or to include a warning (code or documentation) if a user deviates from the standard (trained) epoch definition.

I didn't see a pattern (e.g. only sleep would be affected), but I also did not spend a lot of time in the analysis. Additionally, I was more concerned about individual outliers than small random shifts in the data. However here's some additional information:

The definition of "poor" correlation is often subjective and depending on the quesiton, so here you see the distribution of Spearman's ranked correlations across the features:

And here's a list of top 20 lowest correlated features (ranging from ~0.5 - 0.7)

['sleep-hourOfWeekday-10-avg', 'sedentary-hourOfDay-20-avg', 'light-hourOfDay-20-avg', 'sedentary-hourOfWeekend-20-avg', 'sedentary-hourOfWeekday-20-avg', 'cutPointVPA-hourOfWeekday-14-avg', 'moderate-vigorous-hourOfWeekend-20-avg', 'moderate-vigorous-hourOfWeekend-11-avg', 'sleep-hourOfWeekend-20-avg', 'light-hourOfWeekday-20-avg', 'sleep-hourOfDay-10-avg', 'sedentary-hourOfWeekend-19-avg', 'moderate-vigorous-hourOfWeekend-9-avg', 'sleep-fri-avg', 'cutPointVPA-hourOfWeekday-17-avg', 'cutPointVPA-hourOfDay-16-avg', 'MET-hourOfDay-20-avg', 'sedentary-hourOfWeekend-10-avg', 'sedentary-hourOfDay-10-avg', 'MET-hourOfWeekday-20-avg']

Answer 4 · 2022-04-12T00:24:52.000Z

Thank you for sharing this @sehamsick
It's hard to tell what's going on, but I think since we're looking at 500+ variables we're bound to find some low correlation variables. For now I think your best option is to stick to 30s epoch, and postprocess the TimeSeries.csv file to rederive the statistics that you want with 1min epochs (for example, by averaging consecutive 30s epochs).