QData/spacetimeformer

Guidance on how to modify training loop to use data from longitudinal study

slppo opened this issue · 5 comments

slppo commented

Hello,

I'm working on a project using data from a longitudinal study that has many subjects tied to exam dates and associated features at each of those exam dates. Looking into the core task Spacetimeformer is targetting it looks like it's more focused on datasets which have a single time index and using that to predict uniformly against that index, but I'm wondering if it won't be too challenging to modify the training loop to support multiple time indices as each subject will have their own (e.g. Subject 1 has exam date of 2016/01/02, Subject 2 also has exam date of 2016/01/02 etc.)

I imagine I can modify the CSVTorchDset to load across subjects to keep it simple (one batch still has one subjects' worth of context + target points), but before digging into it further I was wondering if you had come across this while researching and have any guidance.

Let me know if you'd like to see some sample data.

Thanks! Excellent project by the way.

Hi, and thanks!

Sorry, I am not sure I fully understand the question. Do you have multiple variables ({y_0, ... y_N}), but each series is taken at different time intervals? Or does each batch contain data from a different variable?

The first case is definitely possible in theory, because we treat the value of every variable as its own token and give them their own timestamp (usually [year, month, day, hour, minute, second] format). So if every variable was sampled at different intervals that could be reflected in the Time2Vec embeddings of their tokens. This also allows for some variables to be missing at different timesteps, or to appear very irregularly. However, the specific way I implemented the embedding in this code is designed to map the standard multivariate time series dataset format to the spacetimeformer input.... unfortunately it would take some modifications to get more flexibility here. This feature has been requested a few times and I plan on adding it for some new experiments over the next few weeks.

slppo commented

Yeah so it'd be closer to the first description with multiple variables but each series taken at different time intervals. Here's an example to hopefully illustrate more clearly:

Subject ID EXAMDATE AGE ...other features... TARGET
1 2005-09-08 74.3 ... 0
2 2005-09-12 81.3 ... 3
2 2006-03-13 81.8 ... 3
2 2006-09-12 82.3 ... 3
2 2007-09-12 83.3 ... 3
3 2005-11-08 67.5 ... 1
4 2005-09-07 73.7 ... 0
4 2006-03-09 74.2 ... 0
4 2006-09-05 74.7 ... 0
4 2007-09-07 75.7 ... 0

Notice how the timestamps/EXAMDATE of subjects 2 and 4 overlap, for example. So I suppose a unique thing happening here is that the time intervals could in fact coincide with subjects having the same EXAMDATE but still having distinct values for other variables. During training I'd like to use all data available for all subjects, but during inference forecasting it would be just for one subject.

Ok so the default embedding is designed to look at long sequences of N variables that have some relationship to each other and make a prediction of those same N variables. With some pretty simple changes we can do things like read (N + k) variables (with k additional sources of information) and still only predict N. By separating all (N + k) variables into a sequence of their own tokens, spacetimeformer gives extra flexibility by letting variables be missing or sampled at different frequencies.

Your use case seems like more of a metalearning problem... trying to make predictions about new subjects from limited data rather than using the relationships between multiple subjects to make more accurate predictions. It looks like you have N = 1 (TARGET), k = |{AGE, ...other features...}|, with each subject_id essentially creating its own sequence problem? You might want to split it up so that training sequences have varied length and never include more than one subject (each batch has all prior results for a given subject?)

slppo commented

OK gotcha -- that makes a lot of sense. Yes, that's correct to use your terminology we have N=1 and k ~= 10.

I like the idea of batching by subject ID / one sequence during training, but I'm hoping that an ultimate end goal of being able to infer based on one subject at one timepoint would be sufficient context to give to spacetimeformer to get good results.

I'll look more deeply into the Time2Vec embedding logic so I have a better understanding of what this would look like, but if I'm hearing you right it sounds like this is the thing you're working on right now as a requested feature.

Ok so the default embedding is designed to look at long sequences of N variables that have some relationship to each other and make a prediction of those same N variables. With some pretty simple changes we can do things like read N + k variables (with k additional sources of information) and still only predict N.

By separating all N + k variables into a sequence of their own tokens, Spacetimeformer gives extra flexibility by letting variables be missing or sampled at different frequencies.

Your use case seems like more of a metalearning problem... Trying to make predictions about new subjects from limited data rather than using the relationships between multiple subjects to make more accurate predictions. It looks like you have N = 1 (TARGET), k = |{AGE, ...other features...}|, with each subject Id essentially creating its own sequence problem?
You might want to split it up so that training sequences have varied length and never include more than one subject Id (each batch has all prior results for a given subject Id?)

Hi @jakegrigsby,

Sorry for bringing such an old issue back up, but I have a learning problem similar to @slppo's. I have 14 different signals (e.g., @slppo's subject Ids) with synchronized readings every minute, having k = 5 and N = 1 variables each. In fact, the N variables are ahead-in-time variables, so unavailable at inference time, i.e., the model should predict N, knowing only k.

I have mainly two questions for you:

  1. With some pretty simple changes we can do things like read N + k variables (with k additional sources of information) and still only predict N.

Could you elaborate more what these simple changes are? Would they work in a setup where N is unavailable?

  1. You might want to split it up so that training sequences have varied length and never include more than one subject Id (each batch has all prior results for a given subject Id?)

What if the different signals (subject Ids, here) have a non-negligible correlation between their k variables? Would splitting the batches in individual subsets of subject Ids be able to learn this inter-subject-Id correlation? At inference time, should the Spacetimeformer model be conditioned to make a prediction for a specific subject Id?

I have been using an LSTM recurrent neural network with decent results, but I'd like to see how your model would perform.