Determine inputs into the predictive models (and starting satellites)
vc1492a opened this issue ยท 9 comments
In most cases, clever feature engineering provides the best bang for the buck in terms of providing predictive power to trained models. We should discuss here and explore what are the most likely drivers of predictive performance. We should document the combinations of model inputs.
As a start, an obvious first place to start is with the following combination of values:
Index | Variables | Notes |
---|---|---|
1 | window of past dStec/dt values in look-back window size w, specific. to a satellite. |
We ought to establish what the model can do with a limited amount of information as some sort of baseline for comparison. This ought to be explored in two ways - first, by averaging or taking the min/max of values across ground stations for a particular satellite and, secondly, by using the values from each of the ground stations as individual inputs to a model for a specific satellite. |
2 | The above, but also adding in the elevation of the satellite. | The motivation is that this variable could "control" for environmental conditions that the satellite may experience in model training since, perhaps, some amount of variation in the dependent variable could be captured through the specific environmental conditions of the satellite(s) modeled. |
@MichelaRavanelli mentioned she could mention which satellites show clear perturbations in dStec/dt
. Can you please list them here? Thanks!
Any other considerations? Michela mentioned some filtering for the elevation data may be needed.
@vc1492a The satellites that clearly show the perturbations are G10, G20, G04, G07 and G08.
For every file (e.g ahup_g20), I suggest to use only the observations with an elevation value higher than 15 degrees in order not to consider noisy measurements.
Satellites G04, G07, G08, G10, G13, G20, G23 are examined in previous work - which includes the satellites noted above. I think it would be good to follow that set as we could directly compare our work with at least one previous work in literature as a start. We can always expand from there or note it for future work.
Noted on the filtering of elevation values. It may be nice to include that filtering step as a configurable flag in our experimental pipeline. When set to true, the modeling process will filter the data according to the elevation and when false it would not. That way, we could get an understanding as to the effect the filtering step has on model performance. Deep neural nets are known to do well with noisy data with enough training, so it would be interesting to see what the effect is.
@hamlinliu17 @MichelaRavanelli updated my initial comment at the top of this issue to add some notes on an initial set of model inputs to explore and reformatted the bulleted list into a table for easier viewing. I also added a second entry to the table and described some thoughts for using some of the other variables in the model to account for variation in the dependent variable that could be caused, in part, by environmental conditions experienced by the satellite.
After using the data in practice and performing a few ad-hoc modeling runs, I am more confident in the thought that the model - if used in practice - be retrained each day, only predicting future values one day at a time (this is how I have thus far structured our experiments). This is driven by the intuition that more recent patterns are more relevant in predicting future TEC variation. @MichelaRavanelli what do you think?
The intuition is right: the ionosphere change continuously according to various factors, so it makes sense that more recent TEC patterns give better prediction!
Great, thanks!
On further iteration and modeling runs, it is becoming clear that the best way to approach the problem is to try train a model which just trains a transformation function (net) between a single input (independent) variable and the dependent variable. So far, using the elevation
data as the only input - combined with a small model - has worked best for training a model with residual behavior that will allow for anomaly detection.
I don't see a reason to deviate from this formula for now as it is delivering a pattern in the residuals that we can use for anomaly detection much more consistently.
It seems that only using the elevation
is the way to go to obtain the desired results at the moment. We have also decided to use only the G07
satellite from today's discussion. While we may need to revisit model inputs if we explore other architectures, it's not relevant to this first pass of work so I'll be closing this issue.