vc1492a/tidd

Validate training and testing data

vc1492a opened this issue · 4 comments

Some of the ground station and satellite combination do not show any visible perturbations in the TEC variation that could be identified by a model. These examples ought to be removed from the dataset, and can be identified by manually showing a line chart of the TEC variation. If the perturbation from the tsunami wave isn't clear (usually a wave like pattern), there's little to no chance the tsunami impacted that track (meaning it shouldn't be in the dataset at all). These should be removed from the dataset used in experiments.

@hamlinliu17 since I have the data locally and since it takes forever to upload this much data to S3 (I don't want to do it twice), I'll tackle this issue and others related to prepping data before uploading it to S3 so you have access.

Screen Shot 2021-01-22 at 10 34 11 AM
Screen Shot 2021-01-22 at 10 34 26 AM

Generated the plots needed to review these later and clean up the training data. 🚀 will get back to that within the next few days or early next week.

After doing some more review of the plots, I realized that the following satellite / ground station data contain too much noise for a machine or deep learning model to identify - those with G04 and G10. The amount of noise in the data makes it significantly more difficult for a model to pick up and correctly classify these areas.

For modeling, we'll use data with the following satellites: G07, G08, G20. I'll work on creating an updated dataset and archiving the full dataset we have currently.

It took a while, but I copied the original dataset we generated and then removed the tracks with satellites G04 and G10. We had 279 ground station and satellite combinations to start for the experiment and have now brought that down to 168. This implies that there are a total of 168 anomalous events that can be detected in the data.

Common practice is to set aside some amount of data for validating the model, which we will cover in #71. 20% of 168 is 33.6, which we will round up to 34 and set aside in #70.

Going to commit the notebook I used to generate the plots (they're nice plots, could come in handy for publication). I'll then close this issue and proceed to #70 on the same branch, feature/validate_data.