Construct simulated real-world experiment
vc1492a opened this issue ยท 23 comments
While the metrics resulting from the model training process are helpful and important to consider, they do not portray the most insightful picture as to how the model will behave and perform in practice. More specifically, in practice data will be read, transformed into an image, and classified in real-time (as opposed to the random shuffle of data during model training). It would be helpful to be able to measure the performance of the model under a real-world scenario, where inference is performed at regular intervals on new data.
We'll need to create a sort of mini pipeline, where we maintain a sliding window of data that is converted into an image and then fed into the trained model each t time steps. With each classification, we keep track of the classifications for each index and then take a time step (index) majority vote to classify the time step as anomalous or not. Thus, we end up with labeled time periods as anomalous or not. We compare these to the ground truth time ranges. If a significant enough overlap occurs, we have a false positive (others work the other way). We could then generate the precision, recall, f-score metrics etc.
@hamlinliu17 getting started on this issue now that we have some of our baseline metrics and modeling process in place, thanks again for your hard work there ๐
Going to commit this work to a new branch feature/real_world_experiment
. Goal is to get this completed fairly quickly as we'll want to refine our modeling approach based on how the model will behave in the real world.
Possible title: Detecting Tsunami-related Total Electron Content Anomalies in the Ionosphere with Convolutional Neural Networks
@vc1492a will we be using the chile dataset to help us with the real world application?
@hamlinliu17 nah we're going to leave the Chilean data out of this round of work and leave it for future work as well (would be a natural extension of the work actually once its more stabilized). For now, we are using the validation set in the image balanced dataset for use in the real world experiment.
@hamlinliu17 when you get the chance to take a look at the code and are working on the refactor, could you be sure to write in some functionality that stores the classifications and their predicted probabilities to disk? It would be helpful to have them locally stored for making visualizations that combine with data not directly used in the experiment (such as latitudes and longitudes).
@vc1492a Will do, was able to get mostly abstract away most of the functionality away from the real_world_experiment
function. Will try and get this and all the refactoring done around Friday.
@vc1492a There is the problem where we get less classifications than total timestamps so I will label the missing timestamps as normal for now and see if anything arises from that.
@vc1492a I made a recent commit but will still be working on it. I think there may be some bugs that I created when I refactored so will try and test it out these next couple of days
@vc1492a There is the problem where we get less classifications than total timestamps so I will label the missing timestamps as normal for now and see if anything arises from that.
Does this occur on all of the periods you have tried or only some? I got an error with ground station satellite combination kaep__G20
and I think a few others but never looked into what it may be more closely. Is the difference between the two 60? If so that's due to the window size from the images. The float data that is read in needs to be truncated to become shorter than its original size due to the windowing method used to generate the images.
Let me know if it's another issue and sounds good. I'm largely working on writing the first draft of the paper at the moment, which I hope will get us started and will also help us flush out what sort of other adjustments we want to make for this first paper.
Note that there's also a few ground station satellite combinations that do not have image data for the day of the earthquake in the validation set. We need to fix these or remove them from the validation set and training sets. This hasn't been handled but is referenced in #82, and could be causing the issues you are seeing.
Does this occur on all of the periods you have tried or only some? I got an error with ground station satellite combination
kaep__G20
and I think a few others but never looked into what it may be more closely. Is the difference between the two 60? If so that's due to the window size from the images. The float data that is read in needs to be truncated to become shorter than its original size due to the windowing method used to generate the images.
I think it's mostly from the window size I feel like but I am trying to see if it's affecting all the combinations. The difference between the two is 59 which makes sense due to the window size.
Yup that's the issue, it's due to the window size. The solution is tp truncate the first 59 observations off of the float data. I had a manual fix for this in the old code but was easy to miss.
In the future, it would be nice if is was parameterized as that value could also be set during the image dataset generation process and automatically passed to assessment (imagining a world where generating the images are part of the experimental pipeline, not just the model training).
@hamlinliu17 when you get the chance to take a look at the code and are working on the refactor, could you be sure to write in some functionality that stores the classifications and their predicted probabilities to disk? It would be helpful to have them locally stored for making visualizations that combine with data not directly used in the experiment (such as latitudes and longitudes).
Added it. Hopefully it works on your end as well.
@hamlinliu17 thanks checked out your work. Looks good! Let's tackle the below next steps before merging this branch into the dev
branch:
- Add type hints to any functions we have so far that don't yet have any.
- Add docstring to any functions we have so far that don't yet have any.
- Abstract away the plotting functions from the main real world experiment function. When doing so, let's make sure those functions work with data we read in that records the classifications also.
- Let's abstract away the generation of the false positives, false negatives and true positives into a separate function.
- Let's also abstract away the generation of the precision, recall, and f1-score metrics so that it's a bit cleaner.
We can tackle these together and chat about them the next time we meet @hamlinliu17! I think once these are in place we ought to clean up the notebook and merge into the dev
branch. I'm be pushing a small commit soon where I introduce the ability to track the sequence lengths for true positives, false positive and false negatives. This is something I want to comment on in the paper as the short sequence lengths are common in the false positives for our work so far (which can be remedied with some simple strategies to make our approach a lot more performant, albeit with time-lags introduced for real-time applications).
Lastly I'd like to discuss as a team soon (@hamlinliu17 @MichelaRavanelli) the applied work we want to do as that will inform our future work and where and how we spend time scaling up the project as it is today to the beginnings of a software prototype (graduating from notebooks to software ๐ ) It's looking clearer and clearer to me that the next big focus will be to scale up the code / approach from what we have today before we step into the cool applied work we want to do.
@hamlinliu17 do you have time in the next week or so to tackle the above to-do list before we merge with dev
?
@vc1492a Sounds good, I will try and get to work on it these next few days.
@vc1492a I have finished with most of the software/code clean up though I need to test it with data. I will commit them after we can resolve the problems with the s3 bucket.
@hamlinliu17 updated the readme with the new data path / unpacking instructions. Be sure to update your path in the modeling notebook - worked great on my end, let me know if you run into any issues (you'll want to mv
your current data
directory).
@vc1492a I got all the data, and everything is running smoothly. Going to add a bit more to each docstring and will commit committed. Feel free to check anything and add to the docstrings.
Sounds great, thanks!
Pulled and looks good to me @hamlinliu17! Was able to run without issue.
@hamlinliu17 currently testing the abstraction of the metrics for the validation section of the experiment - once that's complete I'll open a PR that we can review before pulling the feature/real_wold_experiment
branch into dev
.
We'll subsequently go through the issue tracker and make sure it's up to date. Please feel free to add any issues or notes you think may be missing based on our recent discussions. We may transition some of these to a future board to separate existing from future work (we'll do that later though - for now let's brain dump on one board).
@hamlinliu17 pushed the metrics refactor. Going to merge this into dev
and delete this branch.