get autoencoder working with Thwaites Glacier data
Closed this issue ยท 18 comments
Get the autoencoder from #14 working with the Thwaites Glacier (TG) data downloaded in #13.
Document your progress very thoroughly in a notebook and put the notebook in this repo. When documenting I suggest writing lots of explanatory comments and including markup cells in the notebook that explain the overall motivation and approach you are taking. Doing this for every notebook will make sure you can use the notebook again in the future.
cc @Templar129
Day 2 Update:
I tried to use bed topology as training dataset which come from TG data. Also trained a simple autoencoder model based on Tensorflow. Since the input has only a few samples, we want to increase the layers. It turned out that for the bed topology, nearly 60% of the plot is blank, and the 40% glacier topology dense in the center of the graph. It is easy for the model to over fit. The model tends to give an output that is completely blank, yet still has a low mse.
Next step should be trying to improve the model.
There are several possible ways to do so:
We can try defining our own loss function, so it loses less on blank pixel yet more on glacier pixel. Also, we can add more layers so it can better learn the pattern of the training data.
@Templar129 exciting work! One issue might be the null values as you noted. Have you tried using a subset of the topography and training the autoencoder on a subset region that doesn't include any nan values?
Hi all, I just looked up some material and have an idea for our data.
One possible way to get a rotated, cropped rectangular which contains the glacier yet as small as possible is to use PCA. Since the values in our data matrix are numbers which ranges from 0 to negatively large numbers and NAN. If we change the NAN to like 1, then all NAN would still be significantly different from the value inside glacier. Now we ran a PCA on the matrix, it will help us find a main axis, which should a x-y axis that is centered in the glacier and tilted in the same way as the glacier.
By doing so, we can know the angle of the axis is from the horizontal line. With the help of the angle, and a package called CV2, we can cut from the original data and then rotate it to the right direction. CV2 is a python package that can automatically identify the main part in a graph and put it in a small enough rectangular and give you that cropped picture.
Below is the new picture I have got.
@Templar129 This is great progress, Hengbo! I agree with jonny especially regarding the periodograms. The roughness looks different between the two, and I'm curious to see whetehr that is an issue.
@jkingslake Yes, I will make the histogram and periodogram on both original and predicted data. I will later attach the result here in another comment.
@hoffmaao And I totally agree about the roughness. It seems many autoencoders would give smoother results. By looking at the graph, I would say the predicted result is more blurred and less sharp as the original. I think there might be some parameters we can adjust to make the outcome look better, I will try googling this next.
Again, this is great work! Seems like some of our worries here about the power at high frequencies may be confirmed. I have a method that's pretty customary in the geostatistics literature that we could consider using to generate synthetic topographies for the sake of comparison. I think that could be a good benchmark as others have used these in ensembles to identify the correct bed topography.
Thanks, and it's true there's significant difference in periodograms. And I think the method you mentioned would be great, we can use it to generate a synthetic topography and make them all on the same plot to see the difference. Then we can see the problem and maybe fix it at some scale by setting a few parameters.
This is quick work getting these plotted. Well done. Clearly some differences. The question is if this is an issue for us, given that the simulations may not be (or maybe do not need to be) as high resolution as the data.
What do you think?
Hi, I was trying to tuning some of the hyperparameters of the autoencoder model. It turns out for our case, since we only have 1 large picture, it might not be a good choice to use too complicated structure models.
So, I looked up and find that to keep the details and pattern of the original data, some people would use 'skip-connections'. It is a that help keep the abstract pattern of the original data and pass it to the output.
I have run a demo and it seems better when you look at the output picture. I will try adding one or two layers, use the skip-connection method and add more training epoch to see what happens.
Hi everyone, I just created a fork to upload the autoencoder Jupyter notebook. It now has old version and new versions of our autoencoder code, and some comparison between them. I am still working on the comments and instructions on the code so it would be easily accessed and replicated. Here is the link: https://github.com/Templar129/Autoencoder_demo
The main codes are in Autoencoder_example/TG_data_autoencoder.ipynb
looks great! Looks like you getting reasonable results in terms of the values and the frequencies. Albeit with some loss of high frequency information.
I would like to see more documentation of the notebook at some stage though. In general, I a well documented notebook has a markdown cell preceding very code cell explaining what you are going to do, and code cells should just do one thing (set up the encoder layer, remove nans, define the model from the layers, etc.).
Yes! I think the documentation could be more in detail and reasonable. I will edit it and add more info to make it easier to read later
maybe we close this issue, given that you have successfully got the autoencoder working with TG data. Then make another one about improving the documentation of the notebook.
Yes, now the TG data can successfully work on the autoencoder instance. We can further develop the autoencoder and even start looking at the VAE (variational autoencoder). I will create a new issue about the documentations, and how we further improve it in a new issue.