ASSERT-KTH/CodRep

Participant %13: Team madPL, University of Wisconsin--Madison & Microsoft Research

chenzimin opened this issue · 4 comments

Created for Team madPL from University of Wisconsin--Madison & Microsoft Research for discussions. Welcome!

Jordan Henkel, Shuvendu Lahiri, Ben Liblit, Thomas Reps

We have a technique based on treating the repair problem as a search/ranking problem. We extract features and then run a "learning to rank" technique on the data. As a post-processing step, we rule out the highest ranked prediction if applying the repair at that location yields a file that fails to parse (and, if the file was parseable originally, with no repair).

Here's a table that summarizes our results:

Trained On Loss on Dataset1 Loss on Dataset2 Loss on Dataset3 Loss on Dataset4 Parseability Check
80% of 2 0.087606 0.068825 0.05736 0.06536 NO
80% of 2 0.085909 0.067685 0.05537 0.06484 YES
80% of 123 0.069487 0.066061 0.04301 0.07607 NO
80% of 124 0.056232 0.058874 0.05606 0.03400 NO
80% of 134 0.052917 0.085307 0.03244 0.03716 NO
80% of 234 0.096918 0.065058 0.03698 0.03990 NO
80% of 1234 0.044905 0.051056 0.02839 0.03525 NO
80% of 1234 0.044459 0.050524 0.02831 0.03515 YES

The first two rows show our best performance training on 80% of a single dataset (Dataset2). The next four rows show performance when doing cross-validation (by holding out one whole dataset each time). The last two rows show performance of a model trained on all datasets, with and without the parseability filter.

One difficulty with this technique is that its performance on totally unseen data is unpredictable. It usually generalizes well enough, but I'm sure with more time to tune and better features you could have a model that generalizes better.

We've made our submission available via docker hub (it will use the model trained on all datasets). To run this on a new dataset do the following (on a machine with docker installed):

docker pull jjhenkel/instauro
docker run -it --rm -v /path/to/Datasets/NewDataset:/data jjhenkel/instauro

It is a really interesting result.

It is funny to see that by learning from 2 3 4 you obtain a worse result on dataset 1 than just with dataset 2.

By any chance, do you have the effectiveness of your approach on the tasks that have not been used during the training (the 20%)?

During the learning, did you take into account that some tasks are duplicated?

Hi @tdurieux

I didn't save performance measurements for the 20% used for validation. I did watch some models complete training, and each time performance on the 20% was within a percent or two of performance on the 80% (it was learning to rank using Precision @ 1 as its metric).

The learner is not taking into account duplicate tasks (as in I do not filter duplicates anywhere). Although, I do think it may be interesting to train on 100% of 3 of the datasets and use the held-out dataset as a validation set. Using this strategy the learner would stop when it didn't make any progress on the held-out set; that may help to prevent overfitting.

Indeed interesting ... and quite good! Looking forward to the performance on the hidden dataset.