microsoft/graph-based-code-modelling

Where is the dataset?

Closed this issue · 14 comments

Hi @mmjb ,
How are you?

Is the dataset used for the ICLR'19 paper available?
I always thought that it is the same one as in the ICLR'18 paper, but I just saw in the papers that the ICLR'19 one is much larger.

Thanks!

mmjb commented

Sorry, we didn't release the dataset here, only the tool to extract it, and will not be able to release the dataset. Roughly, the dataset is considered to be distribution of the original source code, and so we would need to get approval from our legal team for each of the (hundreds of) projects we used here...

Do you think that the ICLR'18 dataset here: https://aka.ms/iclr18-prog-graphs-dataset
Is large enough to be useful? (useful == sensible enough to compare different models on)

(at the worst case I can create a new dataset, but then I will need to de-duplicate etc.)

mmjb commented

That dataset is specialized towards the VarMisuse task (in that it only contains subgraphs centered around a hole into which a variable should go), so I don't think that it would work for other scenarios.

What task are you trying to evaluate? Generating source code?

Yes, basically the ExprGen task as in your paper.

We're currently implementing our approach inside your data extractor (and plugging it in like the PhogExtractor to make sure our model, your model, and your baselines run on the same holes).

mmjb commented

I fear that's the sanest way of doing things. Sorry to not be more helpful here, but the legal requirements around a global megacorporation sometimes make certain things surprisingly hard...

Yes, I understand.
Can we use the ICLR'18 dataset at all, as a temporary solution?
Is it in raw ".cs" files or a specific preprocessed format?

mmjb commented

It's in preprocessed JSON, essentially the output of the spin of the Extractor for VarMisuse...

Maybe you can release that dataset as raw files? So I can run my extractor on the same raw data?

mmjb commented

@mallamanis, you dealt with the dataset release -- can we do that?

I can try to get to this, but I need a few days before I get to this...

@urialon if you want this sooner, it might be faster to do the following (and I will eventually do the same thing, since we've deleted all other data by now)

a) The ICLR18 paper, in Appendix D has the projects along with the git SHA we used.
b) Clone these projects and set the HEAD to that SHA.
c) The files in there repositories are the ones used in the extracted data.
d) For additional, filtering, in the .json each entry has information about the file (filename field) where the data was extracted from. Filter only the files that appear in any of the .jsons.

(if you do this, let us know)

That's great!
Thanks guys!

Hey @mallamanis,
I did what you suggested.

Can you please approve the following script?
It does work, I just want to verify that I am taking the right graphs, jsons, repos, etc.
https://gist.github.com/urialon/bae095ebd86a0411ee97883dfcb5ae5b

  • There were a few projects that were in the Appendix but not in the dataset, and a project in the dataset that did not appear in the paper. I included all of them, see comments in the code. When the repo was not in the dataset - I took all *.cs files. When the repo was in the dataset but not in the paper (actually this was just OpenLiveWriter) I took the latest commit. Is there a SHA for that repo?
  • I don't care about the internal train/dev/test split of a repo, I just took an entire repo to either train/dev/test by taking the *.gz files in the graphs dir.

Sorry for the trouble, I just figured that it will be easier for you to review my code rather than writing it.

Thanks!

This looks great! And I believe that the script is correct.

  • We had to remove some repos from Tbl4 because we cannot release them given their license. (The ICLR18 results on the reduced dataset are in the table in the last page).
  • OpenLiveWriter is a peculiar case. I recall that I was not able to compile it and removed it from the dataset. But since it's still in there, feel free to use it.
  • I think it's perfectly alright to ignore the internal split. The only minor aspect here is that for the ICLR18 paper we have many times used RavenDB and CommonMark.NET as dev set, ie a small dataset where we test variations/changes in our code. This might imply that our model has somewhat "overfitted" these two projects because of the model tuning we performed on them. Having said that, I don't believe this to be the case, and you should feel free to reshuffle things.
  • Note that RavenDB also has a full port of Newtonsoft internally (probably the largest duplicate) which was removed during deduplication.

Hope this helps... Sorry for making things so complicated. Releasing/redistributing code of various licenses requires a lot of legal effort for any company and some open-source licenses make things even harder.

Let us know if we can be of more help :)

Yes, thanks, it helps a lot!