microsoft/graph-based-code-modelling

Implementation of VarNaming task from ICLR '18

Closed this issue ยท 21 comments

Hi,

My name is Alex Haigh, and I'm a master's student at Stanford. For a project, I'm working to reproduce (and hopefully extend) some of your results on VarNaming from your '18 ICLR paper. My understanding of your model for that task is that you:

  1. Replace each instance of your target variable with a <SLOT> token
  2. Represent every other variable as a concatenation of the average of the (learnable) embedding for each subtoken in its name and the type of the variable (as described on the top of p. 5) here
  3. Run message passing with a GGNN for 8 timesteps, using the program graph
  4. Average the final representation of every <SLOT> token
  5. Use this as input to a GRU decoder that outputs the variable name as a sequence of subtokens.

I found the dataset here, and it looks like it's in the format digestible by utils/tensorise.py. Similarly, the model you use for VarNaming seems to be the Graph2SeqModel.

So, is this all you need to do to reproduce the results?

  • run utils/tensorise.py --model graph2seq on the dataset published in ICLR '18
  • train a graph2seq model on the dataset using utils/train.py PATH_TO_TENSORISED_GRAPHS --model graph2seq

Just wanted to make sure I'm looking in the right place, and would also appreciate any other tips you have. Also, what modifications did you make to the model based on Cvitovic et al? And is there a way you can compare results with/without those modifications?

Thanks!

mmjb commented

Heya,

This sounds mostly right (there may be some patching required in tensorise to look at the variable name instead of the target expression). The code released here is indeed quite similar to what we used for the ICLR'18 model (it's the basis before some refactorings / extensions for the ICLR'19 paper). The Graph2SeqModel is a bit different from what we used, but you should be able to retool it fairly easily.

The only obvious problem I see is that I removed the subtokenization bits at some point (when introducing the CharCNNs for node labels). If you really want to reproduce the original result, you would need to patch this back in (essentially, look for uses of the cg_node_label_embedding_style hyperparameter to find places you need to modify). Apart from that, I see no obvious issue, but I also haven't tried this ... so good luck, and come back with questions? :-)

Re the Cvitovic et al. extensions: These are controlled by the cg_add_subtoken_nodes hyperparameter (for adding the meta-nodes for each used subtoken) and the cg_node_label_embedding_style
hyperparameter (which is CharCNN for their best-performing configuration).

I think we also added residual GGNN connections, which you should be able to turn off with cg_ggnn_residual_connections: {}.

Good luck!

Thanks for your help here! Seriously appreciate it.

Right now, I'm working with tensorize.py to process the data, and I'm getting an error in the _load_metadata_from_sample function in contextgraphmodel.py. It seems like the context graphs in the dataset don't have EdgeValues fields, so the program crashes at the line

for edge_type, values in raw_sample['ContextGraph']['EdgeValues'].items():

After poking around a bit, I'm also unsure what the cg_edge_value_sizes field of the metadata does, so I'm not sure how to patch this up. Any ideas? Did you change the way you stored the graphs between papers? An example of the program graph in the old dataset is here

mmjb commented

Awesome. As you mentioned in the first post, the decoder needs to look at the variable name instead of the expression that you're trying to generate. In SeqDecoder.load_metadata_from_sample, it looks at the SymbolLabels field of the sample, which doesn't exist in the graph samples I have.

From the sample graph in the repo, SymbolLabels is just keeping a count of how often each subtoken appears in an answer in order to specify the output vocabulary. The most similar thing in the dataset from ICLR '18 is the field called SymbolCandidates, which per the description is a:

List of type-correct in-scope variables that can be used in the considered slot. Each variable represented by a dictionary in which "SymbolDummyNode" identifies the node in "ContextGraph" corresponding to the candidate, "IsCorrect" marks if the candidate is the correct choice, and "SymbolName" gives the name of the variable candidate [Unneeded for task, used for debugging].

My plan is to just find the correct variable name and split it up into subtokens based on camelCase (as you say in the paper: the name inputStreamBuffer is treated as the sequence [input, stream, buffer]). This seems reasonable, but I want to get your input first to make sure that is in fact what you do. Also, is there any chance you've already written a utility that tokenizes strings into camelCase somewhere?

Looking forward to training time, there's also a Productions field that gets used in SeqDecoder.load_data_from_sample and in collect_token_seq (in model.py). This seems like an artifact of expanding your program graph based on production rules and is unnecessary for the simpler VarNaming task - can I just get rid of that bit and replace it with the actual answer tokens?

Thanks again for the help!

mmjb commented

Just wanted to say thanks again for the help here - we have the pipeline working and were able to overfit a small dataset. Now training on the full dataset!

mmjb commented

Yeah, if you could share that, that'd be great! Although rather than re-compiling the dataset, it might be easier just to traverse the existing graph instance and replace instances of the answer token (if that makes any sense).

My edits are a fork of this repo, and when I get a chance I'll make a little readme explaining my changes/the steps I did to make it work.

Hey Marc,

Just following up here - would love suggestions on the easiest way to create a dataset like the one you used for VarNaming.

Also: do you remember the hyperparameters you used when training the VarNaming model (namely how many epochs)? The default patience value (5) gives up after ~20 epochs since there isn't any improvement from epochs 15-20 when I run it.

mmjb commented

Heya,

I've just pushed a new branch (dev/mabrocks/varnaming) with the data extractor for variable naming (copy & pasted from our internal repo from back then).

The problem with using the existing data is that the graphs we are extracting are truncated. More precisely, we have the notion of a "hole" (e.g., the removed expression) and we then copy a subgraph around this hole, based on the idea that the GNN in the end will only use K=8 propagation steps anyway. For VariableNaming, we have more holes, corresponding to different uses of the variable, and so the graphs are richer than what would you get from just using the dataset we generated for expression generation...

Re hypers, @mallamanis should know the details...

I believe that we used the default hyperparameters here. I am sure that a more exhaustive hyper-optimization could have helped here.

Note, that for variable naming we have used a GRU decoder that generated the variable name per-subtoken. This isn't included in this repo.

Thanks for all the help (and the clarifications on the nuances of your dataset), guys! I'll check out what you did with the data extractor and look a little more into doing an exhaustive hyperparameter search. How often did you encounter graphs with diameter > 8 where this would make a difference?

@mallamanis - I thought the SeqDecoder in this repo did what you're saying. Did you use something different?

mmjb commented

Graphs with a substantial diameter are the norm -- this truncation is relevant for more or less every sample in the dataset. I'm currently at ICLR and am having a hard time to do proper hacking, but I can try to look into recreating the ICLR'18 VarNaming dataset next week.

Awesome - would be very much appreciated. Have fun at ICLR!

@haighal You're right! I missed that SeqDecoder could be used for variable naming too.

Great! Thanks for the confirmation.

I know it's a slightly different task than what you've published (I'm really doing variable naming with only 1 masked instance and thus a different/truncated subgraph), but I'm still hoping to get interesting results with this.

I trained the model pretty quickly, but testing is taking ages: I ran the test script with 4 parallel workers on Friday afternoon and it still isn't complete (it's now Tuesday at 4 pm here in California). Do you guys have any idea why it might be taking so long/where there might be inefficiencies in the test script? I'm not tensorizing the test graphs before running test.py, but I don't think that's the issue because model.test() seems to do that for us.

Alex

Also, after I can get it working with C# code, I'm hoping to apply a similar model to a Python dataset. This obviously limits the scope of edge types I can use, and I was planning to start with only the syntax edges (Child and NextToken) and not use variable type information (since it's unavailable in Python). Two questions on that end:

  1. What's the easiest way to remove the type hierarchy from the model? It seems like we can just set 'cg_node_type_embedding_size' = 0, but is there anywhere else in the tensorisation pipeline we'd need to make substantial modifications?

  2. How do you actually compute the NextToken edges from the AST? I found this in the paper but am still a bit unsure (do you just arbitrarily link the terminals based on the order in which they're traversed?):

We use Child edges to connect nodes according to the AST. As this does not induce an order on children of a syntax node, we additionally add NextToken edges connecting each syntax token to its successor. An example of this is shown in Fig. 2a.

Thanks!

mmjb commented

Getting the NextToken edges right is indeed a bit harder to do with the Python parser (in Roslyn, we can (a) rely on the fact that the source code visitor visits things in the right order, see

private void AddToken(SyntaxNode parentNode, SyntaxToken token)
)

However, Patrick, whom Miltos and I worked with on a recent ICLR paper, also released Python2Graph code, so you might want to look into simply reusing that:
https://github.com/CoderPat/structured-neural-summarization/blob/master/parsers/sourcecode/barone/ast_graph_generator.py

Re type embeddings: Setting the size to 0 should just work, but it may require making a few things more robust in the tensorisation pipeline (i.e., access to the types would need to become optional). However, this data is just passed through, so the changes required should be fairly minor.

Marc

Perfect, thanks for sharing! And do you have any ideas on why testing might be so slow (two posts above)?

mmjb commented

Oh, sorry, yes. I don't have a clear idea, but I have a hunch. Concretely, I observed in the experiments for the ICLR'19 paper that the sequence decoder would get slower with time; in the end, I think I traced it to the tensorflow session collecting more and more constants or something. This problem resolved itself when I introduced per-file parallelisation (https://github.com/microsoft/graph-based-code-modelling/blob/master/Models/utils/test.py#L53) because that is creating a new model copy (with its own tf.Session) for every file that it's testing on. Now, my test files happened to be quite small; if yours are large, the problem may continue to exist for you...

So, to make that actionable:
(1) Run test from the start, track how fast it's making progress. If you see that it's getting slower: Yay, you're running into this issue.
(2) If this is indeed your problem, instead of principled fix, you can just take test test files and quickly split them into smaller files and everything should magically work...

Hey Marc,

Just wanted to say thanks again for your very helpful comments here (that was indeed the issue I was having at test time). We got an implementation working on Python ASTs using only the Syntax edges and got results pretty similar to what you reported in your ablation from the ICLR '18 paper!

Accuracy@1: 34.5671%
Accuracy@3: 42.4864%
Accuracy@5: 45.6902%

The code lives at https://github.com/haighal/graph-based-code-modelling/ and has a thorough summary of both (1) our modifications to your code base and (2) the steps to generate our results. Let me know if you have any questions - my email is haighal [at] cs.stanford.edu

Cheers,
Alex