Cannot try new RFC
Opened this issue · 23 comments
Hi, I tried to use the code to process new RFCs, but it didn't work.
For example, I just tried IPv6, which was included in the "rfcs-original" folder. I used the command: python3 nlp-parser/preprocess.py --protocol IPv6
It seems that I have met two problems: OpenNLP and the code itself. The problem about OpenNLP may have nothing to do with your code. But can it be the reason why the code didn't work when processing IPv6? Or is there something wrong in the code?
Hi @SongyuanSui , thanks for opening this issue. Currently we are only committing to fixing issues relating to the reproduction of our results, as we approach the Oakland presentation date. So, we might not be able to invest much time into fixing this for you at this time. However, I'm happy to provide some light guidance and maybe the issue is quickly fixable.
Did you actually make rfcs-annotated-tided/IPv6.xml
? At first glance, I think you need to modify tidy_documents.py to include IPv6, then run it to generate the tided version in rfcs-annotated-tided
, before you can try doing the FSM extraction.
Hi Max. I found these lines in the tidy_documents.py:
for protocol in ["BGPv4", "DCCP", "LTP", "PPTP", "SCTP", "TCP"]:
input_file = "rfcs-annotated/{}.txt".format(protocol)
output_file = "rfcs-annotated-tidied/{}.xml".format(protocol)
Accoding to them, the input file has been annotated before this step. I wonder how to annotate the raw RFCs in the "rfcs-original" folder to generate annotated RFCs in the "rfcs-annotated" folder?
Thank you for your help!
Oh, gotcha. rfcs-annotated
are annotated by hand. What you want to do is annotate automatically; to do this you should use one of our models, e.g., our linear model. You'll need to modify the code to allow the RFC you want to use. Then take a look at the example targets, e.g., here. Once you have this working, it'll write an annotated RFC to the --outdir
of your choice, which you can then find and feed into the FSM extraction pipeline e.g. like this. If you want to do attack synthesis you will need to do additional steps, e.g., writing LTL properties.
I see. But I found that the input files of these models are from the "rfcs-bio" folder. Even the RFC file that will be automatically annotated by the model also come from this folder. It seems that files in the "rfcs-bio" folder were generated by the preprocess.py
.
Does this mean that if I'm going to try a new RFC, I need to annotate it by hand first, then use the preprocess.py
to get its BIO form, and finally use the BIO form as the model's input, enabling the model to annotate it automatically?
You definitely don't need to annotate by hand first. I'll try to take a look in the coming days .... I don't know the pre-processing code very well unfortunately.
Great, thank you very much!
Hi Songyuan, you would need to adapt the code so that it works for a new RFC file, currently the prediction and testing is done together, so an option needs to be added for it to predict and generate an output, without trying to test anything (for which you would need manual annotations). And yes, you will need to run all preprocessing steps so that you can generate all the intermediate files.
Currently, I do not have a manual written for this, but I can take a look in about a week or two.
Hi Maria, thank you very much! Look forward to your new manual written.
Hello, I am trying to predict the new RFC document and have met some problems. I would like to ask how to solve them.
-
In the training and test code, namely
bert_bilstm_crf.py
, the annotated data is used, which is divided intocontrol
andchunk
. According to the division method in the paper,chunk
is divided according to punctuation and reserved words, I wonder howcontrol
is selected when labeling. When making prediction for a new RFC document, which paragraphs in the text should be labeled ascontrol
? -
In the training and test code, there are only 7 tags, and
state
andevent
are extracted from annotated XML file. So for a new RFC document, how can I extract this content? Do I need to manually extract it from the text? -
The parameters
level_h
andlevel_d
are used in thewrite_result
function when the predicted result is written as an XML file, which is also obtained from the annotated data. How do I get this value for a new RFC document? -
In the training and testing code,
feature_size
is related tovocab_size
andpos_size
of the RFC document that participated in the training and testing, while the model is related tofeature_size
. Does this mean that each prediction for a new RFC document requires a new model to be trained?
In addition, if the whole document is used as input (annotated data is only a small part of the document), it will contain a lot of words, resulting in the increase ofvocab_size
andfeature_size
, which will eventually lead to the increase of model parameters and training and predicting time. Is there any solution? (Or remove the feature related to vocabulary and POS?)
Hi Lyonelyl,
-
Control was annotated manually, based on the text that was relevant. During training, we rely on this information being available. In this paper, we did not tackle the problem of predicting control scopes. We currently do not support new RFCs out of the box using our code base, but it is in our to-do list (all authors are busy with other projects, so it has taken some time). You could try guessing control statements using indentation or other heuristics, or annotate them.
-
We assume this information is given to us in advance. So yes, you need to provide it manually.
-
See number 1. I can let you know once we get around to adapting our code. Alternatively, you can provide some annotations yourself to your target RFC.
-
Not necessarily, you could use the same trained model to predict multiple RFCs, but feature_size would be decided based on your training data. In our code, we did K-fold cross validation, that is why we trained it multiple times. Yes, the downside of those types of features is that they grow with the amount of words that you use. You can choose to use them or not, or you could modify the code so that you only include words based on some criteria (for example, words that appear at least N times).
Hope that helps.
Hello, Maria. Thank you so much for answering my questions!
Does it mean that if I want to predict a new RFCS document, I need to do some manual annotation:
- Manually extract the
states
andevents
related to the state machine from the RFC document; - Find the paragraph related to the state machine as
control
.
and then model predicts the type of eachchunk
incontrol
?
Another problem is that the parameters level_h
and level_d
seems to generate the indentation when generating the xml, whereas the parameters are obtained from the annotated xml. Since I haven't looked closely at the code to generate FSM using the intermediate representation, I wonder if these parameters are necessary to generate FSM? Can I simply set them to 0?
Thanks a lot!
I don't think any manual annotation should be necessary - @mlpacheco can confirm.
Hi Max. if, as you say, no annotations are needed, are there any tool files that can help me convert a plain text RFC file into input for the model, e.g. splitting the text into controls and chunks and getting other parameters to write the results?
Actually, Lionely is right.
The way things are set up right now, you need some annotations -- the state and event definitions (not each reference), and at least one outer control block (and recursive controls when appropriate). These could be either provided manually or guessed. For example, you could exploit the structure of the RFC and just wrap each paragraph or section in a control block (this is kind of what I had in mind to support this feature).
When I developed this, I never intended for it to be used out of the box for any RFC, as we were just interested in our own experimentation. It could be adapted for this, but it requires some work.
Level d and level h are going to be crucial in getting things right when there are recursive statements. I'll take a closer look tomorrow morning to give you an answer on how to use them because you should be able to (I don't remember from the top of my head).
Hello, Maria! Thank you so much for answering my question in your busy schedule!
I tried setting the values of both level_h and level_d to 0. This causes the output xml file and the fsm image to be quite different from the original result, and each control in the xml is independent. As you say, these two parameters are helpful for recursive or nested statements, and I think they have a big impact on the fsm extraction.
I understand from the preprocess_phrases.py
file that level_h and level_d represent horizontal and depth features respectively, but I don't fully understand what they mean yet, can you give a simple explanation or small example to help me understand and extract these features from the text by myself?
Thank you very much!
preprocess_phrases
relies on control annotations. For example, every time there is a control
inside another control
the level_d
feature goes up by one, and one it exits the inner control, the level_d
feature decreases again (hence depth). Level h
tracks the start of new control statements inside an outer control block, irrespective of depth. For example if there are two consecutive controls at the same depth, they will have different level h
(but same level d
). The number is always increasing.
Quick example
<control> h 0, depth 0
<control> h 1, depth 1
<control> h 2, depth 2
</control> h 2, depth 2
<control> h 3, depth 2
</control> h 3, depth 2
</control> h 1, depth 1
</control> h 0, depth 0
When these annotations don't exist, you will need to find methods to guess them -- we added a candidate solution for this on issue #4.
Hi Lyonelyl,
- Control was annotated manually, based on the text that was relevant. During training, we rely on this information being available. In this paper, we did not tackle the problem of predicting outer control scopes, and our solution during inference was to "guess" control statements based on indentation. We currently do not support new RFCs out of the box using our code base, but it is in our to-do list (all authors are busy with other projects, so it has taken some time). You could take a similar approach by guessing control statements using indentation or other heuristics, or annotate them.
- We assume this information is given to us in advance. So yes, you need to provide it manually.
- See number 1. I can let you know once we get around to adapting our code. Alternatively, you can provide some annotations yourself to your target RFC. Note that you do not need to annotate for nested controls as the current code guesses this for you, but you do need to at least signal the outer block.
- Not necessarily, you could use the same trained model to predict multiple RFCs, but feature_size would be decided based on your training data. In our code, we did K-fold cross validation, that is why we trained it multiple times. Yes, the downside of those types of features is that they grow with the amount of words that you use. You can choose to use them or not, or you could modify the code so that you only include words based on some criteria (for example, words that appear at least N times).
Hope that helps.
hello,mlpacheco,you say " nested controls",is it mean I just need to annotate the outermost tag?as follows:
For the FSM extraction you should do multiple levels of control
. I recommend you experiment with this a bit to get a sense of how it works and also read the code (which is not very complicated, and is explained in both prose and pseudocode in the paper). For most realistic protocols you end up needing multiple levels of control
simply because of the complex structure of the English in the document. The gold annotations also give good examples of this. Maria can probably comment more later when she has time.
Sorry,I'm new at machine learning.Thank you so much for your reply!
Do you mean If I want to try a new RFC (such as DHCP), do all the tags (action, trigger, timer...) located in the rfcs-annotated\DHCP.txt folder need to be manually annotated, or do I just need to annotate the state, variables, timer, and control.
In the above discussion and papper, no mention of how to annotate action、trigger 、transimision ....Are they automatically annotated or manually?
Hi @cheif-zyo -- if you just want to predict on a new protocol RFC, you would only need to annotate def_event
, def_state
and control
(recursively when appropriate, or just the outer control block and use the solution outlined in issue #4 to guess recursive controls). If you want to test on a new protocol and recover prediction performance, then you would need to annotate for everything so that predictions and annotations could be contrasted.
This code base was not designed to adapt seamlessly to new protocols, as it was dedicated to our own experimentation. If you would like to support this, you would need to come up with a way to "guess" outer control
statements and def_states
, def_events
. We have plans to add something like this to this repository, but it is not a priority at the moment.
Hi, @mlpacheco
Thank you so much for answering my question in your busy schedule!
As you say,def_states def_events and outer control must be annotated by hand
But I still have a question about whether args also need manual annotation
fp_vars = open("rfcs-definitions/def_vars.txt", "w") for i, var_def in enumerate(xml.iter('def_var')): fp_vars.write("{}\t{}\n".format(protocol, var_def.text))
Don't they matter because they are not explicitly referenced by annotation in the rest of the text?
Args do not need manual annotations, they are identified using an off-the-shelf semantic role labeler.