如何原始xml文件得到restaurant.npz文件
zhangshaolei1998 opened this issue · 9 comments
感谢您的代码。
请问,您是如何从原始xml文件得到restaurant.npz文件的呢?restaurant.npz内部格式及顺序是什么?
再次感谢!
Good question. You need to
1, write an xml parser, retrieving both the review sentences and character-level spans;
2, tokenize review sentences and align character-level spans to token-level: define a simple state machine to transit among B, I, O, then output token-level label by scanning each char and check whether the current char hit within the char-level span so to change the output label accordingly. (I noticed the BERT implementation of SQuAD that maps char-level answer to token spans could be a good starting point);
3, build a vocabulary for all tokens, indexing them with integers. save all input/output as numpy arrays.
4, save them into a .npz
file via np.save()
.
Our evaluation code somehow reverses this process back to XML format.
Thank you very much.
Sorry for reopening the issue, I try to reproduce the experiments, and for me it is still not clear. For example, after performing points 1 and 2 I know that a token 'word' occurs in the dataset 20 times, among them it has 10 times B tag, 6 times I tag, and 4 times O tag. How can a turn it to the format of the vocabulary in the 'restaurant.npz' file? For example, train_X and train y there have length 2000, every element in both lists has length 83, the elements in X look like: [ 33 38 235 5687 19 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0], and elements in y look like: [0 1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]. How do we get such format? Would be very grateful for your help!
Sorry for the delayed reply. I am just back from a long trip.
2000 should be the number of training examples.
First, build a vocabulary to index (dictionary). As the training data is small, every word is in the vocabulary: for example: {'': 0, '': 1, 'I': 2, 'am': 3, ..., 'interested': 20, ...}.
given tokenized sequences: [ ['I', 'am', 'interested', 'in', 'mac'], ['I', 'am', ... ] ... ], look up the dictonary and turn it into[ [2,3, 20, ..., 0, 0][2, 3, ..., 0]... ], (the extra space within 83 are padded with 0).
y is similar but use another dictionary {'O': 0, 'B': 1, 'I': 2}.
Hope this helps.
Thank you very much!
A slightly off-topic question, what's the connection between *.npz and *_raw_test.json ? And what's the word_idx.json using for?
word_idx.json is a dictionary for mapping word to their index in .npz.
_raw_test.json is to map indexed word back to the text for evaluation.
@howardhsu Thanks a lot. Could you release the code of those steps? It's really helpful to understand how the data is constructed.
There's a draft implementation you can derive https://github.com/howardhsu/BERT-for-RRC-ABSA/tree/master/preprocessing