howardhsu/DE-CNN

如何原始xml文件得到restaurant.npz文件

zhangshaolei1998 opened this issue · 9 comments

感谢您的代码。
请问,您是如何从原始xml文件得到restaurant.npz文件的呢?restaurant.npz内部格式及顺序是什么?

再次感谢!

Good question. You need to
1, write an xml parser, retrieving both the review sentences and character-level spans;
2, tokenize review sentences and align character-level spans to token-level: define a simple state machine to transit among B, I, O, then output token-level label by scanning each char and check whether the current char hit within the char-level span so to change the output label accordingly. (I noticed the BERT implementation of SQuAD that maps char-level answer to token spans could be a good starting point);
3, build a vocabulary for all tokens, indexing them with integers. save all input/output as numpy arrays.
4, save them into a .npz file via np.save().

Our evaluation code somehow reverses this process back to XML format.

Thank you very much.

Sorry for reopening the issue, I try to reproduce the experiments, and for me it is still not clear. For example, after performing points 1 and 2 I know that a token 'word' occurs in the dataset 20 times, among them it has 10 times B tag, 6 times I tag, and 4 times O tag. How can a turn it to the format of the vocabulary in the 'restaurant.npz' file? For example, train_X and train y there have length 2000, every element in both lists has length 83, the elements in X look like: [ 33 38 235 5687 19 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0], and elements in y look like: [0 1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]. How do we get such format? Would be very grateful for your help!

Sorry for the delayed reply. I am just back from a long trip.

2000 should be the number of training examples.
First, build a vocabulary to index (dictionary). As the training data is small, every word is in the vocabulary: for example: {'': 0, '': 1, 'I': 2, 'am': 3, ..., 'interested': 20, ...}.
given tokenized sequences: [ ['I', 'am', 'interested', 'in', 'mac'], ['I', 'am', ... ] ... ], look up the dictonary and turn it into[ [2,3, 20, ..., 0, 0][2, 3, ..., 0]... ], (the extra space within 83 are padded with 0).

y is similar but use another dictionary {'O': 0, 'B': 1, 'I': 2}.
Hope this helps.

Thank you very much!

A slightly off-topic question, what's the connection between *.npz and *_raw_test.json ? And what's the word_idx.json using for?

word_idx.json is a dictionary for mapping word to their index in .npz.
_raw_test.json is to map indexed word back to the text for evaluation.

@howardhsu Thanks a lot. Could you release the code of those steps? It's really helpful to understand how the data is constructed.