如何原始xml文件得到restaurant.npz文件

Question

如何原始xml文件得到restaurant.npz文件

zhangshaolei1998 opened this issue 5 years ago · 9 comments

zhangshaolei1998 commented 5 years ago

感谢您的代码。
请问，您是如何从原始xml文件得到restaurant.npz文件的呢？restaurant.npz内部格式及顺序是什么？

再次感谢！

Answer 1 · 2019-04-15T21:20:49.000Z

Good question. You need to
1, write an xml parser, retrieving both the review sentences and character-level spans;
2, tokenize review sentences and align character-level spans to token-level: define a simple state machine to transit among B, I, O, then output token-level label by scanning each char and check whether the current char hit within the char-level span so to change the output label accordingly. (I noticed the BERT implementation of SQuAD that maps char-level answer to token spans could be a good starting point);
3, build a vocabulary for all tokens, indexing them with integers. save all input/output as numpy arrays.
4, save them into a .npz file via np.save().

Our evaluation code somehow reverses this process back to XML format.

Answer 2 · 2019-04-16T07:22:15.000Z

Thank you very much.

Answer 3 · 2019-05-26T23:03:10.000Z

Sorry for reopening the issue, I try to reproduce the experiments, and for me it is still not clear. For example, after performing points 1 and 2 I know that a token 'word' occurs in the dataset 20 times, among them it has 10 times B tag, 6 times I tag, and 4 times O tag. How can a turn it to the format of the vocabulary in the 'restaurant.npz' file? For example, train_X and train y there have length 2000, every element in both lists has length 83, the elements in X look like: [ 33 38 235 5687 19 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0], and elements in y look like: [0 1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]. How do we get such format? Would be very grateful for your help!

Answer 4 · 2019-06-10T17:14:22.000Z

Sorry for the delayed reply. I am just back from a long trip.

2000 should be the number of training examples.
First, build a vocabulary to index (dictionary). As the training data is small, every word is in the vocabulary: for example: {'': 0, '': 1, 'I': 2, 'am': 3, ..., 'interested': 20, ...}.
given tokenized sequences: [ ['I', 'am', 'interested', 'in', 'mac'], ['I', 'am', ... ] ... ], look up the dictonary and turn it into[ [2,3, 20, ..., 0, 0][2, 3, ..., 0]... ], (the extra space within 83 are padded with 0).

y is similar but use another dictionary {'O': 0, 'B': 1, 'I': 2}.
Hope this helps.

Answer 5 · 2019-06-11T10:47:26.000Z

Thank you very much!

Answer 6 · 2019-07-21T11:54:06.000Z

A slightly off-topic question, what's the connection between *.npz and *_raw_test.json ? And what's the word_idx.json using for?

Answer 7 · 2019-07-23T04:01:44.000Z

word_idx.json is a dictionary for mapping word to their index in .npz.
_raw_test.json is to map indexed word back to the text for evaluation.

Answer 8 · 2019-07-23T04:07:08.000Z

@howardhsu Thanks a lot. Could you release the code of those steps? It's really helpful to understand how the data is constructed.

Answer 9 · 2019-08-29T16:16:41.000Z

There's a draft implementation you can derive https://github.com/howardhsu/BERT-for-RRC-ABSA/tree/master/preprocessing