howardhsu/DE-CNN

How is the data organized in npz files?

Closed this issue · 2 comments

Hello, thank you for sharing the code.
It seemed that train data was processed and saved in the npz file. Would you please explain that how the data is organized in .npz file? I found that the shape of train_X is [2895,83], valid_X is [150,83]. What's the exact meaning of these information? Many thanks!

Thanks for your interest in our work.
The advantage of using a single npz file is as follows:
(1) fast binary data loading; (e.g., no parsing as in json or pickle)
(2) one file for all: one line code to load all the data. You can specify a piece of data using a key string later (e.g., ae_data['train_X']);
(3) low memory usage (most data is still on the disk)
(4) numpy data format is closest to almost all DL libraries.
Save this file is simply a line:
np.savez('data.npz', train_X=train_X, train_y=train_y, ...)
(https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.savez.html)

Note that SemEval does not have a validation split. The single .npz file keeps this setting.
However, DL model in general needs a validation set, so we split 150 valid examples from e.g., ae_data['train_X'] and ae_data['train_y']. This left train_X as 2895, I assume.

An example is a sequence of indexes. We set the maximum length as 83 (sequences shorter than 83 are padded with 0 ). Each index corresponds to a word in word_idx. So a sequence can represent a tokenized sentence.

Hope this answers your question.

Clearly explained! Thank you so much !