abisee/cnn-dailymail

Note about chunking data

abisee opened this issue · 2 comments

For simplicity, we originally provided code to produce a single train.bin, val.bin and test.bin file for the data. However in our experiments for the paper we split the data into several chunks, each containing 1000 examples (i.e. train_000.bin, train_001.bin, ..., train_287.bin). In the interest of reproducibility, make_datafiles.py has now been updated to also produce chunked data that's saved in finished_files/chunked and the README for the Tensorflow code now gives instructions for chunked data. If you've already run make_datafiles.py to obtain train.bin/val.bin/test.bin files, then just run

import make_datafiles
make_datafiles.chunk_all()

in Python, from the cnn-dailymail directory, to get the chunked files (it takes a few seconds).

To use your chunked datafiles with the Tensorflow code, set e.g.

--data_path=/path/to/chunked/train_*

You don't have to restart training from the beginning to switch to the chunked datafiles.

Why does it matter?

The multi-threaded batcher code is originally from the TextSum project. The idea is that each input thread calls example_generator, which randomizes the chunks, and then reads from the chunks in that order. Thus 16 threads concurrently fill the input queue with examples drawn from different, randomly-chosen chunks. If your data is in a single file however, then the multi-threaded batcher will result in 16 threads concurrently filling the input queue with examples drawn in order from the same single .bin file. Firstly, this might produce batches containing more duplicate examples than we want. Secondly, reading through the dataset in order may produce different training results to reading through in randomized chunks.

  • If you're concerned about duplicate examples in batches, either chunk your data or switch the batcher to single-threaded by setting
self._num_example_q_threads = 1 # num threads to fill example queue
self._num_batch_q_threads = 1  # num threads to fill batch queue

(From a speed point of view, the multi-threaded batcher is probably unnecessary for many systems anyway).

  • If you're concerned about reproducibility and the effect on training of reading the data in randomized chunks vs. from a single file, then chunk your data.

for those has trouble pre-processing this should help
https://github.com/JafferWilson/Process-Data-of-CNN-DailyMail