The RVL-CDIP should be downloaded from the original website - https://www.cs.cmu.edu/~aharley/rvl-cdip/. This repository is particularly useful if you want to train a model on all images without worrying about benchmarks (it does not keep the original train/test/val split!). For example to train a network for later use for transfer learning :).
-
Move the downloaded file in a folder which has a lot of disk space.
-
run
tar -xvzf "./rvl-cdip.tar.gz"
-
The directory should look something like the image, without the dataset folder (that one is created automatically by this script)
- Use the compose.py to build the per category dataset. Note it might take a while.
python compose.py
- Done