/dataset_agnostic_segmentation

TensorFlow implementation of a segmentation system for document images.

Primary LanguagePythonMIT LicenseMIT

Dataset Agnostic Word Segmentation

TensorFlow implementation of Towards A Dataset Agnostic Word Segmentation

Usage

Train

python main.py --name train_example --train-hmap --train-regression --iters 100000 --dataset iamdb  --data_type train --gpu_id 0

Evaluation

python main.py --name train_example --eval-run --iters 500 --dataset iamdb  --data_type val1 --gpu_id 0

Expected output

Sample of Bentham (iclef) Document Sample of Bentham (iclef) Document heatmap
Original Document (Bentham) Heatmap (Bentham)

Dependencies

  • Python 2.7
  • TensorFlow 1.3
  • OpenCV
  • CUDA

Also see requirements.txt

License

Code is released under the MIT License (refer to the LICENSE file for details).

Data

You can load several datasets automaticly by passing two commandline flags:

  1. dataset
  2. data_type
  3. data_type_prob

Possible values for the dataset flag are 'iamdb', 'iclef', 'botany', 'icdar' and 'from-pkl-folder' see below for further details. Available data_type values are 'train', 'val1', 'val2' and 'test' you can mix several data_types by passing a string where each type is separated by '#'. Multiple data types will be sampled uniformly unless you pass a string of probabilities to 'dtype_mix_prob' separated by '#' (NOTICE: string must be same length as data_type string)

For example:

python main.py ... --dataset iamdb  --data_type #train#val1#val2 --data_type_prob #0.75#0.125#0.125

You can change the above by playing with the code in input_stream.py

Assumed directory structure for the available datasets

   Root
   └── datasets
       ├── icdar
       ├── iclef
       ├── iamdb
       ├── botany
       └── botany

You probably want to use a symlinks and not to keep the data in the repo directory

cd datasets
ln -s /path/to/icdar icdar

ICDAR2013

  • Available for download from here (You need the images and word ground truth data)

  • Open the rar files to the following directory structure:

      datasets/icdar
      ├── gt_words_test
      ├── gt_words_train
      ├── images_test
      └── images_train
    
  • Commandline flags: dataset 'icdar' as value to dataset flag, available values for data_type are {train, val1, test}

  • Binary ground truth data converted to bounding box coordinates are available as a pkl from [here]

You can get dataset splits and cached Numpy arrays of bounding box data from here

IAM Database

  • Available for download from here (registration is required)

  • Make sure you have the following directory structure:

     datasets/iamdb
      ├── ascii
      ├── forms
      ├── lines
      ├── sentences
      ├── words
      └── xml
    
  • Commandline flags: dataset 'iamdb' as value to dataset flag, available values for data_type are {train, val1, val2, test}

You can get dataset splits from here

Bentham dataset (iclef)

  • Available for download from here

  • Make sure you have the following directory structure:

     datasets/iclef
      ├── bboxs_train_for_query-by-example.txt
      ├── pages_devel_jpg
      ├── pages_devel_xml
      ├── pages_test_jpg
      ├── pages_test_xml
      ├── pages_train_jpg
      └── pages_train_xml
    
  • Commandline flags: dataset 'iamdb' as value to dataset flag, available values for data_type are {train, val1, val2, test}

You can get dataset splits from here

Botany

  • Available for download from here

  • Make sure you have the following directory structure:

     datasets/botany
     ├── Botany_Train_III_PageImages
     ├── Botany_Train_II_PageImages
     ├── Botany_Train_I_PageImages
     ├── Botany_Test_PageImages
     └── xml
         ├── Botany_Train_III_WL.xml
         ├── Botany_Train_II_WL.xml
         ├── Botany_Train_I_WL.xml
         └── Botany_Test_GT_SegFree_QbS.xml