BU-TD

Official code for the paper Image interpretation by iterative bottom-up top-down processing

Scene understanding requires the extraction and representation of scene components, such as objects and their parts, people, and places, together with their individual properties, as well as relations and interactions between them. We describe a model in which meaningful scene structures are extracted from the image by an iterative process, combining bottom-up (BU) and top-down (TD) networks, interacting through a symmetric bi-directional communication between them (‘counter-streams’ structure). The BU-TD model extracts and recognizes scene constituents with their selected properties and relations, and uses them to describe and understand the image.

Currently the repository contains the code for the Persons and EMNIST experiments (described in Sections 3 and 5 of the paper). The code creates the data sets used in the paper and also the bottom up (BU) - top down (TD) network model (counter stream).

Code

The code is based on Python 3.6 and uses PyTorch (version 1.6) as well as torchvision (0.7). Newer versions would probably work as well. Requirements are in requirements.txt and can also be installed by:

conda install matplotlib scikit-image Pillow

For image augmentation also install:

conda install imgaug py-opencv

Persons details

Download the raw Persons data (get it here and place it in persons/data/avatars).

Next, run the following from within the persons/code folder. Create the sufficient data set:

python create_dataset.py

and the extended data set (use -e):

python create_dataset.py -e

the data sets will be created in the data folder.

Run the training code for the sufficient set (-e for the extended set):

python avatar_details.py [-e]

A folder with all the learned models and a log file will be created under the data/results folder.

EMNIST spatial relations

Run from within the emnist/code folder. Create the sufficient data set (-e for the extended set) with either 6 or 24 characters in each image (-n 6 or -n 24):

python create_dataset.py -n 24 -e

The EMNIST raw dataset will be downloaded and processed (using torchvision) and the spatial data set will be created in the data folder.

Run the training code for the sufficient set (using -e for the extended set and the corresponding -n):

python emnist_spatial.py -n 24 -e

A folder with all the learned models and a log file will be created under the data/results folder.

Extracting scene structures

Code will be added soon.

Paper

If you find our work useful in your research or publication, please cite our work:

Image interpretation by iterative bottom-up top-down processing

Shimon Ullman, Liav Assif, Alona Strugatski, Ben-Zion Vatashsky, Hila Levi, Aviv Netanyahu, Adam Yaari

TODO

change the design of create_dataset to be one big file with plugins to each dataset
- emnist
- persons
- clevr
- omniglot
change the create_dataset file to be idempotent - namely, when running the file after the dataset is cretaed it won't be downloaded again
change the behaviour of the download function path to not be relative
PLAN
- The beta branch will be the major branch, chaging the main
- Create a packages from supplmentery dir and v26 dir
- End up with a nice robust base version for the model
- Scan the code for problems
- Update documentation to the new training procedures
- create new requirements file for conda
Long Time forward
- Create a workflow and a proper GitHUB action to invoke training / debugging / simple auto tests
- Transfer the use case of some np.array to torch framework

Contribute

@yonatansverdlov @idan-tankel

idan-tankel/BU-TD