/bnb-dataset

Downloading a dataset from Airbnb

Primary LanguagePythonMIT LicenseMIT

🏘️ BnB Dataset 🏘️

MIT arXiv R2R 1st

This contains a set of scripts for downloading a dataset from Airbnb.

πŸ› οΈ 1. Get started

First, you need [git lfs](https://git-lfs.github.com/) to clone the repository. Install it from command line:

curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
sudo apt-get install git-lfs
git lfs install

You can now clone the repository:

git clone https://github.com/airbert-vln/bnb-dataset.git

If you clone the repository without LFS installed, you should have received an error message. You can fix it by running:

make lfs

You need to have a recent version of Python (3.8 or higher) and install dependencies through poetry:

# install python for ubuntu 20.04
sudo apt install python3 python3-pip 
pip install poetry

# install dependencies
poetry install

# activate the environment (do it at each new shell)
poetry shell

Note that typing is extensively used in these scripts. This was a real time saver for detecting errors before runtime. You might want to setup properly your IDE to play well with mypy. I recommend the coc.nvim extension coc-pyright for neovim users.

Managing a large of images is tricky and usually take a lot of times. Usually, the scripts are splitting the task among several workers. A cache folder is keeping the order list for each worker, while each worker is producing its own output file. Look for num_workers or num_procs parameters in the argtyped Arguments.

πŸ—ΊοΈ 2. Download listings from Airbnb

This step is building a TSV file with 4 columns: listing ID, photo ID, image URL, image caption. A too high request rate would induce a rejection from Airbnb. Instead, it is advised to split the job among different IP addresses.

Please note that you can use the pre-computed TSV file used in our paper for training and for testing. The file was generated during Christmas 2019 (yeah, before Covid. Sounds so far away now!). Some images might not be available anymore.

Also, note that this file contains only a portion from the total of Airbnb listings. It might be interesting to extend it.

2.1. Create a list of regions

Airbnb listings are searched among a specific region. We need first to initialize the list of regions. A quick hack for that consists in scrapping Wikipedia list of places, as done in the script cities.py.

For this script, you need to download and install Selenium. Instructions here are valid only for a Linux distribution. Otherwise, follow the guide from Selenium documentation.

pip install selenium
wget https://github.com/mozilla/geckodriver/releases/download/v0.30.0/geckodriver-v0.30.0-linux32.tar.gz
mkdir -p $HOME/.local/bin
export PATH=$PATH:$HOME/.local/bin
tar -xvf geckodriver-v0.30.0-linux32.tar.gz -C $HOME/.local/bin
# Testing the driver path is recognized:
geckodriver --version

Here is how I scrapped a list of cities. You might want to update this script to order to increase the amount of cities.

python cities.py --output data/cities.txt

You can see other examples in the locations/ folder, used as an attempt to enlarge the BnB dataset.

2.2. Download listings

# Download a list of listing from the list of cities
python search_listings.py --locations data/cities.txt --output data/listings

# Download JSON files for each listing
python download_listings.py --listings data/listings.txt --output data/merlin --with_photo
# Note you can download also reviews and infos (see python download_listings.py --help)

# Extract photo URLs from listing export files
python extract_photo_metadata.py --merlin data/merlin --output data/bnb-dataset-raw.tsv

2.3. Filter captions

# Apply basic rules to remove some captions
python filter_captions.py --input data/bnb-dataset-raw.tsv --output data/bnb-dataset.tsv

πŸ“Έ 3. Get images

Now we want to download images and filter out outdoor images.

3.1. Download images

The download rate can be higher before the server kicks us out. However, it is still preferable to use a pool of IP addresses.

python download_images.py --csv_file data/bnb-dataset.tsv --output data/images --correspondance /tmp/cache-download-images/

3.2. Optionally, make sure images were correctly downloaded

python detect_errors.py --images data/images --merlin data/merlin

3.3. Filter out outdoor images

Outdoor images tend to be of lower qualities and captions are often not relevant. We first detect outdoor images from a CNN pretrained on the places365 dataset. Later on, we will keep indoor images.

Note that the output of this step is also used for image merging.

# Detect room types
python detect_room.py --output data/places365/detect.tsv --images data/images

# Filter out indoor images
python extract_indoor.py --output data/bnb-dataset-indoor.tsv --detection data/places365/detect.tsv

πŸ’½ 4. Build an LMDB database with BnB pictures

Extract visual features and store them on a single file. Several steps are required to achieve that. Unfortunately, we don't own permissions over Airbnb images, and thus we are not permitted to share our own LMDB file.

4.1. Split between train and test

5% of the dataset is allocated to the testset:

round() {
  printf "%.${2}f" "${1}"
}

num_rows=$(wc -l data/bnb-dataset-indoor.tsv)

test=$((num_rows * 0.05))
test=$(round $test)
cat data/bnb-dataset-indoor.tsv | tail -n $test > data/bnb-test-indoor-filtered.tsv

train=$((num_rows - test))
cat data/bnb-dataset-indoor.tsv | head -n $train > data/bnb-train-indoor-filtered.tsv

4.2. Extract bottom-up top-down features

This step is one of the most annoying one, since the install of bottom-up top-down attention is outdated. I put docker file and Singularity definition file in the folder container to help you with that. Note that this step is also extremely slow and you might want to use multiple GPUs.

python precompute_airbnb_img_features_with_butd.py  --images data/images

If this step is too difficult, open an issue and I'll try to use the PyTorch version instead.

4.3. Build an LMDB file

# Extract keys
python extract_keys.py --output data/keys.txt --datasets data/bnb-dataset.indoor.tsv
# Create an LMDB
python convert_to_lmdb.py --output img_features --keys data/keys.txt

Note that you can split the LMDB into multiple files by using a number of workers. This could be relevant when your LMDB file is super huge!

πŸ”— 5. Create dataset files with path-instruction pairs

Almost there! We built image-caption pairs and now we want to convert them into path-instruction pairs. Actually, we are just going to produce JSON files that you can feed into the training repository.

⛓️ 5.1. Concatenation

python preprocess_dataset.py --csv data/bnb-train.tsv --name bnb_train
python preprocess_dataset.py --csv data/bnb-test.tsv --name bnb_test

πŸ‘₯ 5.2. Image merging

python merge_photos.py --source bnb_train.py --output merge+bnb_train.py --detection-dir data/places365 
python merge_photos.py --source bnb_test.py --output merge+bnb_test.py --detection-dir data/places365

πŸ‘¨β€πŸ‘©β€πŸ‘§ 5.3. Captionless insertion

python preprocess_dataset.py --csv data/bnb-dataset.indoor.tsv --captionless True --min-caption 2 --min-length 4 --name 2capt+bnb_train

python preprocess_dataset.py --csv datasets/data/bnb-dataset.indoor.tsv --captionless True --min-caption 2 --min-length 4 --name 2capt+bnb_test

πŸ‘£ 5.4. Instruction rephrasing

# Extract noun phrases from BnB captions
python extract_noun_phrases.py --source data/airbnb-train-indoor-filtered.tsv --output data/bnb-train.np.tsv 
python extract_noun_phrases.py --source data/airbnb-test-indoor-filtered.tsv --output data/bnb-test.np.tsv 

# Extract noun phrases from R2R train set
python perturbate_dataset.py --infile R2R_train.json --outfile np_train.json --mode object --training True 

5.5. Create the testset

You need to create a testset for each dataset. Here is an example for captionless insertion.

python build_testset.py --output data/bnb/2capt+testset.json --out-listing False --captions 2capt+bnb_test.json