/airbnb-workflow

Example workflow for reproducible science project, using public AirBnB data

Primary LanguagePerl

Reproducible research workflow with public AirBnB Data

Hannes Datta, h.datta@tilburguniversity.edu Hulai Zhang, h.zhang_4@tilburguniversity.edu

Toolkit

To illustrate the versatility of automating one's research, we are using various tools (R, Python, STATA, PERL and TeX/LyX) in this workflow!

How to get started

  • Install Python.
    • Anaconda is recommended. Download Anaconda.
    • check availability: type anaconda --version in the command line.
  • Install the Python Package TableFill: pip install git+https://github.com/mcaceresb/tablefill
  • Install the Kaggle package.
    • Kaggle API instruction for installation and setup.
  • Install Automation tools.
    • GNU make: already installed in Mac and Linux OS. Download Make for Windows OS and install.
    • Windows OS users only: make Make available via the command line.
      • Right Click on Computer
      • Go to Property, and click Advanced System Settings
      • Choose Environment Variables, and choose Path under the system variables, click edit
      • Add the bin of Make
    • check availability: type make --version in the command line.
  • Install Stata.
    • making Stata available via the command line. Instruction for adding Stata to path.
    • check availability: type $STATA_BIN --version in the command line.
  • Install LyX.
    • LyX is an open source document processor based on the LaTeX. Download LyX.
    • make sure LyX available via the command line. Instruction for adding LyX to path.
    • check availability: type $LYX_BIN in the command line.

How to run the workflow

Open your command line tool:

  • Check whether your present working directory is airbnb-workflow by typing pwd in terminal
    • if not, type cd yourpath/airbnb-workflow to change your directory to airbnb-workflow
  • Type make in the command line.

Directory structure

Make sure makefile is put in the present working directory. The directory structure for the workflow is shown below.

├── data
├── gen
│   ├── analysis
│   │   ├── input
│   │   ├── output
│   │   │   ├── figure
│   │   │   ├── log
│   │   │   └── table
│   │   └── temp
│   ├── data_preparation
│   │   ├── audit
│   │   │   ├── figure
│   │   │   ├── log
│   │   │   └── table
│   │   ├── input
│   │   ├── output
│   │   │   ├── figure
│   │   │   ├── log
│   │   │   └── table
│   │   └── temp
│   └── paper
│       ├── input
│       ├── output
│       └── temp
└── src
    ├── analysis
    ├── data_preparation
    └── paper
  • gen: all generated files such as tables, figures, logs.
    • Three parts: data_preparation, analysis, and paper.
    • audit: put the resulting log/tables/figures of audit program. It has three sub-folders: figure, log, and table.
    • temp : put the temporary files, such as some intermediate datasets. We may delete these filed in the end.
    • output: put results, including the generated figures in sub-folder figure, log files in sub-folder log, and tables in sub-folder table.
    • input: put all temporary input files
  • data: all raw data.
  • src: all source codes.
    • Three parts: data_preparation, analysis, and paper (including tex files).