/RestifyJupyter

Jupyter Notebook for RESTify Experiment data mining

Primary LanguageJupyter NotebookMIT LicenseMIT

RESTify Data Analysis

Data, Data-Mining and Visualization for the RESTify experiment.

pycharm pylint jupyter docker

About

This repository hosts sources and raw input data that allows replication of empiric findings around the RESTify experiment. The data can be reproduced and inspected with a Jupyter Notebook instance, or for more experienced users and collaborators with a preconfigured PyCharm project.

To replicate our data analysis, you have four options:

Dockerized Notebook

This repository hosts a Docker configuration that creates a container Jupyter Notebook instance with all runtime dependencies.
The notebook allows you to locally replicate our methodology and all findings, together with in-depth explanations.

Instructions for Docker (MacOS / Linux host):

  1. Install Docker
    (After install, test your setup with: docker run hello-world)
  2. Clone this repository:
    git clone https://github.com/m5c/RestifyJupyter.git
  3. Build and Run the Jupyter Notebook Container:
    cd RestifyJupyter; ./docker-autostart.sh
    (On linux, you may need to prefix the docker command with sudo)
  4. Access the Notebook: http://127.0.0.1:8889/notebooks/Restify.ipynb

If you see a notebook with all paper figures and stats, you have succeeded.

Manual Notebook

This section explains how to run the Jupyter Notebook instance natively. For this to work, you must install all runtime dependencies. The below steps will install the dependencies in a virtual environment, so your system-wide python installation stays clutter-free.

  1. Install Python 3.9 or newer. Make sure the newly installed python version is set as default. Verify with: python --version
  2. Go into the project and create a new virtual environment (local python folder with all dependencies):
    cd RestifyJupyter
    python3 -m venv .env
    source .env/bin/activate
  3. Install all required python libraries, using the pip3 package manager:
    pip3 install pandas numpy matplotlib plotly scipy statsmodels seaborn jupyter
    You can also install all at once, with pip3 install -r requirements.txt
  4. Start up the Notebook:

PyCharm IDE

Complementary to the replication of our results with a Jupyter Notebook, you can also directly execute the python code used for data mining. This option provides an in depth access to implementation details and is intended for data scientist who want to either:

  • Validate the correctness of our extracted data at coding level.
  • Enrich our the data analysis we implemented by additional insights.

All runtime dependencies, including python itself, can be directly installed from PyCharm, however it is important that the IDE is configured to use the correct interpreter.

  1. Install PyCharm. The free Community Edition is sufficient.
  2. Install the python3 interpreter. You find a corresponding option in the PyCharm -> Settings menu:
    interpreter
  3. Install all required libraries. Open the PyCharm -> Settings -> Project -> Interpreter menu: libraries
  4. Install PyLint. Open the plugins menu: PyCharm -> Settings -> Plugins:
    pylint
  5. Select the desired run configuration, to replicate any of our results:
    • For every code cell of the Notebook, there is a corresponding preconfigured run configuration.
    • We recommend that you run the run_all_pseudo_cell.py script, which recreates all statistical figures and listings from the paper.

Inputs and Outputs

  • Inputs:
    The Notebook works on the CSV data, stored in source-csv-files. It is the same data as provided in our replication bundle.
  • Outputs:
    • Figures are generated to generated-plots
    • Intermediate CSV files are generated to generated-csv-files

Implementation Details

This section is only relevant for data analysts who want to tweak the notebook output / visualization, or reuse part of the codebase for similar project layouts.

Label Makers

For scatter plots and scatter series you can easily change how samples are annotated. Just pass a different LabelMaker at the moment of scatter instantiation.
LabelMakers are defined in restify_mining/scatter_plotters/extractors.

If you with to annotated only selected dots, edit the labeloverride.csv and use a custom LabelMaker.

  • To remove all labels, use the EmptyLabelMaker.
  • To annotate full codenames (colour + animal) use the FullLabelMaker.
  • To annotate group internal codenames (only animal), use AnimalLabelMaker.

License

This software is under open source MIT License.

Author / References