/ESCALATE_report

Transform experimental data into ML ready datasets!

Primary LanguagePythonMIT LicenseMIT

Authors: Ian Pendleton, Michael Tynes, Aaron Dharna

Science Contact: jschrier .at. fordham.edu, ian .at. pendletonian.com

Technical Debugging: vshekar .at. haverford.edu, gcattabrig .at. haverford.edu,

Overview

Retrieves experiment files from supported locations and processes to an intermediary JSON file on users local machine. The generated JSON files are used to generate a 2d CSV of the data in a format compatible with most machine learning software (e.g. SciKit learn). Additional configuration is required to map the existing data structures to headers which resemble the users desired configuration. These mappings are typically trivial for computer scientists, but may be more challenging for non-domain experts or individuals unfamiliar with manipulating dataframes. The dataset is augmented with chemical calculations such as concentrations, temperatures derived from models of plate temperature, and other empirical observations. In the final steps the dataset is supplemented with chemical features and calcs derived from ChemAxon, RDKit, and local datasets saved to this repository. Additional information on how to control the generation of _feat_ and _calc_ columns can be found in the user documentation here.

The original ESCALATE publication can be found here.

User documents, relating to a complete cycle of escalate, can be found here.

Installation

This build process has been tested on MacOS High Sierra (10.13.5), MacOS Catalina (10.15.3), Ubuntu Bionic Beaver (18.04), and Windows 10 (version 1909 OS Build 18363.418)

Windows Users: Please note that while windows has been tested it is not the recommended Operating System. Everything is more challenging, the installation is messier, logging is limited, and the file system interaction is more brittle.

Mac and Linux

Initial Setup

Pip Install

  1. Create new python 3.8 environment in conda and activate:

    conda create -n escalate_report python=3.8

    conda activate escalate_report

  2. Install the latest version of the pip package manager

    conda install pip

  3. Then install requirments (still in escalate_report)

    pip install -r requirements.txt

  4. Then install conda dependent pieces:

    conda install -c conda-forge rdkit

Conda Install

  1. Execute:

    conda update conda

    conda env create -f environment.yml

    The conda env create command will automatically create an escalate_report environment

Custom Environment (Package List)

Windows Users will likely need to use this

Pip install the following python packages prior to use:

  • pandas, json, numpy, gspread, pydrive, cerberus, google-api-python-client==1.7.4, xlrd, xlwt, tqdm, pytest,

conda install -c conda-forge rdkit

Please report any failures of the above message to the repo admins

Authentication Setup

  1. Download the securekey files and move them into the root folder (./, aka. current working directory, aka. ESCALATE_report-master/ if downloaded from git). Do not distribute these keys! (Contact a dev for access)

    Note: Navigate to the wiki for more information on setting up a new lab or generating additional authentication keys

  2. Ensure that the files 'client_secrets.json' and 'creds.json' are both present in the root folder (./, aka. current working directory, aka. ESCALATE_report-master/ if downloaded from git). The correct folder for these keys is the one which contains the runme.py script.

  3. Stop here if you don't want to use ChemAxon for feature generation. Rdkit and the available ESCALATE features will still be generated.

    • Note: ESCALATE will throw warnings if chemaxon features are implemented in type_command.csv, these can be ignored if that is the desired functionality

Optional for ChemAxon Support

  1. Download and install ChemAxon JChemSuite and obtain a ChemAxon License Free for academic use

  2. Follow the installation instruction found on ChemAxons website Be sure to not the location of the JChemSuite installation (i.e. ~/opt/chemaxon/jchemsuite/bin on linux or /Applications/JChemSuite/bin/ on MacOSX)

  3. You will need to specify the location of your chemaxon installation locations in ./expworkup/devconfig.py at the bottom of the file.

Running The Code

Currently supported google_drive_target_name (user defined folder names):

  • MIT Data: MIT_PVLab
  • HC and LBL Data: 4-Data-WF3_Iodide, 4-Data-WF3_Alloying, 4-Data-Bromides, 4-Data-Iodides
  • Development: dev

Basic Overview

A more detailed instruction manual including videos overviewing how to operated the code can be found in the ESCALATE user manual

Definitions

<my_local_folder>: is the name of the folder where files should be created. This will be automatically created by ESCALATE_report if it does not exist. The specified name will also be used as the final exported csv (i.e. if <my_local_folder> is perovskitedata, perovskitedata.csv will be generate)

<google_drive_target_name>: one or more of the available datasets. see examples below

  1. You can always get runtime information by executing:

    python runme.py --help

  2. To execute a normal run with chemaxon, rdkit, and ESCALATE calcs (see installation instructions above for more details)

    python runme.py <my_local_folder> -d <google_drive_target_name>

  3. To improve the clarity of column headers specify them in the dataset_rename.json file. All columns can be viewed in the initial run by executing:

    python runme.py <my_local_folder> -d <google_drive_target_name> --raw 1

  4. Columns that do not conform to the _{category}_ (e.g., _feat_, _rxn_) will be omitted unless --raw 1 is enabled!

    • A list of the columns not conforming to the naming scheme will be exported to './<my_local_folder>/logging/UNNAMED_REPORT_COLUMNS.txt'.
    • The USER can specify an appropriate name in dataset_rename.json
    • To see all columns with naming directly from datasource use: --raw 1
    • Conflicting namespaces will be purged!
  5. Significant flexibility is enabled for _feat_ (via, type_command.csv) and _calc_ (via, ./utils/calc_command.py) specification. For examples, discussion, and limitations of these specifications please see the USER docs.

    • _calc_ generation can be skipped by using the --disablecalcs True flag on the CLI
    • To speed up calc and feature development the first portion of the code can be skipped by:
      1. Running the code with --offline 1
      2. After the first iteration completes running future instances with --offline 2
  6. A file named <my_local_folder>.csv will contain the 2d CSV of the dataset using the configured headers from the data or the mapping developed for the lab. The data/ folder will contain the generated JSONs.

  7. Intermediate dataframes can be exported in bulk by specifying:

    python runme.py <my_local_folder> -d <google_drive_target_name> --debug 1

To add additional target directories please see the how-to guide here. If you would like to add these to the existing datasets, please issue a git merge request after you add the necessary information.

Report to Versioned Data to ESCALATion

More detailed instructions can be found in the ESCALATE user manual.

If you are using Windows10 please follow these instructions on what you will need to setup your environment. Consider using Ubuntu or wsl instead!

  1. Ensure that versioned data repo and escalation are installed

  2. Create an issue on versioned repo with new crank-number

  3. python runmy.py <my_local_folder> -d <google_drive_target_name> -v <crank-number>

  4. This will generate files for upload to versioned data repo with the names:

    • <crank-number>.<dataset-name>.csv
    • <crank-number>.<dataset-name>.index.csv
  5. Move these files to the /pathto/versioned-dataset/data/perovskite/<my_local_folder>

  6. Follow Readme.md instructions for versioned=datasets

Include a state-set file with Crank

  1. Obtain a stateset or generate a stateset

  2. python runmy.py <my_local_folder> -d <google_drive_target_name> -v <crank-number> -s <state-set_file_name.csv>

  3. Follow 5-6 above

Example Useage

  • python runme.py 4-Data-Iodides -d 4-Data-Iodides

  • python runme.py 4-Data-Iodides -d 4-Data-Iodides 4-Data-WF3_Iodide 4-Data-WF3_Alloying

  • python runme.py dev -d dev --debug 1 --raw 1 --offline 1

  • python runme.py perovskitedata -d 4-Data-Iodides --verdata 0111 --state example.csv

FAQs, Trouble Shooting, and Tutorials

  1. FAQs
  2. Trouble Shooting Help: please send log file, any terminal output and a brief explanation to ipendlet .at. haverford.edu for help.
  3. Tutorials
    1. Adding a new target for data workup
    2. Adding a new target for experiment generation