/MARGE

Machine learning Algorithm for Radiative transfer of Generated Exoplanets

Primary LanguagePLSQLOtherNOASSERTION

                                MARGE
   Machine learning Algorithm for Radiative transfer of Generated Exoplanets
===============================================================================


Author :       Michael D. Himes    University of Central Florida
Contact:       mhimes@knights.ucf.edu

Advisor:       Joseph Harrington   University of Central Florida

Contributors:  Adam D. Cobb
                 - plan-net code inspired NN class & training
               David C. Wright
                 - containerized the code
               Zaccheus Scheffer
                 - assisted with refactoring the codebase that led to MARGE


Acknowledgements
----------------
This research was supported by the NASA Fellowship Activity under NASA Grant 
80NSSC20K0682.  We gratefully thank Nvidia Corporation for the Titan Xp GPU 
that was used throughout development of the software.


Summary
=======
MARGE is an all-in-one package to generate exoplanet spectra across a 
defined parameter space, process the output, and train machine learning (ML) 
models as a fast approximation to radiative transfer (RT).

MARGE comes with complete documentation as well as a user manual to assist 
in its usage.  Users can find the latest MARGE User Manual at 
https://exosports.github.io/MARGE/doc/MARGE_User_Manual.html.

MARGE is an open-source project that welcomes improvements from the community 
to be submitted via pull requests on Github.  To be accepted, such improvements 
must be generally consistent with the existing coding style, and all changes 
must be updated in associated documentation.

MARGE is released under the Reproducible Research Software License.  Users are 
free to use the software for personal reasons.  Users publishing results from 
MARGE or modified versions of it in a peer-reviewed journal are required to 
publicly release all necessary code/data to reproduce the published work.  
Modified versions of MARGE are required to also be released open source under 
the same Reproducible Research Software License.  For more details, see the 
text of the license.


Files & Directories
===================
MARGE contains various files and directories, described here.

doc/            - Contains documentation for MARGE.  The User Manual contains 
                  the information in the README with more detail, as well as a 
                  walkthrough for setup and running an example.
environment.yml - Conda environment for MARGE.
example/        - Contains example configuration files for MARGE.
lib/            - Contains the classes and functions of MARGE.
  callbacks.py  - Contains Keras Callback classes.
  datagen/      - Contains files related to data generation
    BART/       - Files necessary for data generation/processing with BART.
      BARTfunc.py - Modified MCMC function that saves out full spectra.
    datagen.py  - Contains functions for generating/processing data with BART.
    datagen_pypsg.py - As above, but for the pypsg format 
             (see https://gitlab.com/frontierdevelopmentlab/astrobiology/pypsg).
  loader.py     - Contains functions related to loading processed data.
  NN.py         - Contains the NN model class, and a driver function for model 
                  training/validating/testing.
  plotter.py    - Contains plotting functions.
  stats.py      - Contains functions related to statistics.
  utils.py      - Contains utiity functions used for internal processing.
Makefile        - Handles setting up BART, and creating a TLI file.
MARGE.py        - The executable driver for MARGE. Accepts a configuration file.
modules/        - Contains necessary modules for data generation.
                  Initially empty; will contain BART via the Makefile.
README          - This file!
requirements.txt- Additional packages for the conda environment installed via 
                  pip.

Note that all .py files have complete documentation; consult a specific file 
for more details.


Installation
============
At present, MARGE uses BART, the Bayesian Atmospheric Radiative Transfer
(BART) code, for data generation.  For details, see the associated user manual 
in the modules/BART/doc directory.

To build the included conda environment, enter 
    conda env create -f environment.yml
into a terminal.  Then, activate it via 
    conda activate marge

If running MARGE in data generation mode, users must first compile BART's 
submodules. This can be done by entering
    make bart
into a terminal while in the MARGE directory.
This requires SWIG.  If the user does not have SWIG installed, they 
can easily add SWIG to the current conda environment by entering
    conda install swig
into the terminal.

Note that the supplied conda environment does not include SWIG.  Also note 
that it uses tensorflow-gpu, which assumes the user has Nvidia drivers 
installed.  If this is not the case, after building the environment, non-GPU 
users must enter 
    conda remove -n marge tensorflow-gpu
and follow the prompts.  This will remove the requirement for Nvidia drivers.

For more details, see the MARGE User Manual.


Executing MARGE
===============
MARGE has 2 execution modes: data generation, and ML model training.

As mentioned above, data generation is handled via BART.
Before data generation can begin, users must first generate a TLI file via 
pylineread, BART's line list-processing code.  To do so, input
    make TLI cfile=<path/to/configration file>
to call pylineread with a configuration file.  For details on how to set this 
up, see the pylineread directory within BART's RT module, transit.

After generating a TLI file, data generation can begin.  After setting up the 
MARGE configuration file, run MARGE via
    ./MARGE.py <path/to/configuration file>

Note that data generation requires a BART configuration file.  In order to save 
the data, the BART config file must have the 'savemodel' parameter set to 
<name>.npy.  This config file must also have the 'modelper' parameter set; this 
controls the number of iterations per chain to save out each model file, to 
ensure that the evaluated models will fit in memory.  If the models will fit in 
memory as a single array, set modelper=0 to save out a single file with all 
models.  If modelper>0, the generated .npy files will be saved to a subdirectory
within BART's output directory, named <name> from the 'savemodel' parameter.

For a detailed example walkthrough, see the README in the example/ subdirectory 
or the user manual.


ML Model Training: Data
-----------------------
If training an ML model using MARGE, data must be formatted a particular way.
At present, MARGE has a single allowed format, which is a 2D array saved as a 
.NPY file.  Each row corresponds to a unique case.  Each column corresponds to 
the inputs followed by the outputs, e.g.,
[input0, input1, input2, ..., output0, output1, output2, ...]

Users are encouraged to introduce additions to loader.py to allow additional 
data formats, which would be specified in the configuration file.  
Please submit any such modifications via pull requests on Github.


Setting Up a MARGE Configuration File
=====================================
A MARGE configuration file contains many options, which are detailed in this 
section.  For an example, see config.cfg in the example subdirectory, and the 
associated README with instructions on execution.

Directories
-----------
inputdir   : str.  Directory containing MARGE inputs.
outputdir  : str.  Directory containing MARGE outputs.
plotdir    : str.  Directory to save plots. 
                   If relative path, subdirectory with respect to `outputdir`.
datadir    : str.  Directory to store generated data. 
                   If relative path, subdirectory with respect to `outputdir`.
preddir    : str.  Directory to store validation and test set predictions and 
                   true values. 
                   If relative path, subdirectory with respect to `outputdir`.


Datagen Parameters
------------------
datagen    : bool. Determines whether to generate data.
datagenfile: str.  File containing functions to generate and/or process 
                   data.  Do NOT include the .py extension!
                   See the files in the lib/datagen directory for examples.
                   User-specified files must have identically-named 
                   functions with an identical set of inputs.  If an 
                   additional input is required, the user must modify the 
                   code in MARGE.py accordingly.  
                   Please submit a pull request if this occurs!
cfile      : str.  Name of the configuration file for data generation.
                   Can be absolute path, or relative path to `inputdir`.
processdat : bool. Determines whether to process the generated data.
preservedat: bool. Determines whether to preserve the unprocessed data after 
                   completing data processing.
                   Note: if False, it will PERMANENTLY DELETE the original, 
                         unprocessed data!



Neural Network (NN) Parameters
------------------------------
nnmodel    : bool. Determines whether to use an NN model.
resume     : bool. Determines whether to resume training a previously-trained 
                   model.
seed       : int.  Random seed.
trainflag  : bool. Determines whether to train    an NN model.
validflag  : bool. Determines whether to validate an NN model.
testflag   : bool. Determines whether to test     an NN model.

TFR_file   : str.  Prefix for the TFRecords files to be created.
buffer     : int.  Number of batches to pre-load into memory.
ncores     : int.  Number of CPU cores to use to load the data in parallel.

normalize  : bool. Determines whether to normalize the data by its mean and 
                   standard deviation.
scale      : bool. Determines whether to scale the data to be within a range.
scalelims  : floats. The min, max of the range of the scaled data.
                     It is recommended to use -1, 1

weight_file: str.  File containing NN model weights.
                   NOTE: MUST end in .h5
input_dim   : int.  Dimensionality of the input  to the NN.
output_dim  : int.  Dimensionality of the output of the NN.
ilog        : bool. Determines whether to take the log10 of the input  data.
                    Alternatively, specify comma-, space-, or newline-separated 
                    integers to selectively take the log of certain inputs.
olog        : bool. Determines whether to take the log10 of the output data.
                    Alternatively, specify comma-, space-, or newline-separated 
                    integers to selectively take the log of certain outputs.

gridsearch  : bool. Determines whether to perform a gridsearch over 
                    architectures.
architectures: strings. Name/identifier for a given model architecture.  
                        Names must not include spaces!
                        For multiple architectures, separate with a space or 
                        indented newlines.
nodes : ints. Number of nodes per layer with nodes.
layers: strings. Type of each hidden layer.  
                 Options: dense, conv1d, maxpool1d, avgpool1d, flatten, dropout
lay_params: list. Parameters for each layer (e.g., kernel size). 
                  For no parameter or the default, use None.
activations: strings. Type of activation function per layer with nodes.
                      Options: linear, relu, leakyrelu, elu, tanh, sigmoid, 
                               exponential, softsign, softplus, softmax
act_params: list. Parameters for each activation.  
                  Use None for no parameter or the default value.
                  Values specified only apply to LeakyReLU and ELU.

epochs     : int.  Maximum number of iterations through the training data set.
patience   : int.  Early-stopping criteria; stops training after `patience` 
                   epochs of no improvement in validation loss.
batch_size : int.  Mini-batch size for training/validation steps.

lengthscale: float. Minimum learning rate.
max_lr     : float. Maximum learning rate.
clr_mode   : str.   Specifies the function to use for a cyclical learning rate 
                    (CLR).
                    'triangular' linearly increases from `lengthscale` to 
                    `max_lr` over `clr_steps` iterations, then decreases.
                    'triangular2' performs similar to `triangular', except that 
                    the `max_lr` value is decreased by 2 every complete cycle,
                    i.e., 2 * `clr_steps`.
                    'exp_range' performs similar to 'triangular2', except that 
                    the amplitude decreases according to an exponential based 
                    on the epoch number, rather than the CLR cycle.
clr_steps  : int.   Number of steps through a half-cycle of the learning rate.
                    E.g., if using clr_mode = 'triangular' and clr_steps = 4, 
                    Every 8 epochs will have the same learning rate.
                    It is recommended to use an even value.
                    For more details, see Smith (2015), Cyclical Learning Rates 
                    for Training Neural Networks.


Plotting Parameters
-------------------
xvals       : str.  .NPY file with the x-axis values corresponding to 
                    the NN output.
xlabel      : str.  X-axis label for plots.
ylabel      : str.  Y-axis label for plots.
plot_cases : ints. Specifies which cases in the test set should be 
                   plotted vs. the true spectrum.
                   Note: must be separated by spaces or indented newlines.


Statistics Files
----------------
fmean      : str.  File name containing the mean of each input/output.
                   Assumed to be in `inputdir`.
fstdev     : str.  File name containing the standard deviation of each 
                   input/output.
                   Assumed to be in `inputdir`.
fmin       : str.  File name containing the minimum of each input/output.
                   Assumed to be in `inputdir`.
fmax       : str.  File name containing the maximum of each input/output.
                   Assumed to be in `inputdir`.
rmse_file  : str.  Prefix for the file to be saved containing the root mean 
                   squared error of predictions on the validation \& test data.
                   Saved into `outputdir`.
r2_file    : str.  Prefix for the file to be saved containing the 
                   coefficient of determination (R\^2) of predictions on the 
                   validation \& test data.
                   Saved into `outputdir`.
filters : strings. (optional) Paths/to/filter files that define a 
                   bandpass over `xvals`. 
                   Two columns, wavelength and transmission.
                   Used to compute statistics for band-integrated values.
filt2um : float. (default: 1.0) Factor to convert the filter files' 
                 X values to microns. 
                 Only used if `filters` is specified.


Versions
========
MARGE was developed on a Unix/Linux machine using the following 
versions of packages:
 - Python 3.7.2
 - Keras 2.2.4
 - Numpy 1.16.2
 - Matplotlib 3.0.2
 - mpi4py 3.0.3
 - Scipy 1.2.1
 - sklearn 0.20.2
 - Tensorflow 1.13.1
 - CUDA 9.1.85
 - cuDNN 7.5.00
 - ONNX 1.6.0
 - keras2onnx 1.6.1
 - onnx2keras 0.0.22

MARGE also requires a working MPI distribution if using BART for 
data generation.  MARGE was developed using MPICH version 3.3.2.


Be kind
=======
Please cite this paper if you found this package useful for your research:

Himes et al. 2022, PSJ, 3, 91
https://iopscience.iop.org/article/10.3847/PSJ/abe3fd/meta

@ARTICLE{2022PSJ.....3...91H,
       author = {{Himes}, Michael D. and {Harrington}, Joseph and {Cobb}, Adam D. and {G{\"u}ne{\c{s}} Baydin}, At{\i}l{\i}m and {Soboczenski}, Frank and {O'Beirne}, Molly D. and {Zorzan}, Simone and {Wright}, David C. and {Scheffer}, Zacchaeus and {Domagal-Goldman}, Shawn D. and {Arney}, Giada N.},
        title = "{Accurate Machine-learning Atmospheric Retrieval via a Neural-network Surrogate Model for Radiative Transfer}",
      journal = {\psj},
     keywords = {Exoplanet atmospheres, Bayesian statistics, Posterior distribution, Convolutional neural networks, Neural networks, 487, 1900, 1926, 1938, 1933},
         year = 2022,
        month = apr,
       volume = {3},
       number = {4},
          eid = {91},
        pages = {91},
          doi = {10.3847/PSJ/abe3fd},
       adsurl = {https://ui.adsabs.harvard.edu/abs/2022PSJ.....3...91H},
      adsnote = {Provided by the SAO/NASA Astrophysics Data System}
}

Thanks!