GP: A Jupyter Notebook repository from PC-FSU

1. INPUT DATA, x

I am now considering bidisperse blends, where the individual components may themselves be polydisperse. Thus, the input vector is x = (Z1, Z2, pd1, pd2, w1) instead of the previous x = (Z1, Z2, w1).

Z1, Z2 are confined to [5, 50]
pd1, pd2 are confined to [1.01, 1.5]
w1 = (0, 1], with Z1 > Z2 as before


********************************************
2. OUTSIDE PROGRAMS

In principle, you don't have to touch either of these programs.

1. tdd.py - has the simulation model (TDD-DR). Since it is based on double reptation, it is relatively fast. I ditched the idea of using the slip link model, because it takes too long for polydisperse samples.

This program generates (t, phi)

2. contSpec.py - takes (t, phi) via the file "relax.dat" and prints out a spectrum "h.dat"

parameters for the program are set through the input file "inpReSpect.dat"

*********************************************


3. GP routine

I split the jupyter notebook into 3 parts.

1. genTrain.py: generates the training data and stores it in TrainData/
- xtrain.dat contains the samples, and h*.dat are corresponding spectra in order
- note that internally I divide Z1 and Z2 by Zmax = 50, so that all elements of "x" are of order 1.
- TASK FOR PANKAJ: right now I am using uniform sampling. You should get it to use LHC, and save the results of all the calculations permanently.


2. commonGP.py
- this contains functions required during training the hyperparameters, and during predictions. This module is imported by the following two functions.

3. trainHyper.py: reads TrainData/ and fits the values of the hyperparameters [7 parameters now, due to pd]
- also I run 10 replicas during the fitting and pick the best one.
- this is still the rate-limiting step
- TASK FOR PANKAJ: I haven't incorporated the Jacobian, so calculation is somewhat slow right now.

4. predictGP.py: reads TrainData/ and also "TrainData/hyper.dat" and makes a prediction.

A typical sequence involves:
python3 genTrain.py [adjust sampling]
python3 trainHyper.py 
predictGP.py [adjust test point]



################################################### Updated readme ######################################


I have included the description about functions, input param and how to run in every .py script and below.

Following are the scripts :
1) CommonGP.py => Utilitles for GP (Few minor change from your version)
2) ContSpec.py => Help to find continuous spectrum (Unchanged)
3) genTest.py  => Routine to generate test datasets (New script)
4) genTrain.py => Routine to generate train DataSets ( Major change from your version, To generate Figure->2 of draft )
5) Load_Workspace.py => Routine that handel the loading,saving and clearing of Workspace (New scrpit)
6) plotParam.py => Plot optimize param vs n (New script,   To generate Figure->5 of draft )
7) predictGP.py => Prediction routines ( Moderate Change from your version)
8) profiler.py  => Routine to profile code (New script)
9) tdd.py       => Program implements double reptation model (Unchnaged Script)
10) trainHyper.py => Program implements training Routine (Moderate change from original script)
11) Visualize_DF_and_CP-V2.py => Program implements decomposition of DF and CP and plot them ( New Script, To generate Figure->4 of draft )
12) Visualize_Hvs&PhivsT.py => Program Plots a a figure instance with 3 subplots. See script for description(New script, To generate Figure->3 of draft ) 
13) visualize_spectrum.py => Program plots HvsS for different polydisperscity.(New script, To generate Figure->1 of draft )
14) Visualize_TestErrorAnalysis.py => Program run and plot mean and median absolute deviation for test error analysis (New Script, To generate Figure->7 of draft ).


###### Scripts Description #################



--> CommonGP.py  (Utilitles for GP (Few minor change from original version))

 This is a python3 script which takes in a bidisperse blend: Z = [x, x], pd = [x, x], w1 = x

  (1) runs TDD and outputs phi(t)
  (2) runs pyReSpect on fixed "s" grid and output H(s)

 Note: This new version of commonGP.py has two additional function, kernal_derivative and partial_derivative_R1R2, in case if we want to work with jacobian.. Currently we are not using jacobian so these function don't matter.
       


-->ContSpec.py => Help to find continuous spectrum (Unchanged from original version)

 Help to find continuous spectrum
 March 2019 major update:
 (i)   added plateau modulus G0 (also in pyReSpect-time) calculation
 (ii)  following Hansen Bayesian interpretation of Tikhonov to extract p(lambda)
 (iii) simplifying lcurve (starting from high lambda to low)
 (iv)  changing definition of rho2 and eta2 (no longer dividing by 1/n and 1/nl)


-->genTest.py  => Routine to generate test datasets (New script)

  This Script Generate Test Data, I am gonna include Xtest data in Zip file so in general you don't have to run this script. 
      
  This script takes argument:
    1. Sample_Size : Number of sample tha you want to draw from hypercubic sampling for test data
          Note: The generated Sample are then subjected to constrained offer by our problem, like M1>M2,
                This lead to elimination of half of the genearted sample, so in reality we actual size is 
                ~Sample_Size/2.
    2. Param_Value: "value for (Z_min,Z_max,pdmin,pdmax), fixed for our case, but in case."

    3. nreplica: Number of Copy of xtest data of size Sample_Size, you need to generate.
                
    4. isSave:  "If True, Save the data, and remove the existing .data files"

    5. isPlot:  "IF true, plot the drawn samples"


  How to run:
            python genTest.py --Sample_Size 50 --isPlot True/False --isSave True/False --nreplica int_N
    
   
   
-->genTrain.py => Routine to generate train DataSets ( Major change from old version )


 This is a python3 script which generates bidisperse blend: Z = [x, x], pd = [x, x], w1 = x by either uniform  or hypercubic sampling and then:
               (1) runs TDD and outputs phi(t)
               (2) runs pyReSpect on fixed "s" grid and output H(s)


  This Script Generate Train Data (h.dat and xtrain.dat). I am going to include training data in Zip file so in general you don't have to run this script.

   
  This script takes argument:
    1. Sample_Size : Number of sample tha you want to draw from hypercubic sampling for train data
          Note: The generated Sample are then subjected to constrained offer by our problem, like M1>M2,
                This lead to elimination of half of the genearted sample, so in reality we actual size is 
                ~Sample_Size/2.
    2. nw :  value for param w
    3. npd : value for param pd
    4. nz :  value for param Z
    Note: The above three argument (nw,npd,nz) is used only in generating uniform sampling data not hypercubic sampling 
    5. Param_Value: "value for (Z_min,Z_max,pdmin,pdmax), fixed for our case, but in case.
    6. isPlot: Flag to plot the hypercubic or uniform test sample 
    7. isSave: if True save the *.dat file in TrainData  

   How to run:
       python genTrain.py --Sample_Size 100 --isSave True/False --(Other_Flags as desired)
       

-->Load_Workspace.py => Routine that handel the loading,saving and clearing of Workspace (New scrpit)
  
     This python script takes care of Loading, clearing, and saving data available on trainData Workspace. 
     TrainData Workspace is workspace defined by all .dat file present in TrainData folder, not in it's
     subfolder (50,200,1600 etc.)

     The script takes in argument : 
     1. ClearWorkspace,  Pass True if you want to remove all the .dat file from TrainData Folder.
                        Note :- It's not going to clear any file in subfolders.  
                        To run: pyhthon LoadWorkspace.py --ClearWorkspace True 

     2. load, Pass True if you want to load data to workspace.                
                        To run: pyhthon LoadWorkspace.py --load True --Folder 50 
                         Note:- pass the folder flag along with it, to specify from which subfolder to load the data.

     3. SaveCurrent, Pass True if you want to save data to workspace, note: In order to saveCurrent to work               
                         you need to  pass the folder flag along with it, to specify to which subfolder to save the data.
                        (Note : if you are saving to a exisiting subfolder it will first clear all the data of subfolder 
                        and save the current file). This is generally run to save the data file (*.dat) and hyper.dat in 
                        case of successful convegrnece of optization routine. 
                        To run: pyhthon LoadWorkspace.py --SaveCurrent True --Folder 100 

--> plotParam.py => Plot optimize param vs n (New script)

 This script plot optimize param (SIGMA,GAMMA,TIME, LIKLIHOOD ) for different length of taraining data(n)

 The script takes in argument:
      1. Folder_list: Pass the different subfolder name for plotting the corresponding params. For example if you 
                 pass 50 100 200, param will be plotted for these three n values
 To run:
        python plotParam.py --Folder_list 50 100 200 400
  
  
--> predictGP.py => Prediction routines ( Moderate Change from your version)


      This Python Script mainly contain the routine required to plot the prediction. The Differene betwenn 
      this and original version is 1) i have included a functon that genratws phi,t back from h,s. 2)  
      modification in function plotPredTrian,(Basically now it has a routine to check for whether  
      original data lie within the range (prediction + 2.5*SD,prediction - 2.5*SD)  
      
      No argument are required for this script.
      To Run:- python predictGP.py
  
--> Profiler.py  => Routine to profile code (New script)
  
  
  This script will run a deterministic profiler on trainHyper.py 
  The script takes in two argument:
   1. Whether to include jacobian for trainHyper.py routine
   2. How many time a optimzation process is ran

  To run the :
   make sure you have the correct data in workspace, if not run the following two commands 
   1. python Load_workspace.py --ClearWorkspace True
   2. python Load_workspace.py --load True --Folder 50 or whatever
   3. python profiler.py --nreplica 10 or 5 (or your choice)

  I have also attached code for running a line profiler, if you wish to use line-profiler (anothr 
  deterministic profiler), please comment the code and uncomment the section at bottom



-->  tdd.py => Program implements double reptation model (Unchnaged Script)

 
  2/11/2020: Version 3, where I only use one method (Schiebers), and as base case use float(Z)
           : using integers only if requested.
  Program implements double reptation model (DRM) for
  (a) Blends of monodisperse fractions (phiFxn) given (w, Z)
  (b) Polydisperse blend with logNormal distribution given (Zw and pd)
 	  The latter module plotPhiPD works well only when pd >= 1.01 (which is rather polydisperse)
 	  when pd is small, the distribution formally becomes a Dirac delta, leading to poor numerical integration


--> trainHyper.py => Program implements training Routine (Moderate change from original script)

  This is a python3 script which takes in a bidisperse blend: Z = [x, x], pd = [x, x], w1 = x
   (1) reads in input data from TrainData/ [xtrain and h*.dat]
   (2) using functions from commonGP.py, optimizes the hyper-parameters

 
  This python script takes in the following argument:
   1. include_jacobian: If yes, Jacobian will be used in optimization process
   2. nreplica: # of time optimization routine is ran 
   3. isSave:  if yes, the hyperparam's will be saved in a file hyper.dat in path TrainData/
   4. isDebug : if true, the decompose plot will be plotted for each param, default is false
   
  Note: For this script to work on some data corresponding .dat file should need to be  
       present in TrainData/ Workspace. Please run the following commands before running the trainHyper.py
       *   python load_Wokspace.py --ClearWorkspace True
       *   python load_Workspace.py --load True --Folder (Folder name)
  Then run trainHyper.py
       * python trainHyper.py --nreplica 10 --isSave False
       
 
--> Visualize_DF_and_CP-V2.py => Program implements decomposition of DF and CP and plot them ( New Script)
  
  This python script plot CP,DF with different param.
  The script takes in following argument: 
    1. paramNum: Plot the CP,DF decomposition for sigma_2,gamma_1,gamma_21,gamma_22,gamma_23,gamma_24 for
                 paramNum 0,1,2,3,4,5,6 respectively.
    2. Folder: (NOte: dtype = int) It will load the training data of given folder to trainData workspace. 
    3. SpanLogLim_list : the x-axis limits for plooting the params.       
 
  To run: 
    For example if you want to plot sigma_2 for folder 100 in loglimit -1 -0.4 you will pass the following 
    :- python visualize_df_and_cf-V2.py --Folder 100 --paramNum 0 --SpanLogLim_list -1 -0.4
 
    Note: This script will clear the workspace and load the data for you (from the folder passed above) 
    No need to clear the data and load the data
   
   
   
--> Visualize_Hvs&PhivsT.py => Program Plots a a figure instance with 3 subplots. See script for description(New script)    

   This Scripy plot 3 subplots on a single figure instance
   first subplot plot prediction on training data 
   second subplot plot predition ( H vs S) on a unknown test sample
   Third plot plot's phi vs T for same unknown data
  
    To Run:-  
     First Make sure that appropriate data is loaded in workspace
       python load_workspace.py --ClearWorkspace True
       python load_workspace.py --load True --Folder 50/100/200...
     Now run the main routine:
       python Visualize_HvsSPhivsT.py

  
--> visualize_spectrum.py => Program plots HvsS for different polydisperscity.(New script)
    
 
   This script plot t,phi and corresponding relaxation spectra s,H for a given input
   To Run:- python visualize_spectrum.py
   
--> Visualize_TestErrorAnalysis.py => Program plots 3 subplots. Mean and median absolute deviation of Phi and H for diffeernt test set. The third subplot plots how 
                                      uncertainty estimate varies with n(New Script).


 
 
  This script takes in following argument:
  1. idx: I have 10 different test set in TrainData folder where each data has ~25 examples, so idx run
          will consider first idx number of test set to run the analysis, (Note: it should be <= 10,as
          there's only 10 test dataset.  
 
  2. Folder_list: List of test data folder used for analysis, for example, if you want to consider n = 50,100
                 200 and 400 you need to pass Folder_list argument as ( --Folder_list 50 100 200 400).
 
  3. Adverserial: If true the test will be done on adverserial test set. Adverserial test have example which 
                  are on extreme end of our input space. Note: You don't need idx flag if you are running it
                  for Adverserial example.

  To run the script: Normal example:-
          python Visualize_TestErrorAnalysis.py --idx 5 --Folder_list 50 100 200
 
          Adverserial example:
            python Visualize_TestErrorAnalysis.py --Folder_list 50 100 200 --Adverserial True

   Note: All the loading and cleaning of workspace is automated in this script.
PC-FSU/GP