/rom-comma

Gaussian Process Regression, Global Sensitivity Analysis and Reduced Order Modelling by COMMA Research at The University of Sheffield

Primary LanguagePythonBSD 3-Clause "New" or "Revised" LicenseBSD-3-Clause

rom-comma

Gaussian Process Regression, Global Sensitivity Analysis and Reduced Order Modelling by COMMA Research at The University of Sheffield

installation

Simply place the romcomma package in a folder included in PYTHONPATH (e.g. site-packages). Test the installation by running the installation_test module, from anywhere.

documentation

Dependencies are documented in pyproject.toml.

Full documentation for the romcomma package is published on readthedocs.

getting started

The following is not intended to substitute full package documentation, but to sketch the most essential, salient and practically important architectural features of the romcomma package. These are introduced by module (or package) name, in order of workflow priority. Presumably this will reflect the package users' first steps. Familiarity with Gaussian Processes (GPs), Global Sensitivity Analysis (GSA) and Reduction of Order by Marginalization (ROM) is largely assumed.

data

The data module contains classes for importing and storing the data being analyzed.

Import is from csv file or pandas DataFrame, in any case tabulated with precisely two header rows as

Input
X1
...
...
Input
XM
Output
Y1
...
...
Output
YL
optional column
of N row indices
N rows of
numeric
data
... N rows of
numeric
data
N rows of
numeric
data
... N rows of
numeric
data

Any first-line header may be used instead of "Input", so long as it is the same for every column to be treated as input. Any first-line header may be used instead of "Output", so long as it is the same for every column to be treated as output, and is different to the first-line header for inputs.

Any second-line headers may be used, without restriction. But internally, the romcomma package sees

  • An (N, M) design matrix of inputs called X.
  • An (N, L) design matrix of outputs called Y.

The key assumption is that each input column is sampled from a uniform distribution Xi ~ U[mini, maxi]. There is no claim that the methods used by this software have any validity at all if this assumption is violated.

In case Xi ~ CDF[Xi] the user should apply the probability transform CDF(Xi) ~ U[0, 1] to the input column i prior to any data import.

Repository

Data is initially imported into a Repository object, which handles storage, retrieval and metadata for repo.data. Every Repository object writes to and reads from its own repo.folder.

Every Repository object crucially exposes a parameter K which triggers k-fold cross-validation for this repo. Setting repo.K=K generates K Fold objects.

Fold

All data analysis is performed on Fold objects. A Fold is really a kind of Repository, with the addition of

  • fold.test_data, stored in a table (Frame) of N/K rows. The test_data does not overlap the (training) data in this Fold, except when the parent repo.K=1 and the ersatz fold.test_data=fold.data is applied.
  • Normalization of inputs: All training and test data inputs are transformed from Xi ~ U[mini, maxi] to the standard normal distribution Xi ~ N[0, 1], as demanded by the analyses implemented by romcomma. Outputs are simultaneously normalized to zero mean and unit variance. Normalization exposes an undo method to return to the original variables used in the parent Repository.

The repo.K Folds are stored under the parent, in fold.folder=repo.folder\fold.k for k in range(repo.K). For the purposes of model integration, an unvalidated, ersatz fold.K is included with N datapoints of (training) data=test_data, just like the would-be ersatz K=1=k+1.