adjtomo/seisflows

Decouple modules and global parameter/paths for a more Pythonic implementation

Closed this issue · 1 comments

bch0w commented

Background

Currently, SeisFlows is built upon sys.modules, where each sub-class (e.g., seisflows.workflow.inversion) is imported and then stored in sys.modules to be globally accessible to all other modules. Parameters and paths are also read in and shared in this way. Some pseudocode example of how this works looks like:

file_1.py

import sys
from seisflows.workflow.inversion import Inversion
from seisflows.tools.wrappers import loadyaml

sys.modules["seisflows_parameters"] = loadyaml("parameters.yaml")
sys.modules["seisflows_workflow"] = Inversion()

file_2.py

import sys
workflow = sys.modules["seisflows_workflow"]
PAR = sys.modules["seisflows_parameters"]
if PAR.BEGIN == 1:
    workflow.checkpoint()

With this implementation, you MUST import file_1.py BEFORE importing file_2.py.

The Problem

Although this approach makes SeisFlows flexible in how it passes information between modules, it simultaneously makes it difficult to use SeisFlows in a Pythonic manner. Some key issues include:

  • Unable to import SeisFlows modules without first establishing the entire workflow. This makes it impossible to use individual pieces of SeisFlows without first instantiating the entire package. In other words, you cannot run the following code:
from seisflows.optimize.LBFGS import LBFGS   # this will throw a KeyError 
  • Modules must be imported in a specific order so that they may access one another through sys.modules. This forces a mandatory set up procedure each time the code is to be used, and leads to undesirable in-function runtime import statements for modules which need to access each other out of order:

seisflows.preprocess.pyatoa.Pyatoa

    def setup_event_pyaflowa(self, source_name=None):
        # Late import because preprocess is loaded before optimize,
        # Optimize required to know which iteration/step_count we are at
        solver = sys.modules["seisflows_solver"]
        optimize = sys.modules["seisflows_optimize"]
        iteration = optimize.iter
        if source_name is None:
            source_name = solver.source_names[0]
        ...
  • Unit testing becomes a problem since each module may be reliant on others. E.g., you cannot test the optimization library alone, as it is coupled to the solver module (related issue: #12)

Goals

The proposed solution would decouple the package from the sys.modules implementation in an effort to improve the modularity of SeisFlows. The end goals of this would be that:

  • Each module (solver, optimize etc.) should be importable and usable on its own, without any reliance on the other modules.
from seisflows.optimize.LBFGS import LBFGS as optimize  # we want this to work
direction = optimize.compute_direction(gradient=g_new)
  • Modules should communicate to one another via function call I/O, rather than globally accessible imports.
from seisflows.solver.specfem2d import SPECFEM2D as solver
from seisflows.optimize.LBFGS import LBFGS as optimize

m_new = solver.merge(solver.load(model_path="path/to/model_init"))
status = optimize.check_model(model=m_new)
  • SeisFlows should be scriptable, i.e., we should not require a complicated start up procedure each time the package is used.

main.py

from seisflows.tools.wrappers import loadyaml
from seisflows.workflow.inversion import Inversion as workflow

parameters = loadyaml("parameters.yaml")
for i in range(parameters.begin, parameters.end):
     workflow.run()

Plan of Attack

This will involve a large code refactor of SeisFlows that will affect almost every file and effectively re-design the core code implementation. Similarly I foresee the public API changing on how SeisFlows will be implemented.

  • Drive this refactor with tests (TDD). Each rewritten component should be unit tested.
  • Remove sys.modules implementation. Any previous reliance on globally accessible modules should be rewritten as parameter passing via function inputs/outputs
  • Concentrate parameter, path and module setup in a main workflow script controlled by the User.
  • Remove global parameter accessibility in favor of initializing parameters via init() and a separate check() function which checks the validity of input parameters.
  • Decouple the hardcoded directory structure. Expose paths explicitly in a main workflow script, rather than having them be implicitly defined throughout the code.
  • Remove the complicated start up procedure that occurs during seisflows submit. Expose any simplified start up in the main workflow script.

Ideally I would like to have a main workflow script that looked something like this:

import os
from seisflows.config import Status  # Used to keep track of processes
from seisflows.config import custom_import
from seisflows.tools.wrappers import loadyaml

if __name__ == "__main__":
    workdir = os.getcwd()
    paths, parameters = loadyaml("parameters.yaml")
    statusfile = Status(path="status.txt")

    # Standardized 'setup' chunk
    system = custom_import(parameters.system)(**parameters)
    workflow = custom_import(parameters.workflow)(**parameters)
    solver = custom_import(parameters.solver)(**parameters)
    optimize = custom_import(parameters.optimize)(**parameters)
    preprocess = custom_import(parameters.preprocess)(**parameters)
    postprocess = custom_import(parameters.postprocess)(**parameters)

    # Check that the input parameters are set correctly
    for module in [system, workflow, solver, optimize, preprocess, postprocess]:
        module.check()

    for iteration in range(parameters.begin, parameters.end):
        system.run(solver.generate_data, path=paths.solver, status=statusfile)
        system.run(solver.generate_initial_model, paths=path.model_init, status=statusfile)
        system.run(solver.forward_simulation, paths=path.gradient, status=statusfile)
        
        gradient_new = solver.load_model(path=path.gradient)
        system.run(postprocess.smooth_gradient, gradient=gradient, status=statusfile)
        ...
bch0w commented

This issue has been resolved with #124 and #125