MiraGeoscience/geoapps

Running in a loop issue - losing workspace

RichardScottOZ opened this issue · 21 comments

This one is odd:-

I have a loop (@domfournier knows this one)

Running a small inversion, move onto the next

and get

Running Inversion
model has any nan: 0
=============================== Projected GNCG ===============================
  #     beta     phi_d     phi_m       f      |proj(x-g)-x|  LS    Comment   
-----------------------------------------------------------------------------
   0  1.28e+11  4.99e+09  2.21e-02  7.81e+09    1.34e+02      0              
  3%|| 2746/87957 [06:21<3:17:14,  7.20it/s]
Traceback (most recent call last):
  File "C:\Users\rscott\OneDrive - OZ Minerals\FunctionMiraLU.py", line 757, in <module>
    mrec, inv_group = run_inversion("Inversion_equivalent_source_REF", grav_survey, simulation, reference_model=ref_model)
  File "C:\Users\rscott\OneDrive - OZ Minerals\FunctionMiraLU.py", line 532, in run_inversion
    mrec = inv.run(m0)
  File "C:\Users\rscott\AppData\Local\Continuum\anaconda3\envs\geoapps\lib\site-packages\SimPEG\utils\counter_utils.py", line 105, in wrapper
    out = f(self, *args, **kwargs)
  File "C:\Users\rscott\AppData\Local\Continuum\anaconda3\envs\geoapps\lib\site-packages\SimPEG\inversion.py", line 66, in run
    self.m = self.opt.minimize(self.invProb.evalFunction, self.invProb.model)
  File "C:\Users\rscott\AppData\Local\Continuum\anaconda3\envs\geoapps\lib\site-packages\SimPEG\utils\counter_utils.py", line 105, in wrapper
    out = f(self, *args, **kwargs)
  File "C:\Users\rscott\AppData\Local\Continuum\anaconda3\envs\geoapps\lib\site-packages\SimPEG\optimization.py", line 398, in minimize
    self.doEndIteration(xt)
  File "C:\Users\rscott\AppData\Local\Continuum\anaconda3\envs\geoapps\lib\site-packages\SimPEG\utils\counter_utils.py", line 93, in wrapper
    out = f(self, *args, **kwargs)
  File "C:\Users\rscott\AppData\Local\Continuum\anaconda3\envs\geoapps\lib\site-packages\SimPEG\utils\code_utils.py", line 178, in wrapper
    return f(self, *args, **kwargs)
  File "C:\Users\rscott\AppData\Local\Continuum\anaconda3\envs\geoapps\lib\site-packages\SimPEG\optimization.py", line 701, in doEndIteration
    self.callback(xt)
  File "C:\Users\rscott\AppData\Local\Continuum\anaconda3\envs\geoapps\lib\site-packages\SimPEG\inversion.py", line 72, in _optCallback
    self.directiveList.call("endIter")
  File "C:\Users\rscott\AppData\Local\Continuum\anaconda3\envs\geoapps\lib\site-packages\SimPEG\directives.py", line 190, in call
    getattr(r, ruleType)()
  File "C:\Users\rscott\AppData\Local\Continuum\anaconda3\envs\geoapps\lib\site-packages\SimPEG\directives.py", line 1036, in endIter
    data = self.h5_object.add_data(
  File "C:\Users\rscott\AppData\Local\Continuum\anaconda3\envs\geoapps\lib\site-packages\geoh5py\objects\object_base.py", line 137, in add_data
    data_object = self.workspace.create_entity(
  File "C:\Users\rscott\AppData\Local\Continuum\anaconda3\envs\geoapps\lib\site-packages\geoh5py\workspace\workspace.py", line 307, in create_entity
    created_entity = self.create_data(
  File "C:\Users\rscott\AppData\Local\Continuum\anaconda3\envs\geoapps\lib\site-packages\geoh5py\workspace\workspace.py", line 276, in create_data
    created_entity = member(data_type, **entity_kwargs)
  File "C:\Users\rscott\AppData\Local\Continuum\anaconda3\envs\geoapps\lib\site-packages\geoh5py\data\float_data.py", line 29, in __init__
    super().__init__(data_type, **kwargs)
  File "C:\Users\rscott\AppData\Local\Continuum\anaconda3\envs\geoapps\lib\site-packages\geoh5py\data\numeric_data.py", line 30, in __init__
    super().__init__(data_type, **kwargs)
  File "C:\Users\rscott\AppData\Local\Continuum\anaconda3\envs\geoapps\lib\site-packages\geoh5py\data\data.py", line 48, in __init__
    super().__init__(**kwargs)
  File "C:\Users\rscott\AppData\Local\Continuum\anaconda3\envs\geoapps\lib\site-packages\geoh5py\shared\entity.py", line 55, in __init__
    setattr(self, "parent", kwargs["parent"])
  File "C:\Users\rscott\AppData\Local\Continuum\anaconda3\envs\geoapps\lib\site-packages\geoh5py\shared\entity.py", line 249, in parent
    if self.parent is not None:
  File "C:\Users\rscott\AppData\Local\Continuum\anaconda3\envs\geoapps\lib\site-packages\geoh5py\shared\entity.py", line 262, in parent
    self._parent = self.workspace.root
  File "C:\Users\rscott\AppData\Local\Continuum\anaconda3\envs\geoapps\lib\site-packages\geoh5py\shared\entity.py", line 356, in workspace
    return self.entity_type.workspace
  File "C:\Users\rscott\AppData\Local\Continuum\anaconda3\envs\geoapps\lib\site-packages\geoh5py\shared\entity_type.py", line 166, in workspace
    assert workspace is not None
AssertionError

which is strange - it actually writes the iteration 0 to the geoh5 workspace ok - then fails - the first run works fine and writes to workspace, this is the second run through (which writes to another workspace - e.g. if 2745 in the first one, 2746 if the next one exists).

If it couldn't find it would have thought the first one would fail so something happening to lose inside that - or with a previous ws existing? Not debugged at all yet.

All precreated workspaces, too.

JIRA issue [GEOPY-355] was created.

    @property
    def workspace(self) -> "ws.Workspace":
        """
        :obj:`~geoh5py.workspace.workspace.Workspace` registering this type.
        """
        workspace = self._workspace()

        # Workspace should never be null, unless this is a dangling type object,
        # which means workspace has been deleted.
        assert workspace is not None
        return workspace

workspace is definitely there and presumably inv.run doesn't randomly kill them and resurrect them. :)

So from entity_type, definitely:-

Workspace is: <geoh5py.workspace.workspace.Workspace object at 0x000001E25DEE43A0>
Workspace is: None

and one I tested that one, can be opened in Analyst and in code

here

File "C:\Users\rscott\AppData\Local\Continuum\anaconda3\envs\geoapps\lib\site-packages\SimPEG\utils\counter_utils.py", line 105, in wrapper
    out = f(self, *args, **kwargs)
  File "C:\Users\rscott\AppData\Local\Continuum\anaconda3\envs\geoapps\lib\site-packages\SimPEG\inversion.py", line 66, in run
h5objectenditer: <geoh5py.objects.octree.Octree object at 0x00000222059F9FA0>
    self.m = self.opt.minimize(self.invProb.evalFunction, self.invProb.model)

so ws maybe disappearing when getting to the points part?

as the inversion run one earlier-
```python
3%|| 2375/87957 [00:20<00:10, 7952.00it/s]model has any nan: 0
=============================== Projected GNCG ===============================
  #     beta     phi_d     phi_m       f      |proj(x-g)-x|  LS    Comment   
-----------------------------------------------------------------------------
   0  1.28e+11  1.34e+10  2.51e-02  1.67e+10    1.21e+02      0              
h5objectenditer: <geoh5py.objects.octree.Octree object at 0x000002225ECC1A60>
h5objectenditer: <geoh5py.objects.points.Points object at 0x000002225EF27C40>
__exception__ = {tuple: 3} (<class 'AssertionError'>, AssertionError(), <traceback object at 0x00000171E78F6E00>)
self = {DataType} <geoh5py.data.data_type.DataType object at 0x00000171E50C4370>
workspace = {NoneType} None

So crapping out when trying to save the Points

  un_order = np.argsort(mesh._ubc_order)
<SimPEG.directives.Update_IRLS object at 0x000001E7BD6FC760> initialize
<SimPEG.directives.UpdateSensitivityWeights object at 0x000001E7BD18C4F0> initialize
<SimPEG.directives.UpdatePreconditioner object at 0x000001E7BD18CA60> initialize
<SimPEG.directives.BetaEstimate_ByEig object at 0x000001E7BD18CF10> initialize
<SimPEG.directives.SaveIterationsGeoH5 object at 0x000001E7BD18CE20> initialize
<SimPEG.directives.SaveIterationsGeoH5 object at 0x000001E7BD18C5B0> initialize
model has any nan: 0
=============================== Projected GNCG ===============================
  #     beta     phi_d     phi_m       f      |proj(x-g)-x|  LS    Comment   
-----------------------------------------------------------------------------
   0  1.23e+11  1.32e+11  5.99e-02  1.39e+11    1.02e+02      0              
<SimPEG.directives.Update_IRLS object at 0x000001E7BD6FC760> endIter
<SimPEG.directives.UpdateSensitivityWeights object at 0x000001E7BD18C4F0> endIter
<SimPEG.directives.UpdatePreconditioner object at 0x000001E7BD18CA60> endIter
<SimPEG.directives.BetaEstimate_ByEig object at 0x000001E7BD18CF10> endIter
<SimPEG.directives.SaveIterationsGeoH5 object at 0x000001E7BD18CE20> endIter
h5objectenditer: <geoh5py.objects.octree.Octree object at 0x000001E7BA9E2CD0>
  3%|| 2770/87957 [07:53<4:02:49,  5.85it/s]
Traceback (most recent call last):

as per the previous one

<SimPEG.directives.Update_IRLS object at 0x000001E7BAA48BB0> initialize
<SimPEG.directives.UpdateSensitivityWeights object at 0x000001E7BCAB2190> initialize
<SimPEG.directives.UpdatePreconditioner object at 0x000001E7BCAB2160> initialize
<SimPEG.directives.BetaEstimate_ByEig object at 0x000001E7BCAB20A0> initialize
<SimPEG.directives.SaveIterationsGeoH5 object at 0x000001E7BCAB2280> initialize
<SimPEG.directives.SaveIterationsGeoH5 object at 0x000001E7BCAB22E0> initialize
  1%|| 1199/87957 [00:20<00:07, 11987.47it/s]model has any nan: 0
=============================== Projected GNCG ===============================
  #     beta     phi_d     phi_m       f      |proj(x-g)-x|  LS    Comment   
-----------------------------------------------------------------------------
   0  1.28e+11  1.39e+11  8.18e-02  1.49e+11    1.10e+02      0              
<SimPEG.directives.Update_IRLS object at 0x000001E7BAA48BB0> endIter
<SimPEG.directives.UpdateSensitivityWeights object at 0x000001E7BCAB2190> endIter
<SimPEG.directives.UpdatePreconditioner object at 0x000001E7BCAB2160> endIter
<SimPEG.directives.BetaEstimate_ByEig object at 0x000001E7BCAB20A0> endIter
<SimPEG.directives.SaveIterationsGeoH5 object at 0x000001E7BCAB2280> endIter
h5objectenditer: <geoh5py.objects.octree.Octree object at 0x000001E7BA9EA4F0>
<SimPEG.directives.SaveIterationsGeoH5 object at 0x000001E7BCAB22E0> endIter
h5objectenditer: <geoh5py.objects.points.Points object at 0x000001E7BA9EF0A0>

and when we try a couple more in the loop

<SimPEG.directives.Update_IRLS object at 0x0000021D84C25C40> initialize
<SimPEG.directives.UpdateSensitivityWeights object at 0x0000021D84C25220> initialize
<SimPEG.directives.UpdatePreconditioner object at 0x0000021D84C25400> initialize
<SimPEG.directives.BetaEstimate_ByEig object at 0x0000021D84C25E50> initialize
<SimPEG.directives.SaveIterationsGeoH5 object at 0x0000021D84C25A30> initialize
h5objectinit: <geoh5py.objects.octree.Octree object at 0x0000021D81FE5D60>
<SimPEG.directives.SaveIterationsGeoH5 object at 0x0000021D84C25DF0> initialize
h5objectinit: <geoh5py.objects.points.Points object at 0x0000021D84B71D30>
model has any nan: 0
=============================== Projected GNCG ===============================
  #     beta     phi_d     phi_m       f      |proj(x-g)-x|  LS    Comment   
-----------------------------------------------------------------------------
   0  1.28e+11  1.33e+11  4.89e-02  1.39e+11    1.01e+02      0              
<SimPEG.directives.Update_IRLS object at 0x0000021D84C25C40> endIter
<SimPEG.directives.UpdateSensitivityWeights object at 0x0000021D84C25220> endIter
<SimPEG.directives.UpdatePreconditioner object at 0x0000021D84C25400> endIter
  3%|| 2772/87957 [07:20<3:45:30,  6.30it/s]
Traceback (most recent call last):
  File "C:\Users\rscott\OneDrive - OZ Minerals\Exploration2021\FunctionMiraLU_Main.py", line 763, in <module>
    mrec, inv_group = run_inversion("Inversion_equivalent_source_REF", grav_survey, simulation, reference_model=ref_model)
  File "C:\Users\rscott\OneDrive - OZ Minerals\Exploration2021\FunctionMiraLU_Main.py", line 538, in run_inversion
    mrec = inv.run(m0)
  File "C:\Users\rscott\AppData\Local\Continuum\anaconda3\envs\geoapps\lib\site-packages\SimPEG\utils\counter_utils.py", line 105, in wrapper
    out = f(self, *args, **kwargs)
  File "C:\Users\rscott\AppData\Local\Continuum\anaconda3\envs\geoapps\lib\site-packages\SimPEG\inversion.py", line 66, in run
    self.m = self.opt.minimize(self.invProb.evalFunction, self.invProb.model)
  File "C:\Users\rscott\AppData\Local\Continuum\anaconda3\envs\geoapps\lib\site-packages\SimPEG\utils\counter_utils.py", line 105, in wrapper
    out = f(self, *args, **kwargs)
  File "C:\Users\rscott\AppData\Local\Continuum\anaconda3\envs\geoapps\lib\site-packages\SimPEG\optimization.py", line 398, in minimize
    self.doEndIteration(xt)
  File "C:\Users\rscott\AppData\Local\Continuum\anaconda3\envs\geoapps\lib\site-packages\SimPEG\utils\counter_utils.py", line 93, in wrapper
    out = f(self, *args, **kwargs)
  File "C:\Users\rscott\AppData\Local\Continuum\anaconda3\envs\geoapps\lib\site-packages\SimPEG\utils\code_utils.py", line 178, in wrapper
    return f(self, *args, **kwargs)
  File "C:\Users\rscott\AppData\Local\Continuum\anaconda3\envs\geoapps\lib\site-packages\SimPEG\optimization.py", line 701, in doEndIteration
    self.callback(xt)
  File "C:\Users\rscott\AppData\Local\Continuum\anaconda3\envs\geoapps\lib\site-packages\SimPEG\inversion.py", line 72, in _optCallback
    self.directiveList.call("endIter")
  File "C:\Users\rscott\AppData\Local\Continuum\anaconda3\envs\geoapps\lib\site-packages\SimPEG\directives.py", line 191, in call
    getattr(r, ruleType)()
  File "C:\Users\rscott\AppData\Local\Continuum\anaconda3\envs\geoapps\lib\site-packages\SimPEG\directives.py", line 1038, in endIter
    data = self.h5_object.add_data(
  File "C:\Users\rscott\AppData\Local\Continuum\anaconda3\envs\geoapps\lib\site-packages\geoh5py\objects\object_base.py", line 137, in add_data
    data_object = self.workspace.create_entity(
  File "C:\Users\rscott\AppData\Local\Continuum\anaconda3\envs\geoapps\lib\site-packages\geoh5py\workspace\workspace.py", line 307, in create_entity
    created_entity = self.create_data(
  File "C:\Users\rscott\AppData\Local\Continuum\anaconda3\envs\geoapps\lib\site-packages\geoh5py\workspace\workspace.py", line 276, in create_data
    created_entity = member(data_type, **entity_kwargs)
  File "C:\Users\rscott\AppData\Local\Continuum\anaconda3\envs\geoapps\lib\site-packages\geoh5py\data\float_data.py", line 29, in __init__
<SimPEG.directives.BetaEstimate_ByEig object at 0x0000021D84C25E50> endIter
<SimPEG.directives.SaveIterationsGeoH5 object at 0x0000021D84C25A30> endIter
h5objectenditer: <geoh5py.objects.octree.Octree object at 0x0000021D81FE5D60>
    super().__init__(data_type, **kwargs)
  File "C:\Users\rscott\AppData\Local\Continuum\anaconda3\envs\geoapps\lib\site-packages\geoh5py\data\numeric_data.py", line 30, in __init__
    super().__init__(data_type, **kwargs)
  File "C:\Users\rscott\AppData\Local\Continuum\anaconda3\envs\geoapps\lib\site-packages\geoh5py\data\data.py", line 48, in __init__
    super().__init__(**kwargs)
  File "C:\Users\rscott\AppData\Local\Continuum\anaconda3\envs\geoapps\lib\site-packages\geoh5py\shared\entity.py", line 55, in __init__
    setattr(self, "parent", kwargs["parent"])
  File "C:\Users\rscott\AppData\Local\Continuum\anaconda3\envs\geoapps\lib\site-packages\geoh5py\shared\entity.py", line 249, in parent
    if self.parent is not None:
  File "C:\Users\rscott\AppData\Local\Continuum\anaconda3\envs\geoapps\lib\site-packages\geoh5py\shared\entity.py", line 262, in parent
    self._parent = self.workspace.root
  File "C:\Users\rscott\AppData\Local\Continuum\anaconda3\envs\geoapps\lib\site-packages\geoh5py\shared\entity.py", line 356, in workspace
    return self.entity_type.workspace
  File "C:\Users\rscott\AppData\Local\Continuum\anaconda3\envs\geoapps\lib\site-packages\geoh5py\shared\entity_type.py", line 167, in workspace
    assert workspace is not None
AssertionError

so initialise ok - fail to end

Hi @RichardScottOZ
Really hard to tell like that. It would be easier to help if you had a small piece of code that reproduces the crash.

Sure... just not small in that sense, will see what I can do... pretty much that Inversion Workflow with reference model.. and the second time in a loop it is crashing in the first equivalent part.

@domfournier sent you an email with the details/code - and test data etc.

Still don't know exactly what is going on

However, tried running it in a threadpool - same problem.

Ran in a process pool - and ok - when I was looping before was calling the script from a loop - so thought this would probably work.

So something in that call directives/end iteration space that maybe has a thread memory reference issue (or something I don't understand, anyway).

yea sorry, I ll get to it tomorrow. True, could be an access right if the threads are trying to write at the same time. Might have to re-structure the run...

Maybe something Dask even is possible?

Thanks very much, be good to check speed of thread version vs process when going to do a lot.

A ThreadPool crash on unix, too - so not an OS quirk

Again, hard to tell without running in debug here. But from the error log, it looks like it is trying to write data that may belong to a workspace that no longer exists. Try to see if the script has pointers to the workspace of the previous run...

Ideally, you would want your parallel runs to be self-contained. i.e. Each worker copies the necessary entities to a local workspace, do its work, and copy the final result back at the end to a common repo. Or you do the merge at the end.

@RichardScottOZ did you get to the bottom of this?

No, not yet sorry, haven't had time to re-delve into debugging a lower level than I had got to as yet as above. So I have to investigate more

When I did the big job ran each one literally in a separate container.

It was breaking at a particular point in the directives.

Most likely on the SaveGeoh5 directives.

Yeah, it was as I recall - but I hadn't worked it why it was losing it at that point.