adamtuft/otter

Otter failing to read OTF2 file

Closed this issue · 13 comments

Hi,

I'm testing Otter with NemoLite2D's openMP task implementation.
I'm using python 3.9, otf2, llvm11, and run the application with:
OMP_TOOL_LIBRARIES=/home/achalk/otter/lib/libotter.so ./nemolite2d_omp.exe

I'm then trying to do
python3 -m otter trace/otter_trace.316382/otter_trace.316382.otf2
and otter is failing:

Loading OTF2 anchor file: trace/otter_trace.316382/otter_trace.316382.otf2
An unhandled python exception has occurred in an OTF2_GlobalEvtReaderCallback:
Traceback (most recent call last):
  File "/home/achalk/.INSTALL/otf2/lib/python3.6/site-packages/_otf2/GlobalEvtReaderCallbacks.py", line 50, in wrapper
    ret = func(location, time, py_userData, attribute_list, *args)
  File "/home/achalk/.INSTALL/otf2/lib/python3.6/site-packages/otf2/event_reader.py", line 143, in _enter
    self._append(events.Enter, *args)
  File "/home/achalk/.INSTALL/otf2/lib/python3.6/site-packages/otf2/event_reader.py", line 18, in _append
    event = event_type._construct(self.reader.definitions, time, *args)
  File "/home/achalk/.INSTALL/otf2/lib/python3.6/site-packages/otf2/events.py", line 106, in _construct
    return cls(time, registry.regions[region])
  File "/home/achalk/.INSTALL/otf2/lib/python3.6/site-packages/otf2/registry.py", line 120, in __getitem__
    return self._elements_by_ref[ref]
KeyError: 1
Traceback (most recent call last):
  File "/home/achalk/.INSTALL/python-3.9.0/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/achalk/.INSTALL/python-3.9.0/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/achalk/otter/src/python/otter/__main__.py", line 55, in <module>
    tr = trace.Archive(args.anchorfile, verbose=True)
  File "/home/achalk/otter/src/python/otter/trace.py", line 176, in __init__
    self.events = LocationEventMap(tr.events, self.attr)
  File "/home/achalk/otter/src/python/otter/trace.py", line 91, in __init__
    for l, e in events:
  File "/home/achalk/.INSTALL/otf2/lib/python3.6/site-packages/otf2/event_reader.py", line 46, in __iter__
    read_events = _otf2.GlobalEvtReader_ReadEvents(self._handle, self.batch_events)
  File "/home/achalk/.INSTALL/otf2/lib/python3.6/site-packages/_otf2/GlobalEvtReader.py", line 40, in GlobalEvtReader_ReadEvents
    c_ReadEvents(reader, recordsToRead, ctypes.byref(recordsRead))
  File "/home/achalk/.INSTALL/otf2/lib/python3.6/site-packages/_otf2/ErrorCodes.py", line 166, in HandleErrorCode
    raise Error(errcode)
_otf2.ErrorCodes.Error: INTERRUPTED_BY_CALLBACK (92): Record reading interrupted by reader callback

otf2-print otter_trace.316382.otf2 does correctly output stuff so the trace should be readable I assume?

Is there something else I need to do to be able to get it to analyse the file?

Hi Aidan,

Thanks for posting this issue. Are you able to share the trace you recorded and are trying to read please? I have seen similar errors before where python was unable to read the trace, but otf2-print appeared to read it ok. It turned out there were some invalid or missing definitions in the trace. Please can you report the result of the following command to see if that is the case here:

otf2-print -A trace/otter_trace.316382/otter_trace.316382.otf2 2>&1 | grep -in "error\|warning\|invalid"

Thanks

It might be - that command outputs a large number of lines:

194947:ENTER                                      0    31171575682257354  Region: INVALID <32498>
194949:LEAVE                                      0    31171575682353562  Region: INVALID <32498>
194953:ENTER                                      0    31171575682360654  Region: INVALID <32499>
194955:LEAVE                                      0    31171575682458195  Region: INVALID <32499>
194959:ENTER                                      0    31171575682459758  Region: INVALID <32500>
194961:LEAVE                                      0    31171575682554778  Region: INVALID <32500>

I can share the trace - everything should be in https://drive.google.com/file/d/1a4jvxqaFcS0M8mmF7VM_n4rAE1bB7Clh/view?usp=sharing

Thanks, I'll take a look at the trace. It would also be really helpful if you could re-compile Otter with all debug statements enabled and then share the contents of the log:

make clean
source otter-defs.sh
make
OMP_TOOL_LIBRARIES=/home/achalk/otter/lib/libotter.so ./nemolite2d_omp.exe 2>badtrace.log

EDIT: Would also be good to do this with the minimal problem size that reproduces the error (there are 65,000 warnings/errors in the log).

EDIT2: Having looked at the log, it seems that the region definitions for the explicit tasks are not being created or saved to the trace for some reason, so the task-enter/task-leave events cannot be associated with a region when the trace is read back in python.

EDIT3: The definitions only include a single region and a single location:

REGION 0 Name: "initial task 0" <102> (Aka. "" <0>), Descr.: "" <0>, Role: TASK, Paradigm: OPENMP, Flags: NONE, File: "" <0>, Begin: 0, End: 0
LOCATION 0 Name: "Thread 0" <103>, Type: CPU_THREAD, # Events: 97404, Group: "OMP Process" <0>

Ok - I've rerun with a smaller example which should only make 2 tasks in each region instead of 32. This still is reporting 5k invalids in otf2-print

I wonder if this is something to do with Fortran but I don't know how otter does the region definitions. Hopefully I can just upload the new logs here
badtrace.log
otter_trace.345738.tar.gz

Edit - never mind this code is mostly C++ with some linking to some fortran libraries i think

Thanks, I will most likely take a closer look at this over the weekend. Please could you point me at the source of the target application?

EDIT: already made an interesting observation - there are thread-begin, task-create and task-enter events being recorded, but no parallel-begin event that I can see. I would expect the target application to make use of at least 1 parallel region, is this not the case?

grep -in "\(event\)" badtrace.log gives:

31:[d] [on_ompt_callback_thread_begin   ] [t=0] (event) thread-begin
32:[d] [on_ompt_callback_implicit_task  ] [t=0] (event) initial-task-begin
36:[d] [trace_event_enter               ] [t=0] enter region 0x172ec70
38:[d] [on_ompt_callback_task_create    ] [t=0] (event) task-create
43:[d] [on_ompt_callback_task_schedule  ] [t=0] (event) task-schedule
44:[d] [trace_event_enter               ] [t=0] enter region 0x172f6d0
47:[d] [on_ompt_callback_task_schedule  ] [t=0] (event) task-schedule
49:[d] [trace_event_leave               ] [t=0] leave region 0x172f6d0
50:[d] [on_ompt_callback_task_create    ] [t=0] (event) task-create
55:[d] [on_ompt_callback_task_schedule  ] [t=0] (event) task-schedule
56:[d] [trace_event_enter               ] [t=0] enter region 0x182fb20
59:[d] [on_ompt_callback_task_schedule  ] [t=0] (event) task-schedule
...

Argh this might all be my fault, let me investigate - I had quickly converted this from ompss to OpenMP but may have messed up apologies!

Otter should probably handle no parallel-begin in some way (even if its just "No parallel begin found, exiting"), though I don't remember if its defined what a compiler/runtime has to do upon finding parallel OpenMP pragmas outside of a parallel region.

Edit: Ok, so those errors were definitely from compiling an ompss program with OpenMP (since ompss doesn't require definition of a parallel region). I've now built the OpenMP program correctly, with the parallel region and single, and am now getting

Processing taskwait nodes:
 node 5020 task(s)=[1]
  child_tasks={5, 7, 9, 11, 12, 14, 15, 16, 17, 19, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36}
  node 5018: task=28 crt_ts=100469
Traceback (most recent call last):
  File "/home/achalk/.INSTALL/python-3.9.0/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/achalk/.INSTALL/python-3.9.0/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/achalk/otter/src/python/otter/__main__.py", line 450, in <module>
    taskleavenode, = g.vs.select(lambda v: v['event_type']=='task_leave' and v['unique_id']==taskcreatenode['unique_id'])
  File "/home/achalk/.INSTALL/python-3.9.0/lib/python3.9/site-packages/igraph/__init__.py", line 4544, in select
    vs = _VertexSeq.select(self, *args)
  File "/home/achalk/otter/src/python/otter/__main__.py", line 450, in <lambda>
    taskleavenode, = g.vs.select(lambda v: v['event_type']=='task_leave' and v['unique_id']==taskcreatenode['unique_id'])
NameError: name 'taskcreatenode' is not defined

These are the relevant files now I actually am compiling an OpenMP program...(sorry)

badtrace.log
otter_trace.351897.tar.gz

As for the code - its difficult for me to share the actual flang-compiling version of the libraries, as I don't have write access to the dependencies to upload recent version. The repo is available at https://github.com/stfc/PSycloneBench

The code I'm using is a slightly modified version of the benchmark in benchmarks/nemo/nemolite2d/manual_versions/psykal_cpp_ompss - the only changes are to convert the dependencies from OMPSS out(x, y, z, ...) to OpenMP's depend(out: x,y,z,...), and to wrap the c_invoke_time_step function with

#pragma omp parallel 
{
#pragma omp single
{
//....all the task code here
}
}

I can try to get the real code to you but it will only compile (to work with otter) with flang11 - if you need this just let me know what you're durham id is so I can email you it but I probably can't until Monday

That's great! So the trace appears to be readable in python now. I will have to check what the spec says about pragmas outside any parallel region, I thought I could rely on there always being an enclosing parallel region.

Are you using the very latest python code from main? I believe I've seen and resolved that python error already. Note that with the current python code you can use the graph_to_dot function to write to a .dot file. I typically invoke python with -i to get an interactive prompt at the end.

I have it on my to-do list to finalise the post-processing so this can all be done non-interactively from the command line.

No I'm not on the main branch - I was on the dev-trace-otf2 branch as Holger had said that was the main development branch - should i try the main branch instead?

Yes please, I've merged all the latest code into main. It occurs to me that I've been developing the python on a separate, local branch which is why you haven't been able to see the updated code 🤦‍♂️ that's my fault, I need to tidy up the branches at some point soon!

Yeah - that works now. Is the graph_to_dot the best way to manage/plot the output for larger graphs? The .svg files produced are not really viewable, I assume due to the number of tasks created in the system being too large for igraph to plot succesfully?

At the moment your options are to export to dot/graphml, or to manipulate the graph directly in python. I would like to implement the ability to easily reduce a highly-nested graph, or one with many sibling tasks, but that is down my list of priorities at the moment. If you export to dot/graphml you could import into a graph editing suite like yEd - I've had some luck doing so with dot files in yEd Live, though I haven't tried this on very large graphs yet.

I've marked this as closed since we found the cause of the errors and no changes to the code were required 👍