StanfordLegion/legion

Legion: profiler crashing

Closed this issue · 15 comments

The profiler seems to be crashing on the latest master:

Reading log file "prof_0.gz"...
Matched 10988545 objects
thread 'main' panicked at src/state.rs:3046:49:
called `Option::unwrap()` on a `None` value
stack backtrace:
   0: rust_begin_unwind
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/panicking.rs:645:5
   1: core::panicking::panic_fmt
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/panicking.rs:72:14
   2: core::panicking::panic
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/panicking.rs:144:5
   3: legion_prof::state::State::create_task
   4: legion_prof::state::State::process_records
   5: legion_prof::main

There is a log file here: http://sapling2.stanford.edu/~seshu/s3d_ammonia/prof_0.gz

Here's the ProcID the lookup is failing on:

[src/state.rs:3047:29] proc_id = ProcID(
    2089670227099910147,
)

Here are all the ProcDesc records. Note the use of grep -n to give us line numbers.

$ cargo run --release --all-features -- dump prof_0.gz | grep -n ProcDesc
4765:{"ProcDesc":{"proc_id":2089672426123165700,"kind":1,"cuda_device_uuid":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}}
4766:{"ProcDesc":{"proc_id":2089672426123165701,"kind":1,"cuda_device_uuid":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}}
4767:{"ProcDesc":{"proc_id":2089672426123165696,"kind":3,"cuda_device_uuid":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}}
4768:{"ProcDesc":{"proc_id":2089672426123165697,"kind":3,"cuda_device_uuid":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}}
4769:{"ProcDesc":{"proc_id":2089672426123165698,"kind":3,"cuda_device_uuid":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}}
4770:{"ProcDesc":{"proc_id":2089672426123165699,"kind":2,"cuda_device_uuid":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}}
3099400:{"ProcDesc":{"proc_id":2089670227099910144,"kind":3,"cuda_device_uuid":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}}
3099401:{"ProcDesc":{"proc_id":2089670227099910145,"kind":3,"cuda_device_uuid":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}}
3099402:{"ProcDesc":{"proc_id":2089670227099910146,"kind":3,"cuda_device_uuid":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}}
3099403:{"ProcDesc":{"proc_id":2089670227099910147,"kind":2,"cuda_device_uuid":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}}
3099404:{"ProcDesc":{"proc_id":2089670227099910148,"kind":1,"cuda_device_uuid":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}}
3099405:{"ProcDesc":{"proc_id":2089670227099910149,"kind":1,"cuda_device_uuid":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}}
3099406:{"ProcDesc":{"proc_id":2089674625146421252,"kind":1,"cuda_device_uuid":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}}
3099407:{"ProcDesc":{"proc_id":2089674625146421253,"kind":1,"cuda_device_uuid":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}}
3099408:{"ProcDesc":{"proc_id":2089674625146421248,"kind":3,"cuda_device_uuid":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}}
3099409:{"ProcDesc":{"proc_id":2089674625146421249,"kind":3,"cuda_device_uuid":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}}
3099410:{"ProcDesc":{"proc_id":2089674625146421250,"kind":3,"cuda_device_uuid":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}}
3099411:{"ProcDesc":{"proc_id":2089674625146421251,"kind":2,"cuda_device_uuid":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}}
3099412:{"ProcDesc":{"proc_id":2089671326611537924,"kind":1,"cuda_device_uuid":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}}
3099413:{"ProcDesc":{"proc_id":2089671326611537925,"kind":1,"cuda_device_uuid":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}}
3099414:{"ProcDesc":{"proc_id":2089671326611537920,"kind":3,"cuda_device_uuid":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}}
3099415:{"ProcDesc":{"proc_id":2089671326611537921,"kind":3,"cuda_device_uuid":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}}
3099416:{"ProcDesc":{"proc_id":2089671326611537922,"kind":3,"cuda_device_uuid":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}}
3099417:{"ProcDesc":{"proc_id":2089671326611537923,"kind":2,"cuda_device_uuid":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}}
7211118:{"ProcDesc":{"proc_id":2089683421239443460,"kind":1,"cuda_device_uuid":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}}
7211119:{"ProcDesc":{"proc_id":2089683421239443461,"kind":1,"cuda_device_uuid":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}}
7211120:{"ProcDesc":{"proc_id":2089683421239443456,"kind":3,"cuda_device_uuid":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}}
7211121:{"ProcDesc":{"proc_id":2089683421239443457,"kind":3,"cuda_device_uuid":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}}
7211122:{"ProcDesc":{"proc_id":2089683421239443458,"kind":3,"cuda_device_uuid":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}}
7211123:{"ProcDesc":{"proc_id":2089683421239443459,"kind":2,"cuda_device_uuid":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}}

The ProcDesc for processor 2089670227099910147 appears on line 3,099,403. (Note that legion_prof dump generates one line per record, so this is essentially record number 3 million and something.)

We can grep for records that refer to that ProcID:

$ cargo run --release --all-features -- dump prof_0.gz | grep -n 2089670227099910147 | head
1182:{"MapperName":{"mapper_id":0,"mapper_proc":2089670227099910147,"name":"Default Mapper on Processor 1d00000000000003"}}
1185:{"MapperName":{"mapper_id":0,"mapper_proc":2089670227099910147,"name":"RHST Mapper for Processor 1d00000000000003"}}
646011:{"TaskInfo":{"op_id":640,"task_id":1048630,"variant_id":1,"proc_id":2089670227099910147,"create":4911738313,"ready":4911738413,"start":4915593753,"stop":4951164397,"creator":9223372174303166485,"fevent":9223372243031031811}}
646015:{"TaskInfo":{"op_id":624,"task_id":1048630,"variant_id":1,"proc_id":2089670227099910147,"create":4911711072,"ready":4911711262,"start":4915548739,"stop":4951211295,"creator":9223372174301069348,"fevent":9223372243032080387}}
646019:{"TaskInfo":{"op_id":672,"task_id":1048630,"variant_id":1,"proc_id":2089670227099910147,"create":4966019877,"ready":4966020018,"start":4966020990,"stop":4969203870,"creator":9223372174306312202,"fevent":9223372311777771527}}
646025:{"TaskInfo":{"op_id":3248,"task_id":1048620,"variant_id":1,"proc_id":2089670227099910147,"create":5110106102,"ready":5243183320,"start":5250631188,"stop":5270754353,"creator":9223372174353498121,"fevent":9223372174360838154}}
646026:{"TaskInfo":{"op_id":2582784,"task_id":1048623,"variant_id":1,"proc_id":2089670227099910147,"create":8816965040,"ready":8825387784,"start":8825393986,"stop":8860030931,"creator":9223372176871129204,"fevent":9223372314512457884}}
646027:{"TaskInfo":{"op_id":2592720,"task_id":1048620,"variant_id":1,"proc_id":2089670227099910147,"create":9021410943,"ready":9021411033,"start":9021413628,"stop":9272239026,"creator":9223372314521895342,"fevent":9223372314434863111}}
646029:{"TaskInfo":{"op_id":2869728,"task_id":1048620,"variant_id":1,"proc_id":2089670227099910147,"create":10445185449,"ready":10547213269,"start":10547230131,"stop":10977769235,"creator":9223372179408683027,"fevent":9223372251993210881}}
646031:{"TaskInfo":{"op_id":3023200,"task_id":1048623,"variant_id":1,"proc_id":2089670227099910147,"create":10651526650,"ready":12342565681,"start":12342573706,"stop":12377272627,"creator":9223372322192228383,"fevent":9223372190068506625}}

We see TaskInfo records appearing as early as line 646,011.

So the question for @lightsighter is why we're getting ProcDesc logs after the corresponding TaskInfo logs. And for that matter, why don't we get all the processors logged all at once?

So the question for @lightsighter is why we're getting ProcDesc logs after the corresponding TaskInfo logs. And for that matter, why don't we get all the processors logged all at once?

You're never guaranteed to get the ProcDesc before the TaskInfo. If that has been happening it is purely by accident. We log processors as they are referenced on each node because we don't want to record the entire machine in every log file as that would not be scalable, so we log processors one at a time on each node as they are referenced such that each processor in the log file has a corresponding ProcDesc in that logfile for the profiler to use in case only some of the log files are presented to the profiler.

You're never guaranteed to get the ProcDesc before the TaskInfo. If that has been happening it is purely by accident.

That has literally been the way the profiler has worked since the beginning.

We log processors as they are referenced on each node

Does a TaskInfo qualify as a "reference"? If so, I'm wondering how does this fail to satisfy the above property?

We've been down the road of deferring references to things in the past. It's a pretty big pain in the rear, and requires either:

  1. Temporary storage for tasks because we cannot assume processors exist at the point where they are logged (i.e., more temporary data structures in process_record we have to pass around everywhere), or
  2. Allowing the usage of "incomplete" processors (i.e., more fields of Proc are going to be optional and filled in later, meaning more ugly .unwrap() calls all over the place), or
  3. Buffering the entire set of records and doing multiple passes over them (arguably the worst of all words: increased memory usage and potentially exponential increase in running time)

All of the options have ugly tradeoffs and in my opinion it's far easier to handle if we can make sure ProcDesc comes before any references on the same node. (I'm not talking about cross-node references here, only references within a given node's log file.)

Can you write down somewhere all the places that you make that kind of assumption?

I checked the python code, we allow "incomplete" processors in the python script, can we do the same thing for the rust as well?

I checked the python code, we allow "incomplete" processors in the python script, can we do the same thing for the rust as well?

I already said it here: #1719 (comment)

It means making more things optional, being able to make fewer assumptions about what's set where, and generally need to check more things in more places.

Python only gets around doing this because the code is wildly unsafe and will break in unexpected ways if you fail to successfully memorize the undocumented invariants.

Can you write down somewhere all the places that you make that kind of assumption?

I'm not sure if you're asking me to document the current behavior of the code, or try to enumerate some more general invariant.

Based on a cursory glance at the code, it appears the only places we make hard queries into the set of processors is when we have a TaskInfo, GPUTaskInfo, MetaInfo, MessageInfo, etc.—all the logger records that end up creating processor entries.

There is a fix for this in https://gitlab.com/StanfordLegion/legion/-/merge_requests/1389 . @lightsighter @eddy16112 see if you're happy with it.

This follows approach (2) from #1719 (comment). The kind field of Proc is now optional. This means you will need to unwrap() wherever it is used. Otherwise, tasks create the processor when they are parsed and everything happens as before.

I think that merge request is not going to be needed as I can fix this one. It comes from the changes to deduplicate processor/memory logging. I'll have a merge request alter today.

Here is a Legion fix for this case:

https://gitlab.com/StanfordLegion/legion/-/merge_requests/1390

In general, I just find maintaining these kinds of invariants very brittle because there's no way to enforce them. In this particular case you now have to pay a performance penalty the first time you see a processor or memory and the time it takes to write to the log file will show up in the lifetime of the task which I find distasteful, but it is required in order to guarantee that processor and memory descriptions are guaranteed to show up in the file in the right order when they're being used by parallel threads.

Now that we have both versions, we can directly compare the tradeoffs involved:

  1. A slight increase in the complexity of profiler state (less bad than I originally expected, since it's only the one field) and an additional invariant maintainers of the profiler will need to track, or
  2. A performance hit to (presumably) the first task to run on each processor to get the processor logged before any task is recorded on it, and invariant(s) the maintainers of the runtime will need to track.

MR 1389 ended up being less bad than I expected, so if you'd prefer to go that direction, I'm fine with it.

Update from the meeting today:

We decided to merge both fixes, as the profiler fix makes it more tolerant to logs out of order, and the Legion-side fix is probably not too bad in practice. We can reevaluate later if we find a large performance impact.

Both fixes have been merged.

I think the legacy profile output is broken? Its generating profiles with names like this now: Some(CPU) Proc 3 and you cant expand any of the processors in a profile.

See: http://sapling2.stanford.edu/~seshu/s3d_scaling/4ranks/pwave_x_1_ammonia/legion_prof_csv/?start=0&end=46088478.68393575&collapseAll=false&resolution=10

Please try https://gitlab.com/StanfordLegion/legion/-/merge_requests/1404 and let me know if it fixes the issue for you.

That works.