/casita

Primary LanguageC++OtherNOASSERTION

CASITA is a tool for automatic analysis of OTF2 trace files that have been 
generated with Score-P. It determines program activities with high impact on the
total program runtime and the load balancing. CASITA generates an OTF2 trace 
with additional information such as the critical path, waiting time, and the
cause of wait states. The same metrics are used to generate a summary profile 
which rates activities according their potential to improve the program runtime 
and the load balancing. A summary of inefficient patterns exposes waiting times
in the individual programming models and APIs.

Internally, CASITA constructs a distributed DAG, where each node represents an 
event in time and edges the dependencies between events on different locations 
(processes, threads and CUDA streams). Events on the same locations have an
implicit dependency by the happens-before relation. The local DAGs, one per 
MPI process, are connected via remote edges. Only MPI, OpenMP and CUDA nodes 
are represented in the graph. Nevertheless, events from compiler 
instrumentation are accounted.

Publications:

"CASITA: A Tool for Identifying Critical Optimization Targets in Distributed 
Heterogeneous Applications"
http://dx.doi.org/10.1109/ICPPW.2014.35

"Scalable critical-path analysis and optimization guidance for hybrid MPI-CUDA 
applications"
http://dx.doi.org/10.1177/1094342016661865

"Critical-blame analysis for OpenMP 4.0 offloading on Intel Xeon Phi"
http://dx.doi.org/10.1016/j.jss.2015.12.050

"Integrating Critical-Blame Analysis for Heterogeneous Applications into the 
Score-P Workflow"
http://dx.doi.org/10.1007/978-3-319-16012-2_8

"Analyzing Offloading Inefficiencies in Scalable Heterogeneous Applications"
http://dx.doi.org/10.1007/978-3-319-67630-2_34


CASITA analysis requirements:

The MPI analysis is currently based on reenacting the MPI communication in 
forward and backward direction, which means that the respective communication 
records have to be available in the trace. CASITA also needs the region enter 
and leave events of MPI communication functions. Currently, the MPI support is 
limited to (two-sided) point-to-point communication and blocking collectives.

The OpenMP analysis is still based on the OPARI2 instrumentation. It requires 
the fork/join, parallel begin/end, and the barrier begin/end records. Both, MPI 
and OpenMP analysis work with the default Score-P trace output. 

CUDA analysis is supported since Score-P 1.3. The respective OTF2 trace file has
to contain the following information:
* Enter and leave events of CUDA driver API functions that synchronize with the 
  device (including blocking CUDA memory copies and CUDA event queries) as well 
  as CUDA kernel launch and CUDA event record. 
* Enter and leave events of kernel launches and kernels 
* Kernel references (dependency information between launch and synchronization 
  of kernels)
The minimum set of Score-P CUDA recording features is "driver,kernel,references" 
which can be set with the environment variable SCORE_CUDA_ENABLE.

The OpenCL analysis also requires kernel dependencies, e.g. to detect the OpenCL
queue a kernel is enqueued to or synchronized with clFinish. This is currently 
implemented in a Score-P development branch. OpenACC analysis is indirectly 
supported with the low-level paradigms CUDA and OpenCL.