CommDiagnostics

Usage

use CommDiagnostics;

This module provides support for reporting and counting communication operations between network-connected locales. The operations include various kinds of remote reads (GETs), remote writes (PUTs), and remote executions. Callers can request on-the-fly output each time a remote operation occurs, or count such operations as they occur and retrieve the counts later. The former gives more detailed information but has much more overhead. The latter has much less overhead but only provides aggregate information.

On-the-fly Reporting

All forms of communication reporting and counting are done between pairs of function calls that turn it on and off. On-the-fly reporting across all locales is done like this:

startVerboseComm();
// between start/stop calls, report comm ops initiated on any locale
stopVerboseComm();

On-the-fly reporting for just the calling locale is similar. Only the procedure names change:

startVerboseCommHere();
// between start/stop calls, report comm ops initiated on this locale
stopVerboseCommHere();

In either case, the output produced consists of a line written to stdout for each communication operation. (Here stdout means the file associated with the process, not the Chapel channel with the same name.)

Consider this little example program:

use CommDiagnostics;
proc main() {
  startVerboseComm();
  var x: int = 1;
  on Locales(1) {     // should execute_on a blocking task onto locale 1
    x = x + 1;        // should invoke a remote put and a remote get
  }
  stopVerboseComm();
}

Executing this on two locales with the -nl 2 command line option results in the following output:

0: remote task created on 1
1: t.chpl:6: remote get from 0
1: t.chpl:6: remote put to 0

The initial number refers to the locale reporting the communication event. The file name and line number point to the place in the code that triggered the communication event. (For remote execute_ons, file name and line number information is not yet reported.)

Counting Communication Operations

Counting communication operations requires a few more calls then just reporting them does. In particular, the counts have to be retrieved after they are collected and, if they have been used previously, the internal counters have to be reset before counting is turned on. Counting across all locales is done like this:

// (optional) if we counted previously, reset the counters to zero
resetCommDiagnostics();
startCommDiagnostics();
// between start/stop calls, count comm ops initiated on any locale
stopCommDiagnostics();
// retrieve the counts and report the results
writeln(getCommDiagnostics());

Counting on just the calling locale is similar. Just as for on-the-fly reporting, only the procedure names change:

// (optional) if we counted previously, reset the counters to zero
resetCommDiagnosticsHere();
startCommDiagnosticsHere();
// between start/stop calls, count comm ops initiated on this locale
stopCommDiagnosticsHere();
// retrieve the counts and report the results
writeln(getCommDiagnosticsHere());

The optional call to reset the counters is only needed when a program collects counts more than once. In this case, the counters have to be set back to zero before starting the second and succeeding counting periods. By far the most common situation is that programs only collect communication counts once per run, in which case this step is not needed.

Note that the same internal mechanisms and counters are used for counting on all locales and counting on just the calling locale, so trying to do both at once may lead to surprising turn-on/turn-off behavior and/or incorrect results.

Consider this little example program:

use CommDiagnostics;
proc main() {
  startCommDiagnostics();
  var x: int = 1;
  on Locales(1) {     // should execute_on a blocking task onto locale 1
    x = x + 1;        // should invoke a remote put and a remote get
  }
  stopCommDiagnostics();
  writeln(getCommDiagnostics());
}

Executing this on two locales with the -nl 2 command line option results in the following output:

(get = 0, get_nb = 0, put = 0, put_nb = 0, test_nb = 0, wait_nb = 0, try_nb = 0, execute_on = 1, execute_on_fast = 0, execute_on_nb = 0) (get = 1, get_nb = 0, put = 1, put_nb = 0, test_nb = 0, wait_nb = 0, try_nb = 0, execute_on = 0, execute_on_fast = 0, execute_on_nb = 0)

The first parenthesized group contains the counts for locale 0, and the second contains the counts for locale 1. So, for the instrumented section of this program we can say that a remote execute_on was executed on locale 0, and a remote get and a remote put were executed on locale 1.

Studying Communication During Module Initialization

It is hard for a programmer to determine exactly what happens during initialization or teardown of a module, because the code that runs then does so only implicitly, as a result of the declarations present. And even if that code can be identified, doing debug output or logging data for later reporting might not work because the Chapel capabilities needed to do so could be unavailable due to being implemented by built-in modules which themselves are not yet initialized, or have already been torn down.

To help with that problem, this module provides built-in support for studying communication operations during module initialization and teardown. To use it, set either or both of the config params printInitVerboseComm and printInitCommCounts, described below. You can do this by using appropriate -sconfigParamName=value command line options when you compile your program.

The reporting and/or counting enabled by these covers all of program execution, from just before the first module is initialized until just after the last one is torn down. This is almost always a superset of the part of the program that is of interest, which is often just a single module. To learn what communication is being done by a single module during its initialization and teardown it is often necessary to run a small test program twice, once with that module present and once without it.

record chpl_commDiagnostics

Aggregated communication operation counts. This record type is defined in the same way by both the underlying comm layer(s) and this module, because we don't have a good way to inherit types back and forth between the two. This first definition duplicates the one in the comm layer(s).

var get: uint(64)

blocking GETs, in which initiator waits for completion

var get_nb: uint(64)

non-blocking GETs

var put: uint(64)

blocking PUTs, in which initiator waits for completion

var put_nb: uint(64)

non-blocking PUTs

var test_nb: uint(64)

tests for non-blocking GET/PUT completions

var wait_nb: uint(64)

blocking waits for non-blocking GET/PUT completions

var try_nb: uint(64)

non-blocking waits for non-blocking GET/PUT completions

var execute_on: uint(64)

blocking remote executions, in which initiator waits for completion

var execute_on_fast: uint(64)

blocking remote executions performed by the target locale's Active Message handler

var execute_on_nb: uint(64)

non-blocking remote executions

type commDiagnostics = chpl_commDiagnostics

The Chapel record type inherits the comm layer definition of it.

proc startVerboseComm()

Start on-the-fly reporting of communication initiated on any locale.

proc stopVerboseComm()

Stop on-the-fly reporting of communication initiated on any locale.

proc startVerboseCommHere()

Start on-the-fly reporting of communication initiated on this locale.

proc stopVerboseCommHere()

Stop on-the-fly reporting of communication initiated on this locale.

proc startCommDiagnostics()

Start counting communication operations across the whole program.

proc stopCommDiagnostics()

Stop counting communication operations across the whole program.

proc startCommDiagnosticsHere()

Start counting communication operations initiated on this locale.

proc stopCommDiagnosticsHere()

Stop counting communication operations initiated on this locale.

proc resetCommDiagnostics()

Reset aggregate communication counts across the whole program.

proc resetCommDiagnosticsHere()

Reset aggregate communication counts on the calling locale.

proc getCommDiagnostics()

Retrieve aggregate communication counts for the whole program.

Returns:array of counts of comm ops initiated on each locale
Return type:[LocaleSpace] commDiagnostics
proc getCommDiagnosticsHere()

Retrieve aggregate communication counts for this locale.

Returns:counts of comm ops initiated on this locale
Return type:commDiagnostics
config param printInitVerboseComm = false

If this is set, on-the-fly reporting of communication operations will be turned on before any module initialization begins and turned off after all module teardown ends. See procedures startVerboseComm and stopVerboseComm for more information.

config param printInitCommCounts = false

If this is set, communication operations are counted from before any module initialization begins until after all module teardown ends, and then the aggregate counts are printed. See procedures startCommDiagnostics, stopCommDiagnostics, and getCommDiagnostics for more information.