CommDiagnostics

Usage

use CommDiagnostics;

or

import CommDiagnostics;

Warning

The CommDiagnostics module is unstable and may change in the future

Supports counting and reporting network communication operations.

This module provides support for reporting and counting communication operations between network-connected locales. The operations include various kinds of remote reads (GETs), remote writes (PUTs), and remote executions. Callers can request on-the-fly output each time a remote operation occurs, or count such operations as they occur and retrieve the counts later. The former gives more detailed information but has much more overhead. The latter has much less overhead but only provides aggregate information.

On-the-fly Reporting

All forms of communication reporting and counting are done between pairs of function calls that turn it on and off. On-the-fly reporting across all locales is done like this:

startVerboseComm();
// between start/stop calls, report comm ops initiated on any locale
stopVerboseComm();

On-the-fly reporting for just the calling locale is similar. Only the procedure names change:

startVerboseCommHere();
// between start/stop calls, report comm ops initiated on this locale
stopVerboseCommHere();

In either case, the output produced consists of a line written to stdout for each communication operation. (Here stdout means the file associated with the process, not the Chapel channel with the same name.)

Consider this little example program:

use CommDiagnostics;
proc main() {
  startVerboseComm();
  var x: int = 1;
  on Locales(1) {     // should execute_on a blocking task onto locale 1
    x = x + 1;        // should invoke a remote put and a remote get
  }
  stopVerboseComm();
}

Executing this on two locales with the -nl 2 command line option results in the following output:

0: remote task created on 1
1: t.chpl:6: remote get from 0, 8 bytes
1: t.chpl:6: remote put to 0, 8 bytes

The initial number refers to the locale reporting the communication event. The file name and line number point to the place in the code that triggered the communication event. (For remote execute_ons, file name and line number information is not yet reported.)

Counting Communication Operations

Counting communication operations requires a few more calls then just reporting them does. In particular, the counts have to be retrieved after they are collected and, if they have been used previously, the internal counters have to be reset before counting is turned on. Counting across all locales is done like this:

// (optional) if we counted previously, reset the counters to zero
resetCommDiagnostics();
startCommDiagnostics();
// between start/stop calls, count comm ops initiated on any locale
stopCommDiagnostics();
// retrieve the counts and report the results
writeln(getCommDiagnostics());

Counting on just the calling locale is similar. Just as for on-the-fly reporting, only the procedure names change:

// (optional) if we counted previously, reset the counters to zero
resetCommDiagnosticsHere();
startCommDiagnosticsHere();
// between start/stop calls, count comm ops initiated on this locale
stopCommDiagnosticsHere();
// retrieve the counts and report the results
writeln(getCommDiagnosticsHere());

The optional call to reset the counters is only needed when a program collects counts more than once. In this case, the counters have to be set back to zero before starting the second and succeeding counting periods. By far the most common situation is that programs only collect communication counts once per run, in which case this step is not needed.

Note that the same internal mechanisms and counters are used for counting on all locales and counting on just the calling locale, so trying to do both at once may lead to surprising turn-on/turn-off behavior and/or incorrect results.

Consider this little example program:

use CommDiagnostics;
proc main() {
  startCommDiagnostics();
  var x: int = 1;
  on Locales(1) {     // should execute_on a blocking task onto locale 1
    x = x + 1;        // should invoke a remote put and a remote get
  }
  stopCommDiagnostics();
  writeln(getCommDiagnostics());
}

Executing this on two locales with the -nl 2 command line option results in the following output:

(execute_on = 1) (get = 1, put = 1)

The first parenthesized group contains the counts for locale 0, and the second contains the counts for locale 1. So, for the instrumented section of this program we can say that a remote execute_on was executed on locale 0, and a remote get and a remote put were executed on locale 1.

Studying Communication During Module Initialization

It is hard for a programmer to determine exactly what happens during initialization or teardown of a module, because the code that runs then does so only implicitly, as a result of the declarations present. And even if that code can be identified, doing debug output or logging data for later reporting might not work because the Chapel capabilities needed to do so could be unavailable due to being implemented by built-in modules which themselves are not yet initialized, or have already been torn down.

To help with that problem, this module provides built-in support for studying communication operations during module initialization and teardown. To use it, set either or both of the config params printInitVerboseComm and printInitCommCounts, described below. You can do this by using appropriate -sconfigParamName=value command line options when you compile your program.

The reporting and/or counting enabled by these covers all of program execution, from just before the first module is initialized until just after the last one is torn down. This is almost always a superset of the part of the program that is of interest, which is often just a single module. To learn what communication is being done by a single module during its initialization and teardown it is often necessary to run a small test program twice, once with that module present and once without it.

config param commDiagsStacktrace = false

Print out stack traces for comm events printed after startVerboseComm

config param commDiagsPrintUnstable = false

If this is false, a written commDiagnostics value does not include “unstable” fields even when they are non-zero. Unstable fields are those expected to have unpredictable values for multiple executions of the same code sequence. Setting this to true causes such fields, if non-zero, to be included when a commDiagnostics value is written. At present the only unstable field is the amo counter, whose instability is due to the use of atomic reads in spin loops that wait for parallelism and on-statements to complete.

record chpl_commDiagnostics

Aggregated communication operation counts. This record type is defined in the same way by both the underlying comm layer(s) and this module, because we don’t have a good way to inherit types back and forth between the two. This first definition duplicates the one in the comm layer(s).

var get : uint(64)

blocking GETs, in which initiator waits for completion

var get_nb : uint(64)

non-blocking GETs

var put : uint(64)

blocking PUTs, in which initiator waits for completion

var put_nb : uint(64)

non-blocking PUTs

var test_nb : uint(64)

tests for non-blocking GET/PUT completions

var wait_nb : uint(64)

blocking waits for non-blocking GET/PUT completions

var try_nb : uint(64)

non-blocking waits for non-blocking GET/PUT completions

var amo : uint(64)

atomic memory operations

var execute_on : uint(64)

blocking remote executions, in which initiator waits for completion

var execute_on_fast : uint(64)

blocking remote executions performed by the target locale’s Active Message handler

var execute_on_nb : uint(64)

non-blocking remote executions

var cache_get_hits : uint(64)

GETs that were handled by the cache. GETs counted here did not require the cache to communicate in order to return the result.

var cache_get_misses : uint(64)

GETs that were not handled by the cache - that is, GETs where the cache needed to communicate with another locale.

var cache_put_hits : uint(64)

PUTs that were stored in cache pages that already existed.

var cache_put_misses : uint(64)

PUTs that required the cache to create a new page to store them.

var cache_num_prefetches : uint(64)

Number of prefetches issued to the remote cache at the granularity of cache pages. This counter is specifically triggered via calls to chpl_comm_remote_prefetch

var cache_num_page_readaheads : uint(64)

Number of readaheads issued to the remote cache at the granularity of cache pages.

var cache_prefetch_unused : uint(64)

Number of cache pages that were prefetched but evicted from the cache before being accessed (i.e., the prefetches were too early).

var cache_prefetch_waited : uint(64)

Number of cache pages that were prefetched but did not arrive in the cache before being accessed (i.e., the prefetches were too late).

var cache_readahead_unused : uint(64)

Number of cache pages that were read ahead but evicted from the cache before being accessed (i.e., the readaheads were too early).

var cache_readahead_waited : uint(64)

Number of cache pages that were read ahead but did not arrive in the cache before being accessed (i.e., the readaheads were too late).

type commDiagnostics = chpl_commDiagnostics

The Chapel record type inherits the comm layer definition of it.

proc startVerboseComm()

Start on-the-fly reporting of communication initiated on any locale.

proc stopVerboseComm()

Stop on-the-fly reporting of communication initiated on any locale.

proc startVerboseCommHere()

Start on-the-fly reporting of communication initiated on this locale.

proc stopVerboseCommHere()

Stop on-the-fly reporting of communication initiated on this locale.

proc startCommDiagnostics()

Start counting communication operations across the whole program.

proc stopCommDiagnostics()

Stop counting communication operations across the whole program.

proc startCommDiagnosticsHere()

Start counting communication operations initiated on this locale.

proc stopCommDiagnosticsHere()

Stop counting communication operations initiated on this locale.

proc resetCommDiagnostics()

Reset aggregate communication counts across the whole program.

proc resetCommDiagnosticsHere()

Reset aggregate communication counts on the calling locale.

proc getCommDiagnostics()

Retrieve aggregate communication counts for the whole program.

Returns:

array of counts of comm ops initiated on each locale

Return type:

[LocaleSpace] commDiagnostics

proc getCommDiagnosticsHere()

Retrieve aggregate communication counts for this locale.

Returns:

counts of comm ops initiated on this locale

Return type:

commDiagnostics

proc printCommDiagnosticsTable(printEmptyColumns = false)

Print the current communication counts in a markdown table using a row per locale and a column per operation. By default, operations for which all locales have a count of zero are not displayed in the table, though an argument can be used to reverse that behavior.

Arguments:

printEmptyColumns : bool – Indicates whether empty columns should be printed (defaults to false)

config param printInitVerboseComm = false

If this is set, on-the-fly reporting of communication operations will be turned on before any module initialization begins and turned off after all module teardown ends. See procedures startVerboseComm and stopVerboseComm for more information.

config param printInitCommCounts = false

If this is set, communication operations are counted from before any module initialization begins until after all module teardown ends, and then the aggregate counts are printed. See procedures startCommDiagnostics, stopCommDiagnostics, and getCommDiagnostics for more information.