CommDiagnostics¶
Usage
use CommDiagnostics;
or
import CommDiagnostics;
Warning
The CommDiagnostics module is unstable and may change in the future
Supports counting and reporting network communication operations.
This module provides support for reporting and counting communication operations between network-connected locales. The operations include various kinds of remote reads (GETs), remote writes (PUTs), and remote executions. Callers can request on-the-fly output each time a remote operation occurs, or count such operations as they occur and retrieve the counts later. The former gives more detailed information but has much more overhead. The latter has much less overhead but only provides aggregate information.
On-the-fly Reporting
All forms of communication reporting and counting are done between pairs of function calls that turn it on and off. On-the-fly reporting across all locales is done like this:
startVerboseComm();
// between start/stop calls, report comm ops initiated on any locale
stopVerboseComm();
On-the-fly reporting for just the calling locale is similar. Only the procedure names change:
startVerboseCommHere();
// between start/stop calls, report comm ops initiated on this locale
stopVerboseCommHere();
In either case, the output produced consists of a line written to
stdout
for each communication operation. (Here stdout
means
the file associated with the process, not the Chapel channel with the
same name.)
Consider this little example program:
use CommDiagnostics;
proc main() {
startVerboseComm();
var x: int = 1;
on Locales(1) { // should execute_on a blocking task onto locale 1
x = x + 1; // should invoke a remote put and a remote get
}
stopVerboseComm();
}
Executing this on two locales with the -nl 2
command line
option results in the following output:
0: remote task created on 1
1: t.chpl:6: remote get from 0, 8 bytes
1: t.chpl:6: remote put to 0, 8 bytes
The initial number refers to the locale reporting the communication event. The file name and line number point to the place in the code that triggered the communication event. (For remote execute_ons, file name and line number information is not yet reported.)
Counting Communication Operations
Counting communication operations requires a few more calls then just reporting them does. In particular, the counts have to be retrieved after they are collected and, if they have been used previously, the internal counters have to be reset before counting is turned on. Counting across all locales is done like this:
// (optional) if we counted previously, reset the counters to zero
resetCommDiagnostics();
startCommDiagnostics();
// between start/stop calls, count comm ops initiated on any locale
stopCommDiagnostics();
// retrieve the counts and report the results
writeln(getCommDiagnostics());
Counting on just the calling locale is similar. Just as for on-the-fly reporting, only the procedure names change:
// (optional) if we counted previously, reset the counters to zero
resetCommDiagnosticsHere();
startCommDiagnosticsHere();
// between start/stop calls, count comm ops initiated on this locale
stopCommDiagnosticsHere();
// retrieve the counts and report the results
writeln(getCommDiagnosticsHere());
The optional call to reset the counters is only needed when a program collects counts more than once. In this case, the counters have to be set back to zero before starting the second and succeeding counting periods. By far the most common situation is that programs only collect communication counts once per run, in which case this step is not needed.
Note that the same internal mechanisms and counters are used for counting on all locales and counting on just the calling locale, so trying to do both at once may lead to surprising turn-on/turn-off behavior and/or incorrect results.
Consider this little example program:
use CommDiagnostics;
proc main() {
startCommDiagnostics();
var x: int = 1;
on Locales(1) { // should execute_on a blocking task onto locale 1
x = x + 1; // should invoke a remote put and a remote get
}
stopCommDiagnostics();
writeln(getCommDiagnostics());
}
Executing this on two locales with the -nl 2
command line
option results in the following output:
(execute_on = 1) (get = 1, put = 1)
The first parenthesized group contains the counts for locale 0, and the second contains the counts for locale 1. So, for the instrumented section of this program we can say that a remote execute_on was executed on locale 0, and a remote get and a remote put were executed on locale 1.
Studying Communication During Module Initialization
It is hard for a programmer to determine exactly what happens during initialization or teardown of a module, because the code that runs then does so only implicitly, as a result of the declarations present. And even if that code can be identified, doing debug output or logging data for later reporting might not work because the Chapel capabilities needed to do so could be unavailable due to being implemented by built-in modules which themselves are not yet initialized, or have already been torn down.
To help with that problem, this module provides built-in support for
studying communication operations during module initialization and
teardown. To use it, set either or both of the config params
printInitVerboseComm
and printInitCommCounts
,
described below. You can do this by using appropriate
-sconfigParamName=value
command line options when you compile
your program.
The reporting and/or counting enabled by these covers all of program execution, from just before the first module is initialized until just after the last one is torn down. This is almost always a superset of the part of the program that is of interest, which is often just a single module. To learn what communication is being done by a single module during its initialization and teardown it is often necessary to run a small test program twice, once with that module present and once without it.
- config param commDiagsStacktrace = false¶
Print out stack traces for comm events printed after startVerboseComm
- config param commDiagsPrintUnstable = false¶
If this is false, a written commDiagnostics value does not include “unstable” fields even when they are non-zero. Unstable fields are those expected to have unpredictable values for multiple executions of the same code sequence. Setting this to true causes such fields, if non-zero, to be included when a commDiagnostics value is written. At present the only unstable field is the amo counter, whose instability is due to the use of atomic reads in spin loops that wait for parallelism and on-statements to complete.
- record chpl_commDiagnostics¶
Aggregated communication operation counts. This record type is defined in the same way by both the underlying comm layer(s) and this module, because we don’t have a good way to inherit types back and forth between the two. This first definition duplicates the one in the comm layer(s).
- var get : uint(64)¶
blocking GETs, in which initiator waits for completion
- var get_nb : uint(64)¶
non-blocking GETs
- var put : uint(64)¶
blocking PUTs, in which initiator waits for completion
- var put_nb : uint(64)¶
non-blocking PUTs
- var test_nb : uint(64)¶
tests for non-blocking GET/PUT completions
- var wait_nb : uint(64)¶
blocking waits for non-blocking GET/PUT completions
- var try_nb : uint(64)¶
non-blocking waits for non-blocking GET/PUT completions
- var amo : uint(64)¶
atomic memory operations
- var execute_on : uint(64)¶
blocking remote executions, in which initiator waits for completion
- var execute_on_fast : uint(64)¶
blocking remote executions performed by the target locale’s Active Message handler
- var execute_on_nb : uint(64)¶
non-blocking remote executions
- var cache_get_hits : uint(64)¶
GETs that were handled by the cache. GETs counted here did not require the cache to communicate in order to return the result.
- var cache_get_misses : uint(64)¶
GETs that were not handled by the cache - that is, GETs where the cache needed to communicate with another locale.
- var cache_put_hits : uint(64)¶
PUTs that were stored in cache pages that already existed.
- var cache_put_misses : uint(64)¶
PUTs that required the cache to create a new page to store them.
- var cache_num_prefetches : uint(64)¶
Number of prefetches issued to the remote cache at the granularity of cache pages. This counter is specifically triggered via calls to chpl_comm_remote_prefetch
- var cache_num_page_readaheads : uint(64)¶
Number of readaheads issued to the remote cache at the granularity of cache pages.
- var cache_prefetch_unused : uint(64)¶
Number of cache pages that were prefetched but evicted from the cache before being accessed (i.e., the prefetches were too early).
- var cache_prefetch_waited : uint(64)¶
Number of cache pages that were prefetched but did not arrive in the cache before being accessed (i.e., the prefetches were too late).
- var cache_readahead_unused : uint(64)¶
Number of cache pages that were read ahead but evicted from the cache before being accessed (i.e., the readaheads were too early).
- var cache_readahead_waited : uint(64)¶
Number of cache pages that were read ahead but did not arrive in the cache before being accessed (i.e., the readaheads were too late).
- type commDiagnostics = chpl_commDiagnostics¶
The Chapel record type inherits the comm layer definition of it.
- proc startVerboseComm()¶
Start on-the-fly reporting of communication initiated on any locale.
- proc stopVerboseComm()¶
Stop on-the-fly reporting of communication initiated on any locale.
- proc startVerboseCommHere()¶
Start on-the-fly reporting of communication initiated on this locale.
- proc stopVerboseCommHere()¶
Stop on-the-fly reporting of communication initiated on this locale.
- proc startCommDiagnostics()¶
Start counting communication operations across the whole program.
- proc stopCommDiagnostics()¶
Stop counting communication operations across the whole program.
- proc startCommDiagnosticsHere()¶
Start counting communication operations initiated on this locale.
- proc stopCommDiagnosticsHere()¶
Stop counting communication operations initiated on this locale.
- proc resetCommDiagnostics()¶
Reset aggregate communication counts across the whole program.
- proc resetCommDiagnosticsHere()¶
Reset aggregate communication counts on the calling locale.
- proc getCommDiagnostics()¶
Retrieve aggregate communication counts for the whole program.
- Returns:
array of counts of comm ops initiated on each locale
- Return type:
[LocaleSpace] commDiagnostics
- proc getCommDiagnosticsHere()¶
Retrieve aggregate communication counts for this locale.
- Returns:
counts of comm ops initiated on this locale
- Return type:
commDiagnostics
- proc printCommDiagnosticsTable(printEmptyColumns = false)¶
Print the current communication counts in a markdown table using a row per locale and a column per operation. By default, operations for which all locales have a count of zero are not displayed in the table, though an argument can be used to reverse that behavior.
- Arguments:
printEmptyColumns : bool – Indicates whether empty columns should be printed (defaults to
false
)
- config param printInitVerboseComm = false¶
If this is set, on-the-fly reporting of communication operations will be turned on before any module initialization begins and turned off after all module teardown ends. See procedures
startVerboseComm
andstopVerboseComm
for more information.
- config param printInitCommCounts = false¶
If this is set, communication operations are counted from before any module initialization begins until after all module teardown ends, and then the aggregate counts are printed. See procedures
startCommDiagnostics
,stopCommDiagnostics
, andgetCommDiagnostics
for more information.