# Communication Optimization for the Chapel Programming Language Michael Ferguson Cray Inc. March 24, 2016 #### Safe Harbor Statement This presentation may contain forward-looking statements that are based on our current expectations. Forward looking statements may include statements about our financial guidance and expected operating results, our opportunities and future potential, our product development and new product introduction plans, our ability to expand and penetrate our addressable markets and other statements that are not historical facts. These statements are only predictions and actual results may materially vary from those projected. Please refer to Cray's documents filed with the SEC from time to time concerning factors that could affect the Company and these forward-looking statements. #### **Talk Outline** - 15 m Chapel Background - 5m Communication Optimization Motivation - 5m Memory Consistency Models constrain optimization - 15 m Sequential Consistency for Data Race Free Programs - 5m Optimizing communication with a cache for remote data - 5m Using LLVM to optimize communication I ANALYZ ## What is Chapel? ## **Chapel's Origins: HPCS** ## **DARPA HPCS: High Productivity Computing Systems** - Goal: improve productivity by a factor of 10x - Timeframe: summer 2002 fall 2012 - Cray developed a new system architecture, network, software, ... - this became the very successful Cray XC30™ Supercomputer Series ...and a new programming language: Chapel COMPUTE ## **Chapel Motivation** # Q: Why doesn't parallel programming have an equivalent to Python / Matlab / Java / C++ / (your favorite programming language here)? - one that makes it easy to quickly get codes up and running - one that is portable across system architectures and scales - one that bridges the HPC, data analysis, and mainstream communities # A: We believe this is due not to any particular technical challenge, but rather a lack of sufficient... - ...long-term efforts - ...resources - ...community will - ...co-design between developers and users - ...patience ## Chapel is our attempt to change this ## **Chapel's Implementation** - Being developed as open source at GitHub - Licensed as Apache v2.0 software - Portable design and implementation, targeting: - multicore desktops and laptops - commodity clusters and the cloud - HPC systems from Cray and other vendors - in-progress: manycore processors, CPU+accelerator hybrids, ... ## Chapel is a Collaborative, Community Effort Proudly Operated by Battelle Since 1965 (and many others as well...) http://chapel.cray.com/collaborations.html COMPUTE ## **Sustained Performance Milestones** - 1 GF 1988: Cray Y-MP; 8 Processors - Static finite element analysisFortran77 + Cray autotasking + vectorization - 1 TF 1998: Cray T3E; 1,024 Processors - Modeling of metallic magnet atoms - Fortran + MPI (Message Passing Interface) - 1 PF 2008: Cray XT5; 150,000 Processors - Superconductive materials - C++/Fortran + MPI + vectorization - 1 EF ~20\_\_: Cray \_\_\_\_; ~10,000,000 Processors - TBD - TBD: C/C++/Fortran + MPI + OpenMP/OpenACC/CUDA/OpenCL? Or, perhaps something completely different? COMPUTE CRAY" **Given:** *m*-element vectors *A*, *B*, *C* Compute: $\forall i \in 1..m$ , $A_i = B_i + \alpha \cdot C_i$ ## In pictures: CRAY **Given:** *m*-element vectors *A*, *B*, *C* Compute: $\forall i \in 1..m$ , $A_i = B_i + \alpha \cdot C_i$ ## In pictures, in parallel: CRAY **Given:** *m*-element vectors *A*, *B*, *C* Compute: $\forall i \in 1..m$ , $A_i = B_i + \alpha \cdot C_i$ ## In pictures, in parallel (distributed memory): CRAY **Given:** *m*-element vectors *A*, *B*, *C* Compute: $\forall i \in 1..m$ , $A_i = B_i + \alpha \cdot C_i$ In pictures, in parallel (distributed memory multicore): #### **STREAM Triad: MPI** #include <hpcc.h> ``` MPI ``` ``` static int VectorSize; static double *a, *b, *c; int HPCC StarStream(HPCC Params *params) { int myRank, commSize; int rv, errCount; MPI Comm comm = MPI COMM WORLD; MPI Comm size( comm, &commSize ); MPI Comm rank ( comm, &myRank ); rv = HPCC Stream( params, 0 == myRank); MPI Reduce ( &rv, &errCount, 1, MPI INT, MPI SUM, 0, comm ); return errCount; int HPCC Stream(HPCC Params *params, int doIO) { register int j; double scalar; VectorSize = HPCC LocalVectorSize( params, 3, sizeof(double), 0 ); a = HPCC XMALLOC( double, VectorSize ); b = HPCC XMALLOC( double, VectorSize ); c = HPCC XMALLOC( double, VectorSize ); ``` ``` if (!a || !b || !c) { if (c) HPCC free(c); if (b) HPCC free(b); if (a) HPCC free(a); if (doIO) { fprintf( outFile, "Failed to allocate memory (%d).\n", VectorSize ); fclose( outFile ); return 1; for (j=0; j<VectorSize; j++) {</pre> b[j] = 2.0; c[j] = 1.0; scalar = 3.0; for (j=0; j<VectorSize; j++)</pre> a[j] = b[j] + scalar * c[j]; HPCC free(c); ``` HPCC free(b); HPCC free(a); ## **STREAM Triad: MPI+OpenMP** #### MPI + OpenMP ``` #include <hpcc.h> if (!a || !b || !c) { #ifdef OPENMP if (c) HPCC free(c); #include <omp.h> #endif if (b) HPCC free(b); if (a) HPCC free(a); if (doIO) { static int VectorSize; static double *a, *b, *c; fprintf( outFile, "Failed to allocate memory (%d).\n", VectorSize ); int HPCC StarStream(HPCC Params *params) { fclose( outFile ); int myRank, commSize; int rv, errCount; return 1; MPI Comm comm = MPI COMM WORLD; MPI Comm size( comm, &commSize ); #ifdef OPENMP MPI Comm rank( comm, &myRank ); #pragma omp parallel for #endif rv = HPCC Stream( params, 0 == myRank); for (j=0; j<VectorSize; j++) {</pre> MPI Reduce ( &rv, &errCount, 1, MPI INT, MPI SUM, 0, b[j] = 2.0; comm ); c[j] = 1.0; return errCount; scalar = 3.0; int HPCC Stream(HPCC Params *params, int doIO) { #ifdef OPENMP register int j; #pragma omp parallel for double scalar; #endif for (j=0; j<VectorSize; j++)</pre> VectorSize = HPCC LocalVectorSize( params, 3, a[j] = b[j] + scalar * c[j]; sizeof(double), 0 ); HPCC free(c); a = HPCC XMALLOC( double, VectorSize ); HPCC free(b); b = HPCC XMALLOC( double, VectorSize ); HPCC free(a); c = HPCC XMALLOC( double, VectorSize ); ``` STORE ## STREAM Triad: MPI+OpenMP vs. CUDA #### MPI + OpenMP ``` #ifdef OPENMP #include <omp.h> #include <omp.h> #endif static int VectorSize; static double *a, *b, *c; int HPCC StarStream(HPCC Params *params) { int myRank, commSize; int rv, errCount; MPI Comm comm = MPI COMM WORLD; MPI Comm size( comm, &commSize ); MPI Comm_rank( comm, &myRank ); rv = HPCC Stream( params, 0 == myRank); MPI Reduce( &rv, &errCount, 1, MPI INT, MPI SUM, 0, comm ); return errCount; } ``` #### CUDA ``` #define N 2000000 int main() { float *d_a, *d_b, *d_c; float scalar; cudaMalloc((void**)&d_a, sizeof(float)*N); cudaMalloc((void**)&d_b, sizeof(float)*N); cudaMalloc((void**)&d_c, sizeof(float)*N); dim3 dimBlock(128); dim3 dimGrid(N/dimBlock.x); ``` ## HPC suffers from too many distinct notations for expressing parallelism and locality ``` a = HPCC XMALLOC( double, VectorSize ); b = HPCC XMALLOC( double, VectorSize ); c = HPCC XMALLOC( double, VectorSize ); if (!a || !b || !c) { if (c) HPCC free(c); if (b) HPCC free(b); if (a) HPCC free(a); if (doIO) { fprintf( outFile, "Failed to allocate memory (%d).\n", VectorSize ); fclose( outFile ); return 1; #ifdef OPENMP #pragma omp parallel for #endif for (j=0; j<VectorSize; j++) {</pre> b[j] = 2.0; c[j] = 1.0; scalar = 3.0; #ifdef OPENMP #pragma omp parallel for #endif for (j=0; j<VectorSize; j++)</pre> a[j] = b[j] + scalar * c[j]; HPCC free(c); HPCC free (b); HPCC free(a); return 0; ``` ``` set array<<<dimGrid,dimBlock>>>(d b, .5f, N); set array<<<dimGrid,dimBlock>>>(d c, .5f, N); scalar=3.0f; STREAM Triad<<dimGrid,dimBlock>>>(d b, d c, d a, scalar, N); cudaThreadSynchronize(); cudaFree(d a); cudaFree(d b); cudaFree(d c); global void set array(float *a, float value, int len) { int idx = threadIdx.x + blockIdx.x * blockDim.x; if (idx < len) a[idx] = value;</pre> void STREAM Triad( float *a, float *b, float *c, float scalar, int len) { int idx = threadIdx.x + blockIdx.x * blockDim.x; if (idx < len) c[idx] = a[idx]+scalar*b[idx];</pre> } ``` ## Why so many programming models? ## HPC has traditionally given users... - ...low-level, control-centric programming models - ...ones that are closely tied to the underlying hardware - ...ones that support only a single type of parallelism | Type of HW Parallelism | Programming Model | Unit of Parallelism | |-----------------------------------|------------------------|---------------------| | Inter-node | MPI | executable | | Intra-node/multicore | OpenMP / pthreads | iteration/task | | Instruction-level vectors/threads | pragmas | iteration | | GPU/accelerator | Open[MP CL ACC] / CUDA | SIMD function/task | benefits: lots of control; decent generality; easy to implement downsides: lots of user-managed detail; brittle to changes ## Rewinding a few slides... #### MPI + OpenMP ``` #ifdef OPENMP #include <omp.h> #endif static int VectorSize; static double *a, *b, *c; int HPCC StarStream(HPCC Params *params) { int myRank, commSize; int rv, errCount; MPI Comm comm = MPI COMM WORLD; MPI Comm size( comm, &commSize); MPI Comm rank( comm, &myRank); rv = HPCC Stream( params, 0 == myRank); MPI Reduce( &rv, &errCount, 1, MPI INT, MPI SUM, 0, comm ); return errCount; } ``` #### CUDA ``` #define N 2000000 int main() { float *d_a, *d_b, *d_c; float scalar; cudaMalloc((void**)&d_a, sizeof(float)*N); cudaMalloc((void**)&d_b, sizeof(float)*N); cudaMalloc((void**)&d_c, sizeof(float)*N); dim3 dimBlock(128); dim3 dimGrid(N/dimBlock.x ); ``` ## HPC suffers from too many distinct notations for expressing parallelism and locality ``` a = HPCC XMALLOC( double, VectorSize ); b = HPCC XMALLOC( double, VectorSize ); c = HPCC XMALLOC( double, VectorSize ); if (!a || !b || !c) { if (c) HPCC free(c); if (b) HPCC free(b); if (a) HPCC free(a); if (doIO) { fprintf( outFile, "Failed to allocate memory (%d).\n", VectorSize ); fclose( outFile ); return 1; #ifdef OPENMP #pragma omp parallel for #endif for (j=0; j<VectorSize; j++) {</pre> b[j] = 2.0; c[j] = 1.0; scalar = 3.0; #ifdef OPENMP #pragma omp parallel for #endif for (j=0; j<VectorSize; j++)</pre> a[j] = b[j] + scalar * c[j]; HPCC free(c); HPCC free (b); HPCC free(a); return 0; ``` ``` set array<<<dimGrid,dimBlock>>>(d b, .5f, N); set array<<<dimGrid,dimBlock>>>(d c, .5f, N); scalar=3.0f; STREAM Triad<<dimGrid,dimBlock>>>(d b, d c, d a, scalar, N); cudaThreadSynchronize(); cudaFree(d a); cudaFree(d b); cudaFree(d c); global void set array(float *a, float value, int len) { int idx = threadIdx.x + blockIdx.x * blockDim.x; if (idx < len) a[idx] = value;</pre> void STREAM Triad( float *a, float *b, float *c, float scalar, int len) { int idx = threadIdx.x + blockIdx.x * blockDim.x; if (idx < len) c[idx] = a[idx]+scalar*b[idx];</pre> } ``` ## **STREAM Triad: Chapel** ``` Chapel #include <hpcc.h> #ifdef OPENMP #include <omp.h> config const m = 1000, static int VectorSize: static double *a, *b, *c; alpha = 3.0; int HPCC StarStream (HPCC Params int myRank, commSize; int rv, errCount; MPI Comm comm = MPI COMM WORLD; the special const ProblemSpace = {1..m}(dmapped ... MPI Comm size ( comm, &commSize ); MPI Comm rank ( comm, &myRank ); sauce rv = HPCC Stream( params, 0 == myRa MPI Reduce ( &rv, &errCount, 1, MPI var A, B, C: [ProblemSpace] real; return errCount; int HPCC Stream (HPCC Params *params, register int j; = 2.0; double scalar; VectorSize = HPCC LocalVectorSize( N); = 1.0; a = HPCC XMALLOC ( double, VectorSiz b = HPCC XMALLOC ( double, VectorSiz N); c = HPCC XMALLOC ( double, VectorSiz if (!a || !b || !c) { if (c) HPCC free(c); A = B + alpha * C; if (b) HPCC free(b); c, da, scalar, N); if (a) HPCC_free(a); if (doIO) fprintf( outFile, "Failed to a fclose( outFile ); -------------------- _____ ------ ``` <u>Philosophy:</u> Good language design can tease details of locality and parallelism away from an algorithm, permitting the compiler, runtime, applied scientist, and HPC expert to each focus on their strengths. ## **Motivating Chapel Themes** CRAY - 1) General Parallel Programming - 2) Global-View Abstractions - 3) Multiresolution Design - 4) Control over Locality/Affinity - **5)** Reduce HPC ↔ Mainstream Language Gap ## **Motivating Chapel Themes** - 1) General Parallel Programming - 2) Global-View Abstractions - 3) Multiresolution Design - 4) PGAS: Control over Locality/Affinity - **5)** Reduce HPC ← Mainstream Language Gap ## 1) General Parallel Programming ### With a unified set of concepts... ## ...express any parallelism desired in a user's program - Styles: data-parallel, task-parallel, concurrency, nested, ... - Levels: model, function, loop, statement, expression ## ...target any parallelism available in the hardware Types: machines, nodes, cores, instruction | Type of HW Parallelism | Programming Model | Unit of Parallelism | |-----------------------------------|-------------------|---------------------| | Inter-node | Chapel | task(or executable) | | Intra-node/multicore | Chapel | iteration/task | | Instruction-level vectors/threads | Chapel | iteration | | GPU/accelerator | Chapel | SIMD function/task | ## **PGAS Programming in a Nutshell** 23 ## Global Address Space: - permit parallel tasks to access variables by naming them - regardless of whether they are local or remote - compiler / library / runtime will take care of communication Images / Threads / Locales / Places / etc. (think: "compute nodes") ## **PGAS** Programming in a Nutshell ## **Global Address Space:** - permit parallel tasks to access variables by naming them - regardless of whether they are local or remote - compiler / library / runtime will take care of communication #### **Partitioned:** - establish a strong model for reasoning about locality - every variable has a well-defined location in the system - local variables are typically cheaper to access than remote ones ## **PGAS** Programming in a Nutshell ## **Global Address Space:** - permit parallel tasks to access variables by naming them - regardless of whether they are local or remote - compiler / library / runtime will take care of communication #### **Partitioned:** - estal - ev Communication is implicit! One sided GET and PUT. • loc k = i + j; Images / Threads / Locales / Places / etc. (think: "compute nodes") STORE COMPUTE # WHY COMMUNICATION OPTIMIZATION? # TRAIN LATENCY (8 HOURTRIP, 60 TON CARS, 60 SEC/CAR) ## TRAIN BANDWIDTH # INFINIBAND (IB) LATENCY Request Size (bytes) ## INFINIBAND (IB) BANDWIDTH with small 10-node cluster, QDR IB 3500 Max BW: 5000 MB/s 2625 Bandwidth MB/s 1750 875 16 32 128 256 512 8 64 # MEMORY MODELS CONSTRAIN PREFETCH AND WRITE-BEHIND ## AGGREGATION ## **OVERLAP** #### Thread 1 x = 42; notify = 1; #### Thread 2 while 0 == notify { /\* wait \*/ } compute\_with(x); #### Thread 1 x = 42; notify = 1; #### Thread 2 while 0 == notify { /\* wait \*/ } compute\_with(x); ${\rm compiler}\ or\ {\rm processor}$ #### Thread 1 r1 = 42; notify = 1; x = r1; #### Thread 2 r2 = notify; while 0 == r2 { /\* wait \*/ } compute\_with(x); Compiler and processor would like to start loads earlier in order to hide memory latency. We'll call that prefetch. Compiler and processor would like to complete stores later in order to hide memory latency. We'll call that write behind. #### **PGAS** Programming in a Nutshell #### **Global Address Space:** - permit parallel tasks to access variables by naming them - regardless of whether they are local or remote COMPUTE • compiler / library / runtime will take care of communication #### **Partitioned:** - estal - ev - Communication is implicit! - One sided GET and PUT. Images / Threads / Locales / Places / etc. (think: "compute nodes") STORE # REMEMBER THE RACY PROGRAM? #### Thread 1 x = 42; notify = true; #### Thread 2 while 0 == notify { /\* wait \*/ } compute\_with(x); compiler or processor #### Thread 1 r1 = 42; notify = 1; x = r1; #### Thread 2 r2 = notify; while 0 == r2 { /\* wait \*/ } compute\_with(x); COMPUTE # SC FOR DRF # Memory model for C11, C++11, Chapel: data race free programs are sequentially consistent - See Adve, S.V., Boehm, H.-J. 2010. Memory models: a case for rethinking parallel languages and hardware. Communications of the ACM 53(8): 90–101. <a href="http://cacm.acm.org/magazines/2010/8/96610-memory-models-a-case-for-rethinking-parallel-languages-and-hardware/fulltext">http://cacm.acm.org/magazines/2010/8/96610-memory-models-a-case-for-rethinking-parallel-languages-and-hardware/fulltext</a> - Chapel has a new specification chapter describing the memory consistency model. See <a href="http://chapel.cray.com/spec/spec-0.98.pdf">http://chapel.cray.com/spec/spec-0.98.pdf</a> section 29, page 217. ANALYZE # CONFIGURABLE SC-DRF - atomic operations in Chapel and C++ support: - memory\_order\_relaxed "atomic only" - memory\_order\_acquire "acquire" - memory\_order\_release "release" - memory\_order\_seq\_cst "sequentially consistent" - Beware! No global total order for relaxed, acquire, and release. Instead, the order is per atomic variable. $$x = 1;$$ $$x = 1;$$ | x = 1; | |--------| | | | | | | | | | | | | | | | | | | | | $$x = 1;$$ $$x = 1;$$ $$y = 3;$$ $z = 4;$ #### **Memory Order** $$x = 1;$$ Some re-orderings are allowed. read-after-write order preserved within tasks write-after-write order preserved within tasks Bad reordering! (ie, compiler bug) Sequential programs must work as if executed in program order # ASIDE: WEAK MEMORY CONSISTENCY ``` 1 x starts at 0; if someOption then 2 x = 2; if someOtherOption then 3 x = 3; 4 return x; ``` # ASIDE: WEAK MEMORY CONSISTENCY ``` 1 x starts at 0; 2 PUT 2 into x; 3 PUT 3 into x; 4 GET x; ``` Chapel **OpenSHMEM** result must be 3 result could be 0, 2, or 3 # MORE EXAMPLES: SHARED VARIABLES $$x = 0x1234;$$ #### **Memory Order** $$x = 0x1234;$$ Bad program: Data Race. No global order! This outcome is possible: $$c == 0 \times AB34$$ Bad program: Data Race. No global order! This outcome is possible: $$b == 1$$ #### **Memory Order** write behind could reorder: $$ok = 1;$$ $$x = 1;$$ read ahead could reorder: $$c = x;$$ $b = ok;$ In Chapel and C++, atomic vars default to SC ordering which includes both acquire and release | atomic $A = 1$ ; | |------------------| | c = atomic A; | | b = atomic A; | | atomic $A = 2$ ; | | | | | | | | | | | | | SC atomic vars create a global memory order. C == 2 b == 0 not possible e.g. | atomic $A = 1$ ; | |------------------| | c = atomic A; | | b = atomic A; | | atomic A = 2; | | | | | | | | | | | | | $$b = atomic A;$$ atomic $A = 2;$ ``` b = atomic A; atomic A = 1; c = atomic A; atomic A = 2; ``` SC atomic ops constrain the code around them $$b == 1$$ implies | x = 2; | |---------------| | atomic A = 1; | | b = atomic A; | | c = x; | | | | | | | | | | | | | #### **Memory Order** atomic $$A = 1$$ ; atomic $B = 2$ ; SC atomic vars create a global total memory order. $$c == -1 \mid \mid c == 1$$ #### **Program Order** ``` relaxed atomic A = 1; relaxed atomic B = 2; ``` ``` waitFor(atomic B == 2); c = atomic A; ``` Is an outcome of $C == \emptyset$ possible? #### **Program Order** ``` relaxed atomic A = 1; relaxed atomic B = 2; ``` ``` waitFor(atomic B == 2); c = atomic A; ``` Is an outcome of $C == \emptyset$ possible? Yes. While atomic vars cannot create race conditions, relaxed atomics don't create a total order. e.g. write behind could reorder: relaxed atomic B = 2; relaxed atomic A = 1; #### CACHE FOR REMOTE DATA - Goal: communication aggregation and overlap - Bonus points: avoiding repeated communication - Software cache in Chapel's runtime - One cache per pthread - Write-back cache with dirty bits #### CACHE COHERENCY - Simple, local coherency - Discard all cached data on acquire - Wait for pending operations on a *release* Strategy used in related work with UPC # CACHE FEATURES | | Overlap | | Aggregation | | |---------------------------------------------|---------|-----|-------------|-----| | | GET | PUT | GET | PUT | | Do PUTs in background | | X | | | | Start one PUT per contiguous written region | | | | X | | Round GETs up to 64-byte cache lines | | | X | | | Sequential read-ahead | X | | X | | | Programmer-provided prefetch hints* | X | | | | COMPUTE I STORE ANALYZE ### SYNTHETIC BENCHMARKS # APPLICATION BENCHMARKS #### PREFETCH EXAMPLE ``` var A:[1..n] int; on Locales[1] { var sum:int; // Optional warm up for i in 1..k do prefetch(A[f(i)]); for i in 1..n { if i+k <= n then prefetch(A[f(i+k)]);</pre> sum += A[f(i)] ``` #### PREFETCH EXAMPLE COMMUNICATION WITH LLVM "LLVM-based Communication Optimizations for PGAS Programs" Akihito Hayashi, Jisheng Zhao, Michael Ferguson, Vivek Sarkar # THE VISION: SHARED PGAS OPTIMIZATION PASSES #### EXAMPLE ``` // x is remote var sum = 0; for i in 1..100 { sum += get(x); ``` ``` // x is possibly remote var sum = 0; var sum = 0; for i in 1..100 { %1 = get(x); %1 = \mathbf{get}(\mathbf{x}); for i in 1..100 { sum += \%1; sum += \%1; TO DISTRIBUTED TO GLOBAL MEMORY // existing LLVM opt var sum = 0; var sum = 0; for i in 1..100 { %1 = load < 100 > %x %1 = load < 100 > %x for i in 1..100 { sum += \%1; sum += %rl; EXISTING LLVM OPTIMIZATION LICM ``` load <100> %x = load i64 addrspace(100)\* %x ANALYZE ANALYZE STORE STORE ANALYZE I ANALYZE STORE ANALYZE # CACHING VS LLVM #### **Legal Disclaimer** Information in this document is provided in connection with Cray Inc. products. No license, express or implied, to any intellectual property rights is granted by this document. Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates and figures specified are preliminary based on current expectations, and are subject to change without notice. Cray hardware and software products may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request. Cray uses codenames internally to identify products that are in development and not yet publically announced for release. Customers and other third parties are not authorized by Cray Inc. to use codenames in advertising, promotion or marketing and any use of Cray Inc. internal codenames is at the sole risk of the user. Performance tests and ratings are measured using specific systems and/or components and reflect the approximate performance of Cray Inc. products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. The following are trademarks of Cray Inc. and are registered in the United States and other countries: CRAY and design, SONEXION, URIKA, and YARCDATA. The following are trademarks of Cray Inc.: ACE, APPRENTICE2, CHAPEL, CLUSTER CONNECT, CRAYPAT, CRAYPORT, ECOPHLEX, LIBSCI, NODEKARE, THREADSTORM. The following system family marks, and associated model number marks, are trademarks of Cray Inc.: CS, CX, XC, XE, XK, XMT, and XT. The registered trademark LINUX is used pursuant to a sublicense from LMI, the exclusive licensee of Linus Torvalds, owner of the mark on a worldwide basis. Other trademarks used in this document are the property of their respective owners. Copyright 2014 Cray Inc.