Using the Chapel Compiler to Develop Language Tooling

Despite its name, the Chapel compiler isn’t just for compiling Chapel programs. As a benefit of an ongoing rewrite — an effort the team has nicknamed Dyno — Chapel’s front-end is being redesigned to be more modular and re-usable. This direction has allowed the team to separate Chapel’s documentation tool, chpldoc, from the compiler and make it a standalone tool. In addition, as we’ve written about before, we used the new front-end to develop a language server, chpl-language-server, and a linter, chplcheck.

The new front-end is not just for use by the core Chapel team; by using the new compiler library, anyone can develop their own tools that interact with Chapel’s source code. In this post, I’ll tell you how you can do that, and give other examples of what can be done. The library is written in C++, but I find that its Python bindings are an excellent way to get started and iterate on language tooling.

Getting Started

The process for installing the Python bindings to the compiler is well-documented elsewhere, and may well change after I write this post as the bindings become more mature, so I will not go over it here.

Let’s start with something simple. The Chapel convention is that record types should have names that start with a lowercase letter (more specifically, Chapel records should be in camelCase). Let’s write a script that finds all record declarations in a given file, and makes sure that they follow this convention. The full code is below; I will explain it in detail in subsequent paragraphs.

records.py

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


import re
import sys
from chapel import *

def is_camel_case(name: str):
    return re.fullmatch(r"([a-z]+([A-Z][a-z]*|\d+)*|[A-Z]+)?", name)

context = Context()
modules = context.parse(sys.argv[1])

for module in modules:
    for record, _ in each_matching(module, Record):
        if not is_camel_case(record.name()):
            print("Record name is not in camel case:", record.name())

Let’s break this piece of code down and go piece-by-piece. At the top, we have some imports. For the most part, these are the standard library. The only one specific to the Chapel compiler — as you might have guessed — is the following:

from chapel import *

For convenience, I import the entire module.

Next up, I define a function that checks if a string is in camelCase. Its definition isn’t so important to this post, but feel free to expand the explanation below if you’re interested.

(Click here to see how is_camel_case works)

The is_camel_case function uses a regular expression to check if a string is in the desired format. This expression might look a little daunting. All it does, however, is specify that the first character should be a lowercase letter ([a-z]), after which can follow any number or lowercase letters (+ means “one or more”). After that, any number of chunks can follow that start with an uppercase letter ([A-Z]).

A special exception is made for words made up of only uppercase letters, to allow for acronyms such as GPU.

The regular expression below is precisely the one used by the Chapel linter, chplcheck!

def is_camel_case(name: str):
    return re.fullmatch(r"([a-z]+([A-Z][a-z]*|\d+)*|[A-Z]+)?", name)

Finally, we get to the code that makes use of the compiler front-end. At the core of Dyno is the Context object. I will go into more detail about this later. For now, it’s sufficient to understand that the Context keeps track of all of the compiler state, including its configuration, the source code being compiled, and any information that has been extracted from it. With the context in hand, we can parse whatever file the user has given us on the command line:

context = Context()
modules = context.parse(sys.argv[1])

A Chapel file is a collection of modules. When the file is parsed, the Chapel compiler will return a list of these modules. All that’s left is to look at all the records in the given file, and check if their names are in camelCase. If they are not, we print a message.

for module in modules:
    for record, _ in each_matching(module, Record):
        if not is_camel_case(record.name()):
            print("Record name is not in camel case:", record.name())

The each_matching function is provided by the chapel module; it takes as arguments a pattern and a place to search for that pattern. In our case, the pattern is simply Record (representing record declarations in the source code), and the place is the module object. For each piece of code that matches the pattern, the function will return a tuple containing the matching object. It also returns a second value, which we ignore here; this value is used when the pattern is more complex and we want to extract more information from the matching object. The program is traversed recursively by each_matching, so it would return nested records as well.

Running the script on the following Chapel file:

record-naming.chpl

1
2


record fine {}
record NotFine {}

I get the following output:

Record name is not in camel case: NotFine

What I presented here is a very simplified implementation of what goes on in the chplcheck linter! In just 14 lines, we were able to get started on developing language tooling.

Abstract Syntax Trees

The Chapel compiler, like most others, for the most part does not work with the textual representation of the code. Instead, through a process called parsing, the compiler converts the source code into a tree representation; specifically, an abstract syntax tree (AST). ASTs naturally encode the precedence of operators, nesting of expressions, and other syntactic information that is harder to retrieve from the program text.

As an example, take a look at the following program and its AST representation.

one-two-three.chpl

1
2


var x = 1+2*3;
writeln(x);

A tree corresponding to the expression 1+2*3

All Chapel code is eventually contained within a module. Thus, a Module node is at the root of the syntax tree. Each node has children that represent other pieces of code contained within it. Since the module in the above program contains two statements, these two statements are children of this module.

The first statement is a variable declaration. This is represented using a Variable node. The node contains the name of the variable being declared, x. The only child of this node is the expression that is being used to initialize x, which is 1+2*3. Because multiplication has a higher precedence than addition, that expression is interpreted as 1+(2*3). So, the multiplication is “contained” within the addition, and the multiplication node (OpCall *) is a child of the addition node
(OpCall +).

The second is a call to writeln with the variable x. This is represented by a function call node. The children of this call are the “called expression” (in this case, the writeln procedure) and the arguments being passed (in this case, the variable x).

Each type of node in the tree can be used as a pattern. The ones we’ve seen so far are Record, Module, Variable, OpCall, IntLiteral, FnCall, and Identifier. To drive the point home, we can print the value of each integer literal and each binary operation in the program.

ops.py

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


import sys
from chapel import *

context = Context()
modules = context.parse(sys.argv[1])

for module in modules:
    for op, _ in each_matching(module, OpCall):
        print("Found an operation:", op.op())

    for lit, _ in each_matching(module, IntLiteral):
        print("Found a literal:", lit.text())

Found an operation: +
Found an operation: *
Found a literal: 1
Found a literal: 2
Found a literal: 3

The types of nodes in the Chapel AST form a Python class hierarchy. For instance, both the FnCall and the OpCall nodes inherit from a Call node. If you wanted to match all calls, using the Call node as a pattern would match both function and operator calls. Similarly, all loops that have index variables (e.g., for, foreach, forall) inherit from an IndexableLoop base class. I’ve included the entire list of available classes, organized in a tree, below. Because it is quite large, I’ve collapsed it to avoid taking up too much vertical space; you can click the sentence below to expand it.

(Click here to see the Dyno class hierarchy)

AstNode
├── AnonFormal
├── As
├── Array
├── Attribute
├── AttributeGroup
├── Break
├── Catch
├── Cobegin
├── Conditional
├── Comment
├── Continue
├── Delete
├── Domain
├── Dot
├── EmptyStmt
├── ErroneousExpression
├── ExternBlock
├── FunctionSignature
├── Identifier
├── Implements
├── Import
├── Include
├── Init
├── Label
├── Let
├── New
├── Range
├── Require
├── Return
├── Select
├── Throw
├── Try
├── Use
├── VisibilityClause
├── When
├── WithClause
├── Yield
├── SimpleBlockLike
│   ├── Begin
│   ├── Block
│   ├── Defer
│   ├── Local
│   ├── Manage
│   ├── On
│   ├── Serial
│   └── Sync
├── Loop
│   ├── DoWhile
│   ├── While
│   └── IndexableLoop
│       ├── BracketLoop
│       ├── Coforall
│       ├── For
│       ├── Forall
│       └── Foreach
├── Literal
│   ├── BoolLiteral
│   ├── ImagLiteral
│   ├── IntLiteral
│   ├── RealLiteral
│   ├── UintLiteral
│   └── StringLikeLiteral
│       ├── BytesLiteral
│       ├── CStringLiteral
│       └── StringLiteral
├── Call
│   ├── FnCall
│   ├── OpCall
│   ├── PrimCall
│   ├── Reduce
│   ├── Scan
│   ├── Tuple
│   └── Zip
└── Decl
    ├── MultiDecl
    ├── TupleDecl
    ├── ForwardingDecl
    └── NamedDecl
        ├── EnumElement
        ├── Function
        ├── Interface
        ├── Module
        ├── TypeQuery
        ├── ReduceIntent
        ├── VarLikeDecl
        │   ├── Formal
        │   ├── TaskVar
        │   ├── VarArgFormal
        │   └── Variable
        └── TypeDecl
            ├── Enum
            └── AggregateDecl
                ├── Class
                ├── Record
                └── Union

The exact class hierarchy may differ depending on the version of Chapel. I used the following script to generate the formatted version above, which can be used to get an up-to-date version.

hierarchy.py

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32


# Modified from the following StackOverflow answer:
#
# https://stackoverflow.com/a/59109706

# prefix components:
space =  '    '
branch = '│   '
# pointers:
tee =    '├── '
last =   '└── '


def tree(node, prefix: str=''):
    """A recursive generator, given a an AST node,
    will yield a visual tree structure line by line
    with each line prefixed by the same characters
    """
    contents = list(node.__subclasses__())
    # contents each get pointers that are ├── with a final └── :
    pointers = [tee] * (len(contents) - 1) + [last]
    for pointer, child in zip(pointers, contents):
        yield prefix + pointer + child.__name__
        if len(child.__subclasses__()) > 0: # extend the prefix and recurse:
            extension = branch if pointer == tee else space
            # i.e. space because last, └── , above so no more |
            yield from tree(child, prefix=prefix+extension)

from chapel import *
print(AstNode.__name__)
for line in tree(AstNode):
    print(line)

When writing Chapel tooling, the AST nodes are one of the primary ways in which a programmer interacts with a Chapel program. The various methods provided by AST node classes, as well as other available features, are documented in the auto-generated chapel module documentation (in particular, the documentation for classes, including AST node classes, starts with AggregateDecl here). The Python bindings also generate a .pyi file which contains the same information, and can be used for Python type checking and autocompletion in editors.

More Patterns and the `chapel.replace` Module

I’ve mentioned before that AST nodes can be used as patterns. However, not all patterns are just AST nodes. The chapel module supports writing more complicated patterns, which can help find more specific pieces of code. For this section, we’ll implement a somewhat limited and impractical version of a common transformation: constant folding. This transformation replaces operations on known values with their result. Thus, a program like:

var x = 1+2;

Might be transformed into:

var x = 3;

I would like to stress that this transformation will be limited — we will simplify 1+2*3 into 1+6, not 7, and we will only handle integers — and impractical, in the sense that the Chapel compiler already performs constant folding as a part of compiling a program (so transforming a source file in this way will not have any advantages). However, implementing this transformation will allow us to play with more sorts of patterns.

Another form of pattern in Chapel’s API is a list. When the pattern is a list, the first element will be matched against AST nodes, whereas the subsequent elements will be matched against the children of the matched node. Thus, [OpCall, IntLiteral, IntLiteral] is a pattern that matches any binary operation whose operands are integer literals. The expression 2*3 will match this pattern, but 1+2*3, as a whole, will not. Running the following Python code:

for module in modules:
    for op, _ in each_matching(module, [OpCall, IntLiteral, IntLiteral]):
        print("Found an operation:", op.op())

on our previous example file, one-two-three.chpl, produces the following output, which excludes the addition:

Found an operation: *

A powerful feature of the pattern API is being able to store parts of the matched AST into a dictionary. Specifically, replacing IntLiteral with ("?x", IntLiteral) will still match the same type of node, but will store the match into key "x" of the dictionary. This can be used to conveniently retrieve child AST nodes that are nested deeper in the tree. We can adjust our pattern to do this:

for module in modules:
    pattern = [OpCall, ("?lhs", IntLiteral), ("?rhs", IntLiteral)]
    for op, vars in each_matching(module, pattern):
        print("Found an operation:", op.op())
        print("Left operand:", vars["lhs"].text())
        print("Right operand:", vars["rhs"].text())

Note that the variable we previously ignored — the second element of the tuple yielded by each_matching — is now stored into the dictionary variable vars. Running the script above produces:

Found an operation: *
Left operand: 2
Right operand: 3

Given this information, we can work on simplification. For the time being, let’s just implement the four basic arithmetic operations: addition, subtraction, multiplication, and division.

def simplify(opnode, lhs, rhs):
  op = opnode.op()
  lhs_val = int(lhs.text())
  rhs_val = int(rhs.text())

  if op == "+":
    return lhs_val + rhs_val
  elif op == "-":
    return lhs_val - rhs_val
  elif op == "*":
    return lhs_val * rhs_val
  elif op == "/":
    return lhs_val // rhs_val
  else:
    return None

for module in modules:
    pattern = [OpCall, ("?lhs", IntLiteral), ("?rhs", IntLiteral)]
    for op, vars in each_matching(module, pattern):
        result = simplify(op, vars["lhs"], vars["rhs"])
        if result is not None:
            (first_line, _) = op.location().start()
            print("I would simplify an expression on line", first_line, "to", result)

Running it on the example above, one-two-three.chpl, the script produces:

I would simplify an expression on line 1 to 6

To actually perform the simplification, we will use the chapel.replace module. This module is specifically provided to help modify Chapel programs via their ASTs. The core feature of this module is the run function, which takes a Python function that finds nodes to replace and then takes over the execution of the Python program to transform it into a command-line replacer. To make use of this, all we need to do is turn the outer loop over modules into a function. Instead of calling print, this function should yield the node-to-replace, as well as the new textual value to replace it with.

def simple_constant_fold(rc, module):
    for op, vars in each_matching(module, [OpCall, ("?lhs", IntLiteral), ("?rhs", IntLiteral)]):
        result = simplify(op, vars["lhs"], vars["rhs"])
        if result is not None:
            yield (op, str(result))

run(simple_constant_fold)

Running this file as follows:

python fold.py one-two-three.chpl

Produces:

var x = 1+6;
writeln(x);

The following is the complete script we developed in this section:

fold.py

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27


from chapel import *
from chapel.replace import run

def simplify(opnode, lhs, rhs):
  op = opnode.op()
  lhs_val = int(lhs.text())
  rhs_val = int(rhs.text())

  if op == "+":
    return lhs_val + rhs_val
  elif op == "-":
    return lhs_val - rhs_val
  elif op == "*":
    return lhs_val * rhs_val
  elif op == "/":
    return lhs_val // rhs_val
  else:
    return None

def simple_constant_fold(rc, module):
    pattern = [OpCall, ("?lhs", IntLiteral), ("?rhs", IntLiteral)]
    for op, vars in each_matching(module, pattern):
        result = simplify(op, vars["lhs"], vars["rhs"])
        if result is not None:
            yield (op, str(result))

run(simple_constant_fold)

Using Semantic Information

So far, all of the things we’ve done with our Python scripts have been syntactic: they looked solely at the structure of the program, without needing to make sense of the program’s meaning.

What does it mean to look at the program’s meaning? For an example, take a look at the following program:

use IO;

writeln(ioMode : string);

This is a valid Chapel program, and it prints ioMode. Where did ioMode come from, though? There certainly isn’t a definition of that type in this snippet. The answer is that ioMode has been brought in through the use IO statement at the top of the program. If we were to write a version of the program that was only slightly different, it would not compile:

use IO except ioMode;

writeln(ioMode : string); // error: ioMode is not defined

To understand whether or not an identifier like ioMode is valid in a given scope, we need to understand what the surrounding statements — such as use IO — actually do, and how they affect the program. This is what I mean by the “meaning” of the program. In the field of programming languages, the “meaning” of a program is often referred to as its semantics.

“

A major advantage of using the Chapel compiler library to develop language tooling is that it can be queried for semantic information, which would be very hard to replicate in a standalone tool.

”

Following the semantics can be tricky. In the first example, even though ioMode doesn’t occur in the use statement, it’s brought into scope from the IO module. In the second example, even though ioMode is explicitly mentioned, it’s excluded, and therefore not in scope. A major advantage of using the Chapel compiler library to develop language tooling is that it can be queried for semantic information, which would be very hard to replicate in a standalone tool.

To show this off, let’s write a script which I will dub “docbot”. It will read a Chapel program, find all references to standard variables and types, and print out links to their documentation on the Chapel website. We start out as before, except that this time, I configure context’s search paths. This enables it to find the standard library modules.[note:When using semantic information, it’s important to enable Dyno to access the standard modules. This is true in part because the standard modules define a number of essential Chapel procedures (e.g., writeln). More importantly, many Chapel features (ranges, arrays, tuples) are defined using module code. When querying type information in particular, these features will be inscrutable without the standard modules.] The empty lists I use as arguments indicate that I am not overriding any of the default search paths.

1
2
3
4
5
6


import sys
from chapel import *

context = Context()
context.set_module_paths([], [])
modules = context.parse(sys.argv[1])

For this program, the “secret ingredient” will be the to_node method. This method, defined on Identifier and Dot nodes (which represent something like x.y), finds what the node refers to[note:One caveat to using to_node is that it does not perform call resolution. Chapel supports function overloading, which means that without type information, it’s not always possible to determine what the foo in foo(x) refers to. Type checking is far more complicated than name resolution (hence, slower), and is still an active area of work within Dyno.

It’s possible to use type resolution to retrieve refers-to information, but this is not done by to_node, and doing so is prone to running into limitations of Dyno’s current implementation.] and returns the AST node of that definition. The following Chapel program is commented with some examples:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


record someRecord {}
var myNumber = 42;

// calling to_node on 'someRecord' will return the declaration on line 1
writeln(new someRecord());

// calling to_node on 'myNumber' will return the declaration on line 2
writeln(myNumber);

// calling to_node on 'IO' will return the AST node for the IO module.
use IO;

The to_node method uses the exact same process as the Chapel compiler when performing name resolution. This has two important consequences:

You do not have to handle any of it yourself. There’s no need to worry about scopes, shadowing, uses and imports, or any of the other complexities of name resolution.
The information you get will always match what the Chapel compiler would see. This guarantees correctness, in the sense of matching the reference implementation.

To implement “docbot”, we once again iterate over all the modules, and for each node that’s an Identifier or a Dot node, we use to_node to compute what the node refers to. All that’s left then is to print the associated link, as well as the line that it occurs on.

42
43
44
45
46
47
48
49
50


for module in modules:
    for ident, _ in each_matching(module, set([Identifier, Dot])):
        to_node = ident.to_node()
        if not to_node:
            continue

        line_no, _ = ident.location().start()
        full_url = find_doc_link(to_node)
        print("On line {}, URL: {}".format(line_no, full_url))

I’ve glossed over finding the documentation links for the referenced definitions. There is a way to do this, but it’s not particularly critical for the point of this demonstration. As a result, I will relegate its explanation to the appendix; it’s sufficient to take for granted a function find_doc_link, which, given an AST node of a standard library definition, generates a link to its documentation.

The above example uses a set pattern, which is used to match either of its arguments. Thus, set([Identifier, Dot]) will match either an Identifier or a Dot node.

Running this script on the following Chapel program:

list-io.chpl

1
2
3
4


use IO, List;

var x = new list(int);
var y = ioMode.r;

Results in the following output:

On line 1, URL: https://chapel-lang.org/docs/modules/standard/IO.html
On line 1, URL: https://chapel-lang.org/docs/modules/standard/List.html
On line 3, URL: https://chapel-lang.org/docs/modules/standard/List.html#List.list
On line 4, URL: https://chapel-lang.org/docs/modules/standard/IO.html#IO.ioMode.r
On line 4, URL: https://chapel-lang.org/docs/modules/standard/IO.html#IO.ioMode

All of these links work and take us to the Chapel docs!

You can view or download the complete docbot.py script below.

docbot.py

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51


import sys
from chapel import *

context = Context()
context.set_module_paths([], [])
modules = context.parse(sys.argv[1])

def parent_module(node):
    while not isinstance(node, Module):
        node = node.parent_symbol()
    return node

ROOT_URL = "https://chapel-lang.org/docs/modules/standard/"

def build_url(module):
    modules = []
    while module:
        modules.append(module.name())
        module = module.parent_symbol()
    return "/".join(reversed(modules)) + ".html"

def build_anchor(node):
    if isinstance(node, Module):
        return ""

    names = []
    while node:
        names.append(node.name())

        # Anchors don't contain outer modules
        if isinstance(node, Module):
            break

        node = node.parent_symbol()
    return "#" + ".".join(reversed(names))

def find_doc_link(node):
    module = parent_module(node)
    full_url = ROOT_URL + build_url(module) + build_anchor(node)
    return full_url

for module in modules:
    for ident, _ in each_matching(module, set([Identifier, Dot])):
        to_node = ident.to_node()
        if not to_node:
            continue

        line_no, _ = ident.location().start()
        full_url = find_doc_link(to_node)
        print("On line {}, URL: {}".format(line_no, full_url))

To go further with semantic information, you might use the following methods:

type can be used to trigger type resolution and figure out what type a given node has. This can be invoked on any AST node, though only expression-like nodes will return a meaningful result.
resolve gives you access to more of the resolution information. In addition to including the type above, it also includes the results of function resolution, if any, which make it possible to inspect what overloads were selected when resolving calls.
resolve_via can be used to perform resolution within generic instantiations. This method is used to implement the “generic views” in the Chapel Language Server.

Next Steps

I hope this article has given you a taste of what the Chapel front-end library can do. The Dyno effort is not just a project to improve or rewrite the code of the Chapel compiler; its goals also include allowing others to leverage the compiler for their own purposes. The Python bindings are a great way to get started with this, though the full API is also available in C++. Please see the chpldoc source code for an example of using the C++ API, and the chpl-language-server or chplcheck source code for larger examples of using the Python API.

Appendix: Building Documentation URLs

This appendix describes how “docbot” generates URLs to documentation. One thing to know is that the Chapel documentation of the standard modules is organized by module. Thus, the IO module will have its own page, as would the List module, and so on. If we find that an Identifier refers to some declaration, we will need to find which module it comes from.

 8
 9
10
11


def parent_module(node):
    while not isinstance(node, Module):
        node = node.parent_symbol()
    return node

The parent_symbol method finds the symbol — function declaration, module, record, etc. — inside which the given node is being defined. We simply keep traversing the AST upwards until we find a module (as I mentioned before, all Chapel code is contained within some module).

There are two more helper functions I ended up using. One of these is build_url, which builds the name of the HTML file corresponding to a particular module. This name is not just the module name, because some modules can be nested inside of others (e.g., we might have OuterModule.InnerModule.html). Thus, this function traverses upwards through the AST to build a fully-qualified module path.

15
16
17
18
19
20


def build_url(module):
    modules = []
    while module:
        modules.append(module.name())
        module = module.parent_symbol()
    return "/".join(reversed(modules)) + ".html"

The last bit is finding the right definition within its module’s page. Chapel provides HTML IDs for each definition. For a module-level variable or type declaration, the ID is just its name. For something like an element of an enumeration (e.g., the r in ioMode), the ID is fully qualified (i.e., ioMode.r). The build_anchor function takes care of getting this piece.

22
23
24
25
26
27
28
29
30
31
32
33
34
35


def build_anchor(node):
    if isinstance(node, Module):
        return ""

    names = []
    while node:
        names.append(node.name())

        # Anchors don't contain outer modules
        if isinstance(node, Module):
            break

        node = node.parent_symbol()
    return "#" + ".".join(reversed(names))

Finally, find_doc_link puts all of this together:

37
38
39
40


def find_doc_link(node):
    module = parent_module(node)
    full_url = ROOT_URL + build_url(module) + build_anchor(node)
    return full_url