Welcome to day 7 of Chapel’s Advent of Code 2022 series. We’re over halfway through the twelve days of Chapel AoC! In case you haven’t been following the series, check out the introductory Advent of Code 2022: Twelve Days of Chapel article for more context.
The Task at Hand and My Approach
In today’s puzzle, we are given a list of terminal-like commands (
ls
and cd
), as well as output corresponding to running these commands. The commands
explore a fictional file system, which can have files (objects with size)
as well as directories that group files and other (sub-)directories. The
problem then asks to compute the sizes of each folder, and to total up the
sizes of all folders that are smaller than a particular threshold.
The tree-like nature of the file system does not make it amenable to
representations based on arrays, lists, or maps alone. The trouble with
these data types is that they’re flat. Our input could have arbitrary
levels of nested directories. However, arrays, lists, and maps cannot have
such arbitrary nesting — we’d need something like a list of lists of lists of…
We could, of course, use the map
and list
data types to represent the
file system with some sort of adjacency list.
However, such an implementation would be somewhat clunky and hard to use.
Instead, in this article I use a different tool from the repertoire of Chapel language
features, one we haven’t seen so far: classes. Specifically, I use a class, Dir
, to represent
directories in the file system, and build up a tree of these directories
while reading the input. I then create an iterator over this tree that
computes and yields the sizes of the folders. From there, it’s easy to
pick out all directory sizes smaller than the threshold and sum them up.
If you skip right to your favorite parts of a movie, here’s a full solution for the day:
aoc2022-day07-dir-traversals.chpl
|
|
And now, on to the explanation train. Before the train departs, let’s import
a few of the modules we’ll use today. IO
is a permanent fixture in our
solutions (we always need to read input!), and List
is a familiar face.
The only newcomer here is Map
, which helps us associate keys with values,
much like a dictionary in Python, a hash in Ruby, or a map in C++.
We’ll use maps and lists for storing the various files and directories
on the file system.
|
|
With that, our train’s first stop: classes!
Classes in Chapel
Like in most languages, classes in Chapel are a way to group together related pieces of data. Up until now, we’ve used tuples for this purpose. Tuples, however, have a couple of limitations when it comes to solving today’s Advent of Code problem:
- We can’t name a tuple’s elements. Whenever you make and use a tuple, it is up to you to remember the order of the elements within it, and what each element represents.
- Tuples are a statically constrained data structure. We can’t nest tuples within tuples to a depth not known at compile time, just like we couldn’t arbitrarily nest lists within lists.
Classes have neither of these limitations. They do, however, need to be explicitly created within Chapel code. For example, one might create a class to store information about a person:
class person {
var firstName, lastName: string;
}
We’ve seen plenty of var
statements used to create variables; when used
within a class, var
declares a member variable (also known as a field)
for the class. Our person
contains two pieces of data in its fields: the
person’s first name (firstName
) and last name (lastName
).
With that class definition in hand, we can create instances of the person
class
using the new
keyword.
var biggestCandyFan = new person("Daniel", "Fedorin");
As usual, we can rely on type inference to only write the type person
once;
Chapel figures out that biggestCandyFan
is a person
. Now, it’s easy to get
the various fields back out of a class:
writeln("The biggest fan of candy is ", biggestCandyFan.firstName);
Believe it or not, we’ve already seen enough of classes to see how to represent
nested data structures. The key observation is that classes have names, which
means that we can create fields that refer back to instances of the same class. Here’s
an example of what I mean, in the form of a modified person
class:
class person {
var firstName, lastName: string;
var children: list(owned person);
}
The highlighted line is new. We’ve added a list of children to our person.
These children are themselves instances of person
, which means they too
can have children of their own. Et voilà - we’ve got a nested data structure!
Memory Management Strategies
You probably noticed that children
’s type is list(owned person)
—
note the owned
. This keyword is an indication of the way that memory is
allocated and maintained for classes: their memory management. To create
a class, a Chapel program asks for some memory from the computer (allocates it).
This memory is kept by the program until the instance of a class is no longer
needed, at which point it’s deallocated/freed. The challenge is knowing when
a class is no longer needed! This is where memory management strategies,
like owned
, come in.
We don’t need to get too deep into the various memory management strategies in today’s post.
(If you’re curious, here’s a brief description of each strategy…)
-
When using the
owned
strategy, a class instance has one “owner” variable. The instance is only around as long as this owner exists. As soon as the owner disappears, the class instance is deallocated. In some cases — though we won’t be covering them today — ownership can be transferred from one variable to another, but no two values can own the same class instance at the same time.Other variables can still refer to an
owned
class instance, but they must borrow it, creating, for example, aborrowed person
. Borrows do not affect the lifetime of a class nor when it is deallocated. -
When using the
shared
strategy, Chapel keeps track of how many places still have variables that refer to a particular instance of a class. This is typically called a reference count. Each time a variable is created or changed to refer to a class instance, the instance’s reference count increases. When that variable goes out of scope and disappears, the reference count decreases. Finally, when the reference count reaches zero (no more variables refer to the class instance), there’s no point in keeping it around anymore, and its memory is deallocated.As is the case with
owned
, other variables can borrowshared
class instances. Such borrows do not affect the reference count at all, and therefore don’t influence when the instance is freed. -
When using the
unmanaged
strategy, you’re promising to manually free the memory later, using thedelete
keyword. This is very similar to hownew
/delete
work in classic C++.
So, the owned
keyword in our children
list means we’ve opted for the
owned
memory management strategy. The implication of this is that
when a “parent” person is deallocated, so are all of its children
(since the person class, through its children
list, owns each child).
If we aren’t planning on sharing our data, owned
is the preferred strategy. This is because
it precludes the need for some bookkeeping, which
often makes a difference in terms of performance. The added benefit to using
owned
, in my personal view, is that it’s easier to figure out when something
will be deleted — there’s no chance of some other variable, elsewhere in my program,
preventing a class instance’s deallocation.
Methods
Remember how I said that classes can be used to group together pieces
of related data? Well, they can do more than that. They can also group
together operations on this data, in the form of methods. For instance,
we could add the following definition inside the class
declaration
for our person
:
class person {
// ... as before
proc getGreeting() {
return "Hello, " + this.firstName + "!";
}
}
Just like fields can be thought of as var
s that are associated with a particular
class instance, methods can be thought of as procedures associated with
a particular class instance. Thus, methods behave pretty much exactly
like the proc
s we’ve seen so far, with the notable difference of being able to
access that class instance through the this
keyword.
For example, inside the body of a method like getGreeting
above,
this.firstName
gets us the person’s first name, and this.lastName
would
get us their last name.
We can call methods using the dot syntax:
// Prints "Hello, Daniel!"
writeln(biggestCandyFan.getGreeting());
Methods are a powerful tool for abstraction; rather than writing external code
that refers to the various fields of a class, we can put that logic
inside of methods, and avoid exposing it to the rest of the world. A person
writing .getGreeting()
will not need to know how a name is represented
in the person
class.
Another sort of method is a type method (sometimes referred to as
a static method in other languages). Rather than being called on
an instance of a person, like biggestCandyFan
or daniel
, it’s called
on the class itself. For instance:
class person {
// ... as before
proc type createBiggestCandyFan() {
return new person("Daniel", "Fedorin");
}
}
var biggestCandyFan = person.createBiggestCandyFan();
Methods like this have the benefit of being associated with a particular class.
This means that another class can have its own createBiggestCandyFan()
method, and there won’t be any confusion or problems arising from trying
to figure out which is which. Perhaps dogs (represented by a hypothetical
dog
class) have a biggest candy fan, too!
var biggestCandyFan = person.createBiggestCandyFan();
var biggestCandyFanDog = dog.createBiggestCandyFan();
A Dir
Class to Represent Directories
Back to the solution. The class I use for tracking directories is actually not too different
from our modified person
class above. Each directory
[note:
Despite the recent media noise about ChatGPT, directories have not yet
been granted personhood, and do not have both first and last names.
]
as well as a collection of files and directories it contains.
|
|
Since files have no
additional information to them besides their size, I decided to represent
them as a map — a directory’s files
field associates each file’s name
with that file’s size. The subdirectories are represented just like
the children
field from our person
record, as a list of owned Dir
s.
There are a few more things I want to add to Dir
;
the first is a way to read our directory from our puzzle input.
Reading the File System with the fromInput
Type Method
For reasons of abstraction and avoiding conflicts, I put
the code for creating a directory from user input into a type method on Dir
. Within
this method, I include the now-familiar code for reading from the
input using readLine
, until we run out of lines.
|
|
Notice that I’m accepting the name for the
directory as a string formal and initializing a new variable newDir
with that name.
Notice also that I don’t need to provide the files
and dirs
as arguments to new Dir
— they have default values in the
class definition. By default, new
uses the owned
memory management
strategy. For the time being, the newDir
variable owns our
directory-under-construction.
We’re reading lines now; all that’s left is to figure out what to do
with them. The first case is that of $ cd ..
. When we see that line,
it means that we’re done looking at the current directory; none
of the subsequent ls
lines will be meant for us. Thus, we break
out of the input while
-loop.
|
|
If the cd
command is used, but its argument isn’t ..
, we’re being
asked to descend into a sub-directory of our current newDir
.
In this case, we call the fromInput
method again, recursively,
to create a subdirectory of the current one. This
call will keep consuming lines from the input until the sub-directory
has been processed, at which point it will return it to us. We’ll
immediately append this sub-directory to the newDir.dirs
list,
which becomes the sub-directory’s new owner.
Recall that we need to give fromInput
the name of the new
sub-directory. We can figure out the name by slicing the string
starting after the $ cd
prefix. Since I want to get the rest of the
characters after the prefix, I leave the end of my range unbounded, which
makes the slice go until the characters run out at the end of the string.
If you’re feeling shaky on lists and pushBack
, check out our day 5 article.
If you want a little refresher on slicing, we first covered it on day 3.
|
|
As it turns out, all that’s left is to handle files. We already get
directory names from cd
, so there’s no reason to worry about
lines starting with dir
. The ls
command itself always precedes
the list of files and directories; by itself, it provides us no
additional information. Thus, our last case is a line that’s neither
dir
nor ls
. Such a line is a file, so its format will be a number
followed by the file’s name.
I use the partition
method on the line to split it into three
pieces: the part before the space, the space itself, and the part
after the space. After that, I can just update the newDir
map,
associating the file called name
with its size. I use an integer cast
to convert size
(a string) to a number.
|
|
That’s it for the loop! Once the loop stops running, we know we’re done
processing the directory. All that remains is to return it. Returning
an owned
value from a function or method transfers ownership to whatever
code calls the function or method.
|
|
One more thing: I have explicitly annotated the
return type of fromInput
to be owned Dir
to let Chapel know
that I’m using the owned
memory management strategy. This might just
be the first return type annotation we’ve written so far. Up until now,
Chapel has been able to deduce the return types of our procedures
and iterators automatically. However, here, because we are using
recursion, it needs just a little bit of help: determining the types
in the body of fromInput
requires knowing the type of fromInput
itself!
The manual type annotation helps break that loop.
An Iterator Method for Listing Directory Sizes
Let’s recap. What we have now is a data structure, Dir
, which represents
the directory tree. We also have a type method, Dir.fromInput
that
converts our puzzle input into this data structure. What’s left?
The way I see it, the problem is composed of three pieces:
- Go through all of the directory sizes…
- … ignoring those that are above a certain threshold …
- … and sum them.
Over the past week, we’ve gotten really good at summing things! In
Chapel, we can just use + reduce
to compute the sum of something
iterable, so there’s point number three. For point two, it turns out that
those loop expressions
from yesterday can be used to filter out elements like so:
[for i in iterable] if someCondition then i
Putting these two pieces together, we might write something like:
+ reduce [for size in directorySizes] if size < 1000000 then size
That directorySizes
expression is the only “fictional” piece of the solution.
Perhaps we can make our Dir
tree support an iterator of directory sizes?
Then, we’d have our answer.
In my solution, I do just that. Methods on classes don’t have to be procedures —
they can also be iterators. There’s only one complication. We want our
iterator method to yield the sizes of all of the various sub-directories
within a Dir
including sub-directories of sub-directories. That’s because
we have to sum them all up as per the problem statement. However, when
computing the size of a directory, we don’t want to include sub-sub-directories
in our counting: the direct sub-directories already include the sizes of
their own contents. To make this work, I added a parentSize
formal to
the iterator method, which represents a reference to the parent directory’s
size. When it’s done yielding its own size, as well as the sizes of the
sub-directories, the iterator method will add its own size to its parent’s.
Here’s the implementation of the iterator method; I’ll talk about it in more detail below.
|
|
The first thing this method does is create a new variable, size
,
representing the current directory’s size. It’s initialized to the sum
of all the file sizes. However, at this point, that’s not the whole size —
we also need to figure out how much data is stored in the subdirectories.
I use a for
loop over the dirs
list to examine each sub-directory
of the current folder in turn. Each of these sub-directories is its
own full-fledged Dir
, so we can call its dirSizes
method. This gives us an iterator of all directory sizes from subDir
.
I simply yield them from the parent iterator, making it yield
the sizes of all directories, including nested ones. Notice that I also
provide size
as the argument to the recursive call to dirSizes
:
the inner for-loop serves the double purpose of yielding directory sizes
and finishing computing the current folder’s size.
Once all of the sub-directory sizes have been yielded, the size
variable
includes all the files in the folder, including nested ones. Thus, I use it to yield
the size of the current folder. I also add size
to parentSize
.
That concludes our Dir
class!
|
|
Putting It All Together
With our Dir
class complete, we can finally make use of it in our code.
The first thing we need to do is read our file system from the input;
this is accomplished using the fromInput
method.
|
|
Next up, we can use that + reduce
expression I described above. I use
a new variable, rootSize
, to represent the size of the top-level directory.
After the call to dirSizes
completes, it will be set to the total size of
the root directory, i.e., the total disk usage.
|
|
I could’ve omitted the argument to dirSizes
— notice from the method’s
signature that I provide a default value for parentSize
.
iter dirSizes(ref parentSize = 0): int {
However, knowing rootSize
lets us easily compute the amount of space we need
to free up (for part 2 of today’s problem).
|
|
We can now re-use our dirSizes
stream to check every directory size again,
this time looking for the smallest folder that meets a certain threshold.
A min
reduction takes care of this:
|
|
And there’s the solution to part 2, as well!
Summary
This concludes today’s description of my solution. This time, I introduced Chapel’s classes — defining them, creating fields and adding methods. We got a little taste of memory management strategies and ownership, though I deliberately kept it light to avoid introducing too many new concepts.
Admittedly, today’s solution is (for the most part) serial. Although the
+ reduce
expression that computes the initial size
of a directory from
its files
is eligible for parallelization, the dirSizes
iterator is not. The main
reason for this is that the interaction between recursive parallel iterators and
reductions is, at the time of writing, unimplemented.
Nevertheless, I think that using even a serial iterator has yielded an elegant
solution (pun intended).
If you wanted to write a parallel version, I’d advise creating a new,
non-iterator method on Dir
that solves just part 1 of today’s puzzle.
This method could return a tuple of two elements, perhaps sumSmallSizes
and dirSize
; then, a simple forall
loop over dirs
(and judicious use of reduce intents,
which are described in our day 4 article)
will let you compute the answer in parallel.
Thanks for reading! Please feel free to ask any questions or post any comments you have in the new Blog Category of Chapel’s Discourse Page.
Updates to this article
Date | Change |
---|---|
Feb 5, 2023 | Replaced list.append() with new list.pushBack() method |