Strings¶

The following documentation shows functions and methods used to manipulate and process Chapel strings.

The string type in Chapel represents a sequence of UTF-8 characters and is most often used to represent textual data.

Methods Available in Standard Modules¶

Besides the functions below, some other modules provide routines that are useful for working with strings. The IO module provides format which creates a string that is the result of formatting. It also includes functions for reading and writing strings. The Regex module also provides some routines for searching within strings.

Casts from String to a Numeric Type¶

The string type supports casting to numeric types. Such casts will convert the string to the numeric type and throw an error if the string is invalid. For example:

var number = "a":int;

throws an error when it is executed, but

var number = "1":int;

stores the value 1 in number.

To learn more about handling these errors, see the Language-Specification page on Error Handling.

Unicode Support¶

Chapel strings use the UTF-8 encoding. Note that ASCII strings are a simple subset of UTF-8 strings, because every ASCII character is a UTF-8 character with the same meaning.

Non-Unicode Data and Chapel Strings¶

For doing string operations on non-Unicode or arbitrary data, consider using bytes instead of string. However, there may be cases where string must be used with non-Unicode data. Examples of this are file system and path operations on systems where UTF-8 file names are not enforced.

In such scenarios, non-UTF-8 data can be escaped and stored in a string in a way that it can be restored when needed. For example:

var myBytes = b"Illegal \xff sequence";  // \xff is non UTF-8
var myEscapedString = myBytes.decode(policy=decodePolicy.escape);

will escape the illegal 0xFF` byte and store it in the string. The escaping strategy is similar to Python’s “surrogate escapes” and is as follows.

Each individual byte in an illegal sequence is bitwise-or’ed with 0xDC00 to create a 2-byte codepoint.

Then, this codepoint is encoded in UTF-8 and stored in the string buffer.

This strategy typically results in storing 3 bytes for each byte in the illegal sequence. Similarly escaped strings can also be created with createStringWithNewBuffer using a C buffer.

An escaped data sequence can be reconstructed with encode:

var reconstructedBytes = myEscapedString.encode(policy=encodePolicy.unescape);
writeln(myBytes == reconstructedBytes);  // prints true

Alternatively, escaped sequence can be used as-is without reconstructing the bytes:

var escapedBytes = myEscapedString.encode(policy=encodePolicy.pass);
writeln(myBytes == escapedBytes);  // prints false

Note

Strings that contain escaped sequences cannot be directly used with unformatted I/O functions such as writeln. Formatted I/O can be used to print such strings with binary formatters such as %|s.

Note

The standard FileSystem, Path and IO modules can use escaped strings as described above for paths and file names.

Lengths and Offsets in Unicode Strings¶

For Unicode strings, and in particular UTF-8 strings, there are several possible units for offsets or lengths:

bytes

codepoints

graphemes

Most methods on the Chapel string type currently work with codepoint units by default. For example, size returns the length in codepoints and int values passed into this are offsets in codepoint units.

It is possible to indicate byte or codepoint units for indexing in the string methods by using arguments of type byteIndex or codepointIndex respectively.

For speed of indexing with their result values, find() and rfind() return a byteIndex.

Note

Support for grapheme units is not implemented at this time.

Using the `byteIndex` and `codepointIndex` types¶

A value of type byteIndex or codepointIndex can be passed to certain string functions to indicate that the function should operate with units of bytes or codepoints. Passing a codepointIndex has the same behavior as passing an integral type. See this for an example.

Both of these types can be created from an int via assignment or cast. They also support addition and subtraction with int. Finally, values of same types can be compared.

For example, the following function returns a string containing only the second byte of the argument:

proc getSecondByte(arg:string) {
  var offsetInBytes = 1:byteIndex;
  return arg[offsetInBytes];
}

Whereas the following function returns a string containing only the second codepoint of the argument:

proc getSecondCodepoint(arg:string) {
  var offsetInCodepoints = 1:codepointIndex;
  return arg[offsetInCodepoints];
}

Predefined Routines on Strings¶

The string type:

type string¶

supports the following methods:

proc createStringWithBorrowedBuffer(x: string): string¶

Creates a new string which borrows the internal buffer of another string. If the buffer is freed before the string returned from this function, accessing it is undefined behavior.

Arguments: x : string – Object to borrow the buffer from
Returns: A new string

proc createStringWithBorrowedBuffer(x: c_string, length = x.size): string throws

Creates a new string which borrows the internal buffer of a c_string. If the buffer is freed before the string returned from this function, accessing it is undefined behavior.

Arguments

x : c_string – Object to borrow the buffer from
length : int – Length of the string stored in x in bytes, excluding the terminating null byte.

Throws

DecodeError if x contains non-UTF-8 characters.

Returns

A new string

proc createStringWithBorrowedBuffer(x: c_ptr(?t), length: int, size: int): string throws

Creates a new string which borrows the memory allocated for a c_ptr. If the buffer is freed before the string returned from this function, accessing it is undefined behavior.

Arguments

x : c_ptr(uint(8)) or c_ptr(c_char) – Object to borrow the buffer from
length : int – Length of the string stored in x in bytes, excluding the terminating null byte.
size – Size of memory allocated for x in bytes

Throws

DecodeError if x contains non-UTF-8 characters.

Returns

A new string

proc createStringWithOwnedBuffer(x: c_string, length = x.size): string throws¶

Creates a new string which takes ownership of the internal buffer of a c_string. The buffer will be freed when the string is deinitialized.

Arguments

x : c_string – Object to take ownership of the buffer from
length – Length of the string stored in x in bytes, excluding the terminating null byte.

Returns

A new string

proc createStringWithOwnedBuffer(x: c_ptr(?t), length: int, size: int): string throws

Creates a new string which takes ownership of the memory allocated for a c_ptr. The buffer will be freed when the string is deinitialized.

Arguments

x : c_ptr(uint(8)) or c_ptr(c_char) – Object to take ownership of the buffer from
length : int – Length of the string stored in x in bytes, excluding the terminating null byte.
size – Size of memory allocated for x in bytes

Throws

DecodeError if x contains non-UTF-8 characters.

Returns

A new string

proc createStringWithNewBuffer(x: string): string¶

Creates a new string by creating a copy of the buffer of another string.

Arguments: x : string – Object to copy the buffer from
Returns: A new string

proc createStringWithNewBuffer(x: c_string, length = x.size, policy = decodePolicy.strict): string throws

Creates a new string by creating a copy of the buffer of a c_string.

Arguments

x : c_string – Object to copy the buffer from
length : int – Length of the string stored in x in bytes, excluding the terminating null byte.
policy –
- decodePolicy.strict raises an error
- decodePolicy.replace replaces the malformed character with UTF-8 replacement character
- decodePolicy.drop drops the data silently
- decodePolicy.escape escapes each illegal byte with private use codepoints

Throws

DecodeError if decodePolicy.strict is passed to the policy argument and x contains non-UTF-8 characters.

Returns

A new string

proc createStringWithNewBuffer(x: c_ptr(?t), length: int, size = length + 1, policy = decodePolicy.strict): string throws

Creates a new string by creating a copy of a buffer.

Arguments

x : c_ptr(uint(8)) or c_ptr(c_char) – The buffer to copy
length : int – Length of the string stored in x in bytes, excluding the terminating null byte.
size – Size of memory allocated for x in bytes. This argument is ignored by this function.

Throws

DecodeError if x contains non-UTF-8 characters.

Returns

A new string

proc string.size: int¶

Returns: The number of codepoints in the string.

proc string.indices: range¶

Returns: The indices that can be used to index into the string (i.e., the range 0..<this.size)

proc string.numBytes: int¶

Returns: The number of bytes in the string.

proc string.numCodepoints: int¶

Returns: The number of codepoints in the string, assuming the string is correctly-encoded UTF-8.

proc string.localize(): string¶

Gets a version of the string that is on the currently executing locale.

Returns: A shallow copy if the string is already on the current locale, otherwise a deep copy is performed.

proc string.c_str(): c_string¶

Get a c_string from a string.

Warning

This can only be called safely on a string whose home is the current locale. This property can be enforced by calling string.localize() before c_str(). If the string is remote, the program will halt.

For example:

var my_string = "Hello!";
on different_locale {
  printf("%s", my_string.localize().c_str());
}

Returns: A c_string that points to the underlying buffer used by this string. The returned c_string is only valid when used on the same locale as the string.

proc string.encode(policy = encodePolicy.pass): bytes¶

Returns a bytes from the given string. If the string contains some escaped non-UTF8 bytes, policy argument determines the action.

Arguments: policy – encodePolicy.pass directly copies the (potentially escaped) data, encodePolicy.unescape recovers the escaped bytes back.
Returns: bytes

iter string.items(): string¶

Iterates over the string character by character.

For example:

var str = "abcd";
for c in str.items() {
  writeln(c);
}

Output:

a
b
c
d

iter string.these(): string¶

Iterates over the string character by character, yielding 1-codepoint strings. (A synonym for string.items)

For example:

var str = "abcd";
for c in str {
  writeln(c);
}

Output:

a
b
c
d

iter string.bytes(): uint(8)¶: Iterates over the string byte by byte.

iter string.codepoints(): int(32)¶: Iterates over the string Unicode character by Unicode character.

proc string.toByte(): uint(8)¶

Returns: The value of a single-byte string as an integer.

proc string.byte(i: int): uint(8)¶

Returns: The value of the i th byte as an integer.

proc string.toCodepoint(): int(32)¶

Returns: The value of a single-codepoint string as an integer.

proc string.codepoint(i: int): int(32)¶

Returns: The value of the i th multibyte character as an integer.

proc string.this(i: byteIndex): string¶

Return the codepoint starting at the i th byte in the string

Returns: A string with the complete multibyte character starting at the specified byte index from 0..#string.numBytes

proc string.this(i: codepointIndex): string

Return the i th codepoint in the string. (A synonym for string.item)

Returns: A string with the complete multibyte character starting at the specified codepoint index from 0..#string.numCodepoints

proc string.this(i: int): string

Return the i th codepoint in the string. (A synonym for string.item)

Returns: A string with the complete multibyte character starting at the specified codepoint index from 1..string.numCodepoints

proc string.item(i: codepointIndex): string¶

Return the i th codepoint in the string

Returns: A string with the complete multibyte character starting at the specified codepoint index from 1..string.numCodepoints

proc string.item(i: int): string

Return the i th codepoint in the string

Returns: A string with the complete multibyte character starting at the specified codepoint index from 0..#string.numCodepoints

proc string.this(r: range(?)): string throws

Slice a string. Halts if r is non-empty and not completely inside the range 0..<string.size when compiled with –checks. –fast disables this check.

Arguments: r – range of the indices the new string should be made from
Throws: CodepointSplittingError if slicing results in splitting a multi-byte codepoint.
Returns: a new string that is a substring within 0..<string.size. If the length of r is zero, an empty string is returned.

proc string.isEmpty(): bool¶

Returns

true – when the string is empty
false – otherwise

proc string.startsWith(patterns: string ...): bool¶

Arguments

patterns – A varargs list of strings to match against.

Returns

true – when the string begins with one or more of the patterns
false – otherwise

proc string.endsWith(patterns: string ...): bool¶

Arguments

patterns – A varargs list of strings to match against.

Returns

true – when the string ends with one or more of the patterns
false – otherwise

proc string.find(pattern: string, indices: range(?) = this.byteIndices: range(byteIndex)): byteIndex¶

Arguments

pattern – the string to search for
indices – an optional range defining the substring to search within, default is the whole string. Halts if the range is not within 0..<string.size

Returns

the index of the first occurrence of pattern within a string, or -1 if the pattern is not in the string.

proc string.rfind(pattern: string, indices: range(?) = this.byteIndices: range(byteIndex)): byteIndex¶

Arguments

pattern – the string to search for
indices – an optional range defining the substring to search within, default is the whole string. Halts if the range is not within 0..<string.size

Returns

the index of the first occurrence from the right of pattern within a string, or -1 if the pattern is not in the string.

proc string.count(pattern: string, indices: range(?) = this.indices): int¶

Arguments

pattern – the string to search for
indices – an optional range defining the substring to search within, default is the whole string. Halts if the range is not within 0..<string.size

Returns

the number of times pattern occurs in the string

proc string.replace(pattern: string, replacement: string, count: int = -1): string¶

Arguments

pattern – the string to search for
replacement – the string to replace pattern with
count – an optional integer specifying the number of replacements to make, values less than zero will replace all occurrences

Returns

a copy of the string where replacement replaces pattern up to count times

iter string.split(sep: string, maxsplit: int = -1, ignoreEmpty: bool = false): string¶

Splits the string on sep yielding the substring between each occurrence, up to maxsplit times.

Arguments

sep – The delimiter used to break the string into chunks.
maxsplit – The number of times to split the string, negative values indicate no limit.
ignoreEmpty –
- When true – Empty strings will not be yielded,
  and will not count towards maxsplit
- When false – Empty strings will be yielded when
  sep occurs multiple times in a row.

iter string.split(maxsplit: int = -1): string

Works as above, but uses runs of whitespace as the delimiter.

Arguments: maxsplit – The number of times to split the string, negative values indicate no limit.

proc string.join(const ref x: string ...): string¶

Returns a new string, which is the concatenation of all of the string passed in with the contents of the method receiver inserted between them.

var myString = "|".join("a","10","d");
writeln(myString); // prints: "a|10|d"

Arguments: x – string values to be joined
Returns: A string

proc string.join(const ref x): string

Returns a new string, which is the concatenation of all of the string passed in with the contents of the method receiver inserted between them.

var tup = ("a","10","d");
var myJoinedTuple = "|".join(tup);
writeln(myJoinedTuple); // prints: "a|10|d"

var myJoinedArray = "|".join(["a","10","d"]);
writeln(myJoinedArray); // prints: "a|10|d"

Arguments: x – An array or tuple of string values to be joined
Returns: A string

proc string.strip(chars: string = " \t\r\n", leading = true, trailing = true): string¶

Arguments

chars – A string containing each character to remove. Defaults to ” \t\r\n”.
leading – Indicates if leading occurrences should be removed. Defaults to true.
trailing – Indicates if trailing occurrences should be removed. Defaults to true.

Returns

A new string with leading and/or trailing occurrences of characters in chars removed as appropriate.

proc string.partition(sep: string): 3*(string)¶: Splits the string on sep into a 3*string consisting of the section before sep, sep, and the section after sep. If sep is not found, the tuple will contain the whole string, and then two empty strings.

proc string.dedent(columns = 0, ignoreFirst = true): string¶

Remove indentation from each line of string.

This can be useful when applied to multi-line strings that are indented in the source code, but should not be indented in the output.

When columns == 0, determine the level of indentation to remove from all lines by finding the common leading whitespace across all non-empty lines. Empty lines are lines containing only whitespace. Tabs and spaces are the only whitespaces that are considered, but are not treated as the same characters when determining common whitespace.

When columns > 0, remove columns leading whitespace characters from each line. Tabs are not considered whitespace when columns > 0, so only leading spaces are removed.

Arguments

columns – The number of columns of indentation to remove. Infer common leading whitespace if columns == 0.
ignoreFirst – When true, ignore first line when determining the common leading whitespace, and make no changes to the first line.

Returns

A new string with indentation removed.

Warning

string.dedent is subject to change in the future.

proc string.isUpper(): bool¶

Checks if all the characters in the string are either uppercase (A-Z) or uncased (not a letter).

Returns

true – if the string contains at least one uppercase character and no lowercase characters, ignoring uncased characters.
false – otherwise

proc string.isLower(): bool¶

Checks if all the characters in the string are either lowercase (a-z) or uncased (not a letter).

Returns

true – when there are no uppercase characters in the string.
false – otherwise

proc string.isSpace(): bool¶

Checks if all the characters in the string are whitespace (‘ ‘, ‘\t’, ‘\n’, ‘\v’, ‘\f’, ‘\r’).

Returns

true – when all the characters are whitespace.
false – otherwise

proc string.isAlpha(): bool¶

Checks if all the characters in the string are alphabetic (a-zA-Z).

Returns

true – when the characters are alphabetic.
false – otherwise

proc string.isDigit(): bool¶

Checks if all the characters in the string are digits (0-9).

Returns

true – when the characters are digits.
false – otherwise

proc string.isAlnum(): bool¶

Checks if all the characters in the string are alphanumeric (a-zA-Z0-9).

Returns

true – when the characters are alphanumeric.
false – otherwise

proc string.isPrintable(): bool¶

Checks if all the characters in the string are printable.

Returns

true – when the characters are printable.
false – otherwise

proc string.isTitle(): bool¶

Checks if all uppercase characters are preceded by uncased characters, and if all lowercase characters are preceded by cased characters.

Returns

true – when the condition described above is met.
false – otherwise

proc string.toLower(): string¶

Returns: A new string with all uppercase characters replaced with their lowercase counterpart.

Note

The case change operation is not currently performed on characters whose cases take different number of bytes to represent in Unicode mapping.

proc string.toUpper(): string¶

Returns: A new string with all lowercase characters replaced with their uppercase counterpart.

Note

The case change operation is not currently performed on characters whose cases take different number of bytes to represent in Unicode mapping.

proc string.toTitle(): string¶

Returns: A new string with all cased characters following an uncased character converted to uppercase, and all cased characters following another cased character converted to lowercase.

Note

The case change operation is not currently performed on characters whose cases take different number of bytes to represent in Unicode mapping.

operator =(ref lhs: string, rhs: string): void¶: Copies the string rhs into the string lhs.

operator string.+(s0: string, s1: string): string¶

Returns: A new string which is the result of concatenating s0 and s1

operator *(s: string, n: integral): string¶

Returns: A new string which is the result of repeating s n times. If n is less than or equal to 0, an empty string is returned.

The operation is commutative. For example:

writeln("Hello! " * 3);
or
writeln(3 * "Hello! ");

Results in:

Hello! Hello! Hello!

operator string.+=(ref lhs: string, const ref rhs: string): void¶: Appends the string rhs to the string lhs.

proc codepointToString(i: int(32)): string¶

Returns: A string storing the complete multibyte character sequence that corresponds to the codepoint value i.

Strings¶

Methods Available in Standard Modules¶

Casts from String to a Numeric Type¶

Unicode Support¶

Non-Unicode Data and Chapel Strings¶

Lengths and Offsets in Unicode Strings¶

Using the byteIndex and codepointIndex types¶

Predefined Routines on Strings¶

Using the `byteIndex` and `codepointIndex` types¶