Strings¶
The following documentation shows functions and methods used to manipulate and process Chapel strings.
The string
type in Chapel represents a sequence of UTF-8
characters and is most often used to represent textual data.
Methods Available in Standard Modules¶
Besides the functions below, some other modules provide routines that are
useful for working with strings. The IO
module provides
format
which creates a string that is the result of
formatting. It also includes functions for reading and writing strings.
The Regex
module also provides some routines for searching
within strings.
Casts from String to a Numeric Type¶
The string
type supports casting to numeric types. Such casts
will convert the string to the numeric type and throw an error if the string is
invalid. For example:
var number = "a":int;
throws an error when it is executed, but
var number = "1":int;
stores the value 1
in number
.
To learn more about handling these errors, see the Language-Specification page on Error Handling.
Unicode Support¶
Chapel strings use the UTF-8 encoding. Note that ASCII strings are a simple subset of UTF-8 strings, because every ASCII character is a UTF-8 character with the same meaning.
Non-Unicode Data and Chapel Strings¶
For doing string operations on non-Unicode or arbitrary data, consider using
bytes
instead of string. However, there may be cases where
string
must be used with non-Unicode data. Examples of this are
file system and path operations on systems where UTF-8 file names are not
enforced.
In such scenarios, non-UTF-8 data can be escaped and stored in a string in a way that it can be restored when needed. For example:
var myBytes = b"Illegal \xff sequence"; // \xff is non UTF-8
var myEscapedString = myBytes.decode(policy=decodePolicy.escape);
will escape the illegal 0xFF`
byte and store it in the string. The escaping
strategy is similar to Python’s “surrogate escapes” and is as follows.
Each individual byte in an illegal sequence is bitwise-or’ed with
0xDC00
to create a 2-byte codepoint.Then, this codepoint is encoded in UTF-8 and stored in the string buffer.
This strategy typically results in storing 3 bytes for each byte in the illegal
sequence. Similarly escaped strings can also be created with
createCopyingBuffer
using a C buffer.
An escaped data sequence can be reconstructed with encode
:
var reconstructedBytes = myEscapedString.encode(policy=encodePolicy.unescape);
writeln(myBytes == reconstructedBytes); // prints true
Alternatively, escaped sequence can be used as-is without reconstructing the bytes:
var escapedBytes = myEscapedString.encode(policy=encodePolicy.pass);
writeln(myBytes == escapedBytes); // prints false
Note
Strings that contain escaped sequences cannot be directly used with
unformatted I/O functions such as writeln
. Formatted I/O can be used to print such strings with binary
formatters such as %|s
.
Note
The standard FileSystem
, Path
and IO
modules can use
escaped strings as described above for paths and file names.
Lengths and Offsets in Unicode Strings¶
For Unicode strings, and in particular UTF-8 strings, there are several possible units for offsets or lengths:
bytes
codepoints
graphemes
Most methods on the Chapel string type currently work with codepoint units by
default. For example, size
returns the length in
codepoints and int values passed into this
are
offsets in codepoint units.
It is possible to indicate byte or codepoint units for indexing in the
string methods by using arguments of type byteIndex
or
codepointIndex
respectively.
For speed of indexing with their result values, find()
and rfind()
return a byteIndex
.
Note
Support for grapheme units is not implemented at this time.
Using the byteIndex
and codepointIndex
types¶
A value of type byteIndex
or codepointIndex
can be passed to certain
string functions to indicate that the function should operate with units of
bytes or codepoints. Passing a codepointIndex
has the same behavior as
passing an integral type. See this
for an example.
Both of these types can be created from an int
via assignment or cast. They
also support addition and subtraction with int
. Finally, values of same
types can be compared.
For example, the following function returns a string containing only the second byte of the argument:
proc getSecondByte(arg:string) { var offsetInBytes = 1:byteIndex; return arg[offsetInBytes]; }
Whereas the following function returns a string containing only the second codepoint of the argument:
proc getSecondCodepoint(arg:string) { var offsetInCodepoints = 1:codepointIndex; return arg[offsetInCodepoints]; }
Predefined Routines on Strings¶
The string type:
- type string¶
supports the following methods:
- proc type string.createBorrowingBuffer(x: string) : string¶
Warning
‘createBorrowingBuffer’ is unstable and may change in the future
Creates a new
string
which borrows the internal buffer of another string. If the buffer is freed before the string returned from this function, accessing it is undefined behavior.- Arguments:
x : string – Object to borrow the buffer from
- Returns:
A new
string
- proc type string.createBorrowingBuffer(x: c_ptr(?t), length = strLen(x)) : string throws
Warning
‘createBorrowingBuffer’ is unstable and may change in the future
Creates a new
string
which borrows the memory allocated for ac_ptr
. If the buffer is freed before thestring
returned from this function, accessing it is undefined behavior.- Arguments:
x : c_ptr(uint(8)) or c_ptr(int(8)) – The buffer to borrow from
length : int – Length of the string stored in x in bytes, excluding the terminating null byte.
- Throws:
A
DecodeError
: if x contains non-UTF-8 characters.- Returns:
A new
string
- proc type string.createBorrowingBuffer(x: c_ptrConst(?t), length = strLen(x)) : string throws
Warning
‘createBorrowingBuffer’ is unstable and may change in the future
Creates a new
string
which borrows the memory allocated for ac_ptrConst
. If the buffer is freed before thestring
returned from this function, accessing it is undefined behavior.- Arguments:
x : c_ptrConst(uint(8)) or c_ptrConst(int(8)) – The buffer to borrow from
length : int – Length of the string stored in x in bytes, excluding the terminating null byte.
- Throws:
A
DecodeError
: if x contains non-UTF-8 characters.- Returns:
A new
string
- proc type string.createBorrowingBuffer(x: c_ptr(?t), length: int, size: int) : string throws
Warning
‘createBorrowingBuffer’ is unstable and may change in the future
Creates a new
string
which borrows the memory allocated for ac_ptr
. If the buffer is freed before thestring
returned from this function, accessing it is undefined behavior.- Arguments:
x : c_ptr(uint(8)) or c_ptr(int(8)) – The buffer to borrow from
length : int – Length of the string stored in x in bytes, excluding the terminating null byte.
size – Size of memory allocated for x in bytes
- Throws:
A
DecodeError
: if x contains non-UTF-8 characters.- Returns:
A new
string
- proc type string.createAdoptingBuffer(x: c_ptr(?t), length = strLen(x)) : string throws¶
Creates a new
string
which takes ownership of the memory allocated for ac_ptr
. The buffer will be freed when thestring
is deinitialized.- Arguments:
x : c_ptr(uint(8)) or c_ptr(int(8)) – The buffer to take ownership of
length : int – Length of the string stored in x in bytes, excluding the terminating null byte.
- Throws:
A
DecodeError
: if x contains non-UTF-8 characters.`DecodeError` if x contains non-UTF-8 characters.- Returns:
A new
string
- proc type string.createAdoptingBuffer(x: c_ptrConst(?t), length = strLen(x)) : string throws
Creates a new
string
which takes ownership of the memory allocated for ac_ptrConst
. The buffer will be freed when thestring
is deinitialized.- Arguments:
x : c_ptrConst(uint(8)) or c_ptrConst(int(8)) – The buffer to take ownership of
length : int – Length of the string stored in x in bytes, excluding the terminating null byte.
- Throws:
A
DecodeError
: if x contains non-UTF-8 characters.- Returns:
A new
string
- proc type string.createAdoptingBuffer(x: c_ptr(?t), length: int, size: int) : string throws
Creates a new
string
which takes ownership of the memory allocated for ac_ptr
. The buffer will be freed when thestring
is deinitialized.- Arguments:
x : c_ptr(uint(8)) or c_ptr(int(8)) – The buffer to take ownership of
length : int – Length of the string stored in x in bytes, excluding the terminating null byte.
size – Size of memory allocated for x in bytes
- Throws:
A
DecodeError
: if x contains non-UTF-8 characters.- Returns:
A new
string
- proc type string.createCopyingBuffer(x: c_ptrConst(?t), length = strLen(x), policy = decodePolicy.strict) : string throws¶
Creates a new
string
by creating a copy of the memory allocated for ac_ptrConst
.- Arguments:
x : c_ptrConst(uint(8)) or c_ptrConst(int(8)) – The buffer to copy
length : int – Length of x in bytes, excluding the terminating null byte.
policy –
decodePolicy.strict raises an error
decodePolicy.replace replaces the malformed character with UTF-8 replacement character
decodePolicy.drop drops the data silently
decodePolicy.escape escapes each illegal byte with private use codepoints
- Throws:
A
DecodeError
: if decodePolicy.strict is passed to the policy argument and x contains non-UTF-8 characters.- Returns:
A new
string
- proc type string.createCopyingBuffer(x: c_ptr(?t), length = strLen(x), size = length + 1, policy = decodePolicy.strict) : string throws
Creates a new
string
by creating a copy of a buffer.- Arguments:
x : c_ptr(uint(8)) or c_ptr(int(8)) – The buffer to copy
length : int – Length of the string stored in x in bytes, excluding the terminating null byte.
size : int – Size of memory allocated for x in bytes. This argument is ignored by this function.
policy – decodePolicy.strict raises an error, decodePolicy.replace replaces the malformed character with UTF-8 replacement character, decodePolicy.drop drops the data silently, decodePolicy.escape escapes each illegal byte with private use codepoints
- Throws:
A
DecodeError
: if decodePolicy.strict is passed to the policy argument and x contains non-UTF-8 characters.- Returns:
A new
string
- proc string.indices : range¶
- Returns:
The indices that can be used to index into the
string
(i.e., the range0..<this.size
)
- proc string.numCodepoints : int¶
- Returns:
The number of codepoints in the
string
, assuming the string is correctly-encoded UTF-8.
- proc string.localize() : string¶
Warning
string.localize() is unstable and may change in a future release
Gets a version of the
string
that is on the currently executing locale.- Returns:
A shallow copy if the
string
is already on the current locale, otherwise a deep copy is performed.
- proc string.c_str() : c_ptrConst(c_char)¶
Warning
‘string.c_str()’ has moved to ‘CTypes’. Please ‘use CTypes’ to access ‘
c_str
’Get a c_ptrConst(c_char) from a
string
. The returnedc_ptrConst
shares the buffer with thestring
.Warning
This can only be called safely on a
string
whose home is the current locale. This property can be enforced by callingstring.localize()
beforestring.c_str()
. If the string is remote, the program will halt.For example:
var my_string = "Hello!"; on different_locale { printf("%s", my_string.localize().c_str()); }
- Returns:
A c_ptrConst(c_char) that points to the underlying buffer used by this
string
. The returned c_ptrConst(c_char) is only valid when used on the same locale as the string.
- proc string.encode(policy = encodePolicy.pass) : bytes¶
Returns a
bytes
from the givenstring
. If the string contains some escaped non-UTF8 bytes, policy argument determines the action.- Arguments:
policy – encodePolicy.pass directly copies the (potentially escaped) data, encodePolicy.unescape recovers the escaped bytes back.
- Returns:
- iter string.items() : string¶
Iterates over the
string
character by character.For example:
var str = "abcd"; for c in str.items() { writeln(c); }
Output:
a b c d
- iter string.these() : string¶
Iterates over the
string
character by character, yielding 1-codepoint strings. (A synonym forstring.items
)For example:
var str = "abcd"; for c in str { writeln(c); }
Output:
a b c d
- iter string.codepoints() : int(32)¶
Iterates over the
string
Unicode character by Unicode character.
- proc string.byte(i: int) : uint(8)¶
- Returns:
The value of the i th byte as an integer.
- proc string.codepoint(i: int) : int(32)¶
- Returns:
The value of the i th multibyte character as an integer.
- proc string.this(i: byteIndex) : string¶
Return the codepoint starting at the i th byte in the
string
- Returns:
A new
string
with the complete multibyte character starting at the specified byte index from0..#string.numBytes
- proc string.this(i: codepointIndex) : string
Return the i th codepoint in the
string
. (A synonym forstring.item
)- Returns:
A new
string
with the complete multibyte character starting at the specified codepoint index from0..#string.numCodepoints
- proc string.this(i: int) : string
Return the i th codepoint in the
string
. (A synonym forstring.item
)- Returns:
A new
string
with the complete multibyte character starting at the specified codepoint index from1..string.numCodepoints
- proc string.item(i: codepointIndex) : string¶
Return the i th codepoint in the
string
- Returns:
A new
string
with the complete multibyte character starting at the specified codepoint index from1..string.numCodepoints
- proc string.item(i: int) : string
Return the i th codepoint in the
string
- Returns:
A new
string
with the complete multibyte character starting at the specified codepoint index from0..#string.numCodepoints
- proc string.this(r: range(?)) : string throws where r.idxType == byteIndex
Slice a
string
. Halts if r is non-empty and not completely inside the range0..<string.size
when compiled with –checks. –fast disables this check.- Arguments:
r – range of the indices the new
string
should be made from- Throws:
throws a
CodepointSplitError
: if slicing results in splitting a multi-byte codepoint.- Returns:
A new
string
that is a substring within0..<string.size
. If the length of r is zero, an empty string is returned.
- proc string.startsWith(patterns: string ...) : bool¶
- Arguments:
patterns – A varargs list of strings to match against.
- Returns:
true – when the
string
begins with one or more of the patternsfalse – otherwise
- proc string.endsWith(patterns: string ...) : bool¶
- Arguments:
patterns – A varargs list of strings to match against.
- Returns:
true – when the
string
ends with one or more of the patternsfalse – otherwise
- proc string.find(pattern: string, indices: range(?) = this.byteIndices: range(byteIndex)) : byteIndex¶
- Arguments:
pattern – the
string
to search forindices – an optional range defining the substring to search within, default is the whole string. Halts if the range is not within
0..<string.size
- Returns:
the index of the first occurrence of pattern within a
string
, or -1 if the pattern is not in the string.
- proc string.rfind(pattern: string, indices: range(?) = this.byteIndices: range(byteIndex)) : byteIndex¶
- Arguments:
pattern – the
string
to search forindices – an optional range defining the substring to search within, default is the whole string. Halts if the range is not within
0..<string.size
- Returns:
the index of the first occurrence from the right of pattern within a string, or -1 if the pattern is not in the string.
- proc string.count(pattern: string, indices: range(?) = this.indices) : int¶
- Arguments:
pattern – the
string
to search forindices – an optional range defining the substring to search within, default is the whole string. Halts if the range is not within
0..<string.size
- Returns:
the number of times pattern occurs in the string
- proc string.replace(pattern: string, replacement: string, count: int = -1) : string¶
- Arguments:
- Returns:
a copy of the
string
where replacement replaces pattern up to count times
- iter string.split(sep: string, maxsplit: int = -1, ignoreEmpty: bool = false) : string¶
Splits the
string
on sep yielding the substring between each occurrence, up to maxsplit times.- Arguments:
sep – The delimiter used to break the string into chunks.
maxsplit – The number of times to split the string, negative values indicate no limit.
ignoreEmpty –
- When true – Empty strings will not be yielded,
and will not count towards maxsplit
- When false – Empty strings will be yielded when
sep occurs multiple times in a row.
- iter string.split(maxsplit: int = -1) : string
Works as above, but uses runs of whitespace as the delimiter.
- Arguments:
maxsplit – The number of times to split the
string
, negative values indicate no limit.
- proc string.join(const ref x: string ...) : string¶
Returns a new
string
, which is the concatenation of all of thestring
passed in with the contents of the method receiver inserted between them.var myString = "|".join("a","10","d"); writeln(myString); // prints: "a|10|d"
- proc string.join(const ref x) : string
Returns a new
string
, which is the concatenation of all of thestring
passed in with the contents of the method receiver inserted between them.var tup = ("a","10","d"); var myJoinedTuple = "|".join(tup); writeln(myJoinedTuple); // prints: "a|10|d" var myJoinedArray = "|".join(["a","10","d"]); writeln(myJoinedArray); // prints: "a|10|d"
- proc string.strip(chars: string = " \t\r\n", leading = true, trailing = true) : string¶
- Arguments:
chars – A
string
containing each character to remove. Defaults to ” \t\r\n”.leading – Indicates if leading occurrences should be removed. Defaults to true.
trailing – Indicates if trailing occurrences should be removed. Defaults to true.
- Returns:
A new
string
with leading and/or trailing occurrences of characters in chars removed as appropriate.
- proc string.partition(sep: string) : 3*(string)¶
Splits the string on sep into a 3*string consisting of the section before sep, sep, and the section after sep. If sep is not found, the tuple will contain the whole string, and then two empty strings.
- proc string.dedent(columns = 0, ignoreFirst = true) : string¶
Warning
string.dedent is subject to change in the future.
Remove indentation from each line of a
string
.This can be useful when applied to multi-line strings that are indented in the source code, but should not be indented in the output.
When
columns == 0
, determine the level of indentation to remove from all lines by finding the common leading whitespace across all non-empty lines. Empty lines are lines containing only whitespace. Tabs and spaces are the only whitespaces that are considered, but are not treated as the same characters when determining common whitespace.When
columns > 0
, removecolumns
leading whitespace characters from each line. Tabs are not considered whitespace whencolumns > 0
, so only leading spaces are removed.- Arguments:
columns – The number of columns of indentation to remove. Infer common leading whitespace if
columns == 0
.ignoreFirst – When
true
, ignore first line when determining the common leading whitespace, and make no changes to the first line.
- Returns:
A new
string
with indentation removed.
- proc string.isUpper() : bool¶
Checks if all the characters in the
string
are either uppercase (A-Z) or uncased (not a letter).- Returns:
true – if the string contains at least one uppercase character and no lowercase characters, ignoring uncased characters.
false – otherwise
- proc string.isLower() : bool¶
Checks if all the characters in the
string
are either lowercase (a-z) or uncased (not a letter).- Returns:
true – when there are no uppercase characters in the string.
false – otherwise
- proc string.isSpace() : bool¶
Checks if all the characters in the
string
are whitespace (’ ‘, ‘\t’, ‘\n’, ‘\v’, ‘\f’, ‘\r’).- Returns:
true – when all the characters are whitespace.
false – otherwise
- proc string.isAlpha() : bool¶
Checks if all the characters in the
string
are alphabetic (a-zA-Z).- Returns:
true – when the characters are alphabetic.
false – otherwise
- proc string.isDigit() : bool¶
Checks if all the characters in the
string
are digits (0-9).- Returns:
true – when the characters are digits.
false – otherwise
- proc string.isAlnum() : bool¶
Checks if all the characters in the
string
are alphanumeric (a-zA-Z0-9).- Returns:
true – when the characters are alphanumeric.
false – otherwise
- proc string.isPrintable() : bool¶
Checks if all the characters in the
string
are printable.- Returns:
true – when the characters are printable.
false – otherwise
- proc string.isTitle() : bool¶
Checks if all uppercase characters are preceded by uncased characters, and if all lowercase characters are preceded by cased characters.
- Returns:
true – when the condition described above is met.
false – otherwise
- proc string.toLower() : string¶
- Returns:
A new
string
with all uppercase characters replaced with their lowercase counterpart.
Note
The case change operation is not currently performed on characters whose cases take different number of bytes to represent in Unicode mapping.
- proc string.toUpper() : string¶
- Returns:
A new
string
with all lowercase characters replaced with their uppercase counterpart.
Note
The case change operation is not currently performed on characters whose cases take different number of bytes to represent in Unicode mapping.
- proc string.toTitle() : string¶
- Returns:
A new
string
with all cased characters following an uncased character converted to uppercase, and all cased characters following another cased character converted to lowercase.
Note
The case change operation is not currently performed on characters whose cases take different number of bytes to represent in Unicode mapping.
- operator string.+(s0: string, s1: string) : string¶
- Returns:
A new
string
which is the result of concatenating s0 and s1
- operator *(s: string, n: integral) : string¶
- Returns:
A new
string
which is the result of repeating s n times. If n is less than or equal to 0, an emptystring
is returned.
The operation is commutative. For example:
writeln("Hello! " * 3); or writeln(3 * "Hello! ");
Results in:
Hello! Hello! Hello!
- operator string.+=(ref lhs: string, const ref rhs: string) : void¶
Appends the string rhs to the string lhs.
- proc ref string.appendCodepointValues(codepoints: int ...) : void¶
Warning
‘string.appendCodepointValues’ is unstable and may change in the future
Appends the codepoint values passed to the
string
this.Any argument not in 0..0x10FFFF is not valid Unicode codepoint. This function will append the replacement character 0xFFFD instead of such invalid arguments.