Strings¶
The following documentation shows functions and methods used to manipulate and process Chapel strings.
Methods Available in Other Modules¶
Besides the functions below, some other modules provide routines that are
useful for working with strings. The IO
module provides
IO.string.format which creates a string that is the result of
formatting. It also includes functions for reading and writing strings.
The Regexp
module also provides some routines for searching
within strings.
Casts from String to a Numeric Type¶
This module supports casts from string
to numeric types. Such casts
will convert the string to the numeric type and throw an error if the string
is invalid. For example:
var number = "a":int;
throws an error when it is executed, but
var number = "1":int;
stores the value 1
in number
.
To learn more about handling these errors, see the Error Handling technical note.
Unicode Support¶
Chapel strings use the UTF-8 encoding. Note that ASCII strings are a simple subset of UTF-8 strings, because every ASCII character is a UTF-8 character with the same meaning.
UTF-8 strings might not work properly if a UTF-8 environment is not used. See character set environment for more information.
Non-Unicode Data and Chapel Strings¶
For doing string operations on non-Unicode or arbitrary data, consider using
bytes
instead of string. However, there may be cases where
string
must be used with non-Unicode data. Examples of this are file
system and path operations on systems where UTF-8 file names are not enforced.
In such scenarios, non-UTF-8 data can be escaped and stored in a string in a way that it can be restored when needed. For example:
var myBytes = b"Illegal \xff sequence"; // \xff is non UTF-8
var myEscapedString = myBytes.decode(policy=decodePolicy.escape);
will escape the illegal 0xFF byte and store it in the string. The escaping strategy is similar to Python’s “surrogate escapes” and is as follows.
- Each individual byte in an illegal sequence is bitwise-or’ed with 0xDC00 to create a 2-byte codepoint.
- Then, this codepoint is encoded in UTF-8 and stored in the string buffer.
This strategy typically results in storing 3 bytes for each byte in the illegal
sequence. Similarly escaped strings can also be created with
createStringWithNewBuffer
using a C buffer.
An escaped data sequence can be reconstructed with encode
:
var reconstructedBytes = myEscapedString.encode(policy=encodePolicy.unescape);
writeln(myBytes == reconstructedBytes); // prints true
Alternatively, escaped sequence can be used as-is without reconstructing the bytes:
var escapedBytes = myEscapedString.encode(policy=encodePolicy.pass);
writeln(myBytes == escapedBytes); // prints false
Note
Strings that contain escaped sequences cannot be directly used with
unformatted I/O functions such as writeln
. Formatted I/O can be used to print such strings with binary
formatters such as %|s
.
Note
The standard FileSystem
, Path
and IO
modules can use
strings described above for paths and file names.
Lengths and Offsets in Unicode Strings¶
For Unicode strings, and in particular UTF-8 strings, there are several possible units for offsets or lengths:
- bytes
- codepoints
- graphemes
Most methods on the Chapel string type currently work with codepoint units by
default. For example, size
returns the length in codepoints
and int values passed into this
are offsets in codepoint
units.
It is possible to indicate byte or codepoint units for indexing in the
string methods by using arguments of type byteIndex
or
codepointIndex
respectively.
For speed of indexing with their result values, find()
and rfind()
return a byteIndex
.
Note
Support for grapheme units is not implemented at this time.
-
record
byteIndex
¶ A value of type
byteIndex
can be passed to certain string functions to indicate that the function should operate with units of bytes. Seethis
.An int can be added to a
byteIndex
, producing anotherbyteIndex
. OnebyteIndex
can be subtracted from another, producing an int distance between them. AbyteIndex
can also be compared with anotherbyteIndex
or with an int .To create or modify a
byteIndex
, cast or assign it from an int. For example, the following function returns a string containing only the second byte of the argument:proc getSecondByte(arg:string) : int { var offsetInBytes = 2:byteIndex; return arg[offsetInBytes]; }
-
proc
init
(i: int)¶
-
proc
init=
(other: byteIndex)¶
-
proc
init=
(i: int)
-
proc
writeThis
(f) throws¶
-
proc
-
record
codepointIndex
¶ A value of type
codepointIndex
can be passed to certain string functions to indicate that the function should operate with units of codepoints. Seethis
.An int can be added to a
codepointIndex
, producing anothercodepointIndex
. OnecodepointIndex
can be subtracted from another, producing an int distance between them. AcodepointIndex
can also be compared with anothercodepointIndex
or with an int .To create or modify a
codepointIndex
, cast or assign it from an int. For example, the following function returns a string containing only the second codepoint of the argument:proc getSecondCodepoint(arg:string) : int { var offsetInCodepoints = 2:codepointIndex; return arg[offsetInCodepoints]; }
-
proc
init
(i: int)¶
-
proc
init=
(i: int)¶
-
proc
init=
(cpi: codepointIndex)
-
proc
writeThis
(f) throws¶
-
proc
-
proc
createStringWithBorrowedBuffer
(x: string)¶ Creates a new string which borrows the internal buffer of another string. If the buffer is freed before the string returned from this function, accessing it is undefined behavior.
Arguments: x : string – Object to borrow the buffer from Returns: A new string
-
proc
createStringWithBorrowedBuffer
(x: c_string, length = x.size) throws Creates a new string which borrows the internal buffer of a c_string. If the buffer is freed before the string returned from this function, accessing it is undefined behavior.
Arguments: - x : c_string – Object to borrow the buffer from
- length : int – Length of the string stored in x in bytes, excluding the terminating null byte.
Throws: DecodeError if x contains non-UTF-8 characters.
Returns: A new string
-
proc
createStringWithBorrowedBuffer
(x: bufferType, length: int, size: int) throws Creates a new string which borrows the memory allocated for a c_ptr(uint(8)). If the buffer is freed before the string returned from this function, accessing it is undefined behavior.
Arguments: - x : bufferType (i.e. c_ptr(uint(8))) – Object to borrow the buffer from
- length : int – Length of the string stored in x in bytes, excluding the terminating null byte.
- size – Size of memory allocated for x in bytes
Throws: DecodeError if x contains non-UTF-8 characters.
Returns: A new string
-
proc
createStringWithOwnedBuffer
(x: c_string, length = x.size) throws¶ Creates a new string which takes ownership of the internal buffer of a c_string. The buffer will be freed when the string is deinitialized.
Arguments: - x : c_string – Object to take ownership of the buffer from
- length – Length of the string stored in x in bytes, excluding the terminating null byte.
Returns: A new string
-
proc
createStringWithOwnedBuffer
(x: bufferType, length: int, size: int) throws Creates a new string which takes ownership of the memory allocated for a c_ptr(uint(8)). The buffer will be freed when the string is deinitialized.
Arguments: - x : bufferType (i.e. c_ptr(uint(8))) – Object to take ownership of the buffer from
- length : int – Length of the string stored in x in bytes, excluding the terminating null byte.
- size – Size of memory allocated for x in bytes
Throws: DecodeError if x contains non-UTF-8 characters.
Returns: A new string
-
proc
createStringWithNewBuffer
(x: string)¶ Creates a new string by creating a copy of the buffer of another string.
Arguments: x : string – Object to copy the buffer from Returns: A new string
-
proc
createStringWithNewBuffer
(x: c_string, length = x.size, policy = decodePolicy.strict) throws Creates a new string by creating a copy of the buffer of a c_string.
Arguments: - x : c_string – Object to copy the buffer from
- length : int – Length of the string stored in x in bytes, excluding the terminating null byte.
- policy –
- decodePolicy.strict raises an error
- decodePolicy.replace replaces the malformed character with UTF-8 replacement character
- decodePolicy.drop drops the data silently
- decodePolicy.escape escapes each illegal byte with private use codepoints
Throws: DecodeError if decodePolicy.strict is passed to the policy argument and x contains non-UTF-8 characters.
Returns: A new string
-
proc
createStringWithNewBuffer
(x: bufferType, length: int, size = length+1, policy = decodePolicy.strict) throws Creates a new string by creating a copy of a buffer.
Arguments: - x : bufferType (i.e. c_ptr(uint(8))) – The buffer to copy
- length : int – Length of the string stored in x in bytes, excluding the terminating null byte.
- size – Size of memory allocated for x in bytes. This argument is ignored by this function.
Throws: DecodeError if x contains non-UTF-8 characters.
Returns: A new string
-
record
string
¶ -
proc
init=
(s: string)¶
-
proc
init=
(cs: c_string)
-
proc
length
¶ Deprecated - please use
string.size
.
-
proc
size
¶ Returns: The number of codepoints in the string.
-
proc
indices
¶ Returns: The indices that can be used to index into the string (i.e., the range 0..<this.size
)
-
proc
numBytes
¶ Returns: The number of bytes in the string.
-
proc
numCodepoints
¶ Returns: The number of codepoints in the string, assuming the string is correctly-encoded UTF-8.
-
proc
localize
(): string¶ Gets a version of the
string
that is on the currently executing locale.Returns: A shallow copy if the string
is already on the current locale, otherwise a deep copy is performed.
-
proc
c_str
(): c_string¶ Get a c_string from a
string
.Warning
This can only be called safely on a
string
whose home is the current locale. This property can be enforced by callingstring.localize()
beforec_str()
. If the string is remote, the program will halt.For example:
var my_string = "Hello!"; on different_locale { printf("%s", my_string.localize().c_str()); }
Returns: A c_string that points to the underlying buffer used by this string
. The returned c_string is only valid when used on the same locale as the string.
-
proc
encode
(policy = encodePolicy.pass): bytes¶ Returns a
bytes
from the givenstring
. If the string contains some escaped non-UTF8 bytes, policy argument determines the action.Arguments: policy – encodePolicy.pass directly copies the (potentially escaped) data, encodePolicy.unescape recovers the escaped bytes back. Returns: bytes
-
iter
items
(): string¶ Iterates over the string character by character.
For example:
var str = "abcd"; for c in str { writeln(c); }
Output:
a b c d
-
iter
these
(): string¶ Iterates over the string character by character, yielding 1-codepoint strings. (A synonym for
items
)For example:
var str = "abcd"; for c in str { writeln(c); }
Output:
a b c d
-
iter
chpl_bytes
(): byteType¶ Iterates over the string byte by byte.
-
iter
codepoints
(): int(32)¶ Iterates over the string Unicode character by Unicode character.
-
proc
toByte
(): uint(8)¶ Returns: The value of a single-byte string as an integer.
-
proc
byte
(i: int): uint(8)¶ Returns: The value of the i th byte as an integer.
-
proc
toCodepoint
(): int(32)¶ Returns: The value of a single-codepoint string as an integer.
-
proc
codepoint
(i: int): int(32)¶ Returns: The value of the i th multibyte character as an integer.
-
proc
this
(i: byteIndex): string¶ Return the codepoint starting at the i th byte in the string
Returns: A string with the complete multibyte character starting at the specified byte index from 0..#string.numBytes
-
proc
this
(i: codepointIndex): string Return the i th codepoint in the string. (A synonym for
item
)Returns: A string with the complete multibyte character starting at the specified codepoint index from 0..#string.numCodepoints
-
proc
this
(i: int): string Return the i th codepoint in the string. (A synonym for
item
)Returns: A string with the complete multibyte character starting at the specified codepoint index from 1..string.numCodepoints
-
proc
item
(i: codepointIndex): string¶ Return the i th codepoint in the string
Returns: A string with the complete multibyte character starting at the specified codepoint index from 1..string.numCodepoints
-
proc
item
(i: int): string Return the i th codepoint in the string
Returns: A string with the complete multibyte character starting at the specified codepoint index from 0..#string.numCodepoints
-
proc
this
(r: range(?)): string Slice a string. Halts if r is non-empty and not completely inside the range
0..<string.size
when compiled with –checks. –fast disables this check.Arguments: r – range of the indices the new string should be made from Returns: a new string that is a substring within 0..<string.size
. If the length of r is zero, an empty string is returned.
-
proc
isEmpty
(): bool¶ Returns: - true – when the string is empty
- false – otherwise
-
proc
startsWith
(needles: string ...): bool¶ Arguments: needles – A varargs list of strings to match against. Returns: - true – when the string begins with one or more of the needles
- false – otherwise
-
proc
endsWith
(needles: string ...): bool¶ Arguments: needles – A varargs list of strings to match against. Returns: - true – when the string ends with one or more of the needles
- false – otherwise
-
proc
find
(needle: string, region: range(?) = 0: byteIndex..): byteIndex¶ Arguments: - needle – the string to search for
- region – an optional range defining the substring to search within,
default is the whole string. Halts if the range is not
within
0..<string.size
Returns: the index of the first occurrence of needle within a string, or -1 if the needle is not in the string.
-
proc
rfind
(needle: string, region: range(?) = 0: byteIndex..): byteIndex¶ Arguments: - needle – the string to search for
- region – an optional range defining the substring to search within,
default is the whole string. Halts if the range is not
within
0..<string.size
Returns: the index of the first occurrence from the right of needle within a string, or -1 if the needle is not in the string.
-
proc
count
(needle: string, region: range(?) = 0..): int¶ Arguments: - needle – the string to search for
- region – an optional range defining the substring to search within,
default is the whole string. Halts if the range is not
within
0..<string.size
Returns: the number of times needle occurs in the string
-
proc
replace
(needle: string, replacement: string, count: int = -1): string¶ Arguments: - needle – the string to search for
- replacement – the string to replace needle with
- count – an optional integer specifying the number of replacements to make, values less than zero will replace all occurrences
Returns: a copy of the string where replacement replaces needle up to count times
-
iter
split
(sep: string, maxsplit: int = -1, ignoreEmpty: bool = false)¶ Splits the string on sep yielding the substring between each occurrence, up to maxsplit times.
Arguments: - sep – The delimiter used to break the string into chunks.
- maxsplit – The number of times to split the string, negative values indicate no limit.
- ignoreEmpty –
- When true – Empty strings will not be yielded,
- and will not count towards maxsplit
- When false – Empty strings will be yielded when
- sep occurs multiple times in a row.
-
iter
split
(maxsplit: int = -1) Works as above, but uses runs of whitespace as the delimiter.
Arguments: maxsplit – The number of times to split the string, negative values indicate no limit.
-
proc
join
(const ref x: string ...): string¶ Returns a new string, which is the concatenation of all of the strings passed in with the receiving string inserted between them.
var x = "|".join("a","10","d"); writeln(x); // prints: "a|10|d"
-
proc
join
(const ref x): string Same as the varargs version, but with a homogeneous tuple of strings.
var x = "|".join("a","10","d"); writeln(x); // prints: "a|10|d"
-
proc
join
(const ref S: [] string): string Same as the varargs version, but with all the strings in an array.
var x = "|".join(["a","10","d"]); writeln(x); // prints: "a|10|d"
-
proc
strip
(chars: string = " trn", leading = true, trailing = true): string¶ Arguments: - chars – A string containing each character to remove. Defaults to ” \t\r\n”.
- leading – Indicates if leading occurrences should be removed. Defaults to true.
- trailing – Indicates if trailing occurrences should be removed. Defaults to true.
Returns: A new string with leading and/or trailing occurrences of characters in chars removed as appropriate.
-
proc
partition
(sep: string): 3*(string)¶ Splits the string on sep into a 3*string consisting of the section before sep, sep, and the section after sep. If sep is not found, the tuple will contain the whole string, and then two empty strings.
-
proc
isUpper
(): bool¶ Checks if all the characters in the string are either uppercase (A-Z) or uncased (not a letter).
returns: - true – if the string contains at least one uppercase
- character and no lowercase characters, ignoring uncased characters.
- false – otherwise
-
proc
isLower
(): bool¶ Checks if all the characters in the string are either lowercase (a-z) or uncased (not a letter).
returns: - true – when there are no uppercase characters in the string.
- false – otherwise
-
proc
isSpace
(): bool¶ Checks if all the characters in the string are whitespace (‘ ‘, ‘t’, ‘n’, ‘v’, ‘f’, ‘r’).
returns: - true – when all the characters are whitespace.
- false – otherwise
-
proc
isAlpha
(): bool¶ Checks if all the characters in the string are alphabetic (a-zA-Z).
returns: - true – when the characters are alphabetic.
- false – otherwise
-
proc
isDigit
(): bool¶ Checks if all the characters in the string are digits (0-9).
returns: - true – when the characters are digits.
- false – otherwise
-
proc
isAlnum
(): bool¶ Checks if all the characters in the string are alphanumeric (a-zA-Z0-9).
returns: - true – when the characters are alphanumeric.
- false – otherwise
-
proc
isPrintable
(): bool¶ Checks if all the characters in the string are printable.
returns: - true – when the characters are printable.
- false – otherwise
-
proc
isTitle
(): bool¶ Checks if all uppercase characters are preceded by uncased characters, and if all lowercase characters are preceded by cased characters.
Returns: - true – when the condition described above is met.
- false – otherwise
-
proc
toLower
(): string¶ Returns: A new string with all uppercase characters replaced with their lowercase counterpart. Note
The case change operation is not currently performed on characters whose cases take different number of bytes to represent in Unicode mapping.
-
proc
toUpper
(): string¶ Returns: A new string with all lowercase characters replaced with their uppercase counterpart. Note
The case change operation is not currently performed on characters whose cases take different number of bytes to represent in Unicode mapping.
-
proc
toTitle
(): string¶ Returns: A new string with all cased characters following an uncased character converted to uppercase, and all cased characters following another cased character converted to lowercase. Note
The case change operation is not currently performed on characters whose cases take different number of bytes to represent in Unicode mapping.
-
proc
-
proc
=
(ref lhs: byteIndex, rhs: int)¶ Copies the int rhs into the byteIndex lhs.
-
proc
=
(ref lhs: codepointIndex, rhs: int) Copies the int rhs into the codepointIndex lhs.
-
proc
=
(ref lhs: string, rhs: string) Copies the string rhs into the string lhs.
-
proc
=
(ref lhs: string, rhs_c: c_string) Copies the c_string rhs_c into the string lhs.
Halts if lhs is a remote string.
-
proc
+
(s0: string, s1: string)¶ Returns: A new string which is the result of concatenating s0 and s1
-
proc
*
(s: string, n: integral)¶ Returns: A new string which is the result of repeating s n times. If n is less than or equal to 0, an empty string is returned. For example:
writeln("Hello! " * 3);
Results in:
Hello! Hello! Hello!
-
proc
+=
(ref lhs: string, const ref rhs: string): void¶ Appends the string rhs to the string lhs.
-
proc
codepointToString
(i: int(32))¶ Returns: A string storing the complete multibyte character sequence that corresponds to the codepoint value i.