Strings

The following documentation shows functions and methods used to manipulate and process Chapel strings.

Methods Available in Other Modules

Besides the functions below, some other modules provide routines that are useful for working with strings. The IO module provides IO.string.format which creates a string that is the result of formatting. It also includes functions for reading and writing strings. The Regexp module also provides some routines for searching within strings.

Casts from String to a Numeric Type

This module supports casts from string to numeric types. Such casts will convert the string to the numeric type and throw an error if the string is invalid. For example:

var number = "a":int;

throws an error when it is executed, but

var number = "1":int;

stores the value 1 in number.

To learn more about handling these errors, see the Error Handling technical note.

Activating Unicode Support

Chapel strings normally use the UTF-8 encoding. Note that ASCII strings are a simple subset of UTF-8 strings, because every ASCII character is a UTF-8 character with the same meaning.

Certain environment variables may need to be set in order to enable UTF-8 support. Setting these environment variables may not be necessary at all on some systems.

Note

For example, Chapel is currently tested with the following settings:

export LANG=en_US.UTF-8

export LC_COLLATE=C

unset LC_ALL

LANG sets the default character set. LC_COLLATE overrides it for sorting, so that we get consistent results. Anything in LC_ALL would override everything, so we unset it.

Note

Chapel currently relies upon C multibyte character support and may work with other settings of these variables that request non-Unicode multibyte character sets. However such a configuration is not regularly tested and may not work in the future.

Lengths and Offsets in Unicode Strings

For Unicode strings, and in particular UTF-8 strings, there are several possible units for offsets or lengths:

  • bytes
  • codepoints
  • graphemes

Most methods on the Chapel string type currently work with codepoint units by default. For example, length returns the length in codepoints and int values passed into this are offsets in codepoint units.

It is possible to indicate byte or codepoint units for indexing in the string methods by using arguments of type byteIndex or codepointIndex respectively.

For speed of indexing with their result values, find() and rfind() return a byteIndex.

Note

Support for grapheme units is not implemented at this time.

record byteIndex

A value of type byteIndex can be passed to certain string functions to indicate that the function should operate with units of bytes. See this.

An int can be added to a byteIndex, producing another byteIndex. One byteIndex can be subtracted from another, producing an int distance between them. A byteIndex can also be compared with another byteIndex or with an int .

To create or modify a byteIndex, cast or assign it from an int. For example, the following function returns a string containing only the second byte of the argument:

proc getSecondByte(arg:string) : int {
  var offsetInBytes = 2:byteIndex;
  return arg[offsetInBytes];
}
proc init(i: int)
proc init=(i: int)
proc writeThis(f)
record codepointIndex

A value of type codepointIndex can be passed to certain string functions to indicate that the function should operate with units of codepoints. See this.

An int can be added to a codepointIndex, producing another codepointIndex. One codepointIndex can be subtracted from another, producing an int distance between them. A codepointIndex can also be compared with another codepointIndex or with an int .

To create or modify a codepointIndex, cast or assign it from an int. For example, the following function returns a string containing only the second codepoint of the argument:

proc getSecondCodepoint(arg:string) : int {
  var offsetInCodepoints = 2:codepointIndex;
  return arg[offsetInCodepoints];
}
proc init(i: int)
proc init=(i: int)
proc writeThis(f)
proc createStringWithBorrowedBuffer(s: string)

Creates a new string which borrows the internal buffer of another string. If the buffer is freed before the string returned from this function, accessing it is undefined behavior.

Arguments:s : string -- Object to borrow the buffer from
Returns:A new string
proc createStringWithBorrowedBuffer(s: c_string, length = s.length)

Creates a new string which borrows the internal buffer of a c_string. If the buffer is freed before the string returned from this function, accessing it is undefined behavior.

Arguments:
  • s : c_string -- Object to borrow the buffer from
  • length : int -- Length of the c_string in bytes, excluding the terminating null byte.
Returns:

A new string

proc createStringWithBorrowedBuffer(s: bufferType, length: int, size: int)

Creates a new string which borrows the memory allocated for a c_ptr(uint(8)). If the buffer is freed before the string returned from this function, accessing it is undefined behavior.

Arguments:
  • s : bufferType (i.e. c_ptr(uint(8))) -- Object to borrow the buffer from
  • length : int -- Length of the string stored in s, excluding the terminating null byte.
  • size -- Size of memory allocated for s in bytes
Returns:

A new string

proc createStringWithOwnedBuffer(s: c_string, length = s.length)

Creates a new string which takes ownership of the internal buffer of a c_string. The buffer will be freed when the bytes is deinitialized.

Arguments:
  • s : c_string -- Object to take ownership of the buffer from
  • length : int -- Length of the string stored in s, excluding the terminating null byte.
Returns:

A new string

proc createStringWithOwnedBuffer(s: bufferType, length: int, size: int)

Creates a new string which takes ownership of the memory allocated for a c_ptr(uint(8)). The buffer will be freed when the bytes is deinitialized.

Arguments:
  • s : bufferType (i.e. c_ptr(uint(8))) -- Object to take ownership of the buffer from
  • length : int -- Length of the string stored in s, excluding the terminating null byte.
  • size -- Size of memory allocated for s in bytes
Returns:

A new string

proc createStringWithNewBuffer(s: string)

Creates a new string by creating a copy of the buffer of another string.

Arguments:s : string -- Object to copy the buffer from
Returns:A new string
proc createStringWithNewBuffer(s: c_string, length = s.length)

Creates a new string by creating a copy of the buffer of a c_string.

Arguments:
  • s : c_string -- Object to copy the buffer from
  • length : int -- Length of the c_string in bytes, excluding the terminating null byte.
Returns:

A new string

proc createStringWithNewBuffer(s: bufferType, length: int, size: int)

Creates a new string by creating a copy of a buffer.

Arguments:
  • s : bufferType (i.e. c_ptr(uint(8))) -- The buffer to copy
  • length : int -- Length of the string stored in s, excluding the terminating null byte.
  • size -- Size of memory allocated for s in bytes
Returns:

A new string

record string
proc init(s: string, isowned: bool = true)

Initialize a new string from s. If isowned is set to true then s will be fully copied into the new instance. If it is false a shallow copy will be made such that any in-place modifications to the new string may appear in s. It is the responsibility of the user to ensure that the underlying buffer is not freed while being used as part of a shallow copy.

Warning

String initializers are deprecated. Use createString* functions, instead.

proc init=(s: string)
proc init(cs: c_string, length: int = cs.length, isowned: bool = true, needToCopy: bool = true)

Initialize a new string from the c_string cs. If isowned is set to true, the backing buffer will be freed when the new record is destroyed. If needToCopy is set to true, the c_string will be copied into the record, otherwise it will be used directly. It is the responsibility of the user to ensure that the underlying buffer is not freed if the c_string is not copied in.

Warning

String initializers are deprecated. Use createString* functions, instead.

proc init=(cs: c_string)
proc init(buff: bufferType, length: int, size: int, isowned: bool = true, needToCopy: bool = true)

Initialize a new string from buff ( c_ptr(uint(8)) ). size indicates the total size of the buffer available, while len indicates the current length of the string in the buffer (the common case would be size-1 for a C-style string). If isowned is set to true, the backing buffer will be freed when the new record is destroyed. If needToCopy is set to true, the c_string will be copied into the record, otherwise it will be used directly. It is the responsibility of the user to ensure that the underlying buffer is not freed if the c_string is not copied in.

Warning

String initializers are deprecated. Use createString* functions, instead.

proc length
Returns:The number of codepoints in the string.
proc size
Returns:The number of codepoints in the string.
proc numBytes
Returns:The number of bytes in the string.
proc numCodepoints
Returns:The number of codepoints in the string, assuming the string is correctly-encoded UTF-8.
proc localize(): string

Gets a version of the string that is on the currently executing locale.

Returns:A shallow copy if the string is already on the current locale, otherwise a deep copy is performed.
proc c_str(): c_string

Get a c_string from a string.

Warning

This can only be called safely on a string whose home is the current locale. This property can be enforced by calling string.localize() before c_str(). If the string is remote, the program will halt.

For example:

var my_string = "Hello!";
on different_locale {
  printf("%s", my_string.localize().c_str());
}
Returns:A c_string that points to the underlying buffer used by this string. The returned c_string is only valid when used on the same locale as the string.
iter these(): string

Iterates over the string character by character.

For example:

var str = "abcd";
for c in str {
  writeln(c);
}

Output:

a
b
c
d
iter chpl_bytes(): byteType

Iterates over the string byte by byte.

iter codepoints(): int(32)

Iterates over the string Unicode character by Unicode character.

proc toByte(): uint(8)
Returns:The value of a single-byte string as an integer.
proc byte(i: int): uint(8)
Returns:The value of the i th byte as an integer.
proc toCodepoint(): int(32)
Returns:The value of a single-codepoint string as an integer.
proc codepoint(i: int): int(32)
Returns:The value of the i th multibyte character as an integer.
proc this(i: byteIndex): string

Return the codepoint starting at the i th byte in the string

Returns:A string with the complete multibyte character starting at the specified byte index from 1..string.numBytes
proc this(i: codepointIndex): string

Return the i th codepoint in the string

Returns:A string with the complete multibyte character starting at the specified codepoint index from 1..string.numCodepoints
proc this(i: int): string

Return the i th codepoint in the string

Returns:A string with the complete multibyte character starting at the specified codepoint index from 1..string.numCodepoints
proc this(r: range(?)): string

Slice a string. Halts if r is non-empty and not completely inside the range 1..string.length when compiled with --checks. --fast disables this check.

Arguments:r -- range of the indices the new string should be made from
Returns:a new string that is a substring within 1..string.length. If the length of r is zero, an empty string is returned.
proc isEmpty(): bool
Returns:
  • true -- when the string is empty
  • false -- otherwise
proc startsWith(needles: string ...): bool
Arguments:needles -- A varargs list of strings to match against.
Returns:
  • true -- when the string begins with one or more of the needles
  • false -- otherwise
proc endsWith(needles: string ...): bool
Arguments:needles -- A varargs list of strings to match against.
Returns:
  • true -- when the string ends with one or more of the needles
  • false -- otherwise
proc find(needle: string, region: range(?) = 1: byteIndex..): byteIndex
Arguments:
  • needle -- the string to search for
  • region -- an optional range defining the substring to search within, default is the whole string. Halts if the range is not within 1..string.length
Returns:

the index of the first occurrence of needle within a string, or 0 if the needle is not in the string.

proc rfind(needle: string, region: range(?) = 1: byteIndex..): byteIndex
Arguments:
  • needle -- the string to search for
  • region -- an optional range defining the substring to search within, default is the whole string. Halts if the range is not within 1..string.length
Returns:

the index of the first occurrence from the right of needle within a string, or 0 if the needle is not in the string.

proc count(needle: string, region: range(?) = 1..): int
Arguments:
  • needle -- the string to search for
  • region -- an optional range defining the substring to search within, default is the whole string. Halts if the range is not within 1..string.length
Returns:

the number of times needle occurs in the string

proc replace(needle: string, replacement: string, count: int = -1): string
Arguments:
  • needle -- the string to search for
  • replacement -- the string to replace needle with
  • count -- an optional integer specifying the number of replacements to make, values less than zero will replace all occurrences
Returns:

a copy of the string where replacement replaces needle up to count times

iter split(sep: string, maxsplit: int = -1, ignoreEmpty: bool = false)

Splits the string on sep yielding the substring between each occurrence, up to maxsplit times.

Arguments:
  • sep -- The delimiter used to break the string into chunks.
  • maxsplit -- The number of times to split the string, negative values indicate no limit.
  • ignoreEmpty --
    • When true -- Empty strings will not be yielded,
      and will not count towards maxsplit
    • When false -- Empty strings will be yielded when
      sep occurs multiple times in a row.
iter split(maxsplit: int = -1)

Works as above, but uses runs of whitespace as the delimiter.

Arguments:maxsplit -- The number of times to split the string, negative values indicate no limit.
proc join(const ref S: string ...): string

Returns a new string, which is the concatenation of all of the strings passed in with the receiving string inserted between them.

var x = "|".join("a","10","d");
writeln(x); // prints: "a|10|d"
proc join(const ref S): string

Same as the varargs version, but with a homogeneous tuple of strings.

var x = "|".join("a","10","d");
writeln(x); // prints: "a|10|d"
proc join(const ref S: [] string): string

Same as the varargs version, but with all the strings in an array.

var x = "|".join(["a","10","d"]);
writeln(x); // prints: "a|10|d"
proc strip(chars: string = " trn", leading = true, trailing = true): string
Arguments:
  • chars -- A string containing each character to remove. Defaults to " \t\r\n".
  • leading -- Indicates if leading occurrences should be removed. Defaults to true.
  • trailing -- Indicates if trailing occurrences should be removed. Defaults to true.
Returns:

A new string with leading and/or trailing occurrences of characters in chars removed as appropriate.

proc partition(sep: string): 3*(string)

Splits the string on sep into a 3*string consisting of the section before sep, sep, and the section after sep. If sep is not found, the tuple will contain the whole string, and then two empty strings.

proc isUpper(): bool

Checks if all the characters in the string are either uppercase (A-Z) or uncased (not a letter).

returns:
  • true -- if the string contains at least one uppercase

    character and no lowercase characters, ignoring uncased characters.

  • false -- otherwise

proc isLower(): bool

Checks if all the characters in the string are either lowercase (a-z) or uncased (not a letter).

returns:
  • true -- when there are no uppercase characters in the string.
  • false -- otherwise
proc isSpace(): bool

Checks if all the characters in the string are whitespace (' ', 't', 'n', 'v', 'f', 'r').

returns:
  • true -- when all the characters are whitespace.
  • false -- otherwise
proc isAlpha(): bool

Checks if all the characters in the string are alphabetic (a-zA-Z).

returns:
  • true -- when the characters are alphabetic.
  • false -- otherwise
proc isDigit(): bool

Checks if all the characters in the string are digits (0-9).

returns:
  • true -- when the characters are digits.
  • false -- otherwise
proc isAlnum(): bool

Checks if all the characters in the string are alphanumeric (a-zA-Z0-9).

returns:
  • true -- when the characters are alphanumeric.
  • false -- otherwise
proc isPrintable(): bool

Checks if all the characters in the string are printable.

returns:
  • true -- when the characters are printable.
  • false -- otherwise
proc isTitle(): bool

Checks if all uppercase characters are preceded by uncased characters, and if all lowercase characters are preceded by cased characters.

Returns:
  • true -- when the condition described above is met.
  • false -- otherwise
proc toLower(): string
Returns:A new string with all uppercase characters replaced with their lowercase counterpart.

Note

The case change operation is not currently performed on characters whose cases take different number of bytes to represent in Unicode mapping.

proc toUpper(): string
Returns:A new string with all lowercase characters replaced with their uppercase counterpart.

Note

The case change operation is not currently performed on characters whose cases take different number of bytes to represent in Unicode mapping.

proc toTitle(): string
Returns:A new string with all cased characters following an uncased character converted to uppercase, and all cased characters following another cased character converted to lowercase.

Note

The case change operation is not currently performed on characters whose cases take different number of bytes to represent in Unicode mapping.

proc =(ref lhs: byteIndex, rhs: int)

Copies the int rhs into the byteIndex lhs.

proc =(ref lhs: codepointIndex, rhs: int)

Copies the int rhs into the codepointIndex lhs.

proc =(ref lhs: string, rhs: string)

Copies the string rhs into the string lhs.

proc =(ref lhs: string, rhs_c: c_string)

Copies the c_string rhs_c into the string lhs.

Halts if lhs is a remote string.

proc +(s0: string, s1: string)
Returns:A new string which is the result of concatenating s0 and s1
proc *(s: string, n: integral)
Returns:A new string which is the result of repeating s n times. If n is less than or equal to 0, an empty string is returned.

For example:

writeln("Hello! " * 3);

Results in:

Hello! Hello! Hello!
proc +(s: string, x: numeric)

The following concatenation functions return a new string which is the result of casting the non-string argument to a string, and concatenating that result with s.

proc +(x: numeric, s: string)
proc +(s: string, x: enumerated)
proc +(x: enumerated, s: string)
proc +(s: string, x: bool)
proc +(x: bool, s: string)
proc +=(ref lhs: string, const ref rhs: string): void

Appends the string rhs to the string lhs.

proc ascii(a: string): uint(8)
Returns:The byte value of the first character in a as an integer.

Warning

This method is deprecated. Use toByte or byte methods, instead.

proc asciiToString(i: uint(8))
Returns:A string with the single character with the ASCII value i.

Warning

This method is deprecated. Use codepointToString method, instead.

proc codepointToString(i: int(32))
Returns:A string storing the complete multibyte character sequence that corresponds to the codepoint value i.