stringi: Fast and Portable Character String Processing in R

Effective processing of character strings is required at various stages of data analysis pipelines: from data cleansing and preparation, through information extraction, to report generation. Pattern searching, string collation and sorting, normalization, transliteration, and formatting are ubiquitous in text mining, natural language processing, and bioinformatics. This paper discusses and demonstrates how and why stringi, a mature R package for fast and portable handling of string data based on ICU (International Components for Unicode), should be included in each statistician’s or data scientist’s repertoire to complement their numerical computing and data wrangling skills.


Introduction
Stringology (Crochemore and Rytter 2003) deals with algorithms and data structures for character string processing (Jurafsky and Martin 2008;Szpankowski 2001). From the perspective of applied statistics and data science, it is worth stressing that many interesting data sets first come in unstructured or contaminated textual forms, for instance when they have been fetched from different APIs (application programming interfaces) or gathered by means of web scraping techniques.
Diverse data cleansing and preparation operations (Dasu and Johnson 2003; Van der Loo and De Jonge 2018; see also Section 2 below for a real-world example) need to be applied before an analyst can begin to enjoy an orderly and meaningful data frame, matrix, or spreadsheet being finally at their disposal. Activities related to information retrieval, computer vision, bioinformatics, natural language processing, or even musicology can also benefit from including them in data processing pipelines (Jurafsky and Martin 2008;Kurtz et al. 2004).
Although statisticians and data analysts are usually very proficient in numerical computing and data wrangling, the awareness of how crucial text operations are in the generic dataoriented skill-set is yet to reach a more operational level. This paper aims to fill this gap.
Most statistical computing ecosystems provide only a basic set of text operations. In particular, base R (R Core Team 2022) is mostly restricted to pattern matching, string concatenation, substring extraction, trimming, padding, wrapping, simple character case conversion, and string collation, see Chambers (2008, Chapter 8) and Table 1 below. The stringr package (Wickham 2010), first released in November 2009, implemented an alternative, "tidy" API for text data processing (cleaned-up function names, more beginner-friendly outputs, etc.; the list of 21 functions that were available in stringr at that time is given in Table 1). The early stringr package featured a few wrappers around a subset of its base R counterparts. Base R string facilities, however -to this day -not only are of limited scope, but also suffer from a number of portability issues; it may happen that the same code can yield different results on different operating systems; see Section 3 for some examples.
In order to significantly broaden the array of string processing operations and assure that they are portable, in 2013 the current author developed the open source stringi package (pronounced "stringy", IPA [stringi]; Gagolewski et al. 2022). Its API was compatible with that of early stringr's, which some users found convenient. However, for the processing of text in different locales, which are plentiful, stringi relies on ICU (International Components for Unicode; see https://icu.unicode.org/), a mature library that fully conforms with the Unicode standard and which provides globalization support for a broad range of other software applications as well, from web browsers to database systems. Services not covered by ICU were implemented from scratch to guarantee that they are as efficient as possible.
Over the years, stringi confirmed itself as robust, production-quality software; for many years now it has been one of the most often downloaded R extensions. Interestingly, in 2015 the aforementioned stringr package has been rewritten as a set of wrappers around some of the stringi functions instead of the base R ones. In Section 14.7 of "R for Data Science" (Wickham and Grolemund 2017) we read: "stringr is useful when you're learning because it exposes a minimal set of functions, which have been carefully picked to handle the most common string manipulation functions. stringi, on the other hand, is designed to be comprehensive. It contains almost every function you might ever need: stringi has 250 functions to stringr's 49." Also, it is worth noting that the recently-introduced stringx package (Gagolewski 2021) supplies a stringi-based set of portable and efficient replacements for and enhancements of the base R functions. This paper describes the most noteworthy facilities provided by stringi that statisticians and data analysts may find useful in their daily activities. We demonstrate how important it is for a modern data scientist to be aware of the challenges of natural language processing in the internet era: how to force "groß" compare equal to "GROSS", count the number of occurrences of "AGA" within "ACTGAGAGACGGGTTAGAGACT", make "a13" ordered before "a100", or convert between "GRINNING FACE" and " " back and forth. Such operations are performed by the very ICU itself; we therefore believe that what follows may be of interest to data-oriented practitioners employing Python (Van Rossum et al. 2011), Perl, Julia (Bezanson, Edelman, Karpinski, and Shah 2017), PHP, etc., as ICU has bindings for many other languages.
Here is the outline of the paper:  Table 1: Functions in (the historical) stringr 0.6.2 and their counterparts in base R 4.1.
• Section 2 illustrates the importance of string processing in an example data preparation activity.
• General package design principles are outlined in Section 3, including the use cases of deep vectorization, the concepts of data flow, and the main deviations from base R (also with regards to portability and speed).
• Basic string operations, such as computing length and width of strings, string concatenation, extracting and replacing substrings, are discussed in Section 4.
• Section 5 discusses searching for fixed substrings: counting the number of matches, locating their positions, replacing them with other data, and splitting strings into tokens.
• Section 6 details ICU regular expressions, which are a powerful tool for matching patterns defined in a more abstract way, e.g., extracting numbers from text so that they can be processed quantitatively, identifying hyperlinks, etc. We show where ICU is different from other libraries like PCRE (Perl-compatible regular expressions; https://www.pcre.org/); in particular that it enables portable, Unicode-correct lookups, for instance, involving sequences of emojis or mathematical symbols.
• Section 7 deals with the locale-aware ICU Collator, which is suitable for natural language processing activities; this is where we demonstrate that text processing in different languages or regions is governed by quite diverse rules, deviating significantly from the US-ASCII ("C/POSIX.1") setting. The operations discussed therein include testing string equivalence (which can turn out useful when we scrape data that consist of nonnormalized strings, ignorable punctuation, or accented characters) as well as arranging strings with regards to different linear orderings.
• Section 8 covers some other useful operations such as text boundary analysis (for splitting text into words or sentences), trimming, padding, and other formatting, random string generation, character transliteration (converting between cases and alphabets, removing diacritic marks, etc.) as well as date-time formatting and parsing in any locale (e.g., Japanese dates in a German R).
• Section 9 details on encoding conversion and detection (which is key when reading or writing text files that are to be communicated across different systems) as well as Unicode normalization (which can be useful for removing formatting distinctions from text, e.g., superscripts or font variants).
• Finally, Section 10 concludes the paper.
This paper is by no means a substitute for the comprehensive yet much more technical and indepth reference manual available via a call to help(package = "stringi"), see also https: //stringi.gagolewski.com/. Rather, below we explain the package's key design principles and broadly introduce the ideas and services that help program, correct, and optimize text processing workflows.
Let us emphasize that all the below-presented illustrations, i.e., calls to stringi functions on different example arguments together with the generated outputs, form an integral part of this manuscript's text. They have been included based on the author's experience-based belief that each "picture" (that we print out below using a monospaced font) is worth hundreds of words.
All code chunk outputs presented in this paper were obtained in R 4.1.1. The R environment itself and all the packages used herein are available from the Comprehensive R Archive Network (CRAN) at https://CRAN.R-project.org/. In particular, install.packages("stringi") can be called to fetch the object of our focus. By calling: R> library("stringi") R> cat(stri_info(short = TRUE)) stringi_1.7.6 (en_AU.UTF-8; ICU4C 69.1 [bundle]; Unicode 13.0) we can load and attach the package's namespace and display some basic information thereon. Hence, below we shall be working with stringi 1.7.6, however, as the package's API is considered stable, the presented material should be relevant to any later versions.

Use case: Data preparation
Before going into details on the broad array of facilities offered by the stringi package itself, let us first demonstrate that string processing is indeed a relevant part of statistical data analysis workflows. What follows is a short case study where we prepare a web-scraped data set for further processing.
Assume we wish to gather and analyze climate data for major cities around the world based on information downloaded from Wikipedia. For each location from a given list of settlements (e.g., fetched from one of the pages linked under https://en.wikipedia.org/wiki/ Lists_of_cities), we would like to harvest the relevant temperature and precipitation data. Without harm in generality, let us focus on the city of Melbourne, VIC, Australia. Most Wikipedia pages related to particular cities include a table labelled as "Climate data". We need to pinpoint it amongst all the other tables. For this, we will rely on stringi's stri_detect_fixed() function that, in the configuration below, is used to extract the index of the relevant table.

[1] 4
Of course, the detailed description of all the facilities brought by stringi is covered below. In the meantime, let us use rvest's html_table() to convert the above table to a data frame object.
R> (x <-as.data.frame(html_table(all_tables [[idx]], fill = TRUE))) Climate data for Melbourne Regional Office (1991-2015) 1 Month 2 Record high°C (°F) 3 Average high°C (°F) 4 Daily mean°C (°F) 5 Average low°C (°F) 6 Record low°C (°F) 7 Average rainfall mm (inches) 8 Average rainy days (>= 1mm) 9 Average afternoon relative humidity (%) 10 Mean monthly sunshine hours 11 Source: Bureau of Meteorology. [85][86][87] Climate data for Melbourne Regional Office  It is evident that this object requires some significant cleansing and transforming before it can be subject to any statistical analyses. First, for the sake of convenience, let us convert it to a character matrix so that the processing of all the cells can be vectorized (a matrix in R is just a single "long" vector, whereas a data frame is a list of many atomic vectors).

R> x <-as.matrix(x)
The as.numeric() function will find the parsing of the Unicode MINUS SIGN (U+2212, "−") difficult, therefore let us call the transliterator first in order to replace it (and other potentially problematic characters) with its simpler equivalent: R> x[, ] <-stri_trans_general(x, "Publishing-Any; Any-ASCII") Note that it is the first row of the matrix that defines the column names. Moreover, the last row just gives the data source and hence may be removed.

General design principles
The API of the early releases of stringi has been designed so as to be fairly compatible with that of the 0.6.2 version of the stringr package (Wickham 2010(Wickham , dated 2012; see Table 1), with some fixes in the consistency of the handling of missing values and zero-length vectors, amongst others. However, instead of being merely thin wrappers around base R functions, which we have identified as not necessarily portable across platforms and not really suitable for natural language processing tasks, all the functionality has been implemented from the ground up, with the use of ICU services wherever applicable. Since the initial release, an abundance of new features has been added and the package can now be considered a comprehensive workhorse for text data processing. Note that the stringi API is stable. Future releases are aiming for as much backward compatibility as possible so that other software projects can safely rely on it.

Naming
Function and argument names use a combination of lowercase letters and underscores (and no dots). To avoid namespace clashes, all function names feature the "stri_" prefix. Names are fairly self-explanatory, e.g., stri_locate_first_regex and stri_locate_all_fixed find, respectively, the first match to a regular expression and all occurrences of a substring as-is.

Vectorization
Individual character (or code point) strings can be entered using double quotes or apostrophes: R> "spam" However, as the R language does not feature any classical scalar types, strings are wrapped around atomic vectors of type 'character': R> typeof("spam") [1] "character" R> length("spam") Hence, we will be using the terms "string" and "character vector of length 1" interchangeably.
Not having a separate scalar type is very convenient; the so-called vectorization strategy encourages writing code that processes whole collections of objects, all at once, regardless of their size.

Acting elementwise with recycling
Binary and higher-arity operations in R are oftentimes vectorized with respect to all arguments (or at least to the crucial, non-optional ones). As a prototype, let us consider the binary arithmetic, logical, or comparison operators (and, to some extent, paste(), strrep(), and more generally mapply()), for example the multiplication: R> c(10, -1) * c (1,2,3,4) [1] 10 -2 30 -4 Calling "x * y" multiplies the corresponding components of the two vectors elementwisely. As one operand happens to be shorter than another, the former is recycled as many times as necessary to match the length of the latter (there would be a warning if partial recycling occurred). Also, acting on a zero-length input always yields an empty vector.
All functions in stringi follow this convention (with some obvious exceptions, such as the collapse argument in stri_join(), locale in stri_datetime_parse(), etc.). In particular, all string search functions are vectorized with respect to both the haystack and the needle arguments (and, e.g., the replacement string, if applicable).
On a side note, to match different patterns with respect to each column, we can (amongst others) apply matrix transposition twice (t(stri_count_fixed(t(haystack), needle))).

Data flow
All vector-like arguments (including factors and objects) in stringi are treated in the same manner: for example, if a function expects a character vector on input and an object of other type is provided, as.character() is called first (we see that in the example above, "1:2" is treated as c("1", "2")).
Following Wickham (2010), stringi makes sure the output data types are consistent and that different functions are interoperable. This makes operation chaining easier and less error prone.
For example, stri_extract_first_regex() finds the first occurrence of a pattern in each string, therefore the output is a character of the same length as the input (with recycling rule in place if necessary).
On the other hand, stri_extract_all_regex() identifies all occurrences of a pattern, whose counts may differ from input to input, therefore it yields a list of character vectors.
Also, care is taken so that the data or x argument is most often listed as the first one (e.g., in base R we have grepl(needle, haystack) vs. stri_detect(haystack, needle) here). This makes the functions more intuitive to use, but also more forward pipe operator-friendly (either when using "|>" introduced in R 4.1 or "%>%" from magrittr, Bache and Wickham 2022).
Furthermore, for increased convenience, some functions have been added despite the fact that they can be trivially reduced to a series of other calls. In particular, writing: R> stri_sub_all(haystack, stri_locate_all_regex(haystack, + "\\b\\w{1,4}\\b", omit_no_match = TRUE)) yields the same result as in the previous example, but refers to haystack twice.

Further deviations from base R
stringi can be used as a replacement of the existing string processing functions. Also, it offers many facilities not available in base R. Except for being fully vectorized with respect to all crucial arguments, propagating missing values and empty vectors consistently, and following coherent naming conventions, our functions deviate from their classic counterparts even further.
Following Unicode standards. Thanks to the comprehensive coverage of the most important services provided by ICU, its users gain access to collation, pattern searching, normalization, transliteration, etc., that follow the recent Unicode standards for text processing in any locale. Due to this, as we state in Section 9.2, all inputs are converted to Unicode and outputs are always in UTF-8.
Portability issues in base R. As we have mentioned in the introduction, base R string operations have traditionally been limited in scope. There also might be some issues with regards to their portability, reasons for which may be plentiful. For instance, varied versions of the PCRE (8.x or 10.x) pattern matching libraries may be linked to during the compilation of R. On Windows, there is a custom implementation of iconv (compare https://www.gnu. org/software/libiconv/) that has a set of character encoding IDs not fully compatible with that on GNU/Linux: to select the Polish locale, we are required to pass "Polish_Poland" to Sys.setlocale() on Windows whereas "pl_PL" on Linux. Interestingly, R can be built against the system ICU so that it uses its Collator for comparing strings (e.g., using the "<=" operator), however this is only optional and does not provide access to any other Unicode services.

Basic string operations
Let us proceed with a detailed description of the most important facilities in the stringi package that might be of interest to the broad statistical and data analysis audience.

Computing length and width
First we shall review the functions related to determining the number of entities in each string.
Let us consider the following character vector: The x object consists of 5 character strings: stri_length() computes the length of each string. More precisely, the function gives the number of Unicode code points in each string, see Section 9.1 for more details.

R> stri_length(x)
[1] 4 2 3 NA 0 The first string carries 4 ASCII (English) letters, the second consists of 2 Chinese characters (U+4F60, U+597D; a greeting), and the third one is comprised of 3 zero-width spaces (U+200B). Note that the 5th element in x is an empty string, "", hence its length is 0. Moreover, there is a missing (NA) value at index 4, therefore the corresponding length is undefined as well.
When formatting strings for display (e.g., in a report dynamically generated with Sweave() or knitr; see Xie 2015), a string's width estimate may be more informative -an approximate number of text columns it will occupy when printed using a monospaced font. In particular, many Chinese, Japanese, Korean, and most emoji characters take up two text cells. Some code points, on the other hand, might be of width 0 (e.g., the above ZERO WIDTH SPACE, U+200B).

Joining
Below we describe the functions that are related to string concatenation.

Extracting and replacing substrings
The next group of functions deals with the extraction and replacement of particular sequences of code points in given strings.

Indexing vectors.
Recall that in order to select a subsequence from any R vector, we use the square-bracket operator 1 with an index vector consisting of either non-negative integers, negative integers, or logical values 2 .
For example, here is how to select specific elements in a vector: [1] "spam" "bacon" Exclusion of elements at specific positions can be performed like: Filtering based on a logical vector can be used to extract strings fulfilling desired criteria: [1] "spam" "buckwheat" "bacon" Extracting substrings. A character vector is, in its very own essence, a sequence of sequences of code points. To extract specific substrings from each string in a collection, we can use the stri_sub() function.
"From-to" and "from-length" matrices. The second parameter of both stri_sub() and stri_sub_list() can also be fed with a two-column matrix of the form cbind(from, to). Here, the first column gives the start indices and the second column defines the end ones. Such matrices are generated, amongst others, by the stri_locate_*() functions (see below for details).
Note the difference between the above output and the following one: R> stri_sub_all(c("abcdefgh", "ijklmnop"), from_to) This time, we extract four identical sections from each of the two inputs.
Moreover, if the second column of the index matrix is named "length" (and only if this is exactly the case), i.e., the indexer is of the form cbind(from, length = length), extraction will be based on the extracted chunk size.
Replacing substrings in-place. The corresponding replacement functions modify a character vector in-place: R> y <-"spam, egg, spam, spam, bacon, and spam" R> stri_sub(y, 7, length = 3) <-"spam" R> y [1] "spam, spam, spam, spam, bacon, and spam" Note that the state of y has changed in such a way that the substring of length 3 starting at the 7th code point was replaced by a length-4 content.

[1] "A BB CCC"
This has replaced 3 length-2 chunks within y with new content.

Code-pointwise comparing
There are many circumstances where we are faced with testing whether two strings (or parts thereof) consist of exactly the same Unicode code points, in exactly the same order. These include, for instance, matching a nucleotide sequence in a DNA profile and querying for system resources based on file names or UUIDs. Such tasks, due to their simplicity, can be performed very efficiently.

Testing for equality of strings
To quickly test whether the corresponding strings in two character vectors are identical (in a code-pointwise manner), we can use the %s===% operator or, equivalently, the stri_cmp_eq() function. Moreover, %s!==% and stri_cmp_neq() implement the not-equal-to relation.

[1] FALSE TRUE FALSE FALSE NA
Due to recycling, the first string was compared against the 5 strings in the 2nd operand. There is only 1 exact match.

Searching for fixed strings
For detecting if a string contains a given fixed substring (code-pointwisely), the fast KMP (Knuth, Morris, and Pratt 1977)  Return or replace strings that contain pattern matches Table 2: String search/pattern matching functions in stringi. Each function, unless otherwise indicated, can be used in conjunction with any search engine, e.g., we have stri_count_fixed() (see Section 5), stri_detect_regex() (see Section 6), and stri_split_coll() (see Section 7).
is the length of the string and p is the length of the pattern), has been implemented in stringi (with numerous tweaks for even faster matching). Table 2 lists the string search functions available in stringi. Below we explain their behavior in the context of fixed pattern matching. Notably, their description is quite detailed, because -as we shall soon find out -the corresponding operations are available for the two other search engines: based on regular expressions and the ICU Collator, see Sections 6 and 7.

Counting matches
The stri_count_fixed() function counts the number of times a fixed pattern occurs in a given string.

Search engine options
The pattern matching engine may be tuned up by passing further arguments to the search Option Purpose case_insensitive Logical; whether to enable the simple case-insensitive matching (defaults to FALSE) overlap Logical; whether to enable the detection of overlapping matches (defaults to FALSE); available in stri_extract_all_fixed(), stri_locate_all_fixed(), and stri_count_fixed() Table 3: Options for the fixed pattern search engine, see stri_opts_fixed().
functions (via "..."; they are redirected as-is to stri_opts_fixed()). Table 3 gives the list of available options.
First, we may switch on the simplistic 3 case-insensitive matching.

Detecting and subsetting patterns
A somewhat simplified version of the above search task involves asking whether a pattern occurs in a string at all. This operation can be performed with a call to stri_detect_fixed().

[1] TRUE TRUE FALSE TRUE FALSE FALSE NA TRUE
We can also indicate that a no-match is rather of our interest by passing negate = TRUE.
What is more, there is an option to stop searching once a given number of matches has been found in the haystack vector (as a whole), which can speed up the processing of larger data sets: This can be useful in scenarios such as "find the first 2 matching resource IDs".
There are also functions that verify whether a string starts or ends 4 with a pattern match: R> stri_startswith_fixed(x, "abc") [ Pattern detection is often performed in conjunction with character vector subsetting. This is why we have a specialized (and hence slightly faster) function that returns only the strings that match a given pattern.
R> stri_subset_fixed(x, "abc", omit_na = TRUE) [1] "abc" "abcd" "xyzabc" "abc" The above is equivalent to x[which(stri_detect_fixed(x, "abc"))] (note the argument responsible for the removal of missing values), but avoids writing x twice. It hence is particularly convenient when x is generated programmatically on the fly, using some complicated expression. Also, it works well with the forward pipe operator, as we can write "x |> stri_subset_fixed("abc", omit_na = TRUE)".

Locating and extracting patterns
The functions from the stri_locate_*() family aim to pinpoint the positions of pattern matches. First, we may be interested in getting to know the location of the first or the last pattern occurrence: R> x <-c("aga", "actg", NA, "AGagaGAgaga") R> stri_locate_first_fixed(x, "aga") In both examples we obtain a two-column matrix with the number of rows determined by the recycling rule (here: the length of x). In the former case, we get a "from-to" matrix (get_length = FALSE; the default) where missing values correspond to either missing inputs or no-matches. The latter gives a "from-length"-type matrix, where negative lengths correspond to the not-founds.
Second, we may be yearning for the locations of all the matching substrings. As the number of possible answers may vary from string to string, the result is a list of index matrices.
R> stri_locate_all_fixed(x, "aga", overlap = TRUE, case_insensitive = TRUE) Note again that a no-match is indicated by a single-row matrix with two missing values (or with negative length if get_length = TRUE). This behavior can be changed by setting the omit_no_match argument to TRUE.
Moreover, stri_replace_first() and stri_replace_last() can identify and replace the first and the last match, respectively.

[[1]]
[1] "a" "b" "c" "d" [1] "f" "g" "h" "i" "j" The result is a list of character vectors, as each string in the haystack might be split into a possibly different number of tokens.
There is also an option to limit the number of tokens (parameter n).

Regular expressions
Regular expressions (regexes) provide us with a concise grammar for defining systematic patterns which can be sought in character strings. Examples of such patterns include: specific fixed substrings, emojis of any kind, stand-alone sequences of lower-case Latin letters ("words"), substrings that can be interpreted as real numbers (with or without fractional part, also in scientific notation), telephone numbers, email addresses, or URLs.
Theoretically, the concept of regular pattern matching dates back to the so-called regular languages and finite state automata (Kleene 1951), see also Hopcroft and Ullman (1979); Rabin and Scott (1959). Regexes in the form as we know today have already been present in one of the pre-Unix implementations of the command-line text editor qed (Ritchie and Thompson 1970; the predecessor of the well-known sed).
Base R gives access to two different regex matching engines (via functions such as gregexpr() and grep(), see Table 1): • ERE 5 (extended regular expressions that conform to the POSIX.2-1992 standard); used by default; • PCRE 6 (Perl-compatible regular expressions); activated when perl = TRUE is set.
Package stringi, on the other hand, provides access to the regex engine implemented in ICU, which was inspired by Java's util.regex in JDK 1.4. Their syntax is mostly compatible with that of PCRE, although certain more advanced facets might not be supported (e.g., recursive patterns). On the other hand, ICU regexes fully conform to the Unicode Technical Standard #18 (Davis and Heninger 2021) and hence provide comprehensive support for Unicode.
It is worth noting that most programming languages as well as advanced text editors and development environments (including Kate, Eclipse, VSCode, and RStudio) support finding or replacing patterns with regexes. Therefore, they should be amongst the instruments at every data scientist's disposal. One general introduction to regexes is Friedl (2006). The ICU flavor is summarized at https://unicode-org.github.io/icu/userguide/strings/ regexp.html. Below we provide a concise yet comprehensive introduction to the topic from the perspective of the stringi package users. This time we will use the pattern search routines whose names end with the *_regex() suffix. Apart from stri_detect_regex(), stri_locate_all_regex(), and so forth, in Section 6.4 we introduce stri_match_all_regex(). Moreover, Table 4 lists the available options for the regex engine.

Matching individual characters
We begin by discussing different ways to define character sets. In this part, determining the length of all matching substrings will be quite straightforward.
The following characters have special meaning to the regex engine: . Logical; defaults to FALSE; if set, "$" and "^" recognize line terminators within a string; otherwise, they match only at start and end of the input unix_lines Logical; defaults to FALSE; when enabled, only the Unix line ending, i.e., U+000A, is honored as a terminator by ".", "$", and "^" Logical; defaults to FALSE; whether to use the Unicode definition of word boundaries (see Section 8.1), which are quite different from the traditional regex word boundaries error_on_unknown_escapes Logical; defaults to FALSE; whether unrecognized backslashescaped characters trigger an error; by default, unknown backslash-escaped ASCII letters represent themselves time_limit Integer; processing time limit for match operations in ∼milliseconds (depends on the CPU speed); 0 for no limit (the default) stack_limit Integer; maximal size, in bytes, of the heap storage available for the matcher's backtracking stack; setting a limit is desirable if poorly written regexes are expected on input; 0 for no limit (the default) Table 4: Options for the regular expressions search engine, see stri_opts_regex().
Any regular expression that does not contain the above behaves like a fixed pattern: R> stri_count_regex("spam, eggs, spam, bacon, sausage, and spam", "spam") There are hence 3 occurrences of a pattern that is comprised of 4 code points, "s" followed by "p", then by "a", and ending with "m".
However, this time the case insensitive mode fully supports Unicode matching 7 : 7 This does not mean, though, that it considers canonically equivalent strings as equal, see Section 7.2 for a discussion and a workaround. R> stri_detect_regex("groß", "GROSS", case_insensitive = TRUE)

[1] TRUE
If we wish to include a special character as part of a regular expression -so that it is treated literally -we will need to escape it with a backslash, "\". Yet, the backlash itself has a special meaning to R, see help("Quotes"), therefore it needs to be preceded by another backslash.
The dot's insensitivity to the newline character is motivated by the need to maintain the compatibility with tools such as grep (when searching within text files in a line-by-line manner; compare https://www.gnu.org/software/grep/). This behavior can be altered by setting the dot_all option to TRUE. the "[hj]am" regex matches: "h" or "j", followed by "a", followed by "m". In other words, "ham" and "jam" are the only two strings that are matched by this pattern (unless matching is done case-insensitively).
The following characters, if used within square brackets, may be treated non-literally: Therefore, to include them as-is in a character set, the backslash-escape must be used. For example, "[\[\]\\]" matches the backslash or a square bracket.
Complementing sets. Including "^" after the opening square bracket denotes the set complement. Hence, "[^abc]" matches any code point except "a", "b", and "c". Here is an example where we seek any substring that consists of 3 non-spaces.
Nowadays, in the processing of text in natural languages, this notation should rather be avoided. Note the missing "ą" (Polish "a" with ogonek) in the result.
Avoiding POSIX classes. The use of the POSIX-like character classes should be avoided, because they are generally not well-defined.
In particular, in POSIX-like regex engines, "[:punct:]" stands for the character class corresponding to the ispunct() function in C (see "man 3 ispunct" on Unix-like systems). According to ISO/IEC 9899:1990 (ISO C90), ispunct() tests for any printing character except for the space or a character for which isalnum() is true.
Also, matching is always done left-to-right, on a first-come, first-served basis. Hence, if the left branch is a subset of the right one, the latter will never be matched. In particular, "(al|alga|algae)" can only match "al". To fix this, we can write "(algae|alga|al)".

Quantifiers
More often than not, a variable number of instances of the same subexpression needs to be captured or its presence should be made optional. This can be achieved by means of the following quantifiers: • "?" matches 0 or 1 times; • "*" matches 0 or more times; • "+" matches 1 or more times; • "{n,m}" matches between n and m times; • "{n,}" matches at least n times; • "{n}" matches exactly n times.
By default, the quantifiers are greedy -they match the repeated subexpression as many times as possible. The "?" suffix (hence, quantifiers such as "??", "*?", "+?", and so forth) tries with as few occurrences as possible (to obtain a match still). The first regex is greedy: it matches an opening bracket, then as many characters as possible (including ")") that are followed by a closing bracket. The two other patterns terminate as soon as the first closing bracket is found.
Let us stress that the quantifier is applied to the subexpression that stands directly before it. Grouping parentheses can be used in case they are needed.
Performance notes. ICU, just like PCRE, uses a nondeterministic finite automaton-type algorithm. Hence, due to backtracking, some ill-defined regexes can lead to exponential matching times (e.g., "(a+)+b" applied on "aaaa...aaaaac"). If such patterns are expected, setting the time_limit or stack_limit option is recommended.
user system elapsed 16.681 0.020 16.704 Nevertheless, oftentimes such regexes can be naturally reformulated to fix the underlying issue. The ICU user guide on regular expressions also recommends using possessive quantifiers ("?+", "*+", "++", and so on), which match as many times as possible but, contrary to the plain-greedy ones, never backtrack when they happen to consume too much data.
See also the re2r (a wrapper around the RE2 library; Wenfeng 2020) package's documentation and the references therein for a discussion.

Capture groups and references thereto
Round-bracketed subexpressions carry one additional characteristic: they form the so-called capture groups that can be extracted separately or be referred to in other parts of the same regex.
Extracting capture group matches. The above is evident when we use the versions of stri_extract_*() that are sensitive to the presence of capture groups: "name" "Sir Launcelot" [2,] "quest='Seek the Grail'" "quest" "Seek the Grail" [3,] "favcolor='blue'" "favcolor" "blue" The findings are presented in a matrix form. The first column gives the complete matches, the second column stores the matches to the first capture group, and so forth.
Replacing with capture group matches. Matches to particular capture groups can be recalled in replacement strings when using stri_replace_*(). Here, the match in its entirety is denoted with "$0", then "$1" stores whatever was caught by the first capture group, "$2" is the match to the second capture group, etc. Moreover, "\$" gives the dollar-sign.

Anchoring
Lastly, let us mention the ways to match a pattern at a given abstract position within a string.
Matching at the beginning or end of a string. "^" and "$" match, respectively, start and end of the string (or each line within a string, if the multi_line option is set to TRUE).
R> x <-c("spam egg", "bacon spam", "spam", "egg spam bacon", "sausage") R> p <-c("spam", "^spam", "spam$", "spam$|^spam", "^spam$") R> structure(outer(x, p, stri_detect_regex), dimnames = list(x, p)) spam^spam spam$ spam$|^spam^spam$ spam egg  TRUE TRUE FALSE  TRUE FALSE  bacon spam  TRUE FALSE TRUE  TRUE FALSE  spam  TRUE TRUE TRUE  TRUE  TRUE  egg spam bacon TRUE FALSE FALSE  FALSE FALSE  sausage  FALSE FALSE FALSE  FALSE FALSE The 5 regular expressions match "spam", respectively, anywhere within the string, at the beginning, at the end, at the beginning or end, and in strings that are equal to the pattern itself.
Matching at word boundaries. Furthermore, "\b" matches at a "word boundary", e.g., near spaces, punctuation marks, or at the start/end of a string (i.e., wherever there is a transition between a word, "\w", and a non-word character, "\W", or vice versa).

[[1]]
[1] "12" "34.5" Note the possessive quantifier, "?+": try matching a dot and a sequence of digits, and if it is present but not followed by a word boundary, do not retry by matching a word boundary only.
Looking behind and ahead. There are also ways to guarantee that a pattern occurrence begins or ends with a match to some subexpression: [1] "spam" "spam" "eggs" "spam" [ [2]] [1] "I" "like" "and" The first regex captures words that end with "," or ".". The second one matches words that end neither with "," nor ".".

Collation
Historically, code-pointwise comparison had been used in most string comparison activities, especially when strings in ASCII (i.e., English) were involved. However, nowadays this does not necessarily constitute the most suitable approach to the processing of natural-language texts. In particular, a code-pointwise matching neither takes accented and conjoined letters nor ignorable punctuation and case into account.
The ICU Collation Service 10 provides the basis for string comparison activities such as string sorting and searching, or determining if two strings are equivalent. This time, though, due to its conformance to the Unicode Collation Algorithm (Davis, Whistler, and Scherer 2021), we may expect that the generated results will meet the requirements of the culturally correct natural language processing in any locale.

Locales
String collation is amongst many locale-sensitive operations available in stringi. Before proceeding any further, we should first discuss how we can parameterize the ICU services so as to deliver the results that reflect the expectations of a specific user community, such as the speakers of different languages and their various regional variants.

Specifying locales.
A locale specifier 11 is of the form "Language", "Language_Country", or "Language_Country_Variant", where: • Language is, most frequently, a two-or three-letter code that conforms to the ISO-639-1 or ISO-630-2 standard, respectively; e.g., "en" or "eng" for English, "es" or "spa" for Spanish, "zh" or "zho" for Chinese, and "mas" for Masai (which lacks the corresponding two-letter code); however, more specific language identifiers may also be available, e.g., "zh_Hans" for Simplified-and "zh_Hant" for Traditional-Chinese or "sr_Cyrl" for Cyrillic-and "sr_Latn" for Latin-Serbian; • Country is a two-letter code following the ISO-3166 standard that enables different language conventions within the same language; e.g., the US-English ("en_US") and Australian-English ("en_AU") not only observe some differences in spelling and vocabulary but also in the units of measurement; • Variant is an identifier indicating a preference towards some convention within the same country; e.g., "de_DE_PREEURO" formats currency values using the pre-2002 Deutsche Mark (DEM).
Moreover, following the "@" symbol, semicolon-separated "key=value" pairs can be appended to the locale specifier, in order to customize some locale-sensitive services even further (see below for an example using "@collation=phonebook" and Section 8.5 for "@calendar=hebrew", amongst others).

Listing locales.
To list the available locale identifiers, we call stri_locale_list().

[1] 784
As the number of supported locales is very high, here we shall display only 5 randomly chosen ones: R> sample(stri_locale_list(), 5) [1] "nl_CW" "pt_CH" "ff_Latn_SL" "en_PH" "en_HK" Querying for locale-specific services. The availability of locale-specific services can only be determined during the request for a particular resource 12 , which may depend on the ICU library version actually in use as well as the way the ICU Data Library (icudt) has been packaged. Therefore, for maximum portability, it is best to rely on the ICU library bundle that is shipped with stringi. This is the case on Windows and macOS, whose users typically download the pre-compiled versions of the package from CRAN. However, on various flavors of GNU/Linux and other Unix-based systems, the system ICU is used more eagerly 13 . To force building ICU from sources, we may call: R> install.packages("stringi", configure.args = "--disable-pkg-config") Overall, if a requested service is unavailable in a given locale, the best possible match is returned.
Default locale. Each locale-sensitive operation in stringi selects the current default locale if no locale has been explicitly requested, i.e., when a function's locale argument (see Table 5) is left alone in its "NULL" state. The default locale is initially set to match the system locale on the current platform, and may be changed with stri_locale_set(), e.g., in the very rare case of improper automatic locale detection.
As we have stated in the introduction, in this paper we use: R> stri_locale_get() i.e., the Australian-English locale (which formats dates like "29 September 2021" and uses metric units of measurement).

Testing string equivalence
In Unicode, some characters may have multiple representations. For instance, "LATIN SMALL LETTER A WITH OGONEK" ("ą") can be stored as a single code point U+0105 or as a sequence that is comprised of the letter "LATIN SMALL LETTER A", U+0061, and the "COMBINING OGONEK", U+0328 (when rendered properly, they should appear as if they were identical glyphs). This is an example of canonical equivalence of strings.
Testing for the Unicode equivalence between strings can be performed by calling %s==% and, more generally, stri_cmp_equiv(), or their negated versions, %s!=% and stri_cmp_nequiv().
In the example below we have: a followed by ogonek (two code points) vs. a with ogonek (single code point).

[1] FALSE FALSE TRUE
Moreover, stri_duplicated_any() returns the index of the first non-unique element.
For instance, here is a comparison in the current default locale (Australian-English): R> "chaotic" %s<% "hard" [1] TRUE

Collator options
Collation strength. The Unicode Collation Algorithm  can go beyond simple canonical equivalence: it can treat some other (depending on the context) differences as negligible too.
The strength option controls the Collator's "attention to detail". For instance, it can be used to make the ligature "ff" (U+FB00) compare equal to the two-letter sequence "ff":  [1] "groß" R> stri_unique(x, strength = 2) [1] "groß" "gross" Hence, strength equal to 1 takes only primary differences into account. Strength of 2 will also be sensitive to secondary differences (distinguishes between "ß" and "ss" above), but will ignore tertiary differences (case).
A note on compatibility equivalence. In Section 9.4 we describe different ways to normalize canonically equivalent code point sequences so that they are represented by the same code points, which can account for some negligible differences (as in the "a with ogonek" example above).
Apart from ignoring punctuation and case, the Unicode Standard Annex #15 ) also discusses the so-called compatibility equivalence of strings. This is a looser form of similarity; it is observed where there is the same abstract content, yet displayed by means of different glyphs, for instance "¼" (U+00BC) vs. "1/4" or "R" vs. "R". In the latter case, whether these should be treated as equal, depends on the context (e.g., this can be the set of real numbers vs. one's favourite programming language). Compatibility decompositions (NFKC, NFKD) mentioned in Section 9.4 or other types of transliteration can be used to normalize strings so that such differences are not accounted for.
Also, for "fuzzy" matching of strings, the stringdist package (Van der Loo 2014) might be helpful.

Searching for fixed strings revisited
The ICU Collator can also be utilized where there is a need to locate the occurrences of simple textual patterns. The counterparts of the string search functions described in Section 5 have their names ending with *_coll(). Albeit slower than the *_fixed() functions, they are more appropriate in natural language processing activities.

Other operations
In the sequel, we cover the functions that deal with text boundary detection, random string generation, date/time formatting and parsing, amongst others.

Analyzing text boundaries
Text boundary analysis aims at locating linguistic delimiters for the purpose of splitting text into lines, word-wrapping, counting characters or words, locating particular text units (e.g., the 3rd sentence), etc.
Generally, text boundary analysis is a locale-sensitive operation, see Davis and Chapman (2021). For example, in Japanese and Chinese, spaces are not used for separation of words -a line break can occur even in the middle of a word. Nevertheless, these languages have punctuation and diacritical marks that cannot start or end a line, so this must also be taken into account.

Trimming, padding, and other formatting
The following functions can be used for pretty-printing character strings or text on the console, dynamically generating reports (e.g., with Sweave() or knitr; see Xie 2015), or creating text files (e.g., with stri_write_lines(); see Section 9.3).
Padding. stri_pad() pads strings with some character so that they reach the desired widths (as in stri_width()). This can be used to centre, left-, or right-align a message when printed with, e.g., cat().

Trimming.
A dual operation is that of trimming from the left or right side of strings: R> x <-" spam, eggs, and lovely spam.\n" R> stri_trim(x) [1] "spam, eggs, and lovely spam." Word wrapping. The stri_wrap() function splits each (possibly long) string in a character vector into chunks of at most a given width. By default, the dynamic word wrap algorithm (Knuth and Plass 1981) that minimizes the raggedness of the formatted text is used. However, there is also an option (cost_exponent = 0) to use the greedy alignment, for compatibility with the built-in strwrap(). (1 Applying string templates. stri_sprintf() is a Unicode-aware rewrite of the built-in sprintf() function. In particular, it enables formatting and padding based on character width, not just the number of code points. The function is also available as a binary operator %s$%, which is similar to Python's % overloaded for objects of type str.

Generating random strings
Apart from stri_rand_lipsum(), which produces random-ish text paragraphs ("placeholders" for real text), we have access to a function that generates sequences of characters uniformly sampled (with replacement) from a given set.

Transliterating
Transliteration, in its broad sense, deals with the substitution of characters or their groups for different ones, according to some well-defined, possibly context-aware, rules. It may be useful, amongst others, when "normalizing" pieces of strings or identifiers so that they can be more easily compared with each other.
Case mapping. Mapping to upper, lower, or title case is a language-and context-sensitive operation that can change the total number of code points in a string.

Input and output
This section deals with some more advanced topics related to the operability of text processing applications between different platforms. In particular, we discuss how to assure that data read from various input connections are interpreted in the correct manner.

Dealing with Unicode code points
The Unicode Standard (as well as the Universal Coded Character Set, i.e., ISO/IEC 10646) currently defines over 140,000 abstract characters together with their corresponding code points -integers between 0 and 1,114,111 (or 0000 16 and 10FFFF 16 in hexadecimal notation, see https://www.unicode.org/charts/). In particular, here is the number of the code points in some popular categories (compare Section 6.1), such as letters, numbers, and the like.
The first 255 code points are identical to the ones defined by ISO/IEC 8859-1 (ISO Latin-1; "Western European"), which itself extends US-ASCII (codes ≤ 127 = 7F 16 ). For instance, the code point that we are used to denoting as U+007A (the "U+" prefix is followed by a sequence of hexadecimal digits; 7A 16 corresponds to decimal 122) encodes the lower case letter "z". To input such a code point in R, we write: R> "\u007A" [1] "z" For communicating with ICU and other libraries, we may need to escape a given string, for example, as follows (recall that to input a backslash in R, we must escape it with another backslash). R> x <-"zß 你好" R> stri_escape_unicode(x) [1] "z\\u00df\\u4f60\\u597d" It is worth noting that despite the fact that some output devices might be unable to display certain code points correctly (due to, e.g., missing fonts), the correctness of their processing with stringi is still guaranteed by ICU.

Character encodings
When storing strings in RAM or on the disk, we need to decide upon the actual way of representing the code points as sequences of bytes. The two most popular encodings in the Unicode family are UTF-8 and UTF-16: R> x <-"abz0ąß 你好!" R> stri_encode(x, to = "UTF-8", to_raw = TRUE) [ Nevertheless, encoding detection is an operation that relies on heuristics, therefore there is a chance that the output might be imprecise or even misleading.

Converting encodings.
Knowing the desired source and destination encoding precisely, stri_encode() can be called to perform the conversion. Contrary to the built-in function iconv(), which relies on different underlying libraries, the current function is portable across operating systems.
Splitting the output into text lines gives: R> tail(stri_split_lines1(y), 4) Users seeking Unicode-aware replacements for base R string processing functions are kindly referred to the stringx package (Gagolewski 2021), which is a set of wrappers around stringi offering a more classic API (functions such as grepl(), substring(), etc., compare Table 1).
stringi functions can also be accessed from within C++ code. Authors of statistical/data analysis software who would like to speed up their projects are encouraged to check out the Exam-pleRcppStringi package available at https://github.com/gagolews/ExampleRcppStringi, which serves as a working prototype developed using Rcpp (Eddelbuettel 2013).
Future of stringi. Over the years, many useful R packages related to text processing have been developed, see Feinerer, Hornik, and Meyer (2008); Welbers, Van Atteveldt, and Benoit (2017) for some reviews. Several of them are listed in the CRAN Task View Natural Language Processing (Wild 2022). At the time of the writing of this paper, stringi itself had over 200 strong (direct) reverse dependencies and has established itself as one of the most frequently downloaded R extension. Its user base is growing steadily.
Most importantly, the package can be relied upon by other software projects as its API is considered stable and most changes are backward compatible.
Future work will involve the porting of stringi to different scientific/statistical computing environments, including Julia and Python with the NumPy (Van der Walt, Colbert, and Varoquaux 2011) ecosystem, offering more Unicode-aware alternatives to the vectorized text processing facilities from numpy.char and pandas (McKinney 2017, Chapter 7).
Moreover, further extensions of stringi shall be conveyed in order to provide an even broader coverage of ICU services.