How can you split a character vector into substrings in R? The strsplit()
function allows you to divide the elements of a character vector into substrings using a specified delimiter or patterns. In this article, I will explain the strsplit() function, covering its syntax, parameters, and usage to demonstrate how to split the elements of a character vector into a list of substrings based on a specified pattern.
Key points-
strsplit()
is an R function that splits character vectors or strings into substrings based on a specified delimiter.- The function returns a list of substrings obtained by splitting the input string at the specified delimiter.
- You can use regular expressions as delimiters for more advanced splitting patterns.
- Setting the
fixed
parameter toTRUE
allows splitting by a specified delimiter without treating it as a regular expression. - Multiple delimiters can be specified using regular expressions, enabling splits based on any of the given delimiters.
- The
perl
parameter enables the use of Perl-compatible regular expressions for more sophisticated pattern matching. - Understanding
strsplit()
and its parameters allow for efficient manipulation and parsing of character vectors in R. - When working with character data in R, you often need to split strings into smaller components for data manipulation, analysis, or processing tasks.
strsplit() Function
The strsplit() function in R is used to split the character vector or string into substrings based on a specific delimiter, which can be a character or a pattern. It uses the split
parameter to specify the delimiter and divides the character vector into a list of substrings.
Syntax of the strsplit() Function
Following is the syntax of the strsplit() function.
# Syntax of the strsplit() function
strsplit(x, split, fixed = FALSE, perl = FALSE, useBytes = FALSE)
Parameters of strsplit()
Following are the parameters of the strsplit() function.
x :
Character vector, sting, or a file to be split.split :
The delimiter pattern is used to split the string. It could be a single character, a regular expression (ifperl = TRUE
), or a vector of such character strings.fixed
: A logical value. IfTRUE
, matches the split exactly; otherwise, treat it as a regular expression.perl
: A logical value that, if set toTRUE
, enables Perl-compatible regular expressions.useBytes
: IfTRUE
, the matching is done byte-by-byte rather than character-by-character.
Return Value
The function returns a list containing substrings obtained by splitting the input string at the specified delimiter.
Split Character Vector in R by Delimeter
Let’s create a character vector and use the strsplit()
function to split its elements into substrings based on a specified delimiter. The result will be a list of substrings.
# Split the character vector using strsplit
# Create character vector
char_vec = "java python r pyspark"
print("Character Vector")
print(char_vec)
split_vector <- strsplit(char_vec, " ")
print("After splitting the vector:")
print(split_vector)
Yields below output.
Split the Character Vector by Regular Expression
You can also split elements of a character vector into substrings using regular expressions. Below is an example of splitting a character vector by any sequence of digits from 0 to 9 using the regular expression [0-9]+
.
# Split the character vector by regular expression
# Create character vector
char_vec = "Spark2By4Examples1."
print("Character Vector")
print(char_vec)
split_vector <- strsplit(char_vec, "[0-9]+")
print("After splitting the vector:")
print(split_vector)
From the above, the string is split at any sequence of digits [0-9]+
. Here, +
is treated as a literal character, not as a regular expression. So, the string is split at each occurrence of +
.
Yields below output.
Split the R Vector with a fixed Param
Alternatively, you can use the fixed
parameter of the strsplit()
function to split the vector. By setting fixed = TRUE
, the function splits the character vector into a list of substrings based on the specified delimiter, rather than using a regular expression.
# Split the character vector by fixed = True
# Create character vector
char_vec = "java/python/r/pyspark"
split_vector <- strsplit(char_vec, "/", fixed = TRUE)
print("After splitting the vector:")
print(split_vector)
# Output:
# [1] "After splitting the vector:"
# [[1]]
# [1] "java" "python" "r" "pyspark"
Split the Character Vector by Multiple Delimiter
To split a character vector by multiple delimiters, you can specify multiple delimiters using regular expressions. This allows splitting the vector into substrings whenever any of the specified delimiters are encountered.
# Split the Character Vector by Multiple Delimeters
# Create character vector
char_vec = "java/python,r%pyspark"
tsplit_vector <- strsplit(char_vec, "/|,|%")
print("After splitting the vector:")
print(split_vector)
# Output:
# [1] "After splitting the vector:"
# [[1]]
# [1] "java" "python" "r" "pyspark"
Here, the regular expression /|,|%
splits the vector at each occurrence of the specified delimiters.
Split the R Vector with a perl Param
Finally, you can use the perl
parameter to enable Perl-compatible regular expressions for more advanced pattern matching when splitting a character vector. By setting perl = TRUE
, you can pass specific regular expressions to split the vector into substrings.
# Split the character vector using strsplit by setting perl = TRUE
# Create character vector
char_vec = "java/python%r,pyspark"
# Split the character vector using strsplit
split_vector <- strsplit(char_vec, "[/%,]", perl = TRUE)
print("After splitting the vector:")
print(split_vector)
# Output:
# [1] "After splitting the vector:"
# [[1]]
# [1] "java" "python" "r" "pyspark"
Conclusion
In this article, I discussed the strsplit()
function, an essential tool in R for managing character data, providing flexibility and control over the splitting of strings into substrings. We explored the syntax, parameters, and usage of the strsplit()
function, demonstrating how to split elements in a character vector into a list of substrings based on specified single or multiple delimiters.
Happy Learning!!