Python regex split

The re module is a very powerful module in Python that you can use for text manipulation and analysis tasks. The power and versatility of this module are shown in its feature-richness, the module has a lot of methods that you can use for your daily text-processing tasks. Every text-processing task has its method, for example, search, match, replace, findall, etc. And one of the useful and powerful methods of the re module is the re.split(), which allows you to split a string into a list of substrings.

Introduction to re.split()

Before you start to use this method, it’s better you know its ins and outs. In this section, you will get to know what the re.split() method is and its syntax. The re.split() method is a method that allows you to split a string into a list of substrings based on a specified pattern. Below is the syntax for the re.split() method:


re.split(pattern, string, maxsplit=0, flags=0)

The table below explains the method argument:

Argument	Description
pattern	The regular expression pattern that defines the splitting criteria
string	The input string to be split
maxsplit(optional)	The maximum number of splits to be performed. By default, it’s set to 0, which means all possible splits are made
flags(optional)	Additional flags that modify the behavior of the regex matching. You can use constants like `re.IGNORECASE` for case-insensitive matching

re.split() method arguments

Splitting with Simple Patterns

Now that you know the syntax of the re.split() method, we can comfortably look at some examples with simple split patterns.

Splitting by whitespace

The first example will be splitting a sentence using whitespace. Here is a simple example:


import re

# declaring a sentence to be split
sentence = "Hello, welcome to spark by examples?"

# declaring a pattern for splitting the sentence
pattern = r'\s'

# assigning the result of the method to words variable
words = re.split(pattern, sentence) 

# prints the words on the console
print(words)

In this example, the pattern r'\s' is used, which matches any whitespace character. The re.split() function splits the sentence string into a list of substrings wherever a whitespace character is found. The resulting list, words, contains each word as an element.

Output:


['Hello,', 'welcome', 'to', 'spark', 'by', 'examples?']

Splitting by comma

You might also want to split a sentence using a comma, the code for that is implemented as follows:


import re

# declaring a string to be split
languages = "Python, Scala, PySpark, Java, C++, C#"

# declaring a pattern for splitting the sentence
pattern = r','

# assigning the result of the method to language_list variable
language_list = re.split(pattern, languages)

# prints the language_list on the console
print(language_list)

In the code snippet, a string languages is declared, which contains a list of programming languages separated by commas.

The pattern r',' is defined to represent the comma character. This pattern will be used to split the string.

The re.split() method is called with the pattern and languages as arguments. It splits the languages string wherever a comma is found and returns a list of substrings.

Finally, the resulting list is assigned to the variable language_list.

Output:


#Output
['Python', ' Scala', ' PySpark', ' Java', ' C++', ' C#']

Splitting by period

What if you wanted to split a sentence or string using a period(.)? here is how you would implement the code:


#Output
import re

# declaring a sentence to be split
sentence = "SparkByExamples is very good resource. Especially for those starting programming."

# declaring a pattern for splitting the sentence
pattern = r'\.'

# assigning the result of the method to sentences variable
sentences = re.split(pattern, sentence)

# prints the sentences on the console
print(sentences)

Here, a sentence is declared in the sentence variable. The pattern r'\.' is defined to represent the period character. This pattern will be used to split the sentence into separate sentences.

The re.split() method is called with the pattern and sentence as arguments. It splits the sentence string wherever a period is found and returns a list of substrings representing each sentence.

Output:


#Output
['SparkByExamples is very good resource', ' Especially for those starting programming', '']

Splitting with Complex Patterns

As you are working with the re.split() method, you might encounter scenarios where splitting text using simple patterns will not be efficient. So the only solution to this is to use complex patterns.

Splitting by multiple characters

Let us look at an example where we split the sentence using multiple characters:


#Output
import re

# declaring a sentence to be split
sentence = "Spark-By-Examples, A fine_platform_to_learn_coding,quickly"

# declaring a pattern for splitting the sentence
pattern = r'[-_, ]'

# assigning the result of the method to parts variable
parts = re.split(pattern, sentence)

# prints the parts on the console
print(parts)

In this example, we split the sentence string based on these multiple characters, hyphen, underscore, comma, and space. The pattern r'[-_, ]' matches any of these characters, and each part of the sentence becomes a separate element in the resulting list.

Output:


#Output
['Spark', 'By', 'Examples', '', 'A', 'fine', 'platform', 'to', 'learn', 'coding', 'quickly']

Splitting using lookarounds

In cases you want to use lookarounds for splitting a sentence, here is how you would implement the code:


#Output
import re

# declaring a sentence to be split
sentence = "1234ABC5678DEF91011"

# declaring a pattern for splitting the sentence
pattern = r'(?<=\d)(?=[A-Z])'

# assigning the result of the method to parts variable
parts = re.split(pattern, sentence)

# prints the parts on the console
print(parts)

In this example, we split the sentence string based on the position where a digit is followed by an uppercase letter. The pattern r'(?<=\d)(?=[A-Z])‘ uses positive lookbehind (?<=\d) and positive lookahead (?=[A-Z]) to find the desired split positions. Each part of the sentence becomes a separate element in the resulting list.

Output:


#Output
['1234', 'ABC5678', 'DEF91011']

Conclusion

That concludes this tutorial, thanks for reading and please do read other tutorials on Python RegEx.