Python String Unicode

How to convert string to Unicode characters in Python? Strings in Python are sequences of characters that are used to represent text. Unicode is a character encoding standard that aims to cover all characters from all human languages and various symbols. It allows computers to handle text in a uniform and represent and standardized way. Python includes built-in support for Unicode strings, and there are different techniques to create Unicode strings.

1. Quick Examples of Python String Unicode

If you are in a hurry, below are some quick examples of converting strings to Unicode characters.


# Quick examples of converting string to unicode characters

import re

# Initializing string
string = "Sparkbyexamples"

# Example 1: Using re.sub() and ord() method
def unicode_replacement(match):
    char = match.group()
    return r'\u{:04X}'.format(ord(char))
result = re.sub(r'.', unicode_replacement, string)

# Example 2: Convert string to Unicode 
# Using re.sub() and ord() method
result = (re.sub('.', lambda x: r'\u % 04X' % ord(x.group()), string))

# Example 3: Convert string to Unicode 
# Using join(), format(), and ord() method
string = "Python"
unicode_result = ' '.join([r'\u{:04X}'.format(ord(ele)) for ele in string])

2. Convert String to Unicode Using re.sub() and ord() Method

You can convert a string to its Unicode representation using the re.sub() function and the ord() function. For example, you can import the re module to work with regular expressions and define the string containing the characters you want to convert to Unicode. Then you can define a function named unicode_replacement that takes a match object as an argument.

This function extracts the matched character using match.group(), converts it to its Unicode code point using ord(), and formats the Unicode code point as a four-digit hexadecimal representation with leading zeros using '{:04X}'.format(ord(char)). The re.sub() function is then used to replace each character in the string with its Unicode representation. The regular expression matches each individual character. The unicode_replacement function is used as the replacement function within re.sub(). Finally, you print the original string and the resulting string with Unicode representations.


import re

# Initializing string
string = "Sparkbyexamples"
print("Original string:", string)

# Using re.sub() and ord() method
def unicode_replacement(match):
    char = match.group()
    return r'\u{:04X}'.format(ord(char))

result = re.sub(r'.', unicode_replacement, string)
print("After converting the string to unicode:\n", result)

Yields below output.

To convert a string to its Unicode use the re.sub() and ord(), is another approach. When using the \u escape sequence in a regular expression, ensure it is followed by four hexadecimal digits representing the Unicode code point. However, there’s a space in the \u % 04X part, which should be removed.

In the below example. It takes the input string "Sparkbyexamples" and converts each character in the string to its Unicode representation using the re.sub() function and the ord() function. The regular expression matches each character in the string and the lambda function within re.sub() converts each matched character to its Unicode code point using the %04X format specifier for four-digit hexadecimal representation.


import re
 
# Initializing string
string = "Sparkbyexamples"
print("Original string:", string)
 
# Convert string to Unicode 
# Using re.sub() and ord() method
result = (re.sub('.', lambda x: r'\u % 04X' % ord(x.group()), string))
print("After converting the string to unicode:\n", result)

# Output:
# Original string: Sparkbyexamples
# After converting the string to unicode: \u  053\u  070\u  061\u  072\u  06B\u  062\u  079\u  065\u  078\u  061\u  06D\u  070\u  06C\u  065\u  073

3. Convert String to Unicode Using join(), format(), and ord()

You can also use the join() method along with the format() method and the ord() function to convert a string to its Unicode representation.

In the below example, you use a list comprehension to iterate through each character in the string. For each character, you can use the ord() function to get its Unicode code point, and then you use the format() method to create the Unicode escape sequence using the hexadecimal value of the code point. The resulting Unicode escape sequences are joined using the join() method, and each Unicode code point is separated by a space.


# Initializing string
string = "Python"
print("Original string:", string)

# Convert string to Unicode 
# Using join(), format(), and ord() method
unicode_result = ' '.join([r'\u{:04X}'.format(ord(ele)) for ele in string])
print("After converting the string to Unicode:", unicode_result)

Yields below output.

Conclusion

In this article, I have explained some of the methods of Python like re.sub(), ord(), lambda, join(), and format() functions and using these methods how you can convert string to its Unicode representation with well-defined examples.

Happy Learning !!

1. Quick Examples of Python String Unicode

2. Convert String to Unicode Using re.sub() and ord() Method

3. Convert String to Unicode Using join(), format(), and ord()

Conclusion

Related Articles