Working with Unicode in Python
Last Updated :
31 Jan, 2024
Unicode serves as the global standard for character encoding, ensuring uniform text representation across diverse computing environments. Python, a widely used programming language, adopts the Unicode Standard for its strings, facilitating internationalization in software development.
This tutorial aims to provide a foundational understanding of working with Unicode in Python, covering key aspects such as encoding, normalization, and handling Unicode errors.
How To Work With Unicode In Python?
Below are some of the ways by which we can work with Unicode in Python:
- Converting Unicode Code Points
- Normalize Unicode
- Unicode with NFD & NFC
- Regular Expressions
- Solving Unicode Errors
Converting Unicode Code Points in Python
Encoding, the process of representing data in a computer-readable form, is crucial for internationalized data in Python 3. The default string encoding is UTF-8. Let's create the copyright symbol (©) using its Unicode code point. below, code creates a string s with the Unicode code point \u00A9, and due to UTF-8 encoding, printing the value of s results in the corresponding Unicode symbol '©'.
Python3
Normalizing Unicode in Python
Normalization is crucial for determining whether two characters, written in different fonts, are the same. For instance, the Unicode characters 'R' and 'ℜ' might appear identical to the human eye but are considered different by Python strings. Normalization helps address such issues.
Below, code prints False because Python strings do not consider the two characters identical. Normalization becomes crucial when dealing with combined characters, as shown in the example with strings s1 and s2.
Python3
styled_R = 'ℜ'
normal_R = 'R'
print(styled_R == normal_R)
Normalize Unicode with NFD & NFC
Python's unicodedata module provides the normalize() function for normalizing Unicode strings. The normalization forms include NFD, NFC, NFKD, and NFKC.
Below, code demonstrates the effects of different normalization forms on string lengths. NFD decomposes characters, while NFC composes them. Similarly, NFKD and NFKC are used for "strict" normalization.
Python3
from unicodedata import normalize
s1 = 'hôtel'
s2 = 'ho\u0302tel'
s1_nfd = normalize('NFD', s1)
print(len(s1), len(s1_nfd))
s2_nfc = normalize('NFC', s2)
print(len(s2), len(s2_nfc))
Solving Unicode Errors in Python
Two common Unicode errors are UnicodeEncodeError and UnicodeDecodeError. Handling these errors is crucial for robust Unicode support.
Solving a UnicodeEncodeError : Below, code snippet showcases different error handling approaches when encoding a string with characters outside the ASCII character set.
Python3
ascii_unsupported = '\ufb06'
# Using 'ignore', 'replace', and 'xmlcharrefreplace' to handle errors
print(ascii_unsupported.encode('ascii', errors='ignore'))
print(ascii_unsupported.encode('ascii', errors='replace'))
print(ascii_unsupported.encode('ascii', errors='xmlcharrefreplace'))
Outputb''
b'?'
b'st'
Solving a UnicodeDecodeError : Below, code snippet demonstrates error handling when attempting to decode a byte string into an incompatible encoding.
Python3
iso_supported = '§A'
b = iso_supported.encode('iso8859_1')
# Using 'replace' and 'ignore' to handle errors
print(b.decode('utf-8', errors='replace'))
print(b.decode('utf-8', errors='ignore'))
Similar Reads
Working with Strings in Python 3
In Python, sequences of characters are referred to as Strings. It used in Python to record text information, such as names. Python strings are "immutable" which means they cannot be changed after they are created. Creating a StringStrings can be created using single quotes, double quotes, or even tr
6 min read
Working with Binary Data in Python
Alright, lets get this out of the way! The basics are pretty standard: There are 8 bits in a byteBits either consist of a 0 or a 1A byte can be interpreted in different ways, like binary octal or hexadecimal Note: These are not character encodings, those come later. This is just a way to look at a s
5 min read
Writing to file in Python
Writing to a file in Python means saving data generated by your program into a file on your system. This article will cover the how to write to files in Python in detail. Creating a FileCreating a file is the first step before writing data to it. In Python, we can create a file using the following t
4 min read
unicode_literals in Python
Unicode is also called Universal Character set. ASCII uses 8 bits(1 byte) to represents a character and can have a maximum of 256 (2^8) distinct combinations. The issue with the ASCII is that it can only support the English language but what if we want to use another language like Hindi, Russian, Ch
3 min read
f-strings in Python
Python offers a powerful feature called f-strings (formatted string literals) to simplify string formatting and interpolation. f-strings is introduced in Python 3.6 it provides a concise and intuitive way to embed expressions and variables directly into strings. The idea behind f-strings is to make
5 min read
Unicodedata â Unicode Database in Python
Unicode Character Database (UCD) is defined by Unicode Standard Annex #44 which defines the character properties for all unicode characters. This module provides access to UCD and uses the same symbols and names as defined by the Unicode Character Database. Functions defined by the module : unicoded
3 min read
Python String - printable()
In Python string.printable is a pre-initialized string constant that contains all characters that are considered printable. This includes digits, ASCII letters, punctuation, and whitespace characters. Let's understand with an example: [GFGTABS] Python import string # to show the contents of string.p
2 min read
Python String
A string is a sequence of characters. Python treats anything inside quotes as a string. This includes letters, numbers, and symbols. Python has no character data type so single character is a string of length 1. [GFGTABS] Python s = "GfG" print(s[1]) # access 2nd char s1 = s + s[0] # updat
6 min read
Collections.UserString in Python
Strings are the arrays of bytes representing Unicode characters. However, Python does not support the character data type. A character is a string of length one. Example: C/C++ Code # Python program to demonstrate # string # Creating a String # with single Quotes String1 = 'Welcome to the Geeks Worl
2 min read
Reading and Writing to text files in Python
Python provides built-in functions for creating, writing, and reading files. Two types of files can be handled in Python, normal text files and binary files (written in binary language, 0s, and 1s). Text files: In this type of file, Each line of text is terminated with a special character called EOL
8 min read