🐍 Python - Encoding
Identity of an Unicode character comes from code point. Code point is a number from 0 to 1,114,111, shown usually as 4 to 6 hexadecimals e.g. A is U+0041.
Encoding is an algorithm that coverts code points to bytes. Actual bytes that represent the character depend on the encoding currently in use e.g. A (U+0041) is encoded as
\x41 in UTF-8 or
\x41\x00 in UTF-16LE.
cafe: bytes = bytes('café', encoding='utf8') assert cafe == b'caf\xc3\xa9' assert cafe == 99 assert cafe == 97 assert cafe == 102 assert cafe == 195 assert cafe == 169
You encode strings to bytes for storage or transfer. Via versa, you decode bytes to strings to be read by humans.
s: str = 'café' assert len(s) == 4 b: bytes = s.encode('utf8') assert b == b'caf\xc3\xa9' assert len(b) == 5 assert b.decode('utf8') == 'café'
No encoding can encode all code points. But some legacy encodings can decode any byte stream, which produce "gremlins" or garbage characters.
city: str = 'São Paulo' assert city.encode('utf8') == b'S\xc3\xa3o Paulo' # city.encode('cp437') # => UnicodeEncodeError: can't encode character '\xe3' assert city.encode('cp437', errors='ignore') == b'So Paulo' assert city.encode('cp437', errors='replace') == b'S?o Paulo' assert city.encode('cp437', errors='xmlcharrefreplace') == b'São Paulo'
octets: bytes = b'Montr\xe9al' assert octets.decode('cp1252') == 'Montréal' assert octets.decode('koi8_r') == 'MontrИal' # octets.decode('utf8') # => UnicodeDecodeError: can't decode byte 0xe9 assert octets.decode('utf8', errors='replace') == 'Montr�al'
Python 3 source code files should always use UTF-8 encoding. There are ways to use other encodings but just don't. You can only guess encoding of a byte sequence and UTF-8 should always be your first guess.
- UTF-16 has two variants, little-endian and big-endian.
- Little-endian CPUs show the least significant byte first in a code point.
- Big-endian CPUs show the most significant byte first in a code point.
- Which one is in use is defined with a byte-order mark (BOM) or you use UTF-16LE / UTF-16BE. UTF-16 file without a BOM should be considered to be big-endian by the standard but that is not always the case.
Never rely on default encoding. Default encoding depends on operating system and locale.
import locale assert locale.getpreferredencoding() == 'UTF-8' # default on linux and mac systems with open('cafe.txt', 'w', encoding='utf8') as f: f.write('café') # default on windows systems with open('cafe.txt', encoding='cp1252') as f: assert f.read() == 'cafÃ©'
There are canonically equivalent sequences that are not equal when compared. You can normalize these with
unicodedata module. 'NFC' produce the shortest equivalent form, while
NFKC can be used to get the maximum compatibility for searching and indexing, but lose/distort information e.g. with exponents.
s1: str = 'café' s2: str = 'cafe\u0301' print(s1, s2) # => café café assert len(s1) == 4 assert len(s2) == 5 assert s1 != s2 from unicodedata import normalize ns1: str = normalize('NFC', s1) ns2: str = normalize('NFC', s2) print(ns1, ns2) # => café café assert len(ns1) == len(ns2) assert ns1 == ns2
Case folding is a quick approach improve your search results. It also applies
str.lower(). You can go the extra mile and remove combining marks and accents too. Watching an American struggling to type
ö might be fun to watch but no the best user experience.
from unicodedata import combining, normalize text: str = 'Straße Pönthö, 10€' assert normalize('NFC', text).casefold() == 'strasse pönthö, 10€' def search_token(text: str) -> str: normalized = normalize('NFD', text) shaved = ''.join(c for c in normalized if not combining(c)) return normalize('NFC', shaved).casefold() assert search_token(text) == 'strasse pontho, 10€'
If you are sorting strings from many languages, just use PyUCA library. Using
locale.setlocale is error prone and tedious.
- Python Tricks The Book, Dan Bader
- Fluent Python, Luciano Ramalho