🐍 Python - Encoding
Identity of an Unicode character comes from code point. Code point is a number from 0 to 1,114,111, shown usually as 4 to 6 hexadecimals e.g. A is U+0041.
Encoding is an algorithm that coverts code points to bytes. Actual bytes that represent the character depend on the encoding currently in use e.g. A (U+0041) is encoded as \x41
in UTF-8 or \x41\x00
in UTF-16LE.
cafe: bytes = bytes('café', encoding='utf8')
assert cafe == b'caf\xc3\xa9'
assert cafe[0] == 99
assert cafe[1] == 97
assert cafe[2] == 102
assert cafe[3] == 195
assert cafe[4] == 169
You encode strings to bytes for storage or transfer. Via versa, you decode bytes to strings to be read by humans.
s: str = 'café'
assert len(s) == 4
b: bytes = s.encode('utf8')
assert b == b'caf\xc3\xa9'
assert len(b) == 5
assert b.decode('utf8') == 'café'
No encoding can encode all code points. But some legacy encodings can decode any byte stream, which produce "gremlins" or garbage characters.
city: str = 'São Paulo'
assert city.encode('utf8') == b'S\xc3\xa3o Paulo'
# city.encode('cp437') # => UnicodeEncodeError: can't encode character '\xe3'
assert city.encode('cp437', errors='ignore') == b'So Paulo'
assert city.encode('cp437', errors='replace') == b'S?o Paulo'
assert city.encode('cp437', errors='xmlcharrefreplace') == b'São Paulo'
octets: bytes = b'Montr\xe9al'
assert octets.decode('cp1252') == 'Montréal'
assert octets.decode('koi8_r') == 'MontrИal'
# octets.decode('utf8') # => UnicodeDecodeError: can't decode byte 0xe9
assert octets.decode('utf8', errors='replace') == 'Montr�al'
Python 3 source code files should always use UTF-8 encoding. There are ways to use other encodings but just don't. You can only guess encoding of a byte sequence and UTF-8 should always be your first guess.
UTF-16
- UTF-16 has two variants, little-endian and big-endian.
- Little-endian CPUs show the least significant byte first in a code point.
- Big-endian CPUs show the most significant byte first in a code point.
- Which one is in use is defined with a byte-order mark (BOM) or you use UTF-16LE / UTF-16BE. UTF-16 file without a BOM should be considered to be big-endian by the standard but that is not always the case.
Never rely on default encoding. Default encoding depends on operating system and locale.
import locale
assert locale.getpreferredencoding() == 'UTF-8'
# default on linux and mac systems
with open('cafe.txt', 'w', encoding='utf8') as f:
f.write('café')
# default on windows systems
with open('cafe.txt', encoding='cp1252') as f:
assert f.read() == 'café'
There are canonically equivalent sequences that are not equal when compared. You can normalize these with unicodedata
module. 'NFC' produce the shortest equivalent form, while NFKC
can be used to get the maximum compatibility for searching and indexing, but lose/distort information e.g. with exponents.
s1: str = 'café'
s2: str = 'cafe\u0301'
print(s1, s2) # => café café
assert len(s1) == 4
assert len(s2) == 5
assert s1 != s2
from unicodedata import normalize
ns1: str = normalize('NFC', s1)
ns2: str = normalize('NFC', s2)
print(ns1, ns2) # => café café
assert len(ns1) == len(ns2)
assert ns1 == ns2
Case folding is a quick approach improve your search results. It also applies str.lower()
. You can go the extra mile and remove combining marks and accents too. Watching an American struggling to type ö
might be fun to watch but no the best user experience.
from unicodedata import combining, normalize
text: str = 'Straße Pönthö, 10€'
assert normalize('NFC', text).casefold() == 'strasse pönthö, 10€'
def search_token(text: str) -> str:
normalized = normalize('NFD', text)
shaved = ''.join(c for c in normalized if not combining(c))
return normalize('NFC', shaved).casefold()
assert search_token(text) == 'strasse pontho, 10€'
If you are sorting strings from many languages, just use PyUCA library. Using locale.setlocale
is error prone and tedious.
Source
- Python Tricks The Book, Dan Bader
- Fluent Python, Luciano Ramalho