Encoding
This note is about encoding as a rule sets to converting information to binary and vice versa.
First we define some terms so we understand what we are talking about:
- Code: rules about converting a piece of information into another format.
- Encoding (1): a rule set to convert information to binary and from binary to information.
- Encoding (2): the process of transforming information to another format.
- Decoding: the process of transforming binary to another format.
This note is specifically about Encoding (1).
Texts in digital systems are always sequences of bits which are translated into human readable text, usually by using lookup tables. If the wrong lookup table is used, the wrong character is shown.
Unicode
Unicode is a code, a character set. It is a large table mapping specific characters to specific numbers.
A = 41
These numbers are called number points.
UTFs (UCS Transformation Format) are encodings, which translate those number points to binary so they can be stored and transferred.
There are multiple variants of UTF encodings:
- UTF-8 favors efficiency for English text.
- UTF-16 favors efficiency for Asian text.
- UTF-32 favors consistency in translated byte length.
Do not include BOM (Byte Order Mark) if you do not have a good reason. Cause problems e.g. in browsers.
If the character cannot be expressed with 1 byte, more bytes are glued together with special byte in UTF-8 and UTF-16. Thus, UTF-8 and UTF-16 are variable-length encodings while UTF-32 is a fixed-length encoding.
Encoding Character value
A UTF-8 01000001
A UTF-16 00000000 01000001
A UTF-32 00000000 00000000 00000000 01000001
あ UTF-8 11100011 10000001 10000010
あ UTF-16 00110000 01000010
あ UTF-32 00000000 00000000 00110000 01000010
UTF-16 is generally the worst of both worlds, variable-length and not optimized. You should always use UTF-8 for projects that need to be multilingual. You may consider using UTF-16 if your target is only Asia.