ruk·si

Encoding

Updated at 2012-11-18 18:41

This note is about encoding as a rule sets to converting information to binary and vice versa.

First we define some terms so we understand what we are talking about:

  • Code: rules about converting a piece of information into another format.
  • Encoding (1): a rule set to convert information to binary and from binary to information.
  • Encoding (2): the process of transforming information to another format.
  • Decoding: the process of transforming binary to another format.

This note is specifically about Encoding (1).

Texts in digital systems are always sequences of bits which are translated into human readable text, usually by using lookup tables. If the wrong lookup table is used, the wrong character is shown.

Unicode

Unicode is a code, a character set. It is a large table mapping specific characters to specific numbers.

A = 41

These numbers are called number points.

UTFs (UCS Transformation Format) are encodings, which translate those number points to binary so they can be stored and transferred.

There are multiple variants of UTF encodings:

  • UTF-8 favors efficiency for English text.
  • UTF-16 favors efficiency for Asian text.
  • UTF-32 favors consistency in translated byte length.

Do not include BOM (Byte Order Mark) if you do not have a good reason. Cause problems e.g. in browsers.

If the character cannot be expressed with 1 byte, more bytes are glued together with special byte in UTF-8 and UTF-16. Thus, UTF-8 and UTF-16 are variable-length encodings while UTF-32 is a fixed-length encoding.

    Encoding                                Character value
A   UTF-8                                          01000001
A   UTF-16                             00000000    01000001
A   UTF-32     00000000    00000000    00000000    01000001

あ   UTF-8                 11100011    10000001    10000010
あ   UTF-16                            00110000    01000010
あ   UTF-32    00000000    00000000    00110000    01000010

UTF-16 is generally the worst of both worlds, variable-length and not optimized. You should always use UTF-8 for projects that need to be multilingual. You may consider using UTF-16 if your target is only Asia.