Tokenization

Updated at 2024-10-29 23:21

Tokenization is the process of breaking a text into words, phrases, symbols, or other meaningful elements. The resulting tokens are then used as input for further processing such as parsing or text analysis.

Tokenization isn't strictly related to machine learning, but it's a common preprocessing step in NLP and LLM tasks.

In machine learning, tokenization is often used to convert text data into a format that can be more easily processed by algorithms and form connections between those tokens.

You can see tokenization in action e.g. on ChatGPT's website.

With ChatGPT, one token generally corresponds to ~4 characters of text for common English text.

Quick brown fox jumps over the lazy dog.

[Quick ][ brown][ fox][ jumps][ over ][ the][ lazy][ dog][.]

[28903, 19705, 68347, 65613, 1072, 290, 29082, 6446, 13]

Most tokenizers use subword tokenization, which allows them to represent larger vocabulary with a smaller number of tokens.

The word "unbelievable" could be tokenized into "un", "bel" and "ievable".

You usually optimize your input/output formats so it takes as few tokens as possible to represent the data, while still preserving the necessary information.

For example, YAML is arguably more human-readable than JSON and might tokenize better in certain scenarios, but JSON is more compact overall. JSON can also be minified to remove whitespace and performs better with nested structures.