The problem with tokenization

One of the crucial preprocessing steps for large language models (LLMs) is tokenization. Via tokenization we are converting the input text into tokens and reconstructing the output tokens back into text. The most common approach is to treat tokens as subword units. We apply this approach because using words directly would limit the vocabulary to only known words. While character-level tokenization could handle unknown words, it would significantly increase input length. And this would lead to higher computational and memory requirements. Why is it a problem? Because that scale quadratically for current LLMs.

In a large language model these tokens are mapped to high-dimensional embeddings that captures their representations. Embeddings play a critical role because they result in faster training when preserving pre-trained embeddings while re-initializing other parameters.

Still, LLMs can struggle with seemingly simple tasks that rely on token-level dependencies. An example is counting the letters in a word – this information gets lost during tokenization.

Tokenization_1

The root cause is treating tokens as independent units, ignoring their compositional nature. For instance, in English plural nouns, the singular and plural forms often appear as separate tokens, forcing the model to learn their relationship from ample examples in different contexts.
A more intelligent tokenization approach could encode compositional information directly into the embeddings.

The challenge lies in developing such a tokenizer in a data-driven, end-to-end manner that captures compositionality in the embeddings, rather than relying on hand-crafted rules.

Solving this tokenization challenge could significantly improve LLM performance, data efficiency, and our understanding of how to imbue these models with true compositional linguistic knowledge.


Want to know more? Let us talk!