Extensive Tokenizer Library

Making a 300k tokenizer

Published on 4th February 2025

Introduction: The Attention Revolution permalink

A few years ago, the paper "Attention is All You Need" introduced the concept of attention mechanism to the world. This paper was a game changer for the field of machine learning and artificial intelligence.

Let's say we have a paragraph, a small one, that contains all the text generated by humanity. Now someone suggests you this formula as a way to predict the next word in the paragraph:

$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) \times V $$

So, what do you do?

Understanding Tokenization permalink

First, we need to break down all the paragraph into words. That's tokenization.

"Hello world, this is a paragraph" -> ["Hello", "world", "this", "is", "a", "paragraph"]

Then we assign numbers to each word. But those are not random.

What is Byte-Pair Encoding? permalink

Once we have a training corpus, we use the Byte-Pair Encoding (BPE) algorithm to assign numbers to each word.

The first step is to break down the corpus into bytes:

1 A
2 B
3 C
4 D
5 E
6 F
7 G
8 H
9 I
10 J
...

Then we need to find the most common pair of bytes and assign a number to it. For example, "th" and "is" are the most common pairs of bytes in a typical English paragraph:

256 th
257 is

Evolution of OpenAI's Tokenizers permalink

OpenAI has the tiktoken library that was used to tokenize text. OpenAI changed their tokenization algorithm across many models:

import tiktoken

# GPT-2 y GPT-3
enc_gpt2 = tiktoken.get_encoding("gpt2")
print(f"GPT-2/3: {enc_gpt2.n_vocab:,} tokens")  # 50,257

# GPT-3.5-turbo y GPT-4
enc_gpt4 = tiktoken.get_encoding("cl100k_base")
print(f"GPT-4:   {enc_gpt4.n_vocab:,} tokens")  # 100,256

# GPT-4o (más nuevo)
enc_gpt4o = tiktoken.get_encoding("o200k_base")
print(f"GPT-4o:  {enc_gpt4o.n_vocab:,} tokens")  # 200,019

Dynamic Tokenization Across Models permalink

Now, something I found interesting is that the tokenization algorithm is not static:

import tiktoken

text = "El rápido zorro marrón salta sobre el perro perezoso"

gpt2 = tiktoken.get_encoding("gpt2")
gpt4 = tiktoken.get_encoding("cl100k_base")
gpt4o = tiktoken.get_encoding("o200k_base")

print(f"GPT-2:  {len(gpt2.encode(text)):2} tokens") # 22 tokens 
print(f"GPT-4:  {len(gpt4.encode(text)):2} tokens") # 13 tokens
print(f"GPT-4o: {len(gpt4o.encode(text)):2} tokens") # 11 tokens

Now, fewer tokens doesn't mean less information. For example, a Chinese sentence can be translated to English with fewer tokens than the original sentence because they have shorter words.

The Trade-off: Vocabulary Size vs. Sequence Length permalink

There is an interesting concept in every new model released by OpenAI: the correlation between the vocabulary size and the sequence length.

OpenAI increases the tokens of vocabulary size to decrease the sequence length.

For example, the word "Artificial Intelligence" can be chunked into "Art" "if" "icial" "Int" "elli" "gence" in a small model. But in a larger model it can be chunked into "Artificial" "Intelligence".

A smaller vocabulary brings you an efficient model, but of course it will be worse on other tasks.

Making O(n²) Less Expensive permalink

The Attention is O(n²), because n is the number of tokens, and YOU HAVE to multiply each token with each other token. Now, because we're making the tokenization more efficient—that is, "Artificial" is one token instead of 3 or 4—that means we can reduce the n of the algorithmic complexity.

That is, when training a model, we use a huge vocabulary size (so many more tokens) so that when the model is used, it can process fewer tokens (because they are more comprehensive).

Now I found that amazing, because it means, at some point, we'll have a long word like otorrinolaringólogo as a token, or a sentence like This is a long sentence that I want to tokenize as a token, right? Almost.

Introducing ExtendedTokenizer permalink

That's why I made ExtendedTokenizer. This Python library is based on the tiktoken library, but we changed something: instead of generating 200K tokens, we go up to 250K.

BPE (Byte-Pair Encoding) is a tokenization algorithm that is used to tokenize text. But only frequently used pairs of bytes are assigned a number.

So, if you think that a sentence might be a token one day, that's not the best idea. If a chunk of characters is too long and rarely used, it's not worth it.

After 200K tokens, the model will generate worse results.

The Role of Zipf's Law permalink

Now we use Zipf's law for that:

Zipf's law: The most frequent word is used 1 time, the second most frequent word is used 1/2 times, the third most frequent word is used 1/3 times, and so on.

$$ \text{word_frequency} = \frac{1}{\text{rank}} $$