Large Language Models (LLMs) like BERT, GPT, and their variants, have achieved amazing feats in understanding and generating human language. However, they operate on tokens – discrete units of text – and not directly on the raw characters we humans see. This brings up an interesting challenge: how do these models handle numbers? The process, called number tokenization, is more nuanced than you might think and has significant impacts on how well LLMs can perform in numerical tasks.
Numbers are crucial in everyday language and many specialized domains. Consider scenarios like:
If numbers aren’t tokenized effectively, models can struggle with these kinds of tasks. A model might see “123” and “12345” as having no specific relationship with each other, which is not the same as a human would. This limits the ability to reason and perform mathematical operations.
LLMs use subword tokenization techniques (like Byte-Pair Encoding or WordPiece). These techniques don’t simply treat a whole number as a single token. Here’s a breakdown:
This approach of tokenizing numbers as individual digits or digit groupings has several consequences:
Let’s examine some practical examples of how tokenizers handle numbers with some Python code snippets.
from transformers import AutoTokenizer
def tokenize_and_show(text, tokenizer_name="bert-base-uncased"):
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
tokens = tokenizer.tokenize(text)
print(f"Text: '{text}'")
print(f"Tokens: {tokens}")
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print(f"Token IDs: {token_ids}")
print("BERT Tokenizer Examples:")
tokenize_and_show("The price is $123.45", "bert-base-uncased")
tokenize_and_show("The year is 2024.", "bert-base-uncased")
tokenize_and_show("Pi is approximately 3.14159", "bert-base-uncased")
print("\nGPT-2 Tokenizer Examples:")
tokenize_and_show("The price is $123.45", "gpt2")
tokenize_and_show("The year is 2024.", "gpt2")
tokenize_and_show("Pi is approximately 3.14159", "gpt2")
print("\nBART Tokenizer Examples:")
tokenize_and_show("The price is $123.45", "facebook/bart-base")
tokenize_and_show("The year is 2024.", "facebook/bart-base")
tokenize_and_show("Pi is approximately 3.14159", "facebook/bart-base")
Note: The code above requires the transformers
library. You can install it using pip install transformers
.
BERT Tokenizer Examples:
Text: 'The price is $123.45'
Tokens: ['the', 'price', 'is', '$', '123', '.', '45']
Token IDs: [1996, 3923, 2003, 1002, 14887, 1012, 2345]
Text: 'The year is 2024.'
Tokens: ['the', 'year', 'is', '2024', '.']
Token IDs: [1996, 2095, 2003, 26590, 1012]
Text: 'Pi is approximately 3.14159'
Tokens: ['pi', 'is', 'approximately', '3', '.', '14159']
Token IDs: [12379, 2003, 10486, 1020, 1012, 22325]
GPT-2 Tokenizer Examples:
Text: 'The price is $123.45'
Tokens: ['The', 'Ġprice', 'Ġis', 'Ġ$', '123', '.', '45']
Token IDs: [464, 3518, 318, 502, 1251, 13, 1048]
Text: 'The year is 2024.'
Tokens: ['The', 'Ġyear', 'Ġis', 'Ġ2024', '.']
Token IDs: [464, 1380, 318, 14527, 13]
Text: 'Pi is approximately 3.14159'
Tokens: ['Pi', 'Ġis', 'Ġapproximately', 'Ġ3', '.', '14159']
Token IDs: [35301, 318, 11405, 383, 13, 47311]
BART Tokenizer Examples:
Text: 'The price is $123.45'
Tokens: ['The', 'Ġprice', 'Ġis', 'Ġ$', '123', '.', '45']
Token IDs: [464, 728, 16, 754, 1503, 4, 568]
Text: 'The year is 2024.'
Tokens: ['The', 'Ġyear', 'Ġis', 'Ġ2024', '.']
Token IDs: [464, 1030, 16, 14201, 4]
Text: 'Pi is approximately 3.14159'
Tokens: ['Pi', 'Ġis', 'Ġapproximately', 'Ġ3', '.', '14159']
Token IDs: [19693, 16, 3673, 69, 4, 31710]
You can observe that different tokenizers produce varying tokenizations. BERT’s tokenizer handles dollar amounts by keeping them together. The other tokenizers break it down into components. Similarly, each tokenizer breaks down the decimal number differently, sometimes having the digits together, and others seperating them.
Effective number tokenization is still an area of active research. Some challenges and potential areas for improvement are:
Solutions include exploring better tokenization algorithms that can more effectively represent numbers, training strategies tailored for numerical data, and incorporating specialized arithmetic reasoning modules.
Number tokenization is a critical but often overlooked part of how Large Language Models understand and process text. How numbers are broken down into tokens significantly impacts the performance of LLMs, especially in tasks that require reasoning with numbers. While existing models have done well at natural language processing, continuing research in tokenization methods will only make their ability to understand and work with numbers more powerful.
By being aware of these underlying processes, both developers and users of LLMs can better understand the capabilities and limitations of these powerful tools when dealing with numbers.