A simple yet efficient implementation of the Byte Pair Encoding (BPE) algorithm for text tokenization. This Python module can be used as a standalone command-line tool or imported as a library in your projects.
Byte Pair Encoding is a data compression technique that iteratively replaces the most frequent pair of bytes (or characters) in a sequence with a single, unused byte (or character). In the context of natural language processing, it's used as a subword tokenization method, creating a vocabulary that includes both whole words and subword units.
No installation is required beyond Python dependencies:
pip install tqdmThe BPE tokenizer can be run directly from the command line:
python bpe.py [input] [-n NSTEPS] [-o OUTPUT] [-r]input- Path to the input file (default:-for stdin)-n, --nsteps- Number of BPE steps to perform (default: 100)-o, --output- Output file prefix for vocabulary, tokens, and frequency table-r, --relative- Use relative frequencies instead of absolute counts-l, --lowercase- Convert input text to lowercase before processing
Process a file and output tokenized text to stdout:
python bpe.py corpora/sherlock.txtProcess stdin and output tokenized text to stdout:
cat corpora/sherlock.txt | python bpe.pyPerform 200 BPE steps on a file:
python bpe.py corpora/sherlock.txt -n 200Save outputs to files:
python bpe.py corpora/sherlock.txt -o output/sherlockThis generates three files:
output/sherlock.vocab.json- The vocabulary as a JSON listoutput/sherlock.tokens.txt- The tokenized textoutput/sherlock.freqs.json- The frequency table
You can import the bpe function from the module:
from bpe import bpe
# Read your text
with open("your_text_file.txt", "r") as f:
text = f.read()
# Perform BPE
vocab, tokens, freq_table = bpe(
text=text, # The input text
nsteps=100, # Number of BPE steps
relative_freqs=False # Whether to use relative frequencies
)
# Access the results
print(f"Vocabulary size: {len(vocab)}")
print(f"First 10 tokens in first word: {tokens[0][:10]}")
print(f"Most common token: {max(freq_table, key=freq_table.get)}")-
Initialization: The text is split into words, and each word is represented as a list of individual characters.
-
Iterative Merging: For
nstepsiterations:- Count the frequency of all adjacent token pairs
- Find the most frequent pair
- Replace all occurrences of that pair with a new token (the concatenation of the pair)
- Add this new token to the vocabulary
-
Output:
- A vocabulary (set of tokens)
- Tokenized text (list of lists, where each inner list represents a word)
- A frequency table for all tokens
This project is licensed under the BSD 3-Clause License - see the LICENSE.md file for details.