Amharic tokenizer with a GPT-style BPE-like pipeline over decomposed fidel. Implements: cleaning → fidel decomposition → BPE training/application → detokenization, with a Cython core for speed.
- Vocab size: 30,000 tokens
- Trained on a larger and more diverse Amharic corpus
- Improved tokenization quality and detokenization accuracy
- Better handling of edge cases and rare words
- Pretrained tokenizer loading
- You can now load a pretrained tokenizer directly:
from amharic_tokenizer import AmharicTokenizer
tok = AmharicTokenizer.load("amh_bpe_v0.2.6")This version includes a pretrained model (amh_bpe_v0.2.6) that can be used immediately without any additional setup and training.
- Full token-to-ID and ID-to-token functionality
- Added complete round-trip processing methods:
tokens = tok.tokenize(text)
ids = tok.encode(tokens)
detokenized = tok.detokenize(tokens)The tokenizer now supports seamless conversion between tokens and IDs, ensuring full consistency between tokenization and detokenization.
from amharic_tokenizer import AmharicTokenizer
def test_roundtrip_basic():
"""Load a trained tokenizer, tokenize text, convert to IDs, and detokenize."""
tok = AmharicTokenizer.load("amh_bpe_v0.2.6")
text = (
"የኮሪደር ልማት ገፀ በረከት የሆናቸው የከተማችን ሰፈሮች በነዋሪዎች አንደበት በሰዓት 209 ኪሎ ሜትር የሚጓዘው አውሎ ንፋስ ከጃማይካ ቀጥሎ ኩባ ደርሷል ጠቅላይ" )
tokens = tok.tokenize(text)
ids = tok.encode(text)
detokenized = tok.detokenize(tokens)
print("Original Text: ", text)
print("Tokens: ", tokens)
print("IDs: ", ids)
print("Detokenized Text: ", detokenized)
assert text == detokenized, "Detokenized text does not match the original."
if __name__ == "__main__":
test_roundtrip_basic()
Output:
Tokenizer state loaded from amh_bpe_v0.2.6.json
Original Text: የኮሪደር ልማት ገፀ በረከት የሆናቸው የከተማችን ሰፈሮች በነዋሪዎች አንደበት በሰዓት 209 ኪሎ ሜትር የሚጓዘው አውሎ ንፋስ ከጃማይካ ቀጥሎ ኩባ ደርሷል ጠቅላይ
Tokens: ['የአከኦ', 'ረኢደአረእ<eow>', 'ለእመኣተእ<eow>', 'ገአ', 'ፀ', 'አ<eow>', 'በአረአ', 'ከአተእ<eow>', 'የአሀኦነ', 'ኣቸአወእ<eow>', 'የአ', 'ከአተአመኣ', 'ቸእነእ<eow>', 'ሰአፈአረ', 'ኦቸእ<eow>', 'በአ', 'ነአወኣረኢወኦቸእ<eow>', 'አነእደአ', 'በአተእ<eow>', 'በአሰአ', 'ዓተእ<eow>', '2', '0', '9', '<eow>', 'ከኢለኦ<eow>', 'መኤተእረእ<eow>', 'የአመኢ', 'ጓ', 'ዘ', 'አወእ<eow>', 'አወ', 'እለኦ<eow>', 'ነእ', 'ፈኣ', 'ሰእ<eow>', 'ከአ', 'ጀኣ', 'መኣየእ', 'ከኣ<eow>', 'ቀአጠእለኦ<eow>', 'ከኡ', 'በኣ<eow>', 'ደአረእሰ', 'ኡኣለእ<eow>', 'ጠአቀእለኣየእ<eow>']
IDs: [2794, 4229, 1136, 66, 37, 79, 711, 1556, 1480, 116, 43, 1467, 1162, 4664, 68, 45, 1618, 2182, 219, 1831, 879, 1, 1, 1, 0, 2824, 2684, 95, 1, 27, 58, 46, 4373, 67, 206, 83, 62, 1083, 4653, 230, 3916, 191, 202, 1221, 477, 496]
Detokenized Text: የኮሪደር ልማት ገፀ በረከት የሆናቸው የከተማችን ሰፈሮች በነዋሪዎች አንደበት በሰዓት 209 ኪሎ ሜትር የሚጓዘው አውሎ ንፋስ ከጃማይካ ቀጥሎ ኩባ ደርሷል ጠቅላይ- Added
vocab_sizeproperty for inspecting model vocabulary. - Added
test_roundtrip_basic.pyexample script for verifying tokenizer round-trip behavior. - Internal
<eow>token remains an end-of-word marker and is excluded from final detokenized output.
python -m venv .venv
source .venv/bin/activate # Linux/Mac
.venv\Scripts\activate # Windows
pip install amharic-tokenizerVerify the CLI:
amh-tokenizer --helppython -m venv .venv
source .venv/bin/activate
pip install -e .# Train on a cleaned Amharic text corpus and save model
amh-tokenizer train /abs/path/to/cleaned_amharic.txt /abs/path/to/amh_bpe \
--num-merges 50000 --verbose --log-every 2000
# Example using relative paths
amh-tokenizer train cleaned_amharic.txt amh_bpe --num-merges 50000 --verbose --log-every 2000from amharic_tokenizer.tokenizer import AmharicTokenizer
tokenizer = AmharicTokenizer(vocab_size=5000, num_merges=2000)
tokenizer.train(corpus_text, verbose=True, log_every=100)
tokenizer.save("amh_bpe_model")
tokenizer = AmharicTokenizer.load("amh_bpe_model")from amharic_tokenizer import AmharicTokenizer
# Load a trained model
tok = AmharicTokenizer.load("amh_bpe_v0.2.6")
text = "ኢትዮጵያ ጥሩ ናት።"
# Tokenize
tokens = tok.tokenize(text)
print(tokens) # variable-length subword tokens
# Tokens to ids
ids = tok.encode(text) # or tok.convert_tokens_to_ids(tokens)
decoded = tok.decode(ids) # or tok.detokenize(tokens)
display_tokens = [t.replace('<eow>', '') for t in tokens if t != '<eow>']
print(display_tokens)
# Detokenize back to original text
print(tok.detokenize(tokens))# Test a single string
python examples/try_tokenizer.py amh_bpe --text "ኢትዮጵያ ጥሩ ናት።"
# Test a file
python examples/try_tokenizer.py amh_bpe --file cleaned_amharic.txtTip: If running examples directly by path, ensure the package is installed (
pip install -e .) or run as a module from the project root:
python -m examples.try_tokenizer amh_bpe --text "..."AmharicTokenizer(num_merges=50000)train(corpus_text, verbose=False, log_every=1000) -> inttokenize(text) -> list[str]detokenize(tokens) -> strsave(path_prefix)/load(path_prefix)is_trained() -> bool
- Longer, more diverse corpora and higher
num_mergesproduce longer subwords. - Training and tokenization work over decomposed fidel; detokenization recomposes the original Amharic characters.
- ModuleNotFoundError inside the repo: install in editable mode (
pip install -e .) or run scripts from outside the repo to avoid shadowing the installed package. - TestPyPI installs: resolve build dependencies from PyPI:
pip install -i https://test.pypi.org/simple/ \
--extra-index-url https://pypi.org/simple amharic-tokenizerThis project is licensed under the MIT License – see the LICENSE file for details.