Amharic Tokenizer 🇪🇹

Amharic tokenizer with a GPT-style BPE-like pipeline over decomposed fidel. Implements: cleaning → fidel decomposition → BPE training/application → detokenization, with a Cython core for speed.

What's new in v0.2.6

Vocab size: 30,000 tokens
Trained on a larger and more diverse Amharic corpus
Improved tokenization quality and detokenization accuracy
Better handling of edge cases and rare words

Pretrained tokenizer loading

You can now load a pretrained tokenizer directly:

from amharic_tokenizer import AmharicTokenizer
tok = AmharicTokenizer.load("amh_bpe_v0.2.6")

This version includes a pretrained model (amh_bpe_v0.2.6) that can be used immediately without any additional setup and training.

Full token-to-ID and ID-to-token functionality

Added complete round-trip processing methods:

tokens = tok.tokenize(text)
ids = tok.encode(tokens)
detokenized = tok.detokenize(tokens)

The tokenizer now supports seamless conversion between tokens and IDs, ensuring full consistency between tokenization and detokenization.

Test Script: test_roundtrip_basic.py

from amharic_tokenizer import AmharicTokenizer
def test_roundtrip_basic():
    """Load a trained tokenizer, tokenize text, convert to IDs, and detokenize."""
    tok = AmharicTokenizer.load("amh_bpe_v0.2.6")
    text = (
        "የኮሪደር ልማት ገፀ በረከት የሆናቸው የከተማችን ሰፈሮች በነዋሪዎች አንደበት በሰዓት 209 ኪሎ ሜትር የሚጓዘው አውሎ ንፋስ ከጃማይካ ቀጥሎ ኩባ ደርሷል ጠቅላይ" )

    tokens = tok.tokenize(text)
    ids = tok.encode(text)
    detokenized = tok.detokenize(tokens)
    print("Original Text: ", text)
    print("Tokens: ", tokens)
    print("IDs: ", ids)
    print("Detokenized Text: ", detokenized)
    assert text == detokenized, "Detokenized text does not match the original."
if __name__ == "__main__":
    test_roundtrip_basic()

Output:    
    Tokenizer state loaded from amh_bpe_v0.2.6.json
    Original Text:  የኮሪደር ልማት ገፀ በረከት የሆናቸው የከተማችን ሰፈሮች በነዋሪዎች አንደበት በሰዓት 209 ኪሎ ሜትር የሚጓዘው አውሎ ንፋስ ከጃማይካ ቀጥሎ ኩባ ደርሷል ጠቅላይ
    Tokens:  ['የአከኦ', 'ረኢደአረእ<eow>', 'ለእመኣተእ<eow>', 'ገአ', 'ፀ', 'አ<eow>', 'በአረአ', 'ከአተእ<eow>', 'የአሀኦነ', 'ኣቸአወእ<eow>', 'የአ', 'ከአተአመኣ', 'ቸእነእ<eow>', 'ሰአፈአረ', 'ኦቸእ<eow>', 'በአ', 'ነአወኣረኢወኦቸእ<eow>', 'አነእደአ', 'በአተእ<eow>', 'በአሰአ', 'ዓተእ<eow>', '2', '0', '9', '<eow>', 'ከኢለኦ<eow>', 'መኤተእረእ<eow>', 'የአመኢ', 'ጓ', 'ዘ', 'አወእ<eow>', 'አወ', 'እለኦ<eow>', 'ነእ', 'ፈኣ', 'ሰእ<eow>', 'ከአ', 'ጀኣ', 'መኣየእ', 'ከኣ<eow>', 'ቀአጠእለኦ<eow>', 'ከኡ', 'በኣ<eow>', 'ደአረእሰ', 'ኡኣለእ<eow>', 'ጠአቀእለኣየእ<eow>']
    IDs:  [2794, 4229, 1136, 66, 37, 79, 711, 1556, 1480, 116, 43, 1467, 1162, 4664, 68, 45, 1618, 2182, 219, 1831, 879, 1, 1, 1, 0, 2824, 2684, 95, 1, 27, 58, 46, 4373, 67, 206, 83, 62, 1083, 4653, 230, 3916, 191, 202, 1221, 477, 496]
    Detokenized Text:  የኮሪደር ልማት ገፀ በረከት የሆናቸው የከተማችን ሰፈሮች በነዋሪዎች አንደበት በሰዓት 209 ኪሎ ሜትር የሚጓዘው አውሎ ንፋስ ከጃማይካ ቀጥሎ ኩባ ደርሷል ጠቅላይ

Additional Improvements

Added vocab_size property for inspecting model vocabulary.
Added test_roundtrip_basic.py example script for verifying tokenizer round-trip behavior.
Internal <eow> token remains an end-of-word marker and is excluded from final detokenized output.

Installation

From PyPI (recommended)

python -m venv .venv
source .venv/bin/activate  # Linux/Mac
.venv\Scripts\activate     # Windows

pip install amharic-tokenizer

Verify the CLI:

amh-tokenizer --help

From source (for development)

python -m venv .venv
source .venv/bin/activate
pip install -e .

Training (CLI)

# Train on a cleaned Amharic text corpus and save model
amh-tokenizer train /abs/path/to/cleaned_amharic.txt /abs/path/to/amh_bpe \
  --num-merges 50000 --verbose --log-every 2000

# Example using relative paths
amh-tokenizer train cleaned_amharic.txt amh_bpe --num-merges 50000 --verbose --log-every 2000

Training (Python)

from amharic_tokenizer.tokenizer import AmharicTokenizer

tokenizer = AmharicTokenizer(vocab_size=5000, num_merges=2000)
tokenizer.train(corpus_text, verbose=True, log_every=100)
tokenizer.save("amh_bpe_model")
tokenizer = AmharicTokenizer.load("amh_bpe_model")

Quick Usage (Python)

from amharic_tokenizer import AmharicTokenizer

# Load a trained model
tok = AmharicTokenizer.load("amh_bpe_v0.2.6")

text = "ኢትዮጵያ ጥሩ ናት።"

# Tokenize
tokens = tok.tokenize(text)
print(tokens)  # variable-length subword tokens
# Tokens to ids
ids = tok.encode(text) # or tok.convert_tokens_to_ids(tokens)
decoded = tok.decode(ids)  # or tok.detokenize(tokens)

display_tokens = [t.replace('<eow>', '') for t in tokens if t != '<eow>']
print(display_tokens)

# Detokenize back to original text
print(tok.detokenize(tokens))

Example Script

# Test a single string
python examples/try_tokenizer.py amh_bpe --text "ኢትዮጵያ ጥሩ ናት።"

# Test a file
python examples/try_tokenizer.py amh_bpe --file cleaned_amharic.txt

Tip: If running examples directly by path, ensure the package is installed (pip install -e .) or run as a module from the project root:

python -m examples.try_tokenizer amh_bpe --text "..."

API

AmharicTokenizer(num_merges=50000)

train(corpus_text, verbose=False, log_every=1000) -> int
tokenize(text) -> list[str]
detokenize(tokens) -> str
save(path_prefix) / load(path_prefix)
is_trained() -> bool

Notes

Longer, more diverse corpora and higher num_merges produce longer subwords.
Training and tokenization work over decomposed fidel; detokenization recomposes the original Amharic characters.

Troubleshooting

ModuleNotFoundError inside the repo: install in editable mode (pip install -e .) or run scripts from outside the repo to avoid shadowing the installed package.
TestPyPI installs: resolve build dependencies from PyPI:

pip install -i https://test.pypi.org/simple/ \
    --extra-index-url https://pypi.org/simple amharic-tokenizer

License

This project is licensed under the MIT License – see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
.github		.github
amharic_tokenizer		amharic_tokenizer
data		data
data_crawler		data_crawler
docs		docs
scripts		scripts
tests		tests
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg
setup.py		setup.py
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Amharic Tokenizer 🇪🇹

What's new in v0.2.6

Test Script: test_roundtrip_basic.py

Additional Improvements

Installation

From PyPI (recommended)

From source (for development)

Training (CLI)

Training (Python)

Quick Usage (Python)

Example Script

API

Notes

Troubleshooting

License

About

Uh oh!

Releases 2

Sponsor this project

Uh oh!

Packages

Languages

Uh oh!

License

sefineh-ai/Amharic-Tokenizer

Folders and files

Latest commit

History

Repository files navigation

Amharic Tokenizer 🇪🇹

What's new in v0.2.6

Test Script: test_roundtrip_basic.py

Additional Improvements

Installation

From PyPI (recommended)

From source (for development)

Training (CLI)

Training (Python)

Quick Usage (Python)

Example Script

API

Notes

Troubleshooting

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 2

Sponsor this project

Uh oh!

Packages 0

Languages

Packages