Kumzari

The first digital dataset and translation model for the endangered Kumzari language.

Due to copyright restrictions on some source materials, the full dataset cannot be publicly redistributed. The methodology and processing are fully open-source.

About

7,000+ documented entries, with 3,000+ re-processed for improved accuracy.
Though Kumzari is not a written language, the Perso-Arabic (Persian) script has been used to represent it.
Built using vision-enabled language models on sources spanning from 1929 to the present
Part of an ongoing cultural and linguistic preservation effort

Dataset

A public sample of the dataset derived exclusively from public-domain sources is included.

The complete dataset incorporates materials that are subject to copyright and is thus not publicly released.

Status

This project is a work in progress.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
LICENSE		LICENSE
README.md		README.md
process_pdf.ipynb		process_pdf.ipynb
sample_public.json		sample_public.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Kumzari

About

Dataset

Status

About

Uh oh!

Releases

Packages

Languages

License

karimongitb/kumzari

Folders and files

Latest commit

History

Repository files navigation

Kumzari

About

Dataset

Status

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages