BoCorpusQC: Tibetan Corpus Quality Control

A tool for filtering Tibetan text files based on language model perplexity, separating high-quality documents from low-quality ones.

Installation

Clone the repository:

git clone https://github.com/your-repo/BoCorpusQC.git
cd BoCorpusQC

Install the required Python packages:
```
pip install -r requirements.txt
```

Programmatic Usage

You can use the main script kenlm_qc.py to filter a directory of .txt files. The script will process each file, calculate its perplexity, and then sort it into either a good_quality or bad_quality sub-directory.

Command-Line Arguments

--input_dir: (Required) The path to the directory containing the .txt files you want to filter.
--output_dir: (Required) The path to the directory where the sorted files will be saved.
--num_workers: (Optional) The number of parallel processes to use for scoring the files. If not specified, it defaults to the total number of CPU cores on your machine.

Example

python src/BoCorpusQC/kenlm_qc.py \
    --input_dir /path/to/your/text_files \
    --output_dir /path/to/your/output_folder \
    --num_workers 4

This command will:

Process all .txt files in /path/to/your/text_files using 4 CPU cores.
Create two new folders inside /path/to/your/output_folder:
- good_quality: Contains the top 33.3% of files with the lowest perplexity scores.
- bad_quality: Contains the remaining 66.7% of files.

Calculate Perplexity for a Text String

You can also use the library directly in Python to compute the perplexity of any given text:

from BoCorpusQC.kenlm_qc import load_models, calculate_perplexity

# Load the KenLM and SentencePiece models (downloaded automatically from Hugging Face Hub)
kenlm_model, sp_model = load_models()

# Your Tibetan text
text = "བཀྲ་ཤིས་བདེ་ལེགས། ཁམས་བཟང་ངམ།"

# Calculate perplexity
ppl = calculate_perplexity(text, kenlm_model, sp_model)
print(f"Perplexity: {ppl:.4f}")

A lower perplexity score indicates that the text is more fluent and predictable according to the language model, suggesting higher quality.

Implementation

This tool evaluates the quality of Tibetan text files using a pre-trained KenLM language model.

Model Loading: The script automatically downloads a Tibetan KenLM model (openpecha/BoKenlm) and a SentencePiece tokenizer (openpecha/BoSentencePiece) from the Hugging Face Hub.
Perplexity Calculation: It processes each .txt file in the input directory as a single document and calculates its perplexity score. A lower score indicates that the text is more fluent and predictable according to the language model, suggesting higher quality.
Dynamic Thresholding: The script calculates a dynamic quality threshold based on the distribution of perplexity scores across all files. It sets the threshold to keep the top one-third of the best-scoring documents. This two-pass approach ensures that the definition of "good quality" is always relative to the specific dataset being processed.
Parallel Processing: To speed up computation, the script uses multiprocessing to calculate perplexity scores for multiple files in parallel.
Output: Based on the calculated threshold, each file is copied into either the good_quality or bad_quality subdirectory in your specified output folder.

Contributing

If you'd like to help out, check out our contributing guidelines.

How to get help

File an issue on our GitHub repository.
Email us at openpecha[at]gmail.com.
Join our Discord.

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.github/workflows		.github/workflows
docs		docs
src/BoCorpusQC		src/BoCorpusQC
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BoCorpusQC: Tibetan Corpus Quality Control

Installation

Programmatic Usage

Command-Line Arguments

Example

Calculate Perplexity for a Text String

Implementation

Contributing

How to get help

License

About

Uh oh!

Releases

Packages

Languages

License

OpenPecha/BoCorpusQC

Folders and files

Latest commit

History

Repository files navigation

BoCorpusQC: Tibetan Corpus Quality Control

Installation

Programmatic Usage

Command-Line Arguments

Example

Calculate Perplexity for a Text String

Implementation

Contributing

How to get help

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages