A tool for filtering Tibetan text files based on language model perplexity, separating high-quality documents from low-quality ones.
-
Clone the repository:
git clone https://github.com/your-repo/BoCorpusQC.git cd BoCorpusQC -
Install the required Python packages:
pip install -r requirements.txt
You can use the main script kenlm_qc.py to filter a directory of .txt files. The script will process each file, calculate its perplexity, and then sort it into either a good_quality or bad_quality sub-directory.
--input_dir: (Required) The path to the directory containing the.txtfiles you want to filter.--output_dir: (Required) The path to the directory where the sorted files will be saved.--num_workers: (Optional) The number of parallel processes to use for scoring the files. If not specified, it defaults to the total number of CPU cores on your machine.
python src/BoCorpusQC/kenlm_qc.py \
--input_dir /path/to/your/text_files \
--output_dir /path/to/your/output_folder \
--num_workers 4This command will:
- Process all
.txtfiles in/path/to/your/text_filesusing 4 CPU cores. - Create two new folders inside
/path/to/your/output_folder:good_quality: Contains the top 33.3% of files with the lowest perplexity scores.bad_quality: Contains the remaining 66.7% of files.
You can also use the library directly in Python to compute the perplexity of any given text:
from BoCorpusQC.kenlm_qc import load_models, calculate_perplexity
# Load the KenLM and SentencePiece models (downloaded automatically from Hugging Face Hub)
kenlm_model, sp_model = load_models()
# Your Tibetan text
text = "བཀྲ་ཤིས་བདེ་ལེགས། ཁམས་བཟང་ངམ།"
# Calculate perplexity
ppl = calculate_perplexity(text, kenlm_model, sp_model)
print(f"Perplexity: {ppl:.4f}")A lower perplexity score indicates that the text is more fluent and predictable according to the language model, suggesting higher quality.
This tool evaluates the quality of Tibetan text files using a pre-trained KenLM language model.
- Model Loading: The script automatically downloads a Tibetan KenLM model (
openpecha/BoKenlm) and a SentencePiece tokenizer (openpecha/BoSentencePiece) from the Hugging Face Hub. - Perplexity Calculation: It processes each
.txtfile in the input directory as a single document and calculates its perplexity score. A lower score indicates that the text is more fluent and predictable according to the language model, suggesting higher quality. - Dynamic Thresholding: The script calculates a dynamic quality threshold based on the distribution of perplexity scores across all files. It sets the threshold to keep the top one-third of the best-scoring documents. This two-pass approach ensures that the definition of "good quality" is always relative to the specific dataset being processed.
- Parallel Processing: To speed up computation, the script uses multiprocessing to calculate perplexity scores for multiple files in parallel.
- Output: Based on the calculated threshold, each file is copied into either the
good_qualityorbad_qualitysubdirectory in your specified output folder.
If you'd like to help out, check out our contributing guidelines.
- File an issue on our GitHub repository.
- Email us at openpecha[at]gmail.com.
- Join our Discord.
This project is licensed under the MIT License.