This project evaluates the performance of Automatic Speech Recognition (ASR) systems and applies topic modeling techniques to Setswana transcriptions. The goal is to analyze the quality of ASR transcriptions and explore how well topic modeling can extract meaningful topics from Setswana text data.
- ASR Transcription: Scripts for processing audio files and generating transcriptions using models like Wav2Vec2.
- Topic Modeling: Implementation of Latent Dirichlet Allocation (LDA) and BERTopic for extracting topics from Setswana text.
- Data Preprocessing: Includes lemmatization, stopword removal, and sentence splitting.
- Visualization: Generate visualizations such as word clouds and statistical charts.
- data/: Contains raw and processed datasets, including Setswana stopwords and transcriptions.
- docs/: Includes results and visualizations generated during the analysis.
- models/: Directory for storing trained models.
- notebooks/: Jupyter notebooks for interactive exploration and experimentation.
- src/: Python scripts for preprocessing, transcription, and topic modeling.
- references/: Additional reference materials.
numpypandasmatplotlibseabornscikit-learngensimnltktorchtransformers
- Jupyter Notebook (optional, for interactive exploration)
-
Clone the repository:
git clone https://github.com/your-repo/Evaluating-ASR-Transcription-Topic-Modeling-for-Setswana.git cd Evaluating-ASR-Transcription-Topic-Modeling-for-Setswana -
Install the required packages:
pip install -r requirements.txt
- Place your Setswana audio files in the appropriate directory.
- Ensure the stopword list and other preprocessing files are in the
data/folder.
Run the transcription scripts in src/:
python src/transcribe_setswana.pyUse the notebooks in notebooks/ for topic modeling:
- Open
BertTopic.ipynborLDA.ipynb. - Follow the instructions to preprocess data and generate topics.
Generate visualizations using the scripts or notebooks provided in the repository.
Results and visualizations are stored in the docs/ folder, including:
- Word clouds
- Statistical charts
- Topic modeling outputs
Contributions are welcome! Please fork the repository and submit a pull request.
This project is licensed under the MIT License.