To automatically align speech audio and text transcripts at the word and phoneme level using the Montreal Forced Aligner (MFA).
# 1οΈβ£ Create and activate environment
conda create -n mfa_env -c conda-forge montreal-forced-aligner -y
conda activate mfa_env
# 2οΈβ£ Download models
mfa model download dictionary english_us_arpa
mfa model download acoustic english_mfa
# 3οΈβ£ Prepare dataset
# Ensure data/ready_corpus contains .wav and .txt pairs
# 4οΈβ£ Validate
mfa validate data/ready_corpus english_us_arpa english_mfa
# 5οΈβ£ Align
mfa align data/ready_corpus english_us_arpa english_mfa outputs/aligned
π Outputs
Alignment Files β outputs/aligned/*.TextGrid
Alignment Report β outputs/aligned/alignment_analysis.csv
Each .TextGrid contains:
Word tier β timestamps for words
Phone tier β timestamps for phonemes
π Visualization
Open in Praat
:
Open β Read from file β F2BJ_RLP1.wav
Open β Read from file β F2BJ_RLP1.TextGrid
Select both β View & Edit
π§ Observations
Word and phone boundaries aligned accurately.
Minor timing deviations in fast speech segments.
english_us_arpa dictionary and english_mfa acoustic model performed well.