A comprehensive collection of Python scripts for extracting, processing, and managing course data from various sources at Beijing Normal Hong Kong Baptist University (BNBU).
This toolkit provides automated solutions for:
- Course Description Extraction - Extract structured course data from PDF documents
- Teacher Information Fetching - Scrape teacher information from university websites
- Course Offerings Processing - Parse Excel timetables to generate course-lecturer relationships
- Handbook Data Extraction - AI-powered extraction from PDF handbooks using DeepSeek API
- Data Onboarding & Cleanup - Process and standardize course data for database import
You must obtain the required data yourself from official sources:
- Course description PDFs from BNBU Academic Registry
- Excel timetable files from official university sources
- PDF handbook files from university filebase/academic systems
- Any other required input files from authorized university channels
Users are responsible for ensuring they have proper authorization to access and process university data.
- Clone the repository
- Install dependencies:
pip install -r requirements-pdf.txt - Set up API keys (for handbook extraction):
export DEEPSEEK_API='your-key' - Obtain required data files from official university sources (see disclaimer above)
- Navigate to the specific module and follow the detailed instructions in each README
📄 00-cde - Course Description Extraction
Extract course descriptions from official PDF documents.
cd 00-cde
python cd-extract.py <path_to_pdf> --export tsv👥 01-teacher - Teacher Information Fetching
Fetch teacher information from the BNBU staff website.
cd 01-teacher
python teacher-fetch.py📊 02-offering - Course Offerings Extraction
Process Excel timetables to extract course offerings and department information.
cd 02-offering
python offering-extract.py --process
python offering-extract.py --departments📚 04-handbook - Handbook Data Extraction
AI-powered extraction from PDF handbooks using DeepSeek API.
cd 04-handbook
export DEEPSEEK_API='your-api-key'
python pdf_extract_courses.py --input-dir handbooks/2025🔧 05-onboarding - Data Onboarding & Cleanup
Process and clean course data for database import with AI enhancement.
cd 05-onboarding
python course_onboarding.py input/courses_export.tsv output/reference_courses.tsv
python course-cleanup.pyautoauto/
├── 00-cde/ # Course Description Extraction
├── 01-teacher/ # Teacher Information Fetching
├── 02-offering/ # Course Offerings Processing
├── 04-handbook/ # Handbook Data Extraction
├── 05-onboarding/ # Data Onboarding & Cleanup
└── README.md # This file
- Python 3.x
- Required packages (see
requirements-pdf.txt) - DeepSeek API key (for handbook extraction)
- Ollama service (for onboarding AI features)
- Extract course descriptions from PDFs (
00-cde) - Fetch teacher information (
01-teacher) - Process timetable data (
02-offering) - Extract handbook courses with AI (
04-handbook) - Onboard and clean data for database (
05-onboarding)
For detailed usage instructions, examples, and troubleshooting:
- Check the individual README files in each module directory
- Review example input/output files in each module's directories
- Ensure all prerequisites are met before running scripts
When adding new features or scripts:
- Follow the existing directory structure
- Update the relevant module README
- Add example files where appropriate
- Update this main README if adding new modules