This is a working repository to convert EIKEN past exam PDFs into machine-friendly structured data (AST/JSON), and then generate a study-oriented vocabulary dataset (frequency, grade-level scores, translations/examples via LLM). The viewer/ app lets you quickly browse the generated metadata.jsonl.
ast/: TypeScript AST definitionssteps/: pipeline steps (Python + TypeScript)data/: intermediate artifacts and outputsviewer/: a simple viewer fordata/vocab/metadata.jsonl(SolidJS + Vite)
- Python
3.11(see.python-version) - Use existing venv:
source .venv/bin/activate
- With
uv(uv.lockexists):uv sync && source .venv/bin/activate
- Without
uv(example):python -m venv .venv && source .venv/bin/activatepip install -e .
- Root:
bun install
- Viewer:
cd viewer && bun install
All inputs/outputs live under data/ by default.
- Goal: merge question/answer/script PDFs into a single, easier-to-process PDF
- Code:
steps/1-preprocess-pdf/*.py(notebook:steps/1-preprocess-pdf/main.ipynb) - Output (example):
data/eiken_combined/{grade}_{year}_{admin}.pdf
- Goal: read PDFs per section and emit minified XML-like text
- Code:
steps/2-format-by-llm/index.ts - Input:
data/eiken_combined/*.pdf - Output:
data/output/{grade}_{year}_{admin}/reading_*.txt/listening_*.txt - Run (example):
bun steps/2-format-by-llm/index.ts- Dependency:
@evex/rakutenai(authentication is configured in your runtime environment)
- Dependency:
Converts data/output/**/*.txt (HTML/XML-ish) into data/parsed/*.json matching ast/*.
- Run:
bun run step:3:parse-llm-output - Script:
steps/3-parse-llm-output/output_html_to_ast_json.ts
- Goal: collect strings from parsed JSON as inputs for vocabulary counting
- Code:
steps/4-build-wordlist-source/main.ts - Input:
data/parsed/*.json - Output:
data/wordlist-sources/*.json - Run (example):
bun steps/4-build-wordlist-source/main.ts
- Goal: count unigrams/phrases and produce per-grade scores as CSV
- Code:
steps/5-build-vocablist/main.ipynb,steps/5-build-vocablist/count.py - Output (example):
data/wordlist-scores/vocablist_by_grade.csv
- Goal: enrich scored vocab items into a learning-friendly JSONL
- Code:
steps/6-add-context/main.ts - Input:
data/wordlist-scores/vocablist_by_grade.csv - Output:
data/vocab/metadata.jsonl - Run (example):
bun steps/6-add-context/main.ts- Dependency:
@evex/rakutenai(authentication is configured in your runtime environment)
- Dependency:
viewer/ is a small local UI to search/sort metadata.jsonl and copy TSV (e.g. for Quizlet).
- Copy
data/vocab/metadata.jsonltoviewer/src/assets/metadata.jsonl - Start:
cd viewer && bun run dev