Greek Learner Texts Vocabulary Corpus Prep

Extraction and normalization of an (initially) Attic Prose tagged corpus for the Greek Learner Texts Project.

tagged-texts/ contains the extracted files (with minor manual corrections)
scripts/gather.py did the initial extraction.
counts.tsv gives current token counts.
scripts/stats.py produced those counts.
base-texts/ contains the chunked base texts.
scripts/extract_base.py produced those chunked base texts.
tokenized-texts/ contains tokenized base texts.
scripts/tokens.py produced those tokenized base texts.
aligned-tagging/ contains initial alignment of different taggings of each text.
scripts/align.py produced those alignments,

Works Included

Thucydides (0003) 001 (Books 1–3)
Isocrates (0010) 007 008 009 011 019 021
Demosthenes (0014) 001 004 005 006 018 020 021
Xenophon (0032) Anabasis (006)
Plato (0059) Euthyphro (001) Apology (002) Crito (003) Symposium (011) Republic (030)
Lysias (0540) 001 002 003 004 005 006 007 008 009 010 012 013 014 015 016 017 018 019 020 022 023 025 026 032 033

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
aligned-tagging		aligned-tagging
base-texts		base-texts
scripts		scripts
tagged-texts		tagged-texts
tokenized-texts		tokenized-texts
README.md		README.md
counts.tsv		counts.tsv