Skip to content

Extraction and normalization of an (initially) Attic Prose tagged corpus for the Greek Learner Texts Project.

Notifications You must be signed in to change notification settings

greek-learner-texts/vocabulary-corpus-prep

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Greek Learner Texts Vocabulary Corpus Prep

Extraction and normalization of an (initially) Attic Prose tagged corpus for the Greek Learner Texts Project.

  • tagged-texts/ contains the extracted files (with minor manual corrections)
  • scripts/gather.py did the initial extraction.
  • counts.tsv gives current token counts.
  • scripts/stats.py produced those counts.
  • base-texts/ contains the chunked base texts.
  • scripts/extract_base.py produced those chunked base texts.
  • tokenized-texts/ contains tokenized base texts.
  • scripts/tokens.py produced those tokenized base texts.
  • aligned-tagging/ contains initial alignment of different taggings of each text.
  • scripts/align.py produced those alignments,

Works Included

  • Thucydides (0003) 001 (Books 1–3)
  • Isocrates (0010) 007 008 009 011 019 021
  • Demosthenes (0014) 001 004 005 006 018 020 021
  • Xenophon (0032) Anabasis (006)
  • Plato (0059) Euthyphro (001) Apology (002) Crito (003) Symposium (011) Republic (030)
  • Lysias (0540) 001 002 003 004 005 006 007 008 009 010 012 013 014 015 016 017 018 019 020 022 023 025 026 032 033

Datasets Included

About

Extraction and normalization of an (initially) Attic Prose tagged corpus for the Greek Learner Texts Project.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages