Skip to content

UniversalCEFR/universalcefr-data-directory

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 

Repository files navigation

The UniversalCEFR Data Directory

UniversalCEFR is a largescale, multilingual, multidimensional dataset comprising of texts annotated according to the CEFR (Common European Framework of Reference). The collection comprises of a total of 505,807 CEFR-labeled texts annotated in 13 languages in 4 script (Latin, Arabic, Devanagari, and Cyrillic).

  • English (en)
  • Spanish (es)
  • German (de)
  • Dutch (nl)
  • Czech (cs)
  • Italian (it)
  • French (fr)
  • Estonian (et)
  • Portuguese (pt)
  • Arabic (ar)
  • Hindi (hi)
  • Russian (ru)
  • Welsh (cy)

The project paper can be found here: https://arxiv.org/abs/2506.01419

UniversalCEFR Data Format / Schema

To ensure interoperability, transformation, and machine readability, adopted standardised JSON format for each CEFR-labeled text. These fields include the source dataset, language, granularity (document, paragraph, sentence, discourse), production category (learner or reference), and license.

Field Description
title The unique title of the text retrieved from its original corpus (NA if there are no titles such as CEFR-assessed sentences or paragraphs).
lang The source language of the text in ISO 638-1 format (e.g., en for English).
source_name The source dataset name where the text is collected as indicated from their source dataset, paper, and/or documentation (e.g., cambridge-exams from Xia et al., 2016).
format The format of the text in terms of level of granularity as indicated from their source dataset, paper, and/or documentation. The recognized formats are the following: [document-level, paragraph-level, discourse-level, sentence-level].
category The classification of the text in terms of who created the material. The recognized categories are reference for texts created by experts, teachers, and language learning professionals and learner for texts written by language learners and students.
cefr_level The CEFR level associated with the text. The six recognized CEFR levels are the following: [A1, A2, B1, B2, C1, C2]. A small fraction (<1%) of text in UniversalCEFR contains unlabelled text, texts with plus signs (e.g., A1+), and texts with no level indicator (e.g., A, B).
license The licensing information associated with the text (e.g., CC-BY-NC-SA or Unknown if not stated).
text The actual content of the text itself.

The UniversalCEFR Data Directory

The current compilation for UniversalCEFR is composed of 26 CEFR-labeled publicly-accessible corpora which can be used for non-commercial research and derivations can be created as long as it follows the same license.

We provide an informative data directory covering language proficiency-based information including the language, format, category, annotation method, distinct L1 learners, inter-annotator agreeement, and license information about the compiled datasets that may be useful for the utility of UniversalCEFR.

Corpus Name Lang (ISO 638-1) Format Category Size Annotation Method Expert Annotators Distinct L1 Inter-Annotator Agreement CEFR Coverage License Resource
cambridge-exams en document-level reference 331 n/a n/a n/a n/a A1-C2 CC BY-NC-SA 4.0 Xia et al. (2016)
elg-cefr-en en document-level reference 712 manual 3 n/a n/a A1-C2, plus CC BY-NC-SA 4.0 Breukker (2022)
cefr-sp en sentence-level reference 17,000 manual 2 n/a r = 0.75, 0.73 A1-C2 CC BY-NC-SA 4.0 Arase et al. (2022)
elg-cefr-de de document-level reference 509 manual 3 n/a n/a A1-C2 CC BY-NC-SA 4.0 Breukker (2022)
elg-cefr-nl nl document-level reference 3,596 manual 3 n/a n/a A1-C2, plus CC BY-NC-SA 4.0 Breukker (2022)
icle500 en document-level learner 500 manual 28 ur, pa, bg, zh, cs, nl, fi, fr, de, el, hu, it, ja, ko, lt, mk, no, fa, pl, pt, ru, sr, es, sv, tn, tr Rasch kappa = -0.02 A1-C2, plus CC0 1.0 Thwaites et al. (2024)
cefr-asag en paragraph-level learner 299 manual 3 fr Krippendorf alpha = 0.81 A1-C2 CC BY-NC-SA 4.0 Tack et al. (2017)
merlin-cs cs paragraph-level learner 441 manual multiple hu, de, fr, ru, pl, en, sk, es n/a A2-B2 CC BY-SA 4.0 Boyd et al. (2014)
merlin-it it paragraph-level learner 813 manual multiple hu, de, fr, ru, pl, , en, sk, es n/a A1-B1 CC BY-SA 4.0 Boyd et al. (2014)
merlin-de de paragraph-level learner 1,033 manual multiple hu, de, fr, ru, pl, , en, sk, es n/a A1-C1 CC BY-SA 4.0 Boyd et al. (2014)
hablacultura es paragraph-level reference 710 manual multiple n/a n/a A2-C1 CC BY NC 4.0 Vasquez-Rodrigues et al. (2022)
kwiziq-es es document-level reference 206 manual multiple n/a n/a A1-C1 CC BY NC 4.0 Vasquez-Rodrigues et al. (2022)
kwiziq-fr fr document-level reference 344 manual multiple n/a n/a A1-C1 CC BY NC 4.0 Original
caes es document-level learner 30,935 computer-assisted multiple pt, zh, ar, fr, ru n/a A1-C1 CC BY NC 4.0 Vasquez-Rodrigues et al. (2022)
deplain-web-doc de document-level reference 394 manual 2 n/a Cohen kappa = 0.85 A1,A2,B2,C2 CC-BY-SA-3, , CC-BY-4, , CC-BY-NC-ND-4, , save_use_share Stodden et al. (2023)
deplain-apa-doc de document-level reference 483 manual 2 n/a Cohen kappa = 0.85 A2-B1 CC-BY-SA-3, , CC-BY-4, , CC-BY-NC-ND-4, , save_use_share Stodden et al. (2023)
deplain-apa-sent de sentence-level reference 483 manual 2 n/a n/a A2-B2 By request Stodden et al. (2023)
elle et paragraph-level, , document-level learner 1,697 manual 2 n/a n/a A2-C1 CC BY 4.0 Vajjala and Rama (2018)
efcamdat-cleaned en sentence-level,, paragraph-level learner 406,062 manual n/a br, zh, tw, ru, sa, mx, de, it, fr, jp, tr n/a A1-C1 Cambridge Geertzen et al. (2013) , Shatz (2020) , Huang et al. (2020)
beast2019 en sentence-level learner 3,600 manual multiple n/a n/a A1-C2 CC BY SA NC 4.0 Bryant et al. (2019)
peapl2 pt paragraph-level learner 481 manual n/a zh, en, es, de, ru, fr, ja, it, nl, ar, pl, ko, ro, sv n/a A1-C2 CC BY SA NC 4.0 Martins et al. (2019)
cople2 pt paragraph-level learner 942 manual n/a zh, en, es, de, ru, fr, ja, it, nl, ar, pl, ko, ro, sv n/a A1-C1 CC BY SA NC 4.0 Mendes et al. (2016)
zaebuc ar paragraph-level learner 214 manual 3 en Unnamed kappa = 0.99 A2-C1 CC BY SA NC 4.0 Habash and Palfreyman (2022)
readme ar, en, fr, hi, ru sentence-level reference 9,757 computer-assisted 2 n/a Krippendorf kappa = 0.67,0.78 A1-C2 CC BY SA NC 4.0 Naous et al. (2024)
apa-lha de document-level reference 3,130 n/a n/a n/a n/a A2-B1 Public Spring et al. (2021)
learn-welsh cy document-level, , sentence-level,, discourse-level reference 1,372 manual n/a n/a n/a A1-A2 Public Original

Accessing UniversalCEFR

If you're interested in a specific individual or group of datasets from UniversalCEFR, you may access their transformed, standardised version through the UniversalCEFR Huggingface Org: https://huggingface.co/UniversalCEFR

If you use any of the datasets indexed in UniversalCEFR, please cite the original dataset papers they are associated with. You can find them in the data directory above.

Note that there are a few datasets in UniversalCEFR---EFCAMDAT, APA-LHA, BEA Shared Task 2019 Write and Improve, and DEPlain---that are not directly available from the UniversalCEFR Huggingface Org as they require users to agree with their Terms of Use before using them for non-commercial research. Once you've done this, you can use the preprocessing Python scripts in universal-cefr-experiments repository to transform the raw version to UniversalCEFR version.

Contact

For questions, concerns, clarifications, and issues, please contact Joseph Marvin Imperial (jmri20@bath.ac.uk).

Reference

Please use the following information when citing UniversalCEFR:

BibTex Format:

@inproceedings{imperial-etal-2025-universalcefr,
    title = "{U}niversal{CEFR}: Enabling Open Multilingual Research on Language Proficiency Assessment",
    author = "Imperial, Joseph Marvin  and
      Barayan, Abdullah  and
      Stodden, Regina  and
      Wilkens, Rodrigo  and
      Mu{\~n}oz S{\'a}nchez, Ricardo  and
      Gao, Lingyun  and
      Torgbi, Melissa  and
      Knight, Dawn  and
      Forey, Gail  and
      Jablonkai, Reka R.  and
      Kochmar, Ekaterina  and
      Reynolds, Robert Joshua  and
      Ribeiro, Eug{\'e}nio  and
      Saggion, Horacio  and
      Volodina, Elena  and
      Vajjala, Sowmya  and
      Fran{\c{c}}ois, Thomas  and
      Alva-Manchego, Fernando  and
      Tayyar Madabushi, Harish",
    editor = "Christodoulopoulos, Christos  and
      Chakraborty, Tanmoy  and
      Rose, Carolyn  and
      Peng, Violet",
    booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.emnlp-main.491/",
    doi = "10.18653/v1/2025.emnlp-main.491",
    pages = "9714--9766",
    ISBN = "979-8-89176-332-6",
    abstract = "We introduce UniversalCEFR, a large-scale multilingual multidimensional dataset of texts annotated according to the CEFR (Common European Framework of Reference) scale in 13 languages. To enable open research in both automated readability and language proficiency assessment, UniversalCEFR comprises 505,807 CEFR-labeled texts curated from educational and learner-oriented resources, standardized into a unified data format to support consistent processing, analysis, and modeling across tasks and languages. To demonstrate its utility, we conduct benchmark experiments using three modelling paradigms: a) linguistic feature-based classification, b) fine-tuning pre-trained LLMs, and c) descriptor-based prompting of instruction-tuned LLMs. Our results further support using linguistic features and fine-tuning pretrained models in multilingual CEFR level assessment. Overall, UniversalCEFR aims to establish best practices in data distribution in language proficiency research by standardising dataset formats and promoting their accessibility to the global research community."
}

Written with StackEdit.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published