UniversalCEFR is a largescale, multilingual, multidimensional dataset comprising of texts annotated according to the CEFR (Common European Framework of Reference). The collection comprises of a total of 505,807 CEFR-labeled texts annotated in 13 languages in 4 script (Latin, Arabic, Devanagari, and Cyrillic).
- English (en)
- Spanish (es)
- German (de)
- Dutch (nl)
- Czech (cs)
- Italian (it)
- French (fr)
- Estonian (et)
- Portuguese (pt)
- Arabic (ar)
- Hindi (hi)
- Russian (ru)
- Welsh (cy)
The project paper can be found here: https://arxiv.org/abs/2506.01419
To ensure interoperability, transformation, and machine readability, adopted standardised JSON format for each CEFR-labeled text. These fields include the source dataset, language, granularity (document, paragraph, sentence, discourse), production category (learner or reference), and license.
| Field | Description |
|---|---|
title |
The unique title of the text retrieved from its original corpus (NA if there are no titles such as CEFR-assessed sentences or paragraphs). |
lang |
The source language of the text in ISO 638-1 format (e.g., en for English). |
source_name |
The source dataset name where the text is collected as indicated from their source dataset, paper, and/or documentation (e.g., cambridge-exams from Xia et al., 2016). |
format |
The format of the text in terms of level of granularity as indicated from their source dataset, paper, and/or documentation. The recognized formats are the following: [document-level, paragraph-level, discourse-level, sentence-level]. |
category |
The classification of the text in terms of who created the material. The recognized categories are reference for texts created by experts, teachers, and language learning professionals and learner for texts written by language learners and students. |
cefr_level |
The CEFR level associated with the text. The six recognized CEFR levels are the following: [A1, A2, B1, B2, C1, C2]. A small fraction (<1%) of text in UniversalCEFR contains unlabelled text, texts with plus signs (e.g., A1+), and texts with no level indicator (e.g., A, B). |
license |
The licensing information associated with the text (e.g., CC-BY-NC-SA or Unknown if not stated). |
text |
The actual content of the text itself. |
The current compilation for UniversalCEFR is composed of 26 CEFR-labeled publicly-accessible corpora which can be used for non-commercial research and derivations can be created as long as it follows the same license.
We provide an informative data directory covering language proficiency-based information including the language, format, category, annotation method, distinct L1 learners, inter-annotator agreeement, and license information about the compiled datasets that may be useful for the utility of UniversalCEFR.
| Corpus Name | Lang (ISO 638-1) | Format | Category | Size | Annotation Method | Expert Annotators | Distinct L1 | Inter-Annotator Agreement | CEFR Coverage | License | Resource |
|---|---|---|---|---|---|---|---|---|---|---|---|
| cambridge-exams | en | document-level | reference | 331 | n/a | n/a | n/a | n/a | A1-C2 | CC BY-NC-SA 4.0 | Xia et al. (2016) |
| elg-cefr-en | en | document-level | reference | 712 | manual | 3 | n/a | n/a | A1-C2, plus | CC BY-NC-SA 4.0 | Breukker (2022) |
| cefr-sp | en | sentence-level | reference | 17,000 | manual | 2 | n/a | r = 0.75, 0.73 | A1-C2 | CC BY-NC-SA 4.0 | Arase et al. (2022) |
| elg-cefr-de | de | document-level | reference | 509 | manual | 3 | n/a | n/a | A1-C2 | CC BY-NC-SA 4.0 | Breukker (2022) |
| elg-cefr-nl | nl | document-level | reference | 3,596 | manual | 3 | n/a | n/a | A1-C2, plus | CC BY-NC-SA 4.0 | Breukker (2022) |
| icle500 | en | document-level | learner | 500 | manual | 28 | ur, pa, bg, zh, cs, nl, fi, fr, de, el, hu, it, ja, ko, lt, mk, no, fa, pl, pt, ru, sr, es, sv, tn, tr | Rasch kappa = -0.02 | A1-C2, plus | CC0 1.0 | Thwaites et al. (2024) |
| cefr-asag | en | paragraph-level | learner | 299 | manual | 3 | fr | Krippendorf alpha = 0.81 | A1-C2 | CC BY-NC-SA 4.0 | Tack et al. (2017) |
| merlin-cs | cs | paragraph-level | learner | 441 | manual | multiple | hu, de, fr, ru, pl, en, sk, es | n/a | A2-B2 | CC BY-SA 4.0 | Boyd et al. (2014) |
| merlin-it | it | paragraph-level | learner | 813 | manual | multiple | hu, de, fr, ru, pl, , en, sk, es | n/a | A1-B1 | CC BY-SA 4.0 | Boyd et al. (2014) |
| merlin-de | de | paragraph-level | learner | 1,033 | manual | multiple | hu, de, fr, ru, pl, , en, sk, es | n/a | A1-C1 | CC BY-SA 4.0 | Boyd et al. (2014) |
| hablacultura | es | paragraph-level | reference | 710 | manual | multiple | n/a | n/a | A2-C1 | CC BY NC 4.0 | Vasquez-Rodrigues et al. (2022) |
| kwiziq-es | es | document-level | reference | 206 | manual | multiple | n/a | n/a | A1-C1 | CC BY NC 4.0 | Vasquez-Rodrigues et al. (2022) |
| kwiziq-fr | fr | document-level | reference | 344 | manual | multiple | n/a | n/a | A1-C1 | CC BY NC 4.0 | Original |
| caes | es | document-level | learner | 30,935 | computer-assisted | multiple | pt, zh, ar, fr, ru | n/a | A1-C1 | CC BY NC 4.0 | Vasquez-Rodrigues et al. (2022) |
| deplain-web-doc | de | document-level | reference | 394 | manual | 2 | n/a | Cohen kappa = 0.85 | A1,A2,B2,C2 | CC-BY-SA-3, , CC-BY-4, , CC-BY-NC-ND-4, , save_use_share | Stodden et al. (2023) |
| deplain-apa-doc | de | document-level | reference | 483 | manual | 2 | n/a | Cohen kappa = 0.85 | A2-B1 | CC-BY-SA-3, , CC-BY-4, , CC-BY-NC-ND-4, , save_use_share | Stodden et al. (2023) |
| deplain-apa-sent | de | sentence-level | reference | 483 | manual | 2 | n/a | n/a | A2-B2 | By request | Stodden et al. (2023) |
| elle | et | paragraph-level, , document-level | learner | 1,697 | manual | 2 | n/a | n/a | A2-C1 | CC BY 4.0 | Vajjala and Rama (2018) |
| efcamdat-cleaned | en | sentence-level,, paragraph-level | learner | 406,062 | manual | n/a | br, zh, tw, ru, sa, mx, de, it, fr, jp, tr | n/a | A1-C1 | Cambridge | Geertzen et al. (2013) , Shatz (2020) , Huang et al. (2020) |
| beast2019 | en | sentence-level | learner | 3,600 | manual | multiple | n/a | n/a | A1-C2 | CC BY SA NC 4.0 | Bryant et al. (2019) |
| peapl2 | pt | paragraph-level | learner | 481 | manual | n/a | zh, en, es, de, ru, fr, ja, it, nl, ar, pl, ko, ro, sv | n/a | A1-C2 | CC BY SA NC 4.0 | Martins et al. (2019) |
| cople2 | pt | paragraph-level | learner | 942 | manual | n/a | zh, en, es, de, ru, fr, ja, it, nl, ar, pl, ko, ro, sv | n/a | A1-C1 | CC BY SA NC 4.0 | Mendes et al. (2016) |
| zaebuc | ar | paragraph-level | learner | 214 | manual | 3 | en | Unnamed kappa = 0.99 | A2-C1 | CC BY SA NC 4.0 | Habash and Palfreyman (2022) |
| readme | ar, en, fr, hi, ru | sentence-level | reference | 9,757 | computer-assisted | 2 | n/a | Krippendorf kappa = 0.67,0.78 | A1-C2 | CC BY SA NC 4.0 | Naous et al. (2024) |
| apa-lha | de | document-level | reference | 3,130 | n/a | n/a | n/a | n/a | A2-B1 | Public | Spring et al. (2021) |
| learn-welsh | cy | document-level, , sentence-level,, discourse-level | reference | 1,372 | manual | n/a | n/a | n/a | A1-A2 | Public | Original |
If you're interested in a specific individual or group of datasets from UniversalCEFR, you may access their transformed, standardised version through the UniversalCEFR Huggingface Org: https://huggingface.co/UniversalCEFR
If you use any of the datasets indexed in UniversalCEFR, please cite the original dataset papers they are associated with. You can find them in the data directory above.
Note that there are a few datasets in UniversalCEFR---EFCAMDAT, APA-LHA, BEA Shared Task 2019 Write and Improve, and DEPlain---that are not directly available from the UniversalCEFR Huggingface Org as they require users to agree with their Terms of Use before using them for non-commercial research. Once you've done this, you can use the preprocessing Python scripts in universal-cefr-experiments repository to transform the raw version to UniversalCEFR version.
For questions, concerns, clarifications, and issues, please contact Joseph Marvin Imperial (jmri20@bath.ac.uk).
Please use the following information when citing UniversalCEFR:
BibTex Format:
@inproceedings{imperial-etal-2025-universalcefr,
title = "{U}niversal{CEFR}: Enabling Open Multilingual Research on Language Proficiency Assessment",
author = "Imperial, Joseph Marvin and
Barayan, Abdullah and
Stodden, Regina and
Wilkens, Rodrigo and
Mu{\~n}oz S{\'a}nchez, Ricardo and
Gao, Lingyun and
Torgbi, Melissa and
Knight, Dawn and
Forey, Gail and
Jablonkai, Reka R. and
Kochmar, Ekaterina and
Reynolds, Robert Joshua and
Ribeiro, Eug{\'e}nio and
Saggion, Horacio and
Volodina, Elena and
Vajjala, Sowmya and
Fran{\c{c}}ois, Thomas and
Alva-Manchego, Fernando and
Tayyar Madabushi, Harish",
editor = "Christodoulopoulos, Christos and
Chakraborty, Tanmoy and
Rose, Carolyn and
Peng, Violet",
booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2025",
address = "Suzhou, China",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.emnlp-main.491/",
doi = "10.18653/v1/2025.emnlp-main.491",
pages = "9714--9766",
ISBN = "979-8-89176-332-6",
abstract = "We introduce UniversalCEFR, a large-scale multilingual multidimensional dataset of texts annotated according to the CEFR (Common European Framework of Reference) scale in 13 languages. To enable open research in both automated readability and language proficiency assessment, UniversalCEFR comprises 505,807 CEFR-labeled texts curated from educational and learner-oriented resources, standardized into a unified data format to support consistent processing, analysis, and modeling across tasks and languages. To demonstrate its utility, we conduct benchmark experiments using three modelling paradigms: a) linguistic feature-based classification, b) fine-tuning pre-trained LLMs, and c) descriptor-based prompting of instruction-tuned LLMs. Our results further support using linguistic features and fine-tuning pretrained models in multilingual CEFR level assessment. Overall, UniversalCEFR aims to establish best practices in data distribution in language proficiency research by standardising dataset formats and promoting their accessibility to the global research community."
}
Written with StackEdit.