The UniversalCEFR Data Directory

UniversalCEFR is a largescale, multilingual, multidimensional dataset comprising of texts annotated according to the CEFR (Common European Framework of Reference). The collection comprises of a total of 505,807 CEFR-labeled texts annotated in 13 languages in 4 script (Latin, Arabic, Devanagari, and Cyrillic).

English (en)
Spanish (es)
German (de)
Dutch (nl)
Czech (cs)
Italian (it)
French (fr)
Estonian (et)
Portuguese (pt)
Arabic (ar)
Hindi (hi)
Russian (ru)
Welsh (cy)

The project paper can be found here: https://arxiv.org/abs/2506.01419

UniversalCEFR Data Format / Schema

To ensure interoperability, transformation, and machine readability, adopted standardised JSON format for each CEFR-labeled text. These fields include the source dataset, language, granularity (document, paragraph, sentence, discourse), production category (learner or reference), and license.

Field	Description
`title`	The unique title of the text retrieved from its original corpus (`NA` if there are no titles such as CEFR-assessed sentences or paragraphs).
`lang`	The source language of the text in ISO 638-1 format (e.g., `en` for English).
`source_name`	The source dataset name where the text is collected as indicated from their source dataset, paper, and/or documentation (e.g., `cambridge-exams` from Xia et al., 2016).
`format`	The format of the text in terms of level of granularity as indicated from their source dataset, paper, and/or documentation. The recognized formats are the following: [`document-level`, `paragraph-level`, `discourse-level`, `sentence-level`].
`category`	The classification of the text in terms of who created the material. The recognized categories are `reference` for texts created by experts, teachers, and language learning professionals and `learner` for texts written by language learners and students.
`cefr_level`	The CEFR level associated with the text. The six recognized CEFR levels are the following: [`A1`, `A2`, `B1`, `B2`, `C1`, `C2`]. A small fraction (<1%) of text in UniversalCEFR contains unlabelled text, texts with plus signs (e.g., `A1+`), and texts with no level indicator (e.g., `A`, `B`).
`license`	The licensing information associated with the text (e.g., `CC-BY-NC-SA` or `Unknown` if not stated).
`text`	The actual content of the text itself.

The UniversalCEFR Data Directory

The current compilation for UniversalCEFR is composed of 26 CEFR-labeled publicly-accessible corpora which can be used for non-commercial research and derivations can be created as long as it follows the same license.

We provide an informative data directory covering language proficiency-based information including the language, format, category, annotation method, distinct L1 learners, inter-annotator agreeement, and license information about the compiled datasets that may be useful for the utility of UniversalCEFR.

Corpus Name	Lang (ISO 638-1)	Format	Category	Size	Annotation Method	Expert Annotators	Distinct L1	Inter-Annotator Agreement	CEFR Coverage	License	Resource
cambridge-exams	en	document-level	reference	331	n/a	n/a	n/a	n/a	A1-C2	CC BY-NC-SA 4.0	Xia et al. (2016)
elg-cefr-en	en	document-level	reference	712	manual	3	n/a	n/a	A1-C2, plus	CC BY-NC-SA 4.0	Breukker (2022)
cefr-sp	en	sentence-level	reference	17,000	manual	2	n/a	r = 0.75, 0.73	A1-C2	CC BY-NC-SA 4.0	Arase et al. (2022)
elg-cefr-de	de	document-level	reference	509	manual	3	n/a	n/a	A1-C2	CC BY-NC-SA 4.0	Breukker (2022)
elg-cefr-nl	nl	document-level	reference	3,596	manual	3	n/a	n/a	A1-C2, plus	CC BY-NC-SA 4.0	Breukker (2022)
icle500	en	document-level	learner	500	manual	28	ur, pa, bg, zh, cs, nl, fi, fr, de, el, hu, it, ja, ko, lt, mk, no, fa, pl, pt, ru, sr, es, sv, tn, tr	Rasch kappa = -0.02	A1-C2, plus	CC0 1.0	Thwaites et al. (2024)
cefr-asag	en	paragraph-level	learner	299	manual	3	fr	Krippendorf alpha = 0.81	A1-C2	CC BY-NC-SA 4.0	Tack et al. (2017)
merlin-cs	cs	paragraph-level	learner	441	manual	multiple	hu, de, fr, ru, pl, en, sk, es	n/a	A2-B2	CC BY-SA 4.0	Boyd et al. (2014)
merlin-it	it	paragraph-level	learner	813	manual	multiple	hu, de, fr, ru, pl, , en, sk, es	n/a	A1-B1	CC BY-SA 4.0	Boyd et al. (2014)
merlin-de	de	paragraph-level	learner	1,033	manual	multiple	hu, de, fr, ru, pl, , en, sk, es	n/a	A1-C1	CC BY-SA 4.0	Boyd et al. (2014)
hablacultura	es	paragraph-level	reference	710	manual	multiple	n/a	n/a	A2-C1	CC BY NC 4.0	Vasquez-Rodrigues et al. (2022)
kwiziq-es	es	document-level	reference	206	manual	multiple	n/a	n/a	A1-C1	CC BY NC 4.0	Vasquez-Rodrigues et al. (2022)
kwiziq-fr	fr	document-level	reference	344	manual	multiple	n/a	n/a	A1-C1	CC BY NC 4.0	Original
caes	es	document-level	learner	30,935	computer-assisted	multiple	pt, zh, ar, fr, ru	n/a	A1-C1	CC BY NC 4.0	Vasquez-Rodrigues et al. (2022)
deplain-web-doc	de	document-level	reference	394	manual	2	n/a	Cohen kappa = 0.85	A1,A2,B2,C2	CC-BY-SA-3, , CC-BY-4, , CC-BY-NC-ND-4, , save_use_share	Stodden et al. (2023)
deplain-apa-doc	de	document-level	reference	483	manual	2	n/a	Cohen kappa = 0.85	A2-B1	CC-BY-SA-3, , CC-BY-4, , CC-BY-NC-ND-4, , save_use_share	Stodden et al. (2023)
deplain-apa-sent	de	sentence-level	reference	483	manual	2	n/a	n/a	A2-B2	By request	Stodden et al. (2023)
elle	et	paragraph-level, , document-level	learner	1,697	manual	2	n/a	n/a	A2-C1	CC BY 4.0	Vajjala and Rama (2018)
efcamdat-cleaned	en	sentence-level,, paragraph-level	learner	406,062	manual	n/a	br, zh, tw, ru, sa, mx, de, it, fr, jp, tr	n/a	A1-C1	Cambridge	Geertzen et al. (2013) , Shatz (2020) , Huang et al. (2020)
beast2019	en	sentence-level	learner	3,600	manual	multiple	n/a	n/a	A1-C2	CC BY SA NC 4.0	Bryant et al. (2019)
peapl2	pt	paragraph-level	learner	481	manual	n/a	zh, en, es, de, ru, fr, ja, it, nl, ar, pl, ko, ro, sv	n/a	A1-C2	CC BY SA NC 4.0	Martins et al. (2019)
cople2	pt	paragraph-level	learner	942	manual	n/a	zh, en, es, de, ru, fr, ja, it, nl, ar, pl, ko, ro, sv	n/a	A1-C1	CC BY SA NC 4.0	Mendes et al. (2016)
zaebuc	ar	paragraph-level	learner	214	manual	3	en	Unnamed kappa = 0.99	A2-C1	CC BY SA NC 4.0	Habash and Palfreyman (2022)
readme	ar, en, fr, hi, ru	sentence-level	reference	9,757	computer-assisted	2	n/a	Krippendorf kappa = 0.67,0.78	A1-C2	CC BY SA NC 4.0	Naous et al. (2024)
apa-lha	de	document-level	reference	3,130	n/a	n/a	n/a	n/a	A2-B1	Public	Spring et al. (2021)
learn-welsh	cy	document-level, , sentence-level,, discourse-level	reference	1,372	manual	n/a	n/a	n/a	A1-A2	Public	Original

Accessing UniversalCEFR

If you're interested in a specific individual or group of datasets from UniversalCEFR, you may access their transformed, standardised version through the UniversalCEFR Huggingface Org: https://huggingface.co/UniversalCEFR

If you use any of the datasets indexed in UniversalCEFR, please cite the original dataset papers they are associated with. You can find them in the data directory above.

Note that there are a few datasets in UniversalCEFR---EFCAMDAT, APA-LHA, BEA Shared Task 2019 Write and Improve, and DEPlain---that are not directly available from the UniversalCEFR Huggingface Org as they require users to agree with their Terms of Use before using them for non-commercial research. Once you've done this, you can use the preprocessing Python scripts in universal-cefr-experiments repository to transform the raw version to UniversalCEFR version.

Contact

For questions, concerns, clarifications, and issues, please contact Joseph Marvin Imperial (jmri20@bath.ac.uk).

Reference

Please use the following information when citing UniversalCEFR:

BibTex Format:

@inproceedings{imperial-etal-2025-universalcefr,
    title = "{U}niversal{CEFR}: Enabling Open Multilingual Research on Language Proficiency Assessment",
    author = "Imperial, Joseph Marvin  and
      Barayan, Abdullah  and
      Stodden, Regina  and
      Wilkens, Rodrigo  and
      Mu{\~n}oz S{\'a}nchez, Ricardo  and
      Gao, Lingyun  and
      Torgbi, Melissa  and
      Knight, Dawn  and
      Forey, Gail  and
      Jablonkai, Reka R.  and
      Kochmar, Ekaterina  and
      Reynolds, Robert Joshua  and
      Ribeiro, Eug{\'e}nio  and
      Saggion, Horacio  and
      Volodina, Elena  and
      Vajjala, Sowmya  and
      Fran{\c{c}}ois, Thomas  and
      Alva-Manchego, Fernando  and
      Tayyar Madabushi, Harish",
    editor = "Christodoulopoulos, Christos  and
      Chakraborty, Tanmoy  and
      Rose, Carolyn  and
      Peng, Violet",
    booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.emnlp-main.491/",
    doi = "10.18653/v1/2025.emnlp-main.491",
    pages = "9714--9766",
    ISBN = "979-8-89176-332-6",
    abstract = "We introduce UniversalCEFR, a large-scale multilingual multidimensional dataset of texts annotated according to the CEFR (Common European Framework of Reference) scale in 13 languages. To enable open research in both automated readability and language proficiency assessment, UniversalCEFR comprises 505,807 CEFR-labeled texts curated from educational and learner-oriented resources, standardized into a unified data format to support consistent processing, analysis, and modeling across tasks and languages. To demonstrate its utility, we conduct benchmark experiments using three modelling paradigms: a) linguistic feature-based classification, b) fine-tuning pre-trained LLMs, and c) descriptor-based prompting of instruction-tuned LLMs. Our results further support using linguistic features and fine-tuning pretrained models in multilingual CEFR level assessment. Overall, UniversalCEFR aims to establish best practices in data distribution in language proficiency research by standardising dataset formats and promoting their accessibility to the global research community."
}

Written with StackEdit.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
LICENSE		LICENSE
README.md		README.md
UniversalCEFR_ProjectPaper.pdf		UniversalCEFR_ProjectPaper.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The UniversalCEFR Data Directory

UniversalCEFR Data Format / Schema

The UniversalCEFR Data Directory

Accessing UniversalCEFR

Contact

Reference

About

Uh oh!

Releases

Packages

License

UniversalCEFR/universalcefr-data-directory

Folders and files

Latest commit

History

Repository files navigation

The UniversalCEFR Data Directory

UniversalCEFR Data Format / Schema

The UniversalCEFR Data Directory

Accessing UniversalCEFR

Contact

Reference

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages