This repository hosts the code and data used for the GEM Workshop 2023 paper "Flesch or Fumble? Evaluating Readability Standard Alignment of Instruction-Tuned Language Models" by Joseph Imperial and Harish Tayyar Madabushi.
Readability Standard Alignment is the task proposed in the paper evaluating the capability of instruction-tuned large language models (listed below) to capture the correct readability levels of generated texts with respect to specifications provided in the prompts. In the paper, we use enrich prompts iteratively with details from the Flesch Kincaid Grade Level and [CEFR] Framework (https://www.coe.int/en/web/common-european-framework-reference-languages/level-descriptions) to see if there are improvements with aligning to the target reading level specified. We do this for two tasks: prompt-based story completion and prompt-based narrative simplification. Read the paper for the outcomes of the experiments.
- ChatGPT (GPT-3.5 Turbo)
- Llama-7B
- FlanT5-250M
- BLOOMZ-3B
- LongForm T5 XL-3B
- Dolly-3B
The data folder contains the generated texts per model per task. Please read the dedicated sections on the paper for the generation specifications used per task.
The code folder contains the various scripts used for prompting the instruction-tuned models used to generate story completions or simplifications.
If you need any help reproducing the results, please don't hesitate to contact me below:
Joseph Marvin Imperial
jmri20@bath.ac.uk
www.josephimperial.com