Official implementation of the paper published in Knowledge-Based Systems (2024).
ORUGA (Optimizing Readability Using Genetic Algorithms) is an unsupervised framework designed to automatically enhance the readability of text. Unlike deep learning approaches that require massive training datasets, ORUGA uses evolutionary strategies (Genetic Algorithms) to minimize complexity metrics (like FKGL) while preserving semantic meaning.
If you utilize this framework or code in your research, please cite the following paper:
@article{martinez2024oruga,
author = {Jorge Martinez-Gil},
title = {Optimizing readability using genetic algorithms},
journal = {Knowledge-Based Systems},
volume = {284},
pages = {111273},
year = {2024},
issn = {0950-7051},
doi = {10.1016/j.knosys.2023.111273}
}For a general audience overview of the concepts behind this framework, refer to this three-part series on Medium:
- Part 1: Introduction to Readability Optimization
- Part 2: Implementation Details
- Part 3: Advanced Optimization Strategies
To reproduce the experiments, install the dependencies:
pip install -r requirements.txtWarning
CRITICAL DEPENDENCY CONFLICT
- Package Name: Ensure you use
pygad==2.1.0. Newer versions may cause compatibility errors with the evolutionary logic. - Namespace Conflict: There is a known namespace collision between
Readabilityandreadability-lxml.- This project uses
py-readability-metrics. - Do not install
readability-lxmlin the same environment, or the imports will fail.
- This project uses
The repository allows you to reproduce the single-objective and multi-objective evolutionary experiments reported in the paper.
These scripts focus solely on minimizing the FKGL (Flesch-Kincaid Grade Level) score using different synonym replacement strategies.
| Script | Strategy | Description |
|---|---|---|
oruga_wordnet.py |
WordNet | Uses the NLTK WordNet lexical database for synonym retrieval. Fast and standard. |
oruga_word2vec.py |
Word2Vec | Uses vector embeddings to find synonyms. Note: Slower execution due to vector operations. |
oruga_webscraping.py |
Web | Scrapes external thesaurus sites. Note: Please use responsibly to avoid rate limiting. |
These scripts implement the advanced contributions of the paper, simultaneously minimizing Readability Score (FKGL) and Text modification rate, preventing the algorithm from changing too many words (Semantic Drift).
Using NSGA-II (Non-dominated Sorting Genetic Algorithm II):
# Basic Semantic Protection
python oruga2_nsga2.py
# Advanced Semantic Protection (using Word Mover's Distance - WMD)
python oruga2_nsga2_wmd.pyUsing GDE3 (Generalized Differential Evolution 3):
# Basic Semantic Protection
python oruga2_gde3.py
# Advanced Semantic Protection (using Word Mover's Distance - WMD)
python oruga2_gde3_wmd.pyThe repository includes texts.txt, which contains the benchmarking dataset used in the study:
- Content: 10 text samples extracted from Wikipedia.
- Diversity: Varies in length, topic, and initial complexity levels to test the robustness of the algorithm.
This project is licensed under the MIT License.
