ORUGA: Optimizing Readability Using Genetic Algorithms

Official implementation of the paper published in Knowledge-Based Systems (2024).

ORUGA (Optimizing Readability Using Genetic Algorithms) is an unsupervised framework designed to automatically enhance the readability of text. Unlike deep learning approaches that require massive training datasets, ORUGA uses evolutionary strategies (Genetic Algorithms) to minimize complexity metrics (like FKGL) while preserving semantic meaning.

📄 Citation

If you utilize this framework or code in your research, please cite the following paper:

@article{martinez2024oruga,
    author = {Jorge Martinez-Gil},
    title = {Optimizing readability using genetic algorithms},
    journal = {Knowledge-Based Systems},
    volume = {284},
    pages = {111273},
    year = {2024},
    issn = {0950-7051},
    doi = {10.1016/j.knosys.2023.111273}    
}

📚 Tutorials & Context

For a general audience overview of the concepts behind this framework, refer to this three-part series on Medium:

⚙️ Installation

To reproduce the experiments, install the dependencies:

pip install -r requirements.txt

Warning

CRITICAL DEPENDENCY CONFLICT

Package Name: Ensure you use pygad==2.1.0. Newer versions may cause compatibility errors with the evolutionary logic.
Namespace Conflict: There is a known namespace collision between Readability and readability-lxml.
- This project uses py-readability-metrics.
- Do not install readability-lxml in the same environment, or the imports will fail.

🧪 Experimental Reproduction

The repository allows you to reproduce the single-objective and multi-objective evolutionary experiments reported in the paper.

1. Single-Objective Optimization

These scripts focus solely on minimizing the FKGL (Flesch-Kincaid Grade Level) score using different synonym replacement strategies.

Script	Strategy	Description
`oruga_wordnet.py`	WordNet	Uses the NLTK WordNet lexical database for synonym retrieval. Fast and standard.
`oruga_word2vec.py`	Word2Vec	Uses vector embeddings to find synonyms. Note: Slower execution due to vector operations.
`oruga_webscraping.py`	Web	Scrapes external thesaurus sites. Note: Please use responsibly to avoid rate limiting.

2. Multi-Objective Optimization (NSGA-II & GDE3)

These scripts implement the advanced contributions of the paper, simultaneously minimizing Readability Score (FKGL) and Text modification rate, preventing the algorithm from changing too many words (Semantic Drift).

Using NSGA-II (Non-dominated Sorting Genetic Algorithm II):

# Basic Semantic Protection
python oruga2_nsga2.py

# Advanced Semantic Protection (using Word Mover's Distance - WMD)
python oruga2_nsga2_wmd.py

Using GDE3 (Generalized Differential Evolution 3):

# Basic Semantic Protection
python oruga2_gde3.py

# Advanced Semantic Protection (using Word Mover's Distance - WMD)
python oruga2_gde3_wmd.py

📊 Dataset

The repository includes texts.txt, which contains the benchmarking dataset used in the study:

Content: 10 text samples extracted from Wikipedia.
Diversity: Varies in length, topic, and initial complexity levels to test the robustness of the algorithm.

📄 License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
comparison		comparison
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md
example.png		example.png
oruga2_gde3.py		oruga2_gde3.py
oruga2_nsga2.py		oruga2_nsga2.py
oruga3_gde3_wmd.py		oruga3_gde3_wmd.py
oruga3_nsga2_wmd.py		oruga3_nsga2_wmd.py
oruga_massive_experiments.py		oruga_massive_experiments.py
oruga_massive_experiments_smog.py		oruga_massive_experiments_smog.py
oruga_webscraping.py		oruga_webscraping.py
oruga_word2vec.py		oruga_word2vec.py
oruga_wordnet.py		oruga_wordnet.py
requirements.txt		requirements.txt
texts.txt		texts.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ORUGA: Optimizing Readability Using Genetic Algorithms

📄 Citation

📚 Tutorials & Context

⚙️ Installation

🧪 Experimental Reproduction

1. Single-Objective Optimization

2. Multi-Objective Optimization (NSGA-II & GDE3)

📊 Dataset

📄 License

About

Uh oh!

Uh oh!

Languages

License

jorge-martinez-gil/oruga

Folders and files

Latest commit

History

Repository files navigation

ORUGA: Optimizing Readability Using Genetic Algorithms

📄 Citation

📚 Tutorials & Context

⚙️ Installation

🧪 Experimental Reproduction

1. Single-Objective Optimization

2. Multi-Objective Optimization (NSGA-II & GDE3)

📊 Dataset

📄 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages