Merge pull request #32 from MIT-Emerging-Talent/experiment

AseelOmer · web-flow · commit 38c1a5c4957b · 2025-12-06T20:05:54.000+02:00
Milestone 5 - Experiment folder
diff --git a/2_open_source_models/README.md b/2_open_source_models/README.md
@@ -0,0 +1,133 @@
+# **Open-Source Model Experiments**
+
+This directory contains four standalone experiments exploring
+**local, open-source language models** for Retrieval-Augmented Generation
+(RAG), model evaluation, recursive editing, and sustainability tracking
+(energy & CO₂ emissions).
+Each subfolder includes its own notebook, documentation, outputs, and
+model-specific setup.
+
+---
+
+## Directory Structure
+
+```text
+2_open_source_models/
+│
+├── distilled_models/
+│   └── rag_and_distilled_model/
+│
+├── quantized_models/
+│   └── mistral7b/
+│
+└── slm/
+    ├── google_gemma/
+    └── qwen/
+```
+
+Each subfolder contains a self-contained model with its own README,
+notebook(s), generated outputs, and energy/emissions logs where applicable.
+
+---
+
+## Project Summaries
+
+Below is a concise description of each model project to understand
+the purpose of the overall folder at a glance.
+
+---
+
+### **1. Distilled Models – RAG + Instruction-Tuned Distilled LMs**
+
+**Folder:** `distilled_models/rag_and_distilled_model/`
+**Notebook:** `Apollo11_rag&distilled.ipynb`
+
+This project uses a lightweight **LaMini-Flan-T5-248M** distilled model
+combined with a **MiniLM** embedding model to run a fully local
+Retrieval-Augmented Generation pipeline on the Apollo 11 dataset.
+It demonstrates:
+
+* Local embeddings and ChromaDB vector storage
+* RAG-based question answering
+* Evaluation across several prompt types
+* Emissions tracking and generated output logs
+
+Ideal for showing how **compact distilled models** can handle
+RAG efficiently on CPU or modest GPU hardware.
+
+---
+
+### **2. Quantized Models – Mistral 7B RAG Pipeline**
+
+**Folder:** `quantized_models/mistral7b/`
+
+This project evaluates a **quantized Mistral-7B (GGUF)** model running
+fully locally via `llama-cpp-python`.
+It focuses on:
+
+* Retrieval-Augmented Generation using LlamaIndex
+* Local inference using a 4-bit quantized LLM
+* Document processing, embedding (BGE-small), and top-k retrieval
+* Practical observations on feasibility and performance on a laptop
+
+A strong example of how quantization enables
+**large-model capability at small-device cost**.
+
+---
+
+### **3. Small Language Model (SLM): Google Gemma 2-2B**
+
+**Folder:** `slm/google_gemma/`
+
+This experiment implements a structured RAG workflow with Google’s lightweight
+**Gemma 2-2B** model and a fixed Apollo 11 source text.
+Key features include:
+
+* Standardized 21-prompt evaluation set
+* RAG pipeline with chunked retrieval
+* Draft to Critic to Refiner multi-step generation
+* Real-time emissions logging with CodeCarbon
+* Fully reproducible testing and reporting
+
+This project demonstrates how even very small open-weight models can
+perform multi-step reasoning when paired with thoughtful prompting and revision
+cycles.
+
+---
+
+### **4. Small Language Model (SLM): Qwen 2.5B + Recursive Editing**
+
+**Folder:** `slm/qwen/`
+
+This notebook experiments with **Qwen 2.5B**, integrating:
+
+* RAG retrieval
+* A recursive editing loop (Draft to Critic to Refine)
+* Context retrieval through Hugging Face embeddings
+* Energy + CO₂ logging for each query
+
+Outputs are saved in markdown form with all iterations and emissions data.
+
+---
+
+## Purpose of This Collection
+
+This folder exists to:
+
+* Compare how different **model sizes**, **architectures**, and
+**inference strategies** behave on the **same tasks**.
+* Demonstrate **fully local RAG pipelines** using only open-source components.
+* Document **energy and carbon trade-offs** in local LLM usage.
+* Provide reproducible examples that can be extended or rerun with other models.
+
+Each subfolder is designed as a standalone experiment, but together they
+form a cohesive study of open-source LLM efficiency and performance.
+
+---
+
+## Notes
+
+* All code is intended to run locally.
+* Each folder includes its own notebook and README with instructions.
+* Energy/emissions reporting is included where relevant (via CodeCarbon).
+* Datasets and prompts are standardized across projects for fairness and comparability.
diff --git a/3_experiment/README.md b/3_experiment/README.md
@@ -0,0 +1,77 @@
+# AI Model Comparison Experiment
+
+## Evaluating Open-Source vs. Commercial Language Models
+
+This folder contains the materials for our experiment comparing open-source and
+commercial AI models through human evaluation. Participants were asked to read
+pairs of AI-generated texts and judge their quality without knowing which model
+produced which text.
+
+---
+
+## What This Experiment Is
+
+We created a survey where each question includes two texts—**Text A** and
+**Text B**—generated by different AI models. One text always comes from an
+**open-source model**, and the other from a **commercial model**. Participants:
+
+* Choose which text they prefer
+* Guess which model type generated each text
+* Rate both texts (accuracy, clarity, relevance, faithfulness)
+
+All evaluations are blind to remove brand bias.
+
+---
+
+## Why We Did This
+
+Open-source AI models are advancing quickly, and we wanted to understand
+whether they are perceived as competitive alternatives to commercial systems.
+While benchmarks can measure performance numerically, they don’t reflect how
+humans actually experience AI-generated writing.
+
+This experiment aims to answer questions like:
+
+* Do people notice a consistent quality difference?
+* Can users accurately identify commercial vs. open-source output?
+* Are open-source models “good enough” for real-world tasks?
+
+Understanding these perceptions is important for evaluating the viability of
+sustainable, accessible, and transparent AI systems.
+
+---
+
+## Why We Chose This Method
+
+We used a **paired, blind comparison** because it provides a clean way to
+assess text quality without model reputation influencing the results.
+Participants judge writing on its own merits, which helps us collect more
+reliable data.
+
+We included multiple task types: summarization, paraphrasing, reasoning, and
+creative writing, because each one tests a different aspect of model behavior.
+
+This variety gives us a broader picture of model strengths and weaknesses.
+
+---
+
+## Why This Approach Works Well
+
+This survey-based structure is simple and easy for participants to
+understand. It mirrors how people naturally interact with AI systems: reading
+text and forming opinions about quality. By keeping the evaluation blind, we
+minimize bias and generate more meaningful insights into real user perception.
+
+The method also helps determine whether open-source models, especially optimized
+ones, can realistically serve as alternatives to commercial systems
+in practical use.
+
+---
+
+## Contents of This Folder
+
+```text
+3_experiment/
+├── survey_form.md     # The form text used in the study
+└──  README.md          # Explanation of the experiment (this file)
+```
diff --git a/3_experiment/survey_form.md b/3_experiment/survey_form.md