Skip to content

Commit 38c1a5c

Browse files
authored
Merge pull request #32 from MIT-Emerging-Talent/experiment
Milestone 5 - Experiment folder
2 parents bddfb5e + b6a395e commit 38c1a5c

File tree

3 files changed

+1007
-0
lines changed

3 files changed

+1007
-0
lines changed

2_open_source_models/README.md

Lines changed: 133 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,133 @@
1+
# **Open-Source Model Experiments**
2+
3+
This directory contains four standalone experiments exploring
4+
**local, open-source language models** for Retrieval-Augmented Generation
5+
(RAG), model evaluation, recursive editing, and sustainability tracking
6+
(energy & CO₂ emissions).
7+
Each subfolder includes its own notebook, documentation, outputs, and
8+
model-specific setup.
9+
10+
---
11+
12+
## Directory Structure
13+
14+
```text
15+
2_open_source_models/
16+
17+
├── distilled_models/
18+
│ └── rag_and_distilled_model/
19+
20+
├── quantized_models/
21+
│ └── mistral7b/
22+
23+
└── slm/
24+
├── google_gemma/
25+
└── qwen/
26+
```
27+
28+
Each subfolder contains a self-contained model with its own README,
29+
notebook(s), generated outputs, and energy/emissions logs where applicable.
30+
31+
---
32+
33+
## Project Summaries
34+
35+
Below is a concise description of each model project to understand
36+
the purpose of the overall folder at a glance.
37+
38+
---
39+
40+
### **1. Distilled Models – RAG + Instruction-Tuned Distilled LMs**
41+
42+
**Folder:** `distilled_models/rag_and_distilled_model/`
43+
**Notebook:** `Apollo11_rag&distilled.ipynb`
44+
45+
This project uses a lightweight **LaMini-Flan-T5-248M** distilled model
46+
combined with a **MiniLM** embedding model to run a fully local
47+
Retrieval-Augmented Generation pipeline on the Apollo 11 dataset.
48+
It demonstrates:
49+
50+
* Local embeddings and ChromaDB vector storage
51+
* RAG-based question answering
52+
* Evaluation across several prompt types
53+
* Emissions tracking and generated output logs
54+
55+
Ideal for showing how **compact distilled models** can handle
56+
RAG efficiently on CPU or modest GPU hardware.
57+
58+
---
59+
60+
### **2. Quantized Models – Mistral 7B RAG Pipeline**
61+
62+
**Folder:** `quantized_models/mistral7b/`
63+
64+
This project evaluates a **quantized Mistral-7B (GGUF)** model running
65+
fully locally via `llama-cpp-python`.
66+
It focuses on:
67+
68+
* Retrieval-Augmented Generation using LlamaIndex
69+
* Local inference using a 4-bit quantized LLM
70+
* Document processing, embedding (BGE-small), and top-k retrieval
71+
* Practical observations on feasibility and performance on a laptop
72+
73+
A strong example of how quantization enables
74+
**large-model capability at small-device cost**.
75+
76+
---
77+
78+
### **3. Small Language Model (SLM): Google Gemma 2-2B**
79+
80+
**Folder:** `slm/google_gemma/`
81+
82+
This experiment implements a structured RAG workflow with Google’s lightweight
83+
**Gemma 2-2B** model and a fixed Apollo 11 source text.
84+
Key features include:
85+
86+
* Standardized 21-prompt evaluation set
87+
* RAG pipeline with chunked retrieval
88+
* Draft to Critic to Refiner multi-step generation
89+
* Real-time emissions logging with CodeCarbon
90+
* Fully reproducible testing and reporting
91+
92+
This project demonstrates how even very small open-weight models can
93+
perform multi-step reasoning when paired with thoughtful prompting and revision
94+
cycles.
95+
96+
---
97+
98+
### **4. Small Language Model (SLM): Qwen 2.5B + Recursive Editing**
99+
100+
**Folder:** `slm/qwen/`
101+
102+
This notebook experiments with **Qwen 2.5B**, integrating:
103+
104+
* RAG retrieval
105+
* A recursive editing loop (Draft to Critic to Refine)
106+
* Context retrieval through Hugging Face embeddings
107+
* Energy + CO₂ logging for each query
108+
109+
Outputs are saved in markdown form with all iterations and emissions data.
110+
111+
---
112+
113+
## Purpose of This Collection
114+
115+
This folder exists to:
116+
117+
* Compare how different **model sizes**, **architectures**, and
118+
**inference strategies** behave on the **same tasks**.
119+
* Demonstrate **fully local RAG pipelines** using only open-source components.
120+
* Document **energy and carbon trade-offs** in local LLM usage.
121+
* Provide reproducible examples that can be extended or rerun with other models.
122+
123+
Each subfolder is designed as a standalone experiment, but together they
124+
form a cohesive study of open-source LLM efficiency and performance.
125+
126+
---
127+
128+
## Notes
129+
130+
* All code is intended to run locally.
131+
* Each folder includes its own notebook and README with instructions.
132+
* Energy/emissions reporting is included where relevant (via CodeCarbon).
133+
* Datasets and prompts are standardized across projects for fairness and comparability.

3_experiment/README.md

Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,77 @@
1+
# AI Model Comparison Experiment
2+
3+
## Evaluating Open-Source vs. Commercial Language Models
4+
5+
This folder contains the materials for our experiment comparing open-source and
6+
commercial AI models through human evaluation. Participants were asked to read
7+
pairs of AI-generated texts and judge their quality without knowing which model
8+
produced which text.
9+
10+
---
11+
12+
## What This Experiment Is
13+
14+
We created a survey where each question includes two texts—**Text A** and
15+
**Text B**—generated by different AI models. One text always comes from an
16+
**open-source model**, and the other from a **commercial model**. Participants:
17+
18+
* Choose which text they prefer
19+
* Guess which model type generated each text
20+
* Rate both texts (accuracy, clarity, relevance, faithfulness)
21+
22+
All evaluations are blind to remove brand bias.
23+
24+
---
25+
26+
## Why We Did This
27+
28+
Open-source AI models are advancing quickly, and we wanted to understand
29+
whether they are perceived as competitive alternatives to commercial systems.
30+
While benchmarks can measure performance numerically, they don’t reflect how
31+
humans actually experience AI-generated writing.
32+
33+
This experiment aims to answer questions like:
34+
35+
* Do people notice a consistent quality difference?
36+
* Can users accurately identify commercial vs. open-source output?
37+
* Are open-source models “good enough” for real-world tasks?
38+
39+
Understanding these perceptions is important for evaluating the viability of
40+
sustainable, accessible, and transparent AI systems.
41+
42+
---
43+
44+
## Why We Chose This Method
45+
46+
We used a **paired, blind comparison** because it provides a clean way to
47+
assess text quality without model reputation influencing the results.
48+
Participants judge writing on its own merits, which helps us collect more
49+
reliable data.
50+
51+
We included multiple task types: summarization, paraphrasing, reasoning, and
52+
creative writing, because each one tests a different aspect of model behavior.
53+
54+
This variety gives us a broader picture of model strengths and weaknesses.
55+
56+
---
57+
58+
## Why This Approach Works Well
59+
60+
This survey-based structure is simple and easy for participants to
61+
understand. It mirrors how people naturally interact with AI systems: reading
62+
text and forming opinions about quality. By keeping the evaluation blind, we
63+
minimize bias and generate more meaningful insights into real user perception.
64+
65+
The method also helps determine whether open-source models, especially optimized
66+
ones, can realistically serve as alternatives to commercial systems
67+
in practical use.
68+
69+
---
70+
71+
## Contents of This Folder
72+
73+
```text
74+
3_experiment/
75+
├── survey_form.md # The form text used in the study
76+
└── README.md # Explanation of the experiment (this file)
77+
```

0 commit comments

Comments
 (0)