Skip to content

Commit 6d51e6d

Browse files
authored
Merge pull request #30 from MIT-Emerging-Talent/m_m_recursive
Milestone 4- main readme update
2 parents add3f85 + b8425f9 commit 6d51e6d

File tree

2 files changed

+703
-0
lines changed

2 files changed

+703
-0
lines changed
Lines changed: 282 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,282 @@
1+
# Hybrid RAG with Critic–Refiner Workflow (Qwen2.5 + LAmini)
2+
3+
## 1. 🎯Goal
4+
5+
This project implements a **Retrieval-Augmented Generation (RAG)** pipeline enhanced
6+
with a **dual-stage Critic–Refiner architecture**.
7+
8+
The main objective was to create a **highly accurate, context-grounded, and reliable
9+
question-answering system**, combining:
10+
11+
- **Qwen2.5-7B-Instruct** (cloud-based Critic)
12+
- **LAmini (local GGUF model)** (Refiner)
13+
- **LlamaIndex** (retrieval engine)
14+
15+
The system rigorously evaluates draft answers using a critic model, detects
16+
factual errors or missing context, and then rewrites them using a local refiner
17+
model.
18+
This produces answers that are **trustworthy**, **grounded**, and **fully derived
19+
from source documents**.
20+
21+
---
22+
23+
## 2. 🤖 About the Models Used
24+
25+
### 2.1 Qwen2.5-7B-Instruct (Critic Model)
26+
27+
Qwen2.5-7B is a powerful instruction-tuned LLM developed by Alibaba Cloud.
28+
It was chosen as the **Critic** for these reasons:
29+
30+
- **High factual reliability:** Qwen models consistently score high in truthfulness
31+
- and instruction-following benchmarks.
32+
- **Ideal for evaluation:** As a cloud-based model on Hugging Face Inference API,
33+
- it is fast, stable, and accurate.
34+
- **Excellent reasoning capabilities:** Perfect for evaluating alignment between
35+
- retrieved context and generated draft answers.
36+
37+
### 2.2 LAmini (Local Refiner Model)
38+
39+
LAmini is a compact, efficient, open-source model designed for rewriting and
40+
stylistic refinement.
41+
It was selected as the **Refiner** because:
42+
43+
- **Small and fast:** Runs comfortably on consumer hardware in `.gguf` format.
44+
- **Excellent at rewriting:** Ideal for polishing or correcting drafts based on
45+
- reviewer feedback.
46+
- **Local privacy:** No online requests; all refinement happens locally.
47+
- **Lightweight:** Fits the project's goal of low-cost, local execution.
48+
49+
### 2.3 Why a Critic–Refiner System?
50+
51+
This architecture ensures:
52+
53+
- The **Critic** checks for correctness, consistency, and missing facts.
54+
- The **Refiner** rewrites only the necessary corrections.
55+
- The workflow minimizes hallucinations and guarantees source-grounded answers.
56+
57+
This structure is heavily inspired by **self-correcting LLM systems** and
58+
**Human-in-the-Loop editorial workflows**, but automated.
59+
60+
---
61+
62+
## 3. 🛠️ Methodology: Retrieval-Augmented Generation (RAG)
63+
64+
To answer questions based on documents not included in the LLM’s training data,
65+
RAG augments the model’s knowledge using retrieval.
66+
67+
The pipeline works as follows:
68+
69+
1. **Retrieval:**
70+
User question → Convert to embedding → Search vector index → Retrieve relevant
71+
text chunks.
72+
73+
2. **Draft Generation:**
74+
The retrieved context + question are used to generate a **draft answer**.
75+
76+
3. **Critic Evaluation (Qwen2.5):**
77+
The critic compares the draft answer against the retrieved context and returns:
78+
- `[OK]` — Draft is accurate
79+
- `[REVISE]` — Draft contains errors/missing info
80+
- plus a bulleted list of required corrections.
81+
82+
4. **Refinement (LAmini):**
83+
LAmini rewrites the draft based **only on the critic’s feedback**, producing
84+
the final polished answer.
85+
86+
This ensures accuracy and consistency with the source documents.
87+
88+
### Implementation Details
89+
90+
- **Framework:** `LlamaIndex`
91+
- **Local Model Loader:** `llama-cpp-python`
92+
- **Embedding Model:** `HuggingFaceEmbedding` (e.g., BAAI/bge-small)
93+
- **Critic Model:** `Qwen/Qwen2.5-7B-Instruct` via HuggingFace Inference API
94+
- **Refiner Model:** `LAmini-Chat` in `.gguf` format
95+
- **Energy Tracking:** CodeCarbon (`OfflineEmissionsTracker`)
96+
97+
---
98+
99+
## 4. 📑 Prompt Engineering: The Editorial Workflow
100+
101+
### 4.1 Critic Prompt
102+
103+
The Critic acts like a strict editor.
104+
105+
It must:
106+
107+
- Judge the draft answer
108+
- Compare it with the source context
109+
- Output `[OK]` or `[REVISE]`
110+
- Provide bullet-point feedback only when necessary
111+
112+
Example behavior:
113+
[REVISE]
114+
115+
The draft added information not found in the source context.
116+
117+
Missing key fact about X.
118+
119+
### 4.2 Refiner Prompt (LAmini)
120+
121+
The Refiner receives:
122+
123+
- Draft answer
124+
- Editor (Critic) feedback
125+
126+
It rewrites the answer accordingly, following strict rules:
127+
128+
- Only fix issues the Critic highlighted
129+
- No new information allowed
130+
- Must produce a complete final answer
131+
132+
This avoids adding hallucinations and ensures correctness.
133+
134+
---
135+
136+
## 5. 📊 Sample Workflow (Prompts & Responses)
137+
138+
You can include your own examples below.
139+
140+
### Example: [Summarization]
141+
142+
- **Prompt:**
143+
> In your own words, describe what happened when the computer alarms
144+
appeared during the landing.
145+
146+
### **Cycle 1:**
147+
148+
- **Draft Answer:**
149+
>As the descent began, Armstrong and Aldrin found themselves
150+
passing landmarks on the surface two or three seconds early. The problem could
151+
have been mascons, concentrations of high mass in a region or regions of the
152+
Moon's crust that contains a gravitational anomaly, potentially altering Eagle's
153+
trajectory. Five minutes into the descent, the burn and 6,000 feet (1,800 m)
154+
above the surface of the moon, the LM guidance computer (LGC) distracted the
155+
crew with the first of several unexpected 1201 and 1202 program alarms. Inside
156+
Mission Control Center, computer engineer Jack Garman told Guidance Officer
157+
Steve Bales it was safe to continue the descent. The program alarm indicated
158+
"executive overflows," meaning the guidance computer could not complete all its
159+
tasks in real-time and had to postpone some of them. Margaret Hamilton, the
160+
Director of Apollo Flight Computer Programming at the MIT Charles Stark Draper
161+
Laboratory later recalled: "To blame the computer for the Apollo 11 problems is
162+
like blaming the person who spots a fire and calls the fire department.
163+
Actually, the computer was programmed to do more than recognize error
164+
conditions. A complete set of recovery programs was incorporated into the
165+
software. The software's action, in this case, was to eliminate lower priority
166+
tasks and re-establish the more important ones. The computer, rather than almost
167+
forcing an abort, prevented an abourt. If the computer hadn't recognized this
168+
problem and took recovery action, I doubt if Apollo 11 would have been the
169+
successful Moon landing it was.
170+
171+
- **Critic Response:**
172+
173+
>- The draft does not accurately describe the computer alarms that appeared
174+
during the landing. The draft should include the specific computer alarms
175+
(1201 and 1202) and the crew's reaction to them.
176+
>- The draft does not mention the specific actions taken by the crew in
177+
response to the computer alarms, such as Armstrong taking semi-automatic control.
178+
>- The draft does not include the details about Mission Control's role in
179+
addressing the alarms and the reassurance given to the crew.
180+
>- The draft does not mention the post-mission analysis showing the real fuel
181+
remaining was probably closer to 50 seconds, which is critical information.
182+
183+
### **Cycle 2:**
184+
185+
- **Draft Answer:**
186+
>The LM guidance computer (LGC) distracted the crew with the first
187+
of several unexpected 1201 and 1202 program alarms during the Apollo 11 mission.
188+
The program alarm indicated "executive overflows," meaning the guidance computer
189+
could not complete all its tasks in real-time and had to postpone some of them.
190+
The recovery programs were incorporated into the software, eliminating lower
191+
priority tasks and re-establishing the more important ones. Margaret Hamilton,
192+
the Director of Apollo Flight Computer Programming at the MIT Charles Stark
193+
Draper Laboratory later recalled that to blame the computer for the Apollo 11,
194+
problems is like blaming the person who spots a fire and calls the fire
195+
department. Actually, the computer was programmed to do more than recognize
196+
error conditions. The software's action, in this case, was to eliminate lower
197+
priority jobs and regain the more significant ones. If the computer hadn't
198+
recognized this problem and took recovery action, I doubt if Apollo 11 would
199+
have been the successful Moon landing it was.
200+
201+
- **Critic Response:**
202+
203+
>- The draft does not mention the specific computer alarms (1201 and 1202)
204+
that appeared during the landing.
205+
>- The draft does not reference the exact timing of the alarms (five minutes
206+
into the descent, 6,000 feet above the surface).
207+
>- The draft does not include the specific actions taken by Mission Control
208+
to address the alarms.
209+
>- The draft does not mention the critical role of the guidance computer in
210+
safely continuing the descent despite the alarms.
211+
>- The draft does not include the perspective of the astronauts experiencing
212+
the alarms.
213+
>- The draft does not provide the exact quote from Jack Garman to Steve Bales.
214+
>- The draft does not mention Margaret Hamilton's role or her specific
215+
recollection of the event.
216+
217+
- **Refined Answer (Final):**
218+
219+
>The LM guidance computer (LGC) distracted the crew with the first
220+
of several unexpected 1201 and 1202 program alarms during the Apollo 11 mission.
221+
The program alarm indicated "executive overflows," meaning the guidance computer
222+
could not complete all its tasks in real-time and had to postpone some of them.
223+
The recovery programs were incorporated into the software, eliminating lower
224+
priority tasks and re-establishing the more important ones. Margaret Hamilton,
225+
the Director of Apollo Flight Computer Programming at the MIT Charles Stark
226+
Draper Laboratory later recalled that to blame the computer for the Apollo 11,
227+
problems is like blaming the person who spots a fire and calls the fire
228+
department. Actually, the computer was programmed to do more than recognize
229+
error conditions. The software's action, in this case, was to eliminate lower
230+
priority jobs and regain the more significant ones. If the computer hadn't
231+
recognized this problem and took recovery action, I doubt if Apollo 11 would
232+
have been the successful Moon landing it was.
233+
234+
---
235+
236+
## 6. 🌱 Environmental Tracking
237+
238+
We used **CodeCarbon** to measure local compute emissions and energy usage.
239+
240+
This enables:
241+
242+
- Transparency regarding energy cost
243+
- Comparison with API-based approaches
244+
- Understanding environmental impact on local hardware
245+
246+
---
247+
248+
## 7. 📚 References (Reputable Sources)
249+
250+
All documentation used:
251+
252+
- Hugging Face Inference API
253+
<https://huggingface.co/docs/api-inference>
254+
255+
- LlamaIndex Documentation
256+
<https://docs.llamaindex.ai>
257+
258+
- LAmini Models
259+
<https://huggingface.co/LinkSoul/LAmini-Chat>
260+
261+
- Qwen2.5 Models
262+
<https://huggingface.co/Qwen>
263+
264+
- LlamaCPP / GGUF Models
265+
<https://github.com/ggerganov/llama.cpp>
266+
267+
- CodeCarbon
268+
<https://mlco2.github.io/codecarbon/>
269+
270+
---
271+
272+
## 8. ✅ Summary
273+
274+
This project demonstrates a powerful hybrid RAG architecture that blends cloud
275+
reasoning and local refinement.
276+
Using a Critic–Refiner pipeline dramatically increases accuracy, reduces
277+
hallucinations, and ensures answers remain faithful to the source documents.
278+
279+
LAmini provides fast, private, offline rewriting, while Qwen2.5 guarantees
280+
high-quality factual evaluation.
281+
282+
Together, they form a reliable, cost-efficient, and production-ready RAG system.

0 commit comments

Comments
 (0)