Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 41 additions & 0 deletions 2026/p1_measuring_software/g7_llm_quantization.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
---
author: Ceylin Ece, Georgios Markozanis, Kunal Narwani, Amy van der Meijden
group_number: 7
title: "Energy Efficiency of Quantized vs Full-Precision LLM Inference"
image: "img/g7_llm_quantization/project_cover.png"
date: 02/12/2026
summary: |-
Large Language Models are energy-intensive, but quantization techniques promise
to reduce their computational demands. This project compares the energy consumption
of running identical prompts through small LLMs (Llama 3.2 1-3B) in both full-precision
(fp16) and 4-bit quantized (GGUF) formats. We will measure energy consumption, throughput,
and quality trade-offs to provide empirical data for sustainable AI deployment decisions.
identifier: p1_measuring_software_2026 # Do not change this
all_projects_page: "../p1_measuring_software" # Do not change this
---

## Project Topic: Energy Efficiency of Quantized vs Full-Precision LLM Inference

### Motivation

Large Language Models have become ubiquitous in modern software systems, but their energy consumption raises sustainability concerns. Model quantization, that is reducing the numerical precision of weights and activations, is widely adopted to make LLMs more efficient, yet its real-world energy impact remains poorly quantified.

### Research Question

**How does model quantization affect energy consumption during LLM inference?**

We will compare:
- **Full-precision models** (fp16) as our baseline
- **4-bit quantized models** (GGUF format) as the optimized variant

### Experimental Approach

**Models**: Small open-source LLMs such as Llama 3.2 (1B or 3B parameters) that can run on consumer hardware.

**Setup**: We will run identical standardized prompt sets through both model variants on the same hardware, measuring energy consumption (Joules), energy per token, inference throughput, and output quality. Each configuration will be run multiple times to ensure statistical validity, following best practices from the course (zen mode, fixed system settings, controlled environment).

**Quality assessment**: To evaluate whether quantization degrades output, we will use a combination of automated benchmarks and manual comparison of generated outputs.

### Expected Outcomes

We expect quantization to meaningfully reduce energy consumption while maintaining acceptable output quality, providing practical insights for sustainable LLM deployment decisions.