Email Prime

An email processing pipeline with Streamlit web interface that connects to Gmail, classifies emails into topics using AWS Bedrock LLMs, and generates structured summaries of email threads grouped by topic. The system uses the instructor library with Pydantic models for structured LLM outputs. This was built as part of a project for the Columbia Engineering X Amazon Bedrock Innovation Challenge.

Demo Video

emailprime_demo.mp4

Features

Web UI (Recommended)

Streamlit Web App — Interactive two-panel interface for managing projects and viewing summaries
- Automatic email fetching and processing (once per session)
- Project (topic) management with simplified 2-field creation form (name + description; LLM auto-generates classification attributes)
- Delete projects with confirmation dialog — safely remove projects and all associated emails
- Individual project re-summarization
- Email count and last updated timestamps
- Expandable individual email viewer with thread context separation
- Sync status display showing email counts and last update time

Backend Services

Email Service — Unified orchestration service that coordinates all operations (fetch, classify, summarize, delete)
Topic Manager — CRUD operations for email classification topics with LLM-powered attribute generation
Metadata Manager — Tracks sync times, email counts, last update timestamps, and topic statistics
Gmail Connector — OAuth2 authentication and email extraction with thread-aware body parsing
Bedrock Integration — AWS Bedrock-backed LLM for classification and summarization using the instructor library

RAG Components

aws_session.py — Creates an AWS session for AWS Titan and S3 Buckets using config.env
embedder.py — Embeds email data using AWS Titan
ingest.py — Ingests email data by embedding using embedder.py and stores vector files in the S3 Bucket
query.py — Retrieves 10 most relevant emails from the S3 Bucket
utils.py — Utility functions for email operations

Legacy CLI Components

src/email_connector.py — Standalone Gmail client that handles OAuth2 and saves messages to data/emails.json
src/bedrock_classifier.py — CLI tool to classify emails into topics defined in topics.json
src/bedrock_summarizer.py — CLI tool to generate per-topic structured summaries

Data Storage

data/ — Email messages and topic-classified subfolders
data/classified_emails/ — Per-topic email files and summaries
topics.json — Topic definitions with keywords, intent, and classification attributes
data/metadata.json — Sync times and statistics
env.yml — Conda environment file for reproducible setup

Environment Setup

Using Conda (Recommended)

conda env create -f env.yml
conda activate amzn-email-summarizer

This creates an isolated environment with Python 3.11 and all required dependencies listed in env.yml.

Set Up Git Hooks (Husky)

This project uses Husky to automatically clean up generated files before commits. This prevents accidentally committing topics.json and data/ files.

First-time setup (run once):

npm install husky --save-dev
npx husky install

This installs Husky and the pre-commit hook that runs cleanup.sh automatically before each commit.

What it does:

Deletes topics.json if it exists
Deletes all files in data/ directory
Prints concise status messages
Prevents accidental commits of generated files

If you need to commit these files intentionally, you can bypass the hook:

git commit --no-verify

Required Credentials

The project requires two sets of credentials that must NOT be committed:

1. Google OAuth Credentials (Gmail API)

Go to Google Cloud Console
Enable the Gmail API for your project
Create OAuth 2.0 Client ID credentials (Application type: Desktop app)
Download credentials.json and place it in the repository root
On first run, the app will trigger OAuth flow in your browser and auto-generate token.json

2. AWS Credentials (Bedrock)

Set via environment variables or AWS credential chain:

export AWS_ACCESS_KEY_ID=your_key
export AWS_SECRET_ACCESS_KEY=your_secret
export AWS_REGION=us-east-2

Or configure via AWS CLI:

aws configure

3. AWS Setup (Titan and S3)

Create config.env in the repository root using the template in config.env.template.
Then, put S3_BUCKET, AWS_ACCESS_KEY_ID, and AWS_SECRET_ACCESS_KEY in config.env.

Running the Streamlit App (Recommended)

Start the Application

streamlit run app.py

The app will open at http://localhost:8501

First-Time Setup Workflow

Add a Project — Click "➕ Add Project" in the sidebar
- Enter Project Name (e.g., "Q4 Budget Planning")
- Enter Project Description (e.g., "Emails about budget planning for Q4")
- The system automatically generates classification attributes using LLM
- Click Add Topic to create the project
Email Auto-Sync — Emails automatically fetch on page load
- Emails are automatically fetched, classified, and summarized on first load
- Subsequent page loads skip auto-fetch (runs once per session)
View Results — Select a project from the sidebar
- View AI-generated summary with key points, decisions, and action items
- Expand individual emails to read full content (only current email displayed, full thread used for classification)

Regular Operations

"♻️ Re-classify All" — Re-classify all emails (useful after adding new projects)
"🔄 Re-summarize" — Update the summary for a specific project
"📁 Projects" — Browse all projects in the sidebar with email counts and last updated times
"📧 Individual Emails" — View full email content within each project
"🗑️ Delete Topic" — Remove a project with confirmation (permanently deletes project, emails, and metadata)
"📊 Sync Status" — See total emails fetched, classified emails, and last sync timestamp in sidebar

Architecture

High-Level System Overview

Email Prime uses a scalable, AWS-native architecture designed to handle large-scale email processing with intelligent classification and summarization. The system leverages multiple AWS services including Bedrock for LLM inference, S3 for storage, DynamoDB for metadata, and a RAG (Retrieval-Augmented Generation) pipeline with FAISS for semantic email search.

Key architectural components:

AWS Bedrock — LLM inference with gpt-oss-20b model for classification and summarization
Amazon Titan Embeddings — Vector embeddings for semantic search via RAG pipeline
FAISS Vector Database — In-memory, low-latency vector search for retrieving top-10 relevant emails
RAG Pipeline — Enriches LLM prompts with contextually relevant emails for improved accuracy
Service Layer — Unified EmailService orchestrating fetch, classify, summarize, and delete operations
Data Storage — JSON files (local), S3 (production), and DynamoDB for metadata and timestamps

For a detailed architectural breakdown including deployment strategies, AWS service integration, scalability characteristics, and security architecture, see docs/ARCHITECTURE.md.

Service Layer Design

The app uses a unified EmailService that coordinates all operations:

Fetches emails from Gmail via GmailClient
Classifies emails into topics using AWS Bedrock with the instructor library
Generates structured summaries with key points, decisions, and action items
Manages topics and metadata through TopicManager and MetadataManager
Leverages RAG pipeline for semantic context retrieval during classification

Data Flow

User Action (Streamlit UI)
    ↓
EmailService orchestration
    ├─ fetch_emails() → Gmail API → data/emails.json
    ├─ embed_emails() → Amazon Titan → FAISS vector index
    ├─ classify_emails() → RAG query → Top 10 emails → AWS Bedrock → data/classified_emails/{topic}/*.json
    └─ summarize_all_topics() → AWS Bedrock → data/classified_emails/{topic}/summary.json
    ↓
Streamlit reruns and displays updated data

Key Features

Incremental Processing — Only new emails are classified by default (avoids unnecessary API calls)
LLM-Powered Topics — Users provide just name + description; system generates all classification attributes
RAG-Enriched Classification — Semantic search retrieves contextually relevant emails to improve LLM accuracy
Complete Lifecycle Management — Create, view, and delete projects directly from the UI
Structured Output — Uses Pydantic models with instructor for reliable LLM responses
Session State — EmailService persists in Streamlit session for fast interactions
Smart Email Display — Full email threads available to LLM for context, but UI shows only current email to reduce clutter
Automatic Sync — Emails fetch and classify once per session on app load

Legacy CLI Usage

The CLI components can still be run independently if needed:

Fetch Emails Only

python src/email_connector.py

Saves emails to data/emails.json (uses OAuth flow on first run).

Classify Emails (Legacy)

python src/bedrock_classifier.py

Reads from data/emails.json, classifies into topics from topics.json. Note: Legacy script recreates data/classified_emails/ on each run (non-incremental).

Generate Summaries (Legacy)

python src/bedrock_summarizer.py

Generates per-topic summaries from classified emails.

Recommendation: Use the Streamlit app for all operations. Legacy scripts are provided for backwards compatibility.

Configuration & Secrets

Credential Files (Must NOT be committed)

The following credential files are already in .gitignore:

credentials.json — Google OAuth 2.0 client secret (downloaded from Google Cloud Console)
token.json — OAuth token (auto-generated after first successful authentication)

AWS Credentials

Configure AWS credentials via environment variables or AWS credential chain:

export AWS_ACCESS_KEY_ID=your_key
export AWS_SECRET_ACCESS_KEY=your_secret
export AWS_REGION=us-east-2

Or use AWS CLI configuration:

aws configure

Environment Variables

Optional configuration via .env file or environment:

AWS_REGION — AWS region (defaults to us-east-2 if not set)

Security Note: Never commit files containing secrets or personal data. The .gitignore file already excludes all credential files and generated data directories.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.streamlit		.streamlit
data_bak		data_bak
docs		docs
src		src
.gitignore		.gitignore
README.md		README.md
app.py		app.py
cleanup.sh		cleanup.sh
config.env.template		config.env.template
env.yml		env.yml
package.json		package.json
topics_bak.json		topics_bak.json

Amazon-Bedrock-Innovation-Challenge/email-prime

Folders and files

Latest commit

History

Repository files navigation