An email processing pipeline with Streamlit web interface that connects to Gmail, classifies emails into topics using AWS Bedrock LLMs, and generates structured summaries of email threads grouped by topic. The system uses the instructor library with Pydantic models for structured LLM outputs. This was built as part of a project for the Columbia Engineering X Amazon Bedrock Innovation Challenge.
emailprime_demo.mp4
- Streamlit Web App — Interactive two-panel interface for managing projects and viewing summaries
- Automatic email fetching and processing (once per session)
- Project (topic) management with simplified 2-field creation form (name + description; LLM auto-generates classification attributes)
- Delete projects with confirmation dialog — safely remove projects and all associated emails
- Individual project re-summarization
- Email count and last updated timestamps
- Expandable individual email viewer with thread context separation
- Sync status display showing email counts and last update time
- Email Service — Unified orchestration service that coordinates all operations (fetch, classify, summarize, delete)
- Topic Manager — CRUD operations for email classification topics with LLM-powered attribute generation
- Metadata Manager — Tracks sync times, email counts, last update timestamps, and topic statistics
- Gmail Connector — OAuth2 authentication and email extraction with thread-aware body parsing
- Bedrock Integration — AWS Bedrock-backed LLM for classification and summarization using the
instructorlibrary
aws_session.py— Creates an AWS session for AWS Titan and S3 Buckets usingconfig.envembedder.py— Embeds email data using AWS Titaningest.py— Ingests email data by embedding usingembedder.pyand stores vector files in the S3 Bucketquery.py— Retrieves 10 most relevant emails from the S3 Bucketutils.py— Utility functions for email operations
src/email_connector.py— Standalone Gmail client that handles OAuth2 and saves messages todata/emails.jsonsrc/bedrock_classifier.py— CLI tool to classify emails into topics defined intopics.jsonsrc/bedrock_summarizer.py— CLI tool to generate per-topic structured summaries
data/— Email messages and topic-classified subfoldersdata/classified_emails/— Per-topic email files and summariestopics.json— Topic definitions with keywords, intent, and classification attributesdata/metadata.json— Sync times and statisticsenv.yml— Conda environment file for reproducible setup
conda env create -f env.yml
conda activate amzn-email-summarizerThis creates an isolated environment with Python 3.11 and all required dependencies listed in env.yml.
This project uses Husky to automatically clean up generated files before commits. This prevents accidentally committing topics.json and data/ files.
First-time setup (run once):
npm install husky --save-dev
npx husky installThis installs Husky and the pre-commit hook that runs cleanup.sh automatically before each commit.
What it does:
- Deletes
topics.jsonif it exists - Deletes all files in
data/directory - Prints concise status messages
- Prevents accidental commits of generated files
If you need to commit these files intentionally, you can bypass the hook:
git commit --no-verifyThe project requires two sets of credentials that must NOT be committed:
- Go to Google Cloud Console
- Enable the Gmail API for your project
- Create OAuth 2.0 Client ID credentials (Application type: Desktop app)
- Download
credentials.jsonand place it in the repository root - On first run, the app will trigger OAuth flow in your browser and auto-generate
token.json
Set via environment variables or AWS credential chain:
export AWS_ACCESS_KEY_ID=your_key
export AWS_SECRET_ACCESS_KEY=your_secret
export AWS_REGION=us-east-2Or configure via AWS CLI:
aws configure- Create
config.envin the repository root using the template inconfig.env.template. - Then, put
S3_BUCKET,AWS_ACCESS_KEY_ID, andAWS_SECRET_ACCESS_KEYinconfig.env.
streamlit run app.pyThe app will open at http://localhost:8501
-
Add a Project — Click "➕ Add Project" in the sidebar
- Enter Project Name (e.g., "Q4 Budget Planning")
- Enter Project Description (e.g., "Emails about budget planning for Q4")
- The system automatically generates classification attributes using LLM
- Click Add Topic to create the project
-
Email Auto-Sync — Emails automatically fetch on page load
- Emails are automatically fetched, classified, and summarized on first load
- Subsequent page loads skip auto-fetch (runs once per session)
-
View Results — Select a project from the sidebar
- View AI-generated summary with key points, decisions, and action items
- Expand individual emails to read full content (only current email displayed, full thread used for classification)
- "♻️ Re-classify All" — Re-classify all emails (useful after adding new projects)
- "🔄 Re-summarize" — Update the summary for a specific project
- "📁 Projects" — Browse all projects in the sidebar with email counts and last updated times
- "📧 Individual Emails" — View full email content within each project
- "🗑️ Delete Topic" — Remove a project with confirmation (permanently deletes project, emails, and metadata)
- "📊 Sync Status" — See total emails fetched, classified emails, and last sync timestamp in sidebar
Email Prime uses a scalable, AWS-native architecture designed to handle large-scale email processing with intelligent classification and summarization. The system leverages multiple AWS services including Bedrock for LLM inference, S3 for storage, DynamoDB for metadata, and a RAG (Retrieval-Augmented Generation) pipeline with FAISS for semantic email search.
Key architectural components:
- AWS Bedrock — LLM inference with gpt-oss-20b model for classification and summarization
- Amazon Titan Embeddings — Vector embeddings for semantic search via RAG pipeline
- FAISS Vector Database — In-memory, low-latency vector search for retrieving top-10 relevant emails
- RAG Pipeline — Enriches LLM prompts with contextually relevant emails for improved accuracy
- Service Layer — Unified
EmailServiceorchestrating fetch, classify, summarize, and delete operations - Data Storage — JSON files (local), S3 (production), and DynamoDB for metadata and timestamps
For a detailed architectural breakdown including deployment strategies, AWS service integration, scalability characteristics, and security architecture, see docs/ARCHITECTURE.md.
The app uses a unified EmailService that coordinates all operations:
- Fetches emails from Gmail via
GmailClient - Classifies emails into topics using AWS Bedrock with the
instructorlibrary - Generates structured summaries with key points, decisions, and action items
- Manages topics and metadata through
TopicManagerandMetadataManager - Leverages RAG pipeline for semantic context retrieval during classification
User Action (Streamlit UI)
↓
EmailService orchestration
├─ fetch_emails() → Gmail API → data/emails.json
├─ embed_emails() → Amazon Titan → FAISS vector index
├─ classify_emails() → RAG query → Top 10 emails → AWS Bedrock → data/classified_emails/{topic}/*.json
└─ summarize_all_topics() → AWS Bedrock → data/classified_emails/{topic}/summary.json
↓
Streamlit reruns and displays updated data
- Incremental Processing — Only new emails are classified by default (avoids unnecessary API calls)
- LLM-Powered Topics — Users provide just name + description; system generates all classification attributes
- RAG-Enriched Classification — Semantic search retrieves contextually relevant emails to improve LLM accuracy
- Complete Lifecycle Management — Create, view, and delete projects directly from the UI
- Structured Output — Uses Pydantic models with
instructorfor reliable LLM responses - Session State — EmailService persists in Streamlit session for fast interactions
- Smart Email Display — Full email threads available to LLM for context, but UI shows only current email to reduce clutter
- Automatic Sync — Emails fetch and classify once per session on app load
The CLI components can still be run independently if needed:
python src/email_connector.pySaves emails to data/emails.json (uses OAuth flow on first run).
python src/bedrock_classifier.pyReads from data/emails.json, classifies into topics from topics.json.
Note: Legacy script recreates data/classified_emails/ on each run (non-incremental).
python src/bedrock_summarizer.pyGenerates per-topic summaries from classified emails.
Recommendation: Use the Streamlit app for all operations. Legacy scripts are provided for backwards compatibility.
The following credential files are already in .gitignore:
credentials.json— Google OAuth 2.0 client secret (downloaded from Google Cloud Console)token.json— OAuth token (auto-generated after first successful authentication)
Configure AWS credentials via environment variables or AWS credential chain:
export AWS_ACCESS_KEY_ID=your_key
export AWS_SECRET_ACCESS_KEY=your_secret
export AWS_REGION=us-east-2Or use AWS CLI configuration:
aws configureOptional configuration via .env file or environment:
AWS_REGION— AWS region (defaults tous-east-2if not set)
Security Note: Never commit files containing secrets or personal data. The .gitignore file already excludes all credential files and generated data directories.
