🏥 CMS Nursing Home Analytics Pipeline

End-to-end serverless healthcare data engineering pipeline analyzing 15,000+ US nursing homes using AWS cloud infrastructure

An enterprise-grade ETL pipeline that ingests 20 CMS datasets from Google Drive, performs comprehensive data quality transformations, and delivers actionable insights on nurse staffing, readmission rates, and facility performance through serverless SQL analytics.

📊 Project Overview

Business Problem

Healthcare administrators and CMS regulators need to:

Monitor nursing home staffing adequacy across 15,000+ US facilities
Identify dangerous combinations of low staffing + high readmission rates
Track workforce stability and its impact on patient outcomes
Ensure regulatory compliance with minimum care standards

Solution

A fully automated, serverless data pipeline that:

Ingests 2GB of CMS CSV data from Google Drive via OAuth 2.0
Transforms with data quality checks and incremental loading
Stores as optimized Parquet in S3 (75% compression)
Catalogs schemas automatically using AWS Glue Crawlers
Queries via serverless Amazon Athena SQL
Delivers 6 business-critical metrics

Key Results

✅ Identified 54 high-risk facilities requiring immediate intervention
✅ Validated -0.41 correlation between staffing and readmissions
✅ Achieved $2/month cost for processing 2GB healthcare data
✅ 75% storage reduction using Parquet Snappy compression
✅ 42% of facilities fall below CMS staffing standards

🚀 Key Features

✔ Automated ETL Pipeline (Google Drive → AWS S3)

Secure OAuth 2.0 JWT authentication
Incremental ingestion
Memory-safe streaming for large files
Data quality: missing handling, deduplication, outlier treatment, type optimization
Outputs columnar Parquet (Snappy)

✔ AWS Glue + Athena Analytics Layer

Glue crawlers to auto-catalog schema
Athena SQL for scalable serverless queries
5+ Healthcare KPIs computed, including:
- Bed Utilization
- Staffing Adequacy
- Nurse Turnover
- Readmission Rates
- Staffing-Readmission Correlation

✔ Interactive Streamlit Dashboard

Real-time Athena query execution
State-level comparison charts
Facility-level drill-downs
Risk scoring and heatmaps

🎥 Streamlit Demo Video

Watch Video

🏗 Architecture

Google Drive (20 CSVs)
        │  OAuth 2.0
        ▼
AWS Glue ETL (Python)
        │  Parquet + DQ
        ▼
Amazon S3 — Silver Layer
        │  Glue Crawler
        ▼
AWS Glue Data Catalog
        │  SQL
        ▼
Amazon Athena
        │  boto3
        ▼
Streamlit Dashboard

📁 Repository Structure

cms-healthcare-analytics/
│
├── etl/
│   └── glue_etl_google_drive_to_s3.py      # ETL to ingest + clean + store parquet
│
├── streamlit_app/
│   └── app.py                               # Streamlit dashboard querying Athena
│
├── sql/
│   └── metrics_queries.sql                  # Bed Utilization, Staffing, Turnover, Correlation
│
├── license/
│   └── MIT License.md
│
├── .gitignore
├── README.md                                # <-- YOU ARE HERE
└── requirements.txt

🛠 Technology Stack

AWS Glue

Chosen for:

Serverless ETL
Python support
Ideal for large CSV → Parquet conversions
Zero-maintenance orchestration

Amazon S3

Centralized data lake
Parquet + Snappy for 75% storage savings
Schema evolution friendly

Amazon Athena

Serverless SQL engine
No infrastructure to manage
Perfect for analytics dashboards

Streamlit

Lightweight UI layer
Direct Python integration
Zero backend required

Google Drive API (OAuth 2.0 JWT)

Secure enterprise-grade ingestion
Automated access to remote CMS files

📊 Key Healthcare KPIs

1️⃣ Bed Utilization Rate

Measures: Facility capacity strain Formula:

avg_residents_per_day / certified_beds

Insight: Identified facilities running >100% utilization, indicating overcrowding.

2️⃣ Nurse Staffing Hours per Resident

Measures: staffing sufficiency CMS minimum benchmark: 4.1 hrs/resident/day Insight: 42% of facilities below minimum.

3️⃣ Nursing Staff Turnover Rate

Measures: workforce stability Insight:

1 in 4 facilities have >75% turnover
Strong predictor of poor quality

4️⃣ Readmission Rate (Facility Performance)

Insight:

National average ~16%
High-risk facilities reach 22–25%

5️⃣ Correlation: Staffing vs Readmission

CORR(staffing_hours_per_resident, readmission_rate)

Insight:

National correlation: –0.41 (moderate negative)
Higher staffing → lower readmissions

📦 Installation & Local Setup

1. Clone the Repo

git clone https://github.com/<aninori>/cms-healthcare-analytics.git
cd cms-healthcare-analytics

2. Create Virtual Environment

python -m venv venv
venv\Scripts\activate      # Windows

3. Install Requirements

pip install -r requirements.txt

🧪 Running the Streamlit Dashboard

cd streamlit_app
streamlit run app.py

You will see:

Facility-level analytics
Interactive charts
Live Athena integrations

📝 ETL Script Location

📍 /etl/glue_etl_google_drive_to_s3.py

Includes:

OAuth JWT Auth
Chunked CSV ingestion
DQ transformations
Incremental load logic
Parquet writer

📈 SQL Metrics Location

📍 /sql/metrics_queries.sql

Contains:

Bed Utilization SQL
Staffing Hours SQL
Turnover SQL
Readmission SQL
Correlation SQL

📈 Data Sources

All datasets sourced from CMS (Centers for Medicare & Medicaid Services):

Dataset	Type	Records	Purpose
`FY_2024_SNF_VBP_Facility_Performance`	Fact	~15,000	Readmission rates, VBP scores
`NH_ProviderInfo_Oct2024`	Dimension	~15,400	Staffing, beds, ratings
`NH_QualityMsr_MDS_Oct2024`	Fact	~15,000	Care quality metrics
`NH_Penalties_Oct2024`	Fact	~3,500	Financial penalties
`NH_CovidVaxProvider_20241027`	Fact	~15,000	Vaccination rates
... 15 more datasets

Total Size: 2GB (CSV) → 500MB (Parquet)

🎯 Business Recommendations

Based on analysis of 15,000+ nursing homes:

Immediate CMS Action
- Prioritize inspections for 54 high-risk facilities
- Implement mandatory staffing improvement plans
State-Level Interventions
- Focus on 9 states with highest risk scores
- Provide $25M workforce development funding
Policy Impact
- 15% staffing increase → 5-6% readmission reduction
- ROI: $3.2M in avoided penalties annually
Operational Efficiency
- Target 85-90% bed utilization for optimal quality
- Address 54% national turnover rate

🚧 Roadmap

Phase 1: Core Pipeline ✅ (Completed)

AWS Glue ETL with OAuth authentication
Data quality transformations
S3 Parquet storage
Athena SQL metrics

Phase 2: Analytics Enhancement ✅ (Completed)

Streamlit dashboard completion
Interactive visualizations (Plotly)
Real-time monitoring alerts

Phase 3: Advanced Features 📋 (Planned)

Machine learning readmission prediction
Historical trend analysis (Q1-Q4 2024)
Automated reporting (weekly PDF)
CloudWatch monitoring dashboards

🤝 Contributing

Contributions are welcome! Please follow these steps:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit changes (git commit -m 'Add amazing feature')
Push to branch (git push origin feature/amazing-feature)
Open a Pull Request

Code Style

Python: Follow PEP 8
SQL: Use uppercase keywords
Comments: Docstrings for all functions

📝 License

This project is licensed under the MIT License - see LICENSE file for details.

👤 Author

Naga Sai Anirudh Nori

GitHub: @aninori
LinkedIn: (https://linkedin.com/in/anirudh-nori)
Email: anirudhnori01@gmail.com

🙏 Acknowledgments

CMS for providing public healthcare datasets
AWS for serverless infrastructure
Apache Parquet community for columnar format
Healthcare data engineering community

📊 Project Stats

For questions or issues:

Open a GitHub Issue
Email: anirudhnori01@gmail.com
LinkedIn: (https://linkedin.com/in/anirudh-nori)

⭐ If you find this project helpful, please consider giving it a star!

_{Built with ❤️ using AWS, Python, and Healthcare Data}

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
etl		etl
license		license
sql		sql
streamlit_app		streamlit_app
.gitignore		.gitignore
Description of Data Pipeline.pdf		Description of Data Pipeline.pdf
README.md		README.md
healthcare_drawio_architecture.txt.jpg		healthcare_drawio_architecture.txt.jpg
requirements.txt		requirements.txt

aninori/cms-healthcare-analytics

Folders and files

Latest commit

History

Repository files navigation