Skip to content

End-to-end serverless healthcare pipeline analyzing 15,000+ nursing homes using AWS

Notifications You must be signed in to change notification settings

aninori/cms-healthcare-analytics

Repository files navigation

πŸ₯ CMS Nursing Home Analytics Pipeline

Python AWS License Parquet Cost

End-to-end serverless healthcare data engineering pipeline analyzing 15,000+ US nursing homes using AWS cloud infrastructure

An enterprise-grade ETL pipeline that ingests 20 CMS datasets from Google Drive, performs comprehensive data quality transformations, and delivers actionable insights on nurse staffing, readmission rates, and facility performance through serverless SQL analytics.


πŸ“Š Project Overview

Business Problem

Healthcare administrators and CMS regulators need to:

  • Monitor nursing home staffing adequacy across 15,000+ US facilities
  • Identify dangerous combinations of low staffing + high readmission rates
  • Track workforce stability and its impact on patient outcomes
  • Ensure regulatory compliance with minimum care standards

Solution

A fully automated, serverless data pipeline that:

  1. Ingests 2GB of CMS CSV data from Google Drive via OAuth 2.0
  2. Transforms with data quality checks and incremental loading
  3. Stores as optimized Parquet in S3 (75% compression)
  4. Catalogs schemas automatically using AWS Glue Crawlers
  5. Queries via serverless Amazon Athena SQL
  6. Delivers 6 business-critical metrics

Key Results

  • βœ… Identified 54 high-risk facilities requiring immediate intervention
  • βœ… Validated -0.41 correlation between staffing and readmissions
  • βœ… Achieved $2/month cost for processing 2GB healthcare data
  • βœ… 75% storage reduction using Parquet Snappy compression
  • βœ… 42% of facilities fall below CMS staffing standards

πŸš€ Key Features

βœ” Automated ETL Pipeline (Google Drive β†’ AWS S3)

  • Secure OAuth 2.0 JWT authentication
  • Incremental ingestion
  • Memory-safe streaming for large files
  • Data quality: missing handling, deduplication, outlier treatment, type optimization
  • Outputs columnar Parquet (Snappy)

βœ” AWS Glue + Athena Analytics Layer

  • Glue crawlers to auto-catalog schema

  • Athena SQL for scalable serverless queries

  • 5+ Healthcare KPIs computed, including:

    • Bed Utilization
    • Staffing Adequacy
    • Nurse Turnover
    • Readmission Rates
    • Staffing-Readmission Correlation

βœ” Interactive Streamlit Dashboard

  • Real-time Athena query execution
  • State-level comparison charts
  • Facility-level drill-downs
  • Risk scoring and heatmaps

πŸŽ₯ Streamlit Demo Video

Watch Video


πŸ— Architecture

Google Drive (20 CSVs)
        β”‚  OAuth 2.0
        β–Ό
AWS Glue ETL (Python)
        β”‚  Parquet + DQ
        β–Ό
Amazon S3 β€” Silver Layer
        β”‚  Glue Crawler
        β–Ό
AWS Glue Data Catalog
        β”‚  SQL
        β–Ό
Amazon Athena
        β”‚  boto3
        β–Ό
Streamlit Dashboard


Healthcare Architecture Diagram


πŸ“ Repository Structure

cms-healthcare-analytics/
β”‚
β”œβ”€β”€ etl/
β”‚   └── glue_etl_google_drive_to_s3.py      # ETL to ingest + clean + store parquet
β”‚
β”œβ”€β”€ streamlit_app/
β”‚   └── app.py                               # Streamlit dashboard querying Athena
β”‚
β”œβ”€β”€ sql/
β”‚   └── metrics_queries.sql                  # Bed Utilization, Staffing, Turnover, Correlation
β”‚
β”œβ”€β”€ license/
β”‚   └── MIT License.md
β”‚
β”œβ”€β”€ .gitignore
β”œβ”€β”€ README.md                                # <-- YOU ARE HERE
└── requirements.txt

πŸ›  Technology Stack

AWS Glue

Chosen for:

  • Serverless ETL
  • Python support
  • Ideal for large CSV β†’ Parquet conversions
  • Zero-maintenance orchestration

Amazon S3

  • Centralized data lake
  • Parquet + Snappy for 75% storage savings
  • Schema evolution friendly

Amazon Athena

  • Serverless SQL engine
  • No infrastructure to manage
  • Perfect for analytics dashboards

Streamlit

  • Lightweight UI layer
  • Direct Python integration
  • Zero backend required

Google Drive API (OAuth 2.0 JWT)

  • Secure enterprise-grade ingestion
  • Automated access to remote CMS files

πŸ“Š Key Healthcare KPIs

1️⃣ Bed Utilization Rate

Measures: Facility capacity strain Formula:

avg_residents_per_day / certified_beds

Insight: Identified facilities running >100% utilization, indicating overcrowding.


2️⃣ Nurse Staffing Hours per Resident

Measures: staffing sufficiency CMS minimum benchmark: 4.1 hrs/resident/day Insight: 42% of facilities below minimum.


3️⃣ Nursing Staff Turnover Rate

Measures: workforce stability Insight:

  • 1 in 4 facilities have >75% turnover
  • Strong predictor of poor quality

4️⃣ Readmission Rate (Facility Performance)

Insight:

  • National average ~16%
  • High-risk facilities reach 22–25%

5️⃣ Correlation: Staffing vs Readmission

CORR(staffing_hours_per_resident, readmission_rate)

Insight:

  • National correlation: –0.41 (moderate negative)
  • Higher staffing β†’ lower readmissions

πŸ“¦ Installation & Local Setup

1. Clone the Repo

git clone https://github.com/<aninori>/cms-healthcare-analytics.git
cd cms-healthcare-analytics

2. Create Virtual Environment

python -m venv venv
venv\Scripts\activate      # Windows

3. Install Requirements

pip install -r requirements.txt

πŸ§ͺ Running the Streamlit Dashboard

cd streamlit_app
streamlit run app.py

You will see:

  • Facility-level analytics
  • Interactive charts
  • Live Athena integrations

πŸ“ ETL Script Location

πŸ“ /etl/glue_etl_google_drive_to_s3.py

Includes:

  • OAuth JWT Auth
  • Chunked CSV ingestion
  • DQ transformations
  • Incremental load logic
  • Parquet writer

πŸ“ˆ SQL Metrics Location

πŸ“ /sql/metrics_queries.sql

Contains:

  • Bed Utilization SQL
  • Staffing Hours SQL
  • Turnover SQL
  • Readmission SQL
  • Correlation SQL

πŸ“ˆ Data Sources

All datasets sourced from CMS (Centers for Medicare & Medicaid Services):

Dataset Type Records Purpose
FY_2024_SNF_VBP_Facility_Performance Fact ~15,000 Readmission rates, VBP scores
NH_ProviderInfo_Oct2024 Dimension ~15,400 Staffing, beds, ratings
NH_QualityMsr_MDS_Oct2024 Fact ~15,000 Care quality metrics
NH_Penalties_Oct2024 Fact ~3,500 Financial penalties
NH_CovidVaxProvider_20241027 Fact ~15,000 Vaccination rates
... 15 more datasets

Total Size: 2GB (CSV) β†’ 500MB (Parquet)



🎯 Business Recommendations

Based on analysis of 15,000+ nursing homes:

  1. Immediate CMS Action

    • Prioritize inspections for 54 high-risk facilities
    • Implement mandatory staffing improvement plans
  2. State-Level Interventions

    • Focus on 9 states with highest risk scores
    • Provide $25M workforce development funding
  3. Policy Impact

    • 15% staffing increase β†’ 5-6% readmission reduction
    • ROI: $3.2M in avoided penalties annually
  4. Operational Efficiency

    • Target 85-90% bed utilization for optimal quality
    • Address 54% national turnover rate

🚧 Roadmap

Phase 1: Core Pipeline βœ… (Completed)

  • AWS Glue ETL with OAuth authentication
  • Data quality transformations
  • S3 Parquet storage
  • Athena SQL metrics

Phase 2: Analytics Enhancement βœ… (Completed)

  • Streamlit dashboard completion
  • Interactive visualizations (Plotly)
  • Real-time monitoring alerts

Phase 3: Advanced Features πŸ“‹ (Planned)

  • Machine learning readmission prediction
  • Historical trend analysis (Q1-Q4 2024)
  • Automated reporting (weekly PDF)
  • CloudWatch monitoring dashboards

🀝 Contributing

Contributions are welcome! Please follow these steps:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit changes (git commit -m 'Add amazing feature')
  4. Push to branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Code Style

  • Python: Follow PEP 8
  • SQL: Use uppercase keywords
  • Comments: Docstrings for all functions

πŸ“ License

This project is licensed under the MIT License - see LICENSE file for details.


πŸ‘€ Author

Naga Sai Anirudh Nori


πŸ™ Acknowledgments

  • CMS for providing public healthcare datasets
  • AWS for serverless infrastructure
  • Apache Parquet community for columnar format
  • Healthcare data engineering community

πŸ“Š Project Stats

GitHub Stars GitHub Forks GitHub Issues GitHub Last Commit


For questions or issues:

  1. Open a GitHub Issue
  2. Email: anirudhnori01@gmail.com
  3. LinkedIn: (https://linkedin.com/in/anirudh-nori)

⭐ If you find this project helpful, please consider giving it a star!


Built with ❀️ using AWS, Python, and Healthcare Data

About

End-to-end serverless healthcare pipeline analyzing 15,000+ nursing homes using AWS

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages