End-to-end serverless healthcare data engineering pipeline analyzing 15,000+ US nursing homes using AWS cloud infrastructure
An enterprise-grade ETL pipeline that ingests 20 CMS datasets from Google Drive, performs comprehensive data quality transformations, and delivers actionable insights on nurse staffing, readmission rates, and facility performance through serverless SQL analytics.
Healthcare administrators and CMS regulators need to:
- Monitor nursing home staffing adequacy across 15,000+ US facilities
- Identify dangerous combinations of low staffing + high readmission rates
- Track workforce stability and its impact on patient outcomes
- Ensure regulatory compliance with minimum care standards
A fully automated, serverless data pipeline that:
- Ingests 2GB of CMS CSV data from Google Drive via OAuth 2.0
- Transforms with data quality checks and incremental loading
- Stores as optimized Parquet in S3 (75% compression)
- Catalogs schemas automatically using AWS Glue Crawlers
- Queries via serverless Amazon Athena SQL
- Delivers 6 business-critical metrics
- β Identified 54 high-risk facilities requiring immediate intervention
- β Validated -0.41 correlation between staffing and readmissions
- β Achieved $2/month cost for processing 2GB healthcare data
- β 75% storage reduction using Parquet Snappy compression
- β 42% of facilities fall below CMS staffing standards
- Secure OAuth 2.0 JWT authentication
- Incremental ingestion
- Memory-safe streaming for large files
- Data quality: missing handling, deduplication, outlier treatment, type optimization
- Outputs columnar Parquet (Snappy)
-
Glue crawlers to auto-catalog schema
-
Athena SQL for scalable serverless queries
-
5+ Healthcare KPIs computed, including:
- Bed Utilization
- Staffing Adequacy
- Nurse Turnover
- Readmission Rates
- Staffing-Readmission Correlation
- Real-time Athena query execution
- State-level comparison charts
- Facility-level drill-downs
- Risk scoring and heatmaps
π₯ Streamlit Demo Video
Google Drive (20 CSVs)
β OAuth 2.0
βΌ
AWS Glue ETL (Python)
β Parquet + DQ
βΌ
Amazon S3 β Silver Layer
β Glue Crawler
βΌ
AWS Glue Data Catalog
β SQL
βΌ
Amazon Athena
β boto3
βΌ
Streamlit Dashboard
cms-healthcare-analytics/
β
βββ etl/
β βββ glue_etl_google_drive_to_s3.py # ETL to ingest + clean + store parquet
β
βββ streamlit_app/
β βββ app.py # Streamlit dashboard querying Athena
β
βββ sql/
β βββ metrics_queries.sql # Bed Utilization, Staffing, Turnover, Correlation
β
βββ license/
β βββ MIT License.md
β
βββ .gitignore
βββ README.md # <-- YOU ARE HERE
βββ requirements.txt
Chosen for:
- Serverless ETL
- Python support
- Ideal for large CSV β Parquet conversions
- Zero-maintenance orchestration
- Centralized data lake
- Parquet + Snappy for 75% storage savings
- Schema evolution friendly
- Serverless SQL engine
- No infrastructure to manage
- Perfect for analytics dashboards
- Lightweight UI layer
- Direct Python integration
- Zero backend required
- Secure enterprise-grade ingestion
- Automated access to remote CMS files
Measures: Facility capacity strain Formula:
avg_residents_per_day / certified_beds
Insight: Identified facilities running >100% utilization, indicating overcrowding.
Measures: staffing sufficiency CMS minimum benchmark: 4.1 hrs/resident/day Insight: 42% of facilities below minimum.
Measures: workforce stability Insight:
- 1 in 4 facilities have >75% turnover
- Strong predictor of poor quality
Insight:
- National average ~16%
- High-risk facilities reach 22β25%
CORR(staffing_hours_per_resident, readmission_rate)
Insight:
- National correlation: β0.41 (moderate negative)
- Higher staffing β lower readmissions
git clone https://github.com/<aninori>/cms-healthcare-analytics.git
cd cms-healthcare-analyticspython -m venv venv
venv\Scripts\activate # Windowspip install -r requirements.txtcd streamlit_app
streamlit run app.pyYou will see:
- Facility-level analytics
- Interactive charts
- Live Athena integrations
π /etl/glue_etl_google_drive_to_s3.py
Includes:
- OAuth JWT Auth
- Chunked CSV ingestion
- DQ transformations
- Incremental load logic
- Parquet writer
π /sql/metrics_queries.sql
Contains:
- Bed Utilization SQL
- Staffing Hours SQL
- Turnover SQL
- Readmission SQL
- Correlation SQL
All datasets sourced from CMS (Centers for Medicare & Medicaid Services):
| Dataset | Type | Records | Purpose |
|---|---|---|---|
FY_2024_SNF_VBP_Facility_Performance |
Fact | ~15,000 | Readmission rates, VBP scores |
NH_ProviderInfo_Oct2024 |
Dimension | ~15,400 | Staffing, beds, ratings |
NH_QualityMsr_MDS_Oct2024 |
Fact | ~15,000 | Care quality metrics |
NH_Penalties_Oct2024 |
Fact | ~3,500 | Financial penalties |
NH_CovidVaxProvider_20241027 |
Fact | ~15,000 | Vaccination rates |
| ... 15 more datasets |
Total Size: 2GB (CSV) β 500MB (Parquet)
Based on analysis of 15,000+ nursing homes:
-
Immediate CMS Action
- Prioritize inspections for 54 high-risk facilities
- Implement mandatory staffing improvement plans
-
State-Level Interventions
- Focus on 9 states with highest risk scores
- Provide $25M workforce development funding
-
Policy Impact
- 15% staffing increase β 5-6% readmission reduction
- ROI: $3.2M in avoided penalties annually
-
Operational Efficiency
- Target 85-90% bed utilization for optimal quality
- Address 54% national turnover rate
- AWS Glue ETL with OAuth authentication
- Data quality transformations
- S3 Parquet storage
- Athena SQL metrics
- Streamlit dashboard completion
- Interactive visualizations (Plotly)
- Real-time monitoring alerts
- Machine learning readmission prediction
- Historical trend analysis (Q1-Q4 2024)
- Automated reporting (weekly PDF)
- CloudWatch monitoring dashboards
Contributions are welcome! Please follow these steps:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit changes (
git commit -m 'Add amazing feature') - Push to branch (
git push origin feature/amazing-feature) - Open a Pull Request
- Python: Follow PEP 8
- SQL: Use uppercase keywords
- Comments: Docstrings for all functions
This project is licensed under the MIT License - see LICENSE file for details.
Naga Sai Anirudh Nori
- GitHub: @aninori
- LinkedIn: (https://linkedin.com/in/anirudh-nori)
- Email: anirudhnori01@gmail.com
- CMS for providing public healthcare datasets
- AWS for serverless infrastructure
- Apache Parquet community for columnar format
- Healthcare data engineering community
For questions or issues:
- Open a GitHub Issue
- Email: anirudhnori01@gmail.com
- LinkedIn: (https://linkedin.com/in/anirudh-nori)
β If you find this project helpful, please consider giving it a star!
