This project implements a reproducible Linux (WSL) + R data analysis pipeline to explore extinction risk estimates from the Urban (2015) dataset from CSB. The workflow emphasizes data validation, lightweight preprocessing, and statistical analysis without requiring a database, making it portable and transparent.
The dataset contains study-level extinction estimates across taxa, regions, prediction years, and modeling assumptions.
-
WSL (Ubuntu) – execution environment
-
Bash – pipeline orchestration and automation
-
Python (standard library + pandas) – data profiling and preprocessing
-
R – statistical analysis and visualization
-
Git/GitHub – version control and reproducibility
.
├── data/
│ ├── raw/ # Original dataset (TSV)
│ └── processed/ # Cleaned and grouped intermediate files
├── scripts/ # Bash pipeline scripts
├── R/ # R analysis scripts
├── reports/ # Figures and analysis outputs
├── logs/ # Pipeline logs
└── README.md
chmod +x scripts/*.sh # make scripts executable
./scripts/run_all.sh # Run full pipeline end-to-end
Or run step-by-step.
./scripts/01_profile_validate.sh
./scripts/02_build_intermediates.sh
Rscript R/01_analysis.R
-
Data profiling and validation (Bash + Python).
-
Intermediate data construction (Bash + Python).
-
Statistical analysis & visualization (R).
-
reports/overall_weighted.csv
-
reports/fig_weighted_by_region_taxa.png
-
reports/fig_threshold_sensitivity.png
-
reports/percent_check_top25.csv (data quality check)
All results can be regenerated on any Linux or WSL system with Bash, Python, and R installed by running the pipeline scripts provided in this repository.