This repository contains a Kubernetes controller that automatically increases the size of a Persistent Volume Claim (PVC) in Kubernetes when it is nearing full (either on space OR inode usage). It is specifically designed for Google Kubernetes Engine (GKE) Autopilot and uses Google Managed Prometheus for metrics.
Keeping your volumes at a minimal size can help reduce cost, but having to manually scale them up can be painful and a waste of time for a DevOps / Systems Administrator. This is often used on storage volumes against things in Kubernetes such as Prometheus, MySQL, Redis, RabbitMQ, or any other stateful service.
- GKE Autopilot Cluster with Google Managed Prometheus enabled
- kubectl binary installed and setup with your cluster
- The Helm 3.0+ binary
- Google Managed Prometheus enabled on your cluster
- Using a Storage Class with
allowVolumeExpansion == true - Workload Identity enabled on your GKE cluster
You must have a StorageClass which supports volume expansion. To check/enable this:
# First, check if your storage class supports volume expansion...
$ kubectl get storageclasses
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
standard-rwo pd.csi.storage.gke.io Delete WaitForFirstConsumer true 10d
# If ALLOWVOLUMEEXPANSION is not set to true, patch it to enable this
kubectl patch storageclass standard-rwo -p '{"allowVolumeExpansion": true}'- Create a GCP Service Account:
export PROJECT_ID=your-gcp-project-id
export NAMESPACE=your-namespace # Where volume-autoscaler will be deployed
# Create the GCP service account
gcloud iam service-accounts create volume-autoscaler \
--display-name="Volume Autoscaler Service Account" \
--project=$PROJECT_ID
# Grant necessary permissions
gcloud projects add-iam-policy-binding $PROJECT_ID \
--member="serviceAccount:volume-autoscaler@$PROJECT_ID.iam.gserviceaccount.com" \
--role="roles/monitoring.viewer"- Enable Workload Identity binding:
# Allow the Kubernetes service account to impersonate the GCP service account
gcloud iam service-accounts add-iam-policy-binding \
volume-autoscaler@$PROJECT_ID.iam.gserviceaccount.com \
--role roles/iam.workloadIdentityUser \
--member "serviceAccount:$PROJECT_ID.svc.id.goog[$NAMESPACE/volume-autoscaler]"# Add the Helm repository
helm repo add gke-volume-autoscaler https://executioner1939.github.io/gke-volume-autoscaler/
helm repo update
# Install the chart
helm install volume-autoscaler gke-volume-autoscaler/volume-autoscaler \
--namespace $NAMESPACE \
--create-namespace \
--set gcp_project_id=$PROJECT_ID \
--set serviceAccount.annotations."iam\.gke\.io/gcp-service-account"="volume-autoscaler@$PROJECT_ID.iam.gserviceaccount.com"
# Or with Slack notifications
helm install volume-autoscaler gke-volume-autoscaler/volume-autoscaler \
--namespace $NAMESPACE \
--create-namespace \
--set gcp_project_id=$PROJECT_ID \
--set serviceAccount.annotations."iam\.gke\.io/gcp-service-account"="volume-autoscaler@$PROJECT_ID.iam.gserviceaccount.com" \
--set "slack_webhook_url=https://hooks.slack.com/services/123123123/4564564564/789789789789789789" \
--set "slack_channel=my-slack-channel-name" \
--set "slack_message_prefix=GKE Cluster: my-cluster"# To view what changes it will make (requires helm diff plugin)
helm diff upgrade volume-autoscaler gke-volume-autoscaler/volume-autoscaler \
--namespace $NAMESPACE \
--set gcp_project_id=$PROJECT_ID \
--set serviceAccount.annotations."iam\.gke\.io/gcp-service-account"="volume-autoscaler@$PROJECT_ID.iam.gserviceaccount.com"
# To remove the service
helm uninstall volume-autoscaler -n $NAMESPACETo confirm the volume autoscaler is working properly:
# Deploy a test PVC that fills up quickly
kubectl apply -f https://raw.githubusercontent.com/DevOps-Nirvana/Kubernetes-Volume-Autoscaler/master/examples/simple-pod-with-pvc.yaml
# Check the logs
kubectl logs -n $NAMESPACE -l app.kubernetes.io/name=volume-autoscaler --followFor this to work, the volume must be mounted by a running pod. Google Managed Prometheus collects kubelet_volume_stats_* metrics only from mounted volumes.
Cloud providers restrict resize frequency. On Google Cloud, you must wait before resizing again. The default cooldown is configured at 6 hours plus 10 minutes buffer.
Control behavior per-PVC with annotations:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: sample-volume-claim
annotations:
volume.autoscaler.kubernetes.io/scale-above-percent: "80"
volume.autoscaler.kubernetes.io/scale-after-intervals: "5"
volume.autoscaler.kubernetes.io/scale-up-percent: "20"
volume.autoscaler.kubernetes.io/scale-up-min-increment: "1000000000"
volume.autoscaler.kubernetes.io/scale-up-max-increment: "100000000000"
volume.autoscaler.kubernetes.io/scale-up-max-size: "16000000000000"
volume.autoscaler.kubernetes.io/scale-cooldown-time: "22200"
volume.autoscaler.kubernetes.io/ignore: "false"
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
storageClassName: standard-rwoThe autoscaler exposes metrics on port 8000:
| Metric Name | Type | Description |
|---|---|---|
| volume_autoscaler_resize_evaluated_total | counter | Times we evaluated resizing PVCs |
| volume_autoscaler_resize_attempted_total | counter | Times we attempted to resize |
| volume_autoscaler_resize_successful_total | counter | Times we successfully resized |
| volume_autoscaler_resize_failure_total | counter | Times we failed to resize |
| volume_autoscaler_num_valid_pvcs | gauge | Number of valid PVCs detected |
| volume_autoscaler_num_pvcs_above_threshold | gauge | Number of PVCs above the threshold |
| volume_autoscaler_num_pvcs_below_threshold | gauge | Number of PVCs below the threshold |
| volume_autoscaler_release_info | info | Version information |
| volume_autoscaler_settings_info | info | Current settings |
# Verify the annotation on the Kubernetes service account
kubectl get serviceaccount volume-autoscaler -n $NAMESPACE -o yaml
# Test authentication from a pod
kubectl run -it --rm debug \
--image=google/cloud-sdk:slim \
--serviceaccount=volume-autoscaler \
-n $NAMESPACE \
-- /bin/bash
# Inside the pod, check if authentication works
gcloud auth list- Authentication Errors: Ensure Workload Identity is properly configured and the service accounts are correctly bound
- No Metrics Found: Verify Google Managed Prometheus is enabled and collecting kubelet metrics
- Volumes Not Scaling: Check that volumes are mounted and have exceeded the threshold for the required intervals
# Install dependencies
pip3 install -r requirements.txt
# Set your GCP project
export GCP_PROJECT_ID=your-project-id
# Run in dry-run mode
DRY_RUN=true VERBOSE=true python3 main.py| Variable Name | Default | Description |
|---|---|---|
| GCP_PROJECT_ID | auto-detect | Google Cloud Project ID |
| INTERVAL_TIME | 60 | How often to check volumes (seconds) |
| SCALE_ABOVE_PERCENT | 80 | Threshold percentage to trigger scaling |
| SCALE_AFTER_INTERVALS | 5 | Intervals above threshold before scaling |
| SCALE_UP_PERCENT | 20 | Percentage to increase volume size |
| SCALE_UP_MIN_INCREMENT | 1000000000 | Minimum resize in bytes (1GB) |
| SCALE_UP_MAX_SIZE | 16000000000000 | Maximum volume size in bytes (16TB) |
| SCALE_COOLDOWN_TIME | 22200 | Cooldown between resizes (seconds) |
| DRY_RUN | false | Test mode - no actual resizing |
| VERBOSE | false | Enable detailed logging |
- Complete rewrite for Google Managed Prometheus
- Removed standard Prometheus support
- Native GKE Autopilot and Workload Identity integration
- Simplified configuration and deployment
See original repository for pre-GMP versions
This is a fork focused on Google Managed Prometheus. For the original multi-prometheus version, see the original repository.
