Skip to content
Draft
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions alerts/google-gke/metadata.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -112,3 +112,7 @@ alert_policy_templates:
id: restarts-containers-within-workload
description: "Alerts if a container restarts within a 5 minute window"
version: 1
-
id: uptime-checks-workload
description: "Alerts if an uptime check fails"
version: 1
37 changes: 37 additions & 0 deletions alerts/google-gke/uptime-checks-workload.v1.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
{
"displayName": "${CLUSTER_NAME}/${WORKLOAD_NAME} GKE Load Balancer Check uptime failure"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine for now but note we can technically support uptime checks for ingress via URL, so we may want to consider passing in the "Load Balancer" part of the display name

"documentation": {},
"userLabels": {
"workload_name": "${WORKLOAD_NAME}",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to match all the environment variables set from the Workload Details Observability tab

workload_type = gke_deployment
location = ${LOCATION}
project_id = ${PROJECT_ID}
namespace = ${NAMESPACE}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if i'm missing something, where are these user labels getting populated in the workload details tab?

"cluster": "${CLUSTER_NAME}",
"uptime_check_id": "${UPTIME_CHECK_ID}",
},
"conditions": [

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing conditions.displayName -> "Failure of ${alertPolicy.displayName}"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure how we'll populate the alert policy display name, i think it would make sense to put the uptime check name here?

{
"conditionThreshold": {
"aggregations": [
{
"alignmentPeriod": "1200s",
"perSeriesAligner": "ALIGN_NEXT_OLDER",
"crossSeriesReducer": "REDUCE_COUNT_FALSE",
"groupByFields": [
"resource.label.*"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our current alert policy lists these out, but if we want to support ingress + load balancer through the same policy template then I think this is fine

],
}
],
"comparison": "COMPARISON_GT",
"duration": "${UPTIME_DURATION}",
"filter": "metric.type=\"monitoring.googleapis.com/uptime_check/check_passed\" AND metric.label.check_id=\"${UPTIME_CHECK_ID}\" AND resource.type=\"k8s_service\"",
"thresholdValue": 1,
"trigger": {
"count": 1
}
}
}
],
"alertStrategy": {
"autoClose": "604800s"
},
"combiner": "OR",
"enabled": true,
}