Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions logs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -100,11 +100,11 @@ The **Logs** Plugin comes with a [Failover Connector](https://github.com/open-te
| openTelemetry.openSearchLogs.index | string | `nil` | Name for OpenSearch index |
| openTelemetry.prometheus.additionalLabels | object | `{}` | Label selectors for the Prometheus resources to be picked up by prometheus-operator. |
| openTelemetry.prometheus.podMonitor | object | `{"enabled":true}` | Activates the pod-monitoring for the Logs Collector. |
| openTelemetry.prometheus.rules | object | `{"additionalRuleLabels":null,"annotations":{},"create":true,"enabled":["FilelogRefusedLogs","LogsOTelLogsMissing","LogsOTelLogsDecreasing","ReconcileErrors","ReceiverRefusedMetric","WorkqueueDepth"],"labels":{}}` | Default rules for monitoring the opentelemetry components. |
| openTelemetry.prometheus.rules | object | `{"additionalRuleLabels":null,"annotations":{},"create":true,"enabled":["FilelogRefusedLogs","LogsOTelLogsMissing","LogsOTelLogsDecreasing","LogsExportingFailed","ReconcileErrors","ReceiverRefusedMetric","WorkqueueDepth"],"labels":{}}` | Default rules for monitoring the opentelemetry components. |
| openTelemetry.prometheus.rules.additionalRuleLabels | string | `nil` | Additional labels for PrometheusRule alerts. |
| openTelemetry.prometheus.rules.annotations | object | `{}` | Annotations for PrometheusRules. |
| openTelemetry.prometheus.rules.create | bool | `true` | Enables PrometheusRule resources to be created. |
| openTelemetry.prometheus.rules.enabled | list | `["FilelogRefusedLogs","LogsOTelLogsMissing","LogsOTelLogsDecreasing","ReconcileErrors","ReceiverRefusedMetric","WorkqueueDepth"]` | PrometheusRules to enable. |
| openTelemetry.prometheus.rules.enabled | list | `["FilelogRefusedLogs","LogsOTelLogsMissing","LogsOTelLogsDecreasing","LogsExportingFailed","ReconcileErrors","ReceiverRefusedMetric","WorkqueueDepth"]` | PrometheusRules to enable. |
| openTelemetry.prometheus.rules.labels | object | `{}` | Labels for PrometheusRules. |
| openTelemetry.prometheus.serviceMonitor | object | `{"enabled":true}` | Activates the service-monitoring for the Logs Collector. |
| openTelemetry.region | string | `nil` | Region label for Logging |
Expand Down
2 changes: 1 addition & 1 deletion logs/charts/Chart.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@

apiVersion: v2
name: logs
version: 0.0.4
version: 0.0.5
description: OpenTelemetry Operator Helm chart for Kubernetes
icon: https://raw.githubusercontent.com/cncf/artwork/a718fa97fffec1b9fd14147682e9e3ac0c8817cb/projects/opentelemetry/icon/color/opentelemetry-icon-color.png
type: application
Expand Down
1 change: 1 addition & 0 deletions logs/charts/ci/test-values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,7 @@ openTelemetry:
- ReconcileErrors
- FilelogRefusedLogs
- LogsOTelLogsMissing
- LogsExportingFailed

testFramework:
enabled: true
Expand Down
12 changes: 12 additions & 0 deletions logs/charts/templates/alerts.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,18 @@ spec:
description: 'OTel on {{`{{ $labels.k8s_cluster_name }}`}} in {{`{{ $labels.region }}`}} is sending 4 times fewer logs in the last 2h. Please check.'
{{- end }}

{{- if (has "LogsExportingFailed" .Values.openTelemetry.prometheus.rules.enabled) }}
- alert: LogsExportingFailed
expr: sum(increase(otelcol_exporter_sent_log_records_total{job="logs/opentelemetry-collector-logs"}[1h])) by (k8s_cluster_name)/ sum(increase(otelcol_exporter_send_failed_log_records_total{job="logs/opentelemetry-collector-logs"}[1h]) + increase(otelcol_exporter_sent_log_records_total{job="logs/opentelemetry-collector-logs"}[1h])) by (k8s_cluster_name) < 0.9
for: 1h
labels:
severity: info
playbook: 'docs/support/playbook/logs/otel-logs-exporting-failed'
{{- include "plugin.additionalRuleLabels" . | nindent 10 }}
annotations:
summary: OTel log exporting is failing. Check logs and exporter connectivity.
description: 'OTel Collectors on {{`{{ $labels.k8s_cluster_name }}`}} are exporting logs below 90%. Please check.'
{{- end }}
{{- if and (has "ReconcileErrors" .Values.openTelemetry.prometheus.rules.enabled) (".Values.opentelemetry-operator.enabled") }}
- alert: ReconcileErrors
expr: rate(controller_runtime_reconcile_total{controller="opentelemetrycollector",result="error"}[5m]) > 0
Expand Down
1 change: 1 addition & 0 deletions logs/charts/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -109,6 +109,7 @@ openTelemetry:
- FilelogRefusedLogs
- LogsOTelLogsMissing
- LogsOTelLogsDecreasing
- LogsExportingFailed
- ReconcileErrors
- ReceiverRefusedMetric
- WorkqueueDepth
Expand Down
64 changes: 64 additions & 0 deletions logs/playbooks/LogsExportingFailed.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
---
title: Logs exporting is failing
weight: 20
---

## Root Cause Analysis

Determine if the lack of sent logs is expected:
- Check if the Plugin is up and running via the Greenhouse Dashboard.
- Check if there are other alerts that would indicate a lack of sent logs (`CrashLoopBackoff`, `ErrImagePull`).
- Check operational logs for the pods (requests failing, incorrect credentials for logshipping).
- Check that the sink (e.g. OpenSearch) is ready to receive logs e.g. by checking if pods of the backend are in `RUNNING` state:
```
kubectl k get pods -n <backend-namespace> -o wide
```

## Solution

### In Case of Expected Downtime
Mute the alert temporarily via Greenhouse until the plugin is healthy again and notify the responsible service owner.

### In Case of an Unexpected Downtime

If other pods in the cluster are working, check the operational logs for any error messages:
```bash
kubectl logs daemonset/logs-collector -n <namespace> | grep -i 'error'
```

Determine the cause of action accordingly, some previously observed issues.

#### Is it ConfigMap related?
Configuration issue, syntax problem, indentation issue within the pipeline
1. Check configMap for the running collector. Make sure that the collector is running the latest configMap:
```bash
kubectl get ds/logs-collector -n <namespace> -o=jsonpath='{.spec.template.spec.volumes[].configMap.name}'
kubectl get cm -n <namespace> --sort-by=.metadata.creationTimestamp
```
2. Action: update configMap, deploy a fix.
3. Action: Restart the logs-collector:
```bash
kubectl rollout restart daemonset/logs-collector -n <namespace>
```

#### Is it a connection issue between the collector and the sink?
There could be an issue with the throttling, latency, connection timeouts when exporting to the sink
1. Check if the sink is running and accepting connections
2. Check if the sink enough resources to accept new logs (cpu, memory, storage).
3. Action: Refer to documentation relating to the logs sink.
#### Is it a authentication/authorization issue between the collector and the sink?
Permission issues with missing, wrong or out-of-sync credentials
1. Check which credentials is being used by checking the secret:
```bash
kubectl get ds/logs-collector -n <namespace> -o=jsonpath='{.spec.template.spec.containers[].envFrom[].secretRef.name}'
2. Action: Update secrets for credentials used by the collector and update accordingly.
3. Action: Restart the logs-collector:
```bash
kubectl rollout restart daemonset/logs-collector -n <namespace>
```

### Ensure that the pods has been recreated by the operator after some time

### Observe the logs for the Pod to ensure that the problem has been resolved.

### Done.
4 changes: 2 additions & 2 deletions logs/plugindefinition.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,14 +6,14 @@ kind: PluginDefinition
metadata:
name: logs
spec:
version: 0.11.16
version: 0.11.17
displayName: Logs
description: Observability framework for instrumenting, generating, collecting, and exporting logs.
icon: https://raw.githubusercontent.com/cloudoperators/greenhouse-extensions/main/logs/logo.png
helmChart:
name: logs
repository: oci://ghcr.io/cloudoperators/greenhouse-extensions/charts
version: 0.0.4
version: 0.0.5
options:
- default: true
description: Set to true to enable the installation of the OpenTelemetry Operator.
Expand Down
Loading