Skip to content

Commit ea8f78a

Browse files
authored
Update documentation (openshift#213)
* Update documentation * Fix failing validate
1 parent 9ea2c27 commit ea8f78a

File tree

12 files changed

+293
-648
lines changed

12 files changed

+293
-648
lines changed

CONTRIBUTING.md

Lines changed: 1 addition & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,7 @@ The dependencies so far are:
4545
- Pagerduty
4646
- OCM
4747
- Tekton
48+
- osd-network-verifier
4849

4950
To test Tekton and the deployment configuration, put CAD on OpenShift behind a Tekton event-listener and use curl to trigger pipeline runs by using example payloads.
5051

@@ -62,8 +63,3 @@ make check-duplicate-error-messages
6263
```
6364

6465
Verify that there are no two entries with the same string.
65-
This also forces us to use `fmt.Errorf` and not a `errors.New`
66-
67-
## Other
68-
69-
additional steps will be added as required

README.md

Lines changed: 34 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -9,23 +9,23 @@
99
- [Contributing](#contributing)
1010
- [Documentation](#documentation)
1111
- [CAD CLI](#cad-cli)
12+
- [Investigations](#investigations)
1213
- [Integrations](#integrations)
1314
- [Overview](#overview)
14-
- [Alert firing investigation](#alert-firing-investigation)
15-
- [CHGM investigation overview](#chgm-investigation-overview)
1615
- [Templates](#templates)
1716
- [Dashboards](#dashboards)
1817
- [Deployment](#deployment)
1918
- [Boilerplate](#boilerplate)
2019
- [PipelinePruner](#pipelinepruner)
20+
- [Required ENV variables](#required-env-variables)
2121

2222
# Configuration Anomaly Detection
2323

2424
[![Configuration Anomaly Detection](./images/CadCat.png)](https://github.com/openshift/configuration-anomaly-detection)
2525

2626
## About
2727

28-
Configuration Anomaly Detection (CAD) is responsible for reducing manual SRE investigation by detecting cluster anomalies and sending relevant communications to the cluster owner.
28+
Configuration Anomaly Detection (CAD) is responsible for reducing manual SRE effort by pre-investigating alerts, detecting cluster anomalies and sending relevant communications to the cluster owner.
2929

3030
## Contributing
3131

@@ -35,13 +35,20 @@ To contribute to CAD, please see our [CONTRIBUTING Document](CONTRIBUTING.md).
3535

3636
## CAD CLI
3737

38-
* [cadctl](./cadctl/README.md) -- Performs workflow for 'cluster has gone missing' (CHGM) alerts.
38+
* [cadctl](./cadctl/README.md) -- Performs investigation workflow.
39+
40+
## Investigations
41+
42+
Every alert managed by CAD corresponds to an investigation, representing the executed code associated with the alert.
43+
44+
Investigation specific documentation can be found in the according investigation folder, e.g. for [ClusterHasGoneMissing](./pkg/investigations/chgm/README.md).
3945

4046
## Integrations
4147

42-
* [AWS](./pkg/aws/README.md) -- Logging into the cluster, retreiving instance info and AWS CloudTrail events.
43-
* [PagerDuty](./pkg/pagerduty/README.md) -- Retrieving alert info, esclating or silencing incidents, and adding notes.
44-
* [OCM](./pkg/ocm/README.md) -- Retrieving cluster info, sending service logs, and managing (post, delete) limited support reasons.
48+
* [AWS](https://github.com/aws/aws-sdk-go) -- Logging into the cluster, retreiving instance info and AWS CloudTrail events.
49+
* [PagerDuty](https://github.com/PagerDuty/go-pagerduty) -- Retrieving alert info, esclating or silencing incidents, and adding notes.
50+
* [OCM](https://github.com/openshift-online/ocm-sdk-go) -- Retrieving cluster info, sending service logs, and managing (post, delete) limited support reasons.
51+
* [osd-network-verifier](https://github.com/openshift/osd-network-verifier) -- Tool to verify the pre-configured networking components for ROSA and OSD CCS clusters.
4552

4653
## Overview
4754

@@ -53,26 +60,6 @@ To contribute to CAD, please see our [CONTRIBUTING Document](CONTRIBUTING.md).
5360
![CAD Overview](./images/cad_overview/cad_architecture_dark.png#gh-dark-mode-only)
5461
![CAD Overview](./images/cad_overview/cad_architecture_light.png#gh-light-mode-only)
5562

56-
### Alert firing investigation
57-
58-
1. PagerDuty webhook receives CHGM alert from Dead Man's Snitch.
59-
2. CAD Tekton pipeline is triggered via PagerDuty sending a webhook to Tekton EventListener.
60-
3. Logs into AWS account of cluster and checks for stopped/terminated instances.
61-
- If unable to access AWS account, posts "cluster credentials are missing" limited support reason.
62-
4. If stopped/terminated instances are found, pulls AWS CloudTrail events for those instances.
63-
- If no stopped/terminated instances are found, escalates to SRE for further investigation.
64-
5. If the user of the event is:
65-
- Authorized (SRE or OSD managed), runs the network verifier and escalates the alert to SRE for futher investigation.
66-
- **Note:** Authorized users have prefix RH-SRE, osdManagedAdmin, or have the ManagedOpenShift-Installer-Role.
67-
- Not authorized (not SRE or OSD managed), posts the appropriate limited support reason and silences the alert.
68-
6. Adds notes with investigation details to the PagerDuty alert.
69-
70-
71-
## CHGM investigation overview
72-
73-
![CHGM investigation overview](./images/cad_chgm_investigation/chgm_investigation_dark.png#gh-dark-mode-only)
74-
![CHGM investigation overview](./images/cad_chgm_investigation/chgm_investigation_light.png#gh-light-mode-only)
75-
7663
## Templates
7764

7865
* [Update-Template](./hack/update-template/README.md) -- Updating configuration-anomaly-detection-template.Template.yaml.
@@ -95,3 +82,23 @@ Grafana dashboard configmaps are stored in the [Dashboards](./dashboards/) direc
9582
## PipelinePruner
9683

9784
* [PipelinePruner](./openshift/PipelinePruning.md) -- Documentation about PipelineRun pruning.
85+
86+
## Required ENV variables
87+
88+
* `CAD_OCM_CLIENT_ID`: refers to the OCM client ID used by CAD to initialize the OCM client
89+
* `CAD_OCM_CLIENT_SECRET`: refers to the OCM client secret used by CAD to initialize the OCM client
90+
* `CAD_OCM_URL`: refers to the used OCM url used by CAD to initialize the OCM client
91+
* `AWS_ACCESS_KEY_ID`: refers to the access key id of the base AWS account used by CAD
92+
* `AWS_SECRET_ACCESS_KEY`: refers to the secret access key of the base AWS account used by CAD
93+
* `CAD_AWS_CSS_JUMPROLE`: refers to the arn of the RH-SRE-CCS-Access jumprole
94+
* `CAD_AWS_SUPPORT_JUMPROLE`: refers to the arn of the RH-Technical-Support-Access jumprole
95+
* `CAD_ESCALATION_POLICY`: refers to the escalation policy CAD should use to escalate the incident to
96+
* `CAD_PD_EMAIL`: refers to the email for a login via mail/pw credentials
97+
* `CAD_PD_PW`: refers to the password for a login via mail/pw credentials
98+
* `CAD_PD_TOKEN`: refers to the generated private access token for token-based authentication
99+
* `CAD_PD_USERNAME`: refers to the username of CAD on PagerDuty
100+
* `CAD_SILENT_POLICY`: refers to the silent policy CAD should use if the incident shall be silent
101+
* `PD_SIGNATURE`: refers to the PagerDuty webhook signature (HMAC+SHA256)
102+
* `X_SECRET_TOKEN`: refers to our custom Secret Token for authenticating against our pipeline
103+
104+
For Red Hat employees, these environment variables can be found in the SRE-P vault.

images/cad_overview/README.md

Whitespace-only changes.

0 commit comments

Comments
 (0)