alert-mgmt-01: k8s foundation#787
alert-mgmt-01: k8s foundation#787sradco wants to merge 1 commit intoopenshift:alerts-management-apifrom
Conversation
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Repository: openshift/coderabbit/.coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
WalkthroughThe PR upgrades core Kubernetes and Prometheus dependencies to v0.34.2+ while introducing a comprehensive Kubernetes client abstraction layer with managers for Prometheus rules, namespace monitoring, alerting health, and alert configuration management. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 8
🧹 Nitpick comments (1)
pkg/k8s/client.go (1)
57-62:routeClientsetis created but not stored on theclientstruct.The
routeClientsetis only passed tonewPrometheusAlerts. If future methods onclientneed route access, it won't be available. This is fine ifprometheusAlertsis the only consumer, but worth noting for awareness.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/k8s/client.go` around lines 57 - 62, The client constructor builds a routeClientset but doesn't store it on the client struct; update the client struct (type client) to include a routeClientset field and assign the created routeClientset to that field in the constructor where c := &client{...} is created, so routeClientset is available for future methods; keep passing routeClientset into newPrometheusAlerts as before but also store it on the client instance (reference symbols: routeClientset, client struct, newPrometheusAlerts, prometheusAlerts).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@go.mod`:
- Around line 12-13: Update the Prometheus endpoint constants in vars.go so they
include the required "/api" prefix: change the PrometheusAlertsPath and
PrometheusRulesPath constants from "/v1/alerts" and "/v1/rules" to
"/api/v1/alerts" and "/api/v1/rules" respectively; locate the symbols
PrometheusAlertsPath and PrometheusRulesPath in vars.go and modify their string
values to the corrected paths.
In `@pkg/k8s/client.go`:
- Around line 36-94: The Client interface methods ConfigMaps()
ConfigMapInterface and AlertingHealth(ctx context.Context) (AlertingHealth,
error) are declared but not implemented on the *client type; add receiver
methods on *client named ConfigMaps and AlertingHealth that return the
appropriate values (for ConfigMaps return the configmap manager/implementation
instance held on the client or create one similarly to namespaceManager; for
AlertingHealth call or create the alert-health checker using existing clients
like monitoringv1clientset or prometheusAlerts and return (AlertingHealth,
error)). Ensure the method signatures exactly match the interface (ConfigMaps()
ConfigMapInterface and AlertingHealth(ctx context.Context) (AlertingHealth,
error)) so the compile-time assertion var _ Client = (*client)(nil) succeeds.
In `@pkg/k8s/namespace.go`:
- Around line 69-75: WaitForNamedCacheSync call's boolean return value is
ignored: after starting nm.informer.Run(ctx.Done()) you must check the result of
cache.WaitForNamedCacheSync("Namespace informer", ctx.Done(),
nm.informer.HasSynced) and if it returns false, return an error instead of
returning nm nil; update the function that constructs/returns nm to propagate a
descriptive error (e.g., "namespace informer failed to sync") when
WaitForNamedCacheSync returns false so the caller knows the informer never
synced.
In `@pkg/k8s/prometheus_rule.go`:
- Around line 21-39: Change newPrometheusRuleManager to return
(*prometheusRuleManager, error), check the boolean result of
cache.WaitForNamedCacheSync and return an error if it returns false (e.g.,
context cancelled or sync failed) instead of discarding it; update the function
signature and its return statements (prometheusRuleManager constructor) and
adjust the caller in pkg/k8s/client.go to handle the returned error (as shown in
other managers like newNamespaceManager/newAlertRelabelConfigManager) so
creation fails fast when the informer never syncs.
- Around line 82-89: The Delete handler in prometheusRuleManager (func Delete)
returns an error that only mentions the resource name; update the error
formatting to include both namespace and name (same pattern used in
Update/AddRule) by changing the fmt.Errorf call to include namespace and name in
the message so logs show "failed to delete PrometheusRule <namespace>/<name>:
%w".
- Around line 45-58: The List method prometheusRuleManager.List currently
ignores the namespace parameter and returns all items from
prm.informer.GetStore().List(); update it to filter the returned PrometheusRule
objects by namespace: iterate prs as before, assert item to
*monitoringv1.PrometheusRule (pr), then skip entries where pr.GetNamespace() !=
namespace when namespace is non-empty (and if you want to preserve previous
behavior, treat an empty namespace as "no filtering" to return all). Ensure the
function still returns a slice of monitoringv1.PrometheusRule built from the
filtered items and returns nil error.
In `@pkg/k8s/types.go`:
- Around line 157-159: The IsClusterMonitoringNamespace method currently returns
only bool which conflates “not cluster-monitoring” with lookup/read errors;
update the NamespaceInterface by changing IsClusterMonitoringNamespace(name
string) bool to IsClusterMonitoringNamespace(name string) (bool, error) and then
update all implementations of that method (and callers) to return (true, nil) or
(false, nil) for a definitive negative result and to return (false, err) when
namespace retrieval/label read fails so errors are propagated rather than
silently treated as false.
- Around line 147-150: RelabeledRulesInterface currently has List(ctx
context.Context) []monitoringv1.Rule and Get(ctx context.Context, id string)
(monitoringv1.Rule, bool) which cannot surface backend or context errors; change
the contract so List returns ([]monitoringv1.Rule, error) and Get returns
(monitoringv1.Rule, bool, error) (or alternatively (*monitoringv1.Rule, error)
if preferred), then update all implementations of RelabeledRulesInterface and
their callers to propagate and handle the error return (including
context/cancellation and backend failure cases) so failures are not silently
swallowed; update unit tests and any call sites that assume the old signatures
accordingly.
---
Nitpick comments:
In `@pkg/k8s/client.go`:
- Around line 57-62: The client constructor builds a routeClientset but doesn't
store it on the client struct; update the client struct (type client) to include
a routeClientset field and assign the created routeClientset to that field in
the constructor where c := &client{...} is created, so routeClientset is
available for future methods; keep passing routeClientset into
newPrometheusAlerts as before but also store it on the client instance
(reference symbols: routeClientset, client struct, newPrometheusAlerts,
prometheusAlerts).
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to data retention organization setting
⛔ Files ignored due to path filters (1)
go.sumis excluded by!**/*.sum
📒 Files selected for processing (9)
go.modpkg/k8s/auth_context.gopkg/k8s/client.gopkg/k8s/client_factory.gopkg/k8s/namespace.gopkg/k8s/prometheus_rule.gopkg/k8s/prometheus_rules_types.gopkg/k8s/types.gopkg/k8s/vars.go
pkg/k8s/prometheus_rule.go
Outdated
| func newPrometheusRuleManager(ctx context.Context, clientset *monitoringv1client.Clientset) *prometheusRuleManager { | ||
| informer := cache.NewSharedIndexInformer( | ||
| prometheusRuleListWatchForAllNamespaces(clientset), | ||
| &monitoringv1.PrometheusRule{}, | ||
| 0, | ||
| cache.Indexers{}, | ||
| ) | ||
|
|
||
| go informer.Run(ctx.Done()) | ||
|
|
||
| cache.WaitForNamedCacheSync("PrometheusRule informer", ctx.Done(), | ||
| informer.HasSynced, | ||
| ) | ||
|
|
||
| return &prometheusRuleManager{ | ||
| clientset: clientset, | ||
| informer: informer, | ||
| } | ||
| } |
There was a problem hiding this comment.
WaitForNamedCacheSync return value is discarded; no error returned to caller.
WaitForNamedCacheSync returns false if synchronization fails (e.g., context cancelled), but the result is ignored and newPrometheusRuleManager does not return an error. Other managers in this PR (newNamespaceManager, newAlertRelabelConfigManager) return (*manager, error), but this one only returns *prometheusRuleManager. If the cache never syncs, operations will silently work with stale or empty data.
🐛 Proposed fix: return an error on sync failure
-func newPrometheusRuleManager(ctx context.Context, clientset *monitoringv1client.Clientset) *prometheusRuleManager {
+func newPrometheusRuleManager(ctx context.Context, clientset *monitoringv1client.Clientset) (*prometheusRuleManager, error) {
informer := cache.NewSharedIndexInformer(
prometheusRuleListWatchForAllNamespaces(clientset),
&monitoringv1.PrometheusRule{},
0,
cache.Indexers{},
)
go informer.Run(ctx.Done())
- cache.WaitForNamedCacheSync("PrometheusRule informer", ctx.Done(),
+ if !cache.WaitForNamedCacheSync("PrometheusRule informer", ctx.Done(),
informer.HasSynced,
- )
+ ) {
+ return nil, fmt.Errorf("failed to sync PrometheusRule informer cache")
+ }
return &prometheusRuleManager{
clientset: clientset,
informer: informer,
- }
+ }, nil
}Note: This change requires updating the call site in pkg/k8s/client.go (Line 64) to handle the error:
c.prometheusRuleManager, err = newPrometheusRuleManager(ctx, monitoringv1clientset)
if err != nil {
return nil, fmt.Errorf("failed to create prometheus rule manager: %w", err)
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/k8s/prometheus_rule.go` around lines 21 - 39, Change
newPrometheusRuleManager to return (*prometheusRuleManager, error), check the
boolean result of cache.WaitForNamedCacheSync and return an error if it returns
false (e.g., context cancelled or sync failed) instead of discarding it; update
the function signature and its return statements (prometheusRuleManager
constructor) and adjust the caller in pkg/k8s/client.go to handle the returned
error (as shown in other managers like
newNamespaceManager/newAlertRelabelConfigManager) so creation fails fast when
the informer never syncs.
pkg/k8s/prometheus_rule.go
Outdated
| func (prm *prometheusRuleManager) List(ctx context.Context, namespace string) ([]monitoringv1.PrometheusRule, error) { | ||
| prs := prm.informer.GetStore().List() | ||
|
|
||
| prometheusRules := make([]monitoringv1.PrometheusRule, 0, len(prs)) | ||
| for _, item := range prs { | ||
| pr, ok := item.(*monitoringv1.PrometheusRule) | ||
| if !ok { | ||
| continue | ||
| } | ||
| prometheusRules = append(prometheusRules, *pr) | ||
| } | ||
|
|
||
| return prometheusRules, nil | ||
| } |
There was a problem hiding this comment.
Bug: List ignores the namespace parameter.
The namespace parameter is accepted but never used. The method returns all PrometheusRule resources from the informer store regardless of namespace. This silently returns a superset of results when callers expect namespace-scoped results.
🐛 Proposed fix to filter by namespace
func (prm *prometheusRuleManager) List(ctx context.Context, namespace string) ([]monitoringv1.PrometheusRule, error) {
prs := prm.informer.GetStore().List()
prometheusRules := make([]monitoringv1.PrometheusRule, 0, len(prs))
for _, item := range prs {
pr, ok := item.(*monitoringv1.PrometheusRule)
if !ok {
continue
}
+ if namespace != "" && pr.Namespace != namespace {
+ continue
+ }
prometheusRules = append(prometheusRules, *pr)
}
return prometheusRules, nil
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/k8s/prometheus_rule.go` around lines 45 - 58, The List method
prometheusRuleManager.List currently ignores the namespace parameter and returns
all items from prm.informer.GetStore().List(); update it to filter the returned
PrometheusRule objects by namespace: iterate prs as before, assert item to
*monitoringv1.PrometheusRule (pr), then skip entries where pr.GetNamespace() !=
namespace when namespace is non-empty (and if you want to preserve previous
behavior, treat an empty namespace as "no filtering" to return all). Ensure the
function still returns a slice of monitoringv1.PrometheusRule built from the
filtered items and returns nil error.
pkg/k8s/types.go
Outdated
| List(ctx context.Context) []monitoringv1.Rule | ||
|
|
||
| // Get retrieves the relabeled rule for a given id | ||
| Get(ctx context.Context, id string) (monitoringv1.Rule, bool) |
There was a problem hiding this comment.
Expose failures in RelabeledRulesInterface reads.
At Line [147] and Line [150], the API can’t report failures even though calls are context-aware. This forces implementations to swallow backend/config errors and can present incorrect empty data.
💡 Suggested contract update
type RelabeledRulesInterface interface {
// List retrieves the relabeled rules for a given PrometheusRule
- List(ctx context.Context) []monitoringv1.Rule
+ List(ctx context.Context) ([]monitoringv1.Rule, error)
// Get retrieves the relabeled rule for a given id
- Get(ctx context.Context, id string) (monitoringv1.Rule, bool)
+ Get(ctx context.Context, id string) (monitoringv1.Rule, bool, error)
// Config returns the list of alert relabel configs
Config() []*relabel.Config
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/k8s/types.go` around lines 147 - 150, RelabeledRulesInterface currently
has List(ctx context.Context) []monitoringv1.Rule and Get(ctx context.Context,
id string) (monitoringv1.Rule, bool) which cannot surface backend or context
errors; change the contract so List returns ([]monitoringv1.Rule, error) and Get
returns (monitoringv1.Rule, bool, error) (or alternatively (*monitoringv1.Rule,
error) if preferred), then update all implementations of RelabeledRulesInterface
and their callers to propagate and handle the error return (including
context/cancellation and backend failure cases) so failures are not silently
swallowed; update unit tests and any call sites that assume the old signatures
accordingly.
|
@sradco: This pull request references mgmt-01 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@sradco: This pull request references mgmt-01 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
3f23644 to
ef9bb18
Compare
|
@sradco: This pull request references mgmt-01 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
ef9bb18 to
0ed1db6
Compare
|
@sradco: This pull request references mgmt-01 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@sradco: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
@sradco: This pull request references mgmt-01 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
2bf65d3 to
e38bbf8
Compare
|
@sradco: This pull request references mgmt-01 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
fb22c01 to
10099ad
Compare
|
@jgbernalp , @jan--f , @avlitman , @machacekondra Please review this PR. |
ab9515e to
a82ec2e
Compare
0480720 to
1cbfb7d
Compare
1cbfb7d to
789cbb0
Compare
pkg/k8s/namespace.go
Outdated
|
|
||
| go nm.informer.Run(ctx.Done()) | ||
|
|
||
| cache.WaitForNamedCacheSync("Namespace informer", ctx.Done(), |
There was a problem hiding this comment.
should we handle the error returned here?
…nt API Add the foundational k8s layer for alert management: - Client factory for dynamic and typed Kubernetes clients - Namespace resolution helpers - Auth context extraction from HTTP requests - Core types (AlertingRuleSource, sortable slices) - Shared vars (label names, configmap keys) - PrometheusRule types and rule parsing/filtering helpers Signed-off-by: Shirly Radco <sradco@redhat.com> Signed-off-by: João Vilaça <jvilaca@redhat.com> Signed-off-by: Aviv Litman <alitman@redhat.com> Signed-off-by: machadovilaca <machadovilaca@gmail.com> Co-authored-by: AI Assistant <noreply@cursor.com>
789cbb0 to
ea94fbb
Compare
|
@simonpasquier, @jgbernalp thank you for the reviews. I fixed the issues you raised. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: simonpasquier, sradco The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Alert Management API — Part 1/8: k8s foundation
Summary
GET /api/v1/alerting/healthstub (returns501 Not Implemented) to make the intended API shape/call-path reviewable earlyDependencies
This PR is part of a stacked series. Please review in order.
Summary by CodeRabbit
Release Notes
Chores
New Features