TRT-1989: add metrics and isolate test matview refreshes#3234
TRT-1989: add metrics and isolate test matview refreshes#3234openshift-merge-bot[bot] merged 1 commit intoopenshift:mainfrom
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: automatic mode |
|
@neisw: This pull request references TRT-1989 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
WalkthroughAdds a DB-backed API to count distinct job runs and tests by lookback, exposes those counts as Prometheus gauges during matview refresh, reorders matview execution for load smoothing, and records load/refresh duration to a Prometheus pushgateway. Changes
Sequence Diagram(s)sequenceDiagram
participant S as SippyServer
participant A as API (pkg/api)
participant DB as Database
participant P as Pushgateway/Prometheus
participant M as MatViewWorkers
S->>A: GetJobRunTestsCountByLookback(lookbackDays)
A->>DB: SELECT COUNT(DISTINCT prow_job_run_id), COUNT(DISTINCT test_id) WHERE created_at > truncatedTime
DB-->>A: (job_runs_count, test_ids_count)
A-->>S: (job_runs_count, test_ids_count)
S->>S: set GaugeVecs for each lookback_days
S->>P: push gauges (optional if pushgateway configured)
S->>M: enqueue non-2d matviews
S->>M: enqueue 2d matview (last)
M-->>S: refresh completed
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 3❌ Failed checks (3 warnings)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.5.0)Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions Comment |
|
@neisw: This pull request references TRT-1989 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@pkg/api/tests.go`:
- Around line 263-290: GetJobRunTestsCountByLookback uses
time.Now().Add(...).Truncate(...) which is local-time dependent; change the
midnight calculation to use time.Date with UTC to get a reliable UTC midnight
start (e.g. build truncatedTime via time.Date(...).UTC()) and validate that
lookbackDays is positive (return an error for non-positive values) before
running the DB query; update references to truncatedTime and lookbackDays inside
the function accordingly.
In `@pkg/sippyserver/server.go`:
- Around line 309-315: The metric observation for totalRefreshMinutesMetric is
happening in RefreshData() after promPusher.Add() so the Pushgateway gets the
previous cycle's value; move the call to
totalRefreshMinutesMetric.Observe(float64(elapsed.Minutes())) into
refreshMaterializedViews() immediately before the code that calls
promPusher.Add(), and remove the redundant Observe call from RefreshData() so
the pushed metric reflects the just-completed refresh cycle.
|
@neisw: This pull request references TRT-1989 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Scheduling required tests: |
|
@neisw: This pull request references TRT-1989 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@pkg/sippyserver/server.go`:
- Around line 133-141: Fix the grammar in the Prometheus metric Help texts for
matViewUniqueNumberOfTests and matViewUniqueNumberOfJobRuns by changing "the
based on lookback days" to "based on lookback days" in the GaugeOpts Help fields
so the descriptions read correctly.
|
@neisw: This pull request references TRT-1989 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Scheduling required tests: |
|
@neisw: This pull request references TRT-1989 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Scheduling required tests: |
|
@neisw: This pull request references TRT-1989 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Scheduling required tests: |
a3cb03a to
baf2f13
Compare
|
@neisw: This pull request references TRT-1989 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@cmd/sippy/load.go`:
- Around line 316-319: The metric Gauge is being created inside the RunE
invocation (loadMetricGauge via promauto.NewGauge) which can panic on duplicate
registrations; move the gauge declaration to package scope (mirroring the
pattern used in server.go) and replace the in-function creation with uses of
that package-level variable, or alternatively create the gauge with
prometheus.NewGauge and register it once on a dedicated registry passed to the
pusher; update references in RunE to use the package-level loadMetricGauge (or
the single-registered gauge/registry) instead of calling promauto.NewGauge
repeatedly.
In `@pkg/sippyserver/server.go`:
- Around line 201-215: The loop captures the loop variable lookback in the
goroutine causing incorrect values; fix by passing lookback into the goroutine
or binding it to a new local variable before launching the goroutine so each
closure gets its own copy; update the anonymous function launched in the for
loop that calls api.GetJobRunTestsCountByLookback to accept a lookback parameter
(or use lb := lookback then use lb inside) and use that local value when logging
and when setting matViewUniqueNumberOfTests and matViewUniqueNumberOfJobRuns so
labels reflect the correct lookback.
🧹 Nitpick comments (1)
pkg/sippyserver/server.go (1)
228-244: LGTM with minor comment clarification suggestion.The logic correctly defers
prow_test_report_2d_matviewto run last. The comment on lines 228-229 mentions separating both 7d and 2d matviews, but only the 2d matview is actually isolated. Consider updating the comment to reflect the actual behavior.📝 Suggested comment update
- // separate prow_test_report_7d_matview and prow_test_report_2d_matview to try to cut down on - // CPU overload + // defer prow_test_report_2d_matview to run last to reduce CPU spikes var twoDayMatView *db.PostgresView
baf2f13 to
75c4ea0
Compare
|
@neisw: This pull request references TRT-1989 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
75c4ea0 to
43c8e28
Compare
|
@neisw: This pull request references TRT-1989 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Scheduling required tests: |
|
|
||
| elapsed = time.Since(start) | ||
| log.WithField("elapsed", elapsed).Info("load and refresh complete") | ||
| loadMetricGauge.Set(float64(elapsed.Minutes())) |
There was a problem hiding this comment.
Should this be within the promPusher != nil conditional block?
There was a problem hiding this comment.
It definitely isn't required but could easily be moved.
| // separate prow_test_report_7d_matview and prow_test_report_2d_matview to try to cut down on | ||
| // CPU overload | ||
| var twoDayMatView *db.PostgresView | ||
| for _, pmv := range db.PostgresMatViews { | ||
| ch <- pmv.Name | ||
| if pmv.Name == "prow_test_report_2d_matview" { | ||
| twoDayMatViewTmp := pmv | ||
| twoDayMatView = &twoDayMatViewTmp | ||
| } else { | ||
| ch <- pmv.Name | ||
| } | ||
| } | ||
|
|
||
| // add the 2 day mat view so it runs last, it is possible we might need to wait for the 14 to complete and run this standalone | ||
| // but we don't want to do that if we don't have to | ||
| if twoDayMatView != nil { | ||
| ch <- twoDayMatView.Name |
There was a problem hiding this comment.
I think it would be a lot simpler to just reorder the matviews in the PostgresMatViews slice. Maybe just adding a comment that says that the 2d needs to be last along with it.
There was a problem hiding this comment.
Not an unreasonable ask..
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: neisw, smg247 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@neisw: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Summary by CodeRabbit