Skip to content

TRT-1989: add metrics and isolate test matview refreshes#3234

Merged
openshift-merge-bot[bot] merged 1 commit intoopenshift:mainfrom
neisw:test-count-metrics
Feb 4, 2026
Merged

TRT-1989: add metrics and isolate test matview refreshes#3234
openshift-merge-bot[bot] merged 1 commit intoopenshift:mainfrom
neisw:test-count-metrics

Conversation

@neisw
Copy link
Contributor

@neisw neisw commented Feb 3, 2026

Summary by CodeRabbit

  • New Features
    • Added Prometheus metrics reporting for unique test counts and job runs across configurable lookback periods.
    • Enabled automatic push of these metrics and load/refresh duration to a Prometheus Pushgateway when configured.
    • Matview refresh ordering adjusted to reduce peak CPU load and improve refresh predictability.

@openshift-ci-robot
Copy link

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Feb 3, 2026
@openshift-ci-robot
Copy link

openshift-ci-robot commented Feb 3, 2026

@neisw: This pull request references TRT-1989 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 3, 2026

Walkthrough

Adds a DB-backed API to count distinct job runs and tests by lookback, exposes those counts as Prometheus gauges during matview refresh, reorders matview execution for load smoothing, and records load/refresh duration to a Prometheus pushgateway.

Changes

Cohort / File(s) Summary
API Layer
pkg/api/tests.go
Adds GetJobRunTestsCountByLookback(dbc *db.DB, lookbackDays int) (int64, int64, error) to validate lookback, query distinct prow_job_run_id and test_id since truncated UTC midnight, and return job run and test counts; logs query duration and returns errors as (-1, -1, err).
Server Metrics & Refresh
pkg/sippyserver/server.go
Adds two Prometheus GaugeVecs sippy_matviews_unique_number_of_tests and sippy_matviews_unique_number_of_job_runs labeled by lookback_days; pre-refresh fetches counts via the new API for configured lookbacks (e.g., 14 and 9), sets gauges and pushes to pushgateway when configured; introduces a pre-refresh WaitGroup, adjusts matview enqueue order to run non-2d matviews first and 2d matview last, and pushes the new gauges during refresh.
Load Command Metrics
cmd/sippy/load.go
Adds Prometheus gauge sippy_data_load_refresh_minutes, registers it with a pushgateway pusher when SIPPY_PROMETHEUS_PUSHGATEWAY is set, records total load/refresh elapsed time to the gauge, and conditionally pushes metrics and logs push results.

Sequence Diagram(s)

sequenceDiagram
    participant S as SippyServer
    participant A as API (pkg/api)
    participant DB as Database
    participant P as Pushgateway/Prometheus
    participant M as MatViewWorkers

    S->>A: GetJobRunTestsCountByLookback(lookbackDays)
    A->>DB: SELECT COUNT(DISTINCT prow_job_run_id), COUNT(DISTINCT test_id) WHERE created_at > truncatedTime
    DB-->>A: (job_runs_count, test_ids_count)
    A-->>S: (job_runs_count, test_ids_count)
    S->>S: set GaugeVecs for each lookback_days
    S->>P: push gauges (optional if pushgateway configured)
    S->>M: enqueue non-2d matviews
    S->>M: enqueue 2d matview (last)
    M-->>S: refresh completed
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 3
❌ Failed checks (3 warnings)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Go Error Handling ⚠️ Warning Error values are ignored using blank identifier pattern without justification or logging in GetJobRunTestsCountByLookback calls Capture and handle errors from GetJobRunTestsCountByLookback explicitly with proper logging or remove calls if non-essential
Single Responsibility And Clear Naming ⚠️ Warning GetJobRunTestsCountByLookback returns bare (int64, int64, error) without named struct wrapper; callers must remember return value order instead of using named fields. Refactor to return public struct like JobRunTestCounts with named fields (JobRuns, Tests) instead of unpacking internal counts struct.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main changes: adding metrics and isolating test matview refreshes, which directly correspond to the primary work across all modified files.
Sql Injection Prevention ✅ Passed The GetJobRunTestsCountByLookback function uses GORM's parameterized query system with ? placeholders for the truncatedTime parameter, hardcoded table and column names, and validated lookbackDays input, preventing SQL injection vulnerabilities.
Excessive Css In React Should Use Styles ✅ Passed The custom check regarding excessive inline CSS in React components is not applicable to this pull request. The PR contains only backend Go code modifications with no React components or CSS styling present.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.5.0)

Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions
The command is terminated due to an error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions


Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci bot requested review from deads2k and deepsm007 February 3, 2026 18:21
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 3, 2026
@openshift-ci-robot
Copy link

openshift-ci-robot commented Feb 3, 2026

@neisw: This pull request references TRT-1989 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Summary by CodeRabbit

  • New Features

  • Added comprehensive metrics collection for tracking test run and test count metrics over customizable lookback periods, enabling better performance visibility and monitoring capabilities.

  • Enhanced system operation logging to include detailed elapsed timing information for refresh processes and operations.

  • Performance

  • Optimized internal refresh workflows to reduce CPU overhead and improve overall system efficiency during peak operations.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@pkg/api/tests.go`:
- Around line 263-290: GetJobRunTestsCountByLookback uses
time.Now().Add(...).Truncate(...) which is local-time dependent; change the
midnight calculation to use time.Date with UTC to get a reliable UTC midnight
start (e.g. build truncatedTime via time.Date(...).UTC()) and validate that
lookbackDays is positive (return an error for non-positive values) before
running the DB query; update references to truncatedTime and lookbackDays inside
the function accordingly.

In `@pkg/sippyserver/server.go`:
- Around line 309-315: The metric observation for totalRefreshMinutesMetric is
happening in RefreshData() after promPusher.Add() so the Pushgateway gets the
previous cycle's value; move the call to
totalRefreshMinutesMetric.Observe(float64(elapsed.Minutes())) into
refreshMaterializedViews() immediately before the code that calls
promPusher.Add(), and remove the redundant Observe call from RefreshData() so
the pushed metric reflects the just-completed refresh cycle.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Feb 3, 2026

@neisw: This pull request references TRT-1989 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Summary by CodeRabbit

  • New Features

  • Added metrics tracking for unique test and job-run counts over configurable lookback periods and new gauges to expose them.

  • Renamed timing metrics to report durations in minutes for clearer monitoring.

  • Performance

  • Reworked materialized view refresh ordering to reduce CPU spikes by running the longest refresh separately.

  • Improved elapsed-time logging for refresh operations.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link

Scheduling required tests:
/test e2e

@openshift-ci-robot
Copy link

openshift-ci-robot commented Feb 3, 2026

@neisw: This pull request references TRT-1989 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Summary by CodeRabbit

  • New Features

  • Added metrics exposing unique test and job-run counts over configurable lookback periods.

  • Performance

  • Reordered materialized view refreshes so the longest-running view executes last to reduce CPU spikes.

  • Kept parallel refresh workers while isolating the heavy view run.

  • Consolidated and improved refresh completion logging.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@pkg/sippyserver/server.go`:
- Around line 133-141: Fix the grammar in the Prometheus metric Help texts for
matViewUniqueNumberOfTests and matViewUniqueNumberOfJobRuns by changing "the
based on lookback days" to "based on lookback days" in the GaugeOpts Help fields
so the descriptions read correctly.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Feb 3, 2026

@neisw: This pull request references TRT-1989 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Summary by CodeRabbit

  • New Features

  • Exposes new metrics for unique test and job-run counts over lookback periods (available as Prometheus gauges).

  • Performance

  • Reorders materialized view refreshes so the longest-running view executes last to reduce CPU spikes.

  • Retains parallel refresh workers while isolating the heavy view run.

  • Consolidates refresh completion logging.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link

Scheduling required tests:
/test e2e

@openshift-ci-robot
Copy link

openshift-ci-robot commented Feb 3, 2026

@neisw: This pull request references TRT-1989 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Summary by CodeRabbit

  • New Features

  • Adds Prometheus gauges exposing unique test and job-run counts over configurable lookback periods (labeled by lookback days).

  • Performance

  • Reorders materialized view refreshes so the longest-running view executes last to reduce CPU spikes.

  • Retains parallel refresh workers while isolating the heavy view run.

  • Consolidates refresh completion logging.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link

Scheduling required tests:
/test e2e

@openshift-ci-robot
Copy link

openshift-ci-robot commented Feb 4, 2026

@neisw: This pull request references TRT-1989 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Summary by CodeRabbit

  • New Features

  • Exposes Prometheus gauges for unique test and job-run counts over lookback windows.

  • Adds a Prometheus gauge reporting total data load time.

  • Performance

  • Reorders materialized view refreshes so the longest-running view runs last to reduce CPU spikes.

  • Keeps parallel refresh workers while isolating the heavy view run.

  • Consolidates refresh completion logging.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link

Scheduling required tests:
/test e2e

@neisw neisw force-pushed the test-count-metrics branch from a3cb03a to baf2f13 Compare February 4, 2026 01:59
@openshift-ci-robot
Copy link

openshift-ci-robot commented Feb 4, 2026

@neisw: This pull request references TRT-1989 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Summary by CodeRabbit

  • New Features

  • Exposes Prometheus gauges for unique test and job-run counts by lookback windows.

  • Adds a Prometheus gauge reporting total data load/refresh time, pushed when configured.

  • Performance

  • Reorders materialized view refreshes so the longest-running view runs last to reduce CPU spikes.

  • Keeps two parallel refresh workers while isolating the heavy view run; pre-fetches lookback metrics before refresh.

  • Style

  • Minor formatting cleanup.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@cmd/sippy/load.go`:
- Around line 316-319: The metric Gauge is being created inside the RunE
invocation (loadMetricGauge via promauto.NewGauge) which can panic on duplicate
registrations; move the gauge declaration to package scope (mirroring the
pattern used in server.go) and replace the in-function creation with uses of
that package-level variable, or alternatively create the gauge with
prometheus.NewGauge and register it once on a dedicated registry passed to the
pusher; update references in RunE to use the package-level loadMetricGauge (or
the single-registered gauge/registry) instead of calling promauto.NewGauge
repeatedly.

In `@pkg/sippyserver/server.go`:
- Around line 201-215: The loop captures the loop variable lookback in the
goroutine causing incorrect values; fix by passing lookback into the goroutine
or binding it to a new local variable before launching the goroutine so each
closure gets its own copy; update the anonymous function launched in the for
loop that calls api.GetJobRunTestsCountByLookback to accept a lookback parameter
(or use lb := lookback then use lb inside) and use that local value when logging
and when setting matViewUniqueNumberOfTests and matViewUniqueNumberOfJobRuns so
labels reflect the correct lookback.
🧹 Nitpick comments (1)
pkg/sippyserver/server.go (1)

228-244: LGTM with minor comment clarification suggestion.

The logic correctly defers prow_test_report_2d_matview to run last. The comment on lines 228-229 mentions separating both 7d and 2d matviews, but only the 2d matview is actually isolated. Consider updating the comment to reflect the actual behavior.

📝 Suggested comment update
-	// separate prow_test_report_7d_matview and prow_test_report_2d_matview to try to cut down on
-	// CPU overload
+	// defer prow_test_report_2d_matview to run last to reduce CPU spikes
 	var twoDayMatView *db.PostgresView

@neisw neisw force-pushed the test-count-metrics branch from baf2f13 to 75c4ea0 Compare February 4, 2026 02:09
@openshift-ci-robot
Copy link

openshift-ci-robot commented Feb 4, 2026

@neisw: This pull request references TRT-1989 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Summary by CodeRabbit

New Features

  • Introduced Prometheus metrics to track unique test counts and job runs across different lookback periods
  • Added automatic load/refresh duration reporting to Prometheus pushgateway when configured
  • Enhanced system monitoring with new observability metrics for improved visibility into operations

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@neisw neisw force-pushed the test-count-metrics branch from 75c4ea0 to 43c8e28 Compare February 4, 2026 02:14
@openshift-ci-robot
Copy link

openshift-ci-robot commented Feb 4, 2026

@neisw: This pull request references TRT-1989 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Summary by CodeRabbit

  • New Features
  • Added Prometheus metrics reporting for unique test counts and job runs across configurable lookback periods.
  • Enabled automatic push of these metrics and load/refresh duration to a Prometheus Pushgateway when configured.
  • Matview refresh ordering adjusted to reduce peak CPU load and improve refresh predictability.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link

Scheduling required tests:
/test e2e


elapsed = time.Since(start)
log.WithField("elapsed", elapsed).Info("load and refresh complete")
loadMetricGauge.Set(float64(elapsed.Minutes()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be within the promPusher != nil conditional block?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It definitely isn't required but could easily be moved.

Comment on lines +228 to +243
// separate prow_test_report_7d_matview and prow_test_report_2d_matview to try to cut down on
// CPU overload
var twoDayMatView *db.PostgresView
for _, pmv := range db.PostgresMatViews {
ch <- pmv.Name
if pmv.Name == "prow_test_report_2d_matview" {
twoDayMatViewTmp := pmv
twoDayMatView = &twoDayMatViewTmp
} else {
ch <- pmv.Name
}
}

// add the 2 day mat view so it runs last, it is possible we might need to wait for the 14 to complete and run this standalone
// but we don't want to do that if we don't have to
if twoDayMatView != nil {
ch <- twoDayMatView.Name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be a lot simpler to just reorder the matviews in the PostgresMatViews slice. Maybe just adding a comment that says that the 2d needs to be last along with it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not an unreasonable ask..

@smg247
Copy link
Member

smg247 commented Feb 4, 2026

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Feb 4, 2026
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 4, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: neisw, smg247

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 4, 2026

@neisw: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit 94f73c6 into openshift:main Feb 4, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants