Skip to content

Conversation

@ofekshenawa
Copy link
Collaborator

@ofekshenawa ofekshenawa commented Dec 3, 2025

This PR implements Observability (Metrics) for go-redis, adding comprehensive OpenTelemetry metrics support following the OpenTelemetry Database Client Semantic Conventions.

The implementation uses a Bridge Pattern to keep the core library dependency-free while providing optional, metrics instrumentation through the new extra/redisotel-native package.

What's Included:
Metric groups covering aspects of Redis operations:

  • Command metrics: Operation duration with retry tracking
  • Connection basic: Connection count and creation time
  • Resiliency: Errors, handoffs, timeout relaxation
  • Connection advanced: Wait time and use time
  • Pubsub metrics: Published and received messages
  • Stream metrics: Processing duration and maintenance notifications

@ofekshenawa ofekshenawa marked this pull request as ready for review December 3, 2025 15:06
Copy link
Member

@ndyakov ndyakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

submitting partial review

Comment on lines 35 to 71
// Global callback for connection state changes (set by otel package)
connectionStateChangeCallback func(ctx context.Context, cn *Conn, fromState, toState string)

// Global callback for connection creation time (set by otel package)
connectionCreateTimeCallback func(ctx context.Context, duration time.Duration, cn *Conn)

// Global callback for connection relaxed timeout changes (set by otel package)
// Parameters: ctx, delta (+1/-1), cn, poolName, notificationType
connectionRelaxedTimeoutCallback func(ctx context.Context, delta int, cn *Conn, poolName, notificationType string)

// Global callback for connection handoff (set by otel package)
// Parameters: ctx, cn, poolName
connectionHandoffCallback func(ctx context.Context, cn *Conn, poolName string)

// Global callback for error tracking (set by otel package)
// Parameters: ctx, errorType, cn, statusCode, isInternal, retryAttempts
errorCallback func(ctx context.Context, errorType string, cn *Conn, statusCode string, isInternal bool, retryAttempts int)

// Global callback for maintenance notifications (set by otel package)
// Parameters: ctx, cn, notificationType
maintenanceNotificationCallback func(ctx context.Context, cn *Conn, notificationType string)

// Global callback for connection wait time (set by otel package)
// Parameters: ctx, duration, cn
connectionWaitTimeCallback func(ctx context.Context, duration time.Duration, cn *Conn)

// Global callback for connection use time (set by otel package)
// Parameters: ctx, duration, cn
connectionUseTimeCallback func(ctx context.Context, duration time.Duration, cn *Conn)

// Global callback for connection timeouts (set by otel package)
// Parameters: ctx, cn, timeoutType
connectionTimeoutCallback func(ctx context.Context, cn *Conn, timeoutType string)

// Global callback for connection closed (set by otel package)
// Parameters: ctx, cn, reason, err
connectionClosedCallback func(ctx context.Context, cn *Conn, reason string, err error)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This global callbacks are not thread safe. Someone may set (or set nil - remove) the callback while a check on its existence is happening. In the best case scenario - we will end up having undefined behaviour, worst case - nil pointer panics. We should fix this with either a global callback mutex or atomic values here. I do think a RWMutex will be an easier and cleaner approach.

Copy link
Member

@ndyakov ndyakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left two comments that can be considered for all get/set operations. other than that the thread safety looks ok if we are willing to accept that there is a time window for which getters may have returned an old value. can we make sure if we remove the telemetry, this won't break the app? In the case when the old value is nil I assume we can accept a single event not being tracked.

Comment on lines 98 to 103
func getConnectionStateChangeCallback() func(ctx context.Context, cn *Conn, fromState, toState string) {
callbackMu.RLock()
cb := connectionStateChangeCallback
callbackMu.RUnlock()
return cb
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a slight chance that connectionStateChangeCallback can be changed after the runlick but before the use. If we agree that this timeframe is something we are willing to tolerate, this is fine.

Comment on lines 207 to 211
func SetConnectionTimeoutCallback(fn func(ctx context.Context, cn *Conn, timeoutType string)) {
callbackMu.Lock()
connectionTimeoutCallback = fn
callbackMu.Unlock()
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we are setting multiple callbacks (as we should in most cases) wouldn't it be better to have a single lock -> set all callbacks -> unlock block for them?

vladvildanov
vladvildanov previously approved these changes Jan 23, 2026
}

// Add pool name
baseAttrs = append(baseAttrs, attribute.String("db.client.connection.pool.name", poolInfo.name))
Copy link
Contributor

@elena-kolevska elena-kolevska Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to see all attributes defined as constants rather than using them as strings across multiple places/files.

Copy link
Member

@ndyakov ndyakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall looks good, can merge as is, left couple of improvements that we can address

Comment on lines +352 to +363
func isTimeoutError(err error) bool {
if err == nil {
return false
}

// Check for net.Error with Timeout() method (standard way)
if netErr, ok := err.(net.Error); ok {
return netErr.Timeout()
}

return false
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the internal implementation that we have for detecting timeout errors:

func isTimeoutError(err error) (isTimeout bool, hasTimeoutFlag bool) {
	// Check for timeoutError interface (works with wrapped errors)
	var te timeoutError
	if errors.As(err, &te) {
		return true, te.Timeout()
	}

	// Check for net.Error specifically (common case for network timeouts)
	var netErr net.Error
	if errors.As(err, &netErr) {
		return true, netErr.Timeout()
	}

	return false, false
}

It could be other type of error (not only net.Error) that has Timeout() method.

Comment on lines +391 to +392
strings.Contains(errStr, "dns") ||
strings.Contains(errStr, "lookup") ||
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the actual error message is something like dns lookup let's have them together. I can see how lookup can be part of another error type.

return "other"
}

func getErrorCategoryFromType(errorType string) string {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are both getErrorCategoryFromType and getErrorCategory needed? Can we combine them since they are doing pretty much the same thing?

)
}
connectionCreateTime, err = meter.Float64Histogram(
"db.client.connection.create_time",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we have this as const?

}

connectionRelaxedTimeout, err = meter.Int64UpDownCounter(
"redis.client.connection.relaxed_timeout",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we have this as const?


// Record relaxed timeout metric (post-handoff)
if relaxedTimeoutCallback := pool.GetMetricConnectionRelaxedTimeoutCallback(); relaxedTimeoutCallback != nil {
relaxedTimeoutCallback(ctx, 1, conn, "main", "HANDOFF")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"main" as const

// successfully completed the handoff, no retry needed and no error
// Notify metrics: connection handoff succeeded
if handoffCallback := pool.GetMetricConnectionHandoffCallback(); handoffCallback != nil {
handoffCallback(ctx, conn, "main")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same, let's have "main" as const


// Record relaxed timeout metric
if relaxedTimeoutCallback := pool.GetMetricConnectionRelaxedTimeoutCallback(); relaxedTimeoutCallback != nil {
relaxedTimeoutCallback(ctx, 1, conn, "main", "MIGRATING")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"main" as const


// Record relaxed timeout metric
if relaxedTimeoutCallback := pool.GetMetricConnectionRelaxedTimeoutCallback(); relaxedTimeoutCallback != nil {
relaxedTimeoutCallback(ctx, 1, conn, "main", "FAILING_OVER")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"main" as const

Comment on lines +26 to +34
// generateUniqueID generates a short unique identifier for pool names.
func generateUniqueID() string {
b := make([]byte, 4)
if _, err := rand.Read(b); err != nil {
return ""
}
return hex.EncodeToString(b)
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it make sense to use global auto increment so we know which pool is it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants