-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Add OpenTelemetry Native Metrics Support #3637
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Elena Kolevska <elena@kolevska.com>
Signed-off-by: Elena Kolevska <elena@kolevska.com>
5b7ff5f to
168f918
Compare
0940dd5 to
b8afd7f
Compare
…quest metric type from UpAndDownCounter to Async gauge.
ndyakov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
submitting partial review
internal/pool/pool.go
Outdated
| // Global callback for connection state changes (set by otel package) | ||
| connectionStateChangeCallback func(ctx context.Context, cn *Conn, fromState, toState string) | ||
|
|
||
| // Global callback for connection creation time (set by otel package) | ||
| connectionCreateTimeCallback func(ctx context.Context, duration time.Duration, cn *Conn) | ||
|
|
||
| // Global callback for connection relaxed timeout changes (set by otel package) | ||
| // Parameters: ctx, delta (+1/-1), cn, poolName, notificationType | ||
| connectionRelaxedTimeoutCallback func(ctx context.Context, delta int, cn *Conn, poolName, notificationType string) | ||
|
|
||
| // Global callback for connection handoff (set by otel package) | ||
| // Parameters: ctx, cn, poolName | ||
| connectionHandoffCallback func(ctx context.Context, cn *Conn, poolName string) | ||
|
|
||
| // Global callback for error tracking (set by otel package) | ||
| // Parameters: ctx, errorType, cn, statusCode, isInternal, retryAttempts | ||
| errorCallback func(ctx context.Context, errorType string, cn *Conn, statusCode string, isInternal bool, retryAttempts int) | ||
|
|
||
| // Global callback for maintenance notifications (set by otel package) | ||
| // Parameters: ctx, cn, notificationType | ||
| maintenanceNotificationCallback func(ctx context.Context, cn *Conn, notificationType string) | ||
|
|
||
| // Global callback for connection wait time (set by otel package) | ||
| // Parameters: ctx, duration, cn | ||
| connectionWaitTimeCallback func(ctx context.Context, duration time.Duration, cn *Conn) | ||
|
|
||
| // Global callback for connection use time (set by otel package) | ||
| // Parameters: ctx, duration, cn | ||
| connectionUseTimeCallback func(ctx context.Context, duration time.Duration, cn *Conn) | ||
|
|
||
| // Global callback for connection timeouts (set by otel package) | ||
| // Parameters: ctx, cn, timeoutType | ||
| connectionTimeoutCallback func(ctx context.Context, cn *Conn, timeoutType string) | ||
|
|
||
| // Global callback for connection closed (set by otel package) | ||
| // Parameters: ctx, cn, reason, err | ||
| connectionClosedCallback func(ctx context.Context, cn *Conn, reason string, err error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This global callbacks are not thread safe. Someone may set (or set nil - remove) the callback while a check on its existence is happening. In the best case scenario - we will end up having undefined behaviour, worst case - nil pointer panics. We should fix this with either a global callback mutex or atomic values here. I do think a RWMutex will be an easier and cleaner approach.
ndyakov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left two comments that can be considered for all get/set operations. other than that the thread safety looks ok if we are willing to accept that there is a time window for which getters may have returned an old value. can we make sure if we remove the telemetry, this won't break the app? In the case when the old value is nil I assume we can accept a single event not being tracked.
internal/pool/pool.go
Outdated
| func getConnectionStateChangeCallback() func(ctx context.Context, cn *Conn, fromState, toState string) { | ||
| callbackMu.RLock() | ||
| cb := connectionStateChangeCallback | ||
| callbackMu.RUnlock() | ||
| return cb | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is a slight chance that connectionStateChangeCallback can be changed after the runlick but before the use. If we agree that this timeframe is something we are willing to tolerate, this is fine.
internal/pool/pool.go
Outdated
| func SetConnectionTimeoutCallback(fn func(ctx context.Context, cn *Conn, timeoutType string)) { | ||
| callbackMu.Lock() | ||
| connectionTimeoutCallback = fn | ||
| callbackMu.Unlock() | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we are setting multiple callbacks (as we should in most cases) wouldn't it be better to have a single lock -> set all callbacks -> unlock block for them?
extra/redisotel-native/redisotel.go
Outdated
| } | ||
|
|
||
| // Add pool name | ||
| baseAttrs = append(baseAttrs, attribute.String("db.client.connection.pool.name", poolInfo.name)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to see all attributes defined as constants rather than using them as strings across multiple places/files.
ndyakov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall looks good, can merge as is, left couple of improvements that we can address
| func isTimeoutError(err error) bool { | ||
| if err == nil { | ||
| return false | ||
| } | ||
|
|
||
| // Check for net.Error with Timeout() method (standard way) | ||
| if netErr, ok := err.(net.Error); ok { | ||
| return netErr.Timeout() | ||
| } | ||
|
|
||
| return false | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the internal implementation that we have for detecting timeout errors:
func isTimeoutError(err error) (isTimeout bool, hasTimeoutFlag bool) {
// Check for timeoutError interface (works with wrapped errors)
var te timeoutError
if errors.As(err, &te) {
return true, te.Timeout()
}
// Check for net.Error specifically (common case for network timeouts)
var netErr net.Error
if errors.As(err, &netErr) {
return true, netErr.Timeout()
}
return false, false
}
It could be other type of error (not only net.Error) that has Timeout() method.
| strings.Contains(errStr, "dns") || | ||
| strings.Contains(errStr, "lookup") || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the actual error message is something like dns lookup let's have them together. I can see how lookup can be part of another error type.
| return "other" | ||
| } | ||
|
|
||
| func getErrorCategoryFromType(errorType string) string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are both getErrorCategoryFromType and getErrorCategory needed? Can we combine them since they are doing pretty much the same thing?
| ) | ||
| } | ||
| connectionCreateTime, err = meter.Float64Histogram( | ||
| "db.client.connection.create_time", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we have this as const?
| } | ||
|
|
||
| connectionRelaxedTimeout, err = meter.Int64UpDownCounter( | ||
| "redis.client.connection.relaxed_timeout", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we have this as const?
|
|
||
| // Record relaxed timeout metric (post-handoff) | ||
| if relaxedTimeoutCallback := pool.GetMetricConnectionRelaxedTimeoutCallback(); relaxedTimeoutCallback != nil { | ||
| relaxedTimeoutCallback(ctx, 1, conn, "main", "HANDOFF") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"main" as const
| // successfully completed the handoff, no retry needed and no error | ||
| // Notify metrics: connection handoff succeeded | ||
| if handoffCallback := pool.GetMetricConnectionHandoffCallback(); handoffCallback != nil { | ||
| handoffCallback(ctx, conn, "main") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same, let's have "main" as const
|
|
||
| // Record relaxed timeout metric | ||
| if relaxedTimeoutCallback := pool.GetMetricConnectionRelaxedTimeoutCallback(); relaxedTimeoutCallback != nil { | ||
| relaxedTimeoutCallback(ctx, 1, conn, "main", "MIGRATING") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"main" as const
|
|
||
| // Record relaxed timeout metric | ||
| if relaxedTimeoutCallback := pool.GetMetricConnectionRelaxedTimeoutCallback(); relaxedTimeoutCallback != nil { | ||
| relaxedTimeoutCallback(ctx, 1, conn, "main", "FAILING_OVER") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"main" as const
| // generateUniqueID generates a short unique identifier for pool names. | ||
| func generateUniqueID() string { | ||
| b := make([]byte, 4) | ||
| if _, err := rand.Read(b); err != nil { | ||
| return "" | ||
| } | ||
| return hex.EncodeToString(b) | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it make sense to use global auto increment so we know which pool is it?
This PR implements Observability (Metrics) for go-redis, adding comprehensive OpenTelemetry metrics support following the OpenTelemetry Database Client Semantic Conventions.
The implementation uses a Bridge Pattern to keep the core library dependency-free while providing optional, metrics instrumentation through the new extra/redisotel-native package.
What's Included:
Metric groups covering aspects of Redis operations: