Skip to content
Open
Show file tree
Hide file tree
Changes from 23 commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
194ed91
feat: databse query summary
hannahramadan Dec 2, 2025
6f7eb0e
Merge branch 'main' into db_query_summary
hannahramadan Dec 4, 2025
051059a
Add more test cases
hannahramadan Dec 5, 2025
5b8d528
Complex! But working
hannahramadan Dec 5, 2025
ec1b120
Add cache test
hannahramadan Dec 10, 2025
45545b9
Add Tokenzier documentation
hannahramadan Dec 11, 2025
a327ace
Update cache docs and tests
hannahramadan Dec 11, 2025
4480f6d
Use token array v struct
hannahramadan Dec 11, 2025
58b8d18
update documentation
hannahramadan Dec 12, 2025
4965a46
Merge branch 'open-telemetry:main' into db_query_summary
hannahramadan Jan 6, 2026
d39c6fb
update tokenizer
hannahramadan Jan 6, 2026
49843a0
Merge branch 'db_query_summary' of https://github.com/hannahramadan/o…
hannahramadan Jan 6, 2026
d91edb3
update parser
hannahramadan Jan 6, 2026
b34f17e
Test update
hannahramadan Jan 6, 2026
c654908
rubocop -a
hannahramadan Jan 7, 2026
f25fded
Rubocop updates
hannahramadan Jan 7, 2026
550b778
Rubocop offenses
hannahramadan Jan 9, 2026
2cd1db2
Add Benchmark to test group
hannahramadan Jan 9, 2026
b80b5d2
Give each instrumentation its own cache
hannahramadan Jan 9, 2026
b5454aa
Add 255 max length boundary
hannahramadan Jan 9, 2026
c72f16a
Add CALL keyword
hannahramadan Jan 9, 2026
5ed3f64
Update README
hannahramadan Jan 9, 2026
83b3727
Fix README linter complaints
hannahramadan Jan 9, 2026
ae29dba
Update namespaces
hannahramadan Jan 13, 2026
859d854
Merge branch 'main' into db_query_summary
hannahramadan Jan 13, 2026
6c9f56c
rubocop
hannahramadan Jan 13, 2026
e3d4b95
Remove unused code
hannahramadan Jan 13, 2026
a1e982a
Refactor complex method
hannahramadan Jan 13, 2026
088d3e3
rubocop
hannahramadan Jan 13, 2026
f4a0db7
Code cleaning
hannahramadan Jan 14, 2026
8ed5750
More modules! Smaller Parser class
hannahramadan Jan 16, 2026
488cf1d
Clean up fast path
hannahramadan Jan 16, 2026
486a4a8
Update tokenizer
hannahramadan Jan 16, 2026
9b8e039
Small tweaks
hannahramadan Jan 16, 2026
2ffa47c
README updates
hannahramadan Jan 17, 2026
605bee9
linter
hannahramadan Jan 17, 2026
ca73e10
linter round2
hannahramadan Jan 17, 2026
4b79eed
Update README
hannahramadan Jan 17, 2026
6f7d21d
Update README
hannahramadan Jan 17, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions helpers/sql-processor/Gemfile
Original file line number Diff line number Diff line change
Expand Up @@ -21,5 +21,6 @@ group :test do
if RUBY_VERSION >= '3.4'
gem 'base64'
gem 'mutex_m'
gem 'benchmark'
end
end
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# frozen_string_literal: true

# Copyright The OpenTelemetry Authors
#
# SPDX-License-Identifier: Apache-2.0

require_relative 'query_summary/tokenizer'
require_relative 'query_summary/cache'
require_relative 'query_summary/parser'

module OpenTelemetry
module Helpers
# QuerySummary generates high-level summaries of SQL queries, made up of
# key operations and table names.
#
# To use this in your instrumentation, create a Cache instance and pass it
# to the generate_summary method:
#
# Example:
# cache = OpenTelemetry::Helpers::QuerySummary::Cache.new(size: 1000)
# summary = OpenTelemetry::Helpers::QuerySummary.generate_summary(
# "SELECT * FROM users WHERE id = 1",
# cache: cache
# )
# # => "SELECT users"
module QuerySummary
module_function

# Generates a high-level summary of a SQL query using the provided cache.
#
# @param query [String] The SQL query to summarize
# @param cache [Cache] The cache instance to use for storing/retrieving summaries
# @return [String] The query summary or 'UNKNOWN' if parsing fails
#
# @api public
def generate_summary(query, cache:)
cache.fetch(query) do
tokens = Tokenizer.tokenize(query)
Parser.build_summary_from_tokens(tokens)
end
rescue StandardError
'UNKNOWN'
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

db.query.summary should be used as a database span name, but if the summary is not available, then the name should be {db.operation.name} {target}. We should keep that in mind when changing db instrumentation. If the return value of the summary is UNKNOWN, then we have more work to do to create the span name.

end
end
end
end
Original file line number Diff line number Diff line change
@@ -0,0 +1,174 @@
# Query Summary Module

The query summary module transforms SQL queries into high-level summaries for OpenTelemetry span attributes.

```ruby
cache = OpenTelemetry::Helpers::QuerySummary::Cache.new(size: 1000)

summary = OpenTelemetry::Helpers::QuerySummary.generate_summary(
"SELECT * FROM users WHERE id = 1",
cache: cache
)
puts summary # => "SELECT users"

query = "SELECT u.name FROM users u JOIN orders o ON u.id = o.user_id"
summary = OpenTelemetry::Helpers::QuerySummary.generate_summary(query, cache: cache)
puts summary # => "SELECT users orders"
```

## Examples

| Input SQL | Output Summary |
| --------- | -------------- |
| `SELECT * FROM users WHERE id = 1` | `SELECT users` |
| `INSERT INTO orders VALUES (1, 2, 3)` | `INSERT orders` |
| `CREATE TABLE products (id INT)` | `CREATE TABLE products` |
| `EXEC GetUserStats @userId = 123` | `EXEC GetUserStats` |
| `CALL update_user_profile(456, 'new@email.com')` | `CALL update_user_profile` |
| `SELECT * FROM table1 UNION SELECT * FROM table2` | `SELECT table1 table2` |

## Complex Examples

| Complex SQL | Summary | Why Useful |
| ----------- | ------- | ---------- |
| `SELECT u.*, p.name FROM users u LEFT JOIN profiles p ON u.id=p.user_id WHERE u.created_at > '2023-01-01' AND p.active = 1` | `SELECT users profiles` | Shows JOIN handling, removes sensitive data |
| `INSERT INTO audit_logs (user_id, action, details, created_at) VALUES (?, ?, ?, NOW())` | `INSERT audit_logs` | Removes parameter placeholders |
| `CREATE PROCEDURE update_user(id INT) AS BEGIN UPDATE users SET last_seen=NOW() WHERE id=id; END` | `CREATE PROCEDURE update_user` | Handles stored procedures |
| `CALL generate_monthly_report(2023, 12, 'summary', @user_id)` | `CALL generate_monthly_report` | Stored procedure calls with parameters removed |
| `SELECT * FROM users UNION SELECT * FROM customers` | `SELECT users customers` | UNION queries consolidated (table names merged) |
| `SELECT * FROM orders UNION ALL SELECT * FROM returns` | `SELECT orders UNION ALL SELECT returns` | UNION ALL preserved (not consolidated like regular UNION) |
| `WITH recent AS (SELECT * FROM orders WHERE date > ?) SELECT r.*, u.name FROM recent r JOIN users u` | `WITH recent SELECT orders SELECT recent users` | Handles CTEs and subqueries |
| `UPDATE users SET status = 'active' WHERE id IN (1,2,3,4,5)` | `UPDATE users` | Removes literal values |
| `DELETE FROM sessions WHERE expires_at < NOW() AND user_id = ?` | `DELETE sessions` | Removes sensitive conditions |

## Configuration

Each instrumentation creates its own cache instance:

```ruby
pg_cache = OpenTelemetry::Helpers::QuerySummary::Cache.new(size: 2000)

summary = OpenTelemetry::Helpers::QuerySummary.generate_summary(query, cache: pg_cache)
```

**Configuration in OpenTelemetry instrumentations:**
```ruby
OpenTelemetry::SDK.configure do |c|
c.use 'OpenTelemetry::Instrumentation::PG', {
query_summary_cache_size: 5000,
query_summary_enabled: true
}
end
```

## Architecture

The module uses a three-stage pipeline to transform SQL queries into summaries:

```text
SQL Query → Tokenizer → Parser → Summary
Cache (stores results)
```

## Tokenizer Details

Breaks SQL strings into structured tokens using `StringScanner` for parsing.

**Example:**
```ruby
"SELECT * FROM users WHERE id = 1"

[
[:keyword, "SELECT"],
[:operator, "*"],
[:keyword, "FROM"],
[:identifier, "users"],
[:keyword, "WHERE"],
[:identifier, "id"],
[:operator, "="],
[:numeric, "1"]
]
```

**Token Types:**
- `:keyword` - SQL keywords (SELECT, FROM, WHERE, EXEC, EXECUTE, CALL, CREATE, etc.)
- `:identifier` - Table/column names, procedure names
- `:quoted_identifier` - Quoted names (`table`, [table], "table")
- `:operator` - SQL operators (=, <, >, +, -, *, (, ), ;)
- `:numeric` - Numbers (integers, decimals, scientific notation, hex)
- `:string` - String literals (automatically filtered for privacy)

## Parser Details

Uses a **finite state machine** with three states to extract operations and table names from tokens.

### Parser State Flow
```text
SQL Token → PARSING → FROM/JOIN → EXPECT_COLLECTION → table names
↓ ↓ ↓
Operations WHERE/END Back to PARSING
AS BEGIN → DDL_BODY (skip everything)
```

**States:**
- **PARSING** - Default state, looking for SQL operations (SELECT, CREATE, EXEC, CALL, etc.)
- **EXPECT_COLLECTION** - Collecting table names after FROM, INTO, JOIN
- **DDL_BODY** - Inside stored procedures/triggers, skips all tokens

**Process:**
1. Process tokens sequentially with finite state machine
2. Apply state transitions based on SQL keywords
3. Collect operations and table names with lookahead for complex patterns
4. Consolidate UNION queries in post-processing
5. Truncate final summary at 255 characters (OpenTelemetry spec compliance)

## Cache Details

LRU cache that stores generated summaries to avoid reprocessing identical queries.

**Features:**
- Default size: 1000 entries
- Mutex synchronization for thread safety
- LRU eviction: oldest entries removed when cache reaches size limit
- `fetch(key) { block }` interface

**Behavior:**
- Cache hits return stored values without executing the block
- Cache misses execute the block and store the result
- When resizing to a smaller size, cache is cleared completely

## Integration

Each database instrumentation needs two new configuration options: `query_summary_cache_size` and `query_summary_enabled`.

```ruby
class Instrumentation < OpenTelemetry::Instrumentation::Base
option :query_summary_cache_size, default: 1000, validate: :integer
option :query_summary_enabled, default: true, validate: :boolean

install do |config|
require_dependencies
initialize_query_summary_cache(config) if config[:query_summary_enabled]
patch_client
end

class << self
attr_reader :query_summary_cache
end

private

def initialize_query_summary_cache(config)
require 'opentelemetry-helpers-sql-processor'
self.class.instance_variable_set(
:@query_summary_cache,
OpenTelemetry::Helpers::QuerySummary::Cache.new(size: config[:query_summary_cache_size])
)
rescue LoadError
OpenTelemetry.logger.debug('Query summary helper not available')
end
end
```
Query summary will be used to generate the `db.query.summary` attribute. This is a recommended attribute that should be a span's name, if `db.query.summary` is available.
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
# frozen_string_literal: true

# Copyright The OpenTelemetry Authors
#
# SPDX-License-Identifier: Apache-2.0

module OpenTelemetry
module Helpers
module QuerySummary
# Cache provides thread-safe LRU caching for query summaries.
#
# Stores generated query summaries to avoid reprocessing identical queries.
# When cache reaches maximum size, least recently used entries are evicted first (LRU).
# Uses mutex synchronization for thread safety in concurrent applications.
#
# @example Basic usage
# cache = Cache.new
# cache.fetch("SELECT * FROM users") { "SELECT users" } # => "SELECT users"
# cache.fetch("SELECT * FROM users") { "won't execute" } # => "SELECT users" (cached)
#
# @example Custom size and configuration
# cache = Cache.new(size: 500)
# cache.configure(size: 100) # Resize and clear if needed
class Cache
DEFAULT_SIZE = 1000

def initialize(size: DEFAULT_SIZE)
@cache = {}
@cache_mutex = Mutex.new
@cache_size = size
end

# Retrieves cached value or computes and caches new value.
#
# @param key [Object] Cache key (typically SQL query string)
# @yield Block to execute if key not found in cache
# @return [Object] Cached value or result of block execution
def fetch(key)
@cache_mutex.synchronize do
if @cache.key?(key)
# Move to end (most recently used) by deleting and re-inserting
value = @cache.delete(key)
@cache[key] = value
return value
end

result = yield
evict_if_needed
@cache[key] = result
result
end
end

# Configures the cache with a new size limit.
#
# If the new size is smaller than the current number of cached entries,
# the cache is cleared completely to ensure it fits within the new limit.
#
# @param size [Integer] Maximum number of entries to cache (default: 1000)
# @return [void]
def configure(size: DEFAULT_SIZE)
@cache_mutex.synchronize do
@cache_size = size
@cache.clear if @cache.size > size
end
end

private

def clear
@cache.clear
end

def evict_if_needed
@cache.shift if @cache.size >= @cache_size
end
end
end
end
end
Loading