-
Notifications
You must be signed in to change notification settings - Fork 233
feat: Add SQL Query Summary - Tokenizer, Parser, Cache #1918
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
hannahramadan
wants to merge
39
commits into
open-telemetry:main
Choose a base branch
from
hannahramadan:db_query_summary
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 23 commits
Commits
Show all changes
39 commits
Select commit
Hold shift + click to select a range
194ed91
feat: databse query summary
hannahramadan 6f7eb0e
Merge branch 'main' into db_query_summary
hannahramadan 051059a
Add more test cases
hannahramadan 5b8d528
Complex! But working
hannahramadan ec1b120
Add cache test
hannahramadan 45545b9
Add Tokenzier documentation
hannahramadan a327ace
Update cache docs and tests
hannahramadan 4480f6d
Use token array v struct
hannahramadan 58b8d18
update documentation
hannahramadan 4965a46
Merge branch 'open-telemetry:main' into db_query_summary
hannahramadan d39c6fb
update tokenizer
hannahramadan 49843a0
Merge branch 'db_query_summary' of https://github.com/hannahramadan/o…
hannahramadan d91edb3
update parser
hannahramadan b34f17e
Test update
hannahramadan c654908
rubocop -a
hannahramadan f25fded
Rubocop updates
hannahramadan 550b778
Rubocop offenses
hannahramadan 2cd1db2
Add Benchmark to test group
hannahramadan b80b5d2
Give each instrumentation its own cache
hannahramadan b5454aa
Add 255 max length boundary
hannahramadan c72f16a
Add CALL keyword
hannahramadan 5ed3f64
Update README
hannahramadan 83b3727
Fix README linter complaints
hannahramadan ae29dba
Update namespaces
hannahramadan 859d854
Merge branch 'main' into db_query_summary
hannahramadan 6c9f56c
rubocop
hannahramadan e3d4b95
Remove unused code
hannahramadan a1e982a
Refactor complex method
hannahramadan 088d3e3
rubocop
hannahramadan f4a0db7
Code cleaning
hannahramadan 8ed5750
More modules! Smaller Parser class
hannahramadan 488cf1d
Clean up fast path
hannahramadan 486a4a8
Update tokenizer
hannahramadan 9b8e039
Small tweaks
hannahramadan 2ffa47c
README updates
hannahramadan 605bee9
linter
hannahramadan ca73e10
linter round2
hannahramadan 4b79eed
Update README
hannahramadan 6f7d21d
Update README
hannahramadan File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -21,5 +21,6 @@ group :test do | |
| if RUBY_VERSION >= '3.4' | ||
| gem 'base64' | ||
| gem 'mutex_m' | ||
| gem 'benchmark' | ||
| end | ||
| end | ||
46 changes: 46 additions & 0 deletions
46
helpers/sql-processor/lib/opentelemetry/helpers/sql_processor/query_summary.rb
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,46 @@ | ||
| # frozen_string_literal: true | ||
|
|
||
| # Copyright The OpenTelemetry Authors | ||
| # | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
|
|
||
| require_relative 'query_summary/tokenizer' | ||
| require_relative 'query_summary/cache' | ||
| require_relative 'query_summary/parser' | ||
|
|
||
| module OpenTelemetry | ||
| module Helpers | ||
| # QuerySummary generates high-level summaries of SQL queries, made up of | ||
| # key operations and table names. | ||
| # | ||
| # To use this in your instrumentation, create a Cache instance and pass it | ||
| # to the generate_summary method: | ||
| # | ||
| # Example: | ||
| # cache = OpenTelemetry::Helpers::QuerySummary::Cache.new(size: 1000) | ||
| # summary = OpenTelemetry::Helpers::QuerySummary.generate_summary( | ||
| # "SELECT * FROM users WHERE id = 1", | ||
| # cache: cache | ||
| # ) | ||
| # # => "SELECT users" | ||
| module QuerySummary | ||
| module_function | ||
|
|
||
| # Generates a high-level summary of a SQL query using the provided cache. | ||
| # | ||
| # @param query [String] The SQL query to summarize | ||
| # @param cache [Cache] The cache instance to use for storing/retrieving summaries | ||
| # @return [String] The query summary or 'UNKNOWN' if parsing fails | ||
| # | ||
| # @api public | ||
| def generate_summary(query, cache:) | ||
| cache.fetch(query) do | ||
| tokens = Tokenizer.tokenize(query) | ||
| Parser.build_summary_from_tokens(tokens) | ||
| end | ||
| rescue StandardError | ||
| 'UNKNOWN' | ||
| end | ||
| end | ||
| end | ||
| end | ||
174 changes: 174 additions & 0 deletions
174
...s/sql-processor/lib/opentelemetry/helpers/sql_processor/query_summary/README.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,174 @@ | ||
| # Query Summary Module | ||
|
|
||
| The query summary module transforms SQL queries into high-level summaries for OpenTelemetry span attributes. | ||
|
|
||
| ```ruby | ||
| cache = OpenTelemetry::Helpers::QuerySummary::Cache.new(size: 1000) | ||
|
|
||
| summary = OpenTelemetry::Helpers::QuerySummary.generate_summary( | ||
| "SELECT * FROM users WHERE id = 1", | ||
| cache: cache | ||
| ) | ||
| puts summary # => "SELECT users" | ||
|
|
||
| query = "SELECT u.name FROM users u JOIN orders o ON u.id = o.user_id" | ||
| summary = OpenTelemetry::Helpers::QuerySummary.generate_summary(query, cache: cache) | ||
| puts summary # => "SELECT users orders" | ||
| ``` | ||
|
|
||
| ## Examples | ||
|
|
||
| | Input SQL | Output Summary | | ||
| | --------- | -------------- | | ||
| | `SELECT * FROM users WHERE id = 1` | `SELECT users` | | ||
| | `INSERT INTO orders VALUES (1, 2, 3)` | `INSERT orders` | | ||
| | `CREATE TABLE products (id INT)` | `CREATE TABLE products` | | ||
| | `EXEC GetUserStats @userId = 123` | `EXEC GetUserStats` | | ||
| | `CALL update_user_profile(456, 'new@email.com')` | `CALL update_user_profile` | | ||
| | `SELECT * FROM table1 UNION SELECT * FROM table2` | `SELECT table1 table2` | | ||
|
|
||
| ## Complex Examples | ||
|
|
||
| | Complex SQL | Summary | Why Useful | | ||
| | ----------- | ------- | ---------- | | ||
| | `SELECT u.*, p.name FROM users u LEFT JOIN profiles p ON u.id=p.user_id WHERE u.created_at > '2023-01-01' AND p.active = 1` | `SELECT users profiles` | Shows JOIN handling, removes sensitive data | | ||
| | `INSERT INTO audit_logs (user_id, action, details, created_at) VALUES (?, ?, ?, NOW())` | `INSERT audit_logs` | Removes parameter placeholders | | ||
| | `CREATE PROCEDURE update_user(id INT) AS BEGIN UPDATE users SET last_seen=NOW() WHERE id=id; END` | `CREATE PROCEDURE update_user` | Handles stored procedures | | ||
| | `CALL generate_monthly_report(2023, 12, 'summary', @user_id)` | `CALL generate_monthly_report` | Stored procedure calls with parameters removed | | ||
| | `SELECT * FROM users UNION SELECT * FROM customers` | `SELECT users customers` | UNION queries consolidated (table names merged) | | ||
| | `SELECT * FROM orders UNION ALL SELECT * FROM returns` | `SELECT orders UNION ALL SELECT returns` | UNION ALL preserved (not consolidated like regular UNION) | | ||
| | `WITH recent AS (SELECT * FROM orders WHERE date > ?) SELECT r.*, u.name FROM recent r JOIN users u` | `WITH recent SELECT orders SELECT recent users` | Handles CTEs and subqueries | | ||
| | `UPDATE users SET status = 'active' WHERE id IN (1,2,3,4,5)` | `UPDATE users` | Removes literal values | | ||
| | `DELETE FROM sessions WHERE expires_at < NOW() AND user_id = ?` | `DELETE sessions` | Removes sensitive conditions | | ||
|
|
||
| ## Configuration | ||
|
|
||
| Each instrumentation creates its own cache instance: | ||
|
|
||
| ```ruby | ||
| pg_cache = OpenTelemetry::Helpers::QuerySummary::Cache.new(size: 2000) | ||
|
|
||
| summary = OpenTelemetry::Helpers::QuerySummary.generate_summary(query, cache: pg_cache) | ||
| ``` | ||
|
|
||
| **Configuration in OpenTelemetry instrumentations:** | ||
| ```ruby | ||
| OpenTelemetry::SDK.configure do |c| | ||
| c.use 'OpenTelemetry::Instrumentation::PG', { | ||
| query_summary_cache_size: 5000, | ||
| query_summary_enabled: true | ||
| } | ||
| end | ||
| ``` | ||
|
|
||
| ## Architecture | ||
|
|
||
| The module uses a three-stage pipeline to transform SQL queries into summaries: | ||
|
|
||
| ```text | ||
| SQL Query → Tokenizer → Parser → Summary | ||
| ↕ | ||
| Cache (stores results) | ||
| ``` | ||
|
|
||
| ## Tokenizer Details | ||
|
|
||
| Breaks SQL strings into structured tokens using `StringScanner` for parsing. | ||
|
|
||
| **Example:** | ||
| ```ruby | ||
| "SELECT * FROM users WHERE id = 1" | ||
|
|
||
| [ | ||
| [:keyword, "SELECT"], | ||
| [:operator, "*"], | ||
| [:keyword, "FROM"], | ||
| [:identifier, "users"], | ||
| [:keyword, "WHERE"], | ||
| [:identifier, "id"], | ||
| [:operator, "="], | ||
| [:numeric, "1"] | ||
| ] | ||
| ``` | ||
|
|
||
| **Token Types:** | ||
| - `:keyword` - SQL keywords (SELECT, FROM, WHERE, EXEC, EXECUTE, CALL, CREATE, etc.) | ||
| - `:identifier` - Table/column names, procedure names | ||
| - `:quoted_identifier` - Quoted names (`table`, [table], "table") | ||
| - `:operator` - SQL operators (=, <, >, +, -, *, (, ), ;) | ||
| - `:numeric` - Numbers (integers, decimals, scientific notation, hex) | ||
| - `:string` - String literals (automatically filtered for privacy) | ||
|
|
||
| ## Parser Details | ||
|
|
||
| Uses a **finite state machine** with three states to extract operations and table names from tokens. | ||
|
|
||
| ### Parser State Flow | ||
| ```text | ||
| SQL Token → PARSING → FROM/JOIN → EXPECT_COLLECTION → table names | ||
| ↓ ↓ ↓ | ||
| Operations WHERE/END Back to PARSING | ||
| ↓ | ||
| AS BEGIN → DDL_BODY (skip everything) | ||
| ``` | ||
|
|
||
| **States:** | ||
| - **PARSING** - Default state, looking for SQL operations (SELECT, CREATE, EXEC, CALL, etc.) | ||
| - **EXPECT_COLLECTION** - Collecting table names after FROM, INTO, JOIN | ||
| - **DDL_BODY** - Inside stored procedures/triggers, skips all tokens | ||
|
|
||
| **Process:** | ||
| 1. Process tokens sequentially with finite state machine | ||
| 2. Apply state transitions based on SQL keywords | ||
| 3. Collect operations and table names with lookahead for complex patterns | ||
| 4. Consolidate UNION queries in post-processing | ||
| 5. Truncate final summary at 255 characters (OpenTelemetry spec compliance) | ||
|
|
||
| ## Cache Details | ||
|
|
||
| LRU cache that stores generated summaries to avoid reprocessing identical queries. | ||
|
|
||
| **Features:** | ||
| - Default size: 1000 entries | ||
| - Mutex synchronization for thread safety | ||
| - LRU eviction: oldest entries removed when cache reaches size limit | ||
| - `fetch(key) { block }` interface | ||
|
|
||
| **Behavior:** | ||
| - Cache hits return stored values without executing the block | ||
| - Cache misses execute the block and store the result | ||
| - When resizing to a smaller size, cache is cleared completely | ||
|
|
||
| ## Integration | ||
|
|
||
| Each database instrumentation needs two new configuration options: `query_summary_cache_size` and `query_summary_enabled`. | ||
|
|
||
| ```ruby | ||
| class Instrumentation < OpenTelemetry::Instrumentation::Base | ||
| option :query_summary_cache_size, default: 1000, validate: :integer | ||
| option :query_summary_enabled, default: true, validate: :boolean | ||
|
|
||
| install do |config| | ||
| require_dependencies | ||
| initialize_query_summary_cache(config) if config[:query_summary_enabled] | ||
| patch_client | ||
| end | ||
|
|
||
| class << self | ||
| attr_reader :query_summary_cache | ||
| end | ||
|
|
||
| private | ||
|
|
||
| def initialize_query_summary_cache(config) | ||
| require 'opentelemetry-helpers-sql-processor' | ||
| self.class.instance_variable_set( | ||
| :@query_summary_cache, | ||
| OpenTelemetry::Helpers::QuerySummary::Cache.new(size: config[:query_summary_cache_size]) | ||
| ) | ||
| rescue LoadError | ||
| OpenTelemetry.logger.debug('Query summary helper not available') | ||
| end | ||
| end | ||
| ``` | ||
| Query summary will be used to generate the `db.query.summary` attribute. This is a recommended attribute that should be a span's name, if `db.query.summary` is available. |
80 changes: 80 additions & 0 deletions
80
helpers/sql-processor/lib/opentelemetry/helpers/sql_processor/query_summary/cache.rb
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,80 @@ | ||
| # frozen_string_literal: true | ||
|
|
||
| # Copyright The OpenTelemetry Authors | ||
| # | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
|
|
||
| module OpenTelemetry | ||
| module Helpers | ||
| module QuerySummary | ||
| # Cache provides thread-safe LRU caching for query summaries. | ||
| # | ||
| # Stores generated query summaries to avoid reprocessing identical queries. | ||
| # When cache reaches maximum size, least recently used entries are evicted first (LRU). | ||
| # Uses mutex synchronization for thread safety in concurrent applications. | ||
| # | ||
| # @example Basic usage | ||
| # cache = Cache.new | ||
| # cache.fetch("SELECT * FROM users") { "SELECT users" } # => "SELECT users" | ||
| # cache.fetch("SELECT * FROM users") { "won't execute" } # => "SELECT users" (cached) | ||
| # | ||
| # @example Custom size and configuration | ||
| # cache = Cache.new(size: 500) | ||
| # cache.configure(size: 100) # Resize and clear if needed | ||
| class Cache | ||
| DEFAULT_SIZE = 1000 | ||
|
|
||
| def initialize(size: DEFAULT_SIZE) | ||
| @cache = {} | ||
| @cache_mutex = Mutex.new | ||
| @cache_size = size | ||
| end | ||
|
|
||
| # Retrieves cached value or computes and caches new value. | ||
| # | ||
| # @param key [Object] Cache key (typically SQL query string) | ||
| # @yield Block to execute if key not found in cache | ||
| # @return [Object] Cached value or result of block execution | ||
| def fetch(key) | ||
| @cache_mutex.synchronize do | ||
| if @cache.key?(key) | ||
| # Move to end (most recently used) by deleting and re-inserting | ||
| value = @cache.delete(key) | ||
| @cache[key] = value | ||
| return value | ||
| end | ||
|
|
||
| result = yield | ||
| evict_if_needed | ||
| @cache[key] = result | ||
| result | ||
| end | ||
| end | ||
|
|
||
| # Configures the cache with a new size limit. | ||
| # | ||
| # If the new size is smaller than the current number of cached entries, | ||
| # the cache is cleared completely to ensure it fits within the new limit. | ||
| # | ||
| # @param size [Integer] Maximum number of entries to cache (default: 1000) | ||
| # @return [void] | ||
| def configure(size: DEFAULT_SIZE) | ||
| @cache_mutex.synchronize do | ||
| @cache_size = size | ||
| @cache.clear if @cache.size > size | ||
| end | ||
| end | ||
|
|
||
| private | ||
|
|
||
| def clear | ||
| @cache.clear | ||
| end | ||
|
|
||
| def evict_if_needed | ||
| @cache.shift if @cache.size >= @cache_size | ||
| end | ||
| end | ||
| end | ||
| end | ||
| end |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
db.query.summaryshould be used as a database span name, but if the summary is not available, then the name should be{db.operation.name} {target}. We should keep that in mind when changing db instrumentation. If the return value of the summary isUNKNOWN, then we have more work to do to create the span name.