-
Notifications
You must be signed in to change notification settings - Fork 135
Handle NaN/Inf values in metric evaluation and roofline plotting #3123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request adds robust NaN/Inf value handling across the metric evaluation and roofline plotting pipeline to prevent analysis from crashing when encountering invalid numeric values. The changes span test files, utility functions, and core analysis modules.
Changes:
- Added NaN/Inf detection and warning emission in metric evaluation functions (
parser.pyandanalysis_db.py) - Introduced
sanitize_roofline_data()function to filter out invalid data points from roofline plots - Modified roofline plotting to use index-based labels instead of kernel names after sanitization
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
tests/test_utils.py |
Comprehensive test coverage for the new sanitize_roofline_data() function with 17 test cases covering NaN/Inf handling, numpy types, and edge cases |
tests/test_analyze_commands.py |
Test coverage for NaN/Inf handling in evaluate() and MetricEvaluator.eval_expression() functions with 15 test cases |
src/utils/roofline_calc.py |
New sanitize_roofline_data() function to filter invalid roofline data points; contains critical bug in loop logic |
src/utils/parser.py |
Added NaN/Inf detection with warnings and np.errstate context manager in MetricEvaluator.eval_expression() |
src/rocprof_compute_analyze/analysis_db.py |
Added NaN/Inf detection with warnings and np.errstate context manager in evaluate() method; moved numpy import to follow PEP 8 ordering |
src/roofline.py |
Integrated sanitization call before plotting; modified plotting code to use index-based labels (K0, K1, etc.) instead of kernel names |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
c2b5aa1 to
d5ddfa0
Compare
Add 33 tests covering sanitize_roofline_data, db_analysis.evaluate(), and MetricEvaluator.eval_expression() NaN/Inf detection and handling. Co-authored-by: Cursor <cursoragent@cursor.com>
d5f9327 to
f2cb567
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def test_metric_evaluator_nan_inf_handling(): | ||
| """Test that MetricEvaluator.eval_expression() handles NaN, Inf, None, and NA.""" | ||
| from utils.parser import MetricEvaluator | ||
|
|
||
| def make_evaluator(raw_pmc_df=None): | ||
| if raw_pmc_df is None: | ||
| raw_pmc_df = pd.DataFrame({"a": [10.0], "b": [2.0]}) | ||
| return MetricEvaluator(raw_pmc_df, {}, {}) | ||
|
|
||
| # Valid expression should evaluate correctly without warnings | ||
| evaluator = make_evaluator() | ||
| with mock.patch("utils.parser.console_warning") as mock_warn: | ||
| result = evaluator.eval_expression("raw_pmc_df['a'][0] + raw_pmc_df['b'][0]") | ||
| assert result == 12.0 | ||
| mock_warn.assert_not_called() | ||
|
|
||
| # Division by zero should produce inf and emit a warning | ||
| evaluator = make_evaluator(pd.DataFrame({"a": [1.0], "b": [0.0]})) | ||
| with mock.patch("utils.parser.console_warning") as mock_warn: | ||
| result = evaluator.eval_expression("raw_pmc_df['a'][0] / raw_pmc_df['b'][0]") | ||
| assert np.isinf(result) | ||
|
|
||
| warn_calls = [str(c) for c in mock_warn.call_args_list] | ||
| inf_warnings = [c for c in warn_calls if "inf" in c] | ||
| assert len(inf_warnings) > 0 | ||
|
|
||
| # 0/0 -> NaN -> detected as scalar NA -> returns "N/A" | ||
| evaluator = make_evaluator(pd.DataFrame({"a": [0.0], "b": [0.0]})) | ||
| with mock.patch("utils.parser.console_warning") as mock_warn: | ||
| result = evaluator.eval_expression("raw_pmc_df['a'][0] / raw_pmc_df['b'][0]") | ||
| assert result == "N/A" | ||
|
|
||
| warn_calls = [str(c) for c in mock_warn.call_args_list] | ||
| nan_warnings = [c for c in warn_calls if "NaN" in c] | ||
| assert len(nan_warnings) > 0 | ||
|
|
||
| # pd.NA expression should return "N/A" | ||
| evaluator = make_evaluator(pd.DataFrame({"a": [pd.NA]})) | ||
| with mock.patch("utils.parser.console_warning"): | ||
| result = evaluator.eval_expression("raw_pmc_df['a'][0]") | ||
| assert result == "N/A" | ||
|
|
||
| # None expression should return "N/A" | ||
| evaluator = make_evaluator() | ||
| with mock.patch("utils.parser.console_debug"): | ||
| result = evaluator.eval_expression("None") | ||
| assert result == "N/A" |
Copilot
AI
Feb 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new MetricEvaluator/analysis_db NaN/Inf tests don’t cover expressions that should return non-numeric scalars (e.g. CONCAT(...) / to_concat). Given the new np.isinf/np.isnan logic, adding a regression test that evaluates a CONCAT expression and asserts the string result is preserved would help prevent accidental breakage of existing metric expressions.
| # Check for inf values resulting from division by zero | ||
| if np.isscalar(eval_result) and np.isinf(eval_result): | ||
| console_warning( | ||
| f"Expression '{expr}' evaluated to inf - likely due to " | ||
| "division by zero." | ||
| ) | ||
|
|
||
| # Check for NaN values resulting from invalid operations | ||
| # (e.g., 0/0 or sqrt(negative number)) | ||
| if np.isscalar(eval_result) and np.isnan(eval_result): | ||
| console_warning( | ||
| f"Expression '{expr}' evaluated to NaN - likely due to " | ||
| "invalid operation (e.g., 0/0 or sqrt(negative number))." | ||
| ) | ||
| return "N/A" |
Copilot
AI
Feb 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The np.isinf/np.isnan checks run for any scalar eval_result. This will raise TypeError for valid non-numeric scalar results (e.g. CONCAT(...)/to_concat returns a str), causing the expression to incorrectly fall into the exception handler and return "N/A". Guard these checks with a numeric type check (e.g., isinstance(eval_result, (int, float, np.number))) or wrap np.isinf/np.isnan in a try/except so string/scalar results aren’t treated as evaluation failures.
| # Check for inf values resulting from division by zero | ||
| if np.isscalar(eval_result) and np.isinf(eval_result): | ||
| console_warning( | ||
| f"Expression for {name}: '{value}' evaluated to inf - likely " | ||
| "due to division by zero." | ||
| ) | ||
|
|
||
| # Check for NaN values resulting from invalid operations | ||
| # (e.g., 0/0 or sqrt(negative number)) | ||
| if np.isscalar(eval_result) and np.isnan(eval_result): | ||
| console_warning( | ||
| f"Expression for {name}: '{value}' evaluated to NaN - likely " | ||
| "due to invalid operation (e.g., 0/0 or sqrt(negative number))." | ||
| ) | ||
| return None |
Copilot
AI
Feb 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np.isnan(eval_result) is executed for any scalar result. If an expression legitimately evaluates to a non-numeric scalar (e.g. to_concat(...) returns a str), np.isnan will raise TypeError and the code will fall into the generic exception handler, returning None and logging a misleading failure. Restrict the isinf/isnan checks to numeric scalars (e.g., isinstance(eval_result, (int, float, np.number))) or catch TypeError around these checks so string results remain supported.
| # Copy kernel names as-is initially | ||
| kernel_names = ai_data.get("kernelNames", []) | ||
|
|
||
| # Process each AI cache level (ai_hbm, ai_l2, ai_l1) | ||
| for key in ai_data.keys(): | ||
| if not key.startswith("ai_"): | ||
| continue | ||
|
|
||
| ai_values = ai_data[key][0] if len(ai_data[key]) > 0 else [] | ||
| perf_values = ai_data[key][1] if len(ai_data[key]) > 1 else [] | ||
|
|
||
| sanitized_ai: list[float] = [] | ||
| sanitized_perf: list[float] = [] | ||
|
|
||
| for i, (ai_val, perf_val) in enumerate(zip(ai_values, perf_values)): | ||
| kernel_name = kernel_names[i] if i < len(kernel_names) else f"kernel_{i}" | ||
|
|
||
| if not is_valid(ai_val) or not is_valid(perf_val): | ||
| cache_level = key.replace("ai_", "").upper() | ||
| warnings.append( | ||
| f"Removed {cache_level} data point for {kernel_name}: " | ||
| f"AI={ai_val}, Perf={perf_val}" | ||
| ) | ||
| else: | ||
| sanitized_ai.append(ai_val) | ||
| sanitized_perf.append(perf_val) | ||
|
|
Copilot
AI
Feb 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sanitize_roofline_data assumes kernelNames[i] corresponds to the ith element of each ai_* list, but calc_ai_analyze() appends kernelNames once per kernel (when performance > 0) while each ai_* list only appends when that cache-level AI is > 0. This means kernelNames and ai_* indices can diverge, so the warning messages can report the wrong kernel for a removed point. Consider returning/using per-cache-level kernel name lists (or a kept-index mapping) so warnings (and any future labeling) stay accurate.
| if ai_val > 0 and perf_val > 0: | ||
| plt.plot( | ||
| [self.__ai_data[key][0][i]], | ||
| [self.__ai_data[key][1][i]], | ||
| label=f"AI_{cache_level}_{kernel_names[i]}", | ||
| [ai_val], | ||
| [perf_val], | ||
| label=f"AI_{cache_level}_K{i}", | ||
| color=color_scheme[cache_level], | ||
| marker=kernel_markers[i % len(kernel_markers)], | ||
| ) | ||
| val1 = ( | ||
| self.__ai_data[key][0][i] | ||
| if i < len(self.__ai_data[key][0]) | ||
| else "N/A" | ||
| ) | ||
| val2 = ( | ||
| self.__ai_data[key][1][i] | ||
| if i < len(self.__ai_data[key][1]) | ||
| else "N/A" | ||
| ) | ||
| console_debug("roofline", f"AI_{kernel_names[i]}: {val1}, {val2}") | ||
| console_debug( | ||
| "roofline", f"AI_{cache_level}_K{i}: {ai_val}, {perf_val}" | ||
| ) |
Copilot
AI
Feb 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The plot labels/debug output were changed from kernel_names[i] to K{i}. This avoids index errors, but it also drops the (potentially meaningful) kernel identifier and makes it harder to correlate plotted points with the analysis tables. If you address the underlying index-alignment issue (e.g., by keeping kernel names aligned with each ai_* list), you can preserve the original kernel labels here without risking crashes.
Motivation
To stop analysis failing with nan and inf values while plotting roofline.
Technical Details
Adds robust NaN/Inf handling across the metric evaluation and roofline plotting pipeline to prevent crash.
JIRA ID
AIPROFCOMP-145
Test Plan
Test Result
Submission Checklist