Skip to content

Conversation

@abchoudh-amd
Copy link
Contributor

@abchoudh-amd abchoudh-amd commented Feb 6, 2026

Motivation

To stop analysis failing with nan and inf values while plotting roofline.

Technical Details

Adds robust NaN/Inf handling across the metric evaluation and roofline plotting pipeline to prevent crash.

JIRA ID

AIPROFCOMP-145

Test Plan

  • Analyze instmix and ipc sample workloads
  • Run ctests

Test Result

  • Analysis of instmix and ipc sample workloads don't crash
  • Ctests pass

Submission Checklist

@abchoudh-amd abchoudh-amd requested a review from a team as a code owner February 6, 2026 16:09
Copilot AI review requested due to automatic review settings February 6, 2026 16:09
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds robust NaN/Inf value handling across the metric evaluation and roofline plotting pipeline to prevent analysis from crashing when encountering invalid numeric values. The changes span test files, utility functions, and core analysis modules.

Changes:

  • Added NaN/Inf detection and warning emission in metric evaluation functions (parser.py and analysis_db.py)
  • Introduced sanitize_roofline_data() function to filter out invalid data points from roofline plots
  • Modified roofline plotting to use index-based labels instead of kernel names after sanitization

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/test_utils.py Comprehensive test coverage for the new sanitize_roofline_data() function with 17 test cases covering NaN/Inf handling, numpy types, and edge cases
tests/test_analyze_commands.py Test coverage for NaN/Inf handling in evaluate() and MetricEvaluator.eval_expression() functions with 15 test cases
src/utils/roofline_calc.py New sanitize_roofline_data() function to filter invalid roofline data points; contains critical bug in loop logic
src/utils/parser.py Added NaN/Inf detection with warnings and np.errstate context manager in MetricEvaluator.eval_expression()
src/rocprof_compute_analyze/analysis_db.py Added NaN/Inf detection with warnings and np.errstate context manager in evaluate() method; moved numpy import to follow PEP 8 ordering
src/roofline.py Integrated sanitization call before plotting; modified plotting code to use index-based labels (K0, K1, etc.) instead of kernel names

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@abchoudh-amd abchoudh-amd force-pushed the users/abchoudh/nan_conversion branch 2 times, most recently from c2b5aa1 to d5ddfa0 Compare February 9, 2026 08:16
abchoudh-amd and others added 8 commits February 9, 2026 09:58
Add 33 tests covering sanitize_roofline_data, db_analysis.evaluate(),
and MetricEvaluator.eval_expression() NaN/Inf detection and handling.

Co-authored-by: Cursor <cursoragent@cursor.com>
@abchoudh-amd abchoudh-amd force-pushed the users/abchoudh/nan_conversion branch from d5f9327 to f2cb567 Compare February 9, 2026 15:58
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +8230 to +8276
def test_metric_evaluator_nan_inf_handling():
"""Test that MetricEvaluator.eval_expression() handles NaN, Inf, None, and NA."""
from utils.parser import MetricEvaluator

def make_evaluator(raw_pmc_df=None):
if raw_pmc_df is None:
raw_pmc_df = pd.DataFrame({"a": [10.0], "b": [2.0]})
return MetricEvaluator(raw_pmc_df, {}, {})

# Valid expression should evaluate correctly without warnings
evaluator = make_evaluator()
with mock.patch("utils.parser.console_warning") as mock_warn:
result = evaluator.eval_expression("raw_pmc_df['a'][0] + raw_pmc_df['b'][0]")
assert result == 12.0
mock_warn.assert_not_called()

# Division by zero should produce inf and emit a warning
evaluator = make_evaluator(pd.DataFrame({"a": [1.0], "b": [0.0]}))
with mock.patch("utils.parser.console_warning") as mock_warn:
result = evaluator.eval_expression("raw_pmc_df['a'][0] / raw_pmc_df['b'][0]")
assert np.isinf(result)

warn_calls = [str(c) for c in mock_warn.call_args_list]
inf_warnings = [c for c in warn_calls if "inf" in c]
assert len(inf_warnings) > 0

# 0/0 -> NaN -> detected as scalar NA -> returns "N/A"
evaluator = make_evaluator(pd.DataFrame({"a": [0.0], "b": [0.0]}))
with mock.patch("utils.parser.console_warning") as mock_warn:
result = evaluator.eval_expression("raw_pmc_df['a'][0] / raw_pmc_df['b'][0]")
assert result == "N/A"

warn_calls = [str(c) for c in mock_warn.call_args_list]
nan_warnings = [c for c in warn_calls if "NaN" in c]
assert len(nan_warnings) > 0

# pd.NA expression should return "N/A"
evaluator = make_evaluator(pd.DataFrame({"a": [pd.NA]}))
with mock.patch("utils.parser.console_warning"):
result = evaluator.eval_expression("raw_pmc_df['a'][0]")
assert result == "N/A"

# None expression should return "N/A"
evaluator = make_evaluator()
with mock.patch("utils.parser.console_debug"):
result = evaluator.eval_expression("None")
assert result == "N/A"
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new MetricEvaluator/analysis_db NaN/Inf tests don’t cover expressions that should return non-numeric scalars (e.g. CONCAT(...) / to_concat). Given the new np.isinf/np.isnan logic, adding a regression test that evaluates a CONCAT expression and asserts the string result is preserved would help prevent accidental breakage of existing metric expressions.

Copilot uses AI. Check for mistakes.
Comment on lines +522 to +536
# Check for inf values resulting from division by zero
if np.isscalar(eval_result) and np.isinf(eval_result):
console_warning(
f"Expression '{expr}' evaluated to inf - likely due to "
"division by zero."
)

# Check for NaN values resulting from invalid operations
# (e.g., 0/0 or sqrt(negative number))
if np.isscalar(eval_result) and np.isnan(eval_result):
console_warning(
f"Expression '{expr}' evaluated to NaN - likely due to "
"invalid operation (e.g., 0/0 or sqrt(negative number))."
)
return "N/A"
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The np.isinf/np.isnan checks run for any scalar eval_result. This will raise TypeError for valid non-numeric scalar results (e.g. CONCAT(...)/to_concat returns a str), causing the expression to incorrectly fall into the exception handler and return "N/A". Guard these checks with a numeric type check (e.g., isinstance(eval_result, (int, float, np.number))) or wrap np.isinf/np.isnan in a try/except so string/scalar results aren’t treated as evaluation failures.

Copilot uses AI. Check for mistakes.
Comment on lines +479 to +493
# Check for inf values resulting from division by zero
if np.isscalar(eval_result) and np.isinf(eval_result):
console_warning(
f"Expression for {name}: '{value}' evaluated to inf - likely "
"due to division by zero."
)

# Check for NaN values resulting from invalid operations
# (e.g., 0/0 or sqrt(negative number))
if np.isscalar(eval_result) and np.isnan(eval_result):
console_warning(
f"Expression for {name}: '{value}' evaluated to NaN - likely "
"due to invalid operation (e.g., 0/0 or sqrt(negative number))."
)
return None
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np.isnan(eval_result) is executed for any scalar result. If an expression legitimately evaluates to a non-numeric scalar (e.g. to_concat(...) returns a str), np.isnan will raise TypeError and the code will fall into the generic exception handler, returning None and logging a misleading failure. Restrict the isinf/isnan checks to numeric scalars (e.g., isinstance(eval_result, (int, float, np.number))) or catch TypeError around these checks so string results remain supported.

Copilot uses AI. Check for mistakes.
Comment on lines +130 to +156
# Copy kernel names as-is initially
kernel_names = ai_data.get("kernelNames", [])

# Process each AI cache level (ai_hbm, ai_l2, ai_l1)
for key in ai_data.keys():
if not key.startswith("ai_"):
continue

ai_values = ai_data[key][0] if len(ai_data[key]) > 0 else []
perf_values = ai_data[key][1] if len(ai_data[key]) > 1 else []

sanitized_ai: list[float] = []
sanitized_perf: list[float] = []

for i, (ai_val, perf_val) in enumerate(zip(ai_values, perf_values)):
kernel_name = kernel_names[i] if i < len(kernel_names) else f"kernel_{i}"

if not is_valid(ai_val) or not is_valid(perf_val):
cache_level = key.replace("ai_", "").upper()
warnings.append(
f"Removed {cache_level} data point for {kernel_name}: "
f"AI={ai_val}, Perf={perf_val}"
)
else:
sanitized_ai.append(ai_val)
sanitized_perf.append(perf_val)

Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sanitize_roofline_data assumes kernelNames[i] corresponds to the ith element of each ai_* list, but calc_ai_analyze() appends kernelNames once per kernel (when performance > 0) while each ai_* list only appends when that cache-level AI is > 0. This means kernelNames and ai_* indices can diverge, so the warning messages can report the wrong kernel for a removed point. Consider returning/using per-cache-level kernel name lists (or a kept-index mapping) so warnings (and any future labeling) stay accurate.

Copilot uses AI. Check for mistakes.
Comment on lines +1369 to +1379
if ai_val > 0 and perf_val > 0:
plt.plot(
[self.__ai_data[key][0][i]],
[self.__ai_data[key][1][i]],
label=f"AI_{cache_level}_{kernel_names[i]}",
[ai_val],
[perf_val],
label=f"AI_{cache_level}_K{i}",
color=color_scheme[cache_level],
marker=kernel_markers[i % len(kernel_markers)],
)
val1 = (
self.__ai_data[key][0][i]
if i < len(self.__ai_data[key][0])
else "N/A"
)
val2 = (
self.__ai_data[key][1][i]
if i < len(self.__ai_data[key][1])
else "N/A"
)
console_debug("roofline", f"AI_{kernel_names[i]}: {val1}, {val2}")
console_debug(
"roofline", f"AI_{cache_level}_K{i}: {ai_val}, {perf_val}"
)
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The plot labels/debug output were changed from kernel_names[i] to K{i}. This avoids index errors, but it also drops the (potentially meaningful) kernel identifier and makes it harder to correlate plotted points with the analysis tables. If you address the underlying index-alignment issue (e.g., by keeping kernel names aligned with each ai_* list), you can preserve the original kernel labels here without risking crashes.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants