Handle NaN/Inf values in metric evaluation and roofline plotting #3123

abchoudh-amd · 2026-02-06T16:09:02Z

Motivation

To stop analysis failing with nan and inf values while plotting roofline.

Technical Details

Adds robust NaN/Inf handling across the metric evaluation and roofline plotting pipeline to prevent crash.

JIRA ID

AIPROFCOMP-145

Test Plan

Analyze instmix and ipc sample workloads
Run ctests

Test Result

Analysis of instmix and ipc sample workloads don't crash
Ctests pass

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Copilot

Pull request overview

This pull request adds robust NaN/Inf value handling across the metric evaluation and roofline plotting pipeline to prevent analysis from crashing when encountering invalid numeric values. The changes span test files, utility functions, and core analysis modules.

Changes:

Added NaN/Inf detection and warning emission in metric evaluation functions (parser.py and analysis_db.py)
Introduced sanitize_roofline_data() function to filter out invalid data points from roofline plots
Modified roofline plotting to use index-based labels instead of kernel names after sanitization

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
`tests/test_utils.py`	Comprehensive test coverage for the new `sanitize_roofline_data()` function with 17 test cases covering NaN/Inf handling, numpy types, and edge cases
`tests/test_analyze_commands.py`	Test coverage for NaN/Inf handling in `evaluate()` and `MetricEvaluator.eval_expression()` functions with 15 test cases
`src/utils/roofline_calc.py`	New `sanitize_roofline_data()` function to filter invalid roofline data points; contains critical bug in loop logic
`src/utils/parser.py`	Added NaN/Inf detection with warnings and `np.errstate` context manager in `MetricEvaluator.eval_expression()`
`src/rocprof_compute_analyze/analysis_db.py`	Added NaN/Inf detection with warnings and `np.errstate` context manager in `evaluate()` method; moved numpy import to follow PEP 8 ordering
`src/roofline.py`	Integrated sanitization call before plotting; modified plotting code to use index-based labels (K0, K1, etc.) instead of kernel names

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

projects/rocprofiler-compute/tests/test_utils.py

projects/rocprofiler-compute/src/utils/roofline_calc.py

projects/rocprofiler-compute/tests/test_analyze_commands.py

Add 33 tests covering sanitize_roofline_data, db_analysis.evaluate(), and MetricEvaluator.eval_expression() NaN/Inf detection and handling. Co-authored-by: Cursor <cursoragent@cursor.com>

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-10T07:22:07Z

projects/rocprofiler-compute/tests/test_utils.py

+def test_metric_evaluator_nan_inf_handling():
+    """Test that MetricEvaluator.eval_expression() handles NaN, Inf, None, and NA."""
+    from utils.parser import MetricEvaluator
+
+    def make_evaluator(raw_pmc_df=None):
+        if raw_pmc_df is None:
+            raw_pmc_df = pd.DataFrame({"a": [10.0], "b": [2.0]})
+        return MetricEvaluator(raw_pmc_df, {}, {})
+
+    # Valid expression should evaluate correctly without warnings
+    evaluator = make_evaluator()
+    with mock.patch("utils.parser.console_warning") as mock_warn:
+        result = evaluator.eval_expression("raw_pmc_df['a'][0] + raw_pmc_df['b'][0]")
+        assert result == 12.0
+        mock_warn.assert_not_called()
+
+    # Division by zero should produce inf and emit a warning
+    evaluator = make_evaluator(pd.DataFrame({"a": [1.0], "b": [0.0]}))
+    with mock.patch("utils.parser.console_warning") as mock_warn:
+        result = evaluator.eval_expression("raw_pmc_df['a'][0] / raw_pmc_df['b'][0]")
+        assert np.isinf(result)
+
+        warn_calls = [str(c) for c in mock_warn.call_args_list]
+        inf_warnings = [c for c in warn_calls if "inf" in c]
+        assert len(inf_warnings) > 0
+
+    # 0/0 -> NaN -> detected as scalar NA -> returns "N/A"
+    evaluator = make_evaluator(pd.DataFrame({"a": [0.0], "b": [0.0]}))
+    with mock.patch("utils.parser.console_warning") as mock_warn:
+        result = evaluator.eval_expression("raw_pmc_df['a'][0] / raw_pmc_df['b'][0]")
+        assert result == "N/A"
+
+        warn_calls = [str(c) for c in mock_warn.call_args_list]
+        nan_warnings = [c for c in warn_calls if "NaN" in c]
+        assert len(nan_warnings) > 0
+
+    # pd.NA expression should return "N/A"
+    evaluator = make_evaluator(pd.DataFrame({"a": [pd.NA]}))
+    with mock.patch("utils.parser.console_warning"):
+        result = evaluator.eval_expression("raw_pmc_df['a'][0]")
+        assert result == "N/A"
+
+    # None expression should return "N/A"
+    evaluator = make_evaluator()
+    with mock.patch("utils.parser.console_debug"):
+        result = evaluator.eval_expression("None")
+        assert result == "N/A"


The new MetricEvaluator/analysis_db NaN/Inf tests don’t cover expressions that should return non-numeric scalars (e.g. CONCAT(...) / to_concat). Given the new np.isinf/np.isnan logic, adding a regression test that evaluates a CONCAT expression and asserts the string result is preserved would help prevent accidental breakage of existing metric expressions.

Copilot · 2026-02-10T07:22:08Z

projects/rocprofiler-compute/src/utils/parser.py

+            # Check for inf values resulting from division by zero
+            if np.isscalar(eval_result) and np.isinf(eval_result):
+                console_warning(
+                    f"Expression '{expr}' evaluated to inf - likely due to "
+                    "division by zero."
+                )
+
+            # Check for NaN values resulting from invalid operations
+            # (e.g., 0/0 or sqrt(negative number))
+            if np.isscalar(eval_result) and np.isnan(eval_result):
+                console_warning(
+                    f"Expression '{expr}' evaluated to NaN - likely due to "
+                    "invalid operation (e.g., 0/0 or sqrt(negative number))."
+                )
+                return "N/A"


The np.isinf/np.isnan checks run for any scalar eval_result. This will raise TypeError for valid non-numeric scalar results (e.g. CONCAT(...)/to_concat returns a str), causing the expression to incorrectly fall into the exception handler and return "N/A". Guard these checks with a numeric type check (e.g., isinstance(eval_result, (int, float, np.number))) or wrap np.isinf/np.isnan in a try/except so string/scalar results aren’t treated as evaluation failures.

Copilot · 2026-02-10T07:22:08Z

projects/rocprofiler-compute/src/rocprof_compute_analyze/analysis_db.py

+            # Check for inf values resulting from division by zero
+            if np.isscalar(eval_result) and np.isinf(eval_result):
+                console_warning(
+                    f"Expression for {name}: '{value}' evaluated to inf - likely "
+                    "due to division by zero."
+                )
+
+            # Check for NaN values resulting from invalid operations
+            # (e.g., 0/0 or sqrt(negative number))
+            if np.isscalar(eval_result) and np.isnan(eval_result):
+                console_warning(
+                    f"Expression for {name}: '{value}' evaluated to NaN - likely "
+                    "due to invalid operation (e.g., 0/0 or sqrt(negative number))."
+                )
+                return None


np.isnan(eval_result) is executed for any scalar result. If an expression legitimately evaluates to a non-numeric scalar (e.g. to_concat(...) returns a str), np.isnan will raise TypeError and the code will fall into the generic exception handler, returning None and logging a misleading failure. Restrict the isinf/isnan checks to numeric scalars (e.g., isinstance(eval_result, (int, float, np.number))) or catch TypeError around these checks so string results remain supported.

Copilot · 2026-02-10T07:22:08Z

projects/rocprofiler-compute/src/utils/roofline_calc.py

+    # Copy kernel names as-is initially
+    kernel_names = ai_data.get("kernelNames", [])
+
+    # Process each AI cache level (ai_hbm, ai_l2, ai_l1)
+    for key in ai_data.keys():
+        if not key.startswith("ai_"):
+            continue
+
+        ai_values = ai_data[key][0] if len(ai_data[key]) > 0 else []
+        perf_values = ai_data[key][1] if len(ai_data[key]) > 1 else []
+
+        sanitized_ai: list[float] = []
+        sanitized_perf: list[float] = []
+
+        for i, (ai_val, perf_val) in enumerate(zip(ai_values, perf_values)):
+            kernel_name = kernel_names[i] if i < len(kernel_names) else f"kernel_{i}"
+
+            if not is_valid(ai_val) or not is_valid(perf_val):
+                cache_level = key.replace("ai_", "").upper()
+                warnings.append(
+                    f"Removed {cache_level} data point for {kernel_name}: "
+                    f"AI={ai_val}, Perf={perf_val}"
+                )
+            else:
+                sanitized_ai.append(ai_val)
+                sanitized_perf.append(perf_val)
+


sanitize_roofline_data assumes kernelNames[i] corresponds to the ith element of each ai_* list, but calc_ai_analyze() appends kernelNames once per kernel (when performance > 0) while each ai_* list only appends when that cache-level AI is > 0. This means kernelNames and ai_* indices can diverge, so the warning messages can report the wrong kernel for a removed point. Consider returning/using per-cache-level kernel name lists (or a kept-index mapping) so warnings (and any future labeling) stay accurate.

Copilot · 2026-02-10T07:22:09Z

projects/rocprofiler-compute/src/roofline.py

+                if ai_val > 0 and perf_val > 0:
                    plt.plot(
-                        [self.__ai_data[key][0][i]],
-                        [self.__ai_data[key][1][i]],
-                        label=f"AI_{cache_level}_{kernel_names[i]}",
+                        [ai_val],
+                        [perf_val],
+                        label=f"AI_{cache_level}_K{i}",
                        color=color_scheme[cache_level],
                        marker=kernel_markers[i % len(kernel_markers)],
                    )
-                val1 = (
-                    self.__ai_data[key][0][i]
-                    if i < len(self.__ai_data[key][0])
-                    else "N/A"
-                )
-                val2 = (
-                    self.__ai_data[key][1][i]
-                    if i < len(self.__ai_data[key][1])
-                    else "N/A"
-                )
-                console_debug("roofline", f"AI_{kernel_names[i]}: {val1}, {val2}")
+                    console_debug(
+                        "roofline", f"AI_{cache_level}_K{i}: {ai_val}, {perf_val}"
+                    )


The plot labels/debug output were changed from kernel_names[i] to K{i}. This avoids index errors, but it also drops the (potentially meaningful) kernel identifier and makes it harder to correlate plotted points with the analysis tables. If you address the underlying index-alignment issue (e.g., by keeping kernel names aligned with each ai_* list), you can preserve the original kernel labels here without risking crashes.

abchoudh-amd requested a review from a team as a code owner February 6, 2026 16:09

Copilot AI review requested due to automatic review settings February 6, 2026 16:09

github-actions bot added the project: rocprofiler-compute label Feb 6, 2026

Copilot started reviewing on behalf of abchoudh-amd February 6, 2026 16:10 View session

Copilot AI reviewed Feb 6, 2026

View reviewed changes

projects/rocprofiler-compute/tests/test_utils.py Outdated Show resolved Hide resolved

projects/rocprofiler-compute/src/utils/roofline_calc.py Show resolved Hide resolved

projects/rocprofiler-compute/src/utils/roofline_calc.py Show resolved Hide resolved

systems-assistant bot added the organization: ROCm label Feb 6, 2026

abchoudh-amd force-pushed the users/abchoudh/nan_conversion branch 2 times, most recently from c2b5aa1 to d5ddfa0 Compare February 9, 2026 08:16

vedithal-amd requested changes Feb 9, 2026

View reviewed changes

projects/rocprofiler-compute/tests/test_analyze_commands.py Outdated Show resolved Hide resolved

abchoudh-amd and others added 8 commits February 9, 2026 09:58

Handle nan and inf

805414b

Sanitize roofline data

00a388d

Add tests for NaN/Inf handling in roofline_calc, analysis_db, and parser

f9a3c66

Add 33 tests covering sanitize_roofline_data, db_analysis.evaluate(), and MetricEvaluator.eval_expression() NaN/Inf detection and handling. Co-authored-by: Cursor <cursoragent@cursor.com>

Fixed issues

0e3e534

Fix formatting

fd884b0

Updated tests

ae2b1d4

Reorganized tests

f43e872

Consolidated tests

f2cb567

abchoudh-amd force-pushed the users/abchoudh/nan_conversion branch from d5f9327 to f2cb567 Compare February 9, 2026 15:58

Removed marker

40e0711

abchoudh-amd requested review from a team and Copilot February 10, 2026 05:08

Copilot started reviewing on behalf of abchoudh-amd February 10, 2026 07:15 View session

Copilot AI reviewed Feb 10, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle NaN/Inf values in metric evaluation and roofline plotting #3123

Handle NaN/Inf values in metric evaluation and roofline plotting #3123

abchoudh-amd commented Feb 6, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 10, 2026

Uh oh!

Copilot AI Feb 10, 2026

Uh oh!

Copilot AI Feb 10, 2026

Uh oh!

Copilot AI Feb 10, 2026

Uh oh!

Copilot AI Feb 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Handle NaN/Inf values in metric evaluation and roofline plotting #3123

Are you sure you want to change the base?

Handle NaN/Inf values in metric evaluation and roofline plotting #3123

Conversation

abchoudh-amd commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

JIRA ID

Test Plan

Test Result

Submission Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

abchoudh-amd commented Feb 6, 2026 •

edited

Loading