[Vectorization] Support integer truncation operations

An example of elementwise op with trunci is as

```
%3 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>],
                         iterator_types = ["parallel", "parallel", "parallel", "parallel"]
                        } ins(%2 : tensor<256x16x64x64xi32>) outs(%i8out : tensor<256x16x64x64xi8>) {
      ^bb0(%in: i32, %out: i8):
        %4 = arith.shrsi %in, %cst_shift : i32
        %5 = arith.trunci %4 : i32 to i8
        linalg.yield %5 : i8
    } -> tensor<256x16x64x64xi8>
```

**Motivation:**

We observed significant performance improvements when using a ukernel for the operation above. However, since the vectorization path with Peano doesn't support this operation, the code falls back to scalar execution, resulting in poor performance as below:

```
-----------------------------------------------------------------------------------------------------------
Benchmark                                                 Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------
BM_matmul4d_trunci/process_time/real_time              8719 us          150 us          100 items_per_second=114.69/s
BM_matmul4d_trunci/process_time/real_time              8719 us          159 us          100 items_per_second=114.697/s
BM_matmul4d_trunci/process_time/real_time              8688 us          155 us          100 items_per_second=115.098/s
BM_matmul4d_trunci/process_time/real_time              8689 us          159 us          100 items_per_second=115.094/s
BM_matmul4d_trunci/process_time/real_time              8709 us          159 us          100 items_per_second=114.83/s
BM_matmul4d_trunci/process_time/real_time_mean         8705 us          156 us            5 items_per_second=114.882/s
BM_matmul4d_trunci/process_time/real_time_median       8709 us          159 us            5 items_per_second=114.83/s
BM_matmul4d_trunci/process_time/real_time_stddev       15.4 us         4.13 us            5 items_per_second=0.20339/s
BM_matmul4d_trunci/process_time/real_time_cv           0.18 %          2.64 %             5 items_per_second=0.18%
```

Below is the performance with ukernel path:
```
-----------------------------------------------------------------------------------------------------------
Benchmark                                                 Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------
BM_matmul4d_trunci/process_time/real_time              1146 us         51.5 us          700 items_per_second=872.337/s
BM_matmul4d_trunci/process_time/real_time              1149 us         64.8 us          700 items_per_second=870.528/s
BM_matmul4d_trunci/process_time/real_time              1149 us         65.5 us          700 items_per_second=869.999/s
BM_matmul4d_trunci/process_time/real_time              1149 us         65.9 us          700 items_per_second=870.46/s
BM_matmul4d_trunci/process_time/real_time              1150 us         66.9 us          700 items_per_second=869.819/s
BM_matmul4d_trunci/process_time/real_time_mean         1149 us         62.9 us            5 items_per_second=870.629/s
BM_matmul4d_trunci/process_time/real_time_median       1149 us         65.5 us            5 items_per_second=870.46/s
BM_matmul4d_trunci/process_time/real_time_stddev       1.32 us         6.42 us            5 items_per_second=1.00111/s
BM_matmul4d_trunci/process_time/real_time_cv           0.11 %         10.20 %             5 items_per_second=0.11%
```

To reproduce the numbers, for ukernel run
```
python cpu_comparison/run.py results /**iree-installed-path**/ --peano_dir=/**llvm-aie-installed-path**/ --xrt_lite_n_core_rows=$XRT_LITE_N_CORE_ROWS --xrt_lite_n_core_cols=$XRT_LITE_N_CORE_COLS --target_device="npu4" -v --vitis_dir=/opt/Xilinx/Vitis/2024.2 --tests=matmul4d_scale_trunci_1024_16384_512_i8_i32_O1_npu4_outline_4_level_tiling_ukernel_peano_ctrlpkt_benchmark
```

For vectorization path, modify `"use_ukernel": False` in https://github.com/nod-ai/iree-amd-aie/blob/b8ad10e8c7274cd744327d99b4b9a7447b0558c9/build_tools/ci/cpu_comparison/run.py#L2215

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Vectorization] Support integer truncation operations #1270

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Vectorization] Support integer truncation operations #1270

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions