Skip to content

[Vectorization] Support integer truncation operations #1270

@yzhang93

Description

@yzhang93

An example of elementwise op with trunci is as

%3 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>],
                         iterator_types = ["parallel", "parallel", "parallel", "parallel"]
                        } ins(%2 : tensor<256x16x64x64xi32>) outs(%i8out : tensor<256x16x64x64xi8>) {
      ^bb0(%in: i32, %out: i8):
        %4 = arith.shrsi %in, %cst_shift : i32
        %5 = arith.trunci %4 : i32 to i8
        linalg.yield %5 : i8
    } -> tensor<256x16x64x64xi8>

Motivation:

We observed significant performance improvements when using a ukernel for the operation above. However, since the vectorization path with Peano doesn't support this operation, the code falls back to scalar execution, resulting in poor performance as below:

-----------------------------------------------------------------------------------------------------------
Benchmark                                                 Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------
BM_matmul4d_trunci/process_time/real_time              8719 us          150 us          100 items_per_second=114.69/s
BM_matmul4d_trunci/process_time/real_time              8719 us          159 us          100 items_per_second=114.697/s
BM_matmul4d_trunci/process_time/real_time              8688 us          155 us          100 items_per_second=115.098/s
BM_matmul4d_trunci/process_time/real_time              8689 us          159 us          100 items_per_second=115.094/s
BM_matmul4d_trunci/process_time/real_time              8709 us          159 us          100 items_per_second=114.83/s
BM_matmul4d_trunci/process_time/real_time_mean         8705 us          156 us            5 items_per_second=114.882/s
BM_matmul4d_trunci/process_time/real_time_median       8709 us          159 us            5 items_per_second=114.83/s
BM_matmul4d_trunci/process_time/real_time_stddev       15.4 us         4.13 us            5 items_per_second=0.20339/s
BM_matmul4d_trunci/process_time/real_time_cv           0.18 %          2.64 %             5 items_per_second=0.18%

Below is the performance with ukernel path:

-----------------------------------------------------------------------------------------------------------
Benchmark                                                 Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------
BM_matmul4d_trunci/process_time/real_time              1146 us         51.5 us          700 items_per_second=872.337/s
BM_matmul4d_trunci/process_time/real_time              1149 us         64.8 us          700 items_per_second=870.528/s
BM_matmul4d_trunci/process_time/real_time              1149 us         65.5 us          700 items_per_second=869.999/s
BM_matmul4d_trunci/process_time/real_time              1149 us         65.9 us          700 items_per_second=870.46/s
BM_matmul4d_trunci/process_time/real_time              1150 us         66.9 us          700 items_per_second=869.819/s
BM_matmul4d_trunci/process_time/real_time_mean         1149 us         62.9 us            5 items_per_second=870.629/s
BM_matmul4d_trunci/process_time/real_time_median       1149 us         65.5 us            5 items_per_second=870.46/s
BM_matmul4d_trunci/process_time/real_time_stddev       1.32 us         6.42 us            5 items_per_second=1.00111/s
BM_matmul4d_trunci/process_time/real_time_cv           0.11 %         10.20 %             5 items_per_second=0.11%

To reproduce the numbers, for ukernel run

python cpu_comparison/run.py results /**iree-installed-path**/ --peano_dir=/**llvm-aie-installed-path**/ --xrt_lite_n_core_rows=$XRT_LITE_N_CORE_ROWS --xrt_lite_n_core_cols=$XRT_LITE_N_CORE_COLS --target_device="npu4" -v --vitis_dir=/opt/Xilinx/Vitis/2024.2 --tests=matmul4d_scale_trunci_1024_16384_512_i8_i32_O1_npu4_outline_4_level_tiling_ukernel_peano_ctrlpkt_benchmark

For vectorization path, modify "use_ukernel": False in

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions