-
Notifications
You must be signed in to change notification settings - Fork 44
Open
Description
An example of elementwise op with trunci is as
%3 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>],
iterator_types = ["parallel", "parallel", "parallel", "parallel"]
} ins(%2 : tensor<256x16x64x64xi32>) outs(%i8out : tensor<256x16x64x64xi8>) {
^bb0(%in: i32, %out: i8):
%4 = arith.shrsi %in, %cst_shift : i32
%5 = arith.trunci %4 : i32 to i8
linalg.yield %5 : i8
} -> tensor<256x16x64x64xi8>
Motivation:
We observed significant performance improvements when using a ukernel for the operation above. However, since the vectorization path with Peano doesn't support this operation, the code falls back to scalar execution, resulting in poor performance as below:
-----------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------
BM_matmul4d_trunci/process_time/real_time 8719 us 150 us 100 items_per_second=114.69/s
BM_matmul4d_trunci/process_time/real_time 8719 us 159 us 100 items_per_second=114.697/s
BM_matmul4d_trunci/process_time/real_time 8688 us 155 us 100 items_per_second=115.098/s
BM_matmul4d_trunci/process_time/real_time 8689 us 159 us 100 items_per_second=115.094/s
BM_matmul4d_trunci/process_time/real_time 8709 us 159 us 100 items_per_second=114.83/s
BM_matmul4d_trunci/process_time/real_time_mean 8705 us 156 us 5 items_per_second=114.882/s
BM_matmul4d_trunci/process_time/real_time_median 8709 us 159 us 5 items_per_second=114.83/s
BM_matmul4d_trunci/process_time/real_time_stddev 15.4 us 4.13 us 5 items_per_second=0.20339/s
BM_matmul4d_trunci/process_time/real_time_cv 0.18 % 2.64 % 5 items_per_second=0.18%
Below is the performance with ukernel path:
-----------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------
BM_matmul4d_trunci/process_time/real_time 1146 us 51.5 us 700 items_per_second=872.337/s
BM_matmul4d_trunci/process_time/real_time 1149 us 64.8 us 700 items_per_second=870.528/s
BM_matmul4d_trunci/process_time/real_time 1149 us 65.5 us 700 items_per_second=869.999/s
BM_matmul4d_trunci/process_time/real_time 1149 us 65.9 us 700 items_per_second=870.46/s
BM_matmul4d_trunci/process_time/real_time 1150 us 66.9 us 700 items_per_second=869.819/s
BM_matmul4d_trunci/process_time/real_time_mean 1149 us 62.9 us 5 items_per_second=870.629/s
BM_matmul4d_trunci/process_time/real_time_median 1149 us 65.5 us 5 items_per_second=870.46/s
BM_matmul4d_trunci/process_time/real_time_stddev 1.32 us 6.42 us 5 items_per_second=1.00111/s
BM_matmul4d_trunci/process_time/real_time_cv 0.11 % 10.20 % 5 items_per_second=0.11%
To reproduce the numbers, for ukernel run
python cpu_comparison/run.py results /**iree-installed-path**/ --peano_dir=/**llvm-aie-installed-path**/ --xrt_lite_n_core_rows=$XRT_LITE_N_CORE_ROWS --xrt_lite_n_core_cols=$XRT_LITE_N_CORE_COLS --target_device="npu4" -v --vitis_dir=/opt/Xilinx/Vitis/2024.2 --tests=matmul4d_scale_trunci_1024_16384_512_i8_i32_O1_npu4_outline_4_level_tiling_ukernel_peano_ctrlpkt_benchmark
For vectorization path, modify "use_ukernel": False in
iree-amd-aie/build_tools/ci/cpu_comparison/run.py
Line 2215 in b8ad10e
| "use_ukernel": True, |
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels