Hello, team.
I have a question about CU pressure from mori dispatch/combine kernels.
When setting up a mori configuration, I noticed that the grid and block dimensions are adjusted using the block_num and warp_num_per_block parameters.
I'm wondering if these values can give any indication of roughly how many CUs (Compute Units) Mori is utilizing.
Additionally, is there any guidance on how to adjust the warp_num_per_block and block_num parameters in different scenarios to achieve better performance?