Beaver.MLIR.Dialect.XeVM (beaver v0.4.7)
Summary
Functions
xevm.blockload2d - 2D block load
xevm.blockload - subgroup block load
xevm.blockprefetch2d - 2D block prefetch
xevm.blockstore2d - 2D block store
xevm.blockstore - subgroup block store
xevm.group_count.x
xevm.group_count.y
xevm.group_count.z
xevm.group_id.x
xevm.group_id.y
xevm.group_id.z
xevm.lane_id
xevm.local_id.x
xevm.local_id.y
xevm.local_id.z
xevm.local_size.x
xevm.local_size.y
xevm.local_size.z
xevm.memfence
xevm.mma - Subgroup matrix multiply-add
xevm.prefetch - Prefetch data into a cache subsystem.
xevm.subgroup_id
xevm.subgroup_size
Functions
xevm.blockload2d - 2D block load
Attributes
elem_size_in_bits- Single,I32Attr, 32-bit signless integer attributetile_width- Single,I32Attr, 32-bit signless integer attributetile_height- Single,I32Attr, 32-bit signless integer attributev_blocks- Single,I32Attr, 32-bit signless integer attributetranspose- Single,I1Attr, 1-bit signless integer attributepack_register- Single,I1Attr, 1-bit signless integer attributecache_control- Optional,XeVM_LoadCacheControlAttr, Describe the cache settings for load operators
Operands
ptr- Single,LLVM_AnyPointer, LLVM pointer typebase_width- Single,I32, 32-bit signless integerbase_height- Single,I32, 32-bit signless integerbase_pitch- Single,I32, 32-bit signless integerx- Single,I32, 32-bit signless integery- Single,I32, 32-bit signless integer
Results
res- Single, anonymous/composite constraint, a vector with length 8 of bfloat16 type values
Description
The xevm.blockload2d operation loads a two dimensional matrix tile
from a base matrix residing in global memory. The parameters are:
ptr- the base address of the base matrix containing the tile to loadbase_width- the width of the base matrix in number of bytes.base_height- the number of rows in the base matrixbase_pitch- the physical stride between the first columns of the current row and the subsequent row in number of bytes.x,y,tile_width,tile_height- the starting offsets and shape of the tile to load in number of elements.elem_size_in_bits- the size in bits of the matrix element type- 32 for f32, tf32
- 16 for f16, int16, bf16
- 8 for int8
v_blocks- number of consecutive tiles in innermost dimension direction to loadtranspose- transpose the tile in registers (useful for 32 bit element type)pack_register- pack element types narrower than register bit width. [M, N] => [M/factor, N, factor] where factor is register_size_in_bits / elem_size_in_bitscache_control- an enumerator that sets the cache behaviour
Notes:
- the
transposeandpack_registerparameters are mutual exclusive - transposing the tile loaded is used for A matrix in backward path or used for the B matrix operand (D = C + A * B), where A has row-major layout and B should have column-major layout in memory.
- if the tile loaded contains out of bound elements of the matrix, they are filled with 0.
Example:
%base_width_a = arith.constant 32 : i32
%base_height_a = arith.constant 8 : i32
%base_pitch_a = arith.constant 32 : i32
%x = arith.constant 0 : i32
%y = arith.constant 0 : i32
%loaded_a = xevm.blockload2d %src, %base_width_a, %base_height_a, %base_pitch_a, %x, %y
<{elem_size_in_bits=16 : i32, tile_width=16 : i32, tile_height=8 : i32,
v_blocks=1 : i32, transpose=false : i32, pack_register=false,
cache_control=#xevm.load_cache_control<Default>}>
: (!llvm.ptr<1>, i32, i32, i32, i32, i32) -> vector<8xi16>
xevm.blockload - subgroup block load
Attributes
cache_control- Optional,XeVM_LoadCacheControlAttr, Describe the cache settings for load operators
Operands
ptr- Single,LLVM_AnyPointer, LLVM pointer type
Results
res- Single, anonymous/composite constraint, fixed-length vector of 8-bit signless integer or 16-bit signless integer or 32-bit signless integer or 64-bit signless integer values of ranks 1
Description
Reads one or more components of Result data for each invocation
in the subgroup from the specified ptr as a block operation.
The data is read strided, so the first value read is:
ptr[ SubgroupLocalInvocationId ]and the second value read is:
ptr[ SubgroupLocalInvocationId + SubgroupMaxSize ]Result type may be a scalar or vector type of scalar element type.
The parameters are:
ptr- the base address to load from. Must be uniform across subgroup.cache_control- an enumerator that sets the cache behaviour
Example:
%loaded_a = xevm.blockload %src,
<{cache_control=#xevm.load_cache_control<L1uc_L2uc_L3uc>}>
: (!llvm.ptr<1>) -> vector<4xi16>
xevm.blockprefetch2d - 2D block prefetch
Attributes
elem_size_in_bits- Single,I32Attr, 32-bit signless integer attributetile_width- Single,I32Attr, 32-bit signless integer attributetile_height- Single,I32Attr, 32-bit signless integer attributev_blocks- Single,I32Attr, 32-bit signless integer attributecache_control- Optional,XeVM_LoadCacheControlAttr, Describe the cache settings for load operators
Operands
ptr- Single,LLVM_AnyPointer, LLVM pointer typebase_width- Single,I32, 32-bit signless integerbase_height- Single,I32, 32-bit signless integerbase_pitch- Single,I32, 32-bit signless integerx- Single,I32, 32-bit signless integery- Single,I32, 32-bit signless integer
Description
The xevm.blockprefetch2d operation prefetches a two dimensional tile
from a larger base matrix residing in global memory. The parameters are:
ptr- the base address of the base matrix containing the tile to prefetchbase_width- the width of the base matrix in number of bytes.base_height- the number of rows in the base matrixbase_pitch- the physical stride between the first columns of the current row and the subsequent row in number of bytes.x,y,tile_width,tile_height- the starting offsets and shape of tile to prefetch in number of elements.elem_size_in_bits- the size in bits of the matrix element- 32 for f32, bf32
- 16 for f16, int16, bf16
- 8 for int8, int4, int2
v_blocks- number of tiles in innermost dimension direction to prefetchcache_control- an enumerator that sets the cache behaviour
Example:
xevm.blockprefetch2d %ptr, %base_width, %base_height, %base_pitch, %x, %y
<{elem_size_in_bits=8 : i32, tile_width=32 : i32, tile_height=8 : i32,
v_blocks=1 : i32, cache_control=#xevm.load_cache_control<L1uc_L2uc_L3uc>}>
: (!llvm.ptr<1>, i32, i32, i32, i32, i32)
xevm.blockstore2d - 2D block store
Attributes
elem_size_in_bits- Single,I32Attr, 32-bit signless integer attributetile_width- Single,I32Attr, 32-bit signless integer attributetile_height- Single,I32Attr, 32-bit signless integer attributecache_control- Optional,XeVM_StoreCacheControlAttr, Describe the cache settings for store operators
Operands
ptr- Single,LLVM_AnyPointer, LLVM pointer typebase_width- Single,I32, 32-bit signless integerbase_height- Single,I32, 32-bit signless integerbase_pitch- Single,I32, 32-bit signless integerx- Single,I32, 32-bit signless integery- Single,I32, 32-bit signless integerstored_val- Single, anonymous/composite constraint, a vector with length 8 of bfloat16 type values
Description
The xevm.blockstore2d operation stores a two dimensional tile into a
larger matrix residing in global memory. The parameters are:
ptr- the base address of the target matrix where to store the tilebase_width- the width of the base matrix in number of bytes.base_height- the number of rows in the base matrixbase_pitch- the physical stride between the first columns of the current row and the subsequent row in number of bytes.x,y,tile_width,tile_height- the starting offsets and shape of the tile to store in number of elements.elem_size_in_bits- the size in bits of the matrix element- 32 for f32, tf32
- 16 for f16, int16, bf16
- 8 for int8
cache_control- an enumerator that sets the cache behaviourstored_val- the tile to store
Example:
%base_width_c = arith.constant 64 : i32
%base_height_c = arith.constant 8 : i32
%base_pitch_c = arith.constant 64 : i32
%x = arith.constant 0 : i32
%y = arith.constant 0 : i32
xevm.blockstore2d %dst, %base_width_c, %base_height_c, %base_pitch_c, %x, %y, %src
<{elem_size_in_bits=32 : i32, tile_width=16 : i32, tile_height=8 : i32,
cache_control=#xevm.load_cache_control<Default>}>
: (!llvm.ptr<1>, i32, i32, i32, i32, i32, vector<8xi32>)
xevm.blockstore - subgroup block store
Attributes
cache_control- Optional,XeVM_StoreCacheControlAttr, Describe the cache settings for store operators
Operands
ptr- Single,LLVM_AnyPointer, LLVM pointer typeval- Single, anonymous/composite constraint, fixed-length vector of 8-bit signless integer or 16-bit signless integer or 32-bit signless integer or 64-bit signless integer values of ranks 1
Description
Writes one or more components of val for each invocation
in the subgroup to the specified ptr as a block operation.
The data is written strided, so the first value is written to:
ptr[ SubgroupLocalInvocationId ]and the second value is written to:
ptr[ SubgroupLocalInvocationId + SubgroupMaxSize ]val type may be a scalar or vector type of scalar element type.
The parameters are:
ptr- the base address to store to. Must be uniform across subgroup.val- the value to storecache_control- an enumerator that sets the cache behaviour
Example:
xevm.blockstore %ptr, %val
<{cache_control=#xevm.store_cache_control<L1uc_L2uc_L3uc>}>
: (!llvm.ptr<1>, vector<4xi16>)
xevm.group_count.x
xevm.group_count.y
xevm.group_count.z
xevm.group_id.x
xevm.group_id.y
xevm.group_id.z
xevm.lane_id
xevm.local_id.x
xevm.local_id.y
xevm.local_id.z
xevm.local_size.x
xevm.local_size.y
xevm.local_size.z
xevm.memfence
xevm.mma - Subgroup matrix multiply-add
Attributes
shape- Single,XeVM_MMAShapeAttr,types- Single,XeVM_MMATypesAttr,
Operands
a- Single, anonymous/composite constraint, fixed-length vector of 8-bit integer or 16-bit integer or 32-bit integer or 32-bit float or tf32 type or 16-bit float or bfloat16 type values of ranks 1b- Single, anonymous/composite constraint, fixed-length vector of 8-bit integer or 16-bit integer or 32-bit integer or 32-bit float or tf32 type or 16-bit float or bfloat16 type values of ranks 1c- Optional, anonymous/composite constraint, fixed-length vector of 8-bit integer or 16-bit integer or 32-bit integer or 32-bit float or tf32 type or 16-bit float or bfloat16 type values of ranks 1
Results
d- Single, anonymous/composite constraint, fixed-length vector of 8-bit integer or 16-bit integer or 32-bit integer or 32-bit float or tf32 type or 16-bit float or bfloat16 type values of ranks 1
Description
The xevm.mma is a cooperative operation where all threads/lanes in
a subgroup participates and carries out matrix multiplication plus accumulation:
D = C + A x B
where the A, B, C input matrices and the result D have shapes:
- D : MxN
- C : MxN
- A : MxK
- B : KxNParameters:
a- vector of matrix A elements.b- vector of matrix B elements.c- (optional) vector of matrix C elements.shape- the shape of the matrices, specified asM,N, andKvalues.types- the data types of the matrices, specified asD,A,B, and optionallyC.
Example:
%d = xevm.mma %a, %b, %c { shape=<m=8, n=16, k=16>, types=<d=f32, a=f16, b=f16, c=f32> }
: (vector<8xi16>, vector<8xi32>, vector<8xf32>) -> vector<8xf32>
xevm.prefetch - Prefetch data into a cache subsystem.
Attributes
cache_control- Optional,XeVM_LoadCacheControlAttr, Describe the cache settings for load operators
Operands
ptr- Single, anonymous/composite constraint, vector of any type values
Description
Work-item issues a prefetch from global memory to cache:
ptr- LLVM pointer with address space. Address space must be 1 (global) or 4 (generic)cache_control- specify caching options
xevm.subgroup_id
xevm.subgroup_size