Beaver.MLIR.Dialect.XeVM (beaver v0.4.7)

Summary

Functions

xevm.blockload2d - 2D block load

xevm.blockload - subgroup block load

xevm.blockprefetch2d - 2D block prefetch

xevm.blockstore2d - 2D block store

xevm.blockstore - subgroup block store

xevm.group_count.x

xevm.group_count.y

xevm.group_count.z

xevm.group_id.x

xevm.group_id.y

xevm.group_id.z

xevm.lane_id

xevm.local_id.x

xevm.local_id.y

xevm.local_id.z

xevm.local_size.x

xevm.local_size.y

xevm.local_size.z

xevm.memfence

xevm.mma - Subgroup matrix multiply-add

xevm.prefetch - Prefetch data into a cache subsystem.

xevm.subgroup_id

xevm.subgroup_size

Functions

blockload2d(ssa)

xevm.blockload2d - 2D block load

Attributes

  • elem_size_in_bits - Single, I32Attr, 32-bit signless integer attribute
  • tile_width - Single, I32Attr, 32-bit signless integer attribute
  • tile_height - Single, I32Attr, 32-bit signless integer attribute
  • v_blocks - Single, I32Attr, 32-bit signless integer attribute
  • transpose - Single, I1Attr, 1-bit signless integer attribute
  • pack_register - Single, I1Attr, 1-bit signless integer attribute
  • cache_control - Optional, XeVM_LoadCacheControlAttr, Describe the cache settings for load operators

Operands

  • ptr - Single, LLVM_AnyPointer, LLVM pointer type
  • base_width - Single, I32, 32-bit signless integer
  • base_height - Single, I32, 32-bit signless integer
  • base_pitch - Single, I32, 32-bit signless integer
  • x - Single, I32, 32-bit signless integer
  • y - Single, I32, 32-bit signless integer

Results

  • res - Single, anonymous/composite constraint, a vector with length 8 of bfloat16 type values

Description

The xevm.blockload2d operation loads a two dimensional matrix tile from a base matrix residing in global memory. The parameters are:

  • ptr - the base address of the base matrix containing the tile to load
  • base_width - the width of the base matrix in number of bytes.
  • base_height - the number of rows in the base matrix
  • base_pitch - the physical stride between the first columns of the current row and the subsequent row in number of bytes.
  • x, y, tile_width, tile_height - the starting offsets and shape of the tile to load in number of elements.
  • elem_size_in_bits - the size in bits of the matrix element type
    • 32 for f32, tf32
    • 16 for f16, int16, bf16
    • 8 for int8
  • v_blocks - number of consecutive tiles in innermost dimension direction to load
  • transpose - transpose the tile in registers (useful for 32 bit element type)
  • pack_register - pack element types narrower than register bit width. [M, N] => [M/factor, N, factor] where factor is register_size_in_bits / elem_size_in_bits
  • cache_control - an enumerator that sets the cache behaviour

Notes:

  • the transpose and pack_register parameters are mutual exclusive
  • transposing the tile loaded is used for A matrix in backward path or used for the B matrix operand (D = C + A * B), where A has row-major layout and B should have column-major layout in memory.
  • if the tile loaded contains out of bound elements of the matrix, they are filled with 0.

Example:

  %base_width_a = arith.constant 32 : i32
  %base_height_a = arith.constant 8 : i32
  %base_pitch_a = arith.constant 32 : i32
  %x = arith.constant 0 : i32
  %y = arith.constant 0 : i32
  %loaded_a = xevm.blockload2d %src, %base_width_a, %base_height_a, %base_pitch_a, %x, %y
                <{elem_size_in_bits=16 : i32, tile_width=16 : i32, tile_height=8 : i32,
                  v_blocks=1 : i32, transpose=false : i32, pack_register=false,
                  cache_control=#xevm.load_cache_control<Default>}>
                : (!llvm.ptr<1>, i32, i32, i32, i32, i32) -> vector<8xi16>

blockload(ssa)

xevm.blockload - subgroup block load

Attributes

  • cache_control - Optional, XeVM_LoadCacheControlAttr, Describe the cache settings for load operators

Operands

  • ptr - Single, LLVM_AnyPointer, LLVM pointer type

Results

  • res - Single, anonymous/composite constraint, fixed-length vector of 8-bit signless integer or 16-bit signless integer or 32-bit signless integer or 64-bit signless integer values of ranks 1

Description

Reads one or more components of Result data for each invocation in the subgroup from the specified ptr as a block operation. The data is read strided, so the first value read is:

  ptr[ SubgroupLocalInvocationId ]

and the second value read is:

  ptr[ SubgroupLocalInvocationId + SubgroupMaxSize ]

Result type may be a scalar or vector type of scalar element type.

The parameters are:

  • ptr - the base address to load from. Must be uniform across subgroup.
  • cache_control - an enumerator that sets the cache behaviour

Example:

  %loaded_a = xevm.blockload %src,
                  <{cache_control=#xevm.load_cache_control<L1uc_L2uc_L3uc>}>
                : (!llvm.ptr<1>) -> vector<4xi16>

blockprefetch2d(ssa)

xevm.blockprefetch2d - 2D block prefetch

Attributes

  • elem_size_in_bits - Single, I32Attr, 32-bit signless integer attribute
  • tile_width - Single, I32Attr, 32-bit signless integer attribute
  • tile_height - Single, I32Attr, 32-bit signless integer attribute
  • v_blocks - Single, I32Attr, 32-bit signless integer attribute
  • cache_control - Optional, XeVM_LoadCacheControlAttr, Describe the cache settings for load operators

Operands

  • ptr - Single, LLVM_AnyPointer, LLVM pointer type
  • base_width - Single, I32, 32-bit signless integer
  • base_height - Single, I32, 32-bit signless integer
  • base_pitch - Single, I32, 32-bit signless integer
  • x - Single, I32, 32-bit signless integer
  • y - Single, I32, 32-bit signless integer

Description

The xevm.blockprefetch2d operation prefetches a two dimensional tile from a larger base matrix residing in global memory. The parameters are:

  • ptr - the base address of the base matrix containing the tile to prefetch
  • base_width - the width of the base matrix in number of bytes.
  • base_height - the number of rows in the base matrix
  • base_pitch - the physical stride between the first columns of the current row and the subsequent row in number of bytes.
  • x, y, tile_width, tile_height - the starting offsets and shape of tile to prefetch in number of elements.
  • elem_size_in_bits - the size in bits of the matrix element
    • 32 for f32, bf32
    • 16 for f16, int16, bf16
    • 8 for int8, int4, int2
  • v_blocks - number of tiles in innermost dimension direction to prefetch
  • cache_control - an enumerator that sets the cache behaviour

Example:

  xevm.blockprefetch2d %ptr, %base_width, %base_height, %base_pitch, %x, %y
    <{elem_size_in_bits=8 : i32, tile_width=32 : i32, tile_height=8 : i32,
      v_blocks=1 : i32, cache_control=#xevm.load_cache_control<L1uc_L2uc_L3uc>}>
    : (!llvm.ptr<1>, i32, i32, i32, i32, i32)

blockstore2d(ssa)

xevm.blockstore2d - 2D block store

Attributes

  • elem_size_in_bits - Single, I32Attr, 32-bit signless integer attribute
  • tile_width - Single, I32Attr, 32-bit signless integer attribute
  • tile_height - Single, I32Attr, 32-bit signless integer attribute
  • cache_control - Optional, XeVM_StoreCacheControlAttr, Describe the cache settings for store operators

Operands

  • ptr - Single, LLVM_AnyPointer, LLVM pointer type
  • base_width - Single, I32, 32-bit signless integer
  • base_height - Single, I32, 32-bit signless integer
  • base_pitch - Single, I32, 32-bit signless integer
  • x - Single, I32, 32-bit signless integer
  • y - Single, I32, 32-bit signless integer
  • stored_val - Single, anonymous/composite constraint, a vector with length 8 of bfloat16 type values

Description

The xevm.blockstore2d operation stores a two dimensional tile into a larger matrix residing in global memory. The parameters are:

  • ptr - the base address of the target matrix where to store the tile
  • base_width - the width of the base matrix in number of bytes.
  • base_height - the number of rows in the base matrix
  • base_pitch - the physical stride between the first columns of the current row and the subsequent row in number of bytes.
  • x, y, tile_width, tile_height - the starting offsets and shape of the tile to store in number of elements.
  • elem_size_in_bits - the size in bits of the matrix element
    • 32 for f32, tf32
    • 16 for f16, int16, bf16
    • 8 for int8
  • cache_control - an enumerator that sets the cache behaviour
  • stored_val - the tile to store

Example:

  %base_width_c = arith.constant 64 : i32
  %base_height_c = arith.constant 8 : i32
  %base_pitch_c = arith.constant 64 : i32
  %x = arith.constant 0 : i32
  %y = arith.constant 0 : i32
  xevm.blockstore2d %dst, %base_width_c, %base_height_c, %base_pitch_c, %x, %y, %src
    <{elem_size_in_bits=32 : i32, tile_width=16 : i32, tile_height=8 : i32,
      cache_control=#xevm.load_cache_control<Default>}>
    : (!llvm.ptr<1>, i32, i32, i32, i32, i32, vector<8xi32>)

blockstore(ssa)

xevm.blockstore - subgroup block store

Attributes

  • cache_control - Optional, XeVM_StoreCacheControlAttr, Describe the cache settings for store operators

Operands

  • ptr - Single, LLVM_AnyPointer, LLVM pointer type
  • val - Single, anonymous/composite constraint, fixed-length vector of 8-bit signless integer or 16-bit signless integer or 32-bit signless integer or 64-bit signless integer values of ranks 1

Description

Writes one or more components of val for each invocation in the subgroup to the specified ptr as a block operation. The data is written strided, so the first value is written to:

  ptr[ SubgroupLocalInvocationId ]

and the second value is written to:

  ptr[ SubgroupLocalInvocationId + SubgroupMaxSize ]

val type may be a scalar or vector type of scalar element type.

The parameters are:

  • ptr - the base address to store to. Must be uniform across subgroup.
  • val - the value to store
  • cache_control - an enumerator that sets the cache behaviour

Example:

  xevm.blockstore %ptr, %val
    <{cache_control=#xevm.store_cache_control<L1uc_L2uc_L3uc>}>
    : (!llvm.ptr<1>, vector<4xi16>)

group_count_x(ssa)

xevm.group_count.x

group_count_y(ssa)

xevm.group_count.y

group_count_z(ssa)

xevm.group_count.z

group_id_x(ssa)

xevm.group_id.x

group_id_y(ssa)

xevm.group_id.y

group_id_z(ssa)

xevm.group_id.z

lane_id(ssa)

xevm.lane_id

local_id_x(ssa)

xevm.local_id.x

local_id_y(ssa)

xevm.local_id.y

local_id_z(ssa)

xevm.local_id.z

local_size_x(ssa)

xevm.local_size.x

local_size_y(ssa)

xevm.local_size.y

local_size_z(ssa)

xevm.local_size.z

memfence(ssa)

xevm.memfence

mma(ssa)

xevm.mma - Subgroup matrix multiply-add

Attributes

  • shape - Single, XeVM_MMAShapeAttr,
  • types - Single, XeVM_MMATypesAttr,

Operands

  • a - Single, anonymous/composite constraint, fixed-length vector of 8-bit integer or 16-bit integer or 32-bit integer or 32-bit float or tf32 type or 16-bit float or bfloat16 type values of ranks 1
  • b - Single, anonymous/composite constraint, fixed-length vector of 8-bit integer or 16-bit integer or 32-bit integer or 32-bit float or tf32 type or 16-bit float or bfloat16 type values of ranks 1
  • c - Optional, anonymous/composite constraint, fixed-length vector of 8-bit integer or 16-bit integer or 32-bit integer or 32-bit float or tf32 type or 16-bit float or bfloat16 type values of ranks 1

Results

  • d - Single, anonymous/composite constraint, fixed-length vector of 8-bit integer or 16-bit integer or 32-bit integer or 32-bit float or tf32 type or 16-bit float or bfloat16 type values of ranks 1

Description

The xevm.mma is a cooperative operation where all threads/lanes in a subgroup participates and carries out matrix multiplication plus accumulation:

D = C + A x B

where the A, B, C input matrices and the result D have shapes:

- D : MxN
- C : MxN
- A : MxK
- B : KxN

Parameters:

  • a - vector of matrix A elements.
  • b - vector of matrix B elements.
  • c - (optional) vector of matrix C elements.
  • shape - the shape of the matrices, specified as M, N, and K values.
  • types - the data types of the matrices, specified as D, A, B, and optionally C.

Example:

  %d = xevm.mma %a, %b, %c { shape=<m=8, n=16, k=16>, types=<d=f32, a=f16, b=f16, c=f32> }
         : (vector<8xi16>, vector<8xi32>, vector<8xf32>) -> vector<8xf32>

prefetch(ssa)

xevm.prefetch - Prefetch data into a cache subsystem.

Attributes

  • cache_control - Optional, XeVM_LoadCacheControlAttr, Describe the cache settings for load operators

Operands

  • ptr - Single, anonymous/composite constraint, vector of any type values

Description

Work-item issues a prefetch from global memory to cache:

  • ptr - LLVM pointer with address space. Address space must be 1 (global) or 4 (generic)
  • cache_control - specify caching options

subgroup_id(ssa)

xevm.subgroup_id

subgroup_size(ssa)

xevm.subgroup_size