Beaver.MLIR.Dialect.XeGPU (beaver v0.4.7)

Summary

Functions

xegpu.alloc_nbarrier

xegpu.atomic_rmw - Atomic read-modify-write operation on the TensorDesc.

xegpu.convert_layout - Convert the layout of the input operand

xegpu.create_mem_desc - Create a memory descriptor.

xegpu.create_nd_tdesc - Create nd-tensor descriptor operation

xegpu.create_tdesc - create scattered tensor descriptors (TensorDesc).

xegpu.dpas - It performs mma computation

xegpu.fence

xegpu.init_nbarrier - It assigns a named barrier to the current thread.

xegpu.load - load a set of scattered data points from memory.

xegpu.load_matrix

xegpu.load_nd - loads a n-D block from memory (represented by TensorDesc)to registers (represented by vector)

xegpu.mem_desc_subview

xegpu.nbarrier_arrive - It signals the arrival at the named barrier.

xegpu.nbarrier_wait - It waits for a named barrier.

xegpu.prefetch - prefetches a set of scattered data points to cache

xegpu.prefetch_nd - prefetches a n-D block to cache

xegpu.store - store data to scattered memory locations.

xegpu.store_matrix

xegpu.store_nd - stores a n-D block register region back to memory, currently only supports 2D

xegpu.update_nd_offset - It updates the offsets for the TensorDesc.

xegpu.update_offset - It updates the offsets for the given tensor descriptor

Functions

alloc_nbarrier(ssa)

xegpu.alloc_nbarrier

atomic_rmw(ssa)

xegpu.atomic_rmw - Atomic read-modify-write operation on the TensorDesc.

Attributes

  • kind - Single, AtomicRMWKindAttr, allowed 64-bit signless integer cases: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15

Operands

  • tensorDesc - Single, XeGPU_TensorDesc, TensorDesc describing regions of interested data.
  • mask - Single, XeGPU_MaskType, fixed-length vector of 1-bit signless integer values
  • value - Single, XeGPU_ValueType, fixed-length vector of 1-bit signless integer or 8-bit signless integer or 16-bit signless integer or 32-bit signless integer or 64-bit signless integer or 1-bit signed integer or 8-bit signed integer or 16-bit signed integer or 32-bit signed integer or 64-bit signed integer or 1-bit unsigned integer or 8-bit unsigned integer or 16-bit unsigned integer or 32-bit unsigned integer or 64-bit unsigned integer or 16-bit float or 32-bit float or 64-bit float or bfloat16 type or tf32 type values

Results

  • result - Single, XeGPU_ValueType, fixed-length vector of 1-bit signless integer or 8-bit signless integer or 16-bit signless integer or 32-bit signless integer or 64-bit signless integer or 1-bit signed integer or 8-bit signed integer or 16-bit signed integer or 32-bit signed integer or 64-bit signed integer or 1-bit unsigned integer or 8-bit unsigned integer or 16-bit unsigned integer or 32-bit unsigned integer or 64-bit unsigned integer or 16-bit float or 32-bit float or 64-bit float or bfloat16 type or tf32 type values

Description

The xegpu.atomic_rmw operation provides a way to perform a read-modify-write operation on the region described by the TensorDesc free from data races. The kind enumeration specifies the modification to be performed, The mask operand has the same shape with TensorDesc, and is used to enable or disable specific data points of the TensorDesc. The value operand represents the new value to be applied during the modification.

convert_layout(ssa)

xegpu.convert_layout - Convert the layout of the input operand

Attributes

  • input_layout - Single, DistributeLayoutAttr, DistributeLayoutAttr instance
  • target_layout - Single, DistributeLayoutAttr, DistributeLayoutAttr instance

Operands

  • source - Single, XeGPU_VectorType, vector of 1-bit signless integer or 8-bit signless integer or 16-bit signless integer or 32-bit signless integer or 64-bit signless integer or 1-bit signed integer or 8-bit signed integer or 16-bit signed integer or 32-bit signed integer or 64-bit signed integer or 1-bit unsigned integer or 8-bit unsigned integer or 16-bit unsigned integer or 32-bit unsigned integer or 64-bit unsigned integer or 16-bit float or 32-bit float or 64-bit float or bfloat16 type or tf32 type values of ranks 1/2/3/4/5/6

Results

  • result - Single, XeGPU_VectorType, vector of 1-bit signless integer or 8-bit signless integer or 16-bit signless integer or 32-bit signless integer or 64-bit signless integer or 1-bit signed integer or 8-bit signed integer or 16-bit signed integer or 32-bit signed integer or 64-bit signed integer or 1-bit unsigned integer or 8-bit unsigned integer or 16-bit unsigned integer or 32-bit unsigned integer or 64-bit unsigned integer or 16-bit float or 32-bit float or 64-bit float or bfloat16 type or tf32 type values of ranks 1/2/3/4/5/6

Description

convert_layout redistribute data across subgroups and/or work-items from the input_layout to the target_layout. Both input_layout and target_layout must correspond to the same programming scope, such as workgroup-level (wg) or subgroup-level (sg) code. This operation is not valid once the IR is lowered to WI level because that is the end result of all distributions.

create_mem_desc(ssa)

xegpu.create_mem_desc - Create a memory descriptor.

Operands

  • source - Single, anonymous/composite constraint, statically shaped memref of 8-bit signless integer values for shared memory

Results

  • mem_desc - Single, XeGPU_MemDesc, MemDesc describing the data in SLM

Description

Creates a memory descriptor from a shared local memory (SLM) buffer, and xegpu specific memory layout. The resulting memory descriptor has to have the same size as the underlying shared local memory.

Arguments:

  • source : a 1D statically shaped memref with element type i8, representing the raw SLM buffer. Results:
  • mem_desc : the memory descriptor.

create_nd_tdesc(ssa)

xegpu.create_nd_tdesc - Create nd-tensor descriptor operation

Attributes

  • const_offsets - Optional, DenseI64ArrayAttr, i64 dense array attribute
  • const_shape - Optional, DenseI64ArrayAttr, i64 dense array attribute
  • const_strides - Optional, DenseI64ArrayAttr, i64 dense array attribute

Operands

  • source - Single, XeGPU_BaseAddrType, non-0-ranked.memref of 1-bit signless integer or 8-bit signless integer or 16-bit signless integer or 32-bit signless integer or 64-bit signless integer or 1-bit signed integer or 8-bit signed integer or 16-bit signed integer or 32-bit signed integer or 64-bit signed integer or 1-bit unsigned integer or 8-bit unsigned integer or 16-bit unsigned integer or 32-bit unsigned integer or 64-bit unsigned integer or 16-bit float or 32-bit float or 64-bit float or bfloat16 type or tf32 type values or 64-bit unsigned integer or 32-bit unsigned integer or 64-bit signless integer or 32-bit signless integer
  • offsets - Variadic, Index, variadic of index
  • shape - Variadic, Index, variadic of index
  • strides - Variadic, Index, variadic of index

Results

  • TensorDesc - Single, XeGPU_TensorDesc, TensorDesc describing regions of interested data.

Description

The "create_nd_tdesc" operation creates a TensorDescType which represents a sub-view of a 1D/2D memory region inside the one or two innermost dimensions of the source. (It can be extended to support n-D memory region if needed in future). Elements in the subview continuous in each dimension. It encodes the following important information for supporting Intel hardware features:

Arguments:

  • source: an object representing (starting address/pointer of) a memory region. It can be either a memref object, or simply a pointer represented by uint64_t type. For the case of dynamic memrefs or pointer, the shape and layout information of the memory region should be explicitly passed via shape and strides parameters.

  • offsets: index values represents offsets from the "source" at the each dimension at which the subview of the target memory will be created. It is encoded via "offsets" and "const_offsets", such that it can accept various forms, such as, operands (e.g., [%c0, %c]) and attributes (e.g., [2, 4]).

  • shape: the shape information of the memory region pointed by the "source". It is typically encoded via the MemRefType of the source, e.g., memref<4096x4096xf16>. But if "source" is simply a pointer represented as uint64_t type, or a memref type without shape information e.g., memref<?x?xf16>, the shape information has to be explicitly passed via the "shape" and "const_shape" arguments.

  • strides: the strides of the memory region pointed by the "source". Similar to shape, it is typically encoded via the MemRefType of the source too. But if "source" is simply a pointer represented as uint64_t type, or a memref type without shape information e.g., memref<?x?xf16>, the strides information has to be explicitly passed via the "strides" and "const_strides" argument.

Results:

  • res: nd tensor descriptor

Example 1 (suppose the tensor shape inferred by the compiler is 8x16):

%0 = memref.alloc() : memref<1024x1024xf32>
%c0 = arith.constant 0 : index
%c1 = arith.constant 1 : index
%1 = xegpu.create_nd_tdesc %0[%c0, %c0]: memref<1024x1024xf32> -> TensorDesc<8x16xf32>

Example 2 (suppose the tensor shape inferred by the compiler is 8x16):

%0 = memref.alloc(%h, %w) : memref<?x?xf32>
%c0 = arith.constant 0 : index
%c1 = arith.constant 1 : index
%1 = xegpu.create_nd_tdesc %0[%c0, %c0], [%h, %w], [%w, %c1]: memref<?x?xf32> -> TensorDesc<8x16xf32>

Example 3 (suppose the tensor shape inferred by the compiler is 8x16):

%0 = ... : ui64
%c0 = arith.constant 0 : index
%c1 = arith.constant 1 : index
%1 = xegpu.create_nd_tdesc %0[%c0, %c0], [%h, %w], [%w, %c1]: ui64 -> TensorDesc<8x16xf32>

create_tdesc(ssa)

xegpu.create_tdesc - create scattered tensor descriptors (TensorDesc).

Operands

  • source - Single, XeGPU_GatherScatterBaseAddrType, 1D memref of 1-bit signless integer or 8-bit signless integer or 16-bit signless integer or 32-bit signless integer or 64-bit signless integer or 1-bit signed integer or 8-bit signed integer or 16-bit signed integer or 32-bit signed integer or 64-bit signed integer or 1-bit unsigned integer or 8-bit unsigned integer or 16-bit unsigned integer or 32-bit unsigned integer or 64-bit unsigned integer or 16-bit float or 32-bit float or 64-bit float or bfloat16 type or tf32 type values or 64-bit unsigned integer or 32-bit unsigned integer or 64-bit signless integer or 32-bit signless integer
  • offsets - Single, XeGPU_OffsetType, fixed-length vector of index values

Results

  • TensorDesc - Single, XeGPU_TensorDesc, TensorDesc describing regions of interested data.

Description

"create_tdesc" is similar to "create_nd_tdesc" in terms that it creates a Tensor Descriptor (TensorDescType) for a memory region. While "create_nd_tdesc" is for creating continuous subviews, "create_tdesc" is for creating non-continuous (scattered) subviews, allowing each work-item in a subgroup specifying their own offset. It accepts the following parameters:

Arguments:

  • source: a 1D memref or pointer (i64, i32, ui64, ui32) represents the flattened memory object.
  • offsets: a vector containing offsets of each access point. Its size is fixed to the hardware supportted subgroup size, e.g., 16 on PVC, implying each element in the vector corresponds to a work-item (SIMT lane) in the subgroup.

Results:

  • res: scattered tensor descriptor

The first dimension of the result TensorDesc corresponds to work-items, so it should match the dimension of offsets. It may also has a second dimension corresponding to the chunk_size if the chunk size is larger than 1.

Example 1: It assumes subgroup size is 4, and accesses a[0], a[16], a[32], a[64]

%a = memref.alloc() : memref<1024xf32>
%0 = arith.constant dense<[0, 16, 32, 64]> : vector<4xindex>
%1 = xegpu.create_tdesc %a, %0: memref<1024xf32>, vector<4xindex> -> TensorDesc<4xf32>

Example 2: It assumes subgroup size is 4, and each workitem access 8 elements.

       It will access totally 32 data elements: a[0:7], a[16:23], a[32:39], a[64:71]
%0 = memref.alloc() : memref<1024xf32>
%off = arith.constant dense<[0, 16, 32, 64]> : vector<4xindex>
%1 = xegpu.create_tdesc %0, %off : memref<1024xf32>, vector<4xindex>
      -> TensorDesc<4x8xf32, #xegpu.scattered_tdesc_attr<chunk_size = 8>>

Example 3: It is similar to Example 2, but there is some overlaps among workitems.

       It accesses: a[0:7], a[4:11], a[8:15], a[12:19]
%0 = memref.alloc() : memref<1024xf32>
%off = arith.constant dense<[0, 4, 8, 12]> : vector<4xindex>
%1 = xegpu.create_tdesc %0, %off : memref<1024xf32>, vector<4xindex>
      -> TensorDesc<4x8xf32, #xegpu.scattered_tdesc_attr<chunk_size = 8>>

dpas(ssa)

xegpu.dpas - It performs mma computation

Operands

  • lhs - Single, XeGPU_DpasOprType, fixed-length vector of 1-bit signless integer or 8-bit signless integer or 16-bit signless integer or 32-bit signless integer or 64-bit signless integer or 1-bit signed integer or 8-bit signed integer or 16-bit signed integer or 32-bit signed integer or 64-bit signed integer or 1-bit unsigned integer or 8-bit unsigned integer or 16-bit unsigned integer or 32-bit unsigned integer or 64-bit unsigned integer or 16-bit float or 32-bit float or 64-bit float or bfloat16 type or tf32 type values of ranks 1/2/3
  • rhs - Single, XeGPU_DpasOprType, fixed-length vector of 1-bit signless integer or 8-bit signless integer or 16-bit signless integer or 32-bit signless integer or 64-bit signless integer or 1-bit signed integer or 8-bit signed integer or 16-bit signed integer or 32-bit signed integer or 64-bit signed integer or 1-bit unsigned integer or 8-bit unsigned integer or 16-bit unsigned integer or 32-bit unsigned integer or 64-bit unsigned integer or 16-bit float or 32-bit float or 64-bit float or bfloat16 type or tf32 type values of ranks 1/2/3
  • acc - Optional, XeGPU_DpasResType, fixed-length vector of 1-bit signless integer or 8-bit signless integer or 16-bit signless integer or 32-bit signless integer or 64-bit signless integer or 1-bit signed integer or 8-bit signed integer or 16-bit signed integer or 32-bit signed integer or 64-bit signed integer or 1-bit unsigned integer or 8-bit unsigned integer or 16-bit unsigned integer or 32-bit unsigned integer or 64-bit unsigned integer or 16-bit float or 32-bit float or 64-bit float or bfloat16 type or tf32 type values of ranks 1/2

Results

  • result - Single, XeGPU_DpasResType, fixed-length vector of 1-bit signless integer or 8-bit signless integer or 16-bit signless integer or 32-bit signless integer or 64-bit signless integer or 1-bit signed integer or 8-bit signed integer or 16-bit signed integer or 32-bit signed integer or 64-bit signed integer or 1-bit unsigned integer or 8-bit unsigned integer or 16-bit unsigned integer or 32-bit unsigned integer or 64-bit unsigned integer or 16-bit float or 32-bit float or 64-bit float or bfloat16 type or tf32 type values of ranks 1/2

Description

DPAS performs matrix multiplication on matrix A of mxk

size, B of `kxn` size, and accumulate on matrix C of `mxn` to the same size
matrix , `m=8`, `n=16` and `k=8 * 32/bit_width_of_elem_type`. So for fp16
data type, the matrices are `A: vector<8x16xf16>`, `B: vector<16x16xf16>`,
and `C/D: vector<8x16xf32>`. Besides the matrix size requirements, DPAS
also requires A and B to be loaded with the required data layout. Specially,
VNNI layout is required for B operand. It is achieved via adding `packed`
attribute to the `load_nd` operator.  Due to the VNNI transformation, B operands
can be represented as a 3D vector, with the last dimension representing the VNNI
factor, which is computed as `32/bit_width_of_elem_type`. Thus, `B: vector<16x16xf16>`
can be represented as `B: vector<8x16x2xf16>`.

In SIMT code, each work-item from a subgroup holds a data fragment for A, B, C and the result,
which are represented as 1D vectors. Please refer to [OpenCL Intel extentions]
(https://registry.khronos.org/OpenCL/extensions/intel/cl_intel_subgroup_matrix_multiply_accumulate.html)
for more details about the fragment distribution.

Note: on PVC, the hardware can perform load with VNNI transformation when data
      element type is 16-bit or lower precision, taking 2 or 4 elements from
      the first dimension and inserted into the newly added innermost dimension.

fence(ssa)

xegpu.fence

init_nbarrier(ssa)

xegpu.init_nbarrier - It assigns a named barrier to the current thread.

Operands

  • nbarrier_id - Single, I8, 8-bit signless integer
  • participant_thread_num - Single, I8, 8-bit signless integer

Results

  • result - Single, XeGPU_Nbarrier, !xegpu.nbarrier a custom XeGPU type representing a barrier.

Description

InitNbarrierOp assigns the named barrier with the specified

  barrier ID (0~31) to the current thread. Multiple threads may bind to the
  same named barrier, and the `participant_thread_num` specifies the total
  number of threads associated with the nbarrier. It returns an object of
  NbarrierType representing the barrier

load(ssa)

xegpu.load - load a set of scattered data points from memory.

Attributes

  • chunk_size - Optional, I64Attr, 64-bit signless integer attribute
  • l1_hint - Optional, XeGPU_CacheHintAttr, Describe the cache settings for prefetch/load/store operators
  • l2_hint - Optional, XeGPU_CacheHintAttr, Describe the cache settings for prefetch/load/store operators
  • l3_hint - Optional, XeGPU_CacheHintAttr, Describe the cache settings for prefetch/load/store operators

Operands

  • source - Single, XeGPU_GatherScatterSourceType, TensorDesc describing regions of interested data. or 1D memref of 1-bit signless integer or 8-bit signless integer or 16-bit signless integer or 32-bit signless integer or 64-bit signless integer or 1-bit signed integer or 8-bit signed integer or 16-bit signed integer or 32-bit signed integer or 64-bit signed integer or 1-bit unsigned integer or 8-bit unsigned integer or 16-bit unsigned integer or 32-bit unsigned integer or 64-bit unsigned integer or 16-bit float or 32-bit float or 64-bit float or bfloat16 type or tf32 type values or 64-bit unsigned integer or 32-bit unsigned integer or 64-bit signless integer or 32-bit signless integer
  • offsets - Optional, anonymous/composite constraint, fixed-length vector of index values or index
  • mask - Single, anonymous/composite constraint, 4D tensor of any type values

Results

  • value - Single, anonymous/composite constraint, fixed-length vector of 1-bit signless integer or 8-bit signless integer or 16-bit signless integer or 32-bit signless integer or 64-bit signless integer or 1-bit signed integer or 8-bit signed integer or 16-bit signed integer or 32-bit signed integer or 64-bit signed integer or 1-bit unsigned integer or 8-bit unsigned integer or 16-bit unsigned integer or 32-bit unsigned integer or 64-bit unsigned integer or 16-bit float or 32-bit float or 64-bit float or bfloat16 type or tf32 type values or 1-bit signless integer or 8-bit signless integer or 16-bit signless integer or 32-bit signless integer or 64-bit signless integer or 1-bit signed integer or 8-bit signed integer or 16-bit signed integer or 32-bit signed integer or 64-bit signed integer or 1-bit unsigned integer or 8-bit unsigned integer or 16-bit unsigned integer or 32-bit unsigned integer or 64-bit unsigned integer or 16-bit float or 32-bit float or 64-bit float or bfloat16 type or tf32 type

Description

It (aka. load) load data per each work-item. The output describes the data being loaded at the subgroup level, so its size is consistent with the number of work-items in a subgroup. When the chunk size is larger than 2, the output vector is a 2D vector, with dim-0 correspoding to work-items, and dim-1 corresponding to the chunk size loaded by each work-item. The mask operand masks out memory access so that it is safe to pass out-of-boundary addresses/offsets as long as they are masked. It applies to slots of SIMD lanes.

In SIMT mode, the result is a 1D vector that represents the data to be loaded by each work-item. If size is not 1, size should be equal to the chunk size,

Arguments:

  • source: represents the memory region to be loaded from, which can be either a tensor_desc or a 1D memref or pointer (ui64, ui32, i64 or i32). In case of tensor_desc, offsets come from the producer create_tdesc op. tensor_desc cannot be used in SIMT mode.
  • offsets: represents offsets from source. required if source in not a TensorDescType. offsets is a vector of index type and vector length is either the subgroup size or 1 in SIMT mode. scalar offset is also valid for SIMT mode.
  • mask: is a vector of i1 type, which is used to mask out the memory access. mask is a vector of size equal to the subgroup size, or 1 in SIMT mode. scalar mask is also valid for SIMT mode.
  • chunk_size: (optional) represents contiguous number of elements to load from per work item.
  • l1_hint, l2_hint, l3_hint: are optional cache hints for each level of cache.

Results:

  • res: represents loaded data

Example 1:

   %2 = xegpu.load %1, %0 <{l1_hint = #xegpu.cache_hint<cached>,
                            l2_hint = #xegpu.cache_hint<uncached>,
                            l3_hint = #xegpu.cache_hint<uncached>}>
         : !xegpu.tensor_desc<16xf32, #xegpu.scatter_tdesc_attr<memory_space=global>>,
           vector<16xi1> -> vector<16xf32>

Example 2:

   %2 = xegpu.load %1, %0 <{l1_hint = #xegpu.cache_hint<cached>,
                            l2_hint = #xegpu.cache_hint<uncached>,
                            l3_hint = #xegpu.cache_hint<uncached>}>
         : !xegpu.tensor_desc<16x8xf32, #xegpu.scatter_tdesc_attr<memory_space=global, chunk_size=8>>,
           vector<16xi1> -> vector<16x8xf32>

Example 3: A variant accepts memref as base pointer and an offset instead of scattered TensorTdesc. It combines "create scattered TensorTdesc" and "load with scattered TensorTdesc". The source operand could be a raw pointer (ui64, ui32, i64, i32). Please refer to create_tdesc for the restriction of memref.

   %a = memref.alloc() : memref<1024xf32>
   %offsets = vector.step : vector<16xindex>
   %mask = vector.constant_mask [16]: vector<16xi1>
   %val = xegpu.load %a[%offsets], %mask {l1_hint = #xegpu.cache_hint<cached>,
                          l2_hint = #xegpu.cache_hint<cached>,
                          l3_hint = #xegpu.cache_hint<cached>}
     : memref<1024xf32>, vector<16xi1>, vector<16xindex> -> vector<16xf32>

Example 4 (SIMT mode): SIMT mode only accepts the offsets variant. chunk_size can be inferred from result type. In this example, chunk_size is 8.

   %2 = xegpu.load %1[%2], %0 <{l1_hint = #xegpu.cache_hint<cached>,
                            l2_hint = #xegpu.cache_hint<uncached>,
                            l3_hint = #xegpu.cache_hint<uncached>}>
         : memref<128xf32>, vector<1xindex>, vector<1xi1> -> vector<8xf32>

load_matrix(ssa)

xegpu.load_matrix

Attributes

  • const_offsets - Single, DenseI64ArrayAttr, i64 dense array attribute
  • layout - Optional, DistributeLayoutAttr, DistributeLayoutAttr instance

Operands

  • mem_desc - Single, XeGPU_MemDesc, MemDesc describing the data in SLM
  • offsets - Variadic, Index, variadic of index

Results

  • res - Single, XeGPU_ValueType, fixed-length vector of 1-bit signless integer or 8-bit signless integer or 16-bit signless integer or 32-bit signless integer or 64-bit signless integer or 1-bit signed integer or 8-bit signed integer or 16-bit signed integer or 32-bit signed integer or 64-bit signed integer or 1-bit unsigned integer or 8-bit unsigned integer or 16-bit unsigned integer or 32-bit unsigned integer or 64-bit unsigned integer or 16-bit float or 32-bit float or 64-bit float or bfloat16 type or tf32 type values

Description

This operation loads a 2D block of data from shared local memory (SLM) as specified by the provided 2D mem_desc. Only 2D memory descriptors are supported; use the subview operation to obtain a compatible 2D mem_desc from a higher-rank descriptor if needed.

Arguments:

  • mem_desc: the memory descriptor identifying the SLM region.
  • offsets: the coordinates within the matrix to read from.
  • layout: [optional] An attribute for guiding distributions among
          subgroups and/or work-items. It currently can accept either
          LayoutAttr or SliceAttr.
    Results:
  • res: the matrix elements loaded from SLM.

load_nd(ssa)

xegpu.load_nd - loads a n-D block from memory (represented by TensorDesc)to registers (represented by vector)

Attributes

  • const_offsets - Optional, DenseI64ArrayAttr, i64 dense array attribute
  • packed - Optional, UnitAttr, unit attribute
  • transpose - Optional, DenseI64ArrayAttr, i64 dense array attribute
  • l1_hint - Optional, XeGPU_CacheHintAttr, Describe the cache settings for prefetch/load/store operators
  • l2_hint - Optional, XeGPU_CacheHintAttr, Describe the cache settings for prefetch/load/store operators
  • l3_hint - Optional, XeGPU_CacheHintAttr, Describe the cache settings for prefetch/load/store operators

Operands

  • TensorDesc - Single, XeGPU_TensorDesc, TensorDesc describing regions of interested data.
  • offsets - Variadic, Index, variadic of index

Results

  • value - Single, XeGPU_ValueType, fixed-length vector of 1-bit signless integer or 8-bit signless integer or 16-bit signless integer or 32-bit signless integer or 64-bit signless integer or 1-bit signed integer or 8-bit signed integer or 16-bit signed integer or 32-bit signed integer or 64-bit signed integer or 1-bit unsigned integer or 8-bit unsigned integer or 16-bit unsigned integer or 32-bit unsigned integer or 64-bit unsigned integer or 16-bit float or 32-bit float or 64-bit float or bfloat16 type or tf32 type values

Description

LoadNdOp essentially mimics the hardware block read instruction to read a block of data from memory to register. It takes a set of optional cache hints for each level of cache, L1, L2 and L3. If hardware does not have a correspoding cache, Corresponding cache hint attribute will be masked. VNNI transformation is an hardware feature for Intel GPU, which is used to do data packing during the load for B operand of matrix operation, if the bit width of the data type is less then 32 bits, e.g., fp16. And transpose is another Intel hardware feature, which will do transpose operation when loading the data if the bit width of the data type is fp32 or fp64. It implies that vnni and transpose cannot exit at the same time. It is only available to 1D or 2D blocked tensor_desc.

In SIMT mode, result vector represents the data to be loaded by each work-item.

Example 1:

  xegpu.load_nd %1 {transpose = [1, 0],
                    l1_hint = #xegpu.cache_hint<cached>,
                    l2_hint = #xegpu.cache_hint<uncached>,
                    l3_hint = #xegpu.cache_hint<streaming>}
          : !xegpu.tensor_desc<8x16xf32> -> vector<16x8xf32>

Example 2 (SIMT mode):

  xegpu.load_nd %1 {l1_hint = #xegpu.cache_hint<cached>,
                    l2_hint = #xegpu.cache_hint<uncached>}>
    : !xegpu.tensor_desc<8x16xf32> -> vector<8xf32>

mem_desc_subview(ssa)

xegpu.mem_desc_subview

Attributes

  • const_offsets - Single, DenseI64ArrayAttr, i64 dense array attribute

Operands

  • src - Single, XeGPU_MemDesc, MemDesc describing the data in SLM
  • offsets - Variadic, Index, variadic of index

Results

  • res - Single, XeGPU_MemDesc, MemDesc describing the data in SLM

Description

Creates a subview of a memory descriptor. The resulting memory descriptor can have a lower rank than the source; in this case, the result dimensions correspond to the higher-order dimensions of the source memory descriptor.

Arguments:

  • src : a memory descriptor.
  • offsets : the coordinates within the matrix the subview will be created from.

Results:

  • res : a memory descriptor with smaller size.

nbarrier_arrive(ssa)

xegpu.nbarrier_arrive - It signals the arrival at the named barrier.

Operands

  • nbarrier - Single, XeGPU_Nbarrier, !xegpu.nbarrier a custom XeGPU type representing a barrier.

Description

NbarrierArriveOp signals the hardware (or other threads)

that the current thread has produced its data for the consumer threads. When
the hardware signalled by `participant_thread_num` threads for the named barrier,
it will notify the threads waiting for the named barrier to continue their work.

nbarrier_wait(ssa)

xegpu.nbarrier_wait - It waits for a named barrier.

Operands

  • nbarrier - Single, XeGPU_Nbarrier, !xegpu.nbarrier a custom XeGPU type representing a barrier.

Description

NbarrierWaitOp signals the hardware which named barrier

the current thread is waiting for, such that it can get notified when the
named barrier is completed.

prefetch(ssa)

xegpu.prefetch - prefetches a set of scattered data points to cache

Attributes

  • l1_hint - Optional, XeGPU_CacheHintAttr, Describe the cache settings for prefetch/load/store operators
  • l2_hint - Optional, XeGPU_CacheHintAttr, Describe the cache settings for prefetch/load/store operators
  • l3_hint - Optional, XeGPU_CacheHintAttr, Describe the cache settings for prefetch/load/store operators
  • offset_align_byte - Optional, I64Attr, 64-bit signless integer attribute

Operands

  • source - Single, XeGPU_GatherScatterSourceType, TensorDesc describing regions of interested data. or 1D memref of 1-bit signless integer or 8-bit signless integer or 16-bit signless integer or 32-bit signless integer or 64-bit signless integer or 1-bit signed integer or 8-bit signed integer or 16-bit signed integer or 32-bit signed integer or 64-bit signed integer or 1-bit unsigned integer or 8-bit unsigned integer or 16-bit unsigned integer or 32-bit unsigned integer or 64-bit unsigned integer or 16-bit float or 32-bit float or 64-bit float or bfloat16 type or tf32 type values or 64-bit unsigned integer or 32-bit unsigned integer or 64-bit signless integer or 32-bit signless integer
  • offsets - Optional, anonymous/composite constraint, fixed-length vector of index values or index

Description

It issues instructions to prefetch a set of scattered data points from memory to each level of the cache based on their cache policy. As compared to prefetch_nd, which works on non-scattered TensorDesc, it works on scattered TensorDesc instead.

Arguments:

  • source: represents the memory region to be loaded from, which can be either a tensor_desc or a 1D memref or pointer (ui64, ui32, i64 or i32). In case of tensor_desc, offsets come from the producer create_tdesc op. tensor_desc cannot be used in SIMT mode.
  • offsets: represents offsets from source. required if source in not a TensorDescType. offsets is a vector of index type and vector length is either the subgroup size or 1 in SIMT mode. scalar offset is also valid for SIMT mode.
  • l1_hint, l2_hint, l3_hint: are optional cache hints for each level of cache.
  • offset_align_byte: required if source is a pointer. If source is not a pointer, it is not allowed. Represents the alignment in bytes of each offset in offsets.

Example 1:

  xegpu.prefetch %tdesc {l1_hint = #xegpu.cache_hint<cached>,
                         l2_hint = #xegpu.cache_hint<cached>,
                         l3_hint = #xegpu.cache_hint<cached>}
    : !xegpu.tensor_desc<16xf16>

Example 2: A variant accepts memref as base pointer and an offset instead of scattered TensorTdesc. It combines "create scattered TensorTdesc" and "prefetch with scattered TensorTdesc". The source operand could be a raw pointer (ui64, ui32, i64, i32). Please refer to create_tdesc for the restriction of memref.

  %a = memref.alloc() : memref<1024xf32>
  %0 = arith.constant dense<[0, 16, 32, 64]> : vector<4xindex>
  xegpu.prefetch %a[%0] {l1_hint = #xegpu.cache_hint<cached>,
                         l2_hint = #xegpu.cache_hint<cached>,
                         l3_hint = #xegpu.cache_hint<cached>}
    : memref<1024xf32>, vector<4xindex>

Example 3 (SIMT mode): SIMT mode only accepts the offsets variant.

  xegpu.prefetch %0[%1] {l1_hint = #xegpu.cache_hint<cached>,
                         l2_hint = #xegpu.cache_hint<cached>,
                         l3_hint = #xegpu.cache_hint<cached>}
    : memref<256xf32>, vector<1xindex>

Example 4 (SIMT mode): SIMT mode only accepts the offsets variant.

  xegpu.prefetch %0[%1] {l1_hint = #xegpu.cache_hint<cached>,
                         l2_hint = #xegpu.cache_hint<cached>,
                         l3_hint = #xegpu.cache_hint<cached>,
                         offset_align_byte = 2}
    : i64, vector<1xindex>

prefetch_nd(ssa)

xegpu.prefetch_nd - prefetches a n-D block to cache

Attributes

  • const_offsets - Optional, DenseI64ArrayAttr, i64 dense array attribute
  • l1_hint - Optional, XeGPU_CacheHintAttr, Describe the cache settings for prefetch/load/store operators
  • l2_hint - Optional, XeGPU_CacheHintAttr, Describe the cache settings for prefetch/load/store operators
  • l3_hint - Optional, XeGPU_CacheHintAttr, Describe the cache settings for prefetch/load/store operators

Operands

  • TensorDesc - Single, XeGPU_TensorDesc, TensorDesc describing regions of interested data.
  • offsets - Variadic, Index, variadic of index

Description

It issues an instruction to prefetch a block of data from continuous memory regions to each level of the cache based on their cache policy.

Example:

  xegpu.prefetch_nd %tdesc {l1_hint = #xegpu.cache_hint<cached>,
                            l2_hint = #xegpu.cache_hint<cached>,
                            l3_hint = #xegpu.cache_hint<cached>}
    : !xegpu.tensor_desc<8x16xf16>

store(ssa)

xegpu.store - store data to scattered memory locations.

Attributes

  • chunk_size - Optional, I64Attr, 64-bit signless integer attribute
  • l1_hint - Optional, XeGPU_CacheHintAttr, Describe the cache settings for prefetch/load/store operators
  • l2_hint - Optional, XeGPU_CacheHintAttr, Describe the cache settings for prefetch/load/store operators
  • l3_hint - Optional, XeGPU_CacheHintAttr, Describe the cache settings for prefetch/load/store operators

Operands

  • value - Single, anonymous/composite constraint, fixed-length vector of 1-bit signless integer or 8-bit signless integer or 16-bit signless integer or 32-bit signless integer or 64-bit signless integer or 1-bit signed integer or 8-bit signed integer or 16-bit signed integer or 32-bit signed integer or 64-bit signed integer or 1-bit unsigned integer or 8-bit unsigned integer or 16-bit unsigned integer or 32-bit unsigned integer or 64-bit unsigned integer or 16-bit float or 32-bit float or 64-bit float or bfloat16 type or tf32 type values or 1-bit signless integer or 8-bit signless integer or 16-bit signless integer or 32-bit signless integer or 64-bit signless integer or 1-bit signed integer or 8-bit signed integer or 16-bit signed integer or 32-bit signed integer or 64-bit signed integer or 1-bit unsigned integer or 8-bit unsigned integer or 16-bit unsigned integer or 32-bit unsigned integer or 64-bit unsigned integer or 16-bit float or 32-bit float or 64-bit float or bfloat16 type or tf32 type
  • dest - Single, XeGPU_GatherScatterSourceType, TensorDesc describing regions of interested data. or 1D memref of 1-bit signless integer or 8-bit signless integer or 16-bit signless integer or 32-bit signless integer or 64-bit signless integer or 1-bit signed integer or 8-bit signed integer or 16-bit signed integer or 32-bit signed integer or 64-bit signed integer or 1-bit unsigned integer or 8-bit unsigned integer or 16-bit unsigned integer or 32-bit unsigned integer or 64-bit unsigned integer or 16-bit float or 32-bit float or 64-bit float or bfloat16 type or tf32 type values or 64-bit unsigned integer or 32-bit unsigned integer or 64-bit signless integer or 32-bit signless integer
  • offsets - Optional, anonymous/composite constraint, fixed-length vector of index values or index
  • mask - Single, anonymous/composite constraint, 4D tensor of any type values

Description

It (aka. store) stores data to scattered memory locations. The value is typically a 1D vector. But when the chunk size of the TensorDesc is larger than 1, it will be a 2D vector instead. For the later case, dim-1 of the value correspods to the simd lanes and the dim-0 of the value corresponds to the chunk size stored per lane. So store_scatter has transpose effect, which is similar to load_gather. Therefore, a transpose attribute is introduced on purpose, making sure users are aware of this implicit transformation.

In SIMT mode, the result is a 1D vector that represents the data to be stored by each work-item. If size is not 1, size should be equal to the chunk size.

Arguments:

  • value: represents the data to be stored.
  • dest: represents the memory region to be stored to, which can be either a tensor_desc or a 1D memref or pointer (ui64, ui32, i64 or i32). In case of tensor_desc, offsets come from the producer create_tdesc op. tensor_desc cannot be used in SIMT mode.
  • offsets: represents offsets from dest. required if source in not a TensorDescType. offsets is a vector of index type and vector length is either the subgroup size or 1 in SIMT mode. scalar offset is also valid for SIMT mode.
  • mask: is a vector of i1 type, which is used to mask out the memory access. mask is a vector of size equal to the subgroup size, or 1 in SIMT mode. scalar mask is also valid for SIMT mode.
  • chunk_size: (optional) represents contiguous number of elements to store to per work item.
  • l1_hint, l2_hint, l3_hint: are optional cache hints for each level of cache.

Example 1:

   xegpu.store %0, %1, %2 <{l1_hint = #xegpu.cache_hint<uncached>,
                            l2_hint = #xegpu.cache_hint<write_back>,
                            l3_hint = #xegpu.cache_hint<write_through>}>
         : vector<16xf32>, !xegpu.tensor_desc<16xf32, #xegpu.scattered_tdesc_attr<>>, vector<16xi1>

Example 2:

   xegpu.store %0, %1, %2 <{l1_hint = #xegpu.cache_hint<uncached>,
                            l2_hint = #xegpu.cache_hint<write_back>,
                            l3_hint = #xegpu.cache_hint<write_through>}>
         : vector<16x8xf32>, !xegpu.tensor_desc<16x8xf32, #xegpu.scattered_tdesc_attr<chunk_size=8>>, vector<16xi1>

Example 3: A variant accepts memref as base pointer and an offset instead of scattered TensorTdesc. It combines "create scattered TensorTdesc" and "store with scattered TensorTdesc". The dest operand could be a raw pointer (uint64_t). Please refer to create_tdesc for the restriction of memref.

   %a = memref.alloc() : memref<1024xf32>
   %val = arith.constant dense<0.0> : vector<16xf32>
   %offsets = vector.step : vector<16xindex>
   %mask = vector.constant_mask [16]: vector<16xi1>
   xegpu.store %val, %a[%offsets], %mask {l1_hint = #xegpu.cache_hint<cached>,
                          l2_hint = #xegpu.cache_hint<cached>,
                          l3_hint = #xegpu.cache_hint<cached>}
     : memref<1024xf32>, vector<16xi1>, vector<16xindex> -> vector<16xf32>

Example 4 (SIMT mode): SIMT mode only accepts the offsets variant. chunk_size can be inferred from value type. In this example, chunk_size is 8.

   xegpu.store %0, %1[%2], %3 <{l1_hint = #xegpu.cache_hint<uncached>,
                            l2_hint = #xegpu.cache_hint<write_back>,
                            l3_hint = #xegpu.cache_hint<write_through>}>
         : vector<8xf32>, memref<256xf32>, vector<1xindex>, vector<1xi1>

store_matrix(ssa)

xegpu.store_matrix

Attributes

  • const_offsets - Single, DenseI64ArrayAttr, i64 dense array attribute
  • layout - Optional, DistributeLayoutAttr, DistributeLayoutAttr instance

Operands

  • data - Single, XeGPU_ValueType, fixed-length vector of 1-bit signless integer or 8-bit signless integer or 16-bit signless integer or 32-bit signless integer or 64-bit signless integer or 1-bit signed integer or 8-bit signed integer or 16-bit signed integer or 32-bit signed integer or 64-bit signed integer or 1-bit unsigned integer or 8-bit unsigned integer or 16-bit unsigned integer or 32-bit unsigned integer or 64-bit unsigned integer or 16-bit float or 32-bit float or 64-bit float or bfloat16 type or tf32 type values
  • mem_desc - Single, XeGPU_MemDesc, MemDesc describing the data in SLM
  • offsets - Variadic, Index, variadic of index

Description

This operation stores a 2D data fragment into the shared local memory region specified by a 2D mem_desc. Only 2D memory descriptors are supported; use the subview operation to obtain a 2D mem_desc from a higher-rank descriptor if needed.

Arguments:

  • mem_desc: the memory descriptor specifying the SLM region.
  • offsets: the coordinates within the matrix where the data will be written.
  • data: the values to be stored in the matrix.
  • layout: [optional] An attribute for guiding distributions among
          subgroups and/or work-items. It currently can accept either
          LayoutAttr or SliceAttr.

store_nd(ssa)

xegpu.store_nd - stores a n-D block register region back to memory, currently only supports 2D

Attributes

  • const_offsets - Optional, DenseI64ArrayAttr, i64 dense array attribute
  • l1_hint - Optional, XeGPU_CacheHintAttr, Describe the cache settings for prefetch/load/store operators
  • l2_hint - Optional, XeGPU_CacheHintAttr, Describe the cache settings for prefetch/load/store operators
  • l3_hint - Optional, XeGPU_CacheHintAttr, Describe the cache settings for prefetch/load/store operators

Operands

  • value - Single, XeGPU_ValueType, fixed-length vector of 1-bit signless integer or 8-bit signless integer or 16-bit signless integer or 32-bit signless integer or 64-bit signless integer or 1-bit signed integer or 8-bit signed integer or 16-bit signed integer or 32-bit signed integer or 64-bit signed integer or 1-bit unsigned integer or 8-bit unsigned integer or 16-bit unsigned integer or 32-bit unsigned integer or 64-bit unsigned integer or 16-bit float or 32-bit float or 64-bit float or bfloat16 type or tf32 type values
  • TensorDesc - Single, XeGPU_TensorDesc, TensorDesc describing regions of interested data.
  • offsets - Variadic, Index, variadic of index

Description

StoreNdOp essentially mimics the hardware block write instruction io write a block of data from register into the memory region as described by the TensorDesc. It takes a set of optional cache hints for each level of cache, L1, L2 and L3. If hardware does not have a correspoding cache, Corresponding cache hint attribute will be masked. It is only available to 1D or 2D blocked tensor_desc.

In SIMT mode, the input vector represents the data to be stored by each work-item.

Example 1:

  xegpu.store_nd %3, %2 {l1_hint = #xegpu.cache_hint<uncached>,
                         l2_hint = #xegpu.cache_hint<write_back>,
                         l3_hint = #xegpu.cache_hint<write_through>}
                         : vector<8x16xf16>, !xegpu.tensor_desc<8x16xf16>

Example 2 (SIMT mode):

  xegpu.store_nd %3, %2 {l1_hint = #xegpu.cache_hint<uncached>,
                         l2_hint = #xegpu.cache_hint<write_back>,
                         l3_hint = #xegpu.cache_hint<write_through>}
                         : vector<8xf16>, !xegpu.tensor_desc<8x16xf16>

update_nd_offset(ssa)

xegpu.update_nd_offset - It updates the offsets for the TensorDesc.

Attributes

  • const_offsets - Single, DenseI64ArrayAttr, i64 dense array attribute

Operands

  • TensorDesc - Single, XeGPU_TensorDesc, TensorDesc describing regions of interested data.
  • offsets - Variadic, Index, variadic of index

Results

  • result - Single, XeGPU_TensorDesc, TensorDesc describing regions of interested data.

Description

The op updates the offset of the given TensorDesc.

The offsets are relative offset to the current position in the number
of elements. It will result in a same type TensorDesc as the input.

Example:

    %2 = xegpu.update_nd_offset %1, [0, 16]: !xegpu.tensor_desc<8x16xf32>

update_offset(ssa)

xegpu.update_offset - It updates the offsets for the given tensor descriptor

Operands

  • TensorDesc - Single, XeGPU_TensorDesc, TensorDesc describing regions of interested data.
  • offsets - Single, XeGPU_OffsetType, fixed-length vector of index values

Results

  • result - Single, XeGPU_TensorDesc, TensorDesc describing regions of interested data.

Description

It behaves similar to update_nd_offset in terms that

it updates offset of a TensorDesc, and the offsets are relative offset to
the current position in the number of elements. However, `update_nd_offset`
is to update the start point of a 2D block, so its offset constains two
elements representing the shift in each dimension. `update_offset` is to
update the offset per work-item, so its offsets contains values representing
shifts for each work-item.

Example:
```mlir
  %off = arith.constant dense<[32, 32, 32, 32]> : vector<4xindex>
  %2 = xegpu.update_offset %1, %off :
          !xegpu.tensor_desc<4x2xf32, #xegpu.scattered_tdesc_attr<chunk_size=2>>, vector<4xindex>
```