Beaver.MLIR.Dialect.AMDGPU (beaver v0.4.7)
Summary
Functions
amdgpu.dpp - AMDGPU DPP operation
amdgpu.ext_packed_fp8 - Extend a fp8 value to a float or a vector of packed fp8 values to two floats
amdgpu.fat_raw_buffer_cast - Create a raw buffer fat pointer that matches memref
amdgpu.gather_to_lds - MLIR wrapper for CDNA Gather to LDS instructions
amdgpu.lds_barrier
amdgpu.memory_counter_wait
amdgpu.mfma - MLIR wrapper for CDNA mfma instructions
amdgpu.packed_scaled_trunc - Round two floats into a packed vector of floats
amdgpu.packed_stoch_round_fp8 - Round float stochiastically into a packed vector of 8-bit floats
amdgpu.packed_trunc_2xfp8 - Round two floats into a packed vector of 8-bit floats
amdgpu.permlane_swap - AMDGPU permlane swap op
amdgpu.raw_buffer_atomic_cmpswap - Raw Buffer Atomic compare-and-swap
amdgpu.raw_buffer_atomic_fadd - Raw Buffer Floating-point Atomic Add (MI-* only)
amdgpu.raw_buffer_atomic_fmax - Raw Buffer Floating-point Atomic Max (non-GFX9)
amdgpu.raw_buffer_atomic_smax - Raw Buffer Signed Integer Atomic Max
amdgpu.raw_buffer_atomic_umin - Raw Buffer Unsigned Integer Atomic Min
amdgpu.raw_buffer_load - Raw Buffer load, exposing GCN features
amdgpu.raw_buffer_store - Raw Buffer Store, exposing GCN features
amdgpu.scaled_ext_packed - Extend a vector of packed floating point values
amdgpu.scaled_mfma - MLIR wrapper for CDNA scaled mfma instructions
amdgpu.sched_barrier
amdgpu.swizzle_bitmode - AMDGPU ds_swizzle op, bitmode variant
amdgpu.transpose_load - MLIR wrapper for CDNA Transpose Load instructions
amdgpu.wmma - MLIR wrapper for RDNA3 wmma instructions
Functions
amdgpu.dpp - AMDGPU DPP operation
This op has support for result type inference.
Attributes
kind- Single,AMDGPU_DPPPermAttr, The possible permutations for a DPP operationpermArgument- Optional, anonymous/composite constraint, 32-bit signless integer attribute or array attribute or unit attributerow_mask- Single,I32Attr, 32-bit signless integer attributebank_mask- Single,I32Attr, 32-bit signless integer attributebound_ctrl- Single,BoolAttr, bool attribute
Operands
old- Single,AnyType, any typesrc- Single,AnyType, any type
Results
result- Single,AnyType, any type
Description
This operation represents DPP functionality in a GPU program. DPP provides the following operations:
- Full crossbar in a group of four (
quad_perm) - Wavefront shift left by one lane (
wave_shl) - Wavefront shift right by one lane (
wave_shr) - Wavefront rotate right by one lane (
wave_ror) - Wavefront rotate left by one lane (
wave_rol) - Row shift left by 1–15 lanes (
row_shl) - Row shift right by 1–15 lanes (
row_shr) - Row rotate right by 1–15 lanes (
row_ror) - Reverse within a row (
row_mirror) - Reverse within a half-row (
row_half_mirror) - Broadcast the 15th lane of each row to the next row (
row_bcast) - Broadcast lane 31 to rows 2 and 3 (
row_bcast)
amdgpu.ext_packed_fp8 - Extend a fp8 value to a float or a vector of packed fp8 values to two floats
Attributes
index- Single,I32Attr, 32-bit signless integer attribute whose value is non-negative whose maximum value is 3
Operands
source- Single, anonymous/composite constraint, f8E5M2FNUZ type or f8E4M3FNUZ type or f8E5M2 type or f8E4M3FN type or vector of f8E5M2FNUZ type or f8E4M3FNUZ type or f8E5M2 type or f8E4M3FN type values of length 1/2/3/4
Results
res- Single, anonymous/composite constraint, 32-bit float or fixed-length vector of 32-bit float values of length 2
Description
Extend one or two 8-bit floats in source[index] to a 32-bit float or
two floats and return them.
This rather unusual signature arises from the fact that AMD GPUs cannot easily work with sub 32-bit quantities, so the compiler intrinsics for extending 8-bit floats (which are, currently, the only way to work with this operation) take packed vectors of 4 such floats.
If the passed-in vector has fewer than four elements, or the input is scalar, the remaining values in the <4 x i8> will be filled with undefined values as needed.
amdgpu.fat_raw_buffer_cast - Create a raw buffer fat pointer that matches memref
This op has support for result type inference.
Attributes
boundsCheck- Single,BoolAttr, bool attributeresetOffset- Optional,UnitAttr, unit attribute
Operands
source- Single,AnyMemRef, memref of any type valuesvalidBytes- Optional,I64, 64-bit signless integercacheSwizzleStride- Optional, anonymous/composite constraint, 14-bit signless integer
Results
result- Single,AnyMemRef, memref of any type values
Description
Wraps the memory pointed to by source as a raw buffer fat pointer, or,
in LLVM terms, a ptr addrspace(7), returning a memref that has the same
sizes and layout but the #amdgpu.address_space<fat_raw_buffer>
address space.
This memref can be used with standard memref operations like memref.load,
memref.store, and memref.atomicrmw, which will be lowered to the relevant
buffer intrinsics. (vector.masked_load/store will work once there's backend
support for lowering them, and then this document will be updated)
If validBytes is given, it is the number of bytes that will be valid as
an offset to out. If it is not provided, this will be inferred from
the size of the memref during lowering. This size is
max_{d = 0 upto rank(source)} (sizes[d] strides[d]) sizeof(element type).
The flags of the buffer descriptor will be set up to enable raw usage -
for example, stride = 0, add_tid = 0, and so on. The boundsCheck
property determines if bounds checking is enabled or not (on architectures
where this can be controlled - that is, on RDNA chips).
If cacheSwizzleStride is provided, L1 cache swizzling will be enabled
on architectures that support it. This swizzling, unlike the main swizzling
mode (whose usage makes a buffer non-raw) does not affect index calculation,
but does affect cache behavior. Mixing access between cache-swizzled raw
buffers and other forms of memory access, like ordinary pointer loads or
unswizzled buffer pointers can cause incorrect behavior and must be avoided.
This operation preserves the sizes, strides, and offset of the input
memref - they'll be added in by memref.load later. However, if
resetOffset is set, that offset will be added to the base pointer.
If the value of the memref's offset is not uniform (independent of the lane/thread ID),
this will lead to substantially decreased performance due to the need for
a waterfall loop on the base address of the buffer resource.
amdgpu.gather_to_lds - MLIR wrapper for CDNA Gather to LDS instructions
Attributes
transferType- Single,TypeAttr, any type attribute
Operands
src- Single,AnyMemRef, memref of any type valuessrcIndices- Variadic,Index, variadic of indexdst- Single,AnyMemRef, memref of any type valuesdstIndices- Variadic,Index, variadic of index
Description
The amdgpu.gather_to_lds op is a wrapper around the global_load_lds instructions.
Operands:
$src: global memory (including fat buffer) memref to read from.$srcIndices: indices into$srcto read from for this thread.$dst: LDS memory memref to write to.$dstIndices: base indices into$dstto write to for the subgroup of this thread. The elements gathered by the subgroup will be written contiguously in order of lane ID starting at$dst[$dstIndices]. Byte-sized (ex. i8) or short-sized (ex. i16) types will be zero-padded/extended to 32 bits before being written. 96-bit types (ex. vector<3xf32>) will be zero-padded to 128 bits before being written. Only the offsets held by lane 0 are used.$transferType: type of the data to be transferred by each thread. This is used to determine the size of the data to be transferred and the number of threads in the subgroup. The transfer type must be a scalar type or a vector type with a single element type.
The $dst, along with its indices, points to the memory location the subgroup of this thread
will write to.
Note: only supported on gfx9 and gfx10.
amdgpu.lds_barrier
amdgpu.memory_counter_wait
amdgpu.mfma - MLIR wrapper for CDNA mfma instructions
This op has support for result type inference.
Attributes
m- Single,I32Attr, 32-bit signless integer attributen- Single,I32Attr, 32-bit signless integer attributek- Single,I32Attr, 32-bit signless integer attributeblocks- Single,I32Attr, 32-bit signless integer attributecbsz- Single,I32Attr, 32-bit signless integer attributeabid- Single,I32Attr, 32-bit signless integer attributeblgp- Single,AMDGPU_MFMAPermBAttr, The possible permutations of the lanes storing B available in an MFMAreducePrecision- Optional,UnitAttr, unit attributenegateA- Optional,UnitAttr, unit attributenegateB- Optional,UnitAttr, unit attributenegateC- Optional,UnitAttr, unit attribute
Operands
sourceA- Single,MFMAInTypes, 32-bit float or 64-bit float or 32-bit signless integer or 64-bit signless integer or vector of 32-bit float values of length 2 or vector of 16-bit float values of length 4/8 or vector of bfloat16 type values of length 2/4/8 or vector of 8-bit signless integer values of length 4/8/16 or vector of f8E5M2FNUZ type or f8E4M3FNUZ type values of length 8 or vector of f8E5M2 type or f8E4M3FN type values of length 8/32 or vector of f6E2M3FN type or f6E3M2FN type or f4E2M1FN type values of length 32sourceB- Single,MFMAInTypes, 32-bit float or 64-bit float or 32-bit signless integer or 64-bit signless integer or vector of 32-bit float values of length 2 or vector of 16-bit float values of length 4/8 or vector of bfloat16 type values of length 2/4/8 or vector of 8-bit signless integer values of length 4/8/16 or vector of f8E5M2FNUZ type or f8E4M3FNUZ type values of length 8 or vector of f8E5M2 type or f8E4M3FN type values of length 8/32 or vector of f6E2M3FN type or f6E3M2FN type or f4E2M1FN type values of length 32destC- Single,MFMAOutTypes, 64-bit float or vector of 32-bit float values of length 4/16/32 or vector of 32-bit signless integer values of length 4/16/32 or vector of 64-bit float values of length 4
Results
destD- Single,MFMAOutTypes, 64-bit float or vector of 32-bit float values of length 4/16/32 or vector of 32-bit signless integer values of length 4/16/32 or vector of 64-bit float values of length 4
Description
The amdgpu.mfma op is an MLIR wrapper around intrinsics
for various mfma instructions in the CDNA architecture, which perform
multiple outer products in order to allow fast matrix multiplication.
The wrapper will select an appropriate mfma instruction, if one is available,
based on the provided m, k, n, and nBlks attributes, along with the
types of the source and destination arguments.
For information on the layouts of the input and output matrces (which are stored
in sourceA, sourceB, destC, and destD), see the CDNA ISA documentation.
The cbsz, abid, and blgp parameters control how the lanes of the wave
are permuted when matrix data is being loaded: blgp can be any number of
fixed permutations, cbsz specifies the log_2 of the number of chunks the lanes
holding sourceA are split into, and abid selects one of those chunks.
Note, this wrapper allows specifying vector<4Kxi8> arguments to MFMA
intrinsics that take an integer type of width 4K. For example,
one can provide a vector<4xi8> as an argument to an MFMA instruction that
logically takes 4 i8s but whose intrinsics are specified to take an i32.
In these cases, the bytes in the vector will be concatenated in little-endian
order (that is, v[0] will go to arg[7:0], v[1] to arg[15:8] and so on).
The negateA, negateB, and negateC flags are only supported for double-precision operations on gfx94x.
amdgpu.packed_scaled_trunc - Round two floats into a packed vector of floats
Attributes
index- Single,I32Attr, 32-bit signless integer attribute whose value is non-negative whose maximum value is 7
Operands
source- Single, anonymous/composite constraint, vector of 32-bit float or 16-bit float or bfloat16 type values of length 1/2scale- Single,F32, 32-bit floatexisting- Optional, anonymous/composite constraint, fixed-length vector of f8E5M2 type or f8E4M3FN type values of length 4 or fixed-length vector of f4E2M1FN type values of length 8
Results
res- Single, anonymous/composite constraint, fixed-length vector of f8E5M2 type or f8E4M3FN type values of length 4 or fixed-length vector of f4E2M1FN type values of length 8
Description
Scale and round the inputs source (which is undefined if not
specified) into the low or high word (bottom two or top two) elements
of the returned vector, keeping the other two elements of existing
unchanged if present (or undefined if it was not passed in).
The reason for this odd signature is that AMD GPUs cannot easily work with sub-registers, and so the conversion intrinsics take 32-bit wide packed vectors of float values.
amdgpu.packed_stoch_round_fp8 - Round float stochiastically into a packed vector of 8-bit floats
Attributes
storeIndex- Single,I32Attr, 32-bit signless integer attribute whose value is non-negative whose maximum value is 3
Operands
source- Single,F32, 32-bit floatstochiasticParam- Single,I32, 32-bit signless integerexisting- Optional, anonymous/composite constraint, fixed-length vector of f8E4M3FNUZ type or f8E5M2FNUZ type or f8E4M3FN type or f8E5M2 type values of length 4
Results
res- Single, anonymous/composite constraint, fixed-length vector of f8E4M3FNUZ type or f8E5M2FNUZ type or f8E4M3FN type or f8E5M2 type values of length 4
Description
Round the input source, adding in stochiasticParam, and place it into
the storeIndexth element of res.
If existing is passed in, elements of res other than the one at storeIndex
are copied from existing.
The reason for this odd signature is that AMD GPUs cannot easily work with sub-registers, and so the conversion intrinsics (which are currently the only way to work with 8-bit float types) take packed vectors of 4 8-bit values.
amdgpu.packed_trunc_2xfp8 - Round two floats into a packed vector of 8-bit floats
Attributes
wordIndex- Single,I32Attr, 32-bit signless integer attribute whose value is non-negative whose maximum value is 1
Operands
sourceA- Single,F32, 32-bit floatsourceB- Optional,F32, 32-bit floatexisting- Optional, anonymous/composite constraint, fixed-length vector of f8E4M3FNUZ type or f8E5M2FNUZ type or f8E4M3FN type or f8E5M2 type values of length 4
Results
res- Single, anonymous/composite constraint, fixed-length vector of f8E4M3FNUZ type or f8E5M2FNUZ type or f8E4M3FN type or f8E5M2 type values of length 4
Description
Round the inputs sourceA and sourceB (which is undefined if not
specified) into the low or high word (bottom two or top two) elements
of the returned vector, keeping the other two elements of existing
unchanged if present (or undefined if it was not passed in).
The reason for this odd signature is that AMD GPUs cannot easily work with sub-registers, and so the conversion intrinsics (which are currently the only way to work with 8-bit float types) take packed vectors of 4 8-bit values.
amdgpu.permlane_swap - AMDGPU permlane swap op
This op has support for result type inference.
Attributes
row_length- Single,I32Attr, 32-bit signless integer attributefetch_inactive- Single,BoolAttr, bool attributebound_ctrl- Single,BoolAttr, bool attribute
Operands
src- Single,AnyIntegerOrFloatOr1DVector, Integer or Float or fixed-length vector of Integer or Float values of ranks 1
Results
result- Single,AnyIntegerOrFloatOr1DVector, Integer or Float or fixed-length vector of Integer or Float values of ranks 1
Description
High-level wrapper on rocdl.permlane{16,32}.swap variants for permutations
on rows of lanes in a subgroup.
Supports arbitrary int/float/vector types, which will be repacked to i32 and
one or more rocdl.permlane_swap ops during lowering.
Supported lane permutations:
- Swap the data between odd and even rows of 16 lanes
- Swap the data between the first 32 lanes and the last 32 lanes
Example:
%0 = amdgpu.permlane_swap %src 16 : f16
%1 = amdgpu.permlane_swap %src 32 { fetch_inactive = true, bound_ctrl = true } : f16Operands:
$src: Vector register to permute across lanes of the subgroup.$row_length: The length of a row to permute in number of lanes (valid values are 16 and 32).$fetch_inactive: Optional. Used to dertermine behavior of a fetch from a disabled lane.fetch_inactive = false: If the source lane is disabled, usebound_ctrlto determine the source value.fetch_inactive = true: If the source lane is disabled, fetch the source value anyway (ignoringbound_ctrl).$bound_ctrl: Optional. Used to determine what a thread should do if its source operand is from a disabled lane: use the value zero, or disable the write.bound_ctrl = false: Do not write when source is from a disabled lanebound_ctrl = true: Use zero as input if source is from a disabled lane
Note: Lowering is only supported on gfx950 and up.
amdgpu.raw_buffer_atomic_cmpswap - Raw Buffer Atomic compare-and-swap
This op has support for result type inference.
Attributes
boundsCheck- Single,BoolAttr, bool attributeindexOffset- Optional,I32Attr, 32-bit signless integer attribute
Operands
src- Single,AnyType, any typecmp- Single,AnyType, any typememref- Single,AnyMemRef, memref of any type valuesindices- Variadic,I32, variadic of 32-bit signless integersgprOffset- Optional,I32, 32-bit signless integer
Results
value- Single,AnyType, any type
Description
The amdgpu.raw_buffer_atomic_cmpswap op is a wrapper around the
buffer-based atomic compare-and-swap min available on AMD GPUs.
The index into the buffer is computed as for memref.store with the addition
of indexOffset (which is used to aid in emitting vectorized code) and,
if present sgprOffset (which is added after bounds checks and includes
any non-zero offset on the memref type).
All indexing components are given in terms of the memref's element size, not the byte lengths required by the intrinsic.
Out of bounds atomic operations are ignored in hardware.
See amdgpu.raw_buffer_load for a description of how the underlying
instruction is constructed.
amdgpu.raw_buffer_atomic_fadd - Raw Buffer Floating-point Atomic Add (MI-* only)
Attributes
boundsCheck- Single,BoolAttr, bool attributeindexOffset- Optional,I32Attr, 32-bit signless integer attribute
Operands
value- Single, anonymous/composite constraint, 32-bit float or vector of 16-bit float or bfloat16 type values of length 2memref- Single,AnyMemRef, memref of any type valuesindices- Variadic,I32, variadic of 32-bit signless integersgprOffset- Optional,I32, 32-bit signless integer
Description
The amdgpu.raw_buffer_atomic_fadd op is a wrapper around the
buffer-based atomic floating point addition available on the MI-* series
of AMD GPUs.
The index into the buffer is computed as for memref.store with the addition
of indexOffset (which is used to aid in emitting vectorized code) and,
if present sgprOffset (which is added after bounds checks and includes
any non-zero offset on the memref type).
All indexing components are given in terms of the memref's element size, not the byte lengths required by the intrinsic.
Out of bounds atomic operations are ignored in hardware.
See amdgpu.raw_buffer_load for a description of how the underlying
instruction is constructed.
amdgpu.raw_buffer_atomic_fmax - Raw Buffer Floating-point Atomic Max (non-GFX9)
Attributes
boundsCheck- Single,BoolAttr, bool attributeindexOffset- Optional,I32Attr, 32-bit signless integer attribute
Operands
value- Single, anonymous/composite constraint, 32-bit float or 64-bit floatmemref- Single,AnyMemRef, memref of any type valuesindices- Variadic,I32, variadic of 32-bit signless integersgprOffset- Optional,I32, 32-bit signless integer
Description
The amdgpu.raw_buffer_atomic_fmax op is a wrapper around the
buffer-based atomic floating point max available on AMD GPUs (except GFX9).
The index into the buffer is computed as for memref.store with the addition
of indexOffset (which is used to aid in emitting vectorized code) and,
if present sgprOffset (which is added after bounds checks and includes
any non-zero offset on the memref type).
All indexing components are given in terms of the memref's element size, not the byte lengths required by the intrinsic.
Out of bounds atomic operations are ignored in hardware.
See amdgpu.raw_buffer_load for a description of how the underlying
instruction is constructed.
amdgpu.raw_buffer_atomic_smax - Raw Buffer Signed Integer Atomic Max
Attributes
boundsCheck- Single,BoolAttr, bool attributeindexOffset- Optional,I32Attr, 32-bit signless integer attribute
Operands
value- Single,I32, 32-bit signless integermemref- Single,AnyMemRef, memref of any type valuesindices- Variadic,I32, variadic of 32-bit signless integersgprOffset- Optional,I32, 32-bit signless integer
Description
The amdgpu.raw_buffer_atomic_smax op is a wrapper around the
buffer-based atomic signed integer max available on AMD GPUs.
The index into the buffer is computed as for memref.store with the addition
of indexOffset (which is used to aid in emitting vectorized code) and,
if present sgprOffset (which is added after bounds checks and includes
any non-zero offset on the memref type).
All indexing components are given in terms of the memref's element size, not the byte lengths required by the intrinsic.
Out of bounds atomic operations are ignored in hardware.
See amdgpu.raw_buffer_load for a description of how the underlying
instruction is constructed.
amdgpu.raw_buffer_atomic_umin - Raw Buffer Unsigned Integer Atomic Min
Attributes
boundsCheck- Single,BoolAttr, bool attributeindexOffset- Optional,I32Attr, 32-bit signless integer attribute
Operands
value- Single,I32, 32-bit signless integermemref- Single,AnyMemRef, memref of any type valuesindices- Variadic,I32, variadic of 32-bit signless integersgprOffset- Optional,I32, 32-bit signless integer
Description
The amdgpu.raw_buffer_atomic_umin op is a wrapper around the
buffer-based atomic signed integer min available on AMD GPUs.
The index into the buffer is computed as for memref.store with the addition
of indexOffset (which is used to aid in emitting vectorized code) and,
if present sgprOffset (which is added after bounds checks and includes
any non-zero offset on the memref type).
All indexing components are given in terms of the memref's element size, not the byte lengths required by the intrinsic.
Out of bounds atomic operations are ignored in hardware.
See amdgpu.raw_buffer_load for a description of how the underlying
instruction is constructed.
amdgpu.raw_buffer_load - Raw Buffer load, exposing GCN features
Attributes
boundsCheck- Single,BoolAttr, bool attributeindexOffset- Optional,I32Attr, 32-bit signless integer attribute
Operands
memref- Single,AnyMemRef, memref of any type valuesindices- Variadic,I32, variadic of 32-bit signless integersgprOffset- Optional,I32, 32-bit signless integer
Results
value- Single,AnyType, any type
Description
The amdgpu.raw_buffer_load op is a wrapper around the buffer load intrinsics
available on AMD GPUs, including extensions in newer GPUs.
The index into the buffer is computed as for memref.load with the additon
of indexOffset and sgprOffset (which may or may not be considered
in bounds checks and includes any offset present on the memref type if it's
non-zero).
All indices and offsets are in units of the memref's data type and are converted to bytes during lowering.
When a load is out of bounds, the instruction returns zero.
Partially-out of bounds have chipset-dependent behavior: whether reading
2 elements starting at index 7 of a memref<8xf32> returns the last element
in the first vector component depends on the architecture.
The memref struct is converted into a buffer resource (a V#) and the arguments are translated to intrinsic arguments as follows:
- The base address of the buffer is the base address of the memref
- The stride is 0 to enable raw mode
- The number of records is the size of the memref, in bytes In the case of dynamically-shaped memrefs, this is computed at runtime as max_d (size(d) stride(d)) sizeof(elementType(memref))
- The offset enable bit is 1, the index enable bit is 0.
- The thread ID addition bit is off
- If
boundsCheckis false and the target chipset is RDNA, OOB_SELECT is set to 2 to disable bounds checks, otherwise it is 3 - The cache coherency bits are off
amdgpu.raw_buffer_store - Raw Buffer Store, exposing GCN features
Attributes
boundsCheck- Single,BoolAttr, bool attributeindexOffset- Optional,I32Attr, 32-bit signless integer attribute
Operands
value- Single,AnyType, any typememref- Single,AnyMemRef, memref of any type valuesindices- Variadic,I32, variadic of 32-bit signless integersgprOffset- Optional,I32, 32-bit signless integer
Description
The amdgpu.raw_buffer_store op is a wrapper around the buffer store
intrinsics available on AMD GPUs, including extensions in newer GPUs.
The store index is computed as in memref.store with the addition of
indexOffset (which is included for uniformity with atomics and may be useful
when writing vectorized code) and sgprOffset (which is added after bounds
checks and implicitly includes the offset of the memref type if non-zero).
All index components are in terms of the elements of the memref, not bytes,
and are scaled up appropriately.
Out of bounds stores are ignored in hardware. Wthether a vector write that includes some in-bounds and soeme out-of-bounds components is partically completed is chipset-dependent.
See amdgpu.raw_buffer_load for a description of how the underlying
instruction is constructed.
amdgpu.scaled_ext_packed - Extend a vector of packed floating point values
Attributes
index- Single,I32Attr, 32-bit signless integer attribute whose value is non-negative whose maximum value is 7
Operands
source- Single, anonymous/composite constraint, vector of f8E5M2 type or f8E4M3FN type values of length 1/2/3/4 or vector of f4E2M1FN type values of length 1/2/3/4/5/6/7/8scale- Single,F32, 32-bit float
Results
res- Single, anonymous/composite constraint, fixed-length vector of 32-bit float values of length 2 or fixed-length vector of 16-bit float values of length 2 or fixed-length vector of bfloat16 type values of length 2
Description
Extend and scale two packed floats in source[index] to two floats and
return them.
This rather unusual signature arises from the fact that AMD GPUs cannot easily work with sub 32-bit quantities, so the compiler intrinsics for extending 8-bit floats (which are, currently, the only way to work with this operation) take packed vectors of 2 such floats.
If the passed-in vector has fewer than two elements, or the input is scalar, the remaining values in the <2 x i8> will be filled with undefined values as needed.
amdgpu.scaled_mfma - MLIR wrapper for CDNA scaled mfma instructions
This op has support for result type inference.
Attributes
m- Single,I32Attr, 32-bit signless integer attributen- Single,I32Attr, 32-bit signless integer attributek- Single,I32Attr, 32-bit signless integer attributescalesIdxA- Single,I32Attr, 32-bit signless integer attribute whose value is non-negative whose maximum value is 3scalesIdxB- Single,I32Attr, 32-bit signless integer attribute whose value is non-negative whose maximum value is 3
Operands
sourceA- Single,ScaledMFMAInTypes, vector of f8E5M2 type or f8E4M3FN type values of length 32 or vector of f6E2M3FN type or f6E3M2FN type or f4E2M1FN type values of length 32sourceB- Single,ScaledMFMAInTypes, vector of f8E5M2 type or f8E4M3FN type values of length 32 or vector of f6E2M3FN type or f6E3M2FN type or f4E2M1FN type values of length 32destC- Single,ScaledMFMAOutTypes, vector of 32-bit float values of length 4/16scalesA- Single, anonymous/composite constraint, f8E8M0FNU type or fixed-length vector of f8E8M0FNU type values of length 4scalesB- Single, anonymous/composite constraint, f8E8M0FNU type or fixed-length vector of f8E8M0FNU type values of length 4
Results
destD- Single,ScaledMFMAOutTypes, vector of 32-bit float values of length 4/16
Description
The amdgpu.scaled_mfma op is an MLIR wrapper around intrinsics
for various scaled versions of mfma instructions in the CDNA architecture, which perform
multiple outer products in order to allow fast matrix multiplication.
The wrapper will select an appropriate mfma instruction, if one is available,
based on the provided m, k, n, and nBlks attributes, along with the
types of the source and destination arguments.
Note, this wrapper allows specifying vector<4Kxi8> arguments to MFMA
intrinsics that take an integer type of width 4K. For example,
one can provide a vector<4xi8> as an argument to an MFMA instruction that
logically takes 4 i8s but whose intrinsics are specified to take an i32.
In these cases, the bytes in the vector will be concatenated in little-endian
order (that is, v[0] will go to arg[7:0], v[1] to arg[15:8] and so on).
This wrapper takes inspiration from amdgpu.mfma, but has some key differences:
amdgpu.scaled_mfmaoperates on fp4 (f4E2M1FN), fp6 (f6E2M3FN and f6E3M2FN) and fp8 (f8E4M3FN and f8E5M2) types using either M=N=16, K=128 or M=N=32, K=64 as their tile size.amdgpu.scaled_mfmadoes not support broadcasting. So,cbsz,abid, andblgpare omitted from this wrapper.- The
negateA,negateB, andnegateCflags inamdgpu.mfmaare only supported for double-precision operations on gfx94x and so are not included here.
amdgpu.sched_barrier
amdgpu.swizzle_bitmode - AMDGPU ds_swizzle op, bitmode variant
This op has support for result type inference.
Attributes
and_mask- Single,I32Attr, 32-bit signless integer attributeor_mask- Single,I32Attr, 32-bit signless integer attributexor_mask- Single,I32Attr, 32-bit signless integer attribute
Operands
src- Single,AnyIntegerOrFloatOr1DVector, Integer or Float or fixed-length vector of Integer or Float values of ranks 1
Results
result- Single,AnyIntegerOrFloatOr1DVector, Integer or Float or fixed-length vector of Integer or Float values of ranks 1
Description
High-level wrapper on bitmode rocdl.ds_swizzle op, masks are represented
as separate fields so user won't need to do manual bitpacking.
Supports arbitrary int/float/vector types, which will be repacked to i32 and
one or more rocdl.ds_swizzle ops during lowering.
amdgpu.transpose_load - MLIR wrapper for CDNA Transpose Load instructions
Operands
src- Single,AnyMemRef, memref of any type valuessrcIndices- Variadic,Index, variadic of index
Results
result- Single, anonymous/composite constraint, vector of any type values
Description
The amdgpu.transpose_load op is a wrapper around the ds_read_tr instructions.
The transpose load op represents a subgroup load from LDS memory,
where the subgroup of threads collectively reads a matrix from the source
memref, with each thread reading a vector of the matrix, and gets a transposed matrix
in as the result. That is, each thread reads a vector of the col-major matrix at different
indices, and the thread's read result is a vector of the corresponding row of the transposed
matrix.
This op is a direct wrapper around the ROCDL ds_read_tr family intrinsics. Please refer
to the CDNA4 ISA documentation for more details about its exact semantics.
Format example:
%0 = amdgpu.transpose_load %src[%srcIndices] : memref<128x256xf16> -> vector<4xf16>Operands:
$src: LDS memref to read from.$srcIndices: indices into$srcto read from for this thread.$result: target register this transpose load instruction will write to.
Note: Lowering is only supported on gfx950 and up.
amdgpu.wmma - MLIR wrapper for RDNA3 wmma instructions
This op has support for result type inference.
Attributes
subwordOffset- Single,I32Attr, 32-bit signless integer attribute whose minimum value is 0 whose maximum value is 1unsignedA- Optional,UnitAttr, unit attributeunsignedB- Optional,UnitAttr, unit attributeclamp- Optional,UnitAttr, unit attribute
Operands
sourceA- Single,WMMAInTypes, vector of 16-bit float or bfloat16 type or 8-bit signless integer or 8-bit signed integer or 8-bit unsigned integer or 4-bit signless integer or 4-bit signed integer or 4-bit unsigned integer or f8E4M3FN type or f8E5M2 type values of length 4/8/16sourceB- Single,WMMAInTypes, vector of 16-bit float or bfloat16 type or 8-bit signless integer or 8-bit signed integer or 8-bit unsigned integer or 4-bit signless integer or 4-bit signed integer or 4-bit unsigned integer or f8E4M3FN type or f8E5M2 type values of length 4/8/16destC- Single,WMMAOutTypes, vector of 32-bit float or 32-bit signless integer values of length 4/8 or vector of 16-bit float or bfloat16 type values of length 4/8/16
Results
destD- Single,WMMAOutTypes, vector of 32-bit float or 32-bit signless integer values of length 4/8 or vector of 16-bit float or bfloat16 type values of length 4/8/16
Description
The amdgpu.wmma op is an MLIR wrapper around intrinsics
for various wmma instructions in the RDNA3 or RDNA4 architecture, which
perform a 16x16 16x16 matrix multiplication for different data types.
Note that in gfx12/RDNA4, there is also a 16x32 32x16 instruction for 4-bit
integer inputs.
On gfx11/RDNA3, emitting f16->f16 (or bf16->bf16) wmma the output is a 16xf16 (or 16xbf16) vector containing only 8 valid values:
- If
subwordOffsetis 0, then the output is stored at indices 0, 2, 4, ..., 14. - If
subwordOffsetis 1, then the output is stored at indices 1, 3, 5, ..., 15. On gfx12/RDNA4, the result is instead returned as a vector<8 x f16/bf16> where all values are valid and thesubwordOffsetmust be0, as it cannot be used.
unsignedA and unsignedB flag that the int8 LLVM inputs are unsigned.
The clamp flag is used to saturate the output of type T to numeric_limits<T>::max()
in case of overflow.