ISA Extension Spec

Instructions

  • mld_w
  • mst_b
  • mst_h
  • mst_w
  • mzero
  • mmaqa_b
  • mmada_h
  • mmasa_w
  • fmmacc_b
  • fmmacc_h
  • fmmacc_s

Gem5 Implementation

Matrix Register Model

The matrix register model is defined in: src/arch/riscv/regs/mat.hh

  • NumMatRegs = 8
  • Matrix registers are exposed as m0 through m7
  • Storage uses MatRegContainer = gem5::MatStore<16, 4>
  • Each matrix register stores:
    • 4 * 4 * 32 bits
    • 4 rows (16B), 4 cols

The same raw storage is reinterpreted as:

  • int8_t for mmaqa_b / fmmacc_b
  • int16_t / fp16 bit patterns for mmada_h / fmmacc_h
  • int32_t / fp32 bit patterns for mmasa_w / fmmacc_s

Matrix Register Row packing helpers

mat.hh also contains the helper functions used by the memory micro-ops:

  • row byte serialization
  • row word deserialization
  • per-macro load state support

Instruction

decoder.isa:

  1. 检查删除 ENABLE_QMAT
  2. decode RD 命名不合法

Matrix instruction format support lives in: src/arch/riscv/isa/formats/matrix.isa

Includes:

  • regular matrix op templates
  • matrix macro instruction templates
  • matrix load row micro-op templates
  • matrix store row micro-op templates

Arithmetic Instructions

Arithmetic instructions:

  • mzero clears the destination matrix register
  • mmaqa_b performs 16-element signed byte dot products per output element
  • mmada_h performs 8-element signed halfword dot products per output element
  • mmasa_w performs 4-element signed word dot products per output element
  • fmmacc_s performs 4-element fp32 dot products per output element
  • fmmacc_h converts fp16 bit patterns to fp32 and accumulates in fp32
  • fmmacc_b currently treats 8-bit lanes as integer values converted to float
    before accumulation

Memory Access Instructions

matrix memory instructions were expanded into row-level micro-ops during decoding:

  • mld_w expands into 4 row-load micro-ops
  • mst_b, mst_h, mst_w each expand into 4 row-store micro-ops
    • mst_w: 4 x 32-bit words per row
    • mst_h: each 32-bit word split into low 16 bits then high 16 bits
    • mst_b: each 32-bit word split into 4 little-endian bytes

Architectural visibility:
mld_w uses atomic final visibility semantics:

  • the 4 row loads fill a temporary tile buffer
  • the matrix destination register is only committed after all rows complete