• 针对的访存模式:虚拟地址空间连续或以恒定 stride 间隔访存
  • 适用于稠密矩阵和数组的访问预取
  • 商业处理器广泛使用
  • 起源于 IBM System/370 系列
  • 对于指针相关的数据结构(链表等)预取效果较差

stride 预取器实现需要考虑的关键问题:

  1. 区分多个交织的 stride 访存序列
    • 解法:stride prefetcher 采用的 Reference Prediction Table
  2. Prefetch Degree / Prefetch Depth 的选择: 每次触发预取时发出多少个连续 stride 预取
    • 不能太小:否则预取效果有限
    • 不能太大:避免较早预取的 cacheline 在被访问前就被踢出,或由于短流对 cache 造成污染
    • 解法:为不同的预取流训练不同的 prefetch degree
  3. 激进的 Stride Prefetcher 遇到低准确率的预取流可能污染 cache
    • 解法:引入 stream buffer / prefetch buffer 存储预取的数据,但不会降低预取对带宽的占用以及能耗的增加

Sequential Prefetcher 顺序地址预取器

A. J. Smith. “Sequential Program Prefetching in Memory Hierarchies.” Computer, v. 11, no. 12, 1978. DOI: 10.1109/C-M.1978.218016. 7, 15

Stride Prefetcher 步幅预取器

提出:
J.-L. Baer and T.-F. Chen. “An Effective On-Chip Preloading Scheme to Reduce Data Access Penalty.” In Proc. of Supercomputing, 1991. DOI: 10.1145/125826.125932. 15, 16

对于顺序和 stride 预取器的分析:
F. Dahlgren and P. Stenstrom. “Effectiveness of Hardware-Based Stride and Sequential Prefetching in Shared-Memory Multiprocessors.” In Proc. of the 1st IEEE Symposium on High-Performance Computer Architecture, 1995. DOI: 10.1109/HPCA.1995.386554. 16

核心设计:Reference Prediction Table

Reference Prediction Table

  • 基于每条 Load 指令的 PC 追踪访存的 Stride
  • 每当连续两次观察到相同的 Stride 时,使用 Last Address 和 Stride 来计算一个或多个用于预取的地址
  • 适用于预取多维数组或结构体数组访存模式

变体

  1. 紧凑表示多个 Stride:
    Y. Ishii, M. Inaba and K. Hiraki. “Access Map Pattern Matching for High Performance Data Cache Prefetch.” Journal of Instruction-Level Parallelism, v. 13, 2011. 16, 28
  2. 通过预测 stride 长度,将预取扩展到更不规则的访问模式:
    S. Sair, T. Sherwood, and B. Calder. “A Decoupled Predictor-Directed Stream Prefetching Architecture.” IEEE Transactions on Computers, v. 52, no. 3, 2003. DOI: 10.1109/
  3. 通过简单的状态机,跟踪最近的预取 stride 长度的直方图,并自适应地为每个不同的 预取流确定合适的 prefetch depth,使得预取器即使对于仅有几个地址的短流也能有效工作:
    I. Hur and C. Lin. “Memory Prefetching Using Adaptive Stream Detection.” In Proc. of the 39th Annual ACM/IEEE International Symposium on Microarchitecture, 2006. DOI: 10.1109/MICRO.2006.32. 16, 35

Stream Buffer / Prefetch Buffer

  • 最早提出:
    N. P. Jouppi. “Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers.” In Proc. of the 17th Annual International Symposium on Computer Architecture, 1990. DOI: 10.1145/325164.325162. 16, 24
  • 用 Stream Buffer 完全代替 L2Cache:
    S. Palacharla and R. E. Kessler. “Evaluating Stream Buffers As a Secondary Cache Placement.” In Proc. of the 21st Annual International Symposium on Computer Architecture, 1994. 17
  • 其他问题和优化讨论:
    • C. Zhang and S. A. McKee. “Hardware-Only Stream Prefetching and Dynamic Access Ordering.” In Proc. of the 14th Annual International Conference on Supercomputing, 2000. DOI: 10.1145/335231.335247. 17
    • S. Iacobovici, L. Spracklen, S. Kadambi, Y. Chou and S. G. Abraham. “Effective StreamBased and Execution-Based Data Prefetching.” In Proc. of the 18th Annual International Conference on Supercomputing, 2004. DOI: 10.1145/1006209.1006211. 17
  • 每个 Stream Buffer 保存来自单个流的 cacheline
  • 访问 L1DCache 时并行访问 Stream Buffer
  • 在 Stream Buffer 中 Hit 通常会导致 Hit Cacheline 被替换到 L1DCache

组织形式

  1. 采用 FIFO 实现, 并且只能访问每个 Stream Buffer 的头部 cacheline
  2. 采用组相联的形式组织,当 Stride 检测机制观察到新的预取流时,丢弃原本预取流中所有未访问的块,通常采用 LRU 或轮询的替换策略