针对的访存模式：虚拟地址空间连续或以恒定 stride 间隔访存
适用于稠密矩阵和数组的访问预取
商业处理器广泛使用
起源于 IBM System/370 系列
对于指针相关的数据结构(链表等)预取效果较差

stride 预取器实现需要考虑的关键问题：

区分多个交织的 stride 访存序列
- 解法：stride prefetcher 采用的 Reference Prediction Table
Prefetch Degree / Prefetch Depth 的选择: 每次触发预取时发出多少个连续 stride 预取
- 不能太小：否则预取效果有限
- 不能太大：避免较早预取的 cacheline 在被访问前就被踢出，或由于短流对 cache 造成污染
- 解法：为不同的预取流训练不同的 prefetch degree
激进的 Stride Prefetcher 遇到低准确率的预取流可能污染 cache
- 解法：引入 stream buffer / prefetch buffer 存储预取的数据，但不会降低预取对带宽的占用以及能耗的增加

Sequential Prefetcher 顺序地址预取器

A. J. Smith. “Sequential Program Prefetching in Memory Hierarchies.” Computer, v. 11, no. 12, 1978. DOI: 10.1109/C-M.1978.218016. 7, 15

Stride Prefetcher 步幅预取器

提出：
J.-L. Baer and T.-F. Chen. “An Effective On-Chip Preloading Scheme to Reduce Data Access Penalty.” In Proc. of Supercomputing, 1991. DOI: 10.1145/125826.125932. 15, 16

对于顺序和 stride 预取器的分析：
F. Dahlgren and P. Stenstrom. “Effectiveness of Hardware-Based Stride and Sequential Prefetching in Shared-Memory Multiprocessors.” In Proc. of the 1st IEEE Symposium on High-Performance Computer Architecture, 1995. DOI: 10.1109/HPCA.1995.386554. 16

核心设计：Reference Prediction Table

Reference Prediction Table

基于每条 Load 指令的 PC 追踪访存的 Stride
每当连续两次观察到相同的 Stride 时,使用 Last Address 和 Stride 来计算一个或多个用于预取的地址
适用于预取多维数组或结构体数组访存模式

变体

紧凑表示多个 Stride:
Y. Ishii, M. Inaba and K. Hiraki. “Access Map Pattern Matching for High Performance Data Cache Prefetch.” Journal of Instruction-Level Parallelism, v. 13, 2011. 16, 28
通过预测 stride 长度,将预取扩展到更不规则的访问模式:
S. Sair, T. Sherwood, and B. Calder. “A Decoupled Predictor-Directed Stream Prefetching Architecture.” IEEE Transactions on Computers, v. 52, no. 3, 2003. DOI: 10.1109/
通过简单的状态机，跟踪最近的预取 stride 长度的直方图,并自适应地为每个不同的预取流确定合适的 prefetch depth,使得预取器即使对于仅有几个地址的短流也能有效工作:
I. Hur and C. Lin. “Memory Prefetching Using Adaptive Stream Detection.” In Proc. of the 39th Annual ACM/IEEE International Symposium on Microarchitecture, 2006. DOI: 10.1109/MICRO.2006.32. 16, 35

Stream Buffer / Prefetch Buffer

最早提出：
N. P. Jouppi. “Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers.” In Proc. of the 17th Annual International Symposium on Computer Architecture, 1990. DOI: 10.1145/325164.325162. 16, 24

用 Stream Buffer 完全代替 L2Cache:
S. Palacharla and R. E. Kessler. “Evaluating Stream Buffers As a Secondary Cache Placement.” In Proc. of the 21st Annual International Symposium on Computer Architecture, 1994. 17

其他问题和优化讨论：

C. Zhang and S. A. McKee. “Hardware-Only Stream Prefetching and Dynamic Access Ordering.” In Proc. of the 14th Annual International Conference on Supercomputing, 2000. DOI: 10.1145/335231.335247. 17

S. Iacobovici, L. Spracklen, S. Kadambi, Y. Chou and S. G. Abraham. “Effective StreamBased and Execution-Based Data Prefetching.” In Proc. of the 18th Annual International Conference on Supercomputing, 2004. DOI: 10.1145/1006209.1006211. 17

每个 Stream Buffer 保存来自单个流的 cacheline
访问 L1DCache 时并行访问 Stream Buffer
在 Stream Buffer 中 Hit 通常会导致 Hit Cacheline 被替换到 L1DCache

组织形式

采用 FIFO 实现, 并且只能访问每个 Stream Buffer 的头部 cacheline
采用组相联的形式组织，当 Stride 检测机制观察到新的预取流时,丢弃原本预取流中所有未访问的块，通常采用 LRU 或轮询的替换策略