Address-Correlating Prefetchers

最早揭示数据访问的地址相关性：
J-L. Baer, J-L., and G. R. Sager. “Dynamic Improvement of Locality in Virtual Memory Systems.” IEEE Transactions on Software Engineering, v. 1, 1976. DOI: 10.1109/ TSE.1976.233801. 17

最早提出成对地址关联的预取器 correlation-based prefetcher：

M. J. Charney and A. P. Reeves. “Generalized Correlation-Based Hardware Prefetching.” Technical Report EE-CEG-95-1, School of Electrical Engineering, Cornell University, Feb. 1995. 17

M. J. Charney. Correlation-Based Hardware Prefetching, 1996. Ph.D. diss., Cornell University, 1996. 17

将地址关联从成对扩展到组和访问序列之间的关联

T. M. Chilimbi and M. Hirzel. “Dynamic Hot Data Stream Prefetching for General-Purpose Programs.” In Proc. of the Conference on Programming Language Design and Implementation, 2002. DOI: 10.1145/512529.512554. 17, 19, 20, 24, 36

T. F. Wenisch, S. Somogyi, N. Hardavellas, J. Kim, A. Ailamaki, and B. Falsafi. “Temporal Streaming of Shared Memory.” In Proc. of the 32nd Annual International Symposium on Computer Architecture, June 2005. DOI: 10.1109/ISCA.2005.50. 17, 20, 22

时间关联 (temporal correlation) 预取器的提出：
T. F. Wenisch, S. Somogyi, N. Hardavellas, J. Kim, A. Ailamaki, and B. Falsafi. “Temporal Streaming of Shared Memory.” In Proc. of the 32nd Annual International Symposium on Computer Architecture, June 2005. DOI: 10.1109/ISCA.2005.50. 17, 20, 22

基于 Jump Pointer 进行预取

C.-K. Luk and T. C. Mowry. “Compiler Based Prefetching for Recursive Data Structures.” In Proc. of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems, 1996. DOI: 10.1145/237090.237190. 17, 36

A. Roth, A. Moshovos, and G. S. Sohi. “Dependence Based Prefetching for Linked Data Structures.” In Proc. of the 8th International Conference on Architectural Support for Programming Languages and Operating Systems, 1998. DOI: 10.1145/291069.291034. 17, 33

A. Roth and G. S. Sohi. “Effective Jump Pointer Prefetching for Linked Data Structures.” In Proc. of the 26th Annual International Symposium on Computer Architecture, 1999. DOI: 10.1109/ISCA.1999.765944. 17, 33

J. Collins, S. Sair, B. Calder, and D. M. Tullsen. “Pointer Cache Assisted Prefetching.” In Proc. of the 35th Annual ACM/IEEE International Symposium on Microarchitecture, 2002. DOI: 10.1109/MICRO.2002.1176239. 17

依赖 Jump Pointer 的预取器通常需要软件或编译器支持对指针进行 Hint
对于特定的数据结构遍历(如链表) 可能非常有效
关键缺点: Jump Pointer 的距离难以调整, 指针本身的存储开销也可能很高

遍历的距离必须仔细权衡,以提供足够的预测,同时又不会跳过太多元素。跳转指针,且

Content Directed Prefetcher

Content Directed Prefetchers 不使用 Hint , 尝试解引用并预取任何可能构成有效虚拟地址的 Load Value

R. Cooksey, S. Jourdan, and D. Grunwald. “A Stateless, Content-Directed Data Prefetching Mechanism.” In Proc. of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems, 2002. DOI: 10.1145/605397.605427. 18

E. Ebrahimi, O. Mutlu, and Y. N. Patt. “Techniques for Bandwidth-Efficient Prefetching of Linked Data Structures in Hybrid Prefetching Systems.” In Proc. of the 15th International Symposium on High Performance Computer Architecture, 2009. DOI: 10.1109/ HPCA.2009.4798232. 18

Pair-Wise Correlation Predictors 成对关联预取器

本质上是一个查找表: 将一个地址映射到访问序列中紧随其后的另一个地址

可以捕获顺序和步幅关系，以及指针地址与指针指向的地址之间的关系
问题：
1. 依赖于访问的重复性（时间局部性），本质是训练一个地址映射查找表，无法预取未访问的地址(而 stride 预取则可以预取未访问过的地址)
2. 映射表的空间需求大：每个访问地址都需要一个 entry，存储需求与工作集成正比

Markov Prefetcher

D. Joseph and D. Grunwald. “Prefetching Using Markov Predictors.” In Proc. of the 24th Annual International Symposium on Computer Architecture, 1997. DOI: 10.1145/264107.264207. 18, 19

D. Joseph and D. Grunwald. “Prefetching Using Markov Predictors.” IEEE Transactions on Computers, v. 48 no. 2, 1999. DOI: 10.1109/12.752653. 18

Markov Predictor

为每个地址映射多个后继地址 (牺牲准确率以提高覆盖率)
每个 Entry 包含
- Tag 匹配查找
- N 个预取后继地址候选项: (N 的典型配置为 4)
  预取后继地址
  置信度 (用于选择预取和替换)
  替换策略元数据（典型为 LRU）
查找表最初实现为 1MB (现代 workload 往往需要更大的查找表)

参考自一阶 Markov 过程，每个状态对应一个 Load Data Addr，后续地址的 Confidence 和 LRU 元数据代表转移概率

存在的问题:

预取前瞻性和内存级并行性受限，预取器仅尝试预测下一次缺失
- 从触发地址到访问后继地址之间的时间太短，不足以隐藏所预取数据的访问延迟
- 改进：
  1. 提高 Prefetch Depth
  2. 为每次预取操作选择一个更早的触发事件: Dead Block Correlation Prefetcher
覆盖率受限于片上查找表的容量
- 改进：

改进 1: 提高 Prefetch Depth 以提高预取前瞻性

使用初始预测地址递归查找映射表发起更深的预取:
T. M. Chilimbi and M. Hirzel. “Dynamic Hot Data Stream Prefetching for General-Purpose Programs.” In Proc. of the Conference on Programming Language Design and Implementation, 2002. DOI: 10.1145/512529.512554. 17, 19, 20, 24, 36

折叠查找表：

Y. Solihin, J. Lee, and J. Torrellas. “Using a User-Level Memory Thread for Correlation Prefetching.” In Proc. of the 29th Annual International Symposium on Computer Architecture, May 2002. DOI: 10.1109/ISCA.2002.1003576. 20, 22

Y. Solihin, J. Lee, and J. Torrellas. “Correlation Prefetching with a User-Level Memory Thread.” IEEE Transactions on Parallel and Distributed Systems, v. 14, no. 6, 2003. DOI: 10.1109/TPDS.2003.1206504. 20

通过使用初始预测地址递归查找映射表发起更深的预取
- 困难 1 : 较高的查找延迟, 急剧增加对查找表的带宽需求
- 困难 2 : 可能的预取候选数量会随预取深度呈几何级数增长
  (需要一种适当的策略来限制这种增长,以维持准确率)
折叠查找表：每个 entry 存储一个后继地址序列（而不是多个候选的后继地址）
- 需要一个合适的策略来选择记录哪些后继流 (最近或概率最高的后继地址)
- 解决了查找延迟/带宽问题
- 困难 1：使查找表更新更复杂, 一个地址可能记录在多个 entries 中
- 困难 2: 这样的表组织必须将最大预取深度固定为后续地址存储的长度
  - 存储长度太少, 预测序列将被截断,牺牲了潜在的覆盖率和前瞻性
  - 存储长度过高，浪费存储空间，准确率可能会下降

研究表明，重复访存序列的长度变化范围可达数个数量级:

T. M. Chilimbi and M. Hirzel. “Dynamic Hot Data Stream Prefetching for General-Purpose Programs.” In Proc. of the Conference on Programming Language Design and Implementation, 2002. DOI: 10.1145/512529.512554. 17, 19, 20, 24, 36

T. F. Wenisch, S. Somogyi, N. Hardavellas, J. Kim, A. Ailamaki, and B. Falsafi. “Temporal Streaming of Shared Memory.” In Proc. of the 32nd Annual International Symposium on Computer Architecture, June 2005. DOI: 10.1109/ISCA.2005.50. 17, 20, 22

Y. Solihin, J. Lee, and J. Torrellas. “Using a User-Level Memory Thread for Correlation Prefetching.” In Proc. of the 29th Annual International Symposium on Computer Architecture, May 2002. DOI: 10.1109/ISCA.2002.1003576. 20, 22

M. Ferdman and B. Falsafi. “Last-Touch Correlated Data Streaming.” In IEEE International Symposium on Performance Analysis of Systems and Software, 2007. DOI: 10.1109/ ISPASS.2007.363741. 20, 21, 22

T.F. Wenisch, M. Ferdman, A. Ailamaki, B. Falsafi, and A. Moshovos. “Temporal Streams in Commercial Server Applications.” In Proc. of the IEEE International Symposium on Workload Characterization, 2008. DOI: 10.1109/IISWC.2008.4636095. 20, 22, 32

乱序 CPU 的内存级并行研究：
Y. Chou, B. Fahs, and S. Abraham. “Microarchitecture Optimizations for Exploiting Memory-Level Parallelism.” In Proc. of the 31st Annual International Symposium on Computer Architecture, 2004. DOI: 10.1145/1028176.1006708. 20

访存序列的规律：

重复访存序列的长度变化范围可达数个数量级
一个访存序列开始时的前几次 misses 一般无法实现及时的预取 —— 触发 miss 在时间上过于接近后续块的访问
- 记录和预取这些地址: 浪费存储空间，并延迟对序列中更深层有用块的预取
- 省略前几个地址，并从预取能带来延迟优势的第一个 Miss 开始定义序列更合理
- 在乱序 CPU 中：指令窗口经常导致多个片外 misses 并发发出，导致第一次 miss 时后续的很多 misses 也已并发发出
  （每组中平均发出的 misses 称为应用程序的内存级并行性）

Epoch-Based Correlation Prefetcher: 基于周期的关联预取 (EBCP) 机制

Y. Chou. “Low-Cost Epoch-Based Correlation Prefetching for Commercial Applications.” In Proc. of the 40th Annual ACM/IEEE International Symposium on Microarchitecture, 2007. DOI: 10.1109/MICRO.2007.39. 20

每个并行组内的第一个 miss 被用作触发地址，并用于查找后续组中的 misses ，从而跳过同一个组内在遇到触发 miss 时可能已经在处理中的 misses

改进 2: Dead Block Correlation Prefetcher 死块关联预取器 (提高预取前瞻性)

Dead Cache Block:

T. R. Puzak. Analysis of Cache Replacement-Algorithms, 1985. Ph.D. diss., Univ. Massachusetts, Amherst,1985. 20

A. Mendelson, D. Thiebaut, and D. K. Pradhan. “Modeling Live and Dead Lines in Cache Memory Systems.” IEEE Transactions on Computers, v. 42, no. 1. DOI: 10.1109/12.192209. 20

D. A. Wood, M. D. Hill, and R. E. Kessler. “A Model for Estimating Trace-Sample Miss Ratios.” In Proc. of the 1991 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, 1991. DOI: 10.1145/107971.107981. 20

优化点：

cacheline 在 cache 中度过的大部分时间处于 Dead 状态(仍被缓存，但在 invalidation 或 eviction 前不会被再次访问) (占据存储空间，却无法提供 cache hit)
预取器可以用预取的 cacheline 替换 dead block，而完全没有缓存污染的风险

死块关联预取器 (DBCP):

A.-C. Lai, C. Fide, and B. Falsafi. “Dead-Block Prediction and Dead-Block Correlating Prefetchers.” In Proc. of the 28th Annual International Symposium on Computer Architecture, 2001. DOI: 10.1145/379240.379259. 19, 20, 21

M. Ferdman and B. Falsafi. “Last-Touch Correlated Data Streaming.” In IEEE International Symposium on Performance Analysis of Systems and Software, 2007. DOI: 10.1109/ ISPASS.2007.363741. 20, 21, 22

H. Liu, M. Ferdman, J. Huh, and D. Burger. “Cache Bursts: A New Approach for Eliminating Dead Blocks and Increasing Cache Efficiency.” In Proc. of the 41st Annual ACM/IEEE International Symposium on Microarchitecture, 2008. DOI: 10.1109/ MICRO.2008.4771793. 20, 21

基于代码关联预测 dead block:

A.-C. Lai, C. Fide, and B. Falsafi. “Dead-Block Prediction and Dead-Block Correlating Prefetchers.” In Proc. of the 28th Annual International Symposium on Computer Architecture, 2001. DOI: 10.1145/379240.379259. 19, 20, 21

M. Ferdman and B. Falsafi. “Last-Touch Correlated Data Streaming.” In IEEE International Symposium on Performance Analysis of Systems and Software, 2007. DOI: 10.1109/ ISPASS.2007.363741. 20, 21, 22

H. Liu, M. Ferdman, J. Huh, and D. Burger. “Cache Bursts: A New Approach for Eliminating Dead Blocks and Increasing Cache Efficiency.” In Proc. of the 41st Annual ACM/IEEE International Symposium on Microarchitecture, 2008. DOI: 10.1109/ MICRO.2008.4771793. 20, 21

基于计时机制预测 dead block:

Z. Hu, S. Kaxiras, and M. Martonosi. “Timekeeping in the Memory System: Predicting and Optimizing Memory Behavior.” In Proc. of the 29th Annual International Symposium on Computer Architecture, 2002. DOI: 10.1109/ISCA.2002.1003579. 20, 21

Dead Block Correlation Prefetcher(DBCP) 试图预测对 cacheline 的最后一次访问，然后利用这次访问作为触发点,预取将替换 dead block 。从某种意义上说, DBCP 最大化了前瞻性,因为它在 cache 中存储空间变得可用以接收预取块的最早时刻发出预取

预测一个 cacheline 何时变为 dead block:

基于代码关联 code correlation 预测
- 旨在识别 cacheline 在被踢出前最后一次访问它的指令 (导致该块变为 dead block 的那次访问)
- 基于观察: cacheline 每次进入 cache 时，往往会被相同的 Load/Store 序列访问: 从分配该 cacheline 的第一次 miss 开始到该 cacheline 变为 dead block 的最后一次访问
- 优势: 为一个 cacheline 学习到的访问序列可以应用于预测其他地址的 death event
基于计时机制 time keeping 预测
- 旨在预测 cacheline 死亡前的时间,而非指示其死亡的具体访问
- 基于观察：一个 cacheline 的生存期(以时钟周期衡量)每次进入缓存时往往相似
- Markov Prefetch Table 会扩展一个额外字段，指示该 cacheline 上次进入 cache 时的生存期。然后经过一个适当的安全裕度(例如,先前生存期的两倍)后,该块被预测为 dead block

改进 3: 应对片上存储面积的限制

Tag Correlating Prefetcher:
Z. Hu, M. Martonosi, and S. Kaxiras. “TCP: Tag Correlating Prefetchers.” In Proc. of the 9th IEEE Symposium on High-Performance Computer Architecture, 2003. DOI: 10.1109/ HPCA.2003.1183549. 21

Correlation Table 迁移到片外:

T. F. Wenisch, S. Somogyi, N. Hardavellas, J. Kim, A. Ailamaki, and B. Falsafi. “Temporal Streaming of Shared Memory.” In Proc. of the 32nd Annual International Symposium on Computer Architecture, June 2005. DOI: 10.1109/ISCA.2005.50. 17, 20, 22

Y. Solihin, J. Lee, and J. Torrellas. “Using a User-Level Memory Thread for Correlation Prefetching.” In Proc. of the 29th Annual International Symposium on Computer Architecture, May 2002. DOI: 10.1109/ISCA.2002.1003576. 20, 22

M. Ferdman and B. Falsafi. “Last-Touch Correlated Data Streaming.” In IEEE International Symposium on Performance Analysis of Systems and Software, 2007. DOI: 10.1109/ ISPASS.2007.363741. 20, 21, 22

T. F. Wenisch, M. Ferdman, A. Ailamaki, B. Falsafi, A. Moshovos. “Practical Off-Chip Meta-Data for Temporal Memory Streaming.”In Proc. of the 15th International Symposium on High Performance Computer Architecture, 2009. DOI: 10.1109/HPCA.2009.4798239. 22, 25

A. Jain and C. Lin. “Linearizing Irregular Memory Accesses for Improved Correlated Prefetching.” In Proc. of the 46th Annual ACM/IEEE International Symposium on Microarchitecture, 2013. DOI: 10.1145/2540708.2540730. 22, 25

通过 Temporal Streaming 分摊元数据访问成本：
T. F. Wenisch, M. Ferdman, A. Ailamaki, B. Falsafi, A. Moshovos. “Practical Off-Chip Meta-Data for Temporal Memory Streaming.”In Proc. of the 15th International Symposium on High Performance Computer Architecture, 2009. DOI: 10.1109/HPCA.2009.4798239. 22, 25

改进片上 correlation table 的存储效率: Tag Correlating Prefetcher
- 每个 Entry 仅存储 Tag 而非完整的 cacheline 地址
- 容量提升有限
将 correlation table 迁移到主存中
- 覆盖率提高
- 预取元数据访问延迟增加（甚至和预取所需 cacheline 时间一致）
- 在此基础上必须提高前瞻性以隐藏元数据访问延时(除改进 1，改进 2 外，还可以通过时间流 Temporal Streaming 分摊片外元数据访问成本)

Global History Buffer

todo…