Address-Correlating Prefetchers
- 最早揭示数据访问的地址相关性:
J-L. Baer, J-L., and G. R. Sager. “Dynamic Improvement of Locality in Virtual Memory Systems.” IEEE Transactions on Software Engineering, v. 1, 1976. DOI: 10.1109/ TSE.1976.233801. 17- 最早提出成对地址关联的预取器 correlation-based prefetcher:
- M. J. Charney and A. P. Reeves. “Generalized Correlation-Based Hardware Prefetching.” Technical Report EE-CEG-95-1, School of Electrical Engineering, Cornell University, Feb. 1995. 17
- M. J. Charney. Correlation-Based Hardware Prefetching, 1996. Ph.D. diss., Cornell University, 1996. 17
- 将地址关联从成对扩展到组和访问序列之间的关联
- T. M. Chilimbi and M. Hirzel. “Dynamic Hot Data Stream Prefetching for General-Purpose Programs.” In Proc. of the Conference on Programming Language Design and Implementation, 2002. DOI: 10.1145/512529.512554. 17, 19, 20, 24, 36
- T. F. Wenisch, S. Somogyi, N. Hardavellas, J. Kim, A. Ailamaki, and B. Falsafi. “Temporal Streaming of Shared Memory.” In Proc. of the 32nd Annual International Symposium on Computer Architecture, June 2005. DOI: 10.1109/ISCA.2005.50. 17, 20, 22
- 时间关联 (temporal correlation) 预取器的提出:
T. F. Wenisch, S. Somogyi, N. Hardavellas, J. Kim, A. Ailamaki, and B. Falsafi. “Temporal Streaming of Shared Memory.” In Proc. of the 32nd Annual International Symposium on Computer Architecture, June 2005. DOI: 10.1109/ISCA.2005.50. 17, 20, 22
基于 Jump Pointer 进行预取
- C.-K. Luk and T. C. Mowry. “Compiler Based Prefetching for Recursive Data Structures.” In Proc. of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems, 1996. DOI: 10.1145/237090.237190. 17, 36
- A. Roth, A. Moshovos, and G. S. Sohi. “Dependence Based Prefetching for Linked Data Structures.” In Proc. of the 8th International Conference on Architectural Support for Programming Languages and Operating Systems, 1998. DOI: 10.1145/291069.291034. 17, 33
- A. Roth and G. S. Sohi. “Effective Jump Pointer Prefetching for Linked Data Structures.” In Proc. of the 26th Annual International Symposium on Computer Architecture, 1999. DOI: 10.1109/ISCA.1999.765944. 17, 33
- J. Collins, S. Sair, B. Calder, and D. M. Tullsen. “Pointer Cache Assisted Prefetching.” In Proc. of the 35th Annual ACM/IEEE International Symposium on Microarchitecture, 2002. DOI: 10.1109/MICRO.2002.1176239. 17
- 依赖 Jump Pointer 的预取器通常需要软件或编译器支持对指针进行 Hint
- 对于特定的数据结构遍历(如链表) 可能非常有效
- 关键缺点: Jump Pointer 的距离难以调整, 指针本身的存储开销也可能很高
遍历的距离必须仔细权衡,以提供足够的预测,同时又不会跳过太多元素。跳转指针,且
Content Directed Prefetcher
Content Directed Prefetchers 不使用 Hint , 尝试解引用并预取任何可能构成有效虚拟地址的 Load Value
- R. Cooksey, S. Jourdan, and D. Grunwald. “A Stateless, Content-Directed Data Prefetching Mechanism.” In Proc. of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems, 2002. DOI: 10.1145/605397.605427. 18
- E. Ebrahimi, O. Mutlu, and Y. N. Patt. “Techniques for Bandwidth-Efficient Prefetching of Linked Data Structures in Hybrid Prefetching Systems.” In Proc. of the 15th International Symposium on High Performance Computer Architecture, 2009. DOI: 10.1109/ HPCA.2009.4798232. 18
Pair-Wise Correlation Predictors 成对关联预取器
本质上是一个查找表: 将一个地址映射到访问序列中紧随其后的另一个地址
- 可以捕获顺序和步幅关系,以及指针地址与指针指向的地址之间的关系
- 问题:
- 依赖于访问的重复性(时间局部性),本质是训练一个地址映射查找表,无法预取未访问的地址(而 stride 预取则可以预取未访问过的地址)
- 映射表的空间需求大:每个访问地址都需要一个 entry,存储需求与工作集成正比
Markov Prefetcher
- D. Joseph and D. Grunwald. “Prefetching Using Markov Predictors.” In Proc. of the 24th Annual International Symposium on Computer Architecture, 1997. DOI: 10.1145/264107.264207. 18, 19
- D. Joseph and D. Grunwald. “Prefetching Using Markov Predictors.” IEEE Transactions on Computers, v. 48 no. 2, 1999. DOI: 10.1109/12.752653. 18
- 为每个地址映射多个后继地址 (牺牲准确率以提高覆盖率)
每个 Entry 包含- Tag 匹配查找
- N 个预取后继地址候选项: (N 的典型配置为 4)
预取后继地址
置信度 (用于选择预取和替换)
替换策略元数据(典型为 LRU)
- 查找表最初实现为 1MB (现代 workload 往往需要更大的查找表)
参考自一阶 Markov 过程,每个状态对应一个 Load Data Addr,后续地址的 Confidence 和 LRU 元数据代表转移概率
存在的问题:
- 预取前瞻性和内存级并行性受限,预取器仅尝试预测下一次缺失
- 从触发地址到访问后继地址之间的时间太短,不足以隐藏所预取数据的访问延迟
- 改进:
- 提高 Prefetch Depth
- 为每次预取操作选择一个更早的触发事件: Dead Block Correlation Prefetcher
- 覆盖率受限于片上查找表的容量
- 改进:
改进 1: 提高 Prefetch Depth 以提高预取前瞻性
- 使用初始预测地址递归查找映射表发起更深的预取:
T. M. Chilimbi and M. Hirzel. “Dynamic Hot Data Stream Prefetching for General-Purpose Programs.” In Proc. of the Conference on Programming Language Design and Implementation, 2002. DOI: 10.1145/512529.512554. 17, 19, 20, 24, 36- 折叠查找表:
- Y. Solihin, J. Lee, and J. Torrellas. “Using a User-Level Memory Thread for Correlation Prefetching.” In Proc. of the 29th Annual International Symposium on Computer Architecture, May 2002. DOI: 10.1109/ISCA.2002.1003576. 20, 22
- Y. Solihin, J. Lee, and J. Torrellas. “Correlation Prefetching with a User-Level Memory Thread.” IEEE Transactions on Parallel and Distributed Systems, v. 14, no. 6, 2003. DOI: 10.1109/TPDS.2003.1206504. 20
- 通过使用初始预测地址递归查找映射表发起更深的预取
- 困难 1 : 较高的查找延迟, 急剧增加对查找表的带宽需求
- 困难 2 : 可能的预取候选数量会随预取深度呈几何级数增长
(需要一种适当的策略来限制这种增长,以维持准确率)
- 折叠查找表:每个 entry 存储一个后继地址序列(而不是多个候选的后继地址)
- 需要一个合适的策略来选择记录哪些后继流 (最近或概率最高的后继地址)
- 解决了查找延迟/带宽问题
- 困难 1:使查找表更新更复杂, 一个地址可能记录在多个 entries 中
- 困难 2: 这样的表组织必须将最大预取深度固定为后续地址存储的长度
- 存储长度太少, 预测序列将被截断,牺牲了潜在的覆盖率和前瞻性
- 存储长度过高,浪费存储空间,准确率可能会下降
研究表明,重复访存序列的长度变化范围可达数个数量级:
- T. M. Chilimbi and M. Hirzel. “Dynamic Hot Data Stream Prefetching for General-Purpose Programs.” In Proc. of the Conference on Programming Language Design and Implementation, 2002. DOI: 10.1145/512529.512554. 17, 19, 20, 24, 36
- T. F. Wenisch, S. Somogyi, N. Hardavellas, J. Kim, A. Ailamaki, and B. Falsafi. “Temporal Streaming of Shared Memory.” In Proc. of the 32nd Annual International Symposium on Computer Architecture, June 2005. DOI: 10.1109/ISCA.2005.50. 17, 20, 22
- Y. Solihin, J. Lee, and J. Torrellas. “Using a User-Level Memory Thread for Correlation Prefetching.” In Proc. of the 29th Annual International Symposium on Computer Architecture, May 2002. DOI: 10.1109/ISCA.2002.1003576. 20, 22
- M. Ferdman and B. Falsafi. “Last-Touch Correlated Data Streaming.” In IEEE International Symposium on Performance Analysis of Systems and Software, 2007. DOI: 10.1109/ ISPASS.2007.363741. 20, 21, 22
- T.F. Wenisch, M. Ferdman, A. Ailamaki, B. Falsafi, and A. Moshovos. “Temporal Streams in Commercial Server Applications.” In Proc. of the IEEE International Symposium on Workload Characterization, 2008. DOI: 10.1109/IISWC.2008.4636095. 20, 22, 32
乱序 CPU 的内存级并行研究:
Y. Chou, B. Fahs, and S. Abraham. “Microarchitecture Optimizations for Exploiting Memory-Level Parallelism.” In Proc. of the 31st Annual International Symposium on Computer Architecture, 2004. DOI: 10.1145/1028176.1006708. 20
访存序列的规律:
- 重复访存序列的长度变化范围可达数个数量级
- 一个访存序列开始时的前几次 misses 一般无法实现及时的预取 —— 触发 miss 在时间上过于接近后续块的访问
- 记录和预取这些地址: 浪费存储空间,并延迟对序列中更深层有用块的预取
- 省略前几个地址,并从预取能带来延迟优势的第一个 Miss 开始定义序列更合理
- 在乱序 CPU 中:指令窗口经常导致多个片外 misses 并发发出,导致第一次 miss 时后续的很多 misses 也已并发发出
(每组中平均发出的 misses 称为应用程序的内存级并行性)
Epoch-Based Correlation Prefetcher: 基于周期的关联预取 (EBCP) 机制
Y. Chou. “Low-Cost Epoch-Based Correlation Prefetching for Commercial Applications.” In Proc. of the 40th Annual ACM/IEEE International Symposium on Microarchitecture, 2007. DOI: 10.1109/MICRO.2007.39. 20
每个并行组内的第一个 miss 被用作触发地址,并用于查找后续组中的 misses ,从而跳过同一个组内在遇到触发 miss 时可能已经在处理中的 misses
改进 2: Dead Block Correlation Prefetcher 死块关联预取器 (提高预取前瞻性)
Dead Cache Block:
- T. R. Puzak. Analysis of Cache Replacement-Algorithms, 1985. Ph.D. diss., Univ. Massachusetts, Amherst,1985. 20
- A. Mendelson, D. Thiebaut, and D. K. Pradhan. “Modeling Live and Dead Lines in Cache Memory Systems.” IEEE Transactions on Computers, v. 42, no. 1. DOI: 10.1109/12.192209. 20
- D. A. Wood, M. D. Hill, and R. E. Kessler. “A Model for Estimating Trace-Sample Miss Ratios.” In Proc. of the 1991 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, 1991. DOI: 10.1145/107971.107981. 20
优化点:
- cacheline 在 cache 中度过的大部分时间处于 Dead 状态(仍被缓存,但在 invalidation 或 eviction 前不会被再次访问) (占据存储空间,却无法提供 cache hit)
- 预取器可以用预取的 cacheline 替换 dead block,而完全没有缓存污染的风险
死块关联预取器 (DBCP):
- A.-C. Lai, C. Fide, and B. Falsafi. “Dead-Block Prediction and Dead-Block Correlating Prefetchers.” In Proc. of the 28th Annual International Symposium on Computer Architecture, 2001. DOI: 10.1145/379240.379259. 19, 20, 21
- M. Ferdman and B. Falsafi. “Last-Touch Correlated Data Streaming.” In IEEE International Symposium on Performance Analysis of Systems and Software, 2007. DOI: 10.1109/ ISPASS.2007.363741. 20, 21, 22
- H. Liu, M. Ferdman, J. Huh, and D. Burger. “Cache Bursts: A New Approach for Eliminating Dead Blocks and Increasing Cache Efficiency.” In Proc. of the 41st Annual ACM/IEEE International Symposium on Microarchitecture, 2008. DOI: 10.1109/ MICRO.2008.4771793. 20, 21
基于代码关联预测 dead block:
- A.-C. Lai, C. Fide, and B. Falsafi. “Dead-Block Prediction and Dead-Block Correlating Prefetchers.” In Proc. of the 28th Annual International Symposium on Computer Architecture, 2001. DOI: 10.1145/379240.379259. 19, 20, 21
- M. Ferdman and B. Falsafi. “Last-Touch Correlated Data Streaming.” In IEEE International Symposium on Performance Analysis of Systems and Software, 2007. DOI: 10.1109/ ISPASS.2007.363741. 20, 21, 22
- H. Liu, M. Ferdman, J. Huh, and D. Burger. “Cache Bursts: A New Approach for Eliminating Dead Blocks and Increasing Cache Efficiency.” In Proc. of the 41st Annual ACM/IEEE International Symposium on Microarchitecture, 2008. DOI: 10.1109/ MICRO.2008.4771793. 20, 21
基于计时机制预测 dead block:
- Z. Hu, S. Kaxiras, and M. Martonosi. “Timekeeping in the Memory System: Predicting and Optimizing Memory Behavior.” In Proc. of the 29th Annual International Symposium on Computer Architecture, 2002. DOI: 10.1109/ISCA.2002.1003579. 20, 21
Dead Block Correlation Prefetcher(DBCP) 试图预测对 cacheline 的最后一次访问,然后利用这次访问作为触发点,预取将替换 dead block 。从某种意义上说, DBCP 最大化了前瞻性,因为它在 cache 中存储空间变得可用以接收预取块的最早时刻 发出预取
预测一个 cacheline 何时变为 dead block:
- 基于代码关联 code correlation 预测
- 旨在识别 cacheline 在被踢出前最后一次访问它的指令 (导致该块变为 dead block 的那次访问)
- 基于观察: cacheline 每次进入 cache 时,往往会被相同的 Load/Store 序列访问: 从分配该 cacheline 的第一次 miss 开始到该 cacheline 变为 dead block 的最后一次访问
- 优势: 为一个 cacheline 学习到的访问序列可以应用于预测其他地址的 death event
- 基于计时机制 time keeping 预测
- 旨在预测 cacheline 死亡前的时间,而非指示其死亡的具体访问
- 基于观察:一个 cacheline 的生存期(以时钟周期衡量)每次进入缓存时往往相似
- Markov Prefetch Table 会扩展一个额外字段,指示该 cacheline 上次进入 cache 时的生存期。然后经过一个适当的安全裕度(例如,先前生存期的两倍)后,该块被预测为 dead block
改进 3: 应对片上存储面积的限制
- Tag Correlating Prefetcher:
Z. Hu, M. Martonosi, and S. Kaxiras. “TCP: Tag Correlating Prefetchers.” In Proc. of the 9th IEEE Symposium on High-Performance Computer Architecture, 2003. DOI: 10.1109/ HPCA.2003.1183549. 21- Correlation Table 迁移到片外:
- T. F. Wenisch, S. Somogyi, N. Hardavellas, J. Kim, A. Ailamaki, and B. Falsafi. “Temporal Streaming of Shared Memory.” In Proc. of the 32nd Annual International Symposium on Computer Architecture, June 2005. DOI: 10.1109/ISCA.2005.50. 17, 20, 22
- Y. Solihin, J. Lee, and J. Torrellas. “Using a User-Level Memory Thread for Correlation Prefetching.” In Proc. of the 29th Annual International Symposium on Computer Architecture, May 2002. DOI: 10.1109/ISCA.2002.1003576. 20, 22
- M. Ferdman and B. Falsafi. “Last-Touch Correlated Data Streaming.” In IEEE International Symposium on Performance Analysis of Systems and Software, 2007. DOI: 10.1109/ ISPASS.2007.363741. 20, 21, 22
- T. F. Wenisch, M. Ferdman, A. Ailamaki, B. Falsafi, A. Moshovos. “Practical Off-Chip Meta-Data for Temporal Memory Streaming.”In Proc. of the 15th International Symposium on High Performance Computer Architecture, 2009. DOI: 10.1109/HPCA.2009.4798239. 22, 25
- A. Jain and C. Lin. “Linearizing Irregular Memory Accesses for Improved Correlated Prefetching.” In Proc. of the 46th Annual ACM/IEEE International Symposium on Microarchitecture, 2013. DOI: 10.1145/2540708.2540730. 22, 25
- 通过 Temporal Streaming 分摊元数据访问成本:
T. F. Wenisch, M. Ferdman, A. Ailamaki, B. Falsafi, A. Moshovos. “Practical Off-Chip Meta-Data for Temporal Memory Streaming.”In Proc. of the 15th International Symposium on High Performance Computer Architecture, 2009. DOI: 10.1109/HPCA.2009.4798239. 22, 25
- 改进片上 correlation table 的存储效率: Tag Correlating Prefetcher
- 每个 Entry 仅存储 Tag 而非完整的 cacheline 地址
- 容量提升有限
- 将 correlation table 迁移到主存中
Global History Buffer
todo…