硬件预取的任务:
- 预测访存地址 (决定:精确率 accuracy) 取决于:
- 程序本身的访存行为
- 在 cache 层级结构中的哪个层次进行预取
- L1 可以捕获所有的访存,精确度高,但浪费资源( L1 一般会 hit ,不需要预取,会浪费大量的预取资源)
- 低层级 cache 捕获的访存请求少
- cache placement 和 replacement 策略
- 预测何时预取 (决定:及时性 timely)
- 太早:可能会踢出处理器此时需要先使用的 cache block
- 太晚:可能由于访存延时导致迟迟得不到预取的结果
- 决定如何放置预取数据
- binding prefetch :预取到寄存器
- 浪费寄存器资源
- 即使此时访存负载很高,依旧会强制预取
- 语义错误(预取无效地址会触发 page fault)
- 不能应用到指令预取
- non-binding prefetch:预取到 cache 或者 buffer 中
- 这些 cache 和 buffer 同样需要参与一致性协议
- binding prefetch :预取到寄存器
预取策略的评估指标 (trade-off)
- Coverage 覆盖率: 在处理器显式的访存请求中,预测成功的占比
- Accuracy 精确度: 实际发射的访存请求中,预取有用的占比
指令预取 Instruction Prefetching
- 更适用于大规模指令工作负载,如服务器和云工作负载。(桌面应用程序和科学计算的工作负载相比之下指令规模并不大)
- CPU 内部的乱序执行只能 overlap 数据访存的延时,对于指令访存则起不了太大作用
现有预取器对比:
预取器 | Accuracy | Cost |
---|---|---|
Next-Line Prefetching | 50% | < 1 KB |
Fetch-Directed Prefetching | > 50% | < 1 KB |
Discontinuity Prefetching | > 50% | > 1 KB |
Prescient Fetch | > 50% | > 1 KB |
Temporal Instruction Fetch Streaming (TIFS) | 95% | ~ 64 KB/core |
Return-address stack-Directed Instruction Prefetching (RDIP) | > 95% | ~ 64 KB/core |
Proactive Instruction Fetch | > 99% | ~ 256 KB/core |
Next-Line Prefetching
A. J. Smith. “Sequential Program Prefetching in Memory Hierarchies.” Computer, v. 11, no. 12, 1978. DOI: 10.1109/C-M.1978.218016. 7, 15
最简单,顺序预取,适用于顺序指令流的场景
- 从下级 cache 预取到的指令被放入 Stream buffer(一个小的全相联 buffer)
- 当 Stream buffer 中的一条指令被处理器请求取走时,该指令转移到 cache 中并向下级 cache 发起下一次预取
第一个实现的机器为 IBM System/360 Model 91
扩展: 将每次预取的长度扩展为连续的基本块
- A. Ramirez, O. J. Santana, J. L. Larriba-Pey and M. Valero. “Fetching Instruction Streams.” In Proc. of the 35th Annual ACM/IEEE International Symposium on Microarchitecture, 2002. 8
- O. J. Santana, A. Ramirez, and M. Valero. “Enlarging Instruction Streams.” IEEE Transactions on Computers, v. 56, no. 10, 2007. DOI: 10.1109/TC.2007.70742. 8, 11
Fetch-Directed Prefetching
Branch-predictor-directed prefetchers: 利用分支预测器以探索指令流的变化
- I-C. K. Chen, C-C. Lee, and T. N. Mudge. “Instruction Prefetching Using Branch Prediction Information.” In Proc. of the IEEE International Conference on Computer Design, 1997. DOI: 10.1109/ICCD.1997.628926. 9
- G. Reinman, B. Calder, and T. Austin. “Fetch Directed Instruction Prefetching.” In Proc. of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture, 1999. 9
- A. V. Veidenbaum, Q. Zhao, and A. Shameer. “Non-Sequential Instruction Cache Prefetching for Multiple–Issue Processors.” International Journal of High Speed Computing, v. 10, no. 1, 1999. DOI: 10.1142/S0129053399000065. 9
- R. Panda, P. V. Gratz, and D. A. Jiménez. “B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors.” In Proc. of the 18th International Symposium on High-Performance Computer Architecture, 2012. DOI: 10.1109/L-CA.2011.33. 9
- T. Sherwood, S. Sair, and B. Calder. “Predictor-Directed Stream Buffers.” In Proc. of the 33rd Annual ACM/IEEE International Symposium on Microarchitecture, 2000. DOI: 10.1145/360128.360135. 9
Fetch-Directed Instruction Prefetching (FDIP)
- 将分支预测器从 IFU 中解耦出来,并在二者之间引入 fetch target queue (FTQ)
- 预取器使用 FTQ 中的地址从 L2 Cache 中预取指令块,并放入一个小的全相联 buffer : Prefetch Buffer
- IFU 会并行访问 Prefetch Buffer 和 L1 ICache
- L1 ICache 中空闲的端口还会监测 Prefetch Buffer 中是否有相应的指令块以避免二者之间存储冗余的指令块
- L1 ICache 会将 miss 的地址放入 prefetch instruction queue (PIQ) 以准备预取
limit: 在所有的指令 cache miss 中有将近一半需要超过 16 个连续的正确分支预测才能生成候选的预取地址 (统计数据)
Discontinuity Prefetching
定位于处理指令序列中的不连续性:函数调用、跳转分支、 traps
Prefetcher | From | Features |
Wrong-Path Prefetching | J. Pierce, and T. N. Mudge. “Wrong-Path Instruction Prefetching.” In Proc. of the 29th Annual ACM/IEEE International Symposium on Microarchitecture, 1996. 11 | 解决 FDIP 中对于分支未跳转路径的处理问题(这条路径之后很可能会被使用) |
Branch-History Guided Prefetcher | V. Srinivasan, E. S. Davidson, G. S. Tyson, M. J. Charney, and T. R. Puzak. “Branch History Guided Instruction Prefetching.” In Proc. of the 7th International Symposium on High-Performance Computer Architecture, 2001. DOI: 10.1109/HPCA.2001.903271. 11 | 根据更早的指令预测不连续的指令流(使用分支预测器追踪) |
Execution-History Guided Prefetcher | Y. Zhang, S. Haga, and R. Barua. “Execution History Guided Instruction Prefetching.” In Proc. of the 16th Annual International Conference on Supercomputing, 2002. DOI: 10.1145/514191.514220. 11 | |
Multiple-Stream Predictor | O. J. Santana, A. Ramirez, and M. Valero. “Enlarging Instruction Streams.” IEEE Transactions on Computers, v. 56, no. 10, 2007. DOI: 10.1109/TC.2007.70742. 8, 11 | |
Next-Trace Predictors | Q. Jacobson, E. Rotenberg, and J. E. Smith. “Path-Based Next Trace Prediction.” In Proc. of the 30th Annual ACM/IEEE International Symposium on Microarchitecture, 1997. DOI: 10.1109/MICRO.1997.645793. 11 | |
Call Graph Prefetching (对服务器应用程序很有效果) | M. Annavaram, J. M. Patel, and E. S. Davidson. “Call Graph Prefetching for Database Applications.” ACM Transactions on Computer Systems, v. 21, no. 4, 2003. DOI: 10.1145/945506.945509. 11 | |
DiscontinuityPredictor |
Discontinuity Predictor
Recent Example
L. Spracklen, Y. Chou, and S. G. Abraham. “Effective Instruction Prefetching in Chip Multiprocessors for Modern Commercial Applications.” In Proc. of the 11th International Symposium on High-Performance Computer Architecture, 2005. DOI: 10.1109/ HPCA.2005.13. 11
- 维护一张表示取指不连续性的表: discontinuity table
- 包含 被预测为跳转的分支的基本块的 PC 到分支目标地址的映射 (只预测一个分支跳转目标)
- IFU 取指时,查询 discontinuity table
- 如果地址匹配,除了顺序预取外,还预取不连续路径
- 简单且硬件开销很小
- 只能跨越单一的取指不连续性
- 递归查询以探索其他路径会导致预取的基本块数量呈指数增长
- 限制最多跨越一次不连续性会限制预取器的预测能力
- 覆盖率有限
- 因为对于每个 cache block 该表只会记录单一不连续性,而实际情况下一个指令块内可能包含多个跳转分支
Prescient Fetch
todo…
Temporal Instruction Fetch Streaming (TIFS)
todo…
Return-address stack-Directed Instruction Prefetching (RDIP)
todo…
Proactive Instruction Fetch
todo…
References
- A Primer on Hardware Prefetching