Title	Patent Number	Inc.	Year	url
Reducing latency for pointer chasing loads	US9710268B2	Apple Inc.	2017	https://patents.google.com/patent/US9710268B2/en

这篇专利解决的是 OoO 处理器中 pointer chasing 的 load-to-load / load-to-store 地址生成延迟问题

TLDR：

当 older producer load 的结果会作为 younger memory operation 的地址源操作数时，常规路径需要把 load data 送回寄存器文件或 reservation station 后再送到 AGU，延迟较长；该专利增加一条从 L1 data cache 读出路径直接到 AGU 输入 mux 的早期旁路
关键风险是 producer load 的结果不一定来自 L1D：如果它实际从 store queue 转发，或者 L1D miss，或者数据需要 alignment/sign extension/endian 处理，早期旁路可能给 AGU 错数据
因此专利使用 load-store dependency predictor/LSD predictor 预测 producer load 是否会 hit store queue，以及是否不能在早期窗口提供可直接作为地址的数据；若预测不会 hit store queue，则 dependent load/store 可按更短延迟 M 发射；若预测会 hit store queue，则按更长延迟 N 发射，其中 M<N，一个实施例中 M=N-1，例如 3-cycle vs 4-cycle
预测错误时 flush/replay dependent consumer memory operation，并训练 LSD predictor，后续遇到相同 producer load 或 dependency pair 时不再错误提前发射

背景和动机

该专利面向高频、深流水、乱序执行处理器中的内存相关指令延迟。load/store 指令通常先由源操作数生成 effective/virtual address，经 DTLB 翻译并访问 L1 data cache 或 store queue。对于普通 load，结果最终会被写入物理寄存器文件、reservation station 或通过普通 bypass 网络转发给 younger dependent instructions。

pointer chasing 的特征是 older load 读出的值本身是指针，并且这个值随后被 younger load/store 用作地址源操作数。例如链表遍历：

1
2
3

node = node->next;
value = node->field;
node = node->next;

微架构上可抽象成：

1 2	L0: R7 <- LD [R3 + 8] ; producer load，读出下一个指针 L1: R10 <- LD [R7 + 4] ; consumer load，地址依赖 R7

这里 L1 不能真正完成地址生成，直到 L0 的结果可用。

传统 OoO 用 register renaming、reservation station wakeup、bypass network 缩短 ALU 相关指令的延迟，但对 load-result-as-address 的场景仍有额外代价：

L0 的数据从 data cache 出来后，需要经过对齐/扩展、写回或全局旁路，再到 L1 所在的 AGU
这些跨 die 的长线、repeater、staging storage 会累积成显著延迟
链表遍历等 pointer chasing 程序会反复经历这段延迟，吞吐被串行依赖链限制

已有处理器通常也有 store-to-load forwarding：当 load 的地址与 older store 匹配时，load 数据来自 store queue，而不是 data cache。

专利的核心不是“预测 pointer 值”，而是预测 producer load 结果来源和可用时序，从而决定 dependent memory operation 是否可以更早发射

相关工作/背景引用中比较重要的体系结构脉络如下：

Title	Authors, From	url
Memory Dependence Prediction Using Store Sets	Chrysos et al., ISCA 1998	https://doi.org/10.1145/279358.279378
Streamlining Inter-operation Memory Communication via Data Dependence Prediction	Moshovos et al., MICRO 1997	https://doi.org/10.1145/266800.266811
Speculative Memory Cloaking and Bypassing	Moshovos et al., International Journal of Parallel Programming 1999	https://doi.org/10.1023/A:1018727608919
The Alpha 21264: A 500 MHz Out-of-Order Execution Microprocessor	Leibholz et al., IEEE 1997	https://doi.org/10.1109/40.566207
Using Virtual Load/Store Queues (VLSQs) to Reduce the Negative Effects of Reordered Memory Instructions	Jaleel et al., HPCA 2005	https://doi.org/10.1109/HPCA.2005.33

Insight

这篇专利的 insight 为：对于 pointer chasing，真正关键的不是提前把下一个 cache line 取回来，而是缩短“producer load 读出指针 -> consumer load/store 用该指针生成地址”的一段数据路径

在常规 load-use latency 中，load 数据的消费方往往是 ALU 或后续指令的源操作数
若 consumer 是 memory operation，则这个 load result 还要进入 AGU 的地址生成路径
AGU 通常在 load-store pipe 的较早 stage 工作，而 L1D data 通常在后面 stage 才出来
若完全走普通写回/唤醒/选择路径，consumer memory operation 必须等 producer load 的结果变成普通可调度源操作数(进入 Issue Queue 或者物理寄存器堆)后再发射，导致链式 load 有较大 bubble

专利主要针对的是一个可利用的特殊情况：若 producer load 是 L1D hit，且读出的数据格式已经可以直接作为地址源操作数，那么在 L1D data access 结束时，可以不等它经过完整 alignment/writeback/bypass 网络，而是直接把 data cache 读出的 result data 接到下一条 dependent memory operation 的 AGU 输入 mux。为了让 consumer op 在同一周期到达 AGU，调度器需要提前发射 consumer op，使其 pipeline stage 与 producer 的 early result 对齐。

因此，调度问题变成一个时序预测问题：

producer load issued at cycle T

如果 producer data 来自 L1D hit:
  result data 可在较早窗口到达 AGU bypass
  consumer memory op 可在 T + M 发射

如果 producer data 来自 store queue 或其他较慢路径:
  result data 到 AGU 更晚
  consumer memory op 需在 T + N 发射

其中 M < N；一个实施例中 M=N-1，比如 M=3, N=4

register rename/mapper/decode 已能知道 younger load/store 的地址源寄存器依赖 older load 的 destination register
难点是 producer load 结果来源是否足够快：L1D hit 可走早期旁路，store queue hit 则通常不能
LSD predictor 记录历史上哪些 producer load 或 load-store pair 会出现 store queue source、ordering violation、早期窗口不可用等情况
对未命中的 producer load，专利倾向于认为其不会 hit store queue，从而允许提前发射；若预测错误，则 flush/replay 并训练 predictor

与传统 memory dependence predictor 的区别是：

传统 Store Sets/LSQ 预测主要避免 load 早于 conflicting older store 执行导致 ordering violation

本专利复用/扩展 LSD predictor，不只记录 store-load ordering，还记录 producer load 的“结果来源/可旁路性”

预测目标不是“consumer load 是否依赖 older store”，而是“producer load 是否会从 store queue 获得结果，或是否不能在 early window 给 consumer AGU 提供干净数据”。

专利没有消除 load miss，也没有预测下一个指针值；它只在 L1D hit 且格式简单时，把每个 pointer chasing hop 的 load-to-address-use latency 减少约一个周期。对长链表、树遍历、哈希桶链、对象图遍历等串行指针链来说，每跳 1 cycle 的减少会直接累积

核心设计

核心设计由四部分组成：

在 map/dispatch 或 decode/mapper 阶段检测 older producer load 与 younger consumer memory operation 的寄存器依赖
使用 LSD predictor 判断 producer load 是否可能 hit store queue，或历史上是否不能在 early bypass 窗口提供可直接作为地址源的数据。
若预测 producer load 不会 hit store queue，则让 dependent load/store 提前发射，使 consumer 到达 AGU 的周期与 producer load 的 L1D early result bypass 到达周期重合
在 LSU 内部增加从 data cache access 输出到 AGU 输入 mux 的直接旁路；预测错误或格式条件不满足时 flush/replay consumer，并训练 predictor

                      +------------------------------+
Fetch/Decode          | Core                         |
   |                  |                              |
   v                  |  +------------------------+  |
Map/Dispatch          |  | Map/Dispatch           |  |
 - rename             |  |  - ROB                 |  |
 - detect dep         |  |  - LSD Predictor       |  |
 - lookup LSD         |  +-----------+------------+  |
   |                               dispatch           |
   v                                                  |
+-------------------- LSU -------------------------+ |
|                                                    | |
|  Reservation Station / picker                      | |
|     |                                              | |
|     v                                              | |
|  Load-store pipe                                   | |
|     Stage2: RF read ----+                          | |
|                         v                          | |
|                     +-------+                      | |
|  early bypass ----> | mux   | ---> Stage3 AGU ---> DTLB
|  from L1D           +-------+                      | |
|                         ^                          | |
|                         |                          | |
|     Stage4: L1D lookup / Store Queue lookup        | |
|                         |                          | |
|     Stage5: L1D data access -----------------------+ |
|                         |                            |
|                         v                            |
|                    alignment/sign-ext                |
|                         |                            |
|                         v                            |
|                 normal writeback/bypass              |
+----------------------------------------------------+ |

设计点 1: 依赖检测

针对的问题：
不是所有 load-use dependency 都适合优化。

该机制只对 producer 是 older load，consumer 是 younger memory operation，且 producer result 用于 consumer 地址生成的情况有意义
若 consumer 是普通 ALU 指令，AGU 早期旁路没有帮助
若 consumer 的地址源来自 ALU/store queue 等其他路径，也不符合该专利关注的 fast path

解决的思路：
在 decode/mapper/map-dispatch 阶段利用寄存器重命名信息检测依赖。
producer load 的 destination register 与 consumer load/store 的 address source register 相同，即可标记出 pointer chasing 风格的 load-to-load 或 load-to-store 地址依赖

设计：

producer load:
  Rptr <- LD [Rbase + imm0]

consumer load/store:
  Rval <- LD [Rptr + imm1]
  或
  ST [Rptr + imm1] <- Rdata

检测条件:
  consumer.addr_src_phys_reg == producer.dest_phys_reg
  producer older than consumer in program/ROB order
  consumer is memory operation requiring AGU address generation

这一步的本质是把普通寄存器 RAW 依赖细分为“load result drives AGU input”的特殊 RAW

设计点 2: LSD predictor 预测 producer load 是否 hit store queue

针对的问题：
producer load 的结果可能来自 L1 data cache，也可能来自 store queue forwarding。
若来自 store queue，则 store queue access 和 alignment path 通常更晚，未必能满足 consumer AGU 的 early bypass 时序。盲目提前发射会让 consumer 用错地址源。

解决的思路：
使用 load-store dependency predictor 记录历史信息。若 producer load 在 predictor 中命中某些 entry，表示它曾经导致 store queue source、ordering violation 或 early window 不可用，则预测其不能走短延迟路径；若未命中，则预测不会 hit store queue，可以走 L1D fast path。

LSD predictor table 的一个实施例字段如下：

1
2
3

+-------+-------------+------------+--------+--------------+--------+
|Valid  | Producer PC | Store RNUM | Armed  | Dependent PC | Status |
+-------+-------------+------------+--------+--------------+--------+

字段含义：

Valid：entry 是否有效，reset 时可清零，也参与替换策略。
Producer PC：识别 producer load/store 的 PC，可与 architectural register 组合或 hash；字段可按 CAM 方式比较。
Store RNUM：当记录 store-load 依赖时，保存 producing store 的 ROB index/RNUM。
Armed：用于 store-load ordering predictor；producer store dispatch 后置位，store issue 后清除。若 dependent load dispatch 时命中 armed entry，则建立并强制依赖。
Dependent PC：识别 dependent operation 的 PC，也可 CAM 查找。
Status：区分 entry 类型，例如 load-store ordering violation、producer load 从 store queue source、producer load 无法在 early window 向 dependent memory op 提供 data-cache result 等。

LSD predictor

为 256-entry fully associative 结构
也可以把 producer PC table、dependent PC table、status table 拆成多个相关联的 SRAM/CAM 表
允许 multiple simultaneous accesses/updates，以适应多发射、多 pipe LSU。

关键点是 predictor 被复用成两类功能：

传统功能:
  记录 store-load pair，避免 load-store ordering violation

本专利扩展功能:
  记录 producer load 是否不适合 L1D early bypass
  例如:
    - producer load 实际从 store queue forwarding 得到数据
    - producer load 在 early issue 窗口不能提供正确数据
    - producer data 需要 alignment/sign extension/endian 修正

设计点 3: L1D 到 AGU 的 early bypass 数据路径

针对的问题：
普通 load result path 经过 alignment、mux、全局 bypass/writeback，再被 consumer 的 reservation station/AGU 使用。这条路径长，且 AGU 所需数据到达太晚，导致 dependent load/store 发射推迟。

解决的思路：
在 LSU 内部从 data cache access 输出处拉一条 fast bypass，直接接回 AGU 输入 mux。这样 producer load 的 L1D hit data 在 stage 5 出来后，不必先写回 RF/RS，而可以在 LSU 本地作为 younger consumer memory operation 的地址源操作数。

load-store pipe 可整理如下：

Stage 1:
  Reservation Station issue

Stage 2:
  Register File Read
       |
       v
     mux 315 <------------------------------+
       |                                    |
Stage 3|                                    |
       v                                    |
     AGU 320 --> DTLB 325                   |
       |            |                       |
Stage 4|            |                       |
       v            v                       |
  Data Cache     Store Queue                |
  Lookup 330     Lookup 335                 |
       |            |                       |
Stage 5|            v                       |
       v       Store Queue Access 355       |
  Data Cache Access 340                     |
       |                                    |
       +---- Early Bypass Result Data 345 --+
       |
       v
  Align 350
       |
Stage 6
       v
  mux 365 -> normal bypass/writeback/RF

旁路发生在 alignment block 之前，因此路径短，但对数据格式有要求。
旁路目的地是 AGU 输入 mux，而不是普通 integer ALU bypass；这是专门为 load-result-as-address 场景服务的局部旁路
store queue path 也属于 LSU data path，但专利明确指出某些实施例只实现 data cache 到 AGU 的 fast bypass，不实现 store queue 到 AGU 的 fast bypass，因为 store queue 数据可能无法满足 timing requirement
若 LSU 有两个 load-store pipe，则每个 pipe 可有自己的 reservation station 和 AGU；fast bypass 需要能把 producer pipe 的 data-cache result 送到相应 consumer AGU

设计点 4: consumer 提前发射的时序控制

针对的问题：
仅有 bypass wire 不够。consumer memory operation 必须被调度到正确周期，使其到达 AGU 时 producer load 的 early result 也刚好可用。如果 consumer 发射太早，AGU 没数据；太晚，则失去收益

解决的思路：
调度器根据 predictor 输出选择两种发射距离：短延迟 M 或长延迟 N。

若预测 producer load hit store queue:
  consumer issue = producer issue + N
  等 store queue result 在 AGU 可用

若预测 producer load miss store queue，即走 data-cache fast path:
  consumer issue = producer issue + M
  使 consumer 到 AGU 与 L1D early result 到 AGU 同周期

约束:
  M, N 为正整数
  M < N
  一个实施例中 M = N - 1
  例子: data cache path 3 cycles, store queue path 4 cycles

ASCII 时序图：

cycle:          T       T+1       T+2       T+3       T+4
producer LD:   issue   RF        AGU/TLB   lookup    L1D data
                                                       |
                                                       | early bypass
consumer LD
short path:            issue     RF        AGU <-------+

consumer LD
long path:                       issue     RF        AGU <--- store queue data

核心是：data cache result 的可用窗口比 store queue result 早一个周期，因此 consumer 可提前一个周期发射。

设计点 5: 错误检测、flush/replay 与训练

针对的问题：
提前发射是投机行为。预测可能错，或者 producer load 虽然没有 store queue hit，但仍不满足数据格式要求。若继续执行，consumer load/store 的地址可能错误，进而造成错误数据访问或错误 store 地址。

解决的思路：
在 producer load 后续阶段确认实际来源和条件。如果预测错误或条件不满足，则 flush/replay dependent consumer memory operation，并训练 LSD predictor，避免后续同一 producer load 再走错误短路径。

  +-------------------------------------------+
  | Detect dependency: producer load ->       |
  | consumer memory operation                 |
  +--------------------+----------------------+
                       |
                       v
            +----------------------+
            | Predict load hits SQ?|
            +----------+-----------+
                       |
         yes           | no
          |            v
          |    +-------------------------------+
          |    | Issue consumer to coincide    |
          |    | with data-cache result at AGU |
          |    +---------------+---------------+
          |                    |
          |                    v
          |    +-------------------------------+
          |    | Bypass data-cache result      |
          |    | directly to AGU               |
          |    +---------------+---------------+
          |                    |
          |                    v
          |    +----------------------+
          |    | Prediction confirmed?|
          |    +-----+------------+---+
          |          |yes         |no
          |          v            v
          |  +----------------+  +-------------------+
          |  | Other conds OK?|  | Flush consumer    |
          |  +-----+----------+  +---------+---------+
          |        |yes/no                 |
          |        |                       v
          |        |              +-------------------+
          |        |              | Train LSD         |
          |        |              | predictor         |
          |        |              +-------------------+
          v        v
+-------------------------------+
| If hit SQ predicted: wait     |
| until SQ result reaches AGU   |
+-------------------------------+

需要检查的 “other conditions” 包括：

producer load result 对齐在 word boundary
producer load result 不需要 sign extension
不处于 Big Endian mode，或至少 endian 排列不会使 early pre-alignment data 的字节解释错误

原因是 early bypass 从 data cache access 到 mux 发生在 alignment block 之前。
如果 load 是 byte/halfword load、非对齐 load、需要 sign-extension 的 load，或者 endian mode 会改变所需字节位置，那么 raw cache data 不是最终语义值，直接作为地址源可能错误

设计点 6: load-to-store 也受益

针对的问题：
pointer chasing 不只表现为 dependent load，也可能表现为 store address depends on prior load。例如遍历链表时向当前节点字段写入，store 的地址同样需要 producer load 读出的指针。

解决的思路：
只要 younger memory operation 的地址生成依赖 older producer load，就可以使用同一套提前发射和 AGU bypass 机制

producer:
  Rptr <- LD [Rbase + next_off]

consumer load:
  Rval <- LD [Rptr + field_off]

consumer store:
  ST [Rptr + field_off] <- Rdata

与预取器的关系

可以与 pointer-chasing prefetcher 互补：

pointer-chasing prefetcher:
  试图提前把 future node cache line 拉入 L1/L2

本专利:
  当 current producer load 已经 L1D hit 时，
  缩短 current pointer value 被下一条 load/store AGU 使用的延迟

若两者结合，prefetcher 降低 miss latency，本专利降低 L1 hit dependent-use latency

实现风险和硬件代价

AGU 输入 mux 增大：mux 增加 early bypass input，可能影响 AGU stage timing
LSU 内部旁路布线：从 L1D data access 输出到一个或多个 AGU 的低延迟连线需要仔细 floorplan
Predictor 多端口/CAM 代价：多发射和多 LSU pipe 需要多路查询/更新，fully associative 结构有面积和功耗成本
Recovery 复杂度：错误时需要 flush/replay dependent consumer memory operation；实现上要确保 store 不会在错误地址不可逆提交。由于 store 通常在 commit 前不写入 cache/memory，这与 OoO 精确状态机制兼容，但 store address/data queue 状态仍需正确清理
Memory ordering 交互：LSD predictor 同时服务 ordering violation 和 early bypass viability，需要 status 字段避免不同训练原因相互污染