背景问题

这篇论文聚焦的是数据中心应用的前端瓶颈：

现代服务器已经普遍用了 decoupled front-end 和 FDIP 一类的取指预取，因此很多 L1I miss 已经能被容忍
但 L2 中的 instruction miss 会导致 decode starvation，进而让 issue queue 变空，最终把 Core 卡住
如果把 “L2 对 instruction 的命中率做到接近完美”，那么在 12 个数据中心工作负载上，相对 Tree-Based PLRU 的平均性能提升可以达到 19.8%，最高甚至 75%
已有 SOTA 方案 EMISSARY 平均只提升 2.2%

SOTA: EMISSARY

1. 关键 instruction line 是动态的

EMISSARY 的思想是：把导致 decode starvation 的 instruction line 识别为 critical，尽量保留在 L2 里

问题：criticality 不是静态属性，而是强烈依赖上下文的动态属性

平均只有 28.32% 的 critical lines 会持续保持 critical；
剩余 71.68% 的 critical lines 都表现为动态 critical（d-critical）
- 同一条 instruction line 有时会导致 decode starvation，有时不会
- 这个现象在用户态代码和 OS 代码里都存在

静态设置 criticality 存在问题：

真正当前路径下 critical 的 line 可能没被识别出来
曾经 critical、但当前路径下并不 critical 的 line 反而被过度保护

2. 没有考虑 branch history

控制流决定 criticality，而控制流可以用 branch history 近似描述

同一个 cache-line-aligned PC，在不同 branch history 下，可能呈现完全不同的 decode starvation 行为

论文用 finagle-chirper 中一个具体 PC 举例：当 branch history 是 11000100 时，它总是导致 decode starvation；而当 history 是 01000100 时，则从不导致 decode starvation

原因在于：branch history 改变了控制流路径，也改变了该 line 在 L2 中的局部 reuse distance

结论：criticality 识别不能只看 PC，必须把 branch history 作为上下文引入

3. 没有区分 reuse distance

即便一条 line 是 critical，也不能一视同仁地保护，因为它们的 reuse 行为差异很大
EMISSARY 把 critical 和 non-critical 分开，但在 critical 内部，没有继续区分“短重用”和“长重用”

论文发现：

critical line 一旦被确认，几乎都会有重用，因为 dead-on-arrival 的 critical line 已经被过滤掉
但 critical line 之间有 short/mid/long reuse 的差别
EMISSARY 的做法: 当一个 set 里 critical lines 超过阈值时，就会在 critical lines 之间替换，导致mid/long reuse 的 critical line 还没等到 reuse就被替换

除此之外，reuse 应当是“局部 reuse”而不是“全局 reuse”:

replacement 决策本质上是在 set 内进行的
- 因此更有意义的是 local reuse distance，而不是整个 L2 范围上的 global reuse distance

实验分析 critical / non-critical line 的 reuse 情况：
把 instruction line 按 local reuse distance 分成：

no reuse
short: [0,16)
mid: [16,32)
long: 32+

然后观察 critical 与 non-critical 的分布:

critical lines 基本都会 reuse
non-critical lines 里大约 19% 是 dead 的，没有 reuse
critical lines 里 short reuse 的平均占比约 19.10%

ICARUS 设计

ICARUS 可以分成两个部分理解：

BHC：Branch-History-based Criticality detection
BRC：Based on Reuse and Criticality 的 bin-based replacement

1. BHC：用 branch history 提升 critical fetch 检测

设计了一个 Critical Instruction fetch Identification Table (CIT)：

当发生 decode starvation 且 issue queue 为空时，把这次 instruction fetch 视为 critical fetch
用 cache-line address 和 branch history 做 hash
用这个 hash 去索引 CIT (小表)
对应的 2-bit 饱和计数器加一
当计数器超过阈值 2 时，向 L2 发送 criticality signal，把该 L2 line 标记为 critical
- 把那些只出现过一次、没有后续重用的偶发 critical miss 过滤掉

细节：

branch history 长度为 9 bits
CIT 有 512 项，每项是 2-bit saturating counter (tagless，允许 aliasing)
为避免长期过预测，每 1M cycles 清空一次 CIT

BHC 带来的直接收益

一个中间实验：只把 EMISSARY 的“PC-based criticality signature”改成“PC + branch history”的 BHC，而 replacement 逻辑仍沿用 EMISSARY

结果显示，相比 TPLRU：

原版 EMISSARY 让 decode starvation cycles/instruction 降低 2.5%
引入 BHC 后，这个降幅变成 6.5%

结论：criticality 检测本身就是瓶颈，branch history 确实显著提高了识别质量

2. BRC：在 replacement 中引入 reuse

ICARUS 为每条 L2 cacheline 维护两个 bit：

criticality bit
reuse bit

于是 cacheline 被划分到四个区域：

[0,0]：non-critical & not reused
[0,1]：non-critical & reused
[1,1]：critical & reused
[1,0]：critical & not reused (保护优先级最高)
- 因为一条 critical 但尚未 reuse 的 line，最需要继续留在 cache 里等它未来那次重用发生

保护优先级: [1,0] > [1,1] > [0,1] > [0,0]

watermark 机制：配额约束

如果完全按优先级赶，可能会让高优先级 cacheline 占满整个 set，损害其他 line 的生存空间

于是作者给四个 bin 设了 watermark，分别是：

[0,0] → 2
[0,1] → 4
[1,1] → 6
[1,0] → 4

含义是：一个 bin 只有在其 line 数超过 watermark 时，才开始成为优先考虑的候选 eviction 来源

作者解释了这些数字背后的经验逻辑：

[0,0] 太大，会让无重用非关键行白白占空间，伤害 critical line
[0,1] 给到 4，是为了给少量从 [0,1] 转成 [1,1] 的 line 留机会
[1,1] 比 [1,0] watermark 更大，是因为一条 line 一旦进入 [1,1]，往往还能再重用至少一次
如果 [1,1] 和 [1,0] 的 watermark 降得太低，性能会明显掉，甚至超过 2%

插入与状态迁移

cacheline 初始插入时进入 [0,0]

之后：

如果发生 cache hit，reuse bit 置 1；
如果收到 CIT 的 criticality signal，critical bit 置 1；

实验部分

Setup

Platform: gem5 全系统仿真，模拟类似 Intel Granite Rapids 的 cache hierarchy：

L1I 64KB
私有统一 L2 2MB，16-way
L3 3MB/core
使用 FDIP
Gem5 FS Mode: Ubuntu 18.04
每个 benchmark 预热 50M instructions，测试 200M instructions

Workloads: 覆盖 12 个数据中心应用:

包括 tpcc、wikipedia、finagle-chirper、kafka、tomcat、verilator、web-search 等。
平均 instruction footprint 为 1.66MB
平均 critical fetch 比例 3.49%，但它们却造成了平均 23.18% 的前端 stall
Verilator critical fetch 占 19.89%，导致 90.69% 的 stall

正说明：critical fetch 很少，但占据很大的瓶颈

性能提升

最终完整的 ICARUS（BHC+BRC）相对 TPLRU 的平均性能提升是 5.6%，最高 51%
EMISSARY 的平均提升是 2.2%

对 instruction MPKI 和 decode starvation 的影响

ICARUS 把平均 L2 instruction MPKI 从 4.72 降到 1.94
同时显著降低 decode starvation cycles per instruction，在所有 benchmark 上都优于 EMISSARY

在 tomcat 上 instruction MPKI 甚至有上升，但 decode starvation 仍下降, 性能仍提升
作者解释为：多出来的 miss 主要发生在 non-critical lines 上，而 policy 故意牺牲它们来保住真正会导致 front-end stall 的 critical lines

与预取器的交互

对 instruction prefetcher PDIP：

PDIP 在 TPLRU 的基础上提升 2.5%
PDIP + EMISSARY 提升 4%
PDIP + ICARUS 提升 6%

对数据预取器 IP-stride：

IP-stride + TPLRU 提升 1.6%
IP-stride + EMISSARY 提升 3.7%
IP-stride + ICARUS 提升 7.4%

同时用 PDIP + IP-stride：

EMISSARY 下平均提升 5.6%
ICARUS 下平均提升 7.7%

对 cache 大小变化的敏感性

对 L1I size

从 32KB 到 256KB，ICARUS 都优于 EMISSARY
即使 L1I 到 256KB，ICARUS 相对 TPLRU 仍有 4.6% 提升，EMISSARY 只有 2.2%

对 L2 size

从 1MB 到 4MB，随着容量增大，两者收益都下降
在 4MB L2 时，ICARUS 还有 <1.4%，EMISSARY 约 0.3%

存储开销

ICARUS 的硬件代价是 0.13KB + 8KB / 2MB L2，其中：

CIT：512 × 2-bit，合计 128B
branch history register：9 bits
每条 L2 line 的 critical/reuse 两个 flag：合计约 8KB

相比 EMISSARY 的 4KB，ICARUS 多了 4.13KB，但换来明显更高的性能收益。这个代价在 2MB L2 的尺度上可以接受

Limits

对 workload 类型有明显偏好

ICARUS 的收益高度依赖于：

code footprint 大
decoupled front-end 下 L1I 已经被较好掩蔽
L2 instruction miss 仍频繁且致命
critical fetch 比例虽小，但 stall 占比高

对 datacenter/server workload 成立
对一些桌面应用、嵌入式场景，或者本来 code footprint 很小的程序，收益可能有限

watermark 有经验参数(静态)

四个 bin 的 watermark 是通过 sweep 调出来的
不同架构、不同 associativity、不同 workload 分布下，最佳值可能变化

Futurn Work: 让 watermark 自适应调整，而非固定常数

距离 Perfetch L2 for Instruction 还存在一定差距

ICARUS 的 IPC 从 2.2% 提升到了 5.6%，已经显著超过 EMISSARY，
但与 perfect L2 for instructions 的 19.8% 相比，仍有很大差距