处理器使用 write buffer 以掩盖 store miss 延迟,同时与 store 相关的 load 指令可以将 store 的写数据前递给 load 指令,使 load 指令更早得到数据。
write buffer 在 store-load 的情况下会违反 SC 模型,并且 write buffer 对多处理器可见
定义
第一次形式化提出:
P. Sindhu, J.-M. Frailong, and M. Ceklov. Formal specification of memory models. Technical Report CSL-91–11, Xerox Palo Alto Research Center, December 1991. DOI: 10.1007/978-1-4615-3604-8_2. 50
TSO execution 的要求:
- 所有核将自己的 load/store 插入 global order <m 的时候需满足各自的 program order <p (不论访存地址是否相同)
- if L(a) <p L(b) => L(a) <m L(b)
- if L(a) <p S(b) => L(a) <m S(b)
- if S(a) <p S(b) => S(a) <m S(b)
- [line-through]#if S(a) <p L(b) => S(a) <m L(b)#
- 每次 load 拿到的值是 global order 中该 load 之前的最后一次 store 的值
- Value of L(a) = Value of $MAX_{<m} {S(a) | S(a) <m L(a) or S(a) <p L(a) }$
- FENCE 指定访存的排序
- if L(a) <p FENCE => L(a) <m FENCE
- if S(a) <p FENCE => S(a) <m FENCE
- if FENCE <p FENCE => FENCE <m FENCE
- if FENCE <p L(a) => FENCE <m L(a)
- if FENCE <p S(a) => FENCE <m S(a)
TSO ordering rules (X 表示强制定序, B 表示需要 bypass)
Operation 2 | |||||
Operation 1 | Load | Store | RMW | FENCE | |
Load | X | X | X | X | |
Store | B | X | X | X | |
RWM | X | X | X | X | |
FENCE | X | X | X | X |
x86 与 TSO
x86 手册并未说明其内存模型为 TSO 模型,但目前测试 x86 所使用的内存模型满足 TSO 模型,因此也称为 x86-TSO 模型
- P. Sewell, S. Sarkar, S. Owens, F. Z. Nardelli, and M. O. Myreen. x86-TSO: A rigorous and usable programmer’s model for x86 multiprocessors. Communications of the ACM, July 2010. DOI: 10.1145/1785414.1785443. 45, 46, 50
- S. Owens, S. Sarkar, and P. Sewell. A better x86 memory model: x86-TSO. In Proc. of the Conference on Theorem Proving in Higher Order Logics, 2009. DOI: 10.1007/978-3-64203359-9_27. 46, 50
- S. Sarkar, P. Sewell, F. Z. Nardelli, S. Owens, T. Ridge, T. Braibant, M. O. Myreen, and J. Alglave. The semantics of x86-CC multiprocessor machine code. In Proc. of the 36th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, pp. 379391, 2009. DOI: 10.1145/1480881.1480929. 46, 50
实现
The Switch
- 相比于 SC switch ,每个核额外有一个 FIFO write buffer
- load 可以从 write buffer 中前递数据值或等待 switch 切换到该核进行访存
- 如果 write buffer 满,阻塞该核
- switch 切换到该核时,要么执行下一个 load 要么执行 store buffer 头部的 store
cache-coherence system
SC 的所有优化实现皆可应用到 TSO ,只需要在核和内存之间加一个 write buffer
对于多线程而言,每个核上的多个线程之间不能互相进行前递(上下文不同), write buffer 应当对每个线程是私有的,解决方法:
- 每个线程各自使用一个私有的 write buffer
- 各线程共享 write buffer,但 write buffer 的每个 entry 需要增加 tag 域,当 tag 匹配时才可前递
Non-speculative TSO reordering
load 和 store 可以非预测性地重排序,同时满足 TSO 模型: coherence delaying
- A. Ros, T. E. Carlson, M. Alipour, and S. Kaxiras. Non-speculative load-load reordering in TSO. In Proc. of the 44th Annual International Symposium on Computer Architecture, 2017. DOI: 10.1145/3079856.3080220. 49
- A. Ros and S. Kaxiras. The superfluous load queue. In 51st Annual IEEE/ACM International Symposium on Microarchitecture, 2018. DOI: 10.1109/micro.2018.00017. 49
- A. Ros and S. Kaxiras. Non-speculative store coalescing in total store order. In 45th ACM/IEEE Annual International Symposium on Computer Architecture, 2018. DOI: 10.1109/isca.2018.00028. 49
原子指令
- load 部分的执行必须在先前访存指令(load & store)之后
- RMW 指令的 load 部分执行前 write buffer 中的 store 必须全部退役
- RMW 指令本身原子性的要求
- load 部分执行时需要获取的 coherence 状态为 M 状态 (read-write)
- load 部分和 store 部分之间,该核必须保持对 cache 相应 block 的 M 状态
write buffer 满足一定的条件也可以在 RMW 指令前不排空
- write buffer 中的所有 store 对应的 cache block 都是 M 状态,并且维持到 RMW 指令退役
- 每个核检查 load 预测执行的正确性
B. Rajaram, V. Nagarajan, S. Sarkar, and M. Elver. Fast RMWs for TSO: Semantics and implementation. In Proc. of ACM SIGPLAN Conference on Programming Language Design and Implementation, 2013. DOI: 10.1145/2491956.2462196. 49_
Fence 的实现
简单实现(在不经常使用 FENCE 的场景下,可以使用):
- 排空 write buffer 中的 store 指令
- 后续的 load 指令需等待 FENCE 退役后才可以继续执行
优化实现(在经常使用 FENCE 的场景下,可以使用):
- FENCE 之后的访存操作在处理 FENCE 之前可以非预测性低退役:
- coherence delays
- predecessor serialization
FENCE 的优化实现:
- Y. Duan, A. Muzahid, and J. Torrellas. WeeFence: Toward making fences free in TSO. In The 40th Annual International Symposium on Computer Architecture, 2013. DOI: 10.1145/2485922.2485941. 49
- Y. Duan, N. Honarmand, and J. Torrellas. Asymmetric memory fences: Optimizing both performance and implementability. In Proc. of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems, 2015. DOI: 10.1145/2694344.2694388. 49
- K. Gharachorloo, M. Sharma, S. Steely, and S. Van Doren. Architecture and design of AlphaServer GS320. In Proc. of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems, 2000. DOI: 10.1145/378993.378997. 49
- C. Lin, V. Nagarajan, and R. Gupta. Efficient sequential consistency using conditional fences. In 19th International Conference on Parallel Architectures and Compilation Techniques, 2010. DOI: 10.1145/1854273.1854312. 49