处理器使用 write buffer 以掩盖 store miss 延迟，同时与 store 相关的 load 指令可以将 store 的写数据前递给 load 指令，使 load 指令更早得到数据。

write buffer 在 store-load 的情况下会违反 SC 模型，并且 write buffer 对多处理器可见

定义

第一次形式化提出:
P. Sindhu, J.-M. Frailong, and M. Ceklov. Formal specification of memory models. Technical Report CSL-91–11, Xerox Palo Alto Research Center, December 1991. DOI: 10.1007/978-1-4615-3604-8_2. 50

TSO execution 的要求:

所有核将自己的 load/store 插入 global order <m 的时候需满足各自的 program order L(a) <m L(b)
- if L(a) L(a) <m S(b)
- if S(a) S(a) <m S(b)
- [line-through]#if S(a) S(a) <m L(b)#
每次 load 拿到的值是 global order 中该 load 之前的最后一次 store 的值
- Value of L(a) = Value of $MAX_{<m} {S(a) | S(a) <m L(a) or S(a) <p L(a) }$
FENCE 指定访存的排序
- if L(a) L(a) <m FENCE
- if S(a) S(a) <m FENCE
- if FENCE FENCE <m FENCE
- if FENCE FENCE <m L(a)
- if FENCE FENCE <m S(a)

TSO ordering rules (X 表示强制定序, B 表示需要 bypass)

Operation 2
Operation 1		Load	Store	RMW	FENCE
	Load	X	X	X	X
	Store	B	X	X	X
	RWM	X	X	X	X
	FENCE	X	X	X	X

x86 与 TSO
x86 手册并未说明其内存模型为 TSO 模型，但目前测试 x86 所使用的内存模型满足 TSO 模型，因此也称为 x86-TSO 模型

P. Sewell, S. Sarkar, S. Owens, F. Z. Nardelli, and M. O. Myreen. x86-TSO: A rigorous and usable programmer’s model for x86 multiprocessors. Communications of the ACM, July 2010. DOI: 10.1145/1785414.1785443. 45, 46, 50

S. Owens, S. Sarkar, and P. Sewell. A better x86 memory model: x86-TSO. In Proc. of the Conference on Theorem Proving in Higher Order Logics, 2009. DOI: 10.1007/978-3-64203359-9_27. 46, 50

S. Sarkar, P. Sewell, F. Z. Nardelli, S. Owens, T. Ridge, T. Braibant, M. O. Myreen, and J. Alglave. The semantics of x86-CC multiprocessor machine code. In Proc. of the 36th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, pp. 379391, 2009. DOI: 10.1145/1480881.1480929. 46, 50

实现

The Switch

相比于 SC switch ，每个核额外有一个 FIFO write buffer
load 可以从 write buffer 中前递数据值或等待 switch 切换到该核进行访存
如果 write buffer 满，阻塞该核
switch 切换到该核时，要么执行下一个 load 要么执行 store buffer 头部的 store

cache-coherence system

SC 的所有优化实现皆可应用到 TSO ，只需要在核和内存之间加一个 write buffer

对于多线程而言，每个核上的多个线程之间不能互相进行前递（上下文不同）， write buffer 应当对每个线程是私有的，解决方法：

每个线程各自使用一个私有的 write buffer
各线程共享 write buffer，但 write buffer 的每个 entry 需要增加 tag 域，当 tag 匹配时才可前递

Non-speculative TSO reordering
load 和 store 可以非预测性地重排序，同时满足 TSO 模型： coherence delaying

A. Ros, T. E. Carlson, M. Alipour, and S. Kaxiras. Non-speculative load-load reordering in TSO. In Proc. of the 44th Annual International Symposium on Computer Architecture, 2017. DOI: 10.1145/3079856.3080220. 49

A. Ros and S. Kaxiras. The superfluous load queue. In 51st Annual IEEE/ACM International Symposium on Microarchitecture, 2018. DOI: 10.1109/micro.2018.00017. 49

A. Ros and S. Kaxiras. Non-speculative store coalescing in total store order. In 45th ACM/IEEE Annual International Symposium on Computer Architecture, 2018. DOI: 10.1109/isca.2018.00028. 49

原子指令

load 部分的执行必须在先前访存指令(load & store)之后
- RMW 指令的 load 部分执行前 write buffer 中的 store 必须全部退役
RMW 指令本身原子性的要求
- load 部分执行时需要获取的 coherence 状态为 M 状态 (read-write)
- load 部分和 store 部分之间，该核必须保持对 cache 相应 block 的 M 状态

write buffer 满足一定的条件也可以在 RMW 指令前不排空

write buffer 中的所有 store 对应的 cache block 都是 M 状态，并且维持到 RMW 指令退役
每个核检查 load 预测执行的正确性

B. Rajaram, V. Nagarajan, S. Sarkar, and M. Elver. Fast RMWs for TSO: Semantics and implementation. In Proc. of ACM SIGPLAN Conference on Programming Language Design and Implementation, 2013. DOI: 10.1145/2491956.2462196. 49_

Fence 的实现

简单实现（在不经常使用 FENCE 的场景下，可以使用）：

排空 write buffer 中的 store 指令
后续的 load 指令需等待 FENCE 退役后才可以继续执行

优化实现（在经常使用 FENCE 的场景下，可以使用）：

FENCE 之后的访存操作在处理 FENCE 之前可以非预测性低退役：
- coherence delays
- predecessor serialization

FENCE 的优化实现:

Y. Duan, A. Muzahid, and J. Torrellas. WeeFence: Toward making fences free in TSO. In The 40th Annual International Symposium on Computer Architecture, 2013. DOI: 10.1145/2485922.2485941. 49

Y. Duan, N. Honarmand, and J. Torrellas. Asymmetric memory fences: Optimizing both performance and implementability. In Proc. of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems, 2015. DOI: 10.1145/2694344.2694388. 49

K. Gharachorloo, M. Sharma, S. Steely, and S. Van Doren. Architecture and design of AlphaServer GS320. In Proc. of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems, 2000. DOI: 10.1145/378993.378997. 49

C. Lin, V. Nagarajan, and R. Gupta. Efficient sequential consistency using conditional fences. In 19th International Conference on Parallel Architectures and Compilation Techniques, 2010. DOI: 10.1145/1854273.1854312. 49

xixi's blog

Total Store Order (TSO)

定义

实现

The Switch

cache-coherence system

原子指令

Fence 的实现