定义

第一次提出的定义

L. Lamport. How to make a multiprocessor computer that correctly executes multiprocess programs. IEEE Transactions on Computers, C-28(9):690–91, September 1979. DOI: 10.1109/tc.1979.1675439. 22, 35

单处理器 sequential: 执行的结果与按照程序指定的顺序执行的结果相同

多处理器 sequentially consistent:
任何执行的结果都相当于以某种顺序依次执行所有处理器（核心）的操作，并且每个单独的处理器（核心）的操作按照其程序指定的顺序出现在该序列中

memory order: 程序指定的所有操作的顺序

SC 模型严格遵守 memory order

以公理的形式定义

D. L. Weaver and T. Germond, Eds. SPARC Architecture Manual (Version 9). PTR Prentice Hall, 1994. 23

符号表:

L(a) / S(a)	对地址 a 的 Load/Store
<p	Program Order: 每个核顺序执行所有访存操作的 total order
<m	Global Order: 所有核的访存操作的 total order

SC execution 的要求:

所有核将自己的 load/store 插入 global order <m 的时候需满足各自的 program order <p (不论访存地址是否相同)
- if L(a) <p L(b) => L(a) <m L(b)
- if L(a) <p S(b) => L(a) <m S(b)
- if S(a) <p S(b) => S(a) <m S(b)
- if S(a) <p L(b) => S(a) <m L(b)
每次 load 拿到的值是 global order 中该 load 之前的最后一次 store 的值
- Value of L(a) = Value of $MAX_{<m} {S(a) | S(a) <m L(a)}$, $MAX_{<m}$ 表示 <m 中最近的操作

SC ordering rules (X 表示强制定序):

Operation 2
Operation 1		Load	Store	RMW
	Load	X	X	X
	Store	X	X	X
	RMW	X	X	X

第一个证明每个核向其 cache(支持 cache coherence) 发出的访存请求满足 program order 时满足 SC 模型:
A. Meixner and D. J. Sorin. Dynamic verification of memory consistency in cachecoherent multithreaded computer architectures. In Proc. of the International Conference on Dependable Systems and Networks, pp. 73–82, June 2006. DOI: 10.1109/dsn.2006.29. 35_

实现

Naive Implementations

The Multitasking Uniprocessor: 在单个顺序核上执行所有线程
- 要求上下文切换前访存操作必须全部完成。
The Switch: 多个核将访存操作提交给一个 switch 选择执行
- 每个核提交的访存操作需满足自己的 program order
- switch 选择核来执行实际的访存，需要满足 global order
- switch 选择的策略不能饿死某个核的 ready 请求以保证公平

The Switch:

满足 SC 模型
缺点: 顺序瓶颈问题: 不具备可扩展性（核数增多时性能瓶颈）

支持 Cache Coherence 的基础实现

L1Cache 中的 block 需要支持 modified(M) 和 shared(S) 两个状态
1. modified 状态表示一个核可以对其进行读写
2. shared 状态表示一个或多个核对其只可读

通过基本的 cache coherence 保证了 SWMR 的不变性以实现 SC

对于地址不冲突的访存请求可以并发执行，提高可扩展性
将 core 的实现与 coherence 的实现解耦

支持 Cache Coherence 的优化实现

与其他 CPU 特性结合使用以提高性能

Non-Binding Prefetching

在 cache-coherenct memory system 中对 block 的预取会改变 block 的 coherence 状态，而对于 memory consistency model 而言预取相当于 nop ，因为对 coherence 状态的获取时间不会影响每个 core 本身的 program order

Speculative Cores

分支指令之后的 load/store 在遇到分支预测失败时会被冲刷出流水线。

对于被冲刷的 load 而言，如果访存请求已经发出，其效果相当于预取，不会影响 SC
对于被冲刷的 store 而言，相当于发出 getM 的 coherence 状态请求，实际的 store 并未发生(store 只有在达到 ROB 的顶端确认可以执行时才会执行)

Dynamically Scheduled Cores

乱序执行/动态调度可能会对不相关的 load 指令重排序(其他核不可见)，会违反 SC 规定的 program order

对于 SC 的预测执行，需要验证乱序核的预测执行是否正确（和其他核的请求不冲突）: L1 -> L2 (预测执行先执行 L2)
乱序核预测执行完 L2 ，在 commit L2 之前检查 cache 中 evict 的 block 以及向 cache 发出的 coherence 请求等是否和 L2 请求的 block 一致
- 如果此时有 GetM 的请求或者 block 一致，说明其他核会试图写入该 block, 乱序核需要使预测执行失效
在预测执行的 L2 指令 commit 时再次发出 load 请求
- 如果两次 load 的值相同，说明预测执行正确，否则说明有其他核在该 load 执行和 commit 之间发出了冲突的写请求，此时乱序核需要清除预测执行
- 对于支持值预测的处理器而言，不能比较两次 load 的结果 (ABA problem)

K. Gharachorloo, A. Gupta, and J. Hennessy. Two techniques to enhance the performance of memory consistency models. In Proc. of the International Conference on Parallel Processing, vol. I, pp. 355–64, August 1991. 30_

Multithreading

一个核内的多个线程共享 L1 cache ：粗粒度多线程、细粒度多线程、同时多线程

需要确保线程 T1 不能读到同一核内另一个线程 T2 此时正在写入的值（该写入此时对其他线程还不可见）

Post-retirement speculation

单核乱序核中的 write buffer 将所有的 store 加入以隐藏 store miss latency, 同时 load 指令通过访问 write buffer 检查相关性

多核中 store miss latency 会更长，为提高性能, 可以预测性地使 store miss 之前的 load/store 指令退役，同时维护这些预测性退役的指令的状态

细粒度：

C. Guiady, B. Falsafi, and T. Vijaykumar. Is SC C ILP D RC? In Proc. of the 26th Annual International Symposium on Computer Architecture, pp. 162–71, May 1999. DOI: 10.1109/isca.1999.765948. 32

P. Ranganathan, V. S. Pai, and S. V. Adve. Using speculative retirement and larger instruction windows to narrow the performance gap between memory consistency models. In Proc. of the 9th ACM Symposium on Parallel Algorithms and Architectures, pp. 199–210, June 1997. DOI: 10.1145/258492.258512. 32

粗粒度：

C. Blundell, M. M. K. Martin, and T. F. Wenisch. InvisiFence: Performance-transparent memory ordering in conventional multiprocessors. In Proc. of the 36th Annual International Symposium on Computer Architecture, June 2009. DOI: 10.1145/1555754.1555785. 32

L. Ceze, J. Tuck, P. Montesinos, and J. Torrellas. BulkSC: Bulk enforcement of sequential consistency. In Proc. of the 34th Annual International Symposium on Computer Architecture, June 2007. DOI: 10.1145/1250662.1250697. 32

L. Hammond et al. Transactional memory coherence and consistency. In Proc. of the 31st Annual International Symposium on Computer Architecture, June 2004. DOI: 10.1109/isca.2004.1310767. 32

T. F. Wenisch, A. Ailamaki, A. Moshovos, and B. Falsafi. Mechanisms for store-wait-free multiprocessors. In Proc. of the 34th Annual International Symposium on Computer Architecture, June 2007. DOI: 10.1145/1250662.1250696. 32

Non-speculative reordering

采用非预测执行的方式乱序执行访存指令以满足 SC ，同时这种重排序不能被其他核看到

coherence delaying:
当一个更年轻的访存指令要退役时如果旧的访存指令还未退役，年轻访存指令的 coherence 请求会延时直到旧访存指令退役。可能会导致死锁，需要检测规避

predecessor serialization:
旧的访存操作需要确保新的访存操作在其之后完成是安全的

Conflict Ordering: 允许 load/store 在 store miss 之后退役，要求
- 未完成的 store 是顺序的
- 新的访存操作和该 store miss 不冲突
Gope and Lipasti 对于顺序处理器的定制方法: load/store 可以顺序地从目录中取出互斥锁，乱序地退役
借助于编译器或者 MMU 的帮助使得访存操作被安全地重排序
- thread-private 和 read-only 变量可以被安全重排序

Conflict Ordering:
C. Lin, V. Nagarajan, R. Gupta, and B. Rajaram. Efficient sequential consistency via conflict ordering. In Proc. of the 17th International Conference on Architectural Support for Programming Languages and Operating Systems ASPLOS, 2012. DOI: 10.1145/2150976.2151006. 32_

Gope and Lipasti
D. Gope and M. H. Lipasti. Atomic SC for simple in-order processors. In 20th IEEE International Symposium on High Performance Computer Architecture, 2014. DOI: 10.1109/hpca.2014.6835950. 32

借助编译器或 MMU 的帮助
A. Singh, S. Narayanasamy, D. Marino, T.D Millstein, and M. Musuvathi. End-to-end sequential consistency. In 39th International Symposium on Computer Architecture, 2012. DOI: 10.1109/isca.2012.6237045. 32

SC 下的原子操作

原子操作用于多线程程序的同步，实现自旋锁或同步原语。

常用的原子指令为 read-modify-write(RMW) 包括：

test-and-set
fetch-and-increment
compare-and-swap

在 SC 模型中， RMW 指令的 load 和 store 在 memory-order 上必须连续出现

原子操作的实现

naive 实现: 执行原子操作时锁住 memory system ，不允许其他核进行访存操作
- 实现正确，直观，但牺牲了性能
激进实现: 一个核将其 cache 中的某个 block 获取到 M 状态后对其进行连续的 load 和 store，且在 store 完成以前该 block 不能响应其他核的 coherence 请求
优化实现：一个核的 cache 如果有一个只读状态的 block ，先预测地 load 一次，再获取到该 block 的 M 状态时，执行 store ，并在 load 和 store 之间检查该 block 是否被踢出 cache
- 如果被踢出，需要 squash 之前预测执行的 load

实例： MIPS R10000

实现了 directory coherence protocol: MESI
系统总线最多连接 4 个核

执行

以 program order 将 load / store 放入 address queue ，并预测发射执行
如果 address queue 中地址匹配， load 可以从 store 获得前递数据，否则从 dcache 中获取
如果 cache 中和 load 相关的 block 因为 coherence 失效或者冲突失效而被踢出时， load 之后的所有指令全部需要 squash 并重新执行

xixi's blog

Sequential Consistency (SC)

定义