本系列文章是对 MIT 6.824 课程的学习笔记。
FaRM writes go to RAM, not disk – eliminates a huge bottleneck. Can write RAM in 200 ns, but takes 10 ms to write hard drive, 100 us for SSD, but RAM loses content in power failure! Not persistent by itself.
Just write to RAM of f+1 machines, to tolerate f failures? Might be enough if failures were always independent, but power failure is not independent – may strike 100% of machines!
So batteries in every rack, can run machines for a few minutes: “non-volatile RAM”. What if crash prevents s/w from writing SSD, e.g bug in FaRM or kernel, or cpu/memory/hardware error. FaRM copes with single-machine crashes by copying data from RAM of machines’ replicas to other machines to ensure always f+1 copies. Crashes (other than power failure) must be independent!
why is the network often a performance bottleneck?
the usual setup: app app --- --- socket buffers buffers TCP TCP NIC driver driver NIC -------------------- NIC
lots of expensive CPU operations:
- system calls
- copy messages
and all twice if RPC! It’s hard to build RPC than can deliver more than a few 100,000 / second wire b/w (e.g. 10 gigabits/second) is rarely the limit for short RPC. These per-packet CPU costs are the limiting factor for small messages.
Two classes of concurrency control for transactions:
pessimistic: wait for lock on first use of object; hold until commit/abort, called two-phase locking. Conflicts cause delays
optimistic: access object without locking; commit “validates” to see if OK.
- Valid: do the writes
- Invalid: abort called Optimistic Concurrency Control (OCC)
FaRM uses OCC. The reason: OCC lets FaRM read using one-sided RDMA reads, server storing the object does not need to set a lock, due to OCC.
FaRM transaction API (simplified):
txCreate() o = txRead(oid) -- RDMA o.f += 1 txWrite(oid, o) -- purely local ok = txCommit() -- Figure 4
What’s in an oid: <region #, address>.
region # indexes a mapping to [ primary, backup1, … ]. Target NIC can use address directly to read or write RAM so target CPU doesn’t have to be involved.
Server memory layout: regions, each an array of objects Object layout: header with version # and lock
Every region replicated on one primary, f backups – f+1 replicas. Only the primary serves reads; all f+1 see commits+writes replication yields availability if <= f failures, i.e. available as long as one replica stays alive; better than Raft!
txRead one-sided RDMA to fetch object direct from primary’s memory – fast! also fetches object’s version number, to detect concurrent writes
txWrite must be preceded by txRead just writes local copy; no communication
transaction execution / commit protocol without failure – Figure 4
LOCK (first message in commit protocol)
TC sends to primary of each written object TC uses RDMA to append to its log at each primary LOCK record contains oid, version # xaction read, new value primary s/w polls log, sees LOCK, validates, sends "yes" or "no" reply message note LOCK is both logged in primary's NVRAM *and* an RPC exchange
what does primary CPU do on receipt of LOCK?
(for each object) if object locked, or version != what xaction read, reply “no” otherwise set the lock flag and return “yes” note: does not block if object is already locked
TC waits for all LOCK reply messages
if any “no”, abort send ABORT to primaries so they can release locks returns “no” from txCommit()
TC sends COMMIT-PRIMARY to primary of each written object
uses RDMA to append to primary’s log TC only waits for hardware ack – does not wait for primary to process log entry TC returns “yes” from txCommit()
what does primary do when it processes the COMMIT-PRIMARY in its log?
copy new value over object’s memory increment object’s version # clear object’s lock flag