Masutangu

长风破浪会有时 直挂云帆济沧海

也許我這一生 始終在追逐那顆九號球


MIT 6.824 学习笔记(五)

本系列文章是对 MIT 6.824 课程的学习笔记。

FaRM Note

FaRM writes go to RAM, not disk – eliminates a huge bottleneck. Can write RAM in 200 ns, but takes 10 ms to write hard drive, 100 us for SSD, but RAM loses content in power failure! Not persistent by itself.

Just write to RAM of f+1 machines, to tolerate f failures? Might be enough if failures were always independent, but power failure is not independent – may strike 100% of machines!

So batteries in every rack, can run machines for a few minutes: “non-volatile RAM”. What if crash prevents s/w from writing SSD, e.g bug in FaRM or kernel, or cpu/memory/hardware error. FaRM copes with single-machine crashes by copying data from RAM of machines’ replicas to other machines to ensure always f+1 copies. Crashes (other than power failure) must be independent!

why is the network often a performance bottleneck?

    the usual setup:
    app                       app
    ---                       ---
    socket buffers            buffers
    TCP                       TCP
    NIC driver                driver
    NIC  -------------------- NIC

lots of expensive CPU operations:

  • system calls
  • copy messages
  • interrupts

and all twice if RPC! It’s hard to build RPC than can deliver more than a few 100,000 / second wire b/w (e.g. 10 gigabits/second) is rarely the limit for short RPC. These per-packet CPU costs are the limiting factor for small messages.

Two classes of concurrency control for transactions:

  • pessimistic: wait for lock on first use of object; hold until commit/abort, called two-phase locking. Conflicts cause delays

  • optimistic: access object without locking; commit “validates” to see if OK.

    • Valid: do the writes
    • Invalid: abort called Optimistic Concurrency Control (OCC)

FaRM uses OCC. The reason: OCC lets FaRM read using one-sided RDMA reads, server storing the object does not need to set a lock, due to OCC.

FaRM transaction API (simplified):

  txCreate()
  o = txRead(oid)  -- RDMA
  o.f += 1
  txWrite(oid, o)  -- purely local
  ok = txCommit()  -- Figure 4

What’s in an oid: <region #, address>. region # indexes a mapping to [ primary, backup1, … ]. Target NIC can use address directly to read or write RAM so target CPU doesn’t have to be involved.

Server memory layout: regions, each an array of objects Object layout: header with version # and lock

Every region replicated on one primary, f backups – f+1 replicas. Only the primary serves reads; all f+1 see commits+writes replication yields availability if <= f failures, i.e. available as long as one replica stays alive; better than Raft!

  • txRead one-sided RDMA to fetch object direct from primary’s memory – fast! also fetches object’s version number, to detect concurrent writes

  • txWrite must be preceded by txRead just writes local copy; no communication

  • transaction execution / commit protocol without failure – Figure 4

    • LOCK (first message in commit protocol)

      TC sends to primary of each written object
      TC uses RDMA to append to its log at each primary
      LOCK record contains oid, version # xaction read, new value
      primary s/w polls log, sees LOCK, validates, sends "yes" or "no" reply message
      note LOCK is both logged in primary's NVRAM *and* an RPC exchange
      
  • what does primary CPU do on receipt of LOCK?

    (for each object) if object locked, or version != what xaction read, reply “no” otherwise set the lock flag and return “yes” note: does not block if object is already locked

  • TC waits for all LOCK reply messages

    if any “no”, abort send ABORT to primaries so they can release locks returns “no” from txCommit()

  • TC sends COMMIT-PRIMARY to primary of each written object

    uses RDMA to append to primary’s log TC only waits for hardware ack – does not wait for primary to process log entry TC returns “yes” from txCommit()

  • what does primary do when it processes the COMMIT-PRIMARY in its log?

    copy new value over object’s memory increment object’s version # clear object’s lock flag

最近的文章

基于 Replicated State Machine 实现游戏进程恢复

Introduction游戏服务器实现的业务逻辑普遍比较复杂,且大部分是带有状态的。如果进程重启或意外崩溃,会导致该服务器上的玩家断线,丢失进行中的游戏数据,带来极差的游戏体验。为了避免这种情况出现,一般游戏服务器都会持久化玩家数据以实现进程恢复,当重启或进程意外崩溃时,重新拉起进程后可以恢复到之前的状态。常用的做法是将玩家的状态信息保存在共享内存中,重启时加载共享内存进行恢复。共享内存虽然方便,但会有许多限制。比如 C++ 涉及到多态(虚函数表)、STL容器(heap分配),都不能直接映...…

工作继续阅读
更早的文章

MIT 6.824 学习笔记(四)

本系列文章是对 MIT 6.824 课程的学习笔记。ZooKeeperAbstractZooKeeper 旨在提供简单高效的内核以供客户端实现更复杂的 coordination primitives。In addition to the wait-freeproperty, ZooKeeper provides a per client guarantee of FIFO execution of requests and linearizability for all requests ...…

读书笔记继续阅读