Bài 35 / 66~25 phútThuật toán phân tánMiễn phí… lượt xem

Raft — đồng thuận leader & log replication

Q: Mô tả quá trình AppendEntries từ khi client gửi lệnh đến khi client nhận phản hồi. Đa số ở đây nghĩa là gì?

1. Client gửi lệnh (ví dụ 'SET x=5' ) tới leader. 2. Leader tạo log entry '(term, index, command)' , append vào log cục bộ. 3. Leader gửi 'AppendEntries' song song tới tất cả follower. 4. Mỗi follower kiểm tra consistency (prevLogIndex/prevLogTerm), append vào log, trả 'success' . 5. Khi leader nhận 'success' từ đủ đa số (bao gồm chính leader) — với 5 node thì cần 3 success — leader tăng 'commitIndex' , apply command vào state machine. 6. Leader trả lời client OK. 7. Trong AppendEntries tiếp theo (hoặc ngay lần này nếu có lệnh mới), leader gửi 'leaderCommit' tới follower → follower cũng apply. Đa số = 'floor(n/2) + 1' node, bao gồm leader. Với 5 node: 3. Với 3 node: 2. Lý do: hai tập đa số luôn overlap ít nhất 1 node — đảm bảo không có hai leader cùng commit khác nhau tại cùng index.

Thuật toán consensus dễ hiểu: bầu leader, nhân bản log, commit khi đa số xác nhận. Nền của etcd, Consul, TiKV.

TL;DR: Raft giải quyết bài toán consensus — nhiều node thống nhất một chuỗi lệnh (replicated log) dù có node chết hay mạng phân vùng. Raft chia thành 3 phần tách biệt: (1) leader election — dùng randomized timeout + term để bầu một leader duy nhất; (2) log replication — leader nhận lệnh từ client, ghi vào log rồi replicate sang follower, commit khi đa số (quorum) xác nhận; (3) safety — chỉ bầu node có log mới nhất, đảm bảo không bao giờ mất entry đã commit. Raft được thiết kế để dễ hiểu hơn Paxos — tách rõ từng phần, không có trường hợp đặc biệt ẩn. Nền tảng của etcd, Consul, TiKV, CockroachDB.

Năm 2013, Ongaro và Ousterhout đặt câu hỏi: tại sao không ai thật sự hiểu Paxos đủ để implement đúng mà không có bug? Họ thiết kế Raft từ đầu với nguyên tắc understandability: mọi quyết định thiết kế đều ưu tiên dễ hiểu hơn hiệu suất tối đa. Kết quả là một thuật toán có thể giải thích đầy đủ trong 30 trang — và đã được implement thành công trong hàng trăm hệ thống production.

1. Bài toán consensus — tại sao cần

Replicated state machine: giả sử bạn muốn một database chịu được node failure. Cách tiếp cận: chạy cùng database trên 5 node, mọi lệnh ghi (write) phải được áp dụng theo cùng thứ tự trên tất cả node. Nếu node 1 áp dụng lệnh A rồi B, node 2 áp dụng B rồi A → state khác nhau → inconsistent.

Consensus là đảm bảo tất cả node thống nhất về cùng một chuỗi lệnh (replicated log), dù có node chết giữa chừng hay mạng bị phân vùng tạm thời. Yêu cầu cốt lõi:

Safety: không bao giờ trả về kết quả sai — nếu hai node quyết định thì quyết định phải giống nhau.
Liveness: cuối cùng sẽ tiến được (không block mãi), miễn là đa số node còn hoạt động.

Raft đảm bảo safety tuyệt đối và liveness khi mạng ổn định (không có split-vote liên tục).

2. Các khái niệm cốt lõi

Term: đơn vị thời gian logic trong Raft — mỗi election tăng term lên 1. Term đóng vai trò như "thế hệ leader". Mọi message đều mang term; node nhận message từ term cũ hơn → bỏ qua.

Trạng thái node: mỗi node trong Raft luôn ở một trong ba trạng thái:

stateDiagram-v2
    [*] --> Follower: "khởi động"
    Follower --> Candidate: "election timeout"
    Candidate --> Leader: "nhận đa số vote"
    Candidate --> Follower: "nhận AppendEntries từ leader hợp lệ"
    Candidate --> Candidate: "election timeout mới (split vote)"
    Leader --> Follower: "phát hiện term cao hơn"

Follower: trạng thái mặc định. Nhận AppendEntries từ leader, vote cho candidate. Nếu không nhận heartbeat trong election_timeout → chuyển sang Candidate.
Candidate: đang xin vote. Gửi RequestVote tới tất cả node. Nếu nhận đa số → thành Leader.
Leader: nhận lệnh từ client, replicate sang follower, quyết định commit. Gửi heartbeat (AppendEntries rỗng) định kỳ để follower không timeout.

Log entry: mỗi entry gồm (term, index, command). index là vị trí trong log (1-based). term là term khi entry được tạo.

Commit index: vị trí log cao nhất đã được commit (đa số xác nhận). Chỉ entry đã commit mới được apply vào state machine.

3. Leader election

3.1 Randomized timeout — tránh split vote

Mỗi follower có election_timeout ngẫu nhiên trong khoảng [150ms, 300ms]. Node nào hết timeout trước → khởi động election, tăng term, tự vote cho mình, gửi RequestVote.

Tại sao ngẫu nhiên? Nếu timeout cố định, tất cả follower timeout cùng lúc → tất cả đều thành candidate → split vote (không ai đủ đa số). Timeout ngẫu nhiên đảm bảo trong phần lớn trường hợp chỉ một node timeout trước, kịp thu đủ vote trước khi các node khác timeout.

-- Tại mỗi follower, chạy liên tục
function election_timer(self):
    while true:
        đặt timeout T <- ngẫu nhiên trong [150ms, 300ms]
        chờ T milliseconds
        if không nhận heartbeat trong T:
            khởi_động_election(self)

function khởi_động_election(self):
    self.currentTerm <- self.currentTerm + 1
    self.state <- CANDIDATE
    self.votedFor <- self.nodeId        -- tự vote cho mình
    votes_received <- 1
    for each peer trong cluster:
        gửi RequestVote(term=self.currentTerm,
                        candidateId=self.nodeId,
                        lastLogIndex=self.log.lastIndex(),
                        lastLogTerm=self.log.lastTerm())

3.2 RequestVote — election restriction

function on_RequestVote(self, req):
    -- Từ chối nếu term cũ hơn
    if req.term < self.currentTerm:
        return VoteResponse(granted=false, term=self.currentTerm)

    -- Cập nhật term nếu nhận term mới hơn
    if req.term > self.currentTerm:
        self.currentTerm <- req.term
        self.state <- FOLLOWER
        self.votedFor <- null

    -- Chỉ vote nếu chưa vote trong term này VÀ log của candidate ít nhất mới bằng mình
    already_voted <- self.votedFor != null and self.votedFor != req.candidateId
    log_ok <- (req.lastLogTerm > self.log.lastTerm()) or
              (req.lastLogTerm = self.log.lastTerm() and
               req.lastLogIndex >= self.log.lastIndex())

    if not already_voted and log_ok:
        self.votedFor <- req.candidateId
        reset_election_timer(self)
        return VoteResponse(granted=true, term=self.currentTerm)
    else:
        return VoteResponse(granted=false, term=self.currentTerm)
// Time: O(n) message per election  Space: O(1) per node

Election restriction là điểm mấu chốt của safety: node chỉ vote cho candidate có log "ít nhất mới bằng" mình. "Mới hơn" nghĩa là: lastLogTerm lớn hơn, hoặc bằng term nhưng lastLogIndex lớn hơn hoặc bằng. Điều này đảm bảo leader mới luôn có tất cả entry đã committed — node nào có log cũ hơn sẽ không bao giờ được bầu.

4. Log replication

Sau khi được bầu, leader nhận lệnh từ client và replicate sang follower theo giao thức AppendEntries.

sequenceDiagram
    participant C as "Client"
    participant L as "Leader"
    participant F1 as "Follower 1"
    participant F2 as "Follower 2"
    participant F3 as "Follower 3"

    C->>L: "lệnh x=5"
    L->>L: "append (term=2, idx=4, x=5) vào log"
    L->>F1: "AppendEntries(entries=[...])"
    L->>F2: "AppendEntries(entries=[...])"
    L->>F3: "AppendEntries(entries=[...])"
    F1-->>L: "success"
    F2-->>L: "success"
    Note over L: "đa số (3/5) xác nhận → commit idx=4"
    L->>L: "apply x=5 vào state machine"
    L-->>C: "OK"
    L->>F1: "AppendEntries tiếp theo thông báo commitIndex=4"
    L->>F2: "AppendEntries tiếp theo thông báo commitIndex=4"
    L->>F3: "AppendEntries tiếp theo thông báo commitIndex=4"
    F1->>F1: "apply x=5"
    F2->>F2: "apply x=5"
    F3->>F3: "apply x=5"

-- Leader nhận lệnh từ client
function on_client_request(self, command):
    entry <- { term: self.currentTerm, index: self.log.nextIndex(), command: command }
    self.log.append(entry)
    broadcast AppendEntries(entry) tới tất cả follower
    chờ đến khi đa số follower xác nhận entry.index
    self.commitIndex <- entry.index
    apply command vào state machine
    trả lời client

-- AppendEntries handler tại follower
function on_AppendEntries(self, req):
    if req.term < self.currentTerm:
        return AppendResponse(success=false, term=self.currentTerm)

    -- term lớn hơn → cập nhật term và lùi về FOLLOWER (đặc tả an toàn Raft)
    if req.term > self.currentTerm:
        self.currentTerm <- req.term
        self.state <- FOLLOWER
        self.votedFor <- null

    reset_election_timer(self)   -- nhận heartbeat từ leader hợp lệ

    -- Log consistency check: entry trước phải khớp
    if req.prevLogIndex > 0:
        if self.log.length < req.prevLogIndex or
           self.log[req.prevLogIndex].term != req.prevLogTerm:
            return AppendResponse(success=false, term=self.currentTerm)

    -- Ghi entries vào log (xóa conflict nếu có)
    for each entry trong req.entries:
        if self.log[entry.index] exists and self.log[entry.index].term != entry.term:
            self.log.truncate(entry.index)  -- xóa từ đây về sau
        self.log.append(entry)

    -- Cập nhật commitIndex
    if req.leaderCommit > self.commitIndex:
        self.commitIndex <- min(req.leaderCommit, self.log.lastIndex())
        apply tất cả entry tới commitIndex vào state machine

    return AppendResponse(success=true, term=self.currentTerm)
// Time: O(n * log_entries) per round  Space: O(log_size) per node

4.1 Log matching property

Raft duy trì bất biến quan trọng: nếu hai log có cùng index và term tại một entry, thì mọi entry trước đó cũng giống nhau. Lý do: leader chỉ tạo một entry duy nhất tại mỗi (index, term), và entries không bao giờ được sửa sau khi ghi (chỉ truncate khi follower phát hiện conflict với leader).

prevLogIndex + prevLogTerm trong AppendEntries là "chứng minh" entry trước khớp — follower từ chối nếu không khớp, leader tự động giảm nextIndex[follower] để thử lại từ entry trước đó.

5. Safety — election restriction phòng mất data

Tình huống nguy hiểm: leader cũ (term 1) commit entry tại index 5. Sau đó leader cũ crash. Node nào được bầu làm leader mới?

Nếu không có election restriction: node với log ngắn hơn có thể được bầu → log của nó không có entry index 5 → leader mới có thể ghi đè index 5 bằng entry khác → mất data đã committed.

Election restriction ngăn điều này: node muốn được bầu phải có log "ít nhất mới bằng" đa số. Entry đã committed được đa số xác nhận → đa số có entry đó. Muốn được đa số vote → phải có log mới bằng đa số → phải có entry đó. Vậy leader mới luôn có mọi entry đã committed.

5.1 Không commit entry từ term cũ trực tiếp

Một subtlety quan trọng: leader mới không commit entry từ term cũ bằng cách đếm replica. Thay vào đó, leader append một no-op entry (log entry rỗng) ở term hiện tại vào đầu nhiệm kỳ, commit entry đó → tự động kéo theo commit mọi entry trước đó. Cơ chế này tránh một edge case phức tạp trong Raft paper (Figure 8) nơi entry từ term cũ có thể bị overwrite dù đã có đa số replica. Bản chất an toàn (Raft §5.4.2): leader KHÔNG commit entry term cũ chỉ bằng đếm replica; chỉ khi một entry term HIỆN TẠI trên cùng log được commit thì prefix (gồm entry term cũ) mới commit gián tiếp. No-op chỉ là tối ưu để commit prefix sớm, không bắt buộc.

6. Xử lý failure scenarios

Kịch bản	Raft xử lý thế nào
Follower crash	Leader tiếp tục retry AppendEntries vô hạn. Khi follower restart, log sẽ được đồng bộ lại từ `nextIndex[follower]` mà leader duy trì.
Leader crash trước khi commit	Entry chưa commit sẽ không được apply. Leader mới có thể không có entry đó → entry bị mất (đúng — chưa commit nên OK). Client retry sẽ gửi lại lệnh.
Leader crash sau khi commit	Entry đã committed → đa số có entry đó → leader mới (phải thắng election tức là log đủ mới) sẽ có entry đó → không mất.
Network partition	Minority partition (bao gồm old leader) không thể commit gì (không đủ quorum). Majority partition bầu leader mới, tiếp tục hoạt động. Khi partition heal: old leader thấy term cao hơn → tự chuyển thành follower, log của nó bị overwrite.

7. Pitfall

Pitfall 1 — Bỏ qua election restriction khi vote

-- SAI: vote cho bất kỳ candidate nào có term cao hơn
function on_RequestVote_wrong(self, req):
    if req.term > self.currentTerm and self.votedFor = null:
        return VoteResponse(granted=true)

-- DUNG: kiem tra log cua candidate phai moi bang hoac moi hon log cua minh
function on_RequestVote_correct(self, req):
    log_ok <- (req.lastLogTerm > self.log.lastTerm()) or
              (req.lastLogTerm = self.log.lastTerm() and
               req.lastLogIndex >= self.log.lastIndex())
    if not already_voted and log_ok:
        return VoteResponse(granted=true)

Bỏ election restriction cho phép node với log cũ trở thành leader → ghi đè entry đã committed → mất data. Đây là nguồn gốc của nhiều Raft implementation bug trong thực tế.

Pitfall 2 — Commit entry của term cũ trực tiếp

-- SAI: dem replica cua entry tu term cu, commit khi dat da so
-- Vi du: entry index=5, term=1, da co 3/5 replica sau khi leader term=2 restart
-- Leader term=2 thay "a, 3 replica roi, commit luon"
-- Raft paper Figure 8 cho thay day co the sai

-- DUNG: leader moi append no-op entry o term hien tai truoc
-- Khi no-op commit (dat da so o term hien tai), moi entry cu cung duoc commit
-- No-op lan truyen commitIndex xuong follower trong AppendEntries tiep theo

Commit entry của term cũ trực tiếp là sai vì đa số replica tại thời điểm hiện tại không đủ bảo đảm an toàn cho entry cũ — có thể có leader khác (cùng lúc đó) đã ghi đè slot đó ở term khác. Commit qua no-op ở term hiện tại loại bỏ trường hợp này.

Pitfall 3 — Không reset election timer khi nhận RequestVote hợp lệ

-- SAI: chi reset timer khi nhan AppendEntries, khong reset khi grant vote
function on_RequestVote_bad(self, req):
    -- grant vote...
    return VoteResponse(granted=true)
    -- Quen: khong reset election_timer

-- DUNG: reset timer khi grant vote de candidate co du thoi gian thu thap vote
function on_RequestVote_correct(self, req):
    -- grant vote...
    reset_election_timer(self)   -- quan trong!
    return VoteResponse(granted=true)

Không reset timer khi grant vote khiến follower có thể timeout và khởi động election mới ngay khi candidate đang thu thập vote — gây split vote lặp đi lặp lại, cluster không bao giờ bầu được leader.

8. So sánh với Paxos

Thuộc tính	Paxos (Multi-Paxos)	Raft
Độ phức tạp	Rất phức tạp, nhiều trường hợp đặc biệt	Đơn giản hơn, tách biệt 3 phần rõ ràng
Leader	Không bắt buộc có leader duy nhất rõ ràng	Luôn có 1 leader duy nhất tại một thời điểm
Log gaps	Cho phép holes trong log, phức tạp xử lý	Không có holes — log liên tục
Membership change	Không được mô tả trong bài báo gốc	Joint consensus hoặc single-server changes được đặc tả
Implementation difficulty	Cao — Chubby (Google), ZooKeeper mất nhiều năm	Thấp hơn đáng kể — nhiều correct implementation trong 1-2 ngàn dòng code
Ứng dụng	ZooKeeper (ZAB, variant), Chubby	etcd, Consul, TiKV, CockroachDB, RethinkDB

Raft không nhanh hơn Paxos — cùng số round-trip cần thiết. Ưu điểm là correctness dễ đạt hơn: người implement ít có cơ hội bỏ sót edge case vì thuật toán không có edge case ẩn.

9. Ứng dụng thực tế

etcd: key-value store Raft-based, nền tảng của Kubernetes (lưu cluster state). Mọi kubectl apply đều qua etcd Raft cluster. 3 hoặc 5 node etcd là tiêu chuẩn — đủ chịu 1 hoặc 2 node failure.

Consul (HashiCorp): service discovery + configuration. Dùng Raft cho consistent catalog, gossip (SWIM) cho membership — hai lớp kết hợp: gossip scalable, Raft consistent.

TiKV (PingCAP): distributed key-value store nền của TiDB. Raft group per region — mỗi region (96 MB data) có 3-replica Raft group riêng → scale horizontal bằng cách tăng số region.

CockroachDB: distributed SQL. Raft per range (chunk of data). Mọi transaction đều qua Raft consensus → serializable isolation trên nhiều node.

10. Liên hệ các bài khác

Quorum — đọc/ghi đa số: Raft commit khi đa số follower xác nhận — đây chính là quorum (majority quorum). Bài quorum giải thích tại sao đa số (n/2 + 1) đủ để đảm bảo overlap, Raft áp dụng trực tiếp nguyên tắc đó.
Gossip protocol: Gossip đạt eventual consistency trong O(log N) vòng — không đủ mạnh cho consensus. Raft đảm bảo strong consistency nhưng cần round-trip đồng bộ. Consul kết hợp cả hai: gossip cho membership (fast, scalable), Raft cho configuration (consistent).
Vector clock — thứ tự nhân quả: Raft dùng term như một vector clock đơn giản (scalar clock) để phát hiện stale messages. Khác vector clock đa chiều, term trong Raft chỉ cần so sánh số nguyên vì chỉ có 1 leader tại một thời điểm.
Merkle tree — kiểm tra nhất quán: Sau khi log được replicated qua Raft, snapshot của state machine có thể được verify bằng Merkle tree — đặc biệt khi follower cần download snapshot từ leader thay vì replay toàn bộ log.
Case study Cassandra & etcd: Thấy Raft trong production: etcd log compaction, snapshot transfer, cluster membership change (add/remove node) và cách Kubernetes dựa vào etcd cho cluster state.

📚 Deep Dive

📚 Deep Dive — Nguồn gốc & tham khảo

Bài báo gốc:

Ongaro, D. & Ousterhout, J. (2014) — "In Search of an Understandable Consensus Algorithm", USENIX ATC. Bài báo Raft. Extended version (Ongaro PhD thesis) tại raft.github.io — đặc tả đầy đủ nhất bao gồm membership change và log compaction.

Visualization:

Raft Visualization — animation tương tác cho phép crash node, partition network và xem leader election + log replication hoạt động. Bắt buộc xem trước khi implement.

Paxos gốc để so sánh:

Lamport, L. (1998) — "The Part-Time Parliament", ACM TOCS. Khét tiếng khó đọc — Lamport sau đó viết "Paxos Made Simple" (2001) ngắn hơn.
Lamport, L. (2001) — "Paxos Made Simple" — 14 trang, dễ đọc hơn nhiều.

Triển khai tham khảo:

etcd/raft — Go implementation được dùng trong production, code sạch: github.com/etcd-io/raft.
MIT 6.824 Distributed Systems Lab 2 — implement Raft từ đầu bằng Go, bài tập chuẩn để hiểu sâu.

Tóm tắt

Consensus = nhiều node thống nhất cùng chuỗi lệnh (replicated log), chịu được node failure.
Raft tách thành 3 phần độc lập: leader election, log replication, safety.
Leader election: randomized timeout tránh split vote; mỗi term có tối đa 1 leader; election restriction (log mới nhất) đảm bảo leader không mất data.
Log replication: leader append rồi gửi AppendEntries; commit khi đa số xác nhận; commitIndex lan xuống follower trong round tiếp theo.
Safety: chỉ node có log mới nhất mới được bầu; leader không commit entry term cũ trực tiếp mà qua no-op ở term hiện tại.
Raft không nhanh hơn Paxos — ưu điểm là correctness dễ đạt hơn, ít edge case ẩn.
Nền tảng của etcd (Kubernetes), Consul, TiKV, CockroachDB.

Tự kiểm tra

Tại sao election timeout phải ngẫu nhiên thay vì cố định? Điều gì xảy ra nếu timeout cố định?

▸

Nếu timeout cố định, tất cả follower hết timeout gần như cùng lúc → tất cả đều chuyển sang Candidate và gửi RequestVote cùng lúc → mỗi node tự vote cho mình → không ai đủ đa số → split vote. Tình huống lặp lại ở term tiếp theo, cluster không bao giờ bầu được leader.

Timeout ngẫu nhiên trong khoảng [150ms, 300ms] đảm bảo xác suất cao rằng một node timeout trước các node khác đủ sớm để gửi RequestVote và nhận phản hồi trước khi node thứ hai timeout. Node đầu tiên thu đủ đa số vote, trở thành leader, gửi heartbeat → reset timer các follower khác. Khoảng [150ms, 300ms] được chọn để đủ xa nhau nhưng vẫn ngắn hơn nhiều so với round-trip timeout để đảm bảo không có hai node timeout cùng lúc trong phần lớn trường hợp.

Election restriction hoạt động như thế nào và tại sao nó đảm bảo không mất entry đã committed?

▸

Election restriction: node chỉ vote cho candidate có log "ít nhất mới bằng" của mình. "Mới hơn" được so sánh theo thứ tự: trước tiên so lastLogTerm (term của entry cuối cùng), nếu bằng nhau thì so lastLogIndex (độ dài log).

Tại sao đảm bảo không mất committed entry: khi một entry được committed, đã có đa số (quorum) node chứa entry đó trong log. Để được bầu làm leader mới, candidate cần đa số vote — quorum này chắc chắn overlap với quorum đã commit entry (vì hai tập đa số trong cùng cluster luôn có ít nhất 1 node chung). Node chung đó sẽ so sánh log và chỉ vote cho candidate có log ít nhất mới bằng mình (bao gồm entry đã committed). Vậy chỉ candidate có entry đó mới được đủ vote.

Mô tả quá trình AppendEntries từ khi client gửi lệnh đến khi client nhận phản hồi. Đa số ở đây nghĩa là gì?

▸

1. Client gửi lệnh (ví dụ SET x=5) tới leader. 2. Leader tạo log entry (term, index, command), append vào log cục bộ. 3. Leader gửi AppendEntries song song tới tất cả follower. 4. Mỗi follower kiểm tra consistency (prevLogIndex/prevLogTerm), append vào log, trả success. 5. Khi leader nhận success từ đủ đa số (bao gồm chính leader) — với 5 node thì cần 3 success — leader tăng commitIndex, apply command vào state machine. 6. Leader trả lời client OK. 7. Trong AppendEntries tiếp theo (hoặc ngay lần này nếu có lệnh mới), leader gửi leaderCommit tới follower → follower cũng apply.

Đa số = floor(n/2) + 1 node, bao gồm leader. Với 5 node: 3. Với 3 node: 2. Lý do: hai tập đa số luôn overlap ít nhất 1 node — đảm bảo không có hai leader cùng commit khác nhau tại cùng index.

Điều gì xảy ra với old leader khi network partition được khắc phục? Tại sao old leader không gây corruption?

▸

Trong partition: minority partition chứa old leader tiếp tục nhận request client nhưng không thể commit (không đủ quorum). Mọi append entries đều "pending" mãi. Majority partition bầu leader mới (term cao hơn) và tiếp tục phục vụ.

Khi partition heal: old leader gửi AppendEntries tới follower ở majority side. Follower trả về term mới hơn term của old leader. Old leader nhận term cao hơn → tự chuyển về Follower, cập nhật currentTerm. Log của old leader (entries pending trong minority partition chưa commit) sẽ bị overwrite khi leader mới gửi AppendEntries với đúng entries.

Không gây corruption vì các entries old leader "commit" trong minority partition thực ra không commit (không đủ quorum) → client chưa nhận OK → client sẽ retry khi kết nối lại. Entries chưa commit bị overwrite là hoàn toàn đúng theo spec Raft.

Tại sao leader không commit entry của term cũ trực tiếp? Mô tả Figure 8 của Raft paper.

▸

Kịch bản: leader S1 (term 2) replicate entry tại index 2 lên S1 và S2 (2/5 node), rồi crash. S5 được bầu làm leader (term 3), ghi entry khác vào index 2. S5 crash. S1 restart, được bầu lại (term 4). S1 thấy entry cũ index 2 đã có 3 replica (S1, S2, S3 sync từ S1). Nếu S1 đếm replica và commit entry term 2 → đó là sai.

Tại sao sai: sau khi S1 commit (theo giả định sai), S5 có thể được bầu lại (log của S5 có thể vẫn valid theo election restriction nếu term 3 mới hơn term 2), ghi đè index 2 → mất data "đã committed". Giải pháp Raft: leader mới append no-op entry ở term 4, commit no-op khi đủ đa số. Việc commit no-op ở term 4 kéo theo commitIndex tăng qua index 2 một cách an toàn — vì đến lúc đó, S5 không thể được bầu nữa (log của S5 cũ hơn đa số vốn đã commit tới index 2+).

So sánh Raft và Paxos: Raft có ưu điểm gì và đánh đổi gì? Khi nào vẫn nên dùng Paxos?

▸

Raft ưu điểm: tách biệt 3 vấn đề (leader election, log replication, safety) → implement đúng dễ hơn. Leader duy nhất rõ ràng tại mỗi thời điểm → không có ambiguity. Log không có holes → đơn giản xử lý. Membership change được đặc tả trong bài báo gốc. Nhiều correct open-source implementation để tham khảo.

Raft đánh đổi: đòi hỏi strong leader (leader là bottleneck throughput trong một số workload). Multi-Paxos linh hoạt hơn: cho phép log holes, nhiều proposer song song → throughput cao hơn trong một số trường hợp. Khi vẫn dùng Paxos-family: ZooKeeper dùng ZAB (Zookeeper Atomic Broadcast, Paxos variant) vì lịch sử, không dễ migrate. Spanner của Google dùng Paxos variant vì optimize cho geo-distributed với latency bounds đặc thù. Raft phù hợp hơn cho hệ thống mới, team muốn implement từ đầu, hoặc ưu tiên correctness over throughput tuning.

Với cụm 5 node Raft, chịu được tối đa bao nhiêu node failure đồng thời? Tại sao không thể chịu được 3 node failure?

▸

Chịu được tối đa 2 node failure. Quorum cần thiết để commit = floor(5/2) + 1 = 3. Nếu 2 node chết, 3 node còn lại đủ quorum → cluster hoạt động bình thường. Nếu 3 node chết, chỉ còn 2 node — không đủ quorum 3 → cluster ngừng nhận write (không thể commit) để đảm bảo safety.

Tại sao không thể chịu 3: giả sử cho phép 2 node commit. Với 5 node bị partition thành 2 nhóm (3 và 2), nhóm 3 node commit entry A, nhóm 2 node commit entry B tại cùng index → inconsistency. Yêu cầu quorum floor(n/2) + 1 đảm bảo bất kỳ hai quorum nào cũng overlap ít nhất 1 node — không thể có hai quorum disjoint cùng commit khác nhau. Đây là lý do số node nên là lẻ (3, 5, 7): với số chẵn, fault tolerance không tăng nhưng cost tăng.

⁂

Bài tiếp theo: Mini-challenge — consistent hashing ring

Bài này có giúp bạn hiểu bản chất không?

Hỏi đáp về bài này

Chưa có câu hỏi

Đặt câu hỏi

Có gì chưa rõ trong bài? Đặt câu hỏi đầu tiên — câu trả lời từ cộng đồng giúp bạn (và người sau).

Đặt câu hỏi đầu tiên

← Bài trước

Gossip protocol — lan truyền trạng thái

Bài tiếp

Mini-challenge — consistent hashing ring có virtual node