The Byzantine Generals Problem paper gave us the number of nodes a system must contain in order to deal with
m faulty nodes (
The original solution assumes the following:
- Every message that is sent is delivered correctly.
- The receiver of a messages knows who sent it.
- The absence of a message can be detected.
In Practical Byzantine Fault Tolerance, Castro and Liskov claim to provide a state-machine replication protocol that survives Byzantine faults in an asynchronous network. The term “state-machine replication” describes generic replicated computation. If you can replicate a state-machine, you can essentially replicate any ordered sequence of operations. You’ll see that a lot in distributed systems literature. Byzantine faults are completely arbitrary faults: a node might crash, restart, send conflicting messages to different nodes, and lie about information. Other nodes have no idea whether a given node has gone rogue or not. By constructing a system resilient to such a wide range of faults, one constructs a reliable system without having to iterate over specific types of crashes.
The system guarantees safety and liveness in the face of
floor((n - 1) / 3) failures. By safety, the authors mean that the system satisies linearizability.
- Linearizability: operations submitted to the system as a whole occur in the order they were submitted. Since requests are submitted to a master replica, the master replica will execute operations in the order (by wall-clock time) that they were received. If a master replica fails, the operations may not occur. This guarantee is difficult to make in situations where requests may be submitted to any replica. A request will either complete or not complete at all. A request will never result in inconsistent state across the network.
By liveness, the authors mean that clients eventually receive replies to their requests. Think of “clients” as nodes interacting with the PBFT system, not included in the system model. In order to circumvent the FLP result, the authors recognize that they can only guarantee liveness in the face of partial synchrony, which is not exactly asynchrony. In an asynchronous system, the network delay and processor speed differences (of nodes) are unknown and unbounded. In a partially synchronous system, the delay and speed differences are unknown, but bounded.
We also assume that each node has a private/public keypair and knows of every other node’s private/public keypair.
The client interacts with the system in the following way:
- Client sends a request to the primary.
- Primary multicasts the request to the backups.
- Replicas execute and send response directly to the client.
- If the client receives
mis the tolerated number of faults (
floor((n - 1) / 3)).
Internally, the algorithm works as follows:
The client sends a request to the leader. The leader sends a pre-prepare message to all backups, which contains:
- A view number: can be thought of as the leader’s “epoch”. It is deterministic: for view
v, the leader is
v mod N, where
Nis the number of nodes in the system.
- A sequence number
- A hash of the request
- The client’s signature
- The client’s request (it’s a hash in the paper, but I fear that is vulnerable to replay attacks)
- The primary’s signature
A backup accepts a pre-prepare if the signatures are correct, it has not accepted a pre-prepare for the same sequence number with a different digest, and the backup is in the given view (it recognizes the current master as leader).
- The master can’t lie about the message’s content
- The master can’t replay messages (with my addition)
- The master cannot send pre-prepares to a subset of nodes.
This step allows the master to totally order events using sequence numbers within views.
- Once a replica has received a valid pre-prepare, it will refuse to accept pre-prepares with lower sequence/view numbers.
Later we will see what happens when a master is faulty.
Every backup then broadcasts a prepare message to all nodes, which looks the same as a pre-prepare except it includes the broadcaster’s ID and signature. A node (including the primary) accepts a prepare if it includes valid signatures and the numbers are valid.
A message has been prepared if a node has received (and stored)
2m valid prepares for a given message. (Note that if we assume this node is also honest, we have our
2m+1 honest nodes.) Since we assume that at least
2m+1 nodes are honest and we know that honest nodes don’t send conflicting prepares (for a given sequence/view number), we can be sure that honest nodes agree upon the message to be committed.
2m+1nodes are honest,
2m+1honest nodes will have “prepared” a message.
- The prepare exchange ensures nodes agree upon the order of events within a view, since any new pre-prepare must have a higher sequence number. They establish a “new minimum”.
A replica broadcasts a commit message once it has finished the prepare phase (a message is prepared).
Once a replica receives
2m+1 commit messages, it is safe to execute the client request. However, a client cannot execute a request until all messages with lower sequence numbers have executed (and therefore committed). This ensures the aforementioned safety property: requests are executed in the order they were submitted.
As we’ve seen, if
2m+1 nodes are honest, the algorithm functions as desired. However, what if the master is faulty? For example, in the pre-prepare step, the master can refuse to send messages to certain nodes. These nodes can either 1) wait until they receive a prepare or 2) do nothing. If they do nothing, the client will eventually fail to receive the requisite responses to their query. They will resend the query to all replicas.
If a replica has already processed a request (it’s been fully committed) if returns its reply. If not, it sends a reminder to the master. If the replica doesn’t receive progress on said request, the replica will assume the master is faulty, timeout, and will broadcast a view change message. This view change message includes the new leader (deterministic, remember) and view number.
If the to-be primary receives
2m view change messages (
2m nodes have timed the leader out) it sends out a new view message, announcing its dominion.
Comparison to Byzantine Generals Signed Solution
I was puzzled by the section in the original paper which details a solution for signed messages which can handle an arbitrary number of Byzantine failures.
Why are we still dealing with
2m+1 faults in the PBFT paper? Aren’t we using signed messages?
We are using signed messages to ensure that a replica knows who actually sent a message, which is an assumption in the original paper.
The arbitrary-failure-signed-messages solution in the original paper is slow. It requires roughly
O(n!) messages, whereas PBFT requires