In the field of computer science, the Paxos algorithm is notorious for how difficult it is to understand. I had to learn the Paxos algorithm in my distributed systems class. I even have "implemented" it by translating Leslie Lamport's TLA+ to Python. But I didn't understand it until much much later.

Now I have a better understanding of Paxos than I used to, I want to explain it to other people. Not because I'd like to help people, rather, I find that explaining things is a very good way to find blind spots in my own understanding.

So, where do we start? Personally, I dislike explanations that start with a step-by-step breakdown of the algorithm, followed by a proof of why those steps do what they claim to do. Instead, I much prefer to start with the problem the algorithm tries to solve, then iteratively come up with a solution together with the reader. So that's what I am going to do. And now you understand the title.

Small disclaimer: The glossaries used in this article is different from what is commonly used for Paxos. I just picked the ones that made the most sense for my narrative.

The problem

The distributed consensus problem is widely useful, so the reader probably doesn't need to be motivated. Here I will just simply state the problem.

There is a group of agents (let's call them $\sc{CLIENT}s$), who want to choose a number among their selections. Any number is fine, as long as everyone agrees on the same number.

Here, there are a few assumptions we will make to make this problem meaningful:

  • All the agents - including but not limited to the $\sc{CLIENT}s$, as we will add more types of agents later - are well-behaved. Meaning they all execute the prescribed algorithms faithfully, and don't maliciously try to trick other agents. (If you like jargons: Byzantine failures don't occur.)
  • Agents can talk to each other by sending each other messages, but the messages they send to each other could take arbitrarily long before reaching their destination, and might get lost (but never altered).

The agents could also "fail". However, failing is equivalent to all messages sent to/from that agent being lost forever. So whether we have this assumption or not won't change the algorithm we come up with.

Also, to not complicate things, we are only solving the "single-round" consensus problem in this article, meaning as the output of this algorithm, all of the $\sc{CLIENT}s$ will get a single number which they agree on.

Solution searching adventure

Iteration 0

When trying to solve a complex problem such as this one, it's usually a good idea to start by simplifying the problem. As a start, let's just ignore the need to be reliable entirely.

If we throw reliability out the window, it should be easy to come up with a very simple solution: we add an agent (let's call it $\sc{COORDINATOR}$). The $\sc{CLIENTS}$ send whatever number they pick to the $\sc{COORDINATOR}$ in a $\sc{PROPOSAL}(client_i, x)$ message, where $x$ is the number proposed by the $i$-th $\sc{CLIENT}$. The $\sc{COORDINATOR}$ picks an arbitrary proposal (say, $x'$), and informs the other $\sc{CLIENT}s$ about this decision. Specifically, the $\sc{COORDINATOR}$ will just reply with a $\sc{CHOSEN}(x')$ message to all the $\sc{PROPOSAL}(\ldots)$ messages it has received and will receive.

If we assume no messages ever get lost, it is quite easy to see that every $\sc{CLIENT}$ will get a number. And because only one number is ever chosen, they will all get the same number.

It is also easy to see why this solution is impractical: it has a single point of failure. Once the singular $\sc{COORDINATOR}$ fails, no further progress can be made.

Iteration 1

To improve this almost looks easy at first glance: just add more $\sc{COORDINATOR}s$!

Sure, more $\sc{COORDINATOR}s$ would remove the single point of failure. However, if there are more than one $\sc{COORDINATOR}s$, they might individually make different decisions, which results in the $\sc{CLIENT}s$ having disagreement.

What if we let the $\sc{COORDINATOR}s$ reach an agreement among themselves before responding? But wait, doesn't that sound familiar? Having a group of agents reaching an agreement, that's exactly what we added the $\sc{COORDINATOR}s$ to solve. We just made the problem cyclic.

Let's take a step back. Is there a way for the clients to reach an agreement without having the $\sc{COORDINATOR}s$ communicate with each other?

In other words, among the decisions of the $\sc{COORDINATOR}s$, is there an deterministic algorithm to pick out a specific one that is robust against message losses?

This might sound hard, but it's actually quite simple: pick the decision that is backed by more than half of the $\sc{COORDINATOR}s$.

There can't be two decisions both with more than half of the $\sc{COORDINATOR}s$ backing them; and if a decision doesn't have that many $\sc{COORDINATOR}s$ backing it, it won't appear to have more backing $\sc{COORDINATOR}s$ through message losses.

Since this approach resembles a majority vote, let's call $\sc{COORDINATOR}$ decisions $\sc{VOTE}(coord_i, x)$ from now on, where $x$ is the number picked by the $i$-th $\sc{COORDINATOR}$. Each $\sc{COORDINATOR}$ has a single vote, because each of them only makes a single decision.

Obviously, our solution cannot be infinitely reliable. If more than half of the $\sc{COORDINATOR}s$ went down, there will never be a majority reached. But this is already vastly better than our first solution, and the reliability scales with the number of $\sc{COORDINATOR}s$. So we will call it good enough.

Sadly, this solution doesn't actually work: there might not be a majority at all! For example, it's possible that three of the proposals each get a third of the votes. We would have a stalemate in that case.

Iteration 2

Again, a solution seems straightforward: just try again in case of a stalemate.

But then again, things aren't that simple.

First of all, the $\sc{COORDINATOR}s$ need to be made aware of a retry. Otherwise, because each $\sc{COORDINATOR}$ only has one vote, they won't be able to vote again even if the $\sc{CLIENT}s$ retry.

To do that, we attach an attempt id to all the messages sent. i.e. $\sc{PROPOSAL}(client_i, x)$ becomes $\sc{PROPOSAL}(\#attempt, client_i, x)$, and so forth. Each time a $\sc{CLIENT}$ retries, it bumps $\#attempt$ to the maximum $\#attempt$ it knows of plus 1. And the $\sc{COORDINATOR}s$ should only responds to messages with the most recent $\#attempt$.

Hopefully the intent of the $\#attempt$ number is clear. (Let me know if not.)

Are we good now? Unfortunately, no. Consider this scenario:

There were 2 clients. They proposed their numbers, the $\sc{COORDINATOR}$ voted on them and all agreed on a single number, $x_1$, all is good. But, all of the $\sc{VOTE}(\ldots)$ messages got lost on the way to $client_2$, while $client_1$ received all of the messages just fine. At this point, $client_1$ thought $x_1$ is the number, but $client_2$ went on to retry. The $\sc{COORDINATOR}s$ voted again, and got $x_2$. This time, all the messages sent to $client_1$ got lost.

And behold, we got the two clients to disagree.

There is an important insight to be had here. Whenever a $\sc{COORDINATOR}$, say $coord_i$, sends out a $\sc{VOTE}(\ldots, coord_i, x)$, there is a chance that some $\sc{CLIENT}$ would adopt $x$. If $coord_i$ ever sends out two votes with different $x$, there is a chance that some of the $\sc{CLIENT}s$ would disagree.

In other words, once a $\sc{COORDINATOR}$ has revealed its vote, it has to stick to it.

This seems to run contrary to our attempt: if the $\sc{COORDINATOR}s$ cannot change their votes, what's the point of retrying? A stalemate will be a stalemate forever.

Looks like we reached a dead end with this type of voting. It appears the problem stems from the fact that the $\sc{COORDINATOR}s$ have to commit to their votes.

So, what if we introduce a form of non-commitment voting?

Iteration 3

Let's explore this idea. Say, the $\sc{COORDINATOR}s$ could now send a $\sc{TENTATIVE}\sc{VOTE}(\#attempt, coord_i, x)$ message, to tentatively vote for $x$.

Obviously, the $\sc{CLIENT}s$ couldn't adopt $x$ right away. So what's this vote good for?

Ah, right, it could get us to a majority.

It is correct that tentative votes don't lead directly to an agreement among $\sc{CLIENT}s$, but it can show us when a majority has formed among the $\sc{COORDINATOR}s$.

Once a $\sc{CLIENT}$ sees a majority tentative vote, it can then message the $\sc{COORDINATOR}s$ to ask for an actual vote. (Let's call this message $\sc{PLEASE}\sc{VOTE}(\#attempt, client_i)$). Intuitively, the $\sc{COORDINATOR}s$ have to make the same vote in the actual vote as their tentative votes.

If all goes well, we would get a majority and an agreement. If there is no majority, the $\sc{COORDINATOR}s$ won't even start a vote, so they are free to change their mind. So the $\sc{CLIENT}s$ could start another attempt which might have a different outcome.

What if things don't go well? What if the $\sc{PLEASE}\sc{VOTE}$ messages weren't received by some of the $\sc{COORDINATOR}s$? In that case, some of the $\sc{COORDINATOR}s$ would have voted, and their decisions cannot be changed. That is to say, in all subsequent attempts, these $\sc{COORDINATOR}s$ will always vote for what they have voted for, whether it's a tentative vote, or the actual vote. But that doesn't create a problem for us. There was a majority in the tentative votes, and now we solidified part of the tentative votes. There is at least one way we can still reach a majority in the next round: everyone votes the same as they did in this round. And we can prove this inductively for all future rounds.

From this, we can have a rough image of how the algorithm functions: as attempts are being made, more and more $\sc{COORDINATOR}s$ start to make up their mind which number they will commit to, while making sure a majority could still be reached. Eventually, after all the $\sc{COORDINATOR}s$ have made up their minds, by induction there must be a majority among them. From that point on, they will just repeatedly broadcast this decision to the $\sc{CLIENT}s$, until all the $\sc{CLIENT}s$ have got that message.

And Viola, we have a working algorithm.

The actual algorithm

Let's clean up our thoughts, and condense the description of our algorithm so it's easy to understand.

First, there is the $\#attempt$ number that is attached to every message. This number is bumped every time a new attempt is made. If a $\sc{CLIENT}$ sees a message with a $\#attempt$ bigger then its most recent $\#attempt$, it knows a new attempt has been initiated, so it would abort its current attempt and participate in the newer one. If a $\sc{COORDINATOR}$ sees a message with a $\#attempt$ smaller than the biggest $\#attempt$ it has ever seen, it would know that message is stale, so it will drop the message.

With that out of the way, let's describe what happens in an attempt.

Each attempt can be split into two phases:

  • Phase 1: The $\sc{CLIENT}s$ each send its $\sc{PROPOSAL}(\ldots)$ to the $\sc{COORDINATOR}s$. The $\sc{COORDINATOR}s$ reply with a $\sc{TENTATIVE}\sc{VOTE}(\ldots, x)$. Each $\sc{CLIENT}$ waits for the tentative votes until they reach a majority. If a majority is not reached, retry.
  • Phase 2: If a majority is reached, the $\sc{CLIENT}s$ send $\sc{PLEASE}\sc{VOTE}$, and the $\sc{COORDINATOR}s$ actually vote. Their actual votes would be the same as their respective tentative votes. Each $\sc{CLIENT}$ waits for the votes until they reach a majority, and then adopt the majority number.

Back to Paxos

Our algorithm does look a bit different from the official Paxos. For one, the name of the agents are different. What we call $\sc{CLIENT}s$, Lamport calls $\sc{PROPOSER}s$; and $\sc{COORDINATOR}s$, $\sc{ACCEPTOR}s$.

Besides that, there are protocol differences too.

Firstly, the $\sc{COORDINATOR}s$ don't have to send the $\sc{TENTATIVE}\sc{VOTE}(\ldots)$ to everyone, they just need to send it to the $\sc{CLIENT}$ they agree with. This way we won't have every $\sc{CLIENT}s$ sending $\sc{PLEASE}\sc{VOTE}$ at the same time, that would be inefficient.

After, we notice that the proposed number is unnecessarily sent multiple times in phase 1 and phase 2. The phase 1 is used to reach a majority, the proposed number is not actually important in that phase. So we remove the $x$ from $\sc{PROPOSAL}$; and in $\sc{TENTATIVE}\sc{VOTE}$, instead voting for a number, they vote for a $\sc{CLIENT}$, by sending the tentative vote only to that $\sc{CLIENTS}$. Finally, after a client received a majority of tentative votes, it sends a $\sc{PLEASE}\sc{VOTE}(x)$, so all the $\sc{COORDINATOR}s$ got that message will vote $x$. Of course, if a $\sc{COORDINATOR}$ has already voted in a previous round, it has to tell the $\sc{CLIENT}$, so it could pick the already voted $x$, otherwise its $\sc{PLEASE}\sc{VOTE}(x)$ will be wasted, as the $\sc{COORDINATOR}s$ couldn't change their minds.

(The modified algorithm has slightly better property. In our algorithm, we just make sure a majority is still possible after each attempt; in Paxos, each round the $\sc{COORDINATOR}s$ that vote will all vote for the same number.)

With this little changes, we can map our algorithm back to Paxos:

Agents:

  • $\sc{CLIENT}$ => $\sc{PROPOSER}$ (which makes proposals) and $\sc{LEARNER}$ (which adopts the resulting number)
  • $\sc{COORDINATOR}$ => $\sc{ACCEPTOR}$

Messages:

  • $\sc{PROPOSAL}(\#attempt, client_i)$ => $\sc{PREPARE}(\#attempt, client_i)$
  • $\sc{TENTATIVE}\sc{VOTE}(\#attempt, coord_i)$ => $\sc{PROMISE}(\#attempt, coord_i)$
  • $\sc{PLEASE}\sc{VOTE}(\#attempt, client_i, x)$ => $\sc{ACCEPT}(\#attempt, client_i, x)$
  • $\sc{VOTE}(\#attempt, coord_i, x)$ => $\sc{ACCEPTED}(\#attempt, coord_i, x)$