Distributed Locking to Build Reliable Systems
Why distributed locking is important when building reliable systems
First, we need to get understanding of distributed systems problem space so i’m going to talk about the CAP theorem. The CAP theorem is a fundamental part of distributed systems describes the trade-offs that we need to make when designing a fault tolerant system, do we need availability, consistency or partition tolerance during failures, like network partitions, we can only guarantee two of the three.
- A consistent system guarantees that every read will serve up the most recent write or error request.
- An available system guarantees that every request is met with a non-error response but with no guarantee that response will be the most recent write.
- A partition tolerance system guarantees that it can continue to operate while requests are dropped or delayed .
Now let’s take a real world example that will give us a little bit of insight into why consistency is important our systems?
Imagine we have a trip that we’re about to offer to a driver, now at the same time the driver accepts the trip and the rider attempts to cancel this trip. Our system receives these two requests in parallel, after each individual request is processed, we’re in two very different states. In one of the states the driver has been assigned to the trip and, in the other state trip is cancelled. When they’re in process in parallel, which state wins, which one’s the source of truth. When there’s no consistency guarantee in our system, there’s no way to know whichever state ends up being persisted into the data layer and will become our source of truth.
We ensure consistency in our system by acquiring lock. Now in this example, it’s the same order of events, we have a trip that’s offered to a driver and the rider cancels and the driver accepts them but this time there’s a little bit of a difference, when a rider cancellation comes in, we acquire a lock on this transaction which forces the driver’s accept transaction to wait for a lock on that entity, during this time we’ll process the rider cancellation request like we did previously and determine the trip state to be cancelled, we’ll release the lock and then the lock will be acquired in the other transaction (the driver accept transaction) and then we will process that transaction.
In distributed systems there are multiple ways to implement locking and each implementation comes with trade-offs and we’ll talk about the two solutions that we built Ring Pop and Distributed Lock Manager (DLM).
One of the solution is, memory lock along with sharding. It ensures that only one worker will be handling request for a given entity and in memory locks make sure that those requests are not computed in parallel. RingPop is open source sharding soluton. It uses a gossip protocol. It leverages consistent hashing to minimize the number of keys to rebalance when our application cluster changes. Using this consistent hashing ringpop provides a method to forward requests to the node that should own that request.
Now let’s talk a little about the trade-offs that we make when we use ring pop, unfortunately requests coming into an application built on top ring pop incur extra hops. A request for trip a will hit up some worker which will then determine whether it owns the trip or not. If it does not own that trip it will forward the request the to the worker that it thinks owns that trip. A requests can actually be proxied around a cluster multiple times before it finds a worker that should actually own that trip. This is a symptom of ring pop’s eventual consistency.
Now, first we have some requests that comes into our cluster which is forwarded to the worker that should own that request, then the worker goes down and we can’t handle that request. The coordinating node will then forward it to the new owning worker and it’ll start processing that request. When the original worker comes back up, it will regain ownership of that job, because it’s consistently hashed, and then we may start processing another request that came in and got forwarded to that worker. These two requests will be handled in parallel, pop can’t guarantee 100% consistency during these changes because it has to eventually propagate the cluster changes throughout with using its constant protocol.
The other solution is DLM is a Decentralized Locking Interface for applications. So we don’t need to worry about sharding clusters to be able to get a lock guarantee. We can have this lock state management system separate from application layer and, into infrastructure, services can now receive a request for a transaction in any of their nodes, can acquire lock from DLM, execute the transaction and then release that lock. It makes it very stateless, very lightweight.
Now i want to give a really high level overview of what DLM architecture looks like this is the three major components :
- Application layer DLM
- The cluster management (Coordinator service i.e. ZooKeeper)
- The state persistence layer (Cassandra)
DLM uses compare and set rights to consistently write lock states.
Let’s dive a little bit into how routing works for DLM. ZooKeeper reads the DLM culster state and push the state into the client application, where a client will keep mapping of the partition ownerships, so that when we need to write a request to a DLM worker and we know where to send it.
Let’s have a look into normal request life cycle with DLM. First, when requests will come to the application, before the application starts handling that request, it will try to acquire a lock for the entity that it’s processing on. Once it has acquired that lock it will start sending heartbeats to DLM (to let it know that transaction’s still alive), that entity is still being worked on. The reason that we do that is, we want to know as soon as possible when the application dies. The worker is doing transaction on entity is no more available and don’t want to wait for some hard-coded TTL, so that some other worker is doing a process on some entity we can quickly release the lock and let that person start doing work, and of course at the end of the transaction we will release the lock from DLM so work can be done somewhere else.
I hope you guys enjoyed the blog. Do please let me know if you have any doubt. Thanks for reading. Happy Learning !!
Reference :