At SendBird, we rely on Redis not only to cache objects from our SQL database but also to act as a primary data store for transient objects like a user's online status. To the platform team, the availability of Redis is crucial for maintaining the overall health of our chat infrastructure.
Since reader endpoints are only accessible as individual nodes in AWS ElastiCache, applications need to load balance read requests themselves. This level of granular control might sound good but if a reader node goes down it means that an application needs a health checking mechanism to wait for the node to recover. Here I'll describe the journey to eliminate single points of failure in our Redis deployment by implementing an active health checker with HAProxy.
In our production environments, we use ElastiCache Redis with cluster mode disabled. To emulate the data sharding abilities of cluster mode, we have Python application-level logic to perform consistent hashing on ElastiCache clusters. We size the clusters to have at least one read replica (in addition to a master) available, so that we can scale reads horizontally by selecting a reader by round-robin.
Now, most of the time this setup works great because ElastiCache allows us to talk to each reader replica (each replica is a full replica) by its own endpoint and gives the writer (either manually selected or a promoted reader with the lowest replication lag) its own endpoint.
This setup doesn't work as well, however, if a Redis node goes down. In that situation, there are two cases:
Of the two cases, not having a writer has a much larger impact on applications but the time-to-recovery is reasonable: about a minute. Not having a reader is less important (just use a different one!) but the time-to-recovery is much longer--as long as ten minutes in some cases.
Our problem lies in the fact that since we talk to each reader endpoint randomly, there is a 1/N chance (where N = number of replicas in a cluster) that read requests will fail until the reader node recovers.
Consider the case above where the writer endpoint points to Replica 1 until about 06:00 when it unexpectedly goes down. The uptick in connections to Replica 2 occurs fast and indicates that the writer endpoint failed over and now Replica 2 is the master.
But because our configuration files are static we still have services attempting to talk to Replica 1 as a reader endpoint. All of those services will fail to connect (with a TimeoutError in our case) which will cause health checks to fail and eventually wake everyone up to deal with a problem that can't be fixed much in the moment (the solution being to wait for ElastiCache to recover the node and hope that nobody needs to read ).
Our first thought was to track failed requests and remove endpoints if they went over some threshold. We already use a fork of the the standard Python Redis library, redis-py, to workaround a bug in connection management (andymccurdy/redis-py#886), so it seemed natural to add health checking and avoid adding another system to our infrastructure.
The design of this passive (error counting) health checker would be such that we would use a custom connection class for every Redis client that increments a sliding window count of recent connection failures for an endpoint. Then, in the future, all future connections to a cluster would avoid an unhealthy node during some cooldown period.
However, we didn't end up merging this approach because
Next, we considered twemproxy to fix the redundant health checking issue. We would even get connection multiplexing for free!
But we also avoided this approach because
Unlike Lyft, we didn't solve our problem with Envoy because it would have been too large a change at the time; however, we are actively interested in moving to it.
So at this point, we knew that we wanted a system with the following characteristics:
We settled on HAProxy because it fit all of our criteria and has Redis-level health checking which means that an endpoint will only be marked as healthy if it is up and successfully speaks the Redis protocol (See in the docs here and code here).
Here's a snippet of our base configuration file where the server endpoints would be resolvable ElastiCache names in production.
With this approach, we also get connection-aware round-robin balancing. Plus the simple configuration format and the need to synchronize settings between our application and HAProxy motivated us to move to an easier-to-maintain settings file format that removes over a thousand lines of code (!) from our main repository.
Note that now in our applications settings file we only have to keep track of two endpoints:
Since making this change in all of our regions we've been able to restart, delete, and failover Redis nodes without a hitch!
Because of the way AWS ElastiCache serves Redis reader endpoints as individual node endpoints instead of a unified reader endpoint, it forces end-users to distribute load across readers manually. If a node goes down then a naïve solution to this problem will lead to a bunch of critical application failures for up to 10 minutes. We recently got a bit smarter by adding HAProxy to detect when this failure starts and resolves, and now the platform team can sleep much better now that this point-of-failure is gone.
We're currently hiring (https://sendbird.com/careers) so If this sort of problem seems interesting to you we'd love to chat!