Using ELB for your backend microservices? Seeing intermittent connectivity issues, partial outages across your instances, or other unexplainable failures?

TL;DR Respect DNS!

At Curalate, we are nearly two years into building out our backend microservices. We are using Finagle for a majority of these new services. When we first started operating these in production, one recurring problem was achieving our availability goals. The solution came down to understanding the interaction between the AWS Elastic Load Balancer (ELB) and DNS and the impact it can have on your services. Almost all of our outages boiled down to not handling DNS changes properly on our end and the three issues discussed in this post should help you avoid the same mistakes.

You must learn from the mistakes of others. You can’t possibly live long enough to make them all yourself. – Samuel Levenson

Issue 1: The default Java configuration uses an infinite DNS cache TTL

For security reasons the Java folks set the default DNS cache TTL to be FOREVER. Think about this for a second and you’ll realize that this configuration won’t work well for a dynamic/cloud-based services environment where the IP addresses that DNS resolves actually change (often quite frequently). If you’re thinking this might be your problem, compare the set of IPs the service clients were using during the problematic timeframe against the set of IPs after the issue passes. To record the live traffic you can use tcpdump and then analyze with Wireshark (covered in a future blog post!). To get the latest set of ELB IPs check your ELB DNS name or the Route53 entry pointing to it with:

host [dns_name] 

To set the TTL, modify the following value in the java.security file in your JRE home directory (e.g. ./Contents/Home/jre/lib/security/java.security)

Default:

#networkaddress.cache.ttl=-1

Change to:

networkaddress.cache.ttl=10

The AWS recommendation is to set the TTL to 60s. However, it’s not clear to me from the documentation that that will ensure zero issues pointing to stale load balancer IPs. Does the ELB guarantee that it keeps old load balancer instances available for at least 60s after they update DNS to point to a new set? Why risk it? We now use a 10 second TTL and we haven’t detected any performance degradation for our scenarios in doing so.

If you don’t rely on the Java Runtime, then double check how your runtime handles DNS TTLs by default; there could be similar default behavior.

Issue 2: Your service client or framework is not respecting the DNS TTL

As mentioned we are using Finagle as our backend services framework and like many service or database client frameworks, Finagle manages a connection pool layer to decrease latencies connecting to the same destination server machines (in this case the ELB instances). The recommended pattern is to create the client object with the connection pool once per process and then have each client request use a connection from the pool and return it when it’s done. In the ELB scenario where all requests are going to the same load balancer instance, this works especially well and you can enable network keep alive to further reduce the latency of each request.

So what’s the problem? The issue with these client frameworks and connection pools is they don’t necessarily handle DNS changes, so you can get stuck with a stale IP address.

To get around this, we evaluated a few solutions:

  1. Detect connection failure and shut down the server. Let the auto-scaling mechanism of wherever the client lived kick in with some fresh EC2 instances and up-to-date DNS. This was too aggressive for our current system. We are still young in our microservice journey and the clients of our backend services are monolithic web applications serving many different request loads. Taking down instances for availability blips from a single ELB doesn’t make sense for us. Also, the health checks on the web applications consuming the backend services are themselves simple ELB health checks and we don’t have the ability to implement rules like “kill the instance until less than 50% of instances are alive.”
  2. Finagle supports an extensibility point where we can plug in a DNS cache, as explained in this gist from a Stackoverflow post. This option wasn’t available to us at the time because we were on finagle 6.22, but we’ll revisit this now that we are on 6.33. It looks more elegant than where we ended up.
  3. Detect connection failure, create a new client, and retry. Note that these retries are handled outside of the Finagle client framework since we found the retries built into the Finagle framework were not picking up the new DNS value either.

This sample Scala code for a Finagle client wrapper should be encapsulated in a helper of some sort. If you already have some simple retry helper code, it can fit in there. Some extra care is taken here so that we don’t create a bunch of new client objects once DNS changes occur.

object SampleServiceClientWrapper {
 private val CLIENT_MIN_LIFETIME_MILLIS = 2000
 private val NUM_RETRIES = 3
 private val SLEEP_BETWEEN_RETRIES_MILLIS = 200

 // This is the reference to the client that is overwritten in case of connectivity failure.
 private var client = createClient()
 private var lastClientResetTime = System.currentTimeMillis

 // The wrapped request method...“BYO” retry pattern.
 def makeSomeRequest(x: Long, y: Long): Long = {
   var retryCount = NUM_RETRIES
   while (true) {
     try {
       return client.makeSomeRequest(x, y)
     } catch {
       case retryableException(e) && retryCount > 0 => {
         retryCount -= 1
         Thread.sleep(SLEEP_BETWEEN_RETRIES_MILLIS)
         resetClient()
       }
     }
   }
 }

 // Finagle client builder code that returns a typed client for making requests.
 private def createClient(): SampleServiceClient = {
   clientFactory.buildClient(...)
 }

 // recreate the client if we haven't done so in the immediate past
 private def resetClient(): Unit = synchronized {
   if (System.currentTimeMillis - lastClientResetTime > CLIENT_MIN_LIFETIME_MILLIS) {
     client = createClient()
     lastClientResetTime = System.currentTimeMillis
   } else {
     // Client was reset recently. Do nothing. Trace. Log. 
   }
 }

 // Handle all the exceptions that could be thrown if
 // it can't connect to the load balancer
 private def retryableException(throwable: Throwable): Boolean = {
   throwable match {
     case e: TimeoutException => true
     case e: ChannelWriteException => true
     case e: UnresolvedAddressException => true
     case _ => false
   }
 }
}

It’s not perfect, but it has worked for us. A variation of this would be to have a background thread checking DNS of the ELB host name that signals to reset the client proactively upon detecting a change. But since stale DNS entries are not the only intermittent network problem we were already retrying on most of these exceptions anyway.

Issue 3: Wildly inconsistent Request Rates

If your request load varies wildly throughout the day it can amplify any DNS issues because the more scaling operations that ELB has to do, the more it will be switching out load balancer instances and changing IPs. Incidentally, this actually makes for a good end-to-end test if you are rolling out new backend services: vary the load dramatically over hours of the day and see how the success rate of your client requests holds up.

One issue with request spikes is that you could overload the capacity of the ELB before it has time to adjust its scale. This AWS article describes that the scaling can take between 1 and 7 minutes. If this isn’t sufficient to handle your load spikes you can contact AWS Support and file a request to have them “pre-warm” specific ELBs with a certain configuration if you know the expected load characteristics. We haven’t needed this yet, but our backend services still have relatively low throughput and our latency requirements aren’t that strict yet. I expect this to be an issue in the future.

Conclusion

If you’re just starting to scale up your fleet of microservices, learn from our mistakes and get your DNS caching right. It’ll save a lot of time chasing down issues.


Further details from AWS documentaiton

“Before a client sends a request to your load balancer, it resolves the load balancer’s domain name using a Domain Name System (DNS) server. The DNS entry is controlled by Amazon, because your instances are in the amazonaws.com domain. The Amazon DNS servers return one or more IP addresses to the client. These are the IP addresses of the load balancer nodes for your load balancer. As traffic to your application changes over time, Elastic Load Balancing scales your load balancer and updates the DNS entry. Note that the DNS entry also specifies the time-to-live (TTL) as 60 seconds, which ensures that the IP addresses can be remapped quickly in response to changing traffic.”

How Elastic Load Balancing Works

“If clients do not re-resolve the DNS at least once per minute, then the new resources Elastic Load Balancing adds to DNS will not be used by clients. This can mean that clients continue to overwhelm a small portion of the allocated Elastic Load Balancing resources, while overall Elastic Load Balancing is not being heavily utilized. This is not a problem that can occur in real-world scenarios, but it is a likely problem for load testing tools that do not offer the control needed to ensure that clients are re-resolving DNS frequently.”

Best Practices in Evaluating Elastic Load Balancing