Tuning AWS SQS Job Queues

On a project recently, we were debugging a slow user experience during file upload and after investigating, found that the culprit was mainly our queue configuration. We were using Amazon’s Simple Queue Service (SQS) for queueing and this post goes over our debugging process and the lessons learned for tuning SQS along with some more general take aways about background jobs and queue design.

Our use case

The user flow in question here is a contract signing flow where the user uploads a file to be prepared for signing. From the user perspective, they upload a file, wait some amount of time (ideally, only a few seconds or less), and then a real time event arrives (via a websocket) that alerts the user that the file has been fully processed and is ready to be signed.

Sounds like a simple process but what we were seeing in reality was the user would be waiting anywhere from 10 seconds to a couple minutes and sometimes receive multiple (duplicate) events letting them know the file was ready. Something was very wrong.

Behind the scenes

We are working in a microservices architecture where we strive to communicate asynchronously (via job queues) where possible. This particular file pipeline involved the cooperation of a handful of internal services as well as communication over the network to multiple external services for file persistence and contract signing management.

Summary of our queue system

Before we can dive into fixing things, we’ll review how we use SQS for managing asynchronous communication and background jobs. Our services publish messages to SNS topics (the “Message Producer” below) and these SNS topics have SQS queues subscribed (the “Queue” below) to them. We then have daemons (i.e. - workers or the “Message Consumer” below) that poll on the SQS queue waiting for messages to process.

SQS Queue subscribed to SNS Topic

Check out AWS’ docs for more info. And now back to our problem!

Debugging steps

Now that we understand how our queues are setup let’s investigate our specific problem. We centralize our logs from all of our services and use correlation id’s (i.e. - request id’s) to allow for easy request tracing. We took a request id of a known slow file upload process and here’s what we found in the logs:

None of these traits were desireable and they pointed us in the direction of slow queuing.

Anatomy of an SQS Message

Now that we have some more specific queue symptoms targeted, let’s take a look at how an SQS message gets processed. When a message gets pushed into a queue (SendMessage from a producer), the message is “hidden” from consumers for DelaySeconds seconds. After DelaySeconds, the next ReceiveMessage request (from a consumer) will see the message. The consumer then has VisibilityTimeoutSeconds seconds to work on the message before the queue will make the message visible again to other consumers (essentially a timeout mechanism). When (if ever) the consumer completes its work on the message, it writes back to the queue with DeleteMessage (not shown here) and the job is marked as complete and removed from the queue.

Anatomy of an SQS Message

Solutions

Now that we know how a message gets processed, let’s revisit our problems from above and work through (surprisingly simple) solutions for each one.

With each of these changes our file processing experience got better and better. As a side benefit, our entire queueing system got faster. While the issues discovered in our code and queue configuration were not obvious from the outset, we learned a lot from this exercise and found that SQS is not necessarily “simple” but very powerful.