Tuning AWS SQS Job Queues
Sep 25, 2016 · 5 minute readOn a project recently, we were debugging a slow user experience during file upload and after investigating, found that the culprit was mainly our queue configuration. We were using Amazon’s Simple Queue Service (SQS) for queueing and this post goes over our debugging process and the lessons learned for tuning SQS along with some more general take aways about background jobs and queue design.
Our use case
The user flow in question here is a contract signing flow where the user uploads a file to be prepared for signing. From the user perspective, they upload a file, wait some amount of time (ideally, only a few seconds or less), and then a real time event arrives (via a websocket) that alerts the user that the file has been fully processed and is ready to be signed.
Sounds like a simple process but what we were seeing in reality was the user would be waiting anywhere from 10 seconds to a couple minutes and sometimes receive multiple (duplicate) events letting them know the file was ready. Something was very wrong.
Behind the scenes
We are working in a microservices architecture where we strive to communicate asynchronously (via job queues) where possible. This particular file pipeline involved the cooperation of a handful of internal services as well as communication over the network to multiple external services for file persistence and contract signing management.
Summary of our queue system
Before we can dive into fixing things, we’ll review how we use SQS for managing asynchronous communication and background jobs. Our services publish messages to SNS topics (the “Message Producer” below) and these SNS topics have SQS queues subscribed (the “Queue” below) to them. We then have daemons (i.e. - workers or the “Message Consumer” below) that poll on the SQS queue waiting for messages to process.
Check out AWS’ docs for more info. And now back to our problem!
Debugging steps
Now that we understand how our queues are setup let’s investigate our specific problem. We centralize our logs from all of our services and use correlation id’s (i.e. - request id’s) to allow for easy request tracing. We took a request id of a known slow file upload process and here’s what we found in the logs:
- there was a consistent gap (~2 seconds) between a message being pushed through SNS and an SQS queue consumer picking up the message.
- there were sometimes multiple instances of the same job running concurrently (doing redundant work).
- there were sometimes large gaps (~30-90 seconds) between different steps in the file processing pipeline.
- there was a long (~30 second) gap between a job failing (due to a network blip, for example) and that same job retrying.
None of these traits were desireable and they pointed us in the direction of slow queuing.
Anatomy of an SQS Message
Now that we have some more specific queue symptoms targeted, let’s take a look
at how an SQS message gets processed. When a message gets pushed into a queue
(SendMessage
from a producer), the message is “hidden” from consumers for
DelaySeconds
seconds. After DelaySeconds
, the next ReceiveMessage
request
(from a consumer) will see the message. The consumer then has
VisibilityTimeoutSeconds
seconds to work on the message before the queue will
make the message visible again to other consumers (essentially a timeout
mechanism). When (if ever) the consumer completes its work on the message, it
writes back to the queue with DeleteMessage
(not shown here) and the job is
marked as complete and removed from the queue.
Solutions
Now that we know how a message gets processed, let’s revisit our problems from above and work through (surprisingly simple) solutions for each one.
- gap between publication and consumption:
- we had configured
DelaySeconds
to be2
. No one could remember why but we were arbitrarily delaying every job by two seconds. In our pipeline we were running 5 different jobs so we were artifically slowing down our user flow by 10 seconds before doing any meaningful work. We setDelaySeconds
to0
. This change sped up our time between publication and consumption but revealed some race conditions where we published messages referencing resources that were not yet commited to the database. Details on this issue are for another post but checkout the event queue pattern for some ideas about how to handle this type of race condition.
- we had configured
- concurrent job execution:
- We had set
VisibilityTimeoutSeconds
to30
as this number seemed like a reasonable upper bound on job runtime. What we found was that we had jobs taking longer than 30 seconds that had not errored. In other words, we were doing good (and correct) work that was taking longer than 30 seconds. Because our timeout was set to 30 seconds, a job would be chugging along but the queue would think it had failed and allow consumers to start the same job again leading to concurrent work (and duplicate event notifications to the user). There are two solutions (both of which we implemented) to this problem:- 1) increase the
VisibilityTimeoutSeconds
for this queue and - 2) make the job faster.
- 1) increase the
- We had set
- gap between pipeline steps:
- We discovered that during this flow, queues would back up with messages (because some jobs took so long). The solution here was simple: add more consumers (in our case, this meant running more daemon processes).
- gap between job retries:
- We found that we were not properly
nack
ing back to the queue when a job did actually fail due to the race conditions we mentioned or network partitions. The former was in our control (don’t write bugs) but the latter was out of our control (#distributedsystems). When a job did fail, we wanted to retry faster than waiting 30 seconds. We updated our daemon processes to write back to the queue upon failure and set theVisibilityTimeoutSeconds
of the specific failed message to5
seconds so that we’d retry after 5 seconds instead of 30. A next and more sophisticated step would be to implement an exponential backoff but this change is good enough for now.
- We found that we were not properly
With each of these changes our file processing experience got better and better. As a side benefit, our entire queueing system got faster. While the issues discovered in our code and queue configuration were not obvious from the outset, we learned a lot from this exercise and found that SQS is not necessarily “simple” but very powerful.