More infrastructure doesn't fix using the wrong infrastructure

This is a continuation of my previous post where I talked about the challenges of using serverless/Function as a Service (FaaS) compute systems for ETL (Extract, Transform, Load) jobs. It sat in my drafts folder for a long time, so I just decided to publish it as is.

I used to work at AWS, and predictably we used a lot of AWS cloud services. In many cases, when an engineer looks for a service platform, they’ll often go directly to AWS Lambda because “it’s Serverless” with the justification that it’s simple and alternatives are too complex and not worth considering.

In this post, I introduce another “heavily inspired by true events” problem space in which Serverless was the wrong choice. A queue processing system in AWS Lambda seems like a class example of using Serverless. Let’s walk through the problem.

Queue Processing Lambda

A pattern I saw was using AWS Lambda as a queue processing feeding into another system with a limited throughput. The target service(s) had different constraints, such as a maximum number of tps (transactions per second) or a maximum number of in-flight operations before it would reject requests.

Diagram Here - Maybe

In a naïve implementation (I saw this in several systems), SQS would feed into the Lambda function, which then would call the service. If it throttles, retry until it succeeds.

Do you see any problems with this approach? Hopefully a few bells will be going off.

  • If you get a throttle, how many times do you retry? How quickly should you retry?
  • What happens if the number of retries exceeds the Lambda function timeout and it gets cut off in the middle of processing?
  • Is the batch size big or small? If it’s too big, will it exceed the short-term throttling window?
  • What happens if Lambda starts scaling out horizontally because you have more inbound messages or because of the retry delays are starting to take more time so it thinks it needs more functions?
  • What happens if the throttle period means you can’t do any work for several minutes? Lambda will continue to deliver messages into your function even if you know you can’t do anything causing failed deliveries and presumably you have a DLQ configured (you do don’t you?)
  • Is the process idempotent?

Let’s take an example service with a limit of 5 tps, each request kicks off a process that runs for 5-20 minutes, and there’s a maximum number of 10 workflows allowed for this client. If you didn’t think about this and used the defaults, then you’ll get 10 messages in that batch. Then pass those to the API sequentially and the API responds in <100ms, then on the 6th request, get throttled. Now depending on the periodicity of the throttling algorithm, it could clear up immediately or take a few seconds. Not terrible.

Let’s make it worse. Say you’ve got huge queue of messages. AWS Lambda automatically scales out your number of invocations from 1 to 10 concurrent invocations. Now, you’re trying to do 10 x (10 / 100ms) = 50 requests per second with 90% of them getting throttled.

Reduce concurrency

Is this fixable without moving from AWS Lambda? What knobs do we have? Well, we could limit the number of horizontal function invocations to 1 which would constrain the maximum number of tps from:

5 (event source mapping batch) × 1000 invocations × 1 second 100ms average latency = 50,000 tps

to:

1 - 5 (event source mapping batch) × 1 invocations × 1 second 100ms average latency = 50 tps

That reduces the worst case quite a bit, but it still exceeds my hypothetical example.

More problems with throttling

That keeps it under the TPS limit, but there’s still a way it can fail. The service could start load-shedding non-important traffic. In this example, the service also enforces a maximum number of workflows that be on-going or maybe per day. Even if you stayed at or below the 5 TPS limit, you could still get throttled.

Q: Why don’t you just scale the other service up? 5 tps seems awfully low.

A: There are numerous reasons why it might not be possible to scale up another service. For example, it could be a third-party organization, it could be shedding low-priority traffic during a peak, or it could have hard limitations like dealing with physical equipment. Imagine a shipping system that can ship a lot of small boxes, but can only fit a few pallet-sized items in a truck before you have to wait for a new truck.

If you continued to be throttled, you have to slow down and retry until it succeeds, but the clock is ticking on that Lambda function timeout. If the downstream service is saying no more traffic for 15+ minutes (yes this could really happen), then you have three options:

  1. Fail the batch entirely, return immediately to SQS
  2. Keep trying periodically and see if any of the requests get through
  3. Just sleep for the entire 15 minute interval

If you chose #1, you have no back-off and are going to immediately hit the service again, and those messages now are closer towards going into the DLQ (even though there’s nothing wrong with those messages). We’re going to immediate fetch new messages that are not going to succeed and penalize those messages with failed delivery attempts. Eventually enough times through this, it’ll exceed the max delivery attempts and go into a DLQ.

If you chose #2, it’s getting closer. One or two messages could succeed, but make sure you keep track of the time and return the ACK messages before exiting so Lambda can consider some of them done. However, make sure your message window time window is >15 minutes or you keep the pending messages alive with sqs:ChangeMessageVisibility, so no other invocations can grab it (Technically we’re single-threaded, but if the handle expires, you can’t ACK it.)

Options #2 and #3 are similar. If you know for certain no new traffic will work, option #3 reduces useless traffic downstream to the service.

The next problem is the fact that messages have a finite number of delivery attempts before they go to a DLQ combined with this 15 minute time limit on Lambda functions. Throttling is a global problem and applies to all messages. We don’t want to penalize an individual message just because the entire system is blocked.

So clearly there are cases where a Serverless FaaS compute environment will cause an algorithm to perform poorly. The lack of ability to control when a function exits and when a new process starts up will cause a cascade of failures where you can’t process backlog of requests. The fundamental problem is that the system (Lambda’s control plane) can’t react to slow down pulling messages from the queue and giving them to your function. It’s completely unaware of the inner-workings of your system and treats any error the same.

Taking control of the process life-cycle

To break free from this problem, we have to take control of the entire control loop and decide when to fetch messages, how many messages to fetch, and how to react to errors. Not to worry, this is easier than it sounds.

The process starts to look something like:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
void main(String[] args) {
  int numMessagesInBatch = 5;
  while (true) {
    List<Message> messages = amazonSQSClient.receiveMessages(numMessagesInBatch)

    try {
      for (message : messages) {
        // Process each message
      }
    } catch (ThrottleException e) {
      // Ack any successful messages
      // Sleep and optionally reduce numMessagesInBatch
    } catch (RuntimeException e) {
      // Handle other errors as appropriate
    }
  }
}

Conclusion

Picking the simple solution or using a technology that you’re already familiar with can sometimes be the pragmatic choice.

Picking the right tool for the job is important and sometimes what appears to be a simple solution actually ends up being more complicated when you factor in all failure modes. Serverless is great for a class of problems. If you have a service API that’s infrequently called, the above problems don’t appear because failures apply back-pressure to the caller to slow down and retry. If you’re processing a queue and you’re likely to cause throttling while processing, Lambda and FaaS-based systems can fall-over. Don’t forget to leverage big data compute specialized systems, like Glue, EMR, and even Batch if you have that situation.

Copyright - All Rights Reserved

Comments

To give feedback, send an email to adam [at] this website url.

Donate

If you've found these posts helpful and would like to support this work directly, your contribution would be appreciated and enable me to dedicate more time to creating future posts. Thank you for joining me!

Donate to my blog