More infrastructure doesn't fix using the wrong infrastructure

I work at AWS, and predictably we use a lot of AWS cloud services. In many cases, when an engineer looks for a computer platform, they’ll often go directly to AWS Lambda because “it’s Serverless” with the justification that it’s simple and the best option no matter what and not want to explore alternatives.

The FaaS (Functions as a Service) compute style is great for a certain category of system problems–ones in which you don’t need strict control over how it executes. AWS Lambda only exposes limited controls and depending on what your workload is like, you could run into unexpected scaling and failure modes. There are alternative compute environments that avoid those limitations that you should know about.

Sure, there is a cost to understanding a new service like having to learn a new query language. What I realize is that not all engineers assign the same weight to that knowledge. There’s different uncertainty in the new and different.

However, in this post I make the claim that FaaS platforms aren’t the best solution for everything and you should recognize situations where they are good and are not good.

The Example - Data Processing Lambda

In this example, you have a situation where you need to take a file from S3, iterate over it, and do some useful work on it. Maybe you save it to another file, maybe push it into a service. Another developer wants to run a Lambda function that does this. It could work right? Lambda can fetch files, process them, and write them back somewhere else. What’s wrong with that?

Let’s take a look at an example implementation as a Lambda function:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
public class Program implements RequestHandler<Request, Response> {
  public Response handleRequest(Request event, Context context)
  {
        S3Client s3 = S3Client.create();
        String fileContent = readS3File(s3);
        String output = multiplyValue(fileContent);
        writeOutputToS3(s3, "s3://outputbucket/foobar", output);
        return null;
    }

    private static String readS3File(S3Client s3, String filePath) throws IOException {
	    ResponseBytes bytes;
	    GetObjectRequest request = GetObjectRequest.builder()
		    .bucket("awsexamplebucket-streaming-demo2")
		    .key("inputs/productsStatic.csv")
		    .build();
        try {
	        bytes = s3.getObjectAsBytes(request);
	    } catch (IOException e) {
		    // TODO
	    }
        return new String(bytes.content().array());
    }

	private static int performBusinessLogic(int value) {
		// Here's the actual business logic right here
		return value * 10;
	}

    private static String multiplyValue(String fileContent) {
        Scanner scanner = new Scanner(fileContent);
        StringBuilder output = new StringBuilder();
        while (scanner.hasNextLine()) {
            String line = scanner.nextLine();
            String[] values = line.split(",");
            // TODO: Handle errors with the input file
            String multipliedValue = String.valueOf(Integer.parseInt(performBusinessLogic(values[1])));
            output.append(multipliedValue).append("\n");
        }
        return output.toString();
    }

    private static void writeOutputToS3(S3Client s3, String filePath, String content) {
        try (FileWriter writer = new FileWriter("output.csv")) {
            writer.write(content);
            s3.putObject(PutObjectRequest.builder()
                    .bucket("outputbucket")
                    .key("foobar/output.csv")
                    .build(),
                    SdkBytes.fromInputStream(new FileInputStream("output.csv")));
        } catch (IOException e) {
            // TODO: Handle errors and retries
        }
    }
}

Is that bad? Should you push back? You might think it’s pretty simple to call s3.getObject, load it, parse it, etc. Why bother with something else? Well, what do you do if the data set starts getting larger and you run into the 15 minute Lambda time limit?

Just split it into two jobs and have one Lambda function perform part of the work, then the second Lambda function perform the rest. (Think that’s crazy? This actually happened in one project I worked with.) Now you have an orchestration problem. What happens if one function succeeds, but the second fails? You have to setup monitoring rules for both and try to handle retries.

Or maybe you try run multiple threads to process in parallel. Now, you’ve got another type of orchestration: batching up work to each thread, waiting for responses, making sure you have your threads scaled to the number of Lambda cores.

Can you solve these problems? Yes. Should you, probably not. Why? They are not solving the actual business problem. The system is there to provide an answer, anything else is just getting in the way. Using the wrong tool in the toolbox makes your life harder. Picking the right tool, i.e. the right framework, makes it easier.

Trying another compute environment

What about AWS Batch or AWS Fargate? These are two hosted compute environments (AWS still owns the underlying OS), but you own the main method. Thus you have more control over the life-cycle of your process. It does away with the 15 minute limit and gives you some controls on what kind of host you’re running on.

That means you don’t have to worry about 15 minute timeouts. But you still have other edge cases: the call to S3 could fail, the file format changes unexpectedly, etc. Your code has to handle that, if not at first, then eventually at 3am when somebody gets paged when it breaks or changes.

The ETL Job

Looking at it from a different perspective, this could be an ETL (Extract, Transform, Load) job. An ETL job is a high-level concept that takes an input file, transforms it in some way, maybe joins it with another file, then writes it to another file. Exactly what the code above does.

This is exactly the kind of thing that ETL services are great at, such as AWS Glue, or AWS EMR, this would look like the following (some lines removed (see here for full example.) AWS Glue uses Spark and you can either use conventional SQL or define your own Spark Scala functions. AWS EMR can run several different big data compute frameworks, but for this post we’re going to use Spark because I’m familiar with it.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
object Program {
  def main(sysArgs: Array[String]) {
    // Yes this is boiler-plate code not relevant for your business logic
    // But it's smaller and limited to just the start
    val spark: SparkContext = new SparkContext()
    val glueContext: GlueContext = new GlueContext(spark)
    val sparkSession: SparkSession = glueContext.getSparkSession
    import sparkSession.implicits._

    val input = sparkSession.read
      .format("csv")
      .option("header", "true")
      .load("s3://awsexamplebucket-streaming-demo2/inputs/productsStatic.csv")

    input.select(($"value" * 10) as "value")
       .write
       .save("s3://outputbucket/foobar")
  }
}

There’s some boiler-plate in here just setting up the Spark system, but the actual business logic is simple. It’s a declarative model (you describe what you want done) and it makes it happen instead of the previous example being imperative (you describe how it happens.)

Is it really cheaper to go learn Spark if you weren’t already familiar? If it was just this small snippet of code, probably not. But systems rarely stay the same as what they were originally envisioned. The growing problems may or may not come up in a prototype that you do just to try it out.

When you’re writing everything from scratch, you have to deal with all the minutia of the problem. Did you do a retry? Should you run it in multiple threads for performance? How do you parse the CSV file? It gets really easy to just layer more and more on it because the alternative is unknown because either you don’t know it exists, or you don’t know how hard it’ll be to use.

Soon enough, you’re spending your time fixing issues in the system that aren’t the important parts (the business logic.) And personally, as a developer, I don’t enjoy running around fixing those kinds of issues.

But Spark SQL is weird and hard

A common response I get about frameworks like Spark is that Scala is confusing, or Spark SQL influences too much of your code.

Yes, the Spark SQL functions are different than native Java, Python, or Scala and they can get a bit complicated modeling everything inside of Spark SQL. The following is what the FizzBuzz problem looks like:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
val df = spark.createDF(Seq(1 to 100)) // TODO: Validate this

val mod3 = $"value".mod(lit(3)) == 0
val mod5 = $"value".mod(lit(5)) == 0

df.withColumn("fizzbuzz", 
	  when(
		  mod3.and(mod5), 
		  lit("fizzbuzz")
		)
		.when(mod3, lit("fizz"))
		.when(mod5, lit("buzz"))
		.otherwise($"value".cast("string"))
	)

Going 100% into that style of coding changes from one problem to another. You went from imperative code that doesn’t handle edge cases, to one where the execution framework heavily dictates your coding style. For some engineers, this will also be confusing.

There are some solutions. You can develop a hybrid code style where you can pull out business logic into conventional Scala, Python, or Java because Spark Scala has full interop with JVM languages and PySpark has interop with Python.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
val df = spark.read.parquet("s3://blah")

val fizzbuzz = (value) => {
	if ((value % 3) & (value % 5)) {
		"fizzbuzz"
	} else if (value % 3) {
		"fizz"
	} else if (value % 5) {
		"buzz"
	} else {
		value.toString
	}
})

df.select(udf[String, Int](fizzbuzz)($"value"))

If you have a large block of business logic and you only want Spark to handle input/output, task distribution, retries, etc. you can just jump out to native Java/Scala/Python code to perform the business logic.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
val df = spark.read.parquet("s3://blah")

def fizzBuzzPartition(rows: Iterator[Row]) => {
	rows.map(row => {
		val value = row.getInt(0)
		// Imagine fizzbuzz() is calling out to existing business logic
		return Row(fizzbuzz(value))
	})
}

val out = df.mapPartitions(fizzBuzzPartition)

Should you always use the “best” technology?

Nah. Even if there appears to be a technology that seems to abstract away all your problems, you can’t ignore the other costs like having to learn a new language, new framework, new failure modes, and operational monitoring requirements. Your teammates have to learn something new and that adds a lot of friction. Some engineers don’t want or have the energy to learn something new.

You have to weigh the benefits with the costs. If it’s a component you’re not going to tough very frequently, err on the side of technologies that are extremely simple and easy for others to understand. Pick ones that are consistent with others.

If it is a key component that you’re going to continue to develop on, then it’s worth spending some extra time to understand these other technologies.

Conclusion

Picking frameworks and technologies is like picking the right tool in the toolbox. Sure you can probably make squeaky door hinge stop with some WD-40, but you really want some lubricant. The right framework is important because software in production has a lot of hidden edge cases that have to be handled.

Don’t always pick the fancy technology. Weigh the benefits and the costs. Using the technology you already know is a big benefit and something new should be considered a cost. But don’t be overly confident in your ability to identify and handle all edge cases.

If you feel like you’re bending a system or framework to get it to work, take a step back and ask yourself if you’re using the right solution.

Copyright - All Rights Reserved

Comments

Comments are currently unavailable while I move to this new blog platform. To give feedback, send an email to adam [at] this website url.