Big Data

Spark on AWS Lambda: An Apache Spark runtime for AWS Lambda


Spark on AWS Lambda (SoAL) is a framework that runs Apache Spark workloads on AWS Lambda. It’s designed for both batch and event-based workloads, handling data payload sizes from 10 KB to 400 MB. This framework is ideal for batch analytics workloads from Amazon Simple Storage Service (Amazon S3) and event-based streaming from Amazon Managed Streaming for Apache Kafka (Amazon MSK) and Amazon Kinesis. The framework seamlessly integrates data with platforms like Apache Iceberg, Apache Delta Lake, Apache HUDI, Amazon Redshift, and Snowflake, offering a low-cost and scalable data processing solution. SoAL provides a framework that enables you to run data-processing engines like Apache Spark and take advantage of the benefits of serverless architecture, like auto scaling and compute for analytics workloads.

This post highlights the SoAL architecture, provides infrastructure as code (IaC), offers step-by-step instructions for setting up the SoAL framework in your AWS account, and outlines SoAL architectural patterns for enterprises.

Solution overview

Apache Spark offers cluster mode and local mode deployments, with the former incurring latency due to the cluster initialization and warmup. Although Apache Spark’s cluster-based engines are commonly used for data processing, especially with ACID frameworks, they exhibit high resource overhead and slower performance for payloads under 50 MB compared to the more efficient Pandas framework for smaller datasets. When compared to Apache Spark cluster mode, local mode provides faster initialization and better performance for small analytics workloads. The Apache Spark local mode on the SoAL framework is optimized for small analytics workloads, and cluster mode is optimized for larger analytics workloads, making it a versatile framework for enterprises.

We provide an AWS Serverless Application Model (AWS SAM) template, available in the GitHub repo, to deploy the SoAL framework in an AWS account. The AWS SAM template builds the Docker image, pushes it to the Amazon Elastic Container Registry (Amazon ECR) repository, and then creates the Lambda function. The AWS SAM template expedites the setup and adoption of the SoAL framework for AWS customers.

SoAL architecture

The SoAL framework provides local mode and containerized Apache Spark running on Lambda. In the SoAL framework, Lambda runs in a Docker container with Apache Spark and AWS dependencies installed. On invocation, the SoAL framework’s Lambda handler fetches the PySpark script from an S3 folder and submits the Spark job on Lambda. The logs for the Spark jobs are recorded in Amazon CloudWatch.

For both streaming and batch tasks, the Lambda event is sent to the PySpark script as a named argument. Utilizing a container-based image cache along with the warm instance features of Lambda, it was found that the overall JVM warmup time reduced from approx. 70 seconds to under 30 seconds. It was observed that the framework performs well with batch payloads up to 400 MB and streaming data from Amazon MSK and Kinesis. The per-session costs for any given analytics workload depends on the number of requests, the run duration, and the memory configured for the Lambda functions.

The following diagram illustrates the SoAL architecture.

Enterprise architecture

The PySpark script is developed in standard Spark and is compatible with the SoAL framework, Amazon EMR, Amazon EMR Serverless, and AWS Glue. If needed, you can use the PySpark scripts in cluster mode on Amazon EMR, EMR Serverless, and AWS Glue. For analytics workloads with a size between a few KBs and 400 MB, you can use the SoAL framework on Lambda and in larger analytics workload scenarios over 400 MB, and run the same PySpark script on AWS cluster-based tools like Amazon EMR, EMR Serverless, and AWS Glue. The extensible script and architecture make SoAL a scalable framework for analytics workloads for enterprises. The following diagram illustrates this architecture.

Prerequisites

To implement this solution, you need an AWS Identity and Access Management (IAM) role with permission to AWS CloudFormation, Amazon ECR, Lambda, and AWS CodeBuild.

Set up the solution

To set up the solution in an AWS account, complete the following steps:

  1. Clone the GitHub repository to local storage and change the directory within the cloned folder to the CloudFormation folder:
    git clone https://github.com/aws-samples/spark-on-aws-lambda.git

  2. Run the AWS SAM template sam-imagebuilder.yaml using the following command with the stack name and framework of your choice. In this example, the framework is Apache HUDI:
    sam deploy --template-file sam-imagebuilder.yaml --stack-name spark-on-lambda-image-builder --capabilities CAPABILITY_IAM CAPABILITY_NAMED_IAM --resolve-s3 --parameter-overrides 'ParameterKey=FRAMEWORK,ParameterValue=HUDI'

The command deploys a CloudFormation stack called spark-on-lambda-image-builder. The command runs a CodeBuild project that builds and pushes the Docker image with the latest tag to Amazon ECR. The command has a parameter called ParameterValue for each open-source framework (Apache Delta, Apache HUDI, and Apache Iceberg).

  1. After the stack has been successfully deployed, copy the ECR repository URI (spark-on-lambda-image-builder) that is displayed in the output of the stack.
  2. Run the AWS SAM Lambda package with the required Region and ECR repository:
    sam deploy --template-file sam-template.yaml --stack-name spark-on-lambda-stack --capabilities CAPABILITY_IAM CAPABILITY_NAMED_IAM --resolve-s3 --image-repository <accountno>.dkr.ecr.us-east-1.amazonaws.com/sparkonlambda-spark-on-lambda-image-builder --parameter-overrides 'ParameterKey=ScriptBucket,ParameterValue=<Provide the s3 bcucket of the script> ParameterKey=SparkScript,ParameterValue=<provide s3 folder lcoation> ParameterKey=ImageUri,ParameterValue=<accountno>.dkr.ecr.us-east-1.amazonaws.com/sparkonlambda-spark-on-lambda-image-builder:latest'

This command creates the Lambda function with the container image from the ECR repository. An output file packaged-template.yaml is created in the local directory.

  1. Optionally, to publish the AWS SAM application to the AWS Serverless Application Repository, run the following command. This allows AWS SAM template sharing with the GUI interface using AWS Serverless Application Repository and other developers to use quick deployments in the future.
    sam publish --template packaged-template.yaml

After you run this command, a Lambda function is created using the SoAL framework runtime.

  1. To test it, use PySpark scripts from the spark-scripts folder. Place the sample script and accomodations.csv dataset in an S3 folder and provide the location via the Lambda environment variables SCRIPT_BUCKET and SCRIPT_LOCATION.

After Lambda is invoked, it uploads the PySpark script from the S3 folder to a container local storage and runs it on the SoAL framework container using SPARK-SUBMIT. The Lambda event is also passed to the PySpark script.

Clean up

Deploying an AWS SAM template incurs costs. Delete the Docker image from Amazon ECR, delete the Lambda function, and remove all the files or scripts from the S3 location. You can also use the following command to delete the stack:

sam delete --stack-name spark-on-lambda-stack

Conclusion

The SoAL framework enables you to run Apache Spark serverless tasks on AWS Lambda efficiently and cost-effectively. Beyond cost savings, it ensures swift processing times for small to medium files. As a holistic enterprise vision, SoAL seamlessly bridges the gap between big and small data processing, using the power of the Apache Spark runtime across both Lambda and other cluster-based AWS resources.

Follow the steps in this post to use the SoAL framework in your AWS account, and leave a comment if you have any questions.


About the authors

John Cherian is Senior Solutions Architect(SA) at Amazon Web Services helps customers with strategy and architecture for building solutions on AWS.

Emerson Antony is Senior Cloud Architect at Amazon Web Services helps customers with implementing AWS solutions.

Kiran Anand is Principal AWS Data Lab Architect at Amazon Web Services helps customers with Big data & Analytics architecture.