Bite-Sized Serverless

Lambda Icon

ML Inference using Lambda's 10 GB Ephemeral Storage

Lambda - Advanced (300)
On March 25th, 2022, AWS released increased ephemeral storage for Lambda Functions. Previously limited to 512 MB, Lambda Functions can now use up to a whopping 10 GB of temporary disk space. This opens up new possibilities, like dynamically loading machine learning (ML) models for inference. In this Bite we will show how to implement ML inference in a Lambda Function, using MXNet, GluonCV, and dynamic models.
This project uses S3 EventBridge notifications to trigger our inference Lambda Function when a new image is uploaded. The Lambda Function will download the image, load a GluonCV ML model to its ephemeral storage, use the model to detect objects in the image, and upload a new image, including the detected objects, back to S3.
The source code for this project can be found in the GitHub project bitesizedserverless/ml-inference-in-lambda. It contains a complete CDK project, Dockerfile, and Lambda code ready to adjust and deploy to your requirements.

Introducing GluonCV

In their own words, GluonCV (Gluon Computer Vision) "provides implementations of state-of-the-art (SOTA) deep learning algorithms in computer vision. It aims to help engineers, researchers, and students quickly prototype products, validate new ideas and learn computer vision".
Simply put, they develop, train and host ready-made computer vision models, so others don't have to. They also provide an API that allows engineers to simply configure inference by naming the model they want to use, and they take care of (almost all) the rest. In our project, Gluon is used like below.
1model = model_zoo.get_model( 2 GLUON_MODEL, 3 pretrained=DATASET, 4 root="/tmp", 5 ctx=mx.cpu(), 6)
This simple piece of code downloads a pre-trained model and stores it in the /tmp directory. Next, we load an image and initialize the model. The model requires the image to be of a specific size. The correct transformations (presets.rcnn) are also readily available in the Gluon framework.
1transformed_img, orig_img = data.transforms.presets.rcnn.load_test(local_image_name) 2 3# model is loaded asynchronously and not called here 4box_ids, scores, bboxes = model(transformed_img)
Finally, we use the model and matplotlib to draw bounding boxes around detected objects.
1utils.viz.plot_bbox( # plot results in inference 2 orig_img, 3 bboxes[0], 4 scores[0], 5 box_ids[0], 6 thresh=0.7, 7 class_names=model.classes, 8 linewidth=1, 9) 10 11local_output_name = f"/{LOCAL_OUTPUT_PATH}/{im_name}" 12plt.savefig(local_output_name)
The file stored on line 12 is displayed below. It contains a box around every detected object, and a label to indicate what has been detected.

Increasing Lambda Ephemeral Storage

The Lambda Function is configured to use a specific pre-trained model. Which model to use is configured through environment variables.
1GLUON_MODEL = os.environ["GLUON_MODEL"] 2BUCKET = os.environ["BUCKET"] 3DATASET = os.environ.get("DATASET", True)
The environment variables themselves are specified on the Lambda Function. This allows us to quickly change models, without rebuilding and redeploying the Lambda Function's application code.
The model used in this example (faster_rcnn_fpn_syncbn_resnest269_coco) is rather large. It is trained on the 80 categories in the COCO 2017 training set and has very high accuracy. When downloaded to disk, it takes up about 503 MB. When we try to load this model in a Lambda Function with the default 512 MB scratch space, it leads to the following error.
1[ERROR] OSError: [Errno 28] No space left on device 2Traceback (most recent call last): 3 File "/var/task/", line 19, in handler 4 model = model_zoo.get_model( 5 File "/var/task/gluoncv/model_zoo/", line 413, in get_model 6 net = _models[name](**kwargs)
But after setting the new EphemeralStorage value to 1 GB in CDK, downloading the model succeeded without a hitch.
1# Set the ephemeral storage size (not supported in L2 constructs yet) 2inference_function_l1: lambda_.CfnFunction = ( 3 inference_function.node.default_child 4) 5inference_function_l1.add_property_override( 6 property_path="EphemeralStorage", value={"Size": 1024} 7)

Lambda Container Images

The Lambda Function uses the MXNet, Gluon, and Matplotlib frameworks. These libraries are very big and do not fit in a Lambda Layer. To make the packages available to the Lambda Function, we use Lambda Container Images, introduced in December 2020. These container images can handle up to 10 GB of application code and dependencies, which is more than enough for the circa 700 MB used by this function.
Creating a Lambda Container Image in CDK is very easy. All we need is a Dockerfile for our Function and a DockerImageCode in CDK. The Dockerfile looks like this:
1FROM 2 3# Install the function's dependencies using file requirements.txt 4# from your project folder. 5 6COPY requirements.txt . 7RUN pip3 install -r requirements.txt --target "${LAMBDA_TASK_ROOT}" 8RUN yum install libgomp libquadmath -y 9 10# Copy function code 11COPY ${LAMBDA_TASK_ROOT} 12 13# Set the CMD to your handler 14CMD [ "main.handler" ]
As you can see, we install the dependencies from requirements.txt and add a few additional packages with yum. In CDK, we only need to reference the directory containing the Dockerfile:
1inference_function = lambda_.DockerImageFunction( 2 scope=self, 3 id="InferenceFunction", 4 code=lambda_.DockerImageCode.from_image_asset("lambda/functions/inference"), 5 environment={ 6 "MPLCONFIGDIR": "/tmp", 7 "GLUON_MODEL": "faster_rcnn_fpn_syncbn_resnest269_coco", 8 "DATASET": "b7d778f5", 9 "BUCKET": input_output_bucket.ref, 10 }, 11 memory_size=4096, 12 timeout=cdk.Duration.minutes(2), 13)
And voilĂ , the CDK will take care of the rest for us. When we execute a cdk synth or cdk deploy, it will use the Dockerfile to build a container. It will then upload the container to ECR, and finally, it will deploy the Lambda Function by referencing the container it just uploaded.

Timing and performance

Of course, the Lambda Function has to spend some time loading the model when a new execution environment is being spun up. Luckily, the Gluon models are stored on S3, so loading a 500 MB model generally takes between 17 and 20 seconds. This is the cold start duration for new inferences. With the Lambda Function configured at 4096 MB of memory (and the associated three vCPUs), inference itself takes an average of 27.33 seconds per image.
When configured at 10240 MB of memory, and thus with six unthrottled vCPUs, inference takes an average of 11.55 seconds per image. At 8846 MB (the lowest value at which a Lambda Function gets six vCPUs), inference is equally fast. There is no real performance difference between a 10240 MB or 8846 MB configuration.


Because we run Lambda as a container we do not get a free init phase. As such, the amount of inferences per execution environment influences the cost of the solution: if an execution environment is only used for one inference, you pay for the cold start plus the single inference. If an execution environment is used 100 times, you pay for a single cold start plus 100 inferences, which results in a lower average inference cost. In the table below you can find an overview of inference cost at different configurations. All costs are calculated with Ireland (eu-west-1) prices.
StorageMemoryInferences per execution environmentExecution $ / 1000 inferencesStorage $ / 1000 inferencesTotal $ / 1000 inferences
1 GB4096 MB100$1.8361$0.0019$1.8379
1 GB4096 MB50$1.8500$0.0019$1.8519
1 GB4096 MB10$1.9618$0.0020$1.9638
1 GB4096 MB1$3.2190$0.0033$3.2223
1 GB8846 MB100$1.6934$0.0017$1.6951
1 GB8846 MB50$1.7236$0.0018$1.7253
1 GB8846 MB10$1.9649$0.0020$1.9669
1 GB8846 MB1$4.6801$0.0048$4.6848
2 GB4096 MB100$1.8361$0.0056$1.8417
2 GB4096 MB50$1.8500$0.0057$1.8557
2 GB4096 MB10$1.9618$0.0060$1.9678
2 GB4096 MB1$3.2190$0.0099$3.2288
2 GB8846 MB100$1.6934$0.0052$1.6986
2 GB8846 MB50$1.7236$0.0053$1.7289
2 GB8846 MB10$1.9649$0.0060$1.9709
2 GB8846 MB1$4.6801$0.0143$4.6944
10 GB4096 MB100$1.8361$0.0356$1.8716
10 GB4096 MB50$1.8500$0.0359$1.8859
10 GB4096 MB10$1.9618$0.0380$1.9998
10 GB4096 MB1$3.2190$0.0624$3.2814
10 GB8846 MB100$1.6934$0.0328$1.7262
10 GB8846 MB50$1.7236$0.0334$1.7570
10 GB8846 MB10$1.9649$0.0381$2.0030
10 GB8846 MB1$4.6801$0.0907$4.7708

Lambda Inference versus SageMaker Serverless Inference

Fun fact: the price for serverless, event driven inference in Lambda Functions looks to be cheaper than Amazon SageMaker Serverless Inference (preview). The latter is priced at $0.000080 per second at 4096 MB, which translates to $2.1864 per 1000 inferences versus $1.8361 in the table above.

General observations on Ephemeral Storage pricing

The pricing for both memory/cpu and ephemeral storage is based on GB-seconds: you pay for the configured value in gigabytes, multiplied by the amount of seconds your functions run. This leads to a pricing formula with three variables:
  • Configured amount of memory in GB (M)
  • Memory price (Mp), which is $0.0000166667 for every GB-second (for x86 Functions in Ireland)
  • Configured amount of ephemeral storage in GB (S)
  • Ephemeral storage price (Sp), which is $0.000000034 for every GB-second (in Ireland)
  • Execution time in Seconds (X).
The formula for Lambda cost is then: cost = (M * Mp + (S - 0.5) * Sp) * X. We purposely left out the request pricing ($0.20 per 1M requests) because it is not influenced by memory / storage configuration.
In the table below you can find the execution prices with various memory and storage configurations, at Ireland pricing, for 1,000,000 seconds of execution time.
Mem / CPUStorageMem CPU $Storage $Total $Mem / CPU $%Storage $%
This table clearly shows that ephemeral storage is priced much lower than function memory. In the most extreme case (512 MB of memory, 10 GB of storage) the storage pricing is 3,73% of the total execution cost. When memory and storage are configured as equal values, the storage component is only 0,19% of the total execution cost.


In this Bite we have seen how the new configurable Lambda Ephemeral Storage allows us to store large ML models on the Function's scratch disk. This enables new use cases, such as dynamically loading models into our Lambda Function. Depending on your use case, the performance of ML inference in Lambda can be quite acceptable. For example, 11 seconds of inference for a complex and broad model can have its place in many automated batch processes. Of course, it's a bit slow when users are synchronously waiting on a response, so in these use cases you're probably better off with a smaller, more specialized model with lower latency. These models don't require a larger scratch disk, so this article likely doesn't apply to them.
In the pricing section, we've seen that the larger ephemeral storage has no significant impact on pricing. Considering that processes which use a lot of scratch storage will generally also need some CPU and memory, it's safe to assume that the lion's share of their cost will be attributed to execution time, not storage.

CDK Project

The services and code described in this Bite are available as a Python AWS Cloud Development Kit (CDK) Project. Within the project, execute a cdk synth to generate CloudFormation templates. Then deploy these templates to your AWS account with a cdk deploy. For your convenience, ready-to-use CloudFormation templates are also available in the cdk.out folder. For further instructions how to use the CDK, see Getting started with the AWS CDK.

Click the Download button below for a Zip file containing the project.