Back to all articles

Gradual rollouts with AWS Lambda

Konstantin Borisov

Konstantin Borisov

Backend developer

When deploying a new version of your product's backend, the worst case scenario is that the new version has a bug that breaks all of your product's clients. An effective way to mitigate this risk is to use a gradual rollout deployment strategy.

In this post, we'll demonstrate a modern way to implement gradual rollout in AWS, by deploying a sample CDK app to AWS Lambda.

How it works

When we deploy a new version of the Lambda function, the new version will only get a small percentage of the overall traffic. If the new version does not produce errors, it will eventually be rolled out to handle all incoming traffic.

A gradual rollout requires a full copy of the entire system (or at least the entire workload) and a mechanism for switching traffic between the new deployment and the stable deployment. This can be tricky, even when using Infrastructure as Code, due to various system limitations and the orchestration required. However, AWS Lambda has a few built-in features which make gradual rollouts much simpler.

Specifically, AWS Lambda supports versioning of functions and weighted traffic distribution between versions. Additionally, AWS CodeDeploy supports Lambda deployment groups to automate the deployment process. Let's jump right into the code to see how we can achieve this.

Code

Here is a sample AWS CDK stack that contains a single Lambda function:

deployment_config-stack.ts
export class DeploymentConfigStack extends cdk.Stack { constructor(scope: Construct, id: string, props?: cdk.StackProps) { super(scope, id, props); const myLambda = new lambda.Function(this, "Lambda", { runtime: lambda.Runtime.NODEJS_20_X, code: lambda.Code.fromAsset("my-lambda"), handler: "my-lambda.handler", }); }

The code for the Lambda function is simple:

my-lambda/my-lambda.ts
exports.handler = async function (event: any) { // Generate an error conditionally to test the rollback if (event['error']) { throw new Error('Test error') } return {statusCode: 200, body: `Hello, CDK! You've hit a lambda`} }

The Lambda code allows us to purposely trigger an execution error, which will allow us to emulate an incorrect deployment.

The following code defines a Lambda deployment group for the new deployment, which we'll call the canary deployment. (Don't worry if you don't understand exactly what this code is doing — we'll explain it in a bit.)

deployment_config-stack.ts
configureDeployment(params: { lambda: lambda.Function; deploymentConfig?: aws_codedeploy.ILambdaDeploymentConfig; deploymentGroupName?: string; }): lambda.QualifiedFunctionBase { const newVersion = params.lambda.currentVersion; const alias = new lambda.Alias(params.lambda, "CanaryAlias", { aliasName: "live", version: newVersion, }); new aws_codedeploy.LambdaDeploymentGroup( params.lambda, "LambdaDeploymentGroup", { alias, deploymentConfig: params.deploymentConfig, alarms: [ new aws_cloudwatch.Alarm(params.lambda, "DeploymentAlarm", { metric: alias.metricErrors({ period: cdk.Duration.minutes(1) }), threshold: 10, alarmDescription: `${params.lambda.functionName} ${newVersion.version}` + "canary deployment failure alarm", evaluationPeriods: 1, }), ], } ); return alias; }

There are three main parts here:

  1. A Lambda version named “live”. This is an alias used to call the Lambda. It will handle traffic switching for us.
  2. A Lambda deployment group. This is a CodeDeploy feature that activates during deployment and configures the Lambda version. The crucial part here is the deploymentConfig parameter, where you configure the deployment strategy and speed. For example, you can use LambdaDeploymentConfig.LINEAR_10PERCENT_EVERY_1MINUTE, and then your Lambda will be deployed over 10 minutes with traffic gradually switching to the new version. To quickly roll out a new version (for a hotfix or in a development environment), you can use LambdaDeploymentConfig.ALL_AT_ONCE for instant deployment.
  3. An alarm to automatically roll back the deployment. It sets a CloudWatch alarm which defines a threshold for the number of errors that are allowed to occur within a 1 minute period. If there are more errors than the threshold per minute, the deployment will be rolled back.

Gradual rollout in action

Let's deploy our code to AWS to see how it works. There is no change in the deployment process; we still run the standard cdk deploy command. However, as it runs, you might notice a difference. It now takes much longer than usually to deploy the stack. Why? Because the deployment now involves a gradual traffic shift, and CloudFormation will keep the stack in the UPDATE_IN_PROGRESS state until the deployment completes.

If you use the LambdaDeploymentConfig.LINEAR_10PERCENT_EVERY_1MINUTE setting, then the deployment of a single Lambda will take an additional 10 minutes. It will start with 10% of Lambda requests going to the new version, then increase to 20% after a minute, and so on.

We can go to the Lambda in the AWS Management Console and observe what we have there. First, we can see the live alias, which we should now be used for all calls. It corresponds to two versions simultaneously:

Two active versions for the alias
Two active versions for the alias

If we check the alias, we will be able to see the weights:

Versions weights
Versions weights

Here, you can see that the latest version #32 receives 30% of all the requests, while the stable version #30 currently gets 70%.

Please note that there is a version #31, but it is not in use. Why? Because it was part of a previous canary lambda deployment that was rolled back. The live alias still points to the stable version #30, but if the new version #32 proves to be stable, it will become the new stable version and receive 100% of the traffic.

We can observe that the deployment time is just over 10 minutes. The deployment itself took only about 2 seconds, and the rest of the time was for the gradual deployment:

Deployment time
Deployment time

One might wonder why it takes 10 minutes, even though we start not from 0% but from 10%. This is because even when the traffic is 100% switched to the new version, CodeDeploy waits for one minute to ensure the version is stable. If errors occur, then even the version with 100% traffic will be reverted.

After the deployment is completed, we can see that all the traffic goes to the new version, #32:

Fully deployed version
Fully deployed version

Automatic rollback

Even with just the gradual deployment itself, you are safer. During the deployment, you can monitor logs and cancel the update if something looks wrong. But our example includes more than that; we have configured automatic rollback if too many errors occur.

In our Lambda function there is a conditional which generates an error if the error key is present in the payload. And there is a threshold of 1 in the CloudWatch alarm configuration which will trigger a rollback if a single error occurs.

Now let's do a deployment and, while it's running, go to the live alias and test it with a payload that contains an error key:

payload
{ "error": "any value" }

This will cause Lambda to fail with our custom error:

Test error in the Lambda function
Test error in the Lambda function

The CloudWatch alarm, which has a threshold of 1 error per minute, goes to the “In alarm” state:

Alarm state
Alarm state

This triggers a rollback of the stack. Note that the alarm is not set immediately; you will need to wait about a minute to see the effect:

Rollback after failure
Rollback after failure

And that's basically it. We have a gradual rollout of new Lambda versions and an immediate rollback in the case of an error!

Wrapping up

This article demonstrated how to reduce the risk when deploying new versions of your AWS Lambda functions, using a gradual rollout deployment. If your application's real infrastructure includes more than just Lambda functions (e.g. containers running on ECS), you should consider implementing a gradual rollout for those resources as well.

One cool thing that gradual rollouts enable is the ability to deploy to all of your environments (test, stating, production, etc.) from a single CI pipeline. If the rollout succeeds in test, the pipeline begins to deploy to staging. If that succeeds as well, the pipeline deploys to production. This deployment strategy allows you to get code into production faster, while keeping the risk low.


Other articles

Streaming real-time data into Snowflake using Kinesis Streams

Significant growth of a product’s user base always leads to challenges for data engineering teams. The volume of events produced by millions (or billions) of users makes it almost impossible to use standard solutions for ingestion as is. It’s always nuanced and adjusted for particular situation.

Using Tailwind to fill in the gaps in your team's CSS knowledge

Many engineering teams are favoring Tailwind CSS over plain CSS for its ease of styling web frontends with utility classes, addressing scalability issues encountered with traditional CSS as project size grows.