OpenTelemetry tracing for Java AWS Lambda

7 min readDec 7, 2024

Introduction

In modern cloud-native applications, observability is crucial for understanding your services' health, performance, and usage patterns. Distributed tracing is one of the most important observability techniques, as it allows you to visualize and monitor the flow of requests across microservices, functions, and other components in your architecture.

For AWS Lambda applications, which are stateless and ephemeral, distributed tracing becomes particularly useful to track the lifecycle of requests and troubleshoot performance issues. OpenTelemetry, an open-source, vendor-agnostic framework, offers a powerful solution for capturing and sending trace data.

In this post, we will explore how to set up OpenTelemetry tracing in Java-based AWS Lambda functions, compare different approaches, measure performance, and write the most optimal solution.

Why Use OpenTelemetry Tracing for AWS Lambda?

AWS Lambda provides a highly scalable, event-driven platform for running stateless functions. However, Lambda functions' stateless and ephemeral nature presents challenges for observability.

By using OpenTelemetry for tracing, you can:

Track end-to-end request flows: Monitor the complete lifecycle of an invocation, even if it spans multiple Lambda functions or services.
Debug cold starts: Capture the latency of Lambda cold starts and optimize resource allocation.
Correlate traces across services: Get a unified view of interactions between Lambda functions, other AWS services (e.g., DynamoDB, SQS), and external services.
Identify performance bottlenecks: Visualize latency and bottlenecks in the application lifecycle

Prerequisites

To do this research I installed OpenTelemetry collector on my EC2 in the public subnet and exposed gRPC and HTTP protocols:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: "0.0.0.0:4317"  # Standard OTLP gRPC port
      http:
        endpoint: "0.0.0.0:4318"  # Standard OTLP HTTP port

So our Lambda will send traces to OpenTelemetry collector and collector then will push it to some backend (in my case I prefer Datadog)

All source code you can see in my GitHub

Approach 1: OpenTelemetry Lambda layer

OpenTelemetry already has Lambda layers for almost all runtimes. Let’s try it !

OpenTelemetry Lambda Instrumentation and OpenTelemetry SDK are bundled into the java/lib directory to be available on the classpath of the Lambda function. No code change is needed to instrument the execution of your function, but you will need to set the AWS_LAMBDA_EXEC_WRAPPER environment variable pointing to the appropriate wrapper for the type of handler.

I’m using Terraform and to add this layer we just need to add 2 environment variables and 1 layer:

resource "aws_lambda_function" "example_lambda_with_layer" {
  function_name    = var.function-name
  source_code_hash = base64sha256(filebase64(local.lambda_payload_filename))
  runtime          = "java21"
  handler          = "com.filichkin.lambda.SampleHandler::handleRequest"
  filename         = local.lambda_payload_filename
  role             = aws_iam_role.lambda_role.arn
  memory_size      = 256
  timeout          = 20
  environment {
    variables = {
      AWS_LAMBDA_EXEC_WRAPPER = "/opt/otel-handler"
      OTEL_EXPORTER_OTLP_ENDPOINT= "http://my-ec2-public-ip:4317"
    }
  
  }
  layers = [
  "arn:aws:lambda:eu-west-1:901920570463:layer:aws-otel-java-wrapper-amd64-ver-1-32-0:3"
  ]
}

Let's deploy HelloWorld lambda and check the result

public class SampleHandler implements RequestHandler<APIGatewayProxyRequestEvent, APIGatewayProxyResponseEvent> {
    @Override
    public APIGatewayProxyResponseEvent handleRequest(APIGatewayProxyRequestEvent apiGatewayProxyRequestEvent, Context context) {

        System.out.println("lambda is invoked");

            APIGatewayProxyResponseEvent apiGatewayProxyResponseEvent = new APIGatewayProxyResponseEvent();
            apiGatewayProxyResponseEvent.setBody("hello-world");
            return apiGatewayProxyResponseEvent;
    }
}

Result

The traces were sent with 0 code changes.

But

Cold start: 8 seconds for 256 MB Lambda !!!
Warm start: 65 ms on average for 256 MB Lambda

Is the problem in the OpenTelemetry layer?

Yes, because without layer we have a 600ms cold start

Approach 2: Lambda + OpenTelemetry Java SDK

To get started with OpenTelemetry tracing in Java AWS Lambda, you need to add the necessary dependencies.

1. Adding Dependencies

I highly recommend using bom for opentelemetry dependencies . Especially when you instrument different libraries:

<dependencyManagement>
        <dependencies>
            <dependency>
                <groupId>io.opentelemetry.instrumentation</groupId>
                <artifactId>opentelemetry-instrumentation-bom</artifactId>
                <version>2.10.0</version>
                <type>pom</type>
                <scope>import</scope>
            </dependency>
        </dependencies>
 </dependencyManagement>

Then we need to add two dependencies

        <dependency>
            <groupId>io.opentelemetry</groupId>
            <artifactId>opentelemetry-sdk</artifactId>
        </dependency>
        <dependency>
            <groupId>io.opentelemetry</groupId>
            <artifactId>opentelemetry-exporter-otlp</artifactId>
        </dependency>

2. Initializing OpenTelemetry SDK

You’ll need to set up the OpenTelemetry SDK to capture traces and send them to your desired backend. Below is an example of how to initialize the OpenTelemetry SDK in an AWS Lambda function.

public class SampleHandler implements RequestHandler<APIGatewayProxyRequestEvent, APIGatewayProxyResponseEvent> {

    private static final String OTEL_ENDPOINT = "http://your-otel-host:4317";
    static {
        // Set up OpenTelemetry SDK
        BatchSpanProcessor spanProcessor = BatchSpanProcessor.builder(OtlpGrpcSpanExporter.builder().setEndpoint(OTEL_ENDPOINT).build()).build();
        SdkTracerProvider tracerProvider = SdkTracerProvider.builder().addSpanProcessor(spanProcessor).build();
        OpenTelemetrySdk.builder().setTracerProvider(tracerProvider).buildAndRegisterGlobal();
    }
    private static final Tracer tracer = GlobalOpenTelemetry.getTracer("LambdaTracer");

    public APIGatewayProxyResponseEvent handleRequest(APIGatewayProxyRequestEvent apiGatewayProxyRequestEvent, com.amazonaws.services.lambda.runtime.Context context) {
        // Start a span to trace the Lambda invocation
        Span span = tracer.spanBuilder("LambdaInvocation").startSpan();
        try {
            APIGatewayProxyResponseEvent apiGatewayProxyResponseEvent = new APIGatewayProxyResponseEvent();
            apiGatewayProxyResponseEvent.setBody("hello world");
            return apiGatewayProxyResponseEvent;
        } finally {
            span.end();  // End the span
        }
    }
}

Lamda works fine however I don’t see any tracing.

Why Lambda doesn't send traces ?

Lambda freeze

Lambda freezes the container if there are no new requests coming. When it occurs all processes in the container will be frozen. See details in cgroup freezer. If the process uses a timer to flush the buffered data, the timer won’t fire until the container thaws. The interval between the frozen and the thaw state is unpredictable, ranging from seconds to hours. Therefore, the data buffered in the OpenTelemetry SDK and Collector will have unpredictable latency.

The solution is to flush data out before the Lambda function returns, because it ensures that the Lambda container will not be frozen during an invocation. To achieve this solution:

The SDK side needs to call force_flush() at the end of Lambda instrumentation.
The Collector needs to remove all processors from config to keep one thread only from the receiver to the exporter. In this way, telemetry data is flushed to the backend synchronously at the end of every invocation. The side effect is slightly increased invocation time.

This is how it looks in the code:

 private static final BatchSpanProcessor SPAN_PROCESSOR = BatchSpanProcessor.builder(OtlpGrpcSpanExporter.builder().setEndpoint(OTEL_ENDPOINT).build()).build();

    public APIGatewayProxyResponseEvent handleRequest(APIGatewayProxyRequestEvent apiGatewayProxyRequestEvent, com.amazonaws.services.lambda.runtime.Context context) {
        // Start a span to trace the Lambda invocation
        Span span = tracer.spanBuilder("LambdaInvocation").startSpan();
        try {
            APIGatewayProxyResponseEvent apiGatewayProxyResponseEvent = new APIGatewayProxyResponseEvent();
            apiGatewayProxyResponseEvent.setBody("hello world");
            return apiGatewayProxyResponseEvent;
        } finally {
            span.end();// End the span
            SPAN_PROCESSOR.forceFlush().join(10, TimeUnit.SECONDS); //wait until spans are published, 10 sec max
        }
    }

After this change, we see the trace in our system:

Result:

Cold start: 3.8 sec

Warm state: 7–10ms

Approach 3: Lambda + OpenTelemetry Java SDK+SnapStart + Runtime hooks

Since we have a plain Java code we are free to use SnapStart feature. I already have a post about it and you can read about it here.

So to enable Snapstart we just need to add 2 params (“snap_start” and “publish=true”)to our Terraform resource:

resource "aws_lambda_function" "lambda-snapstart" {
  function_name    = "${var.function-name}-snapstart"
  source_code_hash = base64sha256(filebase64(local.lambda_payload_filename_snapstart))
  runtime          = "java21"
  handler          = "com.filichkin.lambda.SampleHandler::handleRequest"
  filename         = local.lambda_payload_filename_snapstart
  role             = aws_iam_role.lambda_role.arn
  memory_size      = 256
  timeout          = 20
  snap_start {
    apply_on = "PublishedVersions"
  }
  publish = true
}

Plus we need to make a priming in Java code using RuntimeHooks

Runtime hooks

These pre- and post-hooks give developers a way to react to the snapshotting process.
The Java managed runtime uses the open-source Coordinated Restore at Checkpoint (CRaC) project to provide hook support. The managed Java runtime contains a customized CRaC context implementation that calls your Lambda function’s runtime hooks before completing snapshot creation and after restoring the execution environment from a snapshot.

So let’s implement the CRaC Resource interface and add context registration to our constructor

public class SampleHandler implements RequestHandler<APIGatewayProxyRequestEvent, APIGatewayProxyResponseEvent>, Resource {

    ...

    public SampleHandler() {
        Core.getGlobalContext().register(this);
    }
    @Override
    public void beforeCheckpoint(Context<? extends Resource> context) throws Exception {
        Span span = tracer.spanBuilder("SnapStart").startSpan();
        span.end();// End the span
        SPAN_PROCESSOR.forceFlush().join(10, TimeUnit.SECONDS);
    }

    @Override
    public void afterRestore(Context<? extends Resource> context) throws Exception {

    }

Result

Cold start: 670ms

Warm state: 7ms

Let’s compare three solutions

Summary

The Opentelemetry lambda layer works with zero code changes, but it is extremely slow
Java has a stable Opentelemetry SDK but it produces an additional 3 seconds of latency for the simple lambda.
SnapStart as expected solves the cold start problems and it’s the best solution for Opentelemetry and Java AWS Lambda.

OpenTelemetry tracing for Java AWS Lambda

Introduction

Why Use OpenTelemetry Tracing for AWS Lambda?

Prerequisites

Approach 1: OpenTelemetry Lambda layer

Result

But

Is the problem in the OpenTelemetry layer?

Approach 2: Lambda + OpenTelemetry Java SDK

Why Lambda doesn't send traces ?

Result:

Approach 3: Lambda + OpenTelemetry Java SDK+SnapStart + Runtime hooks

Runtime hooks

Result

Let’s compare three solutions

Summary

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Aleksandr Filichkin

No responses yet