Notice: Our URL has changed! Please update your bookmarks. Dismiss

APM Guidance

Introduction

The objective of this guidance document is to educate the reader on the Application Performance Monitoring (APM) space. The goal is to provide enough material to have an informed conversation about it, to be able to select a product, and start a high level implementation design.

Application Monitoring

There are several angles with which one can approach application monitoring, but all of them rely on the application being observable. The three main sources of information that make an application observable are the following (there is a much longer discussion here):

Observable Description

Application logs

Contains the state of the app as well as actions being undertaken.

Application metrics

Contains various metrics about the technology stacks in which the app is running, as well as some app specific metrics, if the app has been instrumented.

Tracing

Contains information about how time is spent during the processing of a request.

Using these source of information we can build the following types of application monitoring:

Type of Monitoring Description

Application liveness

Determines whether the application is alive and responding appropriately. Used to set off alerts and start automatic recovery procedures.

Application SLA

Determines whether the application is responding with the agreed SLAs. Sometimes this involves an external service that can generate requests from multiple geographies and accounts for lags introduced by going through the internet.

Application Performance Monitoring

Provides introspection to the inner working of the application. Generates a series of statistics on the general status of the runtime (memory used, threads, methods called etc…). Can also be used to profile the application and find internal bottlenecks.

Distributed Tracing

Provides visibility on how time is spent when processing a request. This particularly useful in a microservices architecture in which requests are processed by several microservices.

Business Activity Monitoring

Provides visibility on business key performance indicators (KPIs) that are influenced by the given application. Can be used to create alerts in situations where the application is performing fine from a technical perspective (or example: app is up and running and SLAs are honored) but not performing from a business perspective (for example: checks are not being processed).

And the list can probably be longer.

In this document we will be focusing on Application Performance Monitoring (APM) and Distributed Tracing and what changes in these spaces with the introduction of containers and microservices architectures.

Application Performance Monitoring

Application Performance Monitoring is about being able to answer these types of questions:

  • Why did the container blow up?

  • Why am I running out of memory?

  • Why is this use case taking so long to execute?

  • Why am I running out of JDBC connections?

APM is nowadays achieved by instrumenting the application so that it can generate metrics about what is going on inside the process running it and collecting those metrics to some central database that can then be used to generate dashboards and do alerting.

APMGuidance APMArchitecture

Normally an application is instrumented by adding a probe that captures the main application events and metrics. In Java this is done via an agent (activated usually with the CLI argument ). Other runtimes may have different approaches.

In the container world there are two ways of installing APM tools.

Agent included in the container image

In this architecture the container image of the application that needs to be monitored has the agent onboard. Upon start of the container, the agent will be activated and start communicating with the metrics collector.

APMGuidance ContainerAPMArch1

Notice that this approach requires specifically crafted images and decreases the ability to reuse externally prepared images such as the s2i images from Red Hat.

Prometheus instrumentation

Prometheus is becoming a very popular metrics collector and going forward will be officially supported in OpenShift for collecting infrastructure-related metrics. A possible approach to APM is to have Prometheus also collect application metrics.

Spring Boot can expose metrics directly via the Actuator library. Integrations between Actuator and Prometheus exist in the open source space, for example here.

Prometheus has the ability to automatically start scraping pods with some specific annotations.
So the setup for this configuration is very minimal. On the other hand, Prometheus does not currently offer a APM-specific dashboard.

Agent deployed as a daemon set

In this approach the agents are deployed as daemon sets. The containers in this daemon set are privileged and can inspect the other containers running on the same node. Once a new container starts, the agent sends all available information of that container to the metrics collector.

APMGuidance ContainerAPMArch2

Note that in this case very little additional work is required in order to enable APM and that no additional image configuration is needed.

Market players

Major market players in APM are:

For more information about the APM tools landscape see Gartner’s “Magic Quadrant” report on APM tools (document code G00298377), available through the Gartner site.

It is important to be aware that there is a trend for pure infrastructure monitoring tools to invade the APM space and for APM tools to invade the infrastructure space. In fact a new generation of tools try to do both. Some examples are:

When choosing a product, you may want to consider whether it is part of the OpenShift Primed list.

Below we have a series of link to help getting started with each of the mentioned products.

Distributed Tracing

Distributed tracing is about understanding how time is spent between all of the hops a request goes through in order to be completed. Distributed tracing is not new but it becomes more relevant in a microservices architecture as the average number of hops per request increases.

The current distributed tracing standard is OpenTracing, which has been recently accepted by the CNCF (thus is likely to be widely embraced).

The reference implementation of OpenTracing is Jaeger (also part of CNCF).

The general architecture of Jaeger is the following:

jaeger architecture

A jaeger client (client libraries for various languages exist) will generate trace information and sends them via a local UDP to a jaeger agent, running as a sidecar. The jaeger agent will perform the necessary sampling and throttling, and then send the traces to the central jaeger collector, which stores them in Cassandra. Once the traces are stored they can be visualized using the UI.

Instructions on how to install jaeger in OpenShift can be found here.

The UI presents result as follows:

jaeger ui

Helloworld-MSA is an example of how to instrument an application with Jaeger.

If you are building a service mesh with istio, consider that istio is naturally integrated with jaeger, as explained here.