It’s often thought that the core value of Application Performance Monitoring (APM) is within the Development or Application Support teams. However it becomes an increasingly important tool for effective Capacity and Service Management on modern Cloud Infrastructure.

Capacity Management in the cloud can easily catch many organisations out. We have so much more flexibility in terms of dynamic control around the scale (and therefore cost) of our cloud hosted services such as AWS, but with that flexibility comes complexity and more immediate decisions to make.

With inadequate monitoring those decisions become so risky the chances are you’ll have to operate with a substantial buffer in the running capacity of your services. That buffer represents money you didn’t need to spend.

Mix this in with the dynamic nature of cloud services and you are in a situation where sizing (or oversizing) has a more immediate effect on the operational cost of your environments. Simply put, the opportunity to capitalise on cost savings through optimisation is only as good as your visibility within the services you manage.

We at KCOM tell our customers that in order to make intelligent decisions, the standard high-level metrics from the cloud platform or traditional OS level metrics may not be enough.

Without APM it becomes very difficult to fully consider the full service impact of making capacity adjustments…unless you are considering the full scope of a user’s interaction with your system and the interactions between all services within that system.

Even when using services such as AWS EC2 Auto Scaling, defining appropriate triggers and thresholds for scale up and scale down events whilst considering every nuance of a user’s interaction with your service can be very difficult. Monitoring and continually re-evaluating the success of those triggers against service growth and behavioural changes in either the application or the user-base becomes an essential step in capacity and availability management.

It might all look fine in the load tests you run, but who’s been caught out before when your load tests don’t behave quite like a real user?

This all becomes even more relevant in the context of deploying via a Continuous Integration and Continuous Deployment (CI/CD) process, often it won’t be practical to perform a full scale load test as part of the release process, at best a performance test may be run in a lower environment as part of an automated test framework.

Perhaps a new release introduces a bottleneck in capacity? Such as a particularly heavy database query triggered by some specific user interaction. Or a fault condition, which is only triggered when full and distributed load is applied?

Traditional OS level metrics such as CPU and Memory utilisation may alert you to a symptom but not give you a clear indication of the cause. Coloration with more specific application monitoring is required to achieve this.

An APM tool will increase the chances of being able to quickly establish the root cause and therefore the actual change that caused the issue. In situations where your CI/CD model has a high cadence it’s quite possible that the problem might not have been introduced in the last iteration of the release cycle.

Some APM tools such as New Relic can be integrated within the release process itself and provide build markers on their monitoring graphs and visualisations to assist.

At KCOM we use products such as New Relic and Datadog, but whatever tooling you select, it is clear that a service and application focused monitoring approach can yield significant benefits when operating cloud-hosted infrastructure

Strategic IT