What is Observability?
Observability is defined as the ability of a system to provide information about its state to external requests. Over the last decades, the complexity of software systems increased the need for ways to bolster the observability of an application, microservice, cluster, etc. The goal is to monitor the state of these applications to resolve unforeseen failures and to optimize performance. The optimizations can vary - some are aimed at cost reduction, some at application performance, and others at user experience. With the correct data, engineers, developers, SREs, and managers can make the necessary decisions to deliver business value.
Observability vs Monitoring
In conversation, observability and monitoring are often used interchangeably. However, the two have a slight, yet significant, difference. Monitoring typically involves instrumenting an application to collect specific end-points. On the other hand, Observability allows the user to “explore” data published by the application without the need to instrument. In other words, observability is the “openness” of an application, while monitoring is setting up data points to be extracted.
“You can increase the observability of an application by monitoring it.”
What is an Example of Observability?
Let’s use two examples of observability - One easy to understand in our daily life, the other that applies to modern software applications.
The Observability of a Vehicle
A vehicle is equipped with several sensors that give the owner information about the system's current state. The dashboard has a gauge for fuel, speed, temperature, tire pressure, motor status, etc.
The driver can easily observe the gauges and make the right decision - “Should we stop for fuel in 10 miles?”, “What speed do we need to maintain on the highway?”, “We need to verify the rear left tire as the pressure warning is ON.”
The Observability of a Software Microservice
The analogy above applies to a microservice in the software engineering world. Engineers can observe numerous metrics tied to the application, infrastructure, and external inputs/outputs as IT deploys the application into a cloud container.
The most basic metrics will convey the performance of an application - “Is it running, idle, stopped, faulted?”, “How many requests is the application getting, and how many is it able to process?”, “Which systems are sending requests to this application?” These data points are primarily meant for SREs monitoring the service and the underlying infrastructure.
As mentioned above, observability metrics can extend to the code and allow developers to optimize the service. “Can we increase throughput?”, “Are we sending the right information to the API calls?”, “Can we package data differently and thus reduce the computing requirements of the service?”
What are the Benefits of Observability?
As outlined in our example above, observability is critical to the success of modern IT teams. However, the benefits it drives may vary based on the company's needs, the nature of the applications, and the role/team it is being presented.
Application Performance Monitoring
Once an application is deployed to a staging environment, observability will provide metrics dictating code acceptance into production. Understanding how the application performs through various user-defined tests before it’s rolled out to production is essential. Once the application is exposed to real users, these metrics drive decisions of software teams in the first hours of deployment and of SRE teams long-term. They ensure the uptime and performance of the application - as mentioned above, they look for the state, requests, key metrics, and user data.
Infrastructure Monitoring
Various tools are available and are used in modern software deployment. Containers, load balancers, databases, machine learning modules, and hundreds of other services are available for software engineers to use in their applications. These infrastructure elements are critical to the success of software-driven companies and thus need to be monitored. Observability of infrastructure services allows engineers to understand the demand for the service, the scalability of the service, the need to purchase more / less compute power, the ability to deploy the application to other vital geographical areas of the business, etc.
Business Analytics
One of the most challenging elements to observe is business metrics. The questions answered here will depend on the nature of the application/service; some examples include - “How long did the user take to complete the check-out process?”, and “Which information was presented to the customer at the time of opening the application?”, “What percentage of the users chose product B instead of product A?”. These questions are difficult to answer as they can rarely be answered without context and industry/application knowledge. Engineers need to understand what metrics they’re looking for and how to properly instrument them / capture the data to make sense of it. Furthermore, this data often need to be manipulated before it’s presented to the intended audience, who may not have an extensive understanding of the application.
CI / CD Process Automation
The DevOps Pipeline is a process that involves software automation tools that take the code written by developers through a series of tests before moving it into production. An application's observability pairs well with several of these tools and facilitates the process of getting the code into production systems. As discussed above, this is accomplished by having metrics against which tests can be applied. The more tests we run, the more reliable the application, and the less time is spent troubleshooting, modifying, and redeploying.
How is Observability Implemented?
Every software engineering organization strives to provide more data points about their deployments. However, there’s a point at which it starts to become overwhelming. The rule of thumb is to strive to capture and capitalize on the observability metrics you can use in your analysis without spending too much time trying to gather the ones you won’t use for the foreseeable future. This point is crucial as you converse about implementing observability with your team/organization.
The keywords of observability are metrics, logs, and traces. Metrics are key performance indicators of an application or microservice - they provide you with knowledge based on previous failures. In other words, they are typically implemented after you know about a potential failure point. Logs are the second most important tool for monitoring & observability. They typically provide a more granular view than metrics. However, the abundance of information may also confuse in many cases. A long list of logs will need to be analyzed by an expert to make sense of it and understand what issue led to the problem at hand. Traces are the last piece of the puzzle. The difference between traces and metrics/logs is that they typically span across applications. A trace will follow a message or user request through several different services/applications to capture and reveal information about the path/duration/requests / etc. The goal of a trace is to understand and be able to optimize a series of events rather than a single point.
Now that we understand the three building blocks of observability, how are they implemented? Various tools can be used - some metrics and logs are provided out of the box by the services and infrastructure providers. For example, an AWS Compute Instance will provide metrics and logs to the user through AWS CloudWatch. They can be accessed programmatically and used to understand the current state and ways to optimize the underlying asset. Next, we can look at the services running our application. Docker, Kubernetes, EKS, and other containerization tools provide additional information about the application running within them. Engineers can use various tools to instrument their code at the application level. This is done by running scripts that execute at specified times within the application and send data to the APIs defined by the developer. Lastly, traces are set up by teams across applications. There are tools to deploy traces, track their performance, and simplify how tokens used to capture this data are passed between various services.
Observability Challenges
We’ve described some of the ways observability can be implemented. However, numerous challenges have been alluded to in the post. In this section, our goal is to shine the light on them and provide discussion points as you and your team choose to implement tools to increase the observability of your software.
Observability Overhead
Every application that is deployed has various costs - costs to develop and costs to maintain. Observability is no different. Depending on the methodologies implemented by your team, it takes time and resources to implement observability tools into every application/service properly. The balance lies between the data you can utilize and the time spent to implement it. During regular operation, you’ll incur a cost for running the services that provide and store the data. A typical overhead can be as high as 20%. In other words, you’ll need to pay for computing and storage services to access these metrics, logs, and traces. Lastly, tool fees are associated with observability - each application that “simplifies” the process of getting these metrics to your team will have a cost associated with it. Choose carefully, as we’ve seen numerous examples of overengineered solutions and metrics not used by the teams paying for them.
Observability Complexity
Although we can get several metrics and logs out of the box, custom observability metrics are difficult to implement. Why is that so? As we've briefly discussed in the section above, business decisions will depend on the industry, the nature of the business, the team, and the knowledge of the company's needs. Therefore, it becomes increasingly difficult to find experts who possess technical and business knowledge to implement observability to this extent.
Conclusion on Observability
Data in software development is precious. The ability to understand and be able to optimize an application based on data allows engineers to drive business decisions at increased efficiency. Metrics, Logs, and Traces are deployed to aid in troubleshooting, deployment, and software development. They carry an overhead that can quickly balloon and put a burden organization.