Observability: What you need to know and more

The word "Observability" (or its abbreviation o11y) gets thrown a lot these days. So what is it ? And is it really just a fancy word for "Monitoring" that was made trendy just because those "Though Leaders" ?.

Monitoring is a well explored concept. You define a set of KPIs that need to be monitored every x time period. You set up your agents to collect metrics and logs that support those KPIs, visualize them on some dashboards and perhaps even set up some alerts to notify you once they pass some threshold. Easy right ?. Monitoring has been around for ages and has proven its benefits numerous times.

We can define observability as the next level monitoring. While monitoring is great it does face some challenges in today's fast-paced environments:

  • You can't monitor what you don't know (i.e the unknow unknowns problem). You need a well defined problem first and then you can start monitoring it.

  • It's a reactive approach (a reaction) to problems. You don't discover problems before they impact you first.

  • It suffers from silos. There are either multiple dashboards that serve the same purpose (and normally the numbers don't match between them), every system is monitored separately and you have too many metrics and log types than you know what to do with so normally every team has its own metrics and logs that it depends on (i.e no single source of truth).

Due to all those challenges that are direct results of the quick changes of organizations' scale, the concept of observability was introduced.

Observability is all about providing a complete view of the current state of your environment. This happens using the three pillars of observability:

  • Log files: Records of events that happen in every component of the systems in your environment. They are normally time logged with different log levels.

  • Metrics: Measurements of KPIs in your systems. They are collected and aggerated in the suitable time windows.

  • Traces: Traces are logs that shows the relation between different components of different systems as events are propagating through them.

Once you collect this data (called telemetry data), you need to store it in a single data store. The first difference between observability and monitoring is that observability prefers storing data in the same place to facilitate exploration and analysis. And since the size and diversity of the telemetry data is one of the challenges that are facing monitoring, observability's answer to it was to provide a strong data governance and a data catalog for the collected data. This will enable different teams to find the kind of telemetry data they are searching for and remove any unnessacry data that is not of any use to avoid wasting resources.

One important principle of observability is making telemetry data easy to explore and analyze. No need to create a new custom dashboard or view every time we needed to look at a piece of information. Dashboards should be used only in case that piece of information need to be access frequently and viewed in relation to another piece of information.

However, what truly differs observability from monitoring is that observability is all about being proactive rather than reactive. By collecting all the telemetry data in the same data store, doing some governing and providing a catalog for it, enabling easy exploration and analysis on it, We enable teams to easily detect bugs and predict incidents before they happen. Also we enable them to analyze behaviors and identify patterns that can possibly lead to new features and products being developed.

Just to sum up things. Observability means:

  • Collect logs, metrics and traces from all your systems and applications in an automated and scalable way.

  • Store all collected telemetry data in the same data store(s) while managing permissions and security.

  • Provide data governance and data cataloging for the collected data.

  • Enable easy exploring and analysis of the collected data.

  • Enable the deployment of machine learning models and the building of dashboards and alerts on top of the collected data.

  • Measure the degree of observability of your systems regularly and update your policies and KPIs.

We are still at the beginning of the observability journey, but it would be great to see what level of observability can we reach and what kind of insights we can extract.