Monitoring

September 14, 2023

graphs of performance analytics on a laptop screen

I was asked recently what monitoring tools I recommend. Right now, that’s a difficult question to answer because, in many ways, there isn’t a complete solution that I am happy to recommend. I believe you need a suite of log aggregation, telemetry/metrics and observability tools to get to the bottom of what is going on in your systems. With the widespread use of microservices, we have moved to a position where we have many hosts, and we need better tooling to stitch together what has happened if there is any outage or issue.

I have used a couple of open-source ‘monitoring’ tools in the wild - Grafana and Prometheus. Grafana, as the name suggests, is more focused on data visualisation and provides a user-friendly interface for creating and displaying dashboards; Prometheus is designed for metric collection and alerting. Prometheus has a powerful query language and offers a wide range of metrics, while Grafana is more flexible regarding data sources and visualisation options. Ultimately, the choice between the two depends on the specific monitoring needs of the organisation.

Some reasons to use Grafana include:

User-friendly interface for creating and displaying dashboards

Flexible data sources and visualisation options

Ability to customise and share dashboards with other team members

Integration with numerous data sources, including Prometheus, Elasticsearch, and InfluxDB

Some reasons to use Prometheus include:

Powerful query language for data analysis and alerting

Wide range of metrics for monitoring system performance

Ability to scrape metrics from various sources, including applications and services

Integration with various data visualisation tools, including Grafana

One thing that needs to be remembered is that all the pretty dashboards in the world don't help if you have an outage in the middle of the night when no one is looking at the dashboard.

In fact, all of the dashboards that have stuck in my mind had a single thing on them:

The usage graph was the same every day - until it wasn’t.
The number of messages per minute was never noticeable - until it said 0.
A map of the airport always had planes moving on it - until there weren’t any.

If you want a more integrated monitoring and alerting tool, then I’d recommend Dynatrace and Opsview as two monitoring tools which are in widespread use.

Dynatrace is an AI-powered monitoring tool that provides end-to-end visibility into application performance, user experience, and infrastructure monitoring. It offers automatic root cause analysis and real-time alerts, making it easy for teams to troubleshoot issues and optimise performance. Dynatrace also provides various integrations with other tools and platforms, including Kubernetes and AWS. It has been the easiest to set up in my experience, and so if you have a large estate of servers or services to monitor, then it’s probably the solution you need. On the other hand, it’s opinionated in the range of things it can monitor, so it's relatively difficult to add functionality.

Opsview, on the other hand, is a more traditional infrastructure monitoring platform that allows teams to monitor their entire infrastructure, including servers, networks, and applications. It offers customisable dashboards, alerts, and reports and integrates with various tools, including Nagios plugins. This makes it easier to add bespoke or customised plugins for your application or for tools you use, such as Kafka. Opsview also offers automation features, allowing teams to automate tasks and workflows.

Ultimately, the choice between Dynatrace and Opsview will depend on the organisation's specific needs. Dynatrace may better fit organisations requiring AI-powered monitoring and real-time alerts. In comparison, Opsview may be a better fit for organisations that need a comprehensive IT infrastructure monitoring platform with customisable dashboards and automation capabilities.

Reply

or to participate.