Logging

August 24, 2023

Logging was always the thing that was missed way back in the distant monolith past. I once had a job where my role in life was to sit in meetings with all of the project managers (that’s what we called Scrum masters back then) and say, ‘Have you sorted the logging yet’? Invariably, we’d get to a week after go-live, we’d have sorted the monitoring because it’s essential to leave that until post-release, and then the first Incident would happen. We’d have no idea what was happening because there would be no helpful logging.

So that’s my first definition of proper logging - you should be able to diagnose any issues and resolve them by working your way through the logging.

There are other things such as metrics - the kind of thing you would use Prometheus for - and observability tools like New Relic or Honeycomb, which diagnose whereabouts in your whole flow of services an issue has arisen.

One logging item often missed is something that says very clearly ‘Service started’ - some tools, such as Ansible, run and then output a list of failed processes from their playbooks. This is most helpful because if something has gone wrong, you need to know which thing that is and fix it. Just try to make that as easy as possible for yourselves.

It is essential to make clear when the service has had any issues that it cannot deal with - these should be logged as errors. Please don’t log things that aren’t Errors at the ERROR log level. Just try to make things easy for yourself.

Any significant events must be stored somewhere to be replayed or inspected so the team can establish how the system got into its current state.

You really shouldn’t be shelling onto boxes to inspect logs anymore. There are several reasons for this:

Security

Even in a ‘you built it, you run it’ setup, the developers who wrote the code generally should not be able to SSH onto production boxes. This is not allowed in organisations which have implemented compliance frameworks such as SOX and PCI/DSS.

Operability

It’s not practical to hop from one service running on an instance to the next, and that’s if there even is an instance. In these days of serverless microservices, there is often no there, there, so you need to be able to access all of the logs in one central, separate place. I’ve also had issues where the terminal software kept crashing the box if it got overloaded, which is perfect if you’re trying to resolve a production outage.

Availability

If your service, or even your server, is thrashing about coming up and immediately dying, you will have no way of retrieving the logs before a crash. It is much better again to have a highly available logging server where you can look at what is happening without worrying that the ground will vanish from under you.

The main log management tools are ELK and Splunk. They provide benefits such as centralised log collection and search, real-time log analysis, and alerting. ELK (Elasticsearch, Logstash, Kibana) is an open-source solution that allows users to collect, process, and analyse log data from various sources. It also provides powerful search capabilities and visualisations through its Kibana interface. The downside of hosting your own ELK stack is that it needs a bit of TLC, and you will need to understand things like shards and clustering to manage upgrades, for example.

On the other hand, Splunk is a proprietary (expensive!) solution offering similar log management features but with added functionalities such as machine learning and security analytics. Both tools can help improve troubleshooting, monitoring, and performance analysis of systems and applications.

One way to ensure your log management solution works is to set up alerts for specific events or errors. This way, you can be notified immediately if something goes wrong and take action to resolve the issue. Regularly reviewing and analysing your logs can help identify trends or patterns that may indicate potential issues before they become significant problems. Testing your logging solution in a staging or testing environment before deploying it to production can also help ensure it works as intended.

Reply

or to participate.