🪵 Monitoring & Logging

Updated at 2021-03-10 15:58

This note is about logging and monitoring. Logging records machine and human readable description of an event.

Consolidated monitoring is essential. You simply can't keep track of multiple instances of multiple services by SSHing in the hosts.

It is recommended to monitor:

Software logs aggregated in one place
CPU usage over time
Memory usage over time
Disk usage over time
Response times over time
Error count over time
And all of the above: average, min, max or individual instance.

I would also strongly advocate tracking service and application metrics. At the very least which features are actually being used.

Semantic monitoring is a good addition. Send a fake payload every X minutes, then wait and verify results. If there is an issue, notify developers.

All your application and operational metrics should be in a single place. These are frequently cross-referenced and it will be hard to find patterns otherwise; e.g. using ordering automation causes large disk usage spikes.

Correlation IDs are important, especially in microservice monitoring. An id for a chain of calls. The first call generates an UUID and its passed along to all requests.

[12:12:12] [my-service] [correlation-id] [log-level] message
or something on those lines

Service health checks should report itself and status of potential downstream services from their point-of-view. This allows spotting issues with interconnectivity which can be hard to notice otherwise.

Logging

Logging is a single greatest feature you can add to a software system. Great for hunting bugs and figuring out what the hell is going on. Creating a solid logging system should be one of the first aspect you include in a new project. Make whole process of generating log messages as easy as possible so programmers will really write log events.

Each log entry should have a machine readable event code for filtering. Prefer using string codes that humans can still understand.

account_created

Log entry should have human readable description for debugging.

User created an account: Ruksi Laine (87b0fb59-ec25-48cd-a6f0-b5ce9bd24a56).

Always use UTC as log event time and use ISO 8601 format.

2013-02-27T01:10Z.

Always include host information, even if it is offline application. Some kind of identifier for this particular software installation. Should also contain software version.

Always include the current "user". User can be a person or a computerin this context.

Add a lot of debugging information on errors. Include source code location where the event happened e.g. file name and line number. Include the context in which the event happened e.g. call stack a.k.a. function call chain that caused the error.

You might want to identify the software instance e.g. process id or session id.

You should log using four logging levels:

Debug: fine-grained messages about low-level events, not recorded by default
Info: all working as expected, just notifying the state of the system
Warning: system has not failed yet but might fail soon
Error: mission critical failure, aborting current operation