Monitoring & Logging
This note is about logging and monitoring. Logging records machine and human readable description of an event.
Consolidated monitoring is essential. You simply can't keep track of multiple instances of multiple services by SSHing in the hosts.
It is recommended to monitor:
- Software logs aggregated in one place
- CPU usage over time
- Memory usage over time
- Disk usage over time
- Response times over time
- Error count over time
- And all of the above: average, min, max or individual instance.
I would also strongly advocate tracking service and application metrics. At the very least which features are actually being used.
Semantic monitoring is a good addition. Send a fake payload every X minutes, then wait and verify results. If there is an issue, notify developers.
All your application and operational metrics should be in a single place. These are frequently cross-referenced and it will be hard to find patterns otherwise; e.g. using ordering automation causes large disk usage spikes.
Correlation IDs are important, especially in microservice monitoring. An id for a chain of calls. The first call generates an UUID and its passed along to all requests.
[12:12:12] [my-service] [correlation-id] [log-level] message or something on those lines
Service health checks should report itself and status of potential downstream services from their point-of-view. This allows spotting issues with interconnectivity which can be hard to notice otherwise.
Logging is a single greatest feature you can add to a software system. Great for hunting bugs and figuring out what the hell is going on. Creating a solid logging system should be one of the first aspect you include in a new project. Make whole process of generating log messages as easy as possible so programmers will really write log events.
Each log entry should have a machine readable event code for filtering. Prefer using string codes that humans can still understand.
Log entry should have human readable description for debugging.
User created an account: Ruksi Laine (87b0fb59-ec25-48cd-a6f0-b5ce9bd24a56).
Always use UTC as log event time and use ISO 8601 format.
Always include host information, even if it is offline application. Some kind of identifier for this particular software installation. Should also contain software version.
Always include the current "user". User can be a person or a computerin this context.
Add a lot of debugging information on errors. Include source code location where the event happened e.g. file name and line number. Include the context in which the event happened e.g. call stack a.k.a. function call chain that caused the error.
You might want to identify the software instance e.g. process id or session id.
You should log using four logging levels:
- Debug: fine-grained messages about low-level events, not recorded by default
- Info: all working as expected, just notifying the state of the system
- Warning: system has not failed yet but might fail soon
- Error: mission critical failure, aborting current operation