☁️ Cloud Infrastructure - Basics
Have your infrastructure as code (IaC). This allows defining and testing your setup in a single place. It helps to keep your infrastructure maintainable e.g., deploy to alternative locations, version control your setup, and rollback changes.
Most cloud providers have their own template languages for defining infrastructure, but I'd recommend using open solutions.
Terraform = cloud-agnostic infrastructure provisioning tool
OpenTofu = more... open-source Terraform
# configurational management
Ansible = you define how systems relate to each other
Chef = you define "steps" how to get to desired state
Puppet = you define "state" and Puppet generates steps how to get there
Saltstack = like Puppet
Chef, Puppet = pull and execute
Saltstack, Ansible = push to execute
# provider-specific IaC tools
AWS CloudFormation
Azure Resource Manager
Google Cloud Deployment Manager
OpenStack Heat
Test your IaC definitions. All tools should have a dry-run mode, or some mock/test library; even a simple lint or smoke test is fine. Setup this to your project CI like GitHub Actions.
Monitor application health. How to know if your application is up and running? All cloud providers have ways to set up health checks, like AWS Route 53 Health Checks, that notify you if something is broken. After you have this information, you can make it self-repairing, but still trigger at least a warning notification.
Use centralized high-level error reporting. You want to get notified and see the high-level application logs to understand what was happening during an error.
Sentry
Use centralized low-level logging. One of your users reports an error that happened. You would like to see the low-level logs what was happening on the environment during that time, but you may have no way of knowing which of your hundreds of servers are related to the error.
AWS CloudWatch
Azure Monitor
GCP Stackdriver Logging
Vector + a sink like AWS S3
Give identity to incoming requests and triggered events. Tracking a series of connected events between services is tedious without giving an identity to the requests (the "session" or the "trace"). This request identity should be recorded through the whole lifecycle of the event, at least in a way that you can backtrack to "what triggered this event" if required.