Distributed Systems
Distributed systems are systems that are composed of multiple independent components that communicate with each other over a network. They are designed to be scalable, fault-tolerant, and highly available.
Scalability
- Horizontal scaling is adding more machines to increase capacity.
- Vertical scaling is making your machines bigger to increase capacity.
Horizontal scaling can be done "infinitely" by adding more machines, while vertical scaling has a physical limit. But vertical scaling is usually cheaper and can take you farther than you'd think.
Consistency
Consistency means how each node retains a specific piece of data in a distributed system. There are four types of consistency:
Eventual consistency means that data of each node gets consistent eventually, but not propagated instantly. This allows low latency at the risk of returning stale data.
Strong consistency means that data that comes to one node gets instantly written to other nodes too. This allows all data to be in sync at the cost of higher latency.
Weak consistency means that data that comes to one node gets instantly written to subset of nodes e.g. closest or designated. Offers some middle ground in latency vs data-syncing.
SLA
Service Level Agreements (SLA) tell you how reliable a service is. When selecting or building a distributed system, note down the SLAs.
- Availability, % of the time the service is operational.
- Accuracy, is it ok for some data to be inaccurate or lost?
- Capacity, what load should the service be able to support?
- Latency; the time that the service should respond?
System availability is capped by the availability it relies on. So you need duplicate resources to act as backups e.g. servers.
Data Durability
Data durability means that once data has been written to the system, it will be available going forward, even if the node it was written to go offline or crashes.
Data durability is usually achieved with some for of replication of data to multiple nodes.
Message Persistence
Message persistence means that when a failure happens on a node processing a message, the message will still be there to be processed after the failure is resolved.
Message persistence is usually achieved at the message queue level by having a durable message queue like Kafka.
Idempotency
Idempotency means that no matter how many times a specific request is received, the actual execution of this request only happens once.
User pays for the service, but loses connection and resents the payment after a few minutes. Idempotent system will charge the user only once.
Idempotency can be achieved with database-level locking.
Optimistic locking means that when you read a record, you also take note of its version (timestamp or record hash). When you write the record back, you make sure that the version hasn't changed. If the version is different, it is said to be dirty, and you abort the operation. Optimistic locking is better when you expect only a small number of collisions.
Pessimistic locking means that when you read a record, it is exclusive only to you and none else can work on it. This is more complicated to design to avoid deadlocks. Pessimistic locking is better if you expect a large number of collisions.
Sharding
The term "sharding" comes from Ultima Online, where the game world was split into multiple "shards" to improve performance.
Sharding is the most common strategy how a dataset can be saved on separate nodes when all of your data cannot fit on a single node. Each shard acts as the single source for a subset of your dataset.
One node contains one or more partitions of your dataset. But the partitioning itself doesn't mean that they are stored on separate nodes, they can also be used for improving query speeds on large tables.
Sharding is built on top of horizontal partitioning so separate rows can be stored on separate nodes.
Columnar storage is built on top of vertical partitioning so that separate columns can be stored on separate nodes like how AWS Redshift works.
Quorum means that a certain number of nodes need to get the same result for an operation to be successful. This is commonly used when data and computation are replicated to multiple nodes.
Actor Model
Actor model is software design approach where you think of all running processes as "actors" that communicate with each other. This simplifies how you plan and build a distributed system.
Each actor can do a limited set of actions; create other actors, send messages to other actors and processing messages in various ways.
There are many actor frameworks e.g. Akka.