🪠Data Pipelines - Storage Services
What is the type of your data?
- Transactional: in-memory and database records.
- Files: document, log, image, video, audio and other binary files.
- Events: messages and data streams.
What is the temperature of your data? Data temperature is your assumption how frequently and fast you need to access the data in question. Data temperature effectively defines limits for data size and costs. Warmer data costs more per GB but is faster to access.
Hot Warm Cold Volume MB - GB GB - TB PB - EB Item Size KB KB - MB KB - TB Request Rate High Medium Low Latency ms ms - sec min - hours Cost High - Medium Medium - Low Low
Transactional Storages
Transactional storages are commonly called just databases.
If you are storing transactional data, you should consider these services:
- Redis/Memcached: open source in-memory database.
- PostgreSQL/MySQL: open source SQL database.
- MongoDB/Cassandra/HBase: open source NoSQL database.
- Amazon ElastiCache: managed in-memory database.
- Amazon DynamoDB: managed NoSQL database.
- Amazon RDS: managed SQL database.
- Amazon Elasticsearch: managed search database (ElasticSearch)
Data structures, access patterns and data sizes hint which storage to use:
Data Structure: Database: Key-Value In-memory or NoSQL Fixed Schema SQL or NoSQL JSON Search or NoSQL No Structure S3 or Glacier Access Pattern: Database: Put/Get (hot) In-memory Put/Get (warm or cold) NoSQL Simple relationships (1:N, M:N) NoSQL Multi-table joins, transactions SQL Searching for Specific Trait Search In-memory NoSQL SQL Search Latency ms ms ms, sec ms, sec Volume GB GB-PB GB-64TB GB-TB Item Size KB 400KB max 64KB max KB but 2GB max Request Rate High High Medium Medium Cost/GB High Medium Medium Medium Durability Low High High Medium
File Storages
If you are storing files, you should consider these services:
- Apache Hadoop HDFS: a distributed file system for hot data.
- AWS S3: the Swiss army knife of all files, for warm data.
- AWS S3 Infrequent Access: the S3 extension for less frequently accessed data.
- AWS Glacier: freezes your ice cold data to lower the costs.
S3 is a plausible choice in almost all file-storage cases. Choosing any other service is mainly for performance or cost optimization.
- Zero maintenance.
- Unlimited number of files.
- Unlimited volume of data.
- Natively supported by almost everything; even big data frameworks like Spark, Hive and Presto.
- Automatic backups behind the scenes.
- Easily secured with SSL and optional encryption at rest.
- Cheap compared to the benefits.
Message and Stream Storages
Message and stream storages decouple producers and consumers. Producers create events and consumers process events. The stream storage is in-between to decouple these services.
Message and stream storages are temporary storages. They are meant to act as buffers that allow scaling. Most of these services have some retention time; how long messages stay in the storage.
If your data is streamed events, you should consider these services:
- Amazon Kinesis Streams: managed stream storage and processing.
- Amazon Kinesis Firehose: data forwarding to S3, Redshift or ElasticSearch.
- Amazon DynamoDB: managed NoSQL database with stream support.
- Apache Kafka: open-source streaming platform.
If your are sending simple messages, you should use these services:
Amazon SQS: managed message queue service, with optional FIFO queues.
Amazon SNS: managed publish/subscribe service, with optional retries.
K.Streams K.Firehose DynamoDB Kafka SQS SNS Ordered Yes No Yes Yes Config Yes Delivery 1> 1> 1 1> 1> or 1 0> Retention1 7d None 24h 7d2 14d None Scale No Limit No Limit No Limit No Limit No Limit3 No limit Parallel4 Yes No Yes Yes No Yes Streaming5 Yes No Yes Yes No No Item Size 1MB 1MB 400KB 1MB6 256KB 256KB Cost High Low Low Medium Medium Low
*1) Retention means how long messages stay in the storage. *2) Kafka retention is configurable, use multiple weeks if you wish. *3) 300 send, receive, or delete operations per second for FIFO queues, but batching of 10 messages can boost it up to 3000 TPS. *4) Parallel means parallel consumption of topics/queues. *5) Streaming means streaming to AWS MapReduce. *6) Kafka message size is highly configurable at-least up to 40MB, but remember to edit limit on all producer, broker, and consumer.
Sources
- Big Data Architectural Pattern, AWS Loft Big Data Day, 2017-09-12