🪠Data Pipelines - Processor Services
Favor immutable services to process your data. Simple end-points, microservices, distributed jobs and AWS lambdas; anything that doesn't save state between operations.
Avoid overwriting data. When transforming, cleaning or masking data, avoid overwriting the original data when source is S3. Leaving the original data allows more flexible backfilling of data when your transforms change in the future, e.g. you find a bug.
When you need to process the incoming data hints which processor to use:
- Batch Analyzing Past: use Amazon EMR/Hadoop (MapReduce/Hive/Pig/Spark)
- Interactively Analyzing Past: use Amazon EMR/Hadoop (Presto/Spark), Amazon Redshift or Amazon Athena.
- Processing Messages Near Real-time: use custom Amazon SQS consumer application.
- Processing Events Real-time: use Amazon EMR/Hadoop (Spark Streaming), Amazon Lambda, Apache Storm or custom Amazon Kinesis Stream consumer application.
Interactive analytic tool differences:
Redshift Athena Presto Spark
Uses Warehouse Querying Querying General Purpose
Serverless No Yes No No
Storage Internal S3 S3/HDFS S3/HDFS
Volume GB-PB Unlimited Unlimited Unlimited
Cost High Medium Low Low
All the listed real-time event processing tools are scalable and reliable. Main differences come in how you work with them, what languages they use and whether you want to get is a managed service or host it yourself.
There are dozens of other transform application and services depending on your use-case. These are commonly called ETL services or applications (extract, transform, load).
Sources
- Big Data Architectural Pattern, AWS Loft Big Data Day, 2017-09-12