ruk·si

🪠 Data Pipelines
Processor Services

Updated at 2017-09-16 21:06

Favor immutable services to process your data. Simple end-points, microservices, distributed jobs and AWS lambdas; anything that doesn't save state between operations.

Avoid overwriting data. When transforming, cleaning or masking data, avoid overwriting the original data when source is S3. Leaving the original data allows more flexible backfilling of data when your transforms change in the future, e.g. you find a bug.

When you need to process the incoming data hints which processor to use:

  • Batch Analyzing Past: use Amazon EMR/Hadoop (MapReduce/Hive/Pig/Spark)
  • Interactively Analyzing Past: use Amazon EMR/Hadoop (Presto/Spark), Amazon Redshift or Amazon Athena.
  • Processing Messages Near Real-time: use custom Amazon SQS consumer application.
  • Processing Events Real-time: use Amazon EMR/Hadoop (Spark Streaming), Amazon Lambda, Apache Storm or custom Amazon Kinesis Stream consumer application.

Interactive analytic tool differences:

                Redshift    Athena      Presto      Spark
Uses            Warehouse   Querying    Querying    General Purpose
Serverless      No          Yes         No          No
Storage         Internal    S3          S3/HDFS     S3/HDFS
Volume          GB-PB       Unlimited   Unlimited   Unlimited
Cost            High        Medium      Low         Low

All the listed real-time event processing tools are scalable and reliable. Main differences come in how you work with them, what languages they use and whether you want to get is a managed service or host it yourself.

There are dozens of other transform application and services depending on your use-case. These are commonly called ETL services or applications (extract, transform, load).

Sources

  • Big Data Architectural Pattern, AWS Loft Big Data Day, 2017-09-12