☁️ AWS
AWS for Big Data

Updated at 2016-01-02 14:11

This extends data science notes as an example implementation.


  1. Data collect data through Kinesis, pushes data to S3.
  2. Data storage using S3.
  3. Data processing and exploratory analysis using EMR.
  4. Analysis warehouse using Redshift, database optimized for analytics.
  5. Data visualization can be done with JavaScript plotting libraries and Lambda.
  6. TODO: Try Amazon Machine Learning
EMR: AWS service for running Hadoop clusters
  Hadoop: operating system for big data, different variants exist
    YARN: resource manager that distributes work across a cluster.
    Spark: in-place data processing jobs.
    Spark Streaming: take streaming data and manipulate it.
    Spark SQL: use HiveQL to create Spark jobs.

Redshift: is a data warehouse optimized against querying large datasets.

Kinesis: easily scaling buffer for incoming data.

Kinesis Firehose: could push straight to Redshift.


Create Kinesis stream where to queue the incoming data:

aws kinesis create-stream \
  --stream-name AccessLogStream \
  --shard-count 1

Create S3 bucket where to save the incoming raw data:

aws s3 mb s3://access-log-production

Launch a 2-node EMR cluster with Spark and Hive:

aws emr create-cluster \
  --name "test-cluster" \
  --instance-type m3.xlarge \
  --instance-count 2 \
  --release-label emr-4.1.0 \
  --ec2-attributes Keyname=my-ssh-keyname \
  --use-default-roles \
  --applications Name=Hive Name=Spark

Launch a single-node Redshift cluster:

aws redshift create-cluster \
  --cluster-identifier test \
  --db-name test-storage \
  --node-type dc1.large \
  --cluster-type single-node \
  --master-username master \
  --master-user-password YOUR-PASSWORD \
  --publicly-accessible \
  --port 8192

Somehow fill the Kinesis stream with data?

Log in to EMR cluster and download Kinesis client for Spark:

  • ssh -i KEY hadoop@your-emr
  • ssh -o TCPKeepAlive=yes -o ServerAliveInterval=30 -i ~/.ssh/my-key hadoop@your-emr
  • `wget
  • allows reading streaming data from Kinesis
  • Write a Spark job to Spark REPL using spark-shell --jars ...
    • Import required libraries to the REPL.
    • Configure to load from Kinesis and save to S3.
    • Start as many workers as there are Kinesis shards.
    • Populates your S3 with processed data.
  • Start exploratory analysis with Spark SQL REPL spark-sql.
    • use emrfs to connect EMR cluster straight to S3
    • Create new external table with expected column, parsing instructions and location LOCATION: s3://my-bucket/access-log-raw.
    • Now you can query the raw S3 data with SQL like SELECT * FROM access_log_raw LIMIT 1;
  • Now you have EMR cluster for processing (size is speed), S3 for storage (unlimited).

Connec to Redshift:

  • psql -h REDSHIFT-ENDPOINT -p 8192 -U master test-storage
    • or SQL Workbench/J with PostgreSQL 8.x or JDBC drivers
  • Create access_logs table with SQL.
  • Execute COPY statement from S3 to Redshift.

S3 encryption would be pretty easy to implement. Encrypted S3 is roughly 25% slower.

Make sure your store and compute nodes are separated. Using EMR and S3 does this well.

Parquet is a good data format. Use it.

Partition and version your tables. Helps debugging problems and allows more intelligent migrations from version to another.