AWS - AWS for Big Data
Updated at 2016-01-02 12:11
This extends data science notes as an example implementation.
Overview
- Data collect data through Kinesis, pushes data to S3.
- Data storage using S3.
- Data processing and exploratory analysis using EMR.
- Analysis warehouse using Redshift, database optimized for analytics.
- Data visualization can be done with JavaScript plotting libraries and Lambda.
- TODO: Try Amazon Machine Learning
EMR: AWS service for running Hadoop clusters
Hadoop: operating system for big data, different variants exist
YARN: resource manager that distributes work across a cluster.
Spark: in-place data processing jobs.
Spark Streaming: take streaming data and manipulate it.
Spark SQL: use HiveQL to create Spark jobs.
Redshift: is a data warehouse optimized against querying large datasets.
Kinesis: easily scaling buffer for incoming data.
Kinesis Firehose: could push straight to Redshift.
Setup
Create Kinesis stream where to queue the incoming data:
aws kinesis create-stream \
--stream-name AccessLogStream \
--shard-count 1
Create S3 bucket where to save the incoming raw data:
aws s3 mb s3://access-log-production
Launch a 2-node EMR cluster with Spark and Hive:
aws emr create-cluster \
--name "test-cluster" \
--instance-type m3.xlarge \
--instance-count 2 \
--release-label emr-4.1.0 \
--ec2-attributes Keyname=my-ssh-keyname \
--use-default-roles \
--applications Name=Hive Name=Spark
Launch a single-node Redshift cluster:
aws redshift create-cluster \
--cluster-identifier test \
--db-name test-storage \
--node-type dc1.large \
--cluster-type single-node \
--master-username master \
--master-user-password YOUR-PASSWORD \
--publicly-accessible \
--port 8192
Somehow fill the Kinesis stream with data?
Log in to EMR cluster and download Kinesis client for Spark:
ssh -i KEY hadoop@your-emr
ssh -o TCPKeepAlive=yes -o ServerAliveInterval=30 -i ~/.ssh/my-key hadoop@your-emr
- `wget http://repo1.maven.org/maven2/com/amazonaws/amazon-kinesis-client/1.6.0/amazon-kinesis-client-1.6.0.jar
- allows reading streaming data from Kinesis
- Write a Spark job to Spark REPL using
spark-shell --jars ...
- Import required libraries to the REPL.
- Configure to load from Kinesis and save to S3.
- Start as many workers as there are Kinesis shards.
- Populates your S3 with processed data.
- Start exploratory analysis with Spark SQL REPL
spark-sql
.- use
emrfs
to connect EMR cluster straight to S3 - Create new external table with expected column, parsing instructions and location
LOCATION: s3://my-bucket/access-log-raw
. - Now you can query the raw S3 data with SQL like
SELECT * FROM access_log_raw LIMIT 1;
- use
- Now you have EMR cluster for processing (size is speed), S3 for storage (unlimited).
Connec to Redshift:
psql -h REDSHIFT-ENDPOINT -p 8192 -U master test-storage
- or SQL Workbench/J with PostgreSQL 8.x or JDBC drivers
- Create
access_logs
table with SQL. - Execute
COPY
statement from S3 to Redshift.
S3 encryption would be pretty easy to implement. Encrypted S3 is roughly 25% slower.
Make sure your store and compute nodes are separated. Using EMR and S3 does this well.
Parquet is a good data format. Use it.
Partition and version your tables. Helps debugging problems and allows more intelligent migrations from version to another.