Apach Spark

Updated at 2017-09-03 22:23

Spark is big data processing engine. Spark is usually used with Hadoop which handles the data management. Spark's approach to big data processing has almost always better performance than using Hadoop MapReduce.

To try it, install Java 1.8 and Spark.

brew cask install java
java -version               # => version 1.8.0_112

brew install apache-spark
spark-shell --version       # => version 2.2.0
pyspark                     # => Spark can be used with Python, Scala, etc
# navigate to http://localhost:4040/ to see your jobs while the shell is open

Spark shell can be used to schedule jobs across a cluster of machines. Now you have a cluster of one machine. Try copy pasting the following to the shell and you should see a new job pop up in the Spark web console.

import re
words_rdd = sc.textFile("/usr/share/dict/words")

def check_regex(word):
    return re.match('^[aben][aben][aben][aben]$', word)

out = words_rdd.filter(check_regex)
# => [u'anan', u'anba', u'anna', u'baba', u'babe']

sc used in the script above stands for SparkContext. Spark context is used to interact with your Spark cluster. For example, the sc.textFile method creates a distributed dataset which can point to a local file, hdfs://, s3n://, kfs://, etc.

# /usr/share/dict/words MapPartitionsRDD[6] at textFile at NativeMetho...

These datasets are called Resilient Distributed Datasets (RDD). All methods such as filter, take and map on distributed datasets are scheduled to be distributed across your cluster. Basically, all RDDs are read-only data structures that are stored across all the cluster machines. Editing RDD will just create a new RDD and leave the old intact.

SparkContext (sc) must be initialized. In our example above sc is initialized automatically by PySpark but if you would run the script using normal Python interpreter, you would have to import and configure it.

from pyspark import SparkContext, SparkConf

conf = SparkConf().setAppName(appName).setMaster(master)
sc = SparkContext(conf=conf)

# PySpark requires the same minor version of Python in both driver and workers.

DataFrames are like RDDs but they have a schema. Schemas allow making structured queries to the data.

# in the same session...
words_df = x: (x, )).toDF(['word'])
out = words_df.filter("word like '%ing'")