Spark is big data processing engine. Spark is usually used with Hadoop which handles the data management. Spark's approach to big data processing has almost always better performance than using Hadoop MapReduce.
brew cask install java java -version # => version 1.8.0_112 brew install apache-spark spark-shell --version # => version 2.2.0 pyspark # => Spark can be used with Python, Scala, etc # navigate to http://localhost:4040/ to see your jobs while the shell is open
Spark shell can be used to schedule jobs across a cluster of machines. Now you have a cluster of one machine. Try copy pasting the following to the shell and you should see a new job pop up in the Spark web console.
import re words_rdd = sc.textFile("/usr/share/dict/words") def check_regex(word): return re.match('^[aben][aben][aben][aben]$', word) out = words_rdd.filter(check_regex) out.take(5) # => [u'anan', u'anba', u'anna', u'baba', u'babe']
sc used in the script above stands for
SparkContext. Spark context is used to interact with your Spark cluster. For example, the
sc.textFile method creates a distributed dataset which can point to a local file,
print(words_rdd) # /usr/share/dict/words MapPartitionsRDD at textFile at NativeMetho...
These datasets are called Resilient Distributed Datasets (RDD). All methods such as
map on distributed datasets are scheduled to be distributed across your cluster. Basically, all RDDs are read-only data structures that are stored across all the cluster machines. Editing RDD will just create a new RDD and leave the old intact.
sc) must be initialized. In our example above
sc is initialized automatically by PySpark but if you would run the script using normal Python interpreter, you would have to import and configure it.
from pyspark import SparkContext, SparkConf conf = SparkConf().setAppName(appName).setMaster(master) sc = SparkContext(conf=conf) # PySpark requires the same minor version of Python in both driver and workers.
DataFrames are like RDDs but they have a schema. Schemas allow making structured queries to the data.
# in the same session... words_df = words_rdd.map(lambda x: (x, )).toDF(['word']) out = words_df.filter("word like '%ing'") out.take(5)
- Spark 2.2.0 Docs - Spark Programming Guide
- Highly Parallel Programming with Apache Spark, Linux Magazine September 2017