DSLab Week 7
Getting Started with…
In this 3-week module, we will investigate how to scale up analysis to a cluster of machines using
the Apache Spark distributed computing framework.
Who am I?
- Rok Roškar
- PhD in Astrophysics
- I currently work on the SDSC Platform team at ETH Zürich
- used Spark extensively in previous job for various academic projects
What is Spark?
“a general-purpose distributed computation framework”
a few key features:
- interactive data exploration
- keeps data in-memory - good for loop-intensive algorithms
where is it being used?
- mostly internet applications (recommendation engines, usage analysis etc.)
- classic Big Data use cases e.g. text analysis
- some academia, notably neuroscience
Project stats

Why use Spark?
Spark is just one solution that facilitates analysis on large data.
Other options:
- using the Message Passing Interface (MPI) library
- similar frameworks e.g. Apache Flink (more streaming-specific)
- Python-specific Dask (nice abstraction for scaling python-native applications)
Sometimes being clever delivers the best results…
Flexibility of Spark runtime
Spark’s flexibility is what makes it so popular.
The spark runtime can be deployed on:
- a single machine (local)
- a set of pre-defined machines (stand-alone)
- a dedicated Hadoop-aware scheduler (YARN/Mesos )
- “cloud”, e.g. Amazon EC2
Incremental and interactive development
The development workflow is that you start small (local) and scale up to one of the other solutions, depending on your needs and resources.
In addition, you can run applications on any of these platforms either
- interactively through a shell (or a Jupyter notebook as we’ll see)
- batch mode
No code changes to go between these methods of deployment!</em></p>
## Spark Architecture Overview
### The things that make distributed computing hard:
1. distributing work to the available resources
2. orchestrating task execution
3. collecting results
This is what a "framework" like Spark does for us
At its most basic, it consists of a **driver** and **workers**
**Driver**
* coordinates the work to be done
* keeps track of tasks
* collects metrics about the tasks (disk IO, memory, etc.)
* communicates with the workers (and the user)
**Workers**
* receive tasks to be done from the driver
* store data in memory or on disk
* perform calculations
* return results to the driver
The user's access point to this Spark universe is the **Spark Context** which provides an interface to generate RDDs.
## Basic Data Abstraction:
## the RDD (Resilient Distributed Dataset)
An RDD is the primary interface and cornerstone of every Spark application.
* keeps track of data distribution across the workers
* provides an interface to the user to access and operate on the data
* can rebuild data upon failure
* keeps track of lineage
* is immutable
As a Spark user, you write applications that feed data into RDDs and subsequently transform them into something useful
## RDD transformations and actions
Once an RDD is created, it is **immutable** - it can only be transformed into
a new RDD via a *transformation*.
A transformation, however, does not trigger any computation, only updates the
DAG.
Calculations are triggered by *actions*.
## Transformations
* `distinct`: only retain the unique elements of the entire RDD
* `filter`: only keep those elements for which the filter function evaluates to `True`
* `flatMap`: returns a number of items different from the original data
* `map`: the most basic transformation with 1:1 correspondence to original data
* `mapPartitions`: similar to `map` but done on a per-partition basis
(requires a generator function)
* `reduceByKey`: group elements by key and keep the data distributed
* `sortBy`: sort using the provided function
Transformations are evaluated "lazily" - only executed once an *action* is performed.
## Actions
* `collect`: pulls all elements of the RDD to the driver (often a bad idea!!)
* `collectAsMap`: like `collect` but returns a dictionary for key/value RDDs
* `countByKey`/`countByValue`
* `first`: returns the first element of the RDD to the driver
* `reduce`: reduces the entire RDD to a single value
* `take`: yields a desired number of items to the driver
Don't worry, you will soon get to practice
with most of these!
## Lineage
* When an RDD is transformed, this **transformation** is not automatically
carried out.
* Instead, the system remembers how to get from one RDD to another and only
executes whatever is needed for the **action** that is being done.
* this allows one to build up a complex "pipeline" and easily tweak/rerun it
in its entirety
### Initializing Spark
```python
import pyspark
sc = pyspark.SparkContext()
```
This step launches the Spark runtime and connects the application to the
master. This creates a driver which can now be used to dispatch work to the
resources allocated for the application.
### Parallelize
```python
rdd = sc.parallelize(data)
```
<img src="figs/parallelize.svg" width=700>
### map
```python
def square(x):
return x*x
rdd = sc.parallelize(data)
rdd_squared = rdd.map(square)
```
<img src="figs/map_lineage.svg" height=500px>
## Caching
* RDD evalutations are *lazy*
* whenever an action is performed, the entire lineage graph is recalculated
* unless! an intermediate RDD is cached -- then it is only calculated once and reused from memory each subsequent time
* this allows for good performance when iterating on an RDD is required
```python
rdd = sc.parallelize(data)
rdd2 = rdd.map(square)
rdd2.cache()
```
<img src="figs/cache.svg" height=500px>
```python
rdd = sc.parallelize(data)
rdd2 = rdd.map(square)
rdd2.cache()
rdd3 = rdd2.map()
```
<img src="figs/cache_map.svg" height=500px>
## Partitioning
* data of each RDD is partitioned and each partition is assigned to an executor
* each partition in a transformation results in a task
* there may be many more tasks than cores in the system, which allows for good utilization by fine-graining the overall load.
#### Time for the basic Spark tutorial!
Head to the gitlab repo for this week:
https://git-dslab.epfl.ch/dslab2018/week7-intro-to-spark
Follow the instructions in the `README.md` to get set up.
## Lab notebooks
Three notebooks for this week in the `notebooks/` directory. Should be done in
this order:
1. `python-refresher.ipynb`
2. `spark-intro.ipynb`
3. `gutenberg.ipynb`
## Analyzing the Gutenberg corpus
The [Gutenberg Project](http://http://www.gutenberg.org/) is a large free
repository of books and other media in different languages (but primarily
English)
We will use it to do some basic text analysis using key/value PairRDDs in Spark.
## The data
I have pre-processed the data already and created an RDD for you to use. The
RDD consisting of `(ID, text)` key-value pairs can be found in
```
hdfs:///datasets/gutenberg/gutenberg_rdd
```
This RDD will form the basis of the work in the notebook.
## The goal
The eventual goal is to produce something like the [Google NGram
viewer](https://books.google.com/ngrams), but for the Gutenberg corpus.
Because the notebook is quite long, the last part of the notebook is already
filled in, but feel free to run it!