Contents

CS186-L21: MapReduce and Spark

Motivation

only scaling up relational databases is challenging :s

MapReduce Data and Programming Model

Target /databasel21/image.png

Map phase

/databasel21/image-1.png map function will not keep the state of the intermediate results, so it can be parallelized easily

Reduce phase

/databasel21/image-2.png for example, wanna count the number of occurrences of each word in the input data, we can use the reduce function to sum up the values of the same key /databasel21/image-3.png

Implementation of MapReduce

fault tolerance

by writing intermediate results to disk…

  • mappers can write their output to local disk
  • reducers can read the output of mappers from local disk and combine them, if the reduce task is restarted, the reduce task is restarted on another server

implementation

/databasel21/image-4.png how to handle the stragglers? /databasel21/image-5.png

Implementing Relational Operators

/databasel21/image-6.png

/databasel21/image-7.png

/databasel21/image-8.png

/databasel21/image-9.png

Introduction to Spark

why MR sucks?

  • hard to write more complex queries
  • slow for writing all intermediate results to disk

/databasel21/image-10.png

/databasel21/image-11.png

Programming in Spark

collections in spark /databasel21/image-12.png

1
2
3
4
JavaSparkContext s = SparkSession.builder().appName("MyApp").getOrCreate();
JavaRDD<String> lines = s.read().textFile("input.txt");
JavaRDD<String> errors = lines.filter(line -> line.contains("error")); // lazy
errors.collect() // eager

similar steps in spark and MR /databasel21/image-13.png

Persistence

/databasel21/image-14.png API in Java /databasel21/image-15.png

Spark 2.0

has DataFrame API 😲

and have Datasets API 😲

/databasel21/image-16.png

like DATA100 python!