Tuesday, 24 January 2017

MapReduce vs Spark

This paper presents a comparison of MapReduce vs Spark. Nowadays, Spark is becoming very popular and it can achieve in general better results than the vanilla MapReduce. Here is a list of points of the comparison that is in the paper.

  • Spark does everything In-Memory. Spark needs a lot of memory to process big files.
  • It is easier to work with Spark than MR in general, but you have several tools available for Mapreduce to make it easier this interaction.
  • The main causes of Spark's speedups are the efficiency of the hash-based aggregation component for "combine" stage, as well as reduced CPU and disk overheads due to RDD caching in Spark.
  • Spark uses Resilient Distributed Datasets (RDDs) which implement in-memory data structures used to cache intermediate data across a set of nodes.

In Spark, there may be many stages (map and reduce) which are built at shuffle dependencies. MapReduce only has a map and reduce.
Spark and MapReduce take similar time in the reduce tasks because the reduce stage is network-bound and the amount of data to shuffle is similar in both cases.


So, in which points Spark wins over MapReduce?

  1. Spark is faster than MapReduce in task initialization.
  2. Spark is circa 3x faster in input read and map operations.
  3. Spark is circa 6x faster in combine stage, because the hash-based combine is more efficient than the sort-based combine.
  4. Spark has low complexity in the in-memory collection. But MapReduce can be faster in bigger files. Eg, for a 500GB input data, MapReduce is faster because it reads splits from the input file. Spark reads the whole file. In this case, there is significant CPU overhead for swapping pages in OS buffer cache.
  5. Spark and MapReduce are CPU-bound in the map stage. For Spark, the disk I/O is significantly reduced in the map stage compared to the sampling stage, although its map stage also scans the whole input file.
  6. Spark does not support the overlap between shuffle write and read stage. Spark may want to support this overlap in the future to improve performance.
  7. MapReduce is slower than Spark in these two points:
        1. The load time in MapReduce is much slower than that in Spark.
        2. The total times of reading the input (Read), and for applying the map function on the input (Map) is higher than Spark.
    8. Spark performs better than MapReduce due to these two points:
        1. Spark reads part of the input from the OS buffer cache since its sampling stage scans the whole input file. On the other hand, MapReduce only partially reads the input file during sampling thus OS buffer cache is not very effective during the map stage.
        2. MapReduce collects the map output in a map side buffer before flushing it to disk, but Spark's hash-based shuffle writer writes each map output record directly to disk, which reduces latency

This is a list of points of advantages in using Spark.

Spark is fast because it executes batch processing jobs about 10 to 100 times faster than the Hadoop MapReduce framework just by merely cutting down on the number of reads and writes to the disc. The concept of RDDs (Resilient Distributed Datasets) lets you save data on memory and preserve it to the disc if and only if it is required and as well it does not have any kind of synchronization barriers that possibly could slow down the process.

Spark is easy to manage due to two reasons: (i) With Spark, it is possible to control different kinds of workloads, so if there is an interaction between various workloads in the same process it is easier to manage and secure such workloads which come as a limitation with MapReduce. It is possible to perform Streaming, Batch Processing and Machine Learning all in the same cluster with Spark. (ii) Spark streaming is based on Discretized Streams, which proposes a new model for doing windowed computations on streams using micro-batches.
Spark has a caching system to ensure lower latency computations by caching the partial results across its memory of distributed workers. MapReduce is disk oriented completely, so does not have a cache mechanism as efficient as Spark.

RDD is the main abstraction of Spark. It allows recovery of failed nodes by re-computation of the DAG. Storing a spark job in a DAG allows for lazy computation of RDD's and can also allow spark's optimization engine to schedule the flow in ways that make a big difference in performance.
Spark has retries per task and speculative execution just like MapReduce.

Spark needs at least memory as large as the amount of data you need to process. The need for large memory comes from the fact that data has to fit into the memory for optimal performance. So, if you need to process really Big Data, Hadoop will definitely be the cheaper option since hard disk space comes at a much lower rate than memory space. On the other hand, considering Spark's benchmarks, it should be more cost-effective because less hardware can perform the same tasks much faster (especially on the cloud where compute power is paid per use).
https://hackernoon.com/hn-images/1*klFtrGr8U-XmyiZ1CJx-0w.png

No comments:

Post a Comment