Sparking Up Apache Hadoop
By Jim Scott, Director of Enterprise Strategy & Architecture, MapR Technologies
Everyone who works with Big Data is looking for an easier, faster way to derive more value from their projects. With up to 100 times the top performance of the current default processing framework, Apache Spark is rapidly becoming the preferred way to achieve that goal.
"Apache Spark is rapidly becoming the preferred way to achieve that goal"
Apache Spark is a general purpose compute engine that was specifically architected to process Big Data as efficiently as possible. The previous default processing framework, Hadoop MapReduce, is a solid performer, but its decade-old technology is struggling to keep up with current Big Data demands. One noticeable issue is MapReduce’s slow batch processing, which really bogs down when challenged with a robust flow of real-time data.
Spark delivers measurable performance uplift and enables running batch, interactive, and streaming jobs on the cluster using the same unified frame. It supports rapid application development for Big Data and allows for code reuse across applications. Spark also provides advanced execution graphs with in-memory pipelining to speed up end-to-end application performance.
Before we dig a little deeper into the details of these features, let’s take a look at a few key Apache Spark concepts.
Resilient Distributed Datasets (RDD) are a representation of the data that's coming into a system in an object format that allows computations on top of it. Spark provides a simple programming abstraction, allowing developers to design applications as operations on RDDs. RDDs are spread across the cluster and can be stored in memory or disks. Spark uses the RDD model to transparently store data on memory and persist it to disk only when necessary. Reducing disk read and writes noticeably speeds up data processing: Applications in Hadoop clusters run up to 100
times faster in memory, and 10 times faster even when running on disk.
Transformations are actions performed on RDDs to produce other resilient RDDs. Examples of transformations include map, filter, and groupByKey.
Actions are requests for answers from the system. Spark does lazy elevation, so RDDs are loaded and pushed into the system only when there is an action to be performed (in contrast with eager or greedy evaluation).
Apache Sparks Big Data Benefits
Spark adds new speed to Big Data across the spectrum from programming applications to performance.
Spark offers in-memory performance and combines directed and streaming workflows for operational and analytical workloads on a single cluster in a high-performing, highly scalable way. Leverage the complete Spark stack to build complex ETL pipelines that can merge streaming, machine learning, and SQL operations all in one program.
Spark is optimized in making computations as well as placing the computations using a Directed Acyclic Graph (DAG). Its general purpose execution framework with in-memory pipelining speeds up end-to-end application performance. For many applications, this results in a performance improvement from five to 100 times. Batch applications run 10 to 100 times faster in production environments. Spark’s caching system makes it well-suited for highly iterative jobs.
Additionally, Spark provides a complete library of programming APIs that can be used to build applications at a rapid pace in Java, Python, or Scala. Data scientists and developers will increase productivity with the ability to create rapid prototypes and workflows that reuse code across batch, interactive, and streaming applications. Spark jobs can require as little as one-tenth of the number of lines of code as MapReduce.
The version of Spark released in June 2015, includes Spark MLlib, a production-ready machine learning pipeline that includes a set of widely used algorithms for preparing and transforming data. MLlib consists of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as lower-level optimization primitives and higher-level pipeline APIs.
Spark Does Not Replace Hadoop!
Given all of these benefits, some might wonder if Spark can completely replace Hadoop. The answer is a clear-cut and resounding no. Spark is a component, not a complete solution, and it was designed to run on top of Hadoop as a more robust alternative to the traditional batch Hadoop MapReduce model.
Spark provides an application framework to write Big Data applications, but it does not have its own file system and must populate its own resilient distributed data (RDD) structure to process data. It needs to run in tandem with a storage or NoSQL system. To get the best out of Spark, run a Hadoop distribution that includes and supports the complete Spark stack: Spark, Spark SQL, Spark Streaming, GraphX, and MLLib.
As a part of the Hadoop ecosystem, Spark adds more capabilities to Hadoop’s core data warehousing and offline analysis strengths. Integrate Spark into a Hadoop cluster to benefit from Spark’s capabilities, including better performance for existing workloads, and the ability to run complex workloads—such as machine learning and data streaming—that were unsupportable or highly inefficient under Hadoop alone.