APAC CIOOutlook

Advertise

with us

  • Technologies
      • Artificial Intelligence
      • Big Data
      • Blockchain
      • Cloud
      • Digital Transformation
      • Internet of Things
      • Low Code No Code
      • MarTech
      • Mobile Application
      • Security
      • Software Testing
      • Wireless
  • Industries
      • E-Commerce
      • Education
      • Logistics
      • Retail
      • Supply Chain
      • Travel and Hospitality
  • Platforms
      • Microsoft
      • Salesforce
      • SAP
  • Solutions
      • Business Intelligence
      • Cognitive
      • Contact Center
      • CRM
      • Cyber Security
      • Data Center
      • Gamification
      • Procurement
      • Smart City
      • Workflow
  • Home
  • CXO Insights
  • CIO Views
  • Vendors
  • News
  • Conferences
  • Whitepapers
  • Newsletter
  • Awards
Apac
  • Artificial Intelligence

    Big Data

    Blockchain

    Cloud

    Digital Transformation

    Internet of Things

    Low Code No Code

    MarTech

    Mobile Application

    Security

    Software Testing

    Wireless

  • E-Commerce

    Education

    Logistics

    Retail

    Supply Chain

    Travel and Hospitality

  • Microsoft

    Salesforce

    SAP

  • Business Intelligence

    Cognitive

    Contact Center

    CRM

    Cyber Security

    Data Center

    Gamification

    Procurement

    Smart City

    Workflow

Menu
    • SAS
    • Cyber Security
    • Hotel Management
    • Workflow
    • E-Commerce
    • Business Intelligence
    • MORE
    #

    Apac CIOOutlook Weekly Brief

    ×

    Be first to read the latest tech news, Industry Leader's Insights, and CIO interviews of medium and large enterprises exclusively from Apac CIOOutlook

    Subscribe

    loading

    THANK YOU FOR SUBSCRIBING

    • Home
    Editor's Pick (1 - 4 )
    left
    Creating Your Personal Brand

    Patrick Desbrow, CIO and VP-Engineering, CrownPeak

    The Next Big Technology Is Process

    Tony Cordeiro, CIO, White Case

    3 Trends on Every CIO's Holiday Wish List

    Warren Perlman, CIO, Ceridian

    Cloud Technology Forcing the Re-evaluation of Traditional IT Operational Models in Healthcare Organizations

    Dr. Tina Rourk, Practice Leader for IT Optimization and Planning, Diane Meiller & Associates, Inc.

    Key Elements of Effective Security Planning

    Rickie K Helmer, CEO and ISO/IEC 27001 ISMS expert, NetQuest

    right

    Sparking Up Apache Hadoop

    Jim Scott, Director of Enterprise Strategy & Architecture, MapR Technologies

    Tweet
    content-image

    Jim Scott, Director of Enterprise Strategy & Architecture, MapR Technologies

    Everyone who works with Big Data is looking for an easier, faster way to derive more value from their projects. With up to 100 times the top performance of the current default processing framework, Apache Spark is rapidly becoming the preferred way to achieve that goal.

    "Apache Spark is rapidly becoming the preferred way to achieve that goal"

    Apache Spark is a general purpose compute engine that was specifically architected to process Big Data as efficiently as possible. The previous default processing framework, Hadoop MapReduce, is a solid performer, but its decade-old technology is struggling to keep up with current Big Data demands. One noticeable issue is MapReduce’s slow batch processing, which really bogs down when challenged with a robust flow of real-time data.

    Spark delivers measurable performance uplift and enables running batch, interactive, and streaming jobs on the cluster using the same unified frame. It supports rapid application development for Big Data and allows for code reuse across applications. Spark also provides advanced execution graphs with in-memory pipelining to speed up end-to-end application performance.

    Before we dig a little deeper into the details of these features, let’s take a look at a few key Apache Spark concepts.

    Resilient Distributed Datasets (RDD) are a representation of the data that's coming into a system in an object format that allows computations on top of it. Spark provides a simple programming abstraction, allowing developers to design applications as operations on RDDs. RDDs are spread across the cluster and can be stored in memory or disks. Spark uses the RDD model to transparently store data on memory and persist it to disk only when necessary. Reducing disk read and writes noticeably speeds up data processing: Applications in Hadoop clusters run up to 100

    times faster in memory, and 10 times faster even when running on disk.

    Transformations are actions performed on RDDs to produce other resilient RDDs. Examples of transformations include map, filter, and groupByKey.

    Actions are requests for answers from the system. Spark does lazy elevation, so RDDs are loaded and pushed into the system only when there is an action to be performed (in contrast with eager or greedy evaluation).

    Apache Sparks Big Data Benefits

    Spark adds new speed to Big Data across the spectrum from programming applications to performance.

    Spark offers in-memory performance and combines directed and streaming workflows for operational and analytical workloads on a single cluster in a high-performing, highly scalable way. Leverage the complete Spark stack to build complex ETL pipelines that can merge streaming, machine learning, and SQL operations all in one program.

    Spark is optimized in making computations as well as placing the computations using a Directed Acyclic Graph (DAG). Its general purpose execution framework with in-memory pipelining speeds up end-to-end application performance. For many applications, this results in a performance improvement from five to 100 times. Batch applications run 10 to 100 times faster in production environments. Spark’s caching system makes it well-suited for highly iterative jobs.

    Additionally, Spark provides a complete library of programming APIs that can be used to build applications at a rapid pace in Java, Python, or Scala. Data scientists and developers will increase productivity with the ability to create rapid prototypes and workflows that reuse code across batch, interactive, and streaming applications. Spark jobs can require as little as one-tenth of the number of lines of code as MapReduce.

    The version of Spark released in June 2015, includes Spark MLlib, a production-ready machine learning pipeline that includes a set of widely used algorithms for preparing and transforming data. MLlib consists of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as lower-level optimization primitives and higher-level pipeline APIs.

    Spark Does Not Replace Hadoop!

    Given all of these benefits, some might wonder if Spark can completely replace Hadoop. The answer is a clear-cut and resounding no. Spark is a component, not a complete solution, and it was designed to run on top of Hadoop as a more robust alternative to the traditional batch Hadoop MapReduce model.

    Spark provides an application framework to write Big Data applications, but it does not have its own file system and must populate its own resilient distributed data (RDD) structure to process data. It needs to run in tandem with a storage or NoSQL system. To get the best out of Spark, run a Hadoop distribution that includes and supports the complete Spark stack: Spark, Spark SQL, Spark Streaming, GraphX, and MLLib.

    As a part of the Hadoop ecosystem, Spark adds more capabilities to Hadoop’s core data warehousing and offline analysis strengths. Integrate Spark into a Hadoop cluster to benefit from Spark’s capabilities, including better performance for existing workloads, and the ability to run complex workloads—such as machine learning and data streaming—that were unsupportable or highly inefficient under Hadoop alone.

    tag

    Hadoop

    Big Data

    Machine Learning

    Weekly Brief

    loading
    ON THE DECK

    I agree We use cookies on this website to enhance your user experience. By clicking any link on this page you are giving your consent for us to set cookies. More info

    Read Also

    Maritime: Beyond Systems, Beyond Seas

    Maritime: Beyond Systems, Beyond Seas

    Ron Fong, Cio, Station Satcom
    Human-Centered Innovation in the GenAI Era

    Human-Centered Innovation in the GenAI Era

    CJ Meadows, Head of Innovation-Asia, Executive Education Designer, Professor & Head of Mbaconsulting, S P Jain School Of Global Management
    The Art and Science of Selling

    The Art and Science of Selling

    Scott White, Senior Manager Sales and Marketing Operations, Airbus
    Responsible Data Leadership in an AI-Driven World

    Responsible Data Leadership in an AI-Driven World

    Gemma Dias, Head of Data Governance, Tyro Payments
    Driving Guest-Centric IT Innovation in Integrated Resorts

    Driving Guest-Centric IT Innovation in Integrated Resorts

    Ching Yip, Vice President of Information Technology, Hoiana Resort & Golf
    Microsoft-Covering People 360 Degrees...

    Microsoft-Covering People 360 Degrees...

    Khalid Nizami , Heads The Digital Transformation Center Of Excellence For Asia Pacific, Ecolab
    How Insurers Can Lead Digital Ecosystems and Better Meet Customer Needs Platforms Such as Microsoft Azure Can Help Companies Become Network Orchestrators

    How Insurers Can Lead Digital Ecosystems and Better Meet Customer Needs Platforms Such as Microsoft Azure Can Help Companies Become Network Orchestrators

    Chris Henderson, EY Asia-Pacific Data & Analytics Advisory, Partner
    Balancing Innovation with Stability in the Quick Service Restaurant Sector

    Balancing Innovation with Stability in the Quick Service Restaurant Sector

    Anthony Sok, General Manager Information Technology, Sushi Sushi
    Loading...
    Copyright © 2025 APAC CIOOutlook. All rights reserved. Registration on or use of this site constitutes acceptance of our Terms of Use and Privacy and Anti Spam Policy 

    Home |  CXO Insights |   Whitepapers |   Subscribe |   Conferences |   Sitemaps |   About us |   Advertise with us |   Editorial Policy |   Feedback Policy |  

    follow on linkedinfollow on twitter follow on rss
    This content is copyright protected

    However, if you would like to share the information in this article, you may use the link below:

    https://sas.apacciooutlook.com/cxoinsights/sparking-up-apache-hadoop-nwid-3623.html