Skip to content
Gigaspaces Logo Gigaspaces Logo
  • Technology
    • Our Technologies
      • In-Memory Data Grid
      • Data Integration
      • Data Operations by Multiple Access Methods
      • Unified Data Model
      • Event-Driven Architecture
      • Kubernetes
      • Multi-Region Replication
      • Security
      • Storage Tiering
    • Data Lakes or Digital Integration Hubs: Which one is better suited to solve the “IT GAP”?
      vid-icon Blog

      Enterprises, more than ever, require modernization of their backend and middleware architecture to improve performance for the digital age, facilitate lower TCO of their infrastructure, and optimize the moving parts of the IT and digital services departments.

      Learn More
    • Contact Us
    • Free Trial
  • Products & Solutions
    • Products
      • Smart DIH
      • Smart Cache
    • Industry Solutions
      • Financial Services
      • Insurance
      • Retail and eCommerce
      • Telecommunications
      • Transportation
    • Performance Benefits of HPE Superdome Flex Server with GigaSpaces
      vid-icon Whitepaper

      Learn about the combination of HPE Superdome Flex server and GigaSpaces in-memory data processing delivers extreme performance; in more than 99% of cases, latency was less than one millisecond for a data query.

      DOWNLOAD
    • Contact Us
    • Free Trial
  • Resources
    • Content Hub
      • Webinars
      • Videos
      • Solution Briefs & Whitepapers
      • Customer Case Studies
      • Use Cases
      • Events
      • Analyst Reports
    • col3
      • Blog
      • Technical Documentation
    • Digital Integration Hub: The Architecture of Digital Transformation
      vid-icon Blog

      Learn how a Digital Integration Hub (DIH) connects to multiple systems of record and data stores regardless of whether they reside and aggregates operational data into a low-latency data fabric, supporting modernization initiatives by offloading from legacy architecture and providing a decoupled API layer to power modern on-line applications.

      LEARN MORE
    • Contact Us
    • Free Trial
  • Company
    • Col1
      • About
      • Customers
      • Management
      • Board Members
      • Investors
      • Events
      • News
      • Careers
    • col2
      • Partners
      • OEM Partners
      • System Integrators
      • Technology Partners
      • Value Added Resellers
    • col3
      • Support & Services
      • University
      • Services
      • Support
    • Getting Digital Transformation Right with Avanza Bank
      webinar-icon On-Demand Webinar

      Watch the webinar for a compelling and insightful talk with Joakim Sahlström, CTO at Avanza Bank, who shares his vision for banks in the digital age and how you can simultenously overcome architectural challenges while successfully leveraging new technologies to develop and deploy innovative digital services.

      WATCH NOW
    • Contact Us
    • Free Trial
  • Free Trial
  • Contact Us
  • Technology
    • Our Technologies
      • In-Memory Data Grid
      • Data Integration
      • Data Operations by Multiple Access Methods
      • Unified Data Model
      • Event-Driven Architecture
      • Kubernetes
      • Multi-Region Replication
      • Security
      • Storage Tiering
    • Data Lakes or Digital Integration Hubs: Which one is better suited to solve the “IT GAP”?
      vid-icon Blog

      Enterprises, more than ever, require modernization of their backend and middleware architecture to improve performance for the digital age, facilitate lower TCO of their infrastructure, and optimize the moving parts of the IT and digital services departments.

      Learn More
    • Contact Us
    • Free Trial
  • Products & Solutions
    • Products
      • Smart DIH
      • Smart Cache
    • Industry Solutions
      • Financial Services
      • Insurance
      • Retail and eCommerce
      • Telecommunications
      • Transportation
    • Performance Benefits of HPE Superdome Flex Server with GigaSpaces
      vid-icon Whitepaper

      Learn about the combination of HPE Superdome Flex server and GigaSpaces in-memory data processing delivers extreme performance; in more than 99% of cases, latency was less than one millisecond for a data query.

      DOWNLOAD
    • Contact Us
    • Free Trial
  • Resources
    • Content Hub
      • Webinars
      • Videos
      • Solution Briefs & Whitepapers
      • Customer Case Studies
      • Use Cases
      • Events
      • Analyst Reports
    • col3
      • Blog
      • Technical Documentation
    • Digital Integration Hub: The Architecture of Digital Transformation
      vid-icon Blog

      Learn how a Digital Integration Hub (DIH) connects to multiple systems of record and data stores regardless of whether they reside and aggregates operational data into a low-latency data fabric, supporting modernization initiatives by offloading from legacy architecture and providing a decoupled API layer to power modern on-line applications.

      LEARN MORE
    • Contact Us
    • Free Trial
  • Company
    • Col1
      • About
      • Customers
      • Management
      • Board Members
      • Investors
      • Events
      • News
      • Careers
    • col2
      • Partners
      • OEM Partners
      • System Integrators
      • Technology Partners
      • Value Added Resellers
    • col3
      • Support & Services
      • University
      • Services
      • Support
    • Getting Digital Transformation Right with Avanza Bank
      webinar-icon On-Demand Webinar

      Watch the webinar for a compelling and insightful talk with Joakim Sahlström, CTO at Avanza Bank, who shares his vision for banks in the digital age and how you can simultenously overcome architectural challenges while successfully leveraging new technologies to develop and deploy innovative digital services.

      WATCH NOW
    • Contact Us
    • Free Trial
  • Free Trial
  • Contact Us
  • Technology
    • In-Memory Data Grid
    • Data Integration
    • Event-Driven Architecture
    • Kubernetes
    • Data Operations by Multiple Access Methods
    • Multi-Region Replication
    • Security
    • Storage Tiering
    • Unified Data Model
  • Products & Solutions
    • Products
      • Smart DIH
      • Smart Cache
    • Solutions
      • Industry
        • Financial Services
        • Insurance
        • Retail and eCommerce
        • Telecommunications
        • Transportations
    • Roles
      • Architects
      • CXOs
      • Product Teams
    • Contact Us
    • Try Free
  • Resources
    • Resource Hub
      • Webinars
      • Demos
      • Solution Briefs & Whitepapers
      • Customer Case Studies
      • Benchmarks
      • Cost Reduction Calculators
      • Analyst Reports
    • Blog
    • Technical Documentation
    • Contact Us
    • Try Free
  • Company
    • About
    • Management
    • Customers
    • Board Members
    • Investors
    • Events
    • News
    • Careers
    • Partners
      • OEM Partners
      • System Integrators
      • Technology Partners
      • Value Added Resellers
    • Support & Services
      • University
      • Services
      • Support
  • Contact Us
  • Try Free

Making Hadoop Run Faster

Subscribe to our blog!

Subscribe for Updates
Close
Back

Making Hadoop Run Faster

5min. read
Nati Shalom August 21, 2012
One of the challenges in processing data is that the speed at which we can input data is quite often much faster than the speed at which we can process it. This problem becomes even more pronounced in the context of Big Data, where the volume of data keeps on growing, along with a corresponding need for more insights, and thus the need for more complex processing also increases.

Batch Processing to the Rescue
Hadoop was designed to deal with this challenge in the following ways:
1. Use a distributed file system: This enables us to spread the load and grow our system as needed.
2. Optimize for write speed: To enable fast writes the Hadoop architecture was designed so that writes are first logged, and then processed. This enables fairly fast write speeds.
3. Use batch processing (Map/Reduce) to balance the speed for the data feeds with the processing speed.
Batch Processing Challenges
The challenge with batch-processing is that it assumes that the feeds come in bursts. If our data feeds come in on a continuous basis, the entire assumption and architecture behind batch processing starts to break down.
If we increase the batch window, the result is higher latency between the time the data comes in until the time we actually get it into our reports and insights. Moreover, the number of hours is finite — in many systems the batch window is done on a daily basis. Often, the assumption is that most of the processing can be done during off-peak hours. But as the volume gets bigger, the time it takes to process the data gets longer, until it reaches the limit of the hours in a day and then we face dealing with a continuously growing backlog. In addition, if we experience a failure during the processing we might not have enough time to re-process.
Speed Things Up Through Stream-Based Processing
The concept of stream-based processing is fairly simple. Instead of logging the data first and then processing it, we can process it as it comes in.
A good analogy to explain the difference is a manufacturing pipeline. Think about a car manufacturing pipeline: Compare the process of first putting all the parts together and then assembling them piece by piece, versus a process in which you package each unit at the manufacturer and only send the pre-packaged parts to the assembly line. Which method is faster?
Data processing is just like any pipeline. Putting stream-based processing at the front is analogous to pre-packaging our parts before  they get to the assembly line, which is in our case is the Hadoop batch processing system.
As in manufacturing, even if we pre-package the parts at the manufacturer we still need an assembly line to put all the parts together. In the same way, stream-based processing is not meant to replace our Hadoop system, but rather to reduce the amount of work that the system needs to deal with, and to make the work that does go into the Hadoop process easier, and thus faster, to process.
In-memory stream processing can make a good stream processing system, as Curt Monash’s points out on his research traditional databases will eventually end up in RAM. An example of how this can work in the context of real-time analytics for Big Data is provided in this case study, where we demonstrate the processing of Twitter feeds using stream-based processing that then feeds a Big Data database for the serving providing the historical agregated view as described in the diagram below.
Screen Shot 2012-08-21 at 2.23.49 PM
Faster Processing the Google Way: Using Stream-Based Processing Instead of Map/Reduce
Due to a lack of alternatives at the time, in many Big Data systems today Map/Reduce is used in areas where it wasn’t a very good fit in the first place.  A good example is using Map/Reduce for maintaining a global search index. With Map/Reduce, we basically rebuild the index, where it would actually make more sense to update it with changes as they come in.
Google moved large part of its index processing from Map/Reduce into a more real-time processing model, as noted in this recent post:

So, how does Google manage to make its search results increasingly real-time? By displacing GMR in favor of an incremental processing engine called Percolator. By dealing only with new, modified, or deleted documents and using secondary indices to efficiently catalog and query the resulting output, Google was able to dramatically decrease the time to value. As the authors of the Percolator paper write, ”[C]onverting the indexing system to an incremental system … reduced the average document processing latency by a factor of 100.” This means that new content on the Web could be indexed 100 times faster than possible using the MapReduce system!

..Some datasets simply never stop growing ..it is why trigger-based processing is now available in HBase, and it is a primary reason that Twitter Storm is gaining momentum for real-time processing of stream data.

 

Final Notes
We can make our Hadoop system run faster by pre-processing some of the work before it gets into our Hadoop system. We can also move the types of workload for which batch processing isn’t a good fit out of the Hadoop Map/Reduce system and use Stream Processing, as Google did.

Interestingly enough, I recently found out that Twitter Storm came up with an option to integrate an in-memory data store into Storm through the Trident-State project. The combination of the two makes lots of sense and something were currently looking at right now so stay tuned.

SHARE:

Share this Article

Tags:

Hybrid Cloud Cloud Machine Learning Cloud Services Business Continuity financial services

Subscribe to Our Blog



PRODUCTS & SOLUTIONS

  • Products
    • Smart DIH
    • Smart Cache
  • Solutions
  • Industry
    • Financial Services
    • Digital Banks
    • Insurance
    • Retail and eCommerce
    • Telecommunications
    • Transportation
  • Technical
    • Operational BI
    • Mainframe & AS/400 Modernization
    • In Memory Data Grid
    • Hybrid Cloud Data Fabric
    • Multi-Tiered Storage
    • Kubernetes Deployment
    • Streaming Analytics for Stateful Apps

RESOURCES

  • Resource Hub
  • Webinars
  • Blogs
  • Demos
  • Solution Briefs & Whitepapers
  • Customer Case Studies
  • Use Cases
  • Benchmarks
  • Analyst Reports
  • Technical Documentation

COMPANY

  • About
  • Customers
  • Management
  • Board Members
  • Investors
  • News
  • Events
  • Careers
  • Contact Us
  • Book A Demo
  • Try GigaSpaces For Free
  • Partners
  • OEM Partners
  • System Integrators
  • Value Added Resellers
  • Technology Partners
  • Support & Services
  • University
  • Services
  • Support
Copyright © GigaSpaces 2021 All rights reserved | Privacy Policy
LinkedInTwitterFacebookYouTube