Batch Processing to the Rescue
Hadoop was designed to deal with this challenge in the following ways:
1. Use a distributed file system: This enables us to spread the load and grow our system as needed.
2. Optimize for write speed: To enable fast writes the Hadoop architecture was designed so that writes are first logged, and then processed. This enables fairly fast write speeds.
3. Use batch processing (Map/Reduce) to balance the speed for the data feeds with the processing speed.
The challenge with batch-processing is that it assumes that the feeds come in bursts. If our data feeds come in on a continuous basis, the entire assumption and architecture behind batch processing starts to break down.
The concept of stream-based processing is fairly simple. Instead of logging the data first and then processing it, we can process it as it comes in.
As in manufacturing, even if we pre-package the parts at the manufacturer we still need an assembly line to put all the parts together. In the same way, stream-based processing is not meant to replace our Hadoop system, but rather to reduce the amount of work that the system needs to deal with, and to make the work that does go into the Hadoop process easier, and thus faster, to process.
Faster Processing the Google Way: Using Stream-Based Processing Instead of Map/Reduce
Due to a lack of alternatives at the time, in many Big Data systems today Map/Reduce is used in areas where it wasn’t a very good fit in the first place. A good example is using Map/Reduce for maintaining a global search index. With Map/Reduce, we basically rebuild the index, where it would actually make more sense to update it with changes as they come in.
Google moved large part of its index processing from Map/Reduce into a more real-time processing model, as noted in this recent post:
So, how does Google manage to make its search results increasingly real-time? By displacing GMR in favor of an incremental processing engine called Percolator. By dealing only with new, modified, or deleted documents and using secondary indices to efficiently catalog and query the resulting output, Google was able to dramatically decrease the time to value. As the authors of the Percolator paper write, ”[C]onverting the indexing system to an incremental system … reduced the average document processing latency by a factor of 100.” This means that new content on the Web could be indexed 100 times faster than possible using the MapReduce system!
..Some datasets simply never stop growing ..it is why trigger-based processing is now available in HBase, and it is a primary reason that Twitter Storm is gaining momentum for real-time processing of stream data.
Final Notes
We can make our Hadoop system run faster by pre-processing some of the work before it gets into our Hadoop system. We can also move the types of workload for which batch processing isn’t a good fit out of the Hadoop Map/Reduce system and use Stream Processing, as Google did.
Interestingly enough, I recently found out that Twitter Storm came up with an option to integrate an in-memory data store into Storm through the Trident-State project. The combination of the two makes lots of sense and something were currently looking at right now so stay tuned.