In a stream processing model, data is processed as it arrives to the Big Data system. With a batch processing model, only once the data is stored does the system run a variety of batch analytics – typically through a map/reduce style of processing.
Facebook and Twitter are good examples of stream processing models in use today, as I outlined in a previous post entitled Facebook’s vs Twitter’s Approach to Real-Time Analytics.
According to a recent survey of approximately 250 respondents, there is a trend moving towards stream processing in order to speed up analytics, with a significant increase in the popularity of this model when compared with last year. The number of organizations planning to use stream processing in 2014 has more than doubled (24%) from last year’s amount (10%).
This goes hand in hand with the fact that real-time analytics is also becoming more mainstream, as I pointed out in my 2014 Predictions.
Another interesting data point from the survey is that roughly 43% of the respondents defined real-time as sub-second and 42% defined it as sub-minute. This difference is quite interesting and could lead to different approaches for implementing stream processing.
For example, real-time is defined differently by Facebook and Twitter. In the case of Twitter, they chose Storm as their event processing engine, allowing them to process events at a sub-second resolution. Meanwhile, Facebook defines real-time at a 30 second batch window, choosing a logging-based approach to fit their need.
Based on the survey, it appears that both approaches are valid and can be applied in correlation to the degree of real-timeliness of your analysis.
|Facebook Log-Centric Stream Processing||Twitter Event-Centric Stream Processing|
In-Memory Data Grids become a more popular choice for real-time processing
According to the survey, 64% of the respondents indicated that they plan to combine In-Memory-based solutions for delivering their real-time analytics processing. This is also consistent with last year survey done by Ventana Research.
I believe that will see an even bigger movement in this direction, as the cost/performance ratio of In-Memory Data Grids and In-Memory Databases would go down significantly with the combination of RAM and Flash devices, which together provide a fairly compelling solution from both performance and cost ratio to that of other disk-based alternatives.
Big Data in the Cloud increased to 56%
New developments in cloud infrastructure, such as the support for bare metal in the new OpenStack releases, as well as support, high memory, flash disk, etc, are removing many of the technical barriers for running I/O intensive workloads like Big Data in the cloud.
Indeed, according to the survey, there is a significant shift towards the use of Big Data in the cloud with 55% of organizations either using or planning to run their Big Data in the cloud in 2014.
Where do we go from here? How does this effect GigaSpaces’ future roadmap?
There are multiple areas in the GigaSpaces roadmap that aim to address this demand.
Real-Time Processing through Storm Integration – In this project, we integrate Storm on top of a Memory backend for both stream processing and data-store.
Support for Flash Disk – The integration with flash disk is planned for our upcoming XAP release and will include support for SanDisk, Fusion I/O and other flash disk devices.
Big Data in the Cloud – through Cloudify – We’ve been continuously extending our Big Data portfolio support and recently added support for Storm and Cognos, adding to our existing support for Hadoop, MongoDB, Cassandra, ElasticSearch, etc.