I’ve been involved recently with a POC where we had to load large amount of data into the Data Grid and later perform some complex queries. I took the flight from NY to LA to present the POC and had really only few hours on the plane to build the POC code from scratch. Building the code that performs the queries was easy: Just create the relevant space domain POJO classes (based on the relevant database tables the prospect provided), implement an executor that performs the SQL query, returned the relevant data set needed and finally reduced it the client side. The ABC of Map-Reduce – Cool.
The problem was how to load large amount of data without messing with creating a database and loading its data into the Data Grid. A simple solution I managed to build very quickly was to create a data generator utility that simply push data into the Data Grid. The nice thing here is that I could adapt this data generator to push data into the relevant Data Grid partition based on a given partition ID. This how I could imitate what would happen when a remote client will write data into the Data Grid or when data will be loaded from a database into the Data Grid once the it will be started.
The idea was simple: since data was partitioned based on the Currency field (it was market data application POC) I created groups of currencies (based on their hash code) that belong to the same logical partition. The number of groups was identical to the number of Data Grid partitions (identified during runtime – so it was totally dynamic code). The Data generator would pick a random currency from the list of currencies that was part of a specific group. The only thing that needs to be passed into the Data generator was the partition ID.
The trigger the data load process I used… Yes – an Executor implementation. Since the Task implementation running within the partition, it can retrieve its “hosting” partition ID and pass it to the Data Generator to create data that fits the hosting partition.
With the above technique I managed to load large amount of data (few millions of objects) within very few seconds (pushing about 100,000 objects per seconds into a data grid with 4 partitions) on my Dual-core Del laptop.