As the leader of the GigaSpaces quality team, I’m often finding myself struggling with reproducing a certain customer issue in our environment.
When a certain customer submits an issue, the first priority is to track it down and solve it. But my role is also to make sure that this issue will not recur in future versions, which means I have to create an test scenario (preferably automated) to verify the fix.
The nature of distributed applications and the fact that there are many factors that determine the behavior of a running system make it in some case very hard reproduce a certain issue that a customer reported in our local environment and track it down.
It usually involves a tedious manual process of starting multiple nodes in certain timings, digging into the logs of each node, etc.
That’s why I am so excited about our new administration and monitoring API.
XAP 7.0 will provide a new powerful API which gives you the power to automate, monitor and control the entire GigaSpaces XAP runtime environment.
It has neat features like starting GigaSpaces infrastructure components, deploying and undeploying processing units, relocating running instances to a different machine, monitoring host, JVM and application level information, and much more.
The main motivation behind this API was to allow our customers to better monitor their running system and proactively change the runtime state of the application when needed (e.g. create additional processing unit instances when load increases). But another great usage for them is to enable us to better track down and reproduce customer issues.
Recently, one of our customers found a very intermittent and unpredictable bug related to failover. He experienced a loss of a processing unit instance after very intensive failover test scenario.
In order to reproduce this slippery bug we used the new Admin API and managed to create a full-blown system test in less than 10 minutes. Another nice feature of this API is that it contains some nice Groovy bindings (Personally I preferred to use Groovy in this case because of its Closure support and the ability to create event listeners very easily).
This scenario included starting 4 Grid Service Containers (GSCs) on each of the two test machines, and 1 Grid Service Manager (GSM) on each of these machines. We deployed a partitioned space with 4 primary partitions and 1 backup for each.
Then we terminated all the processes on one of the machines every minute until the problem was reproduced. We did this by using the new GigaSpaces Agent component which enables the API user to start GSCs and GSMs.
I ran the agent on each machine and then, using the API, started 2 global lookup services, 2 global GSMs and 4 local (i.e. on each machine) GSCs. I then ran the following Groovy script to reproduce the issue:
// create an Admin instance
admin = new AdminFactory().addGroup(“myGroup”).createAdmin()
admin.machines.waitFor 2 // wait for 2 machines
admin.gridServiceAgents.waitFor 2 // wait for 2 gs agents
…
//Do the following for each machine:
// run 4 GSCs:
GridServiceContainer gsc1 = admin.machines.machines[i].
gridServiceAgent.startGridServiceAndWait(
new GridServiceContainerOptions())
println “Started GSC PID “ + gsc.getVirtualMachine().getDetails().
getPid() + ” On machine “ + admin.machines.machines[i].
getHostName()
// start 3 more GSCs similar to the above
…
// run 1 GSM:
admin.machines.machines[i].gridServiceAgent.
startGridServiceAndWait(new GridServiceManagerOptions())
…
// deploy the processing unit
ProcessingUnit processingUnitX = admin.gridServiceManagers.
deploy(new Process
ingUnitDeployment(“MyProcessingUnit”).
numberOfInstances(4).numberOfBackups(1))
processingUnitX.waitFor 8
println ” Instances $processingUnitX.numberOfInstances, Backups “
+ “$processingUnitX.numberOfBackups”
for (int i = 0; i < 200; i++) {
println ” Starting Round “ + (i + 1)
//kill all GSCs at specific machine
admin.machines.machines[i % 2].gridServiceContainers.each
{GridServiceContainer gsc -> println ” Kill GSC with PID “+
gsc.getVirtualMachine().getDetails().getPid(); gsc.kill()}
//kill GSM at specific machine
admin.machines.machines[i % 2].gridServiceManagers.each
{GridServiceManager gsm -> println ” Kill GSM with PID “ +
gsm.getVirtualMachine().getDetails().getPid(); gsm.kill()}
load1GSM(i % 2)
load4GSCs(i % 2)
assert admin.processingUnits.getProcessingUnit
(“MyProcessingUnit”).waitFor(8, 60, TimeUnit.SECONDS)
}
admin.close()
Summary
As you can see, this API gives a lot of power to the user and enables you to programmatically test, manage and monitor your application. The issue was reproduced and solved quickly to the satisfaction of the customer. It has also been introduced as an automated test to our regular regression test cycles which run daily.
From a quality perspective, we now have a silver bullet for creating full-blown system tests in a very easy manner. In fact, our new testing platform is based on this API and the GigaSpaces agents.
This allows us to take a certain customer’s application and reproduce the exact scenario in a few minutes.