I recently created an integration with SunGrid Engine. This was easy to do – requiring only that a listener be written that hears JMX events that are produced by our product. As you may know, with GigaSpaces you can create watches for the services you deploy. These watches are populated with information coming from getter methods on those services.For example: you could have a service that exposes the method: getBacklog() that returns a long
Then in your pu.xml you set up the watch for that property: <os-sla:sla max-instances-per-vm=”2″>
<os-sla:scale-up-policy monitor=”backlog” max-instances=”12″ high=”200″ />
<os-sla:cpu high=”.65″ />
<os-sla:memory high=”.75″ />
< /os-sla:sla> In this example, when the value returned from the getBacklog method is over 200, it triggers a scaling event which adds another instance of the service named adaptiveOrderProcessor to the running system. NOTE: that the scaling event effects the entire population of the ProcessingUnit of which that service is a part. To scale one service only, you must define that one service alone in its own pu.xml file. If (as in this example) that service has a limitation on the number of instances that can run in a single GSC that is expressed here: <os-sla:sla max-instances-per-vm=”2″> and means that at some point, the available GSCs will not be enough to host all the possible instances <os-sla:scale-up-policy monitor=”backlog” max-instances=”12″ high=”200″/> here we state we want a max of 12 instances so we need 6 GSC instances to host them all. What I did with SUN was to write a JMX listener that listened for “ProvisionerFailureEvents” which are created by the GSM when it gets a scaling or relocation or failover event that provokes the GSM to seek a host for that service. When the GSM cannot find a suitable host because there are not enough GSCs running, the GSM sends out a ProvisionerFailureEvent which is what my code listens for. When the event hits, my code simply calls the API of the Grid technology in question and asks for the creation of a new GSC. In other words, the service in question says: “help me, I must relocate” or, “Help me, I must failover”, or “help me, I must have more of me running on the network because one of me is not enough!” The GSM says, ” I will start you somewhere else. . .” but then the GSM says, “Oh, golly! There is no where else to start you!” And then the GSM says, “Help me someone!” and sends the ProvisionerFailureEvent to JMX hoping some force in the universe will care. Once the universe shows an interest and starts a new GSC, the GSM will retry the scaling, failover, or relocation effort and utilize that new resource allowing the declared SLA to be satisfied. Bottom line is: it is simple to integrate GigaSpaces with any grid management solution that exposes an API and in doing that integration, enable the dynamic addition of resources to allow the relocation (or failover or scaling) of an application on the fly. Other watches you might set up include: getLocalTimeOfDay() where the value measured causes a relocation event that could move applications to new machines in different timeZones – allowing you to “follow the sun”. getMemoryConsumption() where the value measured causes a relocation event that moves applications to new GSCs that have more memory. getCPULevel() where the value measured causes a relocation event that moves applications to new GSCs with more CPU capacity. etc.. Again, the choice if scaling in response to an event or relocating is yours but is also dependent on the type of service you are affecting. If the service involved has an embedded space in the same PU.xml file, you can not scale it using scaling events. To accomplish this behavior, you must relocate the processing unit which will allow you to scale to the limits of the number of partitions you defined for that space when it was deployed. Example: you deploy a space and worker to the system and define the space as <os-sla:sla cluster-schema=”partitioned-sync2backup” number-of-instances=”24″ number-of-backups=”1″ max-instances-per-vm=”1″> This means you have 24 partitions defined. You first deploy this processing unit to 3 GSCs each having 2gb ram where you run 16 instances in each GSC (8 partitions and 8 backups) These instances run happily until they start running low on memory. At that point a watch on one of the workers could trigger a relocation event which asks the GSM to move one of the instances of the PU to a new GSC. Presuming there is a GSC with the necessary memory and CPU available … (as defined in the following section where it is specified that the PU will not be deployed unless no more than 25% of the CPU and memory is utilized in a target GSC) <os-sla:requirements>
<os-sla:memory high=”.25″ />
</os-sla:requirements> The GSM will relocate a PU to the new GSC and with that relocation the spreading out of the information and work starts to happen. Eventually, if there are enough GSCs available, the system could span 48 GSCs each having 2gb ram so the system that started with 6gb ram and maybe 6 cores, could grow over time to be housed in 48gb ram and use 96 cores! This kind of relocation could be expanded further by moving the instances to a different class of machine. For example one of the tremendous new Sun machines such as the Sun T5240 or Sun M9000 or to an Azul box, the possibilities are almost endless! In my opinion, scaling services automagically [and adding additional resources to a running system] can be automated effectively today. Relocation of busy resources is more likely to be driven by human operators rather than automated rules due to the need to balance a myriad of unexpected factors discovered only at the moment of relocation, but the rules could be put in place as a last resort for the times when the humans are asleep at the helm. Cheers, Owen.