I recently had the interesting challenge of moving a traditional JEE shop into the next generation of application server technology. You might be surprised to hear that the transformation took me only four days (two of them spent on Hibernate issues!). The application was an online gaming application, originally implemented in EJB 2.0 and running on JBoss; before long I had it was running on GigaSpaces XAP as an easily-scalable cloud of processing units. Keep in mind that this was a full-blown staging environment with real data.
The motivation for the change was a big bottleneck – the database (Oracle RAC) cluster was limited to a throughput of 15 transactions / second, where each transaction had to persist around 60 objects. The update rate was also limited to a fairly low threshold. Using the existing architecture there was no way to go beyond this level of throughput. But moving to GigaSpaces did the trick – running the same business logic on the GigaSpaces infrastructure (using the same hardware) immediately boosted performance to 1500 transactions / second, while guaranteeing future scalability.
The original architecture and flow
The infrastructure consisted of an Apache load balancer, multiple Tomcat instances, JBoss (x3) as a scalable service layer and Oracle RAC. The JEE design was struts framework, stateless session beans, DAO interface with JDBC DAO implementation. The Sun JMS Grid server was used to expose the services via JMS for external systems
Originally, the system had JMS clients and a web tier sending “tickets” to three JBoss servers using JMS. The business logic service was running in an EJB 2.0 stateless session bean, and it’s job was to process the ticket and calculate its odds then save the result to a the database cluster, using JDBC.
The immediate problem with this approach was that, as I said, there was an upper limit of 15 transactions / second and nothing could be done about it. But there were other, less obvious problems: the system was occasionally losing messages due to non-clusterable JMS Grid server adding the fact that there was no distributed transaction mechanism that spanned both messaging and service/data. This could be solved with XA transactions, but that would only create a bigger performance headache.
Plus, compared to the GigaSpaces deployment framework, the existing system really did not respond well to failure. If one of the JBoss instances failed, it had to be restarted manually, it took a long time to boot up, and it took even longer until the smart proxy on the clients recognized the new instance. This wasn’t considered critical due to the stateless services and the rather small size grid that used 3 Unix machines , but it was definitely disruptive and threatened the application’s SLA. The chief architect remarked that they expect a steady growth that will triple their cluster in a year and in that case the self-healing mechanism and scale-out will be a clear advantage
Step 1: Resolving the database bottleneck
To solve the data bottleneck, I first implemented the existing DAO interface with GigaSpaces DAO implementation that uses the GigaSpaces Enterprise Data Grid as a distributed cache. This immediately boosted the number of read operations and released some of the database load.
However, the real scalability issue was the update rate, and in order to tackle that, I needed to make the persistency layer asynchronous. In other words, I wanted to simply remove the database from the latency path so it wouldn’t limit the system’s throughput. The GigaSpaces Persistency as a Service (PaaS) approach did the trick quite nicely. I deployed a GigaSpaces Mirror Service, an async mechanism that receives data from the Data Grid and feeds the DB in the background. This had a big impact, because up until this point the whole system had to wait for the database to persist each and every transaction, and now the system could operate at in-memory speeds, while data was persisted to the Oracle cluster in the background.
If you’re wondering how this architecture maintains reliability, this is done by replicating the data between different nodes of the Data Grid. The data is kept in memory, but it is highly available because it is always kept in more than one place and kept constantly synchronized. Plus, the Mirror Service needs to acknowledge that data was successfully committed to the DB, and until it does, the Data Grid nodes keep all updates in an in-memory buffer, which is also kept highly available. So even if any of the components fail – the DB, the Mirror Service, or any of the Data Grid nodes, no data is lost. For more details see Persistence as a Service in the GigaSpaces documentation.
The mapping between the mirror space and the DB was done by Hibernate, which I chose due to the out-of-the-box support and Hibernate tools that auto-generated the hbm.xml files (by calling an existing JDBC connection). But Hibernate can be easily replaced by any other relational mapping utility, without changing anything in the application (even the DAO), because in the new architecture Hibernate is on the back-end, defining the async persistency from the Mirror Service to the DB. On the front-end, the application is communicating with the Data Grid, which is object-oriented, so it doesn’t have to worry about ORM at all.
One problem I experienced was encapsulation of database exceptions that were thrown by the DB; the mirror service implements a retry mechanism to enable database shutdown or logical SQL errors. To solve this I used the exception interface in the latest version of GigaSpaces, 6.5, which can bobble up specific exceptions and reflects the persistency status.
Using the same 3 machines (Solaris quad-cores servers) which were originally running the three JBoss instances of the staging environment, just changing the data layer from synchronous persistency to the Oracle cluster to asynchronous persistency with a GigaSpaces Data Grid, boosted performance from 15 to 1500 transactions / second – 100 times faster! It feels really good to see these kinds of results, especially when it takes only two days to get there J
Step 2: Moving from JBoss and EJB to a processing unit
I saw three strong motivations to move to a “processing unit” model: simplicity (less moving parts, less configuration and less integrations), re-initiation and dynamic deployment capabilities in case of system crash and disconnections, and perhaps most importantly, removing the need for transaction distribution between the client, service and database.
A processing unit is a self-sustaining component that includes the business service as a POJO, collocated with the data it needs to do its work and messaging facilities that deliver the events directly to it. Both data and messaging facilities are provided by the GigaSpaces space, which is lightweight and can be collocated with the business logic services inside the processing unit. The space instances are clustered between them, which allows the different processing units to appear as one big “cloud” as far as the client (and also the developers building and testing the system) is concerned.
When I was talking about the GigaSpaces Data Grid earlier, those Data Grid instances are in fact GigaSpaces space instances. So to achieve a processing unit model, all I needed to do was to package together a space and a business logic instance in a single Spring application context which is the processing unit.
Luckily, the business logic service in this application was a POJO encased in the stateless session bean. I reused that POJO, and defined it as the business logic part of the processing unit. Here is what the final service looked like:
I used simple JavaSpaces call (the write method) to replace the EJB call, as you can see below. GigaSpaces Remoting, which is based on Spring remoting, can also do the trick.
Finally, I used the GigaSpaces SLA-driven containers. The whole processing unit is deployed in an SLA-driven container. This is great because it gives you automatic re-initialization of services in case of failure; even if one of the machines fails, the processing unit can be relocated to an available machine and can immediately continue working there. It also allows hot failover to a backup processing units (processing units always back each other up, using memory-based replication). Plus, the processing unit, inside the SLA driven container, can be deployed on just about any available machine.
Keep in mind that this whole setup doesn’t require special load balancers and proxies or any special hardware – GigaSpaces provides an internal proxy implementation with a configurable cluster schema which is abstracted from the business logic.
Step 3: Integrating with JMS Grid
Messaging can sometimes cause an additional bottleneck , but this wasn’t the case here. The motivation for a change was to supply reliable messaging, because in the original architecture there were sporadic message losses due to the un-clustered JMS solution based on Sun JMS Grid.
The requirement was to align with the JMS standard and implement message listeners. Implementing a GigaSpaces JMS service provided a scalable message bus in a non-intrusive manner. This was done using the same processing units I discussed above. The GigaSpaces space within the processing unit has a built-in JMS implementation, so it can represent virtual queues and is able to receive and send ordinary JMS messages.
For further reading on how to configure JMS using spring based JMS converter please refer to:OpenSpaces Data Example JMS Data Feeder configuration
To summarize…
Integrating XAP into a traditional J2EE environment was an interesting and challenging experience. Because this was a full staging environment, the transition was not perfectly transparent and couldn’t solve spaghetti code problems. However, I was able to move to the processing unit model with a minimum of peripheral code changes, and most importantly, with zero changes to business logic code. In the end, it’s the same application, which still works with data and messaging in much the same way – GigaSpaces just provides an abstracted data and messaging layer that solves the old JEE bottlenecks. This is why it was relatively easy and quick to make a full transition from JEE to the next-generation application server.
I’d be happy to hear your thoughts and share your experience with app servers!
Mickey