Multi-tenancy is a term that is used often in the context of SaaS applications. In general, multi-tenancy refers to the ability to run multiple users of an application on a shared infrastructure. The main motivation for doing this is efficiency, or in other words — reducing the cost per user in comparison to a dedicated system where each user has their own dedicated environment.
Over the past few months I’ve had discussions with various SaaS providers, in which I tried to learn about the main challenges that SaaS providers face when they deal with multi-tenancy. As it happens, I came across an interesting article on this very topic, Why Multi-tenancy Matters in the Cloud by Alok Misra. The article later sparked an interesting discussion thread on the cloud mailing list, which I also find to be a good reference for diverse thoughts on this subject.
In this post I want to summarize some of my findings on multi-tenancy, as well as suggest a model that can make multi-tenancy significantly simpler.
The common challenge: Multi-tenancy at the data layer
There are many approaches and levels of multi-tenancy, depending on the application layer and type of application you’re dealing with. For example, trying to separate user accounts in a CRM application is quite different than separating various batch processing jobs in a risk management application. That said, what seems a common challenge in almost every SaaS application is multi-tenancy at the data layer. To put it simply, the challenge is how to share data resources (database in most cases) among multiple users, while at the same time ensuring data isolation between those users, as if they are running on completely physically seperated servers.
In the remaining part of this post I’ll address how multi-tenancy is currently being dealt with at the data layer, and how the existing challenges can be met.
How is multi-tenancy being done today?
For this part of the story, I think it might be useful to first look at the two extreme approaches: The fine-grained approach and the cross-grained approach.
Fine-grain multi-tenancy (Salesforce.com model)
For many SaaS providers the “poster child” of multi-tenancy is Salesforce.com. There is a very useful videocast explaining the Salesforce approach to multi-tenancy. In a nutshell, their approach is to share the same database instance between different users, and define the data model so that each table has a primary key identifying the customer ID. In addition, there is a separate metadata table that translates every query into a user-specific query before the query hits the database.
- Very fine-grained multi-tenancy resulting in optimum cost efficiency per user.
- Complexity: Requires changes in the data model.
- Doesn’t fit to existing application: Existing applications require a complete rewrite.
- Modest degree of isolation: Users sharing the same resource can impact one another under heavy load situations. A maintenance/patch release required for a particular user might affect other users. In a more extreme scenario, there is also exposure to a higher degree of vulnerability. For example, a bug in the query mapping could point one user to data belonging to other users just by picking up the wrong customer ID.
Cross-grain multi-tenancy (hosted service model)
Most hosting providers (including hosted IT providers) enable outsourcing of certain IT applications. Efficiency is gained by sharing maintenance overhead, labor costs, and other data center costs such as network, electricity, and more.
- No code changes are required, resulting in greater simplicity.
- High degree of isolation.
- Lower efficiency and therefore limited cost margins.
Challenges with today’s approach to multi-tenancy
The challenge with the fine-grained model is that it is fairly complex to maintain and implement. If you have an existing application, it requires a complete rewrite and also forces fairly significant changes in your existing data model. The challenge with the cross-grained model is that cost margins per user are still rather high. In a competitive environment, even a single competitor offering similar servicse to yours using the fine-grained approach could easily put you out of business.
The elasticity challenge
The other challenge that neither approaches addresses today is dynamic elasticity. What happens when a customer grows beyond their allocated shared capacity? Both approaches seem to rely on certain capacity assumptions, and can scale by splitting multiple users between their shared resources. However, none of the approaches seem to handle a situation where a particular user’s capacity grows beyond the allocated resources.
Jayarama Shenoy provides a good introduction to that challenge in his comment in the discussion thread:
So, how does elasticity work in a multi-tenancy case? Say 100 tenants share an app on a server (by which I would think they go all the way down to a shared database). Then,say, five of them grow quite rapidly such that they need to be moved off that server, and the other 95 can continue to exist on the original server.
So ‘something’ in the application has to know how to pick off the records that belong to these five, copy it off into a second instance of that database and kill, move & restore application service for those tenants. (Of course, without being privy to the data contained in those records). And of course there’s an implied load balancer element in front of all this to keep up appearances to the clients.
Based on this analysis I believe that the main challenge with the current approaches is that we have too few options available. No option seems to provide a good solution to the elasticity challenge, and this is what is making multi-tenancy “so damn hard”. This is especially true for existing businesses with existing products.
The solution: Making multitenancy simple
Learning from the past operating system experience
Some of us probably remember the days when Bill Gates came up with DOS, which was basically a single-user system. At the time, if you wanted to provide a multi-user application you had to write it yourself, and sometimes that required going down to the core interrupt level and program in assembly language. This made multi-user systems on DOS very uncommon. Unix, on the hand, was built for multitasking, multi-user environments, and therefore it was quite common to build multi-user systems in the Unix environment. Later on, when Windows NT came out, all that changed. Today we can’t even imagine a scenario where we would build an application that wouldn’t support multiple users.
As long as the underlying infrastructure (the operating system in the example, and the database in our specific discussion) is not built for multi-tenancy, trying to solve the multi-tenancy challenge at the application level is doomed to be as complex as writing a multi-user system on DOS.
In fact, once multi-tenancy was dealt with properly at the operating system level, writing a multi-tenant application became significantly simpler and therefore widely adopted. The same should apply to databases as well.
The database is the problem!
Almost all of the attempts I’ve seen in both the Salesforce approach and the hosted environment approach were built under the assumption that the database is a fixed “concrete” construct that cannnot be broken down easily into “smaller bricks”, let alone scale dynamically. It is therefore not surprising that most of the attempts to deal with multi-tenancy have tried to address the challenge outside the database.
Now, imagine what your approach to the challenge would be if a database could be easily broken down into small “bricks” without huge overhead, and if it could scale dynamically?
Multi-tenant data service
Similar to the experience we had with operating system, I believe that the solution for multi-tenancy has to be dealt with at the infrastructure level. In our specific discussion, that would be the database. In this section I’d like to draw the principles for a multi-tenant data service:
- Simplicity: With a multi-tenant database, applications can be written to a single database tenant just as they are today. How tenants are allocated and shared with other tenant needs would be completely abstracted from the application.
- Efficiency: Efficiency is maintained by the fact that all data tenants can share the same hardware (memory, CPU) with other tenants.
- Dynamic scaling: Each database tenant can be distributed and scale dynamically across multiple machines, if necessary, to meet demand. This includes moving data to other machines if necessary. Scaling must be supported seamlessly without any downtime.
- Support for multiple isolation/sharing levels: With a multi-tenant database, the level of sharing/isolation is set on demand, based on the specific requirements. At that level, a user can choose a dedicated tenant, in which case the multi-tenant database would allocate tenants on a set of machines that is dedicated to that specific user. Clearly, the benefit of this is isolation, at the cost of efficiency. Another type of sharing is available to the user at the same time: Choosing to share with any individual or group under the user’s control. In other words, rather then dictating either extreme (full isolation or full sharing) for all users, with a multi-tenant database we can dynamically choose the right level that best fits our purpose.
Where we are today?
Cloud providers such as Amazon, Microsoft, and Google have already realized that they can’t expect their users to deal with multi-tenancy themselves. So they’ve come up with data services that were designed with multi-tenancy in mind. Amazon recently introduced Amazon RDS (Relational Database as a Service) and Microsoft introduced SQL Azure. Both are an attempt to take existing databases (MySQL in the case of Amazon and SQL Server in the case of Microsoft) and add some of the characteristics that I discussed above. For example In both cases the user works with a single tenant and is completely abstracted from the underlying resource sharing optimizations.
And yet, both solutions are still fairy limited when it comes to providing elasticity: With the Amazon solution, you’re expected to scale-up (use a bigger machine) and go through a period of downtime when you switch between one degree of scaling and another. Google AppEngine has taken a different approach. Instead of relying on an existing RDBMS and wrapping it up with multi-tenancy and dynamic scaling, they took their own NoSQL alternative (BigTable) that was already built for dynamic scaling and, through a mapping layer, provided SQL semantics on top of it. As with RDS and SQL Azure, the user is completely abstracted from the way Google manages multi-tenancy in their underlying infrastructure.
Niether Amazon, Microsoft, nor Google provide WRT flexibility to the isolation level, and more importantly, all these solutions seem to be completely tied into their own cloud data center, which means that I cannot port it to my own local environment. From a throughput efficiency perspective, SQL Azure and Google seem to trade scaling for throughput by introducing an additional layer of indirection (load balancer in the case of Azure) that makes their multi-tenant solution fit only low-end applications.
At GigaSpaces we took an approach very similar to the Google approach with our new Elastic Middleware Services, where we use our existing distributed NoSQL In-Memory Data Grid and used it as a muti-tenant data service.
That is it for similarity. Here are the differences:
- You can easily download and install GigaSpaces in your local environment or use it in the cloud and get the full multi-tenancy and dynamic scaling experience. We provide an open SPI that enables you to plug in any resource pool, such as starting from a simple SSH call through a complete virtualization pool.
- You can control the level of isolation, scaling, and other SLA directly through an API.
- Our data services are designed for extreme efficiency by maximizing the utilization of memory resources. This normally translates into the ability to run at least 10x more user transactions on the same hardware, which in turn translates into huge cost savings.
- We provide the same consistent approach for other layers of the stack such as messaging and map/reduce through the same environment.