Skip to content
GigaSpaces Logo GigaSpaces Logo
  • Products
    • InsightEdge Portfolio
      • Smart Cache
      • Smart ODS
      • Smart Augmented Transactions
    • GigaSpaces Cloud
  • Roles
    • Architects
    • CXOs
    • Product Teams
  • Solutions
    • Industry Solutions
      • Financial Services
      • Insurance
      • Retail and eCommerce
      • Telecommunications
      • Transportations
    • Technical Solutions
      • Operational BI
      • Mainframe & AS/400 Modernization
      • In Memory Data Grid
      • Transactional and Analytical Processing (HTAP)
      • Hybrid Cloud Data Fabric
      • Multi-Tiered Storage
      • Kubernetes Deployment
      • Streaming Analytics for Stateful Apps
  • Customers
  • Company
    • About GigaSpaces
    • Customers
    • Partners
    • Support & Services
      • University
      • Services
      • Support
    • News
    • Contact Us
    • Careers
  • Resources
    • Webinars
    • Blog
    • Demos
    • Solution Briefs & Whitepapers
    • Case Studies
    • Benchmarks
    • ROI Calculators
    • Analyst Reports
    • eBooks
    • Technical Documentation
  • Contact Us
  • Try Free

Scaling Out MySQL

Subscribe to our blog!

Subscribe for Updates
Close
Back

Scaling Out MySQL

Nati Shalom March 29, 2008
17 minutes read

With the recent acquisition of MySQL by Sun, there has been talk about the MySQL open source database now becoming relevant to large enterprises, presumably because it now benefits from Sun’s global support, professional services and engineering organizations. In a blog post about the acquisition, SUN CEO Jonathan Schwartz wrote that this is one of his objectives.

Mysql_logoWhile the organizational aspects may have been addressed by the acquisition, MySQL faces some technology limitations which hinder its ability to compete in the enterprise. Like other relational databases, MySQL
becomes a scalability bottleneck because it introduces
contention among the distributed application components.

There are basically two approaches to this challenge that I’ll touch in this post:

1. Scale your database through database clustering

2. Scale your application, while leaving your existing database untouched by front-ending the database with In-Memory-Data-Grid (IMDG) or caching technologies. The database acts as a persistence store in the background. I refer to this approach as Persistence as a Service (PaaS).

While both options are valid (with pros and cons), in this post I’ll focus mostly on the second approach, which introduces some thought-provoking ideas for addressing the challenge.
Disclaimer: While there are various alternative in-memory data grid products, such as Oracle Coherence and IBM ObjectGrid, in this post I’ll focus on the GigaSpaces solution, because for obvious reasons I happen to know it better. Having said that, I try to cover the core principles presented here in generic terms as much as possible.

Scaling your database through database clustering:
There are two main approaches for addressing scalability through database clustering:

  • Database replication is used to address concurrent
    access to the same data. Database replication enables us to load-balance the
    access to the shared data elements among multiple replicated database
    instances. In this way we can distribute the load across database
    servers, and maintain performance even if the number of concurrent users increases.

            Limitations:

  • Limited to “read mostly” scenarios: when it comes to inserts and updates, replication overhead may be a bigger constraint than working with a single server (especially with synchronous
    replication)
  • Performance: Constrained by disk I/O performance.
  • Consistency: asynchronous
    replication leads to inconsistency as each database instance might
    hold a different version of the data. The alternative — synchronous replication — may cause significant latency.
  • Utilization/Capacity: replication
    assumes that all nodes hold the entire data set. This creates two problems:.1) each table holds a large amount of data, which
    increases query/index complexity. 2) We need to provision (and pay for) more storage capacity with direct
    proportion to the number of replicated database
    instances
  • Complexity: most database
    replication implementations are hard to configure and and are known to cause
    stability issues.
  • Non-Standard: each database product has
    different replication semantics, configuration and setup. Moving from one
    implementation to another might become a nightmare.
  • Database partitioning (“sharding”): database shards/partitions enable the distribution of data on multiple nodes. In other words, each node holds part
    of the data. This is a better approach for scaling both read and write
    operations, as well as more efficient use of capacity, as it
    reduces the volume of data in each database instance.

Limitations:

  • Limited to applications whose data can be
    easily partitioned
    .
  •  Performance: we are still constrained by disk I/O performance
  •  Requires changes to data model: we need to modify the database schema to fit a partitioned model. Many database implementations require that knowledge of which partition  the data resides in is exposed to the application
    code, which brings us to the next point.
  •  Requires changes to application
    code
    : Requires different model for executing aggregated queries (map/reduce and the like).
  •  Static: in most database implementations, adding or changing partitions involves down-time and
    re-partitioning.
  •  Complex: setting-up database
    partitions is a fairly complex task, due to the amount of
    moving parts and the potential of failure during the process.
  •  Non-standard: as with replication,
    each database product has different replication semantics, configuration and setup.
    Partitioning introduces more severe limitations, as it often requires changes to
    our database schema and application code when moving from one database product to
    another.

Time for a
change  –  is database clustering the best we can do?

The fundamental problems with both database replication and database partitioning are the reliance on the performance of the file system/disk and the complexity involved
in setting up database clusters. No matter how you turn it around,
file systems are fairly ineffective when it comes to concurrency and
scaling. This is pure physics:  how fast can disk storage be when every data
access must go through serialization/de-serialization to files, as well as
mapping from binary format to a usable format? And how concurrent can it be
when every file access relies on moving a physical needle between different file
sectors? This puts hard limits on latency. In addition, latency is often severely affected by lack of scalability. So putting the two
together makes file systems — and databases, which heavily rely on them — suffer from limited performance and scalability.

These database patterns evolved under the assumption that memory is scarce and expensive, and that network bandwidth is a bottleneck. Today, memory
resources are abundant and available at a relatively low cost. So
is bandwidth. These two facts allow us to do things differently than we used to, when file systems were the only economically feasible option.
Scaling through In Memory Caching/Data Grid

It is not surprising
that to enhance scalability and performance many Web 2.0 sites use an in-memory caching solution as a
front-end to the database. One such popular solution is memcached. Memcached is
a simple open source distributed caching solution that uses a protocol level
interface to reference data that resides in an external memory server. Memcached enables rudimentary caching and is designed for read-mostly scenarios. It is used mainly as an addition to the LAMP stack.

The simplicity of memcached is both
an advantage and a drawback. Memcached is very limited in functionality. For
example, it doesn’t support transactions, advanced query semantics, and
local-cache. In addition, its protocol-based approach requires the
application to be explicitly exposed to the cache topology, i.e., it needs to be aware of each server host, and explicitly map operations to a specific node. These limitations prevent us from fully exploiting the memory
resources available to us. Instead, we are still heavily relying on the database for
most operations.

Enter in-memory Data Grids.

In-memory data grids (IMDG) provide object-based database capabilities in memory, and support core database functionality, such as advanced indexing
and querying, transactional semantics and locking. IMDGs also abstract data topology from application code. With this
approach, the database is not completely eliminated, but put it in
the *right* place. I refer to this model as Persistence as a Service (PaaS). I covered the core principles of this model in this
post. Below I’ll respond to some of the typical questions I am asked when I present this approach.

How
Persistence as a Service works?

With PaaS, we
keep the existing databases as-is: same data, same schema and so on.
We use a “memory cloud” (i.e., an in-memory data grid) as a front-end to the
database. The IMDG loads its initial state from the database and from that
point on acts as the “system of record” for our application. In other words, all updates and
queries are handled by the IMDG. The IMDG is also responsible for keeping the
database in sync. To reduce performance overhead, synchronization with the
database is done asynchronously. The rate at which the database is kept in
sync is configurable.
The in-memory data
model can be different from the one stored in the database. In most
cases, the memory-based data model will be partitioned to gain maximum scalability and
performance, while the database remains unchanged.

Img1042

How does PaaS improve performance compared to a relational database?
Performance gains over relational databases are achieved because:

  • PaaS relies on memory as the system of record, and memory is significantly faster and more concurrent than
    file systems.
  • Data can be accessed by reference, i.e., no need for continuous serialization of data, as with
    a file system.
  • Data manipulation is performed
    directly on the in-memory objects. Complex manipulation is
    easily achieved by running either Java/.Net/C++ code or a SQL query. There is
    no need for serialization/de-serialization of data or network
    calls during the process.
  • Reduced contention: instead of placing all data in a single table, and consequently having many clients accessing that table, we split it into many small tables, each of which will be accessed by a smaller number of clients.
  • Parallel
    aggregated queries: queries that need to span multiple partitions to
    perform join/sum/max operations can be executed in parallel across
    the nodes. The fact that the queries run on smaller data sets reduces the time
    it takes to perform the actual operation on each node. In addition, the
    fact that queries execute on multiple machines leverages the full
    CPU and memory power of those machines.
  • In-process
    local cache: read-mostly operations are cached in the client
    application local address space. This means that subsequent reads will be executed
    locally.
  • Avoid Object-Relational Mapping (ORM): read operations are performed directly from memory in object format. Thus, there is no need for O/R mapping overhead at this level. O/R mapping
    happens in the background either during the initial load process, or during the asynchronous
    update of the database.

If you keep the
database in sync, isn’t your solution limited by database
performance?

No. Because:

  • Data is sent asynchronously and in
    batches
  • Updates are performed in parallel by all
    partitions.
  • Updates to the database are
    executed collocated in the same machine as the database through a mirror service. This enables to reduce the network overhead to the data base as well as benefit from specific optimization such as batch operations.
  • The database is not used for
    high availability purposes. This means that In-flight transactions are not stored in the database, only the end result of the business transactions. This reduces the amount of updates
    that are sent to the underlying database. Also keep in mind that queries
    don’t really hit the database, only updates and inserts. All this together
    means that the IMDG acts as a smart buffer to the database. It is common that the number of read/update hits the IMDG receives is 10x higher than the number of hits on the underlying database is seeing.
  • The database and the
    application are now decoupled, giving you more options for
    optimization. For example, there are scenarios where writing to the database is
    required to ensure the durability of the data.  In this scenario, you can store
    the data directly in a persistent log (to ensure durability). The log can be
    updated at a relatively high rate. You can read the data from the persistent
    log back into the database as an off-line operation. With these options in place we
    can  easily get to 30,000 to 40,000 updates per second with a single instance of MySQL. If this is not sufficient you can always combine data base clustering  to speed up the data  base  access.

Doesn’t
asynchronous replication mean that data might be lost in case of failure?

No, because asynchronous replication refers to the transfer of data between the IMDG and the database. The IMDG, however, maintains
in-memory backups that are synchronously updated. This means that if one of the
nodes in a partitioned cluster failed before the replication to the underlying database took place, its backup will be able to instantly continue from that exact
point.

What happens if
one of my memory partitions fails?

The backup
of that partition takes over and becomes the primary. The data grid cluster-aware
proxy re-directs the failed operation to the hot backup implicitly. This enables
a smooth transition of the client application during failure — as if nothing
happened. Each primary node may have multiple backups to further reduce the chance of total failure. In addition, the
cluster manager detects failure and provisions a new backup instance on
one of the available machines.
What happens if
the database fails?

The IMDG
maintains a log of all updates and can re-play them as soon as
the database becomes available again. It is important to note that during
this time the system continues to operate unaffected. The end user will not notice this failure!
How do I maintain
transactional integrity?

The IMDG
supports the standard  two-phase commit protocol and XA transactions. Having said that, this
model should be avoided as much as possible due to the fact that it introduces
dependency among multiple partitions, as well as creates a single point of
distributed synchronization in our system. Using a classic distributed
transaction model doesn’t take advantage of the full linear scalability potential of the partitioned topology. Instead, the recommended approach is
to break transactions into small, loosely-coupled services, each of which can be
resolved within a single partition. Each partition can maintain transaction
integrity using local transactions. This model ensures that in partial
failure scenarios the system is kept in a consistent
state.
How is
transactional integrity maintained with the database?

As noted
above, distributed transactions might introduce a severe performance and scalability bottleneck, especially if done with the
database. In addition, attempting to execute transactions with the database violates one of the core principles behind PaaS: asynchronous updates to
the database. To avoid this overhead, the IMDG ensures that transactions are
resolved purely in-memory and are sent to the database in a single batch. If
the update to the database fails, the system will re-try that operation until the
update succeeds.
What types of
queries are supported?

  • Template matching (matching object
    based on class name, class hierarchy, and attribute
    values)
  • SQL – support range queries, ‘like’
    semantics, etc.
  • Continuous queries – through
    a combination of notification and SQL.
  • Parallel query (a.k.a Map/Reduce) –
    queries that are not designated for a specific partition are automatically
    broadcasted to all partitions and the result is implicitly aggregated on the client
    side.
  • Iterator – iterate through a large result-set of data.

This model relies heavily on partitioning. How do I handle queries that need to span
multiple partitions?

Aggregated
queries are executed in parallel on all partitions. You can combine this model
with stored procedure-like queries to perform more advanced manipulations, such as
sum and max. See more details below.
What about stored procedures and prepared statements?
Because the
data is stored in memory, we avoid the use of a proprietary language for stored procedures. Instead, we can use either native Java/.Net/C++ or dynamic
languages, such as Groovy and JRuby, to manipulate the data in memory. The IMDG
provides native support for executing dynamic languages, routes the query to where
the data resides, and enables aggregation of the results back to the client. A reducer
can be invoked on the client-side to execute second  level aggregation.
Can I change these prepared statements and stored procedure equivalents without bringing down the data?
Yes. When
you change the script, the script is reloaded to the server while the server is
up without the need to bring down the data. The same capability exists in case
you need to re-fresh collocated services code on the server-side.
Do I need to
change my application code to use an IMDG?

It depends.
There are cases In which introducing an IMDG can be completely seamless and there
are cases in which you will need to go through a re-write, depending on the programming model:

 

 

Nature of Integration with IMDG

Comments/limitations

Hibernate
2nd level cache

Seamless

Best fit for
read-mostly applications. Limited performance gain as it still heavily
relies on the underlying database.

JDBC

Seamless, but
limited

SQL commands
written against the IMDG are guarantied to run with other JDBC resources. Doesn’t
support full SQL 92 and therefore existing applications may require code
changes.Recommended for monitoring and administration. Not recommended for application
development as it introduces unnecessary O/R mapping complexity.

HashMap

Seamless

Extensions
such as timeout and transaction support are available as well.

GigaSpaces
Spring DAO

Partially
seamless

Abstracts the
transaction handling from the code. Domain model is based on POJOs, and therefore,
doesn’t require explicit changes, only annotations (annotation can be provided
through an external XML file). If our application already uses a DAO pattern then it would require changing the DAO. This allows  narrowing down the scope of changes required to use
an IMDG-specific interface. This option is highly recommended for best performance
and scalability.

What topologies are supported?
Replicated
(synchronous or asynchronous), partitioned, partitioned-with-backup.

Do I need to
change my code if I switch from one topology to
another?

No. The
topology is abstracted from the application code. The only caveat is
that your code needs to be implemented with partitioning in mind, i.e., moving from
a central server or a replicated topology to partitioning doesn’t require changes to
the code as long as your data includes an attribute that acts as a  routing index.
How are IMDGs and PaaS different from in-memory databases (IMDB)?
The relational model itself doesn’t prevents us from taking full advantage of the fact that the data is stored as objects in memory. For example, when we use in-memory storage in an IMDG, we don’t
need the O/R mapping layer. In addition, we don’t need separate
languages to perform data manipulation. We can use the native
application code, or dynamic languages, for that purpose.
Moreover, one of the fundamental problems with in-memory databases is that relational SQL semantics is not geared to deal with distributed data models. For example, an application that runs on a central server and was uses things like Join, which often maintains references among tables, or even uses aggregated queries such as Sum and Max, doesn’t map well to a distributed data model. This is why many existing IMDB implementations only support very basic topologies and often require significant changes to the data schema and application code. This reduces the motivation for using in-memory relational databases, as it lacks transparency.
The GigaSpaces in-memory data grid implementation, for example, exposes a JDBC interface and provides SQL query support. Applications can therefore benefit from best of both worlds: you can read and write objects directly through the GigaSpaces API, query those same objects using SQL semantics, and view and manipulate the entire data set using regular database viewers.
Can I use an existing Hibernate mapping to map data from the database to the IMDG?
Yes. In addition, with PaaS, the Hibernate mapping overhead is reduced as most of it happens in the background, during initial load or during the asynchronous update to the database.

Can I use PaaS with .Net or C++ applications?
Yes. Starting with GigaSpaces 6.5 both Hibernate (Java) and nHibernate (.Net) are supported. C++ applications deffer to the default Hibernate implementation. In addition, with GigaSpaces’ new integration with Microsoft Excel, .Net users can easily access data in the IMDG directly from their Excel spreadsheets without writing code!

Final
words:

While this approach is generic and can be applied to any database product, MySQL is the most interesting to discuss as it is widely adopted by those who need cost-effective scalability the most, such as web services, social networks and other Web 2.0 applications. In addition, MySQL faced several challenges
in penetrating large enterprises. With the acquisition of Sun, MySQL
becomes a viable option for such organizations, but still requires the
capabilities mentioned above to compete effectively with rival databases. The combination of IMDG/PaaS with MySQL provides a good solution for addressing some of the bigger challenges in cloud-based deployments. More on that in a future post.

CATEGORIES

  • GigaSpaces
Nati Shalom

All Posts (167)

YOU MAY ALSO LIKE

May 22, 2009

JavaOne 2009 Lab – PetClinic…
1 minutes read

December 10, 2007

Strut Your Stuff: The OpenSpaces…
3 minutes read

January 5, 2012

Terabyte Elastic Cache clusters on…
8 minutes read
  • Copied to clipboard

PRODUCTS, SOLUTIONS & ROLES

  • Products
  • InsightEdge Portfolio
    • Smart Cache
    • Smart ODS
    • Smart Augmented Transactions
    • Compare InsightEdge Products
  • GigaSpaces Cloud
  • Roles
  • Architects
  • CXOs
  • Product Teams
  • Solutions
  • Industry
    • Financial Services
    • Insurance
    • Retail and eCommerce
    • Telecommunications
    • Transportation
  • Technical
    • Operational BI
    • Mainframe & AS/400 Modernization
    • In Memory Data Grid
    • HTAP
    • Hybrid Cloud Data Fabric
    • Multi-Tiered Storage
    • Kubernetes Deployment
    • Streaming Analytics for Stateful Apps

RESOURCES

  • Resource Hub
  • Webinars
  • Blogs
  • Demos
  • Solution Briefs & Whitepapers
  • Case Studies
  • Benchmarks
  • ROI Calculators
  • Analyst Reports
  • eBooks
  • Technical Documentation
  • Featured Case Studies
  • Mainframe Offload with Groupe PSA
  • Digital Transformation with Avanza Bank
  • High Peak Handling with PriceRunner
  • Optimizing Business Communications with Avaya

COMPANY

  • About
  • Customers
  • Management
  • Board Members
  • Investors
  • News
  • Events
  • Careers
  • Contact Us
  • Book A Demo
  • Try GigaSpaces For Free
  • Partners
  • OEM Partners
  • System Integrators
  • Value Added Resellers
  • Technology Partners
  • Support & Services
  • University
  • Services
  • Support
Copyright © GigaSpaces 2021 All rights reserved | Privacy Policy
LinkedInTwitterFacebookYouTube

Contact Us