Most of the people I speak with seem fairly confused on this subject, and tend to use query semantics and architecture interchangeably. So I thought that a good start would be to provide a quick overview of what each query term stands for in the context of the NoSQL world. Then, I’ll try to break some common misconceptions — which led me to come up with the YeSQL term.
Common Query Semantics in the Post Only-SQL world
The following are some of the common query semantics in the NoSQL world. For those that are interested in code examples i’ve linked each category with the relevant GigaSpaces reference API.
- Key/Value query: Key/Value query, as the name suggests, is probably the most basic form of query. Each data item is associated with a unique identifier (key). In the NoSQL world, memcache is one of the most common implementations of such an interface. A common pattern to perform complex queries with memcache is to defer them to an underlying database which is used as a search engine. The result of these queries is a key or set of keys that is then used to perform subsequent fetching of the values through the memcache data store. The reason they gained new momentum in the post-SQL world is because they lend themselves fairly natively to the concept of partitioning and distribution, which is a key piece in making a data store scalable. In other words, people were willing to trade the rich query functionality provided by most traditional RDBMS for scalability with only basic query support if that was their only choice.
- Document-based query: The roots of this model are in the search engine world, where it is very common to store different types of documents, even if each one represents a completely different object. In the NoSQL world, a document is not the typical Word or PowerPoint that you would see in search engine, but rather objects in the form of Jason or XML, or binary objects associated with a set of key/values, as in the case of Cassandra. In SQL terms, a document can be seen as a blob that is associated with a set of keys, each indexed independently and maintaining a reference to this blob. Each blob can be a different type (tables), each blob can have different set of associated indexes (keys). Matching is done through the associated indexes. The result-set often includes multiple types, each containing a different set of data. Because the indexes and the blob don’t need to conform to a strict structure of rows and tables, it is referred to as “schemaless” — i.e., it can have different versions of the same type, and add fields to new types without having to modify any table or update older version of the data. Examples that support the document model are CouchDB and MongoDB.
- Template query: Template queries were common in JavaSpaces and even in later versions of Hibernate. With template-based matching, you can fetch an object based on class type or inheritance hierarchy, as well as values of the attributes of that object. In more object-oriented versions of template matching you can also perform matching based on specific items within a graph attribute. GigaSpaces is one of the better-known implementations that support the JavaSpaces template query model.
- Map/Reduce: Map/Reduce is often used to perform aggregated queries on a distributed data store. A simple scenario would be Max, or Sum. In such scenario the query request need to be executed independently in each partition (Map) and then aggregated back to the client (Reduce). An implicit Map/Reduce can take a certain query request and spread the execution of that query implicitly. The client gets the aggregated query as if it was a single query. The explicit model allows you to execute code in free-form, where you can control the mapping model — which call goes to which data, the code to run in each node (aka tasks), and the results. In a typical Hadoop implementation, Map/Reduce is often done through the explicit model. Frameworks like Hive and Pig provide an abstraction model that can handle that process implicitly.
- SQL query: If you think about it, SQL is yet another form of dynamic language that was specifically designed for complex data management. With SQL, data is often ordered in tables and rows. Some of the query semantics in SQL, such as Joins, distributed transactions, and others are known to be anti-pattern for scalability. It is mostly this aspect of SQL semantics that associates SQL with scalability limitation. Examples of NoSQL implementations that support SQL are Google Bigtable using JPA, Hive/Hadoop, MongoDB and GigaSpaces. I will discuss in further details below what that actually means.
YeSQL — There’s Nothing Wrong with SQL!
Now that we’ve covered some of the concepts behind query formats, it becomes more apparent that there is nothing really wrong with SQL. Like many languages, SQL gives you a fairly long rope with which to hang yourself if you choose to, but that is true of almost any language. If you design your data model to fit into a distributed model, you may find that SQL can be a fairly useful format to manage your data. A good example is Hive/Pig/Hbase and Google JPA /Bigtable. In both cases the underlying data store is based on a scalable Key/Value store, but the front-end query language happens to be SQL-based. MongoDB aims toward a similar goal with the main difference that it provides SQL-like support and doesn’t fully comply with any of the existing standards.
Its About the Architecture, Stupid!
NoSQL implementations such as Hive/HBase as well as JPA/Bigtable can be a good example of how next-generation databases can support both linear scaling and a SQL API.
The key is the decoupling of the query semantics from the underlying data-store as illustrated in the diagram below:
Supporting a SQL API on top of a NoSQL data store in Google Bigtable
Convergence is Underway
Last week I spent some time at the Hadoop summit. Hadoop created a fairly generic substrate that led to an innovative ecosystem behind it. There are already many new frameworks today that provide different levels of abstraction to the way Hadoop manages data in both query and processing, such as Hive, Cascading, and Pig. Many of them provide tools that the original creator of Hadoop never even thought of.
Which brings me to the point that we can apply the same decoupling pattern that I mentioned above to support a document model in connection with SQL. In other words, I believe that going forward, most of the leading databases will support all of the semantics listed above, and we won’t have to choose a database implementation just because it support a certain query language.
We’ve already seen a similar trend with dynamic languages. In the past, a language had to come with a full stack of tools, compiler, libraries, and development tools behind it, making the selection of a particular language quite strategic. Today, a JVM in Java or a CLR in .Net provides a common substrate that can support a large variety of dynamic languages on top of the same JVM runtime. Good examples are Groovy and Java or Jruby.
As I pointed out throughout this post, SQL is actually a fairly good query language and will continue to serve a major role in the post only-SQL world. However, the concept of one size fits all doesn’t hold up. The data management world is going to be built from a variety of tools and data management languages, each serving a particular purpose. Ideally, we should be able to access any data using any of several query languages, regardless of how it was stored. For example, I should be able store a Jason object using a document model and then, at any time, query that Jason object using SQL query semantics or a simple Key/Value API.