There are various ways to implement a search engine. The typical way uses a crawler technique. The crawler continuously scans the information, collects changes and indexes them. The indexing service is a centerpiece in the architecture and enables quick matching of keywords into search results.
In some cases this technique is not applicable. For example, imagine the case of eCommerce applications. When you submit a bid on an auction, you expect to see it immediately popping up in the search results.This calls for a rather different architecture then the one typically used for Internet applications. The key is how fast we can put new data into the index server. Sounds simple, right? Well, most index servers are highly optimized for fast read, however they tend to be quite heavy on write operations.
This is where it makes sense to put the index server in-memory. Having an in-memory index enables both fast writes and fast searches. There is one small caveat, however ??? memory is limited in capacity and is not considered reliable, i.e. if the memory fails the data stored in that memory is gone. This is where In-Memory Data Grids come to the rescue. An In-Memory Data Grid (IMDG) addresses the capacity and reliability of memory. Capacity is addressed by breaking the data into multiple partitions; reliability is achieved by having at least one copy of each partition available in another memory instance. This is exactly what Shay Banon did in his Compass project, where goes into additional detail about how this model works.
Quoting Banon from his excellent post: Collocated Indexing and Distributed Search with GigaSpaces
“This type of integration takes collocation of indexing and searching to a new level. Indexing and Search operations are performed in a collocated manner in memory making them extremely fast. Scalability is easily handled by adding more partitions, and high availability is provided by adding backups to each partition”
Me and Shay had various discussions about that model. I personally see great potential in this type of offering. With the move of many of the batch analytics application into real-time-analytics, real-time search will become more common for many applications.
Using the search engine as an advertising channel – Adwords
Once you got your own search running, you???ve created a potential commerce area without even noticing. Different suppliers can compete on specific keywords and on the real-estate (location on the page) they’ll get on the search result (first, second, up, down..). Unlike TV commercial that sell ads on specific timeslot, Adwords provides an advertising channel per search click! One common way to take advantage of Adwords is to use Google Adsense which enables you to put Google ads in your site and get rewarded for the clicks that your site is generating. Having said that, the limitation of this model is that it relies on on the fact that you???re using Google as your search engine. If you are looking to use real-time search that is tailored to your site, this is not going to work. So if you???ve already gotten to the conclusion that you need your own custom search engine, you might as well implement your own Adwords to drive more money out of it.
Adwords is a very interesting model ??? it???s basically a sophisticated biding system. The algorithm involves matching between the supplier???s phrase and the phrase in the search query. The matching first looks at those bidders with sufficient budget on their account, it then looks at the track record of each supplier and rates the supplier based on the number of click-throughs that they got on their ads. For those who were selected, there is another algorithm the determines the order and specific location in which the ads appear on the screen. One of the most trivial one is exact match of keywords, i.e. ads with exact match appear higher up in the list.
The challenge with implementing your own Adwords is that this matching process needs to happen in real time during the search. So how can this be done?
If we consider Shay???s comment above, it should be fairly clear. If we already have the index in-memory, and search can be executed within the index server itself, why can’t we use the same idea for matching Adwords?
Well, the idea behind Space Based Architecture really came from these type of bidding scenarios. We realized that executing the matching algorithm collocated with the data is going to speed up the matching time significantly and improve response time. Below, I tried to sketch a simple diagram that illustrates how this model would work.
The indexed data will be stored in one set of partitioned cluster and Adwords data will be stored in another partition.
The search services will be running collocated with the index service as Banon suggested. In similar way, the Adwords matching services will be running collocated with the Adwords data.
A search request gets to the search portal. The search portal executes the search query as well as the Adwords matching in parallel.
It uses a map/reduce execution, meaning it is able to aggregate all matching results from each partition.
To enable this type of operation you can use either the Service Virtualization Framework in case you’d like a strongly typed query service, or the executors framework in case you???re looking for ad-hoc tasks execution.
You can use Futures as the return value in order to simplify the process of executing the search and Adwords matching in parallel. Future will enable you to fork those two tasks and then collect the results using that future handle.
If you would like to process all the results as they they arrive, you can use AsyncResultFilter for that purpose (a typical use case will be to print the result on your page while still waiting for the rest of the result to arrive). For more information and options on using parallel execution to speed up the search and matching process, refer to the executors framework.
A real-life scenario
One of the recent real-life scenarios that I came across recently is Rednano. Rednano is a local portal in Singapore that provides localized search capabilities. Interestingly enough, they built their portal on Spring and Hibernate. They also looked for a way to build their new add-on services such as Adwords in a scalable manner, and GigaSpacs was pretty much the closest solution they could get to meet their needs.
According to Patrick Ng, CTO from Rednano, they where able to integrate GigaSpaces in their existing Spring architecture in matter of two days! Patrick provides an interesting insight on their selection process in an online presentation during a Cloud event in Singapore.
During the writing of these lines, I came across some interesting news: Rednano wins award for breaking new ground. Quoting from the news item:
Local search and directory engine rednano.sg has won a prestigious international award given to companies with effective online search technologies.
The search engine, a collaboration between Singapore Press Holdings and Schibsted ASA, won the Digital Market Award at the Fastforward’09 business and technology conference in Las Vegas earlier this week.
The annual award is given to companies which constantly ‘create new and value-added services for its users’ in the fast-moving technology world.
Congratulation to Rednano on their success!