A recent post on GigaSpaces in TheServerSide generated some interest in our new 5.2 feature , which allows handling slow consumers, so I thought I’d explain the issue in more detail.
Slow consumer is a term used to describe a state in which a publisher/writer can write faster then a consumer can receive messages. This situation can lead to backlogs on the server-side and extensive thread consumptions, which can later lead to exhaustion of memory and threads on the server. In pure memory data grids this can be a serious issue that will determine the effectiveness and reliability of the Space’s ability to disseminate data to a large pool of clients.
Traditionally there are several methods for handling slow consumers:
- block/slow the producer
- drop the slow consumer
- spool messages to disk
- discard messages for the slow consumer
All of the options above can be very costly in terms of performance and/or reliability. The only options that guarantee reliability and consistency are Options #1 and #3.
Both carry a very high price in performance and scalability that will affect not just the slow consumers, but the entire cluster. Option 3 is more popular due to the fact that it only indirectly affects the performance of other non-slow consumers, whereis in option 1 the effect is direct and costly.
This is how slow consumers were handled in the messaging world – the question that I wanted to raise is whether an In-Memory-Data-Grid (IMDG) in general, and space clusters in particular, changes any of the existing assumptions?
To answer this question, we need to take a step back and define the real need: The need is to maintain a consistent local image on the client-side, even in cases of connection failure and/or slow consumers.
One way to address it, as described by Option #3 above, is to ensure that all the events that led to the current state will be stored and replayed when needed. Another approach, not mentioned above, would simply be to get a snapshot of the current state from the server and simply continue to receive updates from that point on. The latter was not really an option with traditional messaging systems because the notion of ‘current-state’ didn’t exist. A Space, on the other hand, maintains the current state by its core nature, and therefore it makes much more sense to maintain slow consumers in this way. In other words, when a local-view is detected as a slow consumer, it will be disconnected from the server. The client will then reconnect to the server and initiate its current state. The capability to maintain such a client-side view is a new feature in GigaSpaces Version 5.2, which we call Continuous Local View.
In addition to the messaging scenario above, the issue of slow consumers is relevant to other Space cluster communication channels, specifcally Notification (an ad-hoc communication channel from a server to its clients) and Local-Cache/Local-View (utilizes notification to maintain a local view on the client-side).
Replication, however, works differently and uses a combination of some of the options mentioned above. Even though this is not a new feature, it’s important to bring it up in this discussion to complete the picture.
A replication channel is a communication channel between two spaces. In this scenario, if one or more of the Spaces is slower then the sender, the Space maintains a dedicated pool and buffers the send events to the slow Space over time. The buffer can be limited in size to avoid running out of server-side memory. Unlike persistent buffers, in-memory buffers contain only references to the data and not a deep copy of it. That enables the memory buffer to maintain a relatively large buffer size with minimal memory consumption. In case the buffers become full – the slow space will detect that event (it will see that the server queue doesn’t hold the data it is waiting for) and will initiate a complete recovery: It will take the current snapshort from the sender, clear its buffers, and continue from that point onward.
We can also achieve the similar level of flexibility provided by the Replication channel (determine a policy for slow consumer that will implicitly maintain a buffer up to a certain size limit per consumer) — in the two scenarios based on notification (the Notification Channel and the Local-Cache). Once it exceeds its buffer limit it will enforce a re-connect.