Last time we discussed the most popular usability concerns for enterprise-scale monitoring. This time around I’ll introduce you to a great time serious database tools and data management tools that are highly-configurable, which can help you avoid these types of situations altogether, thus ensuring that your company is always online.
Given that even minutes of downtime can cost your business so much, you have to design systems that are Highly Available. The last thing you want in a troubled data-center is to have your monitoring tools go offline as a consequence of some unknown cause.
Enterprise-Scale Monitoring: High Availability Via Replication
So let’s begin by discussing how the GigaSpaces XAP in-memory data grid can be configured to use InfluxDB, an open source platform for managing time-series data at scale.
InfluxDB provides configurable, built-in resiliency. By choosing a replication factor and fronting the installation a load balancer, you can choose the level of resiliency that matches your business requirements.
In highly-regulated environments (PCI, HIPPA, and financial services), it may be necessary to retain large series of observations. Where business needs are less stringent, you may be able to get away with less retention.
Time series databases allow you to decide how much data to retain.
As new data arrives at your Influx endpoint, old data is expunged. The effect of this is that by choosing the size of the sliding window and replication factor, you can decide the level of investment for you.
Taken together, configurable time windows and replication parameters allow you to determine the amount of investment to apply to the problem.
A Real-World Example
Now let’s turn our attention to a reference customer, a very large e-commerce business. They use XAP’s Premium Edition datafabric to host multiple low-latency, high-throughput applications that are critical to their operations.
For this business, uptime is an existential concern.
To give a sense of the scale of their operations, 10s of terabytes of data are under management in redundant datacenters running on about 150 physical servers.
For purposes of this example, we will focus on inventory management, barcode, and customer cart applications.
Most real-world applications have specific runtime profiles.
For example, if you’re running an inventory management system, the behavior will be read-mostly, with particular interest paid to Sku count updates, where row-level locking is a concern.
In this case, a custom metric around “hot Sku” contention can be monitored as a KPI.
Nifty, right? But it gets even better.
Once required metrics have been identified and stored, we need to make operational use of them. XAP provides simple, customizable visualizations.
Provide Smart Dashboards to Get Metrics
By integrating Grafana, you can provide smart dashboards from which to slice and dice all of the important metrics, including your KPIs. You can use it to interpret XAP metrics and create graphs that are fully interactive and editable.
You won’t need any expensive software or multiple programs to get the job done — simply build dashboards that match your application architecture.
XAP Hosts Dashboard
XAP is deployed on hosts: physical, virtual or on the cloud. So we represent host metrics in a Hosts dashboard. The rationale for this is that certain metrics – OS swap, CPU %, network traffic – don’t make much sense in a cluster-wide or aggregated context. They apply to at most one host.
The following XAP hosts dashboard has 6 panels. Each line in the graphs corresponds to a single host (of which there are 44). Hovering over anyone with your pointer brings up more information about the measurement. In all cases displayed in this post, we’re representing a 3-hour moving average.
Now let’s model an operational use case and see how it works in real time.
If the system were to rebalance after an unplanned VMotion event, it’s possible that redistributed workloads might be moved to a given host already performing a more or less constant amount of work. We would expect an increase in host resource usage (CPU and/or RAM, depending upon the nature of the added workload).
So we look at the hosts dashboards. We see an increase in CPU here, but at 80%, it’s likely to remain stable enough for our Ops folk to rebalance.
XAP Activity Dashboard
The next thing to look at would be application performance. We collect all XAP API call metrics in a XAP Activity Dashboard.
Since the Inventory Service is the most mission critical in case of system downtime, we’ll concentrate on that. We have reason to be worried, because one of the Inventory Service Processing Unit Instance is collocated with two Customer Cart Instances.
As mentioned earlier, inventory applications inherently contend for SKU number access, particularly at time of update (when an item is sold).
So we look at read calls, update calls and their ratios. Since this system was intensively performance tested ahead of time, we know in advance the maximum throughput for the inventory subsystem.
Then we verify that the database mirror redolog is not growing too fast to stay up until resolution.
Again, redolog growth has been measured in advance, so we’re here to verify what we knew already.
XAP Services Dashboard
Finally, we want to check that core XAP services and systems are functioning appropriately. Our final dashboard collects XAP Framework metrics:
Taken together, XAP Premium Edition integration with InfluxDB and Grafana tool integrations provide a turn-key solution to your monitoring needs.
This allows you to do a Cost Benefit Analysis at design time to determine the most dollar-efficient configuration sufficient to meet your business SLAs.
Other benefits include downtime reduction, licensing cost reduction, and continuous system status visibility. Plus, you decide which bits are most meaningful for your business.