InsightEdge Platform and XAP 15.0
Now more than ever, enterprises are investing in machine learning analytics solutions to gain value from their operational and historical data to improve business decision-making and create new revenue streams. But organizations are struggling to leverage this data as they face challenges in successfully moving their machine learning models into production, which include the lack of speed, scale, and accuracy, necessary to build the data pipeline to feed feature vectors and continuously retrain models to ensure required accuracy.
What is Machine Learning Operations (MLOps) and How Can it Help?
Businesses are adopting Machine Learning Operations (MLOps), an approach developed to facilitate communication between data scientists and the operations or production team.
MLOps is a practice for collaboration and communication between data scientists and operations professionals to help manage the production machine learning lifecycle. Similar to the DevOps term in the software development world, MLOps looks to increase automation and improve the quality of production ML while also focusing on business and regulatory requirements. This collaboration includes the ability to deploy machine learning (ML) projects using today’s production infrastructures like Spark and Kubernetes, both on-premise and in the cloud.
What Problems Will MLOps Solve? The GigaSpaces Approach
Adopting an MLOps approach will help organizations close the loop between gaining insight and turning it into actionable business value. GigaSpaces’ release 15.0 provides key features to allow businesses to do just that:
6 New Features Introduced in Release 15.0
1. Easy Monitoring with New GigaSpaces Ops Manager
Monitor applications for performance issues with ML-centric capabilities, like infrastructure monitoring and alerts with GigaSpaces’ new enterprise-class management platform for GigaSpaces-based clusters running on cloud, on-premise or hybrid infrastructure.
2. Enhanced Intelligent Data Tiering
Retrieve your frequently accessed cold data up to 100X faster with GigaSpaces AnalyticsXtreme. The new batch indexing feature automatically optimizes historical data storage using the AnalyticsXtreme module, by moving data between frequent (cold data) access and infrequent (archive data) access tiers. AnalyticsXtreme Batch Indexing enhances performance and cuts costs down as data access patterns change.
3. Kubernetes as a Core Platform for MLOps: Kubernetes Space-based Remoting
Easily deploy machine learning projects from any platform on modern production infrastructures such as Kubernetes and Spark on any cloud or on-premise. GigaSpaces InsightEdge now provides a native smart Space client in Kubernetes that supports remote CRUD operations, task execution, and event-driven analytics. The smart client is content-aware to the distributed cluster topology, which provides high throughput and fast serialization, as well as automatic load balancing.
4. GigaSpaces Connector for Apache Kafka
The official GigaSpaces Connector for Apache Kafka has been released and also verified by Confluent, following the guidelines set forth by Confluent’s Verified Integrations Program.
5. Tableau Server Support
Tableau Server provides direct, authorized access to live Space objects and documents, allowing users to build and run complex filtered queries against InsightEdge. For more on how to accelerate the visualization on operations data, analytics, and ML insights in real-time, get a copy of the GigaSpaces and Tableau white paper.
6. Dynamic Schema Versioning
InsightEdge now supports dynamic schema definition by enabling the writing and updating of data without a predefined schema. Easily make changes to the data model while ensuring its compatibility with JDBC and BI tools. With a dynamic schema versioning, organizations can integrate code more reliably, deploy faster, and lower administration overhead.
Download release 15.0 from GigaSpaces Download Center
Getting Started With MLOps and GigaSpaces’ Version 15.0
GigaSpaces Ops Manager
The GigaSpaces Ops Manager is enterprise-grade monitoring and administration tool that provides visibility into the components of an organization’s system, starting at the cluster level and drilling through to individual services and service instances. It enables continuous development and monitoring of microservices and machine learning pipelines, so businesses can maintain accurate data models and ensure that small problems are resolved before they affect general performance. Users can view performance metrics and system alerts at the cluster level to immediately identify problem areas.
Users can also view the services that are running in the cluster, with the ability to drill down to individual instances to get additional performance metrics for analysis and troubleshooting.
Intelligent Data Tiering
With the new AnalyticsXtreme Batch Indexing feature, GigaSpaces’ smart indexing of aged data has gotten smarter. Organizations can have even more control over the life cycle of your data by differentiating between cold data (which has aged but is still frequently accessed), and archived data (which is needed for historical purposes but is infrequently accessed), and therefore doesn’t impact the performance of queries. With the help of GigaSpaces’ MemoryXtend, organizations can control the cost and memory footprint for hot and warm data. In addition, thanks to GigaSpaces’ AnalyticsXtreme, businesses can define the exact time when warm data becomes cold and move it to data object stores like Hadoop or Cloud Object Stores.
Batch indexing is supported on any cloud-based object store/data lake that is compatible with Apache Hive, and is based on the ability to store data according to partition. Further bucketing the data in “time slices” enables enhancing queries that match the index period, so InsightEdge only has to retrieve specific partitions instead of scanning the entire batch layer.
An Example of GigaSpaces’ AnalyticsXtreme Batch Indexing Feature
If a company chooses to store its data by year and partition it by month, it can define an index with the necessary granularity (day, week, etc.) taking into account the need to balance performance vs. the cost/storage for the batch index footprint.
For instance, a stock trading site can amass impressive amounts of data and frequently queries this data no matter where it’s stored (RAM, Persistent Memory, SSD, or Data Lakes) and get a quick response. Scanning the entire batch layer (all partitions) can take a long time to return results.
Below is an example of how InsightEdge can accelerate these queries. In the image below, there is a query that specifies a time and date that is outside the index period defined in InsightEdge.
Without InsightEdge, the query takes over 4 minutes when scanning 2,500 partitions in Hive.
With InsightEdge, when the query is modified with a time and date that leverages the intelligent indexing, only 130 partitions are relevant so the query takes ~3 seconds, an 80X performance improvement.
When you can separate non-essential data from data that you need to access more regularly, you can realize performance improvements in orders of magnitude. AnalyticsXtreme Batch Indexing improves speed and scalability for large data sets.
Additional Features
Space-Based Remoting for Kubernetes
Now that Space-based remoting is supported for Kubernetes, your application can use remote invocations of microservices within the Space (that resides in a data pod) using either of the following two remoting mechanisms.
Executor-Based Remoting
This task-based mechanism is the most commonly used type and is executed by submitting a specific task type that executes on the Space side by calling the invoked service.
Event-Driven Remoting
This mechanism utilizes the GigaSpaces polling container and may be preferable if the actual service isn’t located in the same pod as the Space.
Space-based remoting provides high availability and fast performance, which supports the MLOps need for auto-recovery and fast data ingestion.
Confluent-Approved GigaSpaces Connector for Apache Kafka
GigaSpaces and Confluent have partnered to deliver scalable, always-on, event streaming and processing capabilities, to execute real-time analytics as part of their workflows and applications. The official GigaSpaces Connector for Apache Kafka has been released and also verified by Confluent, following the guidelines set forth by Confluent’s Verified Integrations Program.
The GigaSpaces Kafka connector enables simple integration with Kafka streaming, ensuring that GigaSpaces-based applications can easily provide the following as either the consumer or the producer:
- High performance for data stream ingestion and low latency
- Scalability
- Always-on reliability
- Real-time analytics and machine learning
Kafka acts as the central hub for real-time streams of data, which are processed using InsightEdge and Spark Streaming. After the data is processed, the results can be published into yet another Kafka topic or stored in HDFS or databases.
Summary
According to Gartner’s 2019 report on Machine Learning and AI, although developing ML models has become much easier, seamless data preparation and model deployment and operations continue to be elusive for most organizations.
By applying DevOps principles for machine learning, organizations can help bring business interest back to the forefront of ML operations. Data scientists can now work through the lens of organizational interest with clear direction and measurable benchmarks.
GigaSpaces’ latest release continues to meet the needs of the market – helping enterprises run their ML models in production. Version 15.0 focuses on improving deployment monitoring, and managing of machine learning projects on modern production infrastructures like Kubernetes and Spark, across on-premise, cloud, hybrid, and multi-cloud environments.
Additional Links
- Download release 15.0 from the GigaSpaces Download Center.
- Read more about what’s new in release 15.0 on our Documentation website.