Risk management systems are being used consistently by many organizations, ingesting data and producing reports on levels of business risks – from strategic, compliance and operational risks to financial and reputational risks. But are these reports delivering sufficient insight on a timely basis?
Consider financial services, insurance applications, payment processing systems, train and airline route simulations, smart airports and self-driving cars. All leverage risk management systems, but are challenged because the relevant data is constantly changing, being updated and growing at a rapid pace.
Whether organizations are executing overnight, long-batch extract-transform-load (ETL) jobs or live reports on raw data that need to be generated immediately upon user requests, locally or on the cloud, they need a strong distributed engine to lower calculation/retrieval time to as close as possible to real-time.
Organizations can also optimize their decision-making processes and further automate their workflows with the addition of user simulations, such as “what if” scenarios and machine learning algorithms running in the background for continuous learning in order to always have the most up-to-date model in production.
It’s all about making deeper insight available on-demand – enabling users to query and run any report, without investing in customized development.
This article will discuss the unified architecture below which depicts how InsightEdge – the fastest big data analytics processing platform – functions as the fast data tier. This supports both push and pull capabilities from the archive tier (Hadoop or cloud storage); delivering continuous and on-demand deep insights on data.
Figure 1: Architecture for on-demand deep insights
Financial Services Applications: The Real-Time Challenge
Real-Time Risk Management
Real-time risk management systems have many intertwined labor/resource-intensive moving parts. This poses the following challenges:
- Lack of workflow visibility.
- Potential loss of data from cascading failures – some due to sync challenges – exposing many risk factors.
- Delayed batch processing, making them unusable for real-time scenarios.
Figure 2: Typical real-time risk management system architecture
Real-Time Trade Execution
Modern trade execution requires a new standard – a low latency trading platform with an ultra-fast pricing engine. The ability to easily scale in order to support velocity changes and linear volume growth is essential, alongside APIs that can ingest heterogeneous external data sources.
However, high performance is limited by database constraints such as resource overload, single point of bottleneck and concurrent connection limitations leading to too many active P2P connections. Furthermore, development interdependencies can cause a rolling effect between subsystems in the overall architecture.
Figure 3: Typical trade execution architecture
Market Data Real-Time Analytics & Machine Learning
Many options are available for market data analysis and consolidating services running on unified platforms. The selection and design of the solution architecture are influenced by the requirements for rapid, informed decisions and mitigation of risks at the time of transaction. The main requirement is real-time insight for in-time decision-making for increased business agility and flexibility.
However, a number of challenges exist, including:
- Multiple network hops and computing by multiple grids which are difficult to maintain.
- Constantly increasing data (by terabytes) from a variety of sources.
- Difficulty in accelerating time-to-market and enriching customer experience, because of the growth in internal and external data sources that need to be converged.
- Compliance with constantly-emerging industry regulations and standards.
- Reducing fraud exposure in real-time, which is imperative for up-to-date system management and optimal operations.
Figure 4: Typical real-time analytics and machine learning architecture
Click here to listen to reinventing financial services with real-time analytics and ML webinar>>
The Real-Time “Operational” Challenge
Financial services organizations are leveraging data from many segregated sources, as shown in the following table:
Retail & Consumer Banking |
Investment Banking | Credit Card Banking & Payment Gateway |
Bank transactions Customer data ATM activity Online activity Mobile activity Demographic / census data Marketing / CRM Social / sentiment |
Trade data Customer data Web logs Research / publications Market data Communications / documentation |
Card transactions Customer data Online activity Demographic / census data Marketing / CRM Integration with retailers / loyalty Social / sentiment |
Table 1: Examples of different data sources leveraged by Financial Services Organizations
However, all of this data is currently handled in a similar way, slowing overall data ingestion, processing and analytics.
Furthermore, the traditional data warehousing approach is a prolonged batch process which eliminates the ability to query live data throughout the workflow, making intraday processing and the querying of stale data the de facto reality in many financial organizations. By definition, this latency renders the data valueless for timely responses. And that’s why real-time operation is a fundamental element in any distributed calculation engine.
The VaR Challenge
In the financial world, risk management involves analyzing the market and credit risk that an investment bank or its clients take onto their balance sheet during transactions or trades.
Value at Risk (VaR) is a measure of the risk of loss from investments. It estimates how much a set of investments might lose (with a given probability), given normal market conditions, in a set time period such as a day. VaR is typically used by companies and regulators in the financial industry to gauge the amount of assets needed to cover possible losses.
Requiring common data dictionary implementation, consistent and accurate analytics outcome and consistent risk and PnL aggregation and reporting, this presents a series of technical challenges, particularly because:
- Traditional databases are of limited value when the VaR data cannot be linearly aggregated.
- Most risk-reporting warehouses pre-aggregate all frequently-used dimensions and use the pre-aggregated values to report VaR, rendering limited and fixed available views due to pre-aggregation. Additionally, calculating VaR along different dimensions, or for a custom set of assets, requires a new aggregation job, resulting in long delays of hours or even days.
- Polyglot architecture is nearly impossible, due to a multiple codebase infrastructure that requires code repository and maintenance for each datastore/cache.
- New analytics and aggregations require new views and schema changes. Only standard slice-and-dice operations and simple aggregation functions can be used for reporting, because of a shallow schema with limited analytical capabilities.
Achieving the holy grail of analytics and live querying without pre-aggregation of an entire risk profile in the data model allows users to ask any questions through the reporting layer. It enables users to query/run any reports without having to request custom analytics jobs, making deeper insight available on-demand.
Addressing the Need for a Real-Time Distributed Calculation Engine
A Distributed Calculation Engine addresses the aforementioned challenges by:
- Calculating Net Present Value (NPV) for large amounts of trades in real time, using an in-memory compute grid.
- Integrating an intelligent, fast, in-memory MapReduce programming model to direct calculations into distributed nodes, reducing network traffic and lowering the load on each calculation node; and simulating lazy data loads in a batch mode optimizing database access when a cache miss occurs.
- Performing NPV calculations where the trades used for the calculation are divided into several books which can represent different types of trades, markets, customers, etc.
- Using Apache Zeppelin to run Apache Spark workloads and functions on the raw data in the data fabric. This enables the use of any Spark library – such as Mllib directly on the data – and eliminates the need for excessive data shuffling and network I/O.
For example, a NPV calculation for 6 years can be addressed using the following code:
public void calculateNPV(double rate, Trade trade) { double disc = 1.0/(1.0+(double)(rate/100)); CacheFlowData cf = trade.getCacheFlowData(); double NPV = (cf.getCacheFlowYear0() + disc*(cf.getCacheFlowYear1() + disc*(cf.getCacheFlowYear2() + disc*(cf.getCacheFlowYear3() + disc*(cf.getCacheFlowYear4() + disc*cf.getCacheFlowYear5()))))); trade.setNPV(NPV); }
This can be described using the following formula:
The real-time distributed calculation engine powers the running of the NPV calculation while simultaneously enabling brokers to run active queries from Apache Zeppelin, leveraging Apache Spark API (using Scala, Python and R) on top of the raw data and calculations.
Even though this ability to run heavy workload scenarios and allow brokers to query live data is considered a necessity in live business operations, in reality, it is non-existent in many organizations using traditional workflows and legacy architectures.
Machine Learning Architecture for Automation and Accurate Analysis
Increasing Efficiency
Efficiency is often overlooked by organizations due to the huge amount of hardware resources that seem to cope with the ever-increasing sources of data.
Machine learning technology can be a powerful ally in the quest for better risk management. Traditional software applications predict creditworthiness based on static information from loan applications and financial reports, whereas machine learning technology can enhance accuracy by analyzing the applicant’s financial status as it is modified by current market trends and relevant news items.
However, many organizations are still relying on manual processes and have not adapted efficient automated processes powered by machine learning. By shifting a substantial amount of the burden to account monitoring, machine learning systems can change existing patterns and move the focus of investment managers to other tasks where their interaction is more efficient, such as client services.
Implementing Real-time Predictive Analysis
An example of such a process to increase efficiency is the leveraging of real-time predictive analysis on huge amounts of data to detect unison work by rogue investors across multiple accounts. This is a simple task for machine learning, delivering a capability that is near impossible for investment managers to perform, especially in real-time.
To understand this further, let’s consider the four levels of analytics that organizations strive to achieve:
- Descriptive Analytics: What’s happening in my business? Provide comprehensive, accurate and live data analysis with effective visualization, enabling human intelligence to understand different business aspects.
- Diagnostic Analytics: Why is it happening? Enable drilling-down to the root cause, isolation of all confounding information and understanding why something is happening.
- Predictive Analytics: What’s likely to happen? Allow thorough understanding of business strategies that have remained fairly consistent over time using historical patterns, algorithms and technology to predict specific outcomes and automate decisions.
- Prescriptive Analytics: What do I need to do? Apply advanced analytical techniques to make specific recommendations on what actions and strategies are needed, based on champion/challenger testing of strategy outcomes.
Figure 5: Complexity vs. value in the four levels of analytics that organizations strive to achieve
How to Build a Live Risk Result Store
We’ve seen that the mitigation of operational risk is one of the most critical concerns in financial organizations. Today’s reconciliation models are labor and resource intensive, resulting in processes that just take too long.
Batch processing is often performed overnight, leaving the organization without crucial information until the next business day. In more catastrophic cases, the inability to scale infrastructure to meet growing business needs can result in failure to meet regulatory requirements, or even revenue loss.
Live Operational Intelligence
A multi-tiered live risk result store for live operational intelligence – detecting anomalies, delivering traceability, real-time underwriting and validating financial data – can provide crucial intelligence for better business operations, enhanced customer service and regulatory compliance.
This involves the creation of a policy which defines and transfers data no longer needed for real-time operational workflows to data lakes and warehouses such as Hadoop, S3, Blob Storage and Snowflake, while the data still remains completely available for workloads that do not expect real-time performance, such as historical simulations or end-of-year simulations.
The following architecture presents GigaSpaces’ InsightEdge powering a “Live Risk Result Store” by operationalizing the archive/batch tier.
Figure 6: Architecture outline for multi-tiered live risk result store
This architecture delivers:
- An intelligent multi-tier approach which automatically leverages RAM, SSD and Hadoop for data distribution according to defined business policies and frequency of access.
- Distributed microservices that ingest, process and analyze all data models and run distributed calculations in master-worker patterns.
- A composable map-reduce paradigm across tens of thousands of nodes (each node is a running process).
- A single platform, multi-tier approach with weeks of data in memory; months of data on SSD; years of data on persistent memory; and everything else on data lake or cloud storage.
Scaling and Accessing Historical Data
Real-time risk management involves many aspects, from scaling to accessing historical data. VaR can be executed in many ways. Some algorithms require a small set of data but run thousands of simultaneous scenarios; others, such as historical simulations, require access to historical data which is not usually stored in a fast access tier. Different scenarios are often run on different solution stacks.
Many organizations also have limitations on the scope (complexity or timestamp) of each query, which is dependent on the types of caching, databases and data lakes storing the data. This results in a strongly-coupled environment which limits the business. Furthermore, the running of VaR scenarios is not additive, since the VaR of a portfolio containing assets A and B does not equal the sum of the asset A VaR and the asset B VaR. When this is combined with intraday pre-aggregations, running a “what-if” scenario in real-time with current solution stacks – as shown in the following example – is not possible.
/* What If I sell GOOG */ $ select clientAccount, valueAtRisk(arraySum(pnls), 99.0) from positions where riskFactor <> 'GOOG' group by clientAccount /* Change confidence level */ $ select clientAccount, valueAtRisk(arraySum(pnls), 95.0) from positions group by clientAccount |
On the other hand, such “what-if” scenarios are possible using the proposed multi-tiered live risk result store architecture.
Solution for Lambda Architecture
Lambda architecture is used in systems requiring both real-time analytics and big data volumes. The idea is to store data in both a speed layer (using a traditional fast database or IMDG) and a batch layer such as HDFS or S3. Data is streamed into both layers, with the speed layer holding a sliding window of the recent data, e.g., last 24 hours or last 7 days. Users are then responsible for querying the appropriate layer according to the business needs, and in some cases have to query both layers and merge the query results.
Figure 7: Typical Lambda architecture
Standard multi-tiered architectures involving polyglot storage and caching stacks present many technical challenges, including:
- Lifecycle management of the speed/batch layers.
- Query complexity (a lot of work for users).
- Multiple code bases for multiple products.
- HA of each component.
- HA of the entire workflow.
On the other hand, the proposed GigaSpaces architecture workflow is simple. Almost nothing has to be sketched, because after ingesting data from the selected message broker (which by itself is optional, since data can be ingested directly to the platform), further data tiering lifecycles are managed automatically from within the GigaSpaces platform – both downstream and upstream – via a unified API.
Figure 8: GigaSpaces architecture workflow
More information on this process can be found in AnalyticsXtreme: Data Lake Accelerator ebook which describes how the data journey is simplified to faster and smarter insight to action.
For example, the following code can be used to automatically load pre-2018 stock data (which is no longer required for the operational workflow) to HDFS over hive:
lambda.policy.trade.class=policies.ThresholdArchivePolicy lambda.policy.trade.table=model.v1.StockData lambda.policy.trade.threshold-column=stock_date lambda.policy.trade.threshold-value=2018-01-01 lambda.policy.trade.threshold-date-format=yyyy-MM-dd lambda.policy.trade.batch-data- source.class=com.gigaspaces.lambda.JdbcBatchDataSource lambda.policy.trade.batch-data-source.url=jdbc:hive2://hive-server:10000/;ssl=false |
Despite the archiving, the data remains completely available for workloads such as historical simulations or end-of-year simulations. Furthermore, a dynamic property such as “the last 90 days” can be used as a typical business policy.
After loading of the data to InsightEdge and Hadoop, a range of queries can be used to execute a simple select.
In this example, the data queried is more recent than defined in the above policy, thus the data, is located on both RAM and SSD, and is fetched in 37 milliseconds from the fast data tier without the need to access the archived tier.
Figure 9: Running a query that accesses only the speed layer (data on both RAM and SSD) in 37 milliseconds
In this example, the data queried is less recent than defined in the above policy, thus the data, is located on the archive tier, and is seamlessly fetched in 1545 milliseconds from the archive tier without the need to access the fast data tier.
Figure 10: Running the same query on data from before the date defined in the policy in 1545 milliseconds
In this example, the data queried combines data which pre and postdates the definition in the above policy, thus the data, is located on both tiers and is seamlessly fetched in 1530 milliseconds from the speed layer and archive tier. Only one query is needed to receive a unified result.
Figure 11: Unified result received when running the query on data that is split between the batch and speed tiers
The significance of running complex queries via a unified API is clear:
- Simplified and faster in an order of magnitude on hot, warm and cold data
- Effortless data selection, retrieval and other computations, without prior planning concerning the data location or the syntax needed to retrieve it
- Easy running of historical queries, irrespective of whether the data is stored in RAM, SSD or an archive tier (Hadoop, S3, Azure Blob Storage etc.)
- Running of stock predictions based on historical values is as easy as running a Monte Carlo simulation
Figure 12: Stock price prediction simulation for the next 100 days
The following figure depicts how data from multiple sources – such as trades, positions, market data and derivatives (Forex, Interest Rate, Equity) – is streamed through Kafka and ETL tools to InsightEdge. Using defined business policies, this data is intelligently tiered between RAM, persistent memory, SSD and external data storage technologies such as Hadoop and cloud (using AnalyticsXtreme). From the instant the data is ingested into InsightEdge, it is immediately available for continuous and on-demand queries, interactive reports and simulations.
Figure 13 Streaming of data from multiple sources through Kafka and ETL tools to InsightEdge
Summary
GigaSpaces unified platform provides high-throughput data ingestion, computation, analytics, and retrieval. Integrating years of advanced in-memory computing know-how, the software platform transitions data from specialized silos to a converged data platform that collects not only transactions but also clickstreams, service logs, geo-locations, social and more, shortening timelines to operationalization and production.
Figure 14: GigaSpaces unified platform
Leveraging and simplifying the Lambda architecture, the unified software platform delivers a range of benefits:
- Convergence of both analytical and transactional processing.
- Easy deployment, monitoring and management.
- Elimination of multiple data silos serving different business aspects.
- Shortening the gap from ingestion to actionable insights.
- Building of a “Live Risk Result Store” by operationalizing the archive/batch tier using GigaSpaces AnalyticsXtreme.
- Integration of a multi-tier approach using RAM, SSD and archived storage tier for data distribution according to business needs and frequency of access.
- Intelligent data lifecycle management between RAM, SSD and Hadoop/cloud storage (for both downstream and upstream).
- Creation of a distributed microservices platform that serves as the core multi-model store to run distributed calculations in master-worker patterns and composable map-reduce paradigms across tens of thousands of nodes (e.g. running processes).
- Rich, out-of-the-box features such as data storage replication and disk persistence.
- Out-of-the-box, active-active multi-data-center replication & DR.