The global Natural Language Processing (NLP) market size is expected to grow from USD 10.2 billion in 2019 to USD 26.4 billion by 2024, at a Compound Annual Growth Rate (CAGR) of 21.0% during the forecast period, according to a new market research report Natural Language Processing Market. Meanwhile, Gartner predicts that by 2020, 50% of analytical queries will be generated via search, voice or NLP.
With NLP getting more and more traction, program queries no longer have to be programmed into an analytics tool. It’s making data analysis more approachable and enabling new classes of users, such as office workers and support staff, to take advantage of analytics software.
What is Natural Language Processing?
Natural Language Processing is a branch of artificial intelligence dealing with the interaction between humans and computers using a natural language. The ultimate aim of NLP is to read, understand, and decode human words in a valuable manner. Most of the NLP techniques depend on machine learning to obtain meaning from human languages.
NLP is an interdisciplinary field of computer science and linguistics, i.e., the interaction between computers and speech/text of human natural languages. It focuses on the ability of computers to understand human language, and already powers many things that we take for granted in our daily computer interactions, such as personal assistants (think Siri, Cortana, and Google Assistant); auto-complete functionalities in search engines (think Google and Bing); spellchecking in browsers, smartphones, messaging, IDEs and desktop apps; and machine translation (think Google Translate).
Natural Language Processing has two major use cases; Understanding human speech and extracting meaning, unlocking unstructured data in documents and databases by abstracting relevant values and making this information available for decision support along with analytics.
What is Natural Language Processing based on?
In order to understand human language, you must understand not only the words but the concepts that connect them to create meaning. For that reason, NLP requires applying algorithms to recognize and bring out the rules of natural language so that the raw language data is transformed into a machine-understandable form. It is therefore based on different applications, for example, sentiment analysis, topic extraction, automatic text summarization, entity recognition, and speech tagging.
Sentiment analysis determines the emotional tone behind words, in order to gain an understanding of the attitudes, opinions, and emotions expressed in an online mention. Also known as opinion mining – deriving the opinion or attitude of a speaker – it’s the automated process of classifying opinions as positive or negative.
For example, during the last episode of “Game of Thrones,” a sentiment analysis on the Twitter feeds was run to rate the various characters in the episode. This involved the rate and number tweets on each character and using sentiment analysis to determine whether the context of the tweet was positive, negative or neutral. It probably won’t come as a surprise that Jon and Daenerys were the top two in the rating.
In the below example, a pre-trained pipeline of a sentiment analysis model is loaded and started just like a Spark session. As can be seen, when the pipeline.annotate is called ‘Harry Potter is a great movie’, a regular Python dictionary result is received and the result will print [‘positive’].
So basically, the result object has an entry for the stage in the NLP pipeline that has just been completed, and this is all that is required to perform sentiment analysis with a pre-trained pipeline
Figure 2: Example of easily importing a Spark NLP library for sentiment analysis using Spark framework
What is Spark NLP and How it Can Power Smarter Decisions from Training to Production
The availability of NLP technology is growing. TensorFlow, scikit-learn, keras, PyTorch and Spark NLP all integrate NLP libraries and are being used in the financial services, insurance, healthcare, life science, retail and technology sectors.
Spark NLP is an open-source natural language processing library, built on top of Apache Spark and Spark ML. This solution offers an API you can easily integrate with ML Pipelines, which is commercially supported by John Snow Labs. The library is built directly on Spark ML, which means that a Spark NLP pipeline is the same class as a Spark ML pipeline build, which offers many advantages:
- Unification of NLP and ML pipelines into one pipeline that can be distributed across a cluster.
- End-to-end execution planning by Spark of everything from loading data, running NLP algorithms, performing feature engineering for machine learning, running the machine learning and training all inferences.
- Use of some of the Spark ML functionality.
The Spark NLP library covers the entire functionality of previous libraries like NLTK, spaCy and CoreNLP, and includes the ability to break sentences, tokenize, stem, find parts of speech, recognize entities, build a dependency tree between words in a sentence, find dates or text and spell check, becoming the first Scala library to offer such scope. This enables distributed computation and good use of clusters.
Since all pipeline steps are performed as part of the Spark pipeline, no other libraries are required to build an entire pipeline within Spark NLP and take advantage of the scaling and the performance benefits.
The Spark NLP library has full APIs in Java, Scala, and Python. Beneath the deep learning library, it uses TensorFlow, enabling the use of GPUs to train models and eliminating the necessity for use of pre-trained pipelines or even pre-trained models.
The machine learning networks, pipelines, and architectures can be customized independently and can run on a distributed system – a cloud-based Spark deployment or Hadoop basic deployment connecting to YARN or HDFS.
NLP Use Case: Automated Image Preprocessing for Layout Detection
You can use OCR to detect layouts that typify certain documents such as invoices or reports intended for human consumption. Detection of the layout of the different sections is a good alternative to extracting sentences or taking an entire line as an important feature. This can greatly assist RPA – Robotic process automation initiatives that deal with invoices, insurance claims and loan requests.
Figure 3: Health use-case of how OCR is used with Spark NLP
GigaSpaces InsightEdge built-in Spark Distribution
Our In-Memory Platform provides a way to deliver ML model training and inferencing quickly and at scale, which is needed in order to properly leverage ML in business operations.
The bundling of an NLP technology such as Spark NLP with InsightEdge delivers an end-to-end solution that covers speed, accuracy, and scalability, from training and research to the production pipeline and execution of the model in a production system. Furthermore, it enables ongoing, continuous model retraining and deployment, ensuring that the model fits the characteristics of the production environment and is able to adjust.
Machine and deep learning are achieved through an open API of Spark ML, streaming, SQL and GraphX. Rich deep learning support includes numeric computing via Tensor and the loading of pre-trained Caffe or Torch models, while various NLP, OCR, text classification & similarity, sentiment analysis, and other libraries can be easily integrated using multiple programming languages such as Java, Scala, Python, and R.
A Spark distribution is built-in and packed in InsightEdge and any type of Spark cluster – such as Cloudera or Amazon EMR or Azure HDInsight – runs on the same resources as the high-performance data store.
Figure 4: InsightEdge Built-In Spark Distribution
Consequently, it’s possible to run the built-in Spark libraries such as Spark SQL, Spark streaming, ML and GraphX, as well as the Spark NLP open-source project by John Snow Labs. This includes various out-of-the-box pre-trained NLP algorithms, as well as OCR, text classification, text similarity, sentiment analysis, and others for implementation in the production pipeline.
This integration of the Spark in-memory solution with the InsightEdge in-memory multi-model data store as its core delivers the ability to:
- Increase Speed: InsightEdge minimizes data movement to practically zero since the actual Spark models run in the same memory space as the actual data residing on the distributed system. Advanced indexing also provides 30x faster read performance, and optimization using filtering and aggregations.
- Increase Resiliency: Spark storage persistency is based on a filesystem (HDFS) or a cloud blob storage (Amazon S3 & Azure ADLS), which means every time you restart a cluster or a spark executor, the data is lost and must be reloaded. With InsightEdge high availability, there is no need to reload the data and recovery is instantaneous.
- Optimize TCO: Spark is usually run on immutable data (append-only) which are copies/snapshots; significantly increasing the storage footprint. InsightEdge lets you run analytics and ML on mutable data (objects, documents, key-value)
Each partition has at least one Spark executor for every Spark job that runs the model or business logic co-located within the same memory space, providing the extreme performance required for real-time use cases. For example, much of the speed optimization when running Spark on top of InsightEdge is around the ability to run a predicate pushdown or aggregation pushdown, i.e., push the filtering or the aggregation to the distributed data store and instead of loading the data to Spark, thereby increasing resource efficiency and speed.
Use case: The Insight-Driven Call Center
The following is an example of how InsightEdge and NLP bring value to enterprises.
A call center at a large European financial institution seeking to:
- Improve customer experience with automatic call routing to avoid cumbersome IVRs
- Improve customer experience with quicker First Call Resolution.
- Reduce costs and optimize operations by reducing average handling times.
- High performance: Ingestion of millions of CRM cases and data from other repositories into a unified analytics platform.
- Millisecond latency: Real-time leveraging of machine learning models to meet customer demands for immediate response times.
- Continuous machine learning training: Provision of smart insights that are constantly adapted to changing conditions.
InsightEdge integration with Intel’s BigDL – a deep learning framework that works with Apache Spark – simplifies and accelerates AI innovation. The combined solution forms an enhanced insight platform offering a distributed deep learning framework that empowers insight-driven organizations and delivers important key benefits, including:
- Cost savings: BigDL eliminates the need for dense specialized hardware for deep learning by utilizing a low-cost compute infrastructure based on Intel Xeon Scalable processors that can train and run large-scale deep learning workloads without relying on GPUs.
- Simplicity: By nature, deep learning scenarios are complex and require advanced and complex training workflows. InsightEdge’s simplified analytics stack, leveraging BigDL and Apache Spark, eliminates cluster and component sprawl complexity; radically minimizes the number of moving parts, and capitalizes on existing Spark competency.
- Scalability: The integration allows organizations to innovate on text mining, image recognition, and advanced predictive analytics workflows from a handful of machines to thousands of nodes, in the cloud or on-premises, using the same application assets and deployment lifecycle.
In the call center, InsightEdge deep learning delivers intelligent routing, by utilizing NLP processing (model training, prediction, and tuning) to automatically route each call to the right agent for the optimal personalized experience.
This involves the following sequence:
- The caller speaks using a web interface.
- The browser converts the caller’s speech to text and sends it to the controller.
- The controller writes data to a Kafka topic.
- Spark job listens on Kafka topic and creates a prediction using the text classification model.
- BigDL prediction sent to InsightEdge.
- InsightEdge event processor listens for prediction data and routes the call session to the appropriate agent.
This is based on the following architecture:
Figure 5: Intelligent Routing Architecture
The NLP/InsightEdge integration also delivers call center agent assistance by automatically providing agents with articles and knowledge documents based on the actual conversation with the customer.
This involves the following sequence:
- Caller routed to the appropriate agent.
- The case automatically created in the CRM.
- The event triggers a Spark job.
- Spark job runs text similarity on the CRM.
- Five most similar cases are presented to the agent.
- The agent follows the suggested action.
Below is a view of the contact center agent’s application which displays the 5 most similar CRM tickets based on the text-similarity score.
Figure 6: Sample Web Application
The flow involves the ingestion of data from the CRM into InsightEdge, which then trains/builds the model accordingly based on the determination of similarities to other data and models. On average, within about 50 milliseconds, the five most similar cases are presented to the agent. This also results in an overall operational efficiency improvement leading to X10 faster resolution time.
- Time to search and find similar cases: ~50 milliseconds
- Immediate agent response time: Mean time to resolution 5-10 times faster.
- Continuous retraining of machine learning model: 27 minutes for background training of 2 million records, ensuring that the model is up-to-date with the CRM tickets every 30 minutes.
- Operational efficiency: Agents able to at least 5 times more calls every shift.
Operationalizing machine learning models is challenging. Speed, scale and accuracy are essential for model training, the move to production and feeding the ML pipeline.
Learn more about how to gain value from NLP at the speed and scale of your business by viewing our webinar or read more here