What is natural language processing?
Natural Language Processing, a branch of AI, aims at primarily reducing the distance between the capabilities of a human and a machine. Using artificial intelligence and machine learning techniques, NLP translates languages such as English on-the-fly into commands computers can understand and process.
Understanding human language requires an understanding of both words and the concepts that link them together to create meaning. As a result, NLP is based on multiple applications such as automatic text summarization, topic extraction, entity recognition, speech tagging, and sentiment analysis.
What is Spark NLP?
Spark NLP is an open-source library, started just over two years ago, with the goal of providing state-of-the-art NLP to the open-source community, offering libraries and full APIs in Python, Java, and Scala. Evolved as a result of the growth of deep learning in NLP technologies and the optimization of Apache Spark, it enables getting things running one or two orders of magnitude faster on the same hardware for libraries based in Spark.
Furthermore, it’s built directly on Spark ML, meaning that a Spark NLP pipeline is the same class as a Spark ML pipeline build, thereby offering a series of advantages.
Here are 5 Great Examples of Natural Language Processing Using Spark NLP
1. Sentiment Analysis
A year ago, using the Spark NLP Open Source library required a much deeper understanding of Spark and even TensorFlow. However, after the delivery of a prep-trained pipeline set at the start of 2019, it’s possible to import the library and start it just like a Spark session in the backend, as shown in the following example:
In this example, a pre-trained pipeline of a sentiment analysis model is loaded in English. As can be seen, when the pipeline.annotate is called ‘Harry Potter is a great movie’, a regular Python dictionary result is received and the result will print [‘positive’].
So basically, the result object has an entry for the stage in the NLP pipeline that has just been completed, and this is all that is required to perform sentiment analysis with a pre-trained pipeline.
2. Entity Recognition
Remember the TV show “Friends”? Well, consider looking for Chandler and Monica meeting in Central Perk. The NLP algorithm has to know that Chandler and Monica are people and that Central Perk is a location, without using a dictionary. This requires acknowledgment that Central Perk is a fictional location and that Chandler is a person, and not the City of Chandler in Arizona (because he is meeting with someone). Furthermore, when stating “meeting in” in English, it can really only be followed by a location, so a dictionary is not required.
The following is an example of the code for applying deep learning-based named entity recognition with BERT embedding. loading the pre-trained pipeline loads the pre-trained models and the BERT embeddings, setting-up everything in the backend.
When pipeline.annotate is called ‘Harry Potter is a great movie’, the result shows that the first two tokens, Harry and Potter, are part of a person. Again, this is basically all the code required to enable the recognition of people, places, organizations, and locations.
3. Spell Checking and Correction
Another popular feature of the Spark NLP library is spell checking or spell correction, using code in Scala (in general, Scala and Python APIs are identical). This involves loading the pipeline, the pre-trained pipeline, the spell check and machine learning in English.
If the words “great” and “movie” are misspelled in the annotate of the pipeline for “Harry Potter is a great movie”, the tokens are returned corrected when querying the result of the spelling column. Spell checking algorithms apply different statistical techniques to guess and automatically correct misspells to the most likely word, and there are 18 pre-trained libraries with pre-trained pipelines.
4. Automated Image Preprocessing
Spark NLP also comes with a distributed OCR engine which includes some additional features in the pipeline that substantially improve accuracy. First, it has three algorithms for automated image preprocessing (rotation, scaling, and erosion), which enables the actual extraction of the strings from the image.
5. Layout Detection
The Spark NLP OCR implementation enables the detection of layouts that typify certain documents such as invoices or reports intended for human consumption. Detection of the layout of the different sections is a good alternative to extracting sentences or taking an entire line as an important feature.
Accuracy is also significantly impacted by what’s done with the text after the conversion of the image to a set of strings. What’s important is how the text is broken into sentences. Just consider legal documents, particularly scanned legal documents, which contain page breaks, headers and footers, lists, enumerations and all kinds of things that basically affect the correct breaking of sentences.
In NLP, incorrect sentence segmentation prevents everything else from working, because tokenization and part of speech will fail. A dependency tree of the sentence cannot be built and entity recognition becomes far less accurate.
Spark NLP has been trained on OCR documents and to take such common problems, such as commonly formatted business documents into account. Furthermore, it includes an OCR-specific spell correction model. It’s a similar algorithm that includes a trained OCR-specific model which learns to correct OCR-specific spelling mistakes, such as mistaking the digit 1 to be the letter “l” (a spelling mistake that humans very rarely make when writing).
The analysis of sentiment, text classification and text similarity in unstructured text data sources like emails, social media or news feeds posts can provide businesses with key insights to understand what’s behind customer decisions and behavior.
It’s a great tool that is already being leveraged for different types of business use cases. Retail companies use it for analyzing reviews of their product; financial companies use it for analyzing news feeds, understanding market trends and for trading; and airlines are using it for analyzing Facebook and Twitter feeds and posts, in order to understand customer complaints and requests. OCR can be leveraged for RPA initiatives that optimize the handling of loan requests, insurance claims, and invoices.
GigaSpaces’ InsightEdge In-Memory Platform offers an optimal solution to deliver ML model training and inferencing at the speed and scale that is necessary to leverage ML in ongoing business processes such as the contact center as discussed in this post.
Learn more about how to gain value from NLP at the speed and scale of your business by viewing our webinar.
This post was co-written by David Talby and Yoav Einav. David is CTO at Pacific AI, helping fast-growing companies apply big data and data science techniques to solve real-world problems in healthcare, life science, and related fields. David has extensive experience in building and operating web-scale data science and business platforms, as well as building world-class, Agile, distributed teams. Yoav is VP Product at GigaSpaces. He drives product management, technology vision, and go-to-market activities for GigaSpaces. Prior to joining GigaSpaces, Yoav filled various leading product management roles at Iguazio and Qwilt, mapping the product strategy and roadmap while providing technical leadership regarding architecture and implementation.