make business impact with industry-specific text data

How Leveraging Industry-Specific Text Data for ML Can Improve Business Operations

It’s no secret that many businesses generate text in the process of normal business operations. For example, a hospital system may generate hundreds or thousands of reports a day or a school could accumulate a large collection of text data over a period of years. Organizations that harness this text data using machine learning can make a positive impact on business operations.

Using Text Data & NLP

Businesses can use text data to understand consumer sentiment: those who use sentiment analysis to inform their business operations will increase overall customer satisfaction, improve lifetime value, and reduce churn.

The process of finding insights in text data is known as Natural Language Processing, commonly abbreviated NLP. The first goal of using NLP for machine learning is to transform text data, which is understandable to humans, into values that are understandable by machine learning algorithms. The transformed data can be used for machine learning directly or be combined with other data types like tabular numerical data or images prior to making predictions with machine learning.

Embeddings: transforming text into numeric operations

While there are many ways to transform text into numeric representations, one of the most powerful, well-understood, and widely-used techniques is embeddings. The process of forming embeddings from words and documents dates to 2013 and was pioneered by a Google team led by Tomáš Mikolov. Current research in this field has led to more complex methods using contextual and subword information but they do not change the underlying ideas of using text in machine learning.

Text embeddings are primarily used because of their ability to capture not only the words but also the relationships between the words. In the case of more advanced embedding, this may include word context as well. The nature of these methods allows embeddings to be pre-trained for use in a variety of applications without using labeled data. These models tend to be both large and very general. For example, one available model from Google contains 3 million words and phrases from a Google News dataset of over 100 billion words. If one is analyzing a text that primarily contains these words or phrases, the pre-trained model can be used to transform the data with little effort and directly use these embeddings in machine learning.

Using Cloud-Based NLP APIs

Text embeddings that include more information are the basis for advanced cloud-based NLP APIs including Google Natural Language, Amazon Comprehend, and Azure Cognitive Services. These cloud services offer a range of features including sentiment analysis, named entity recognition, dependency parsing, and others. The services can also be connected to machine learning in cloud ecosystems such as Google’s AI Platform or AWS Sagemaker. The pre-trained models used in these services are likely to generalize across many industries.

The Challenge of Specialized Text

However, highly specialized text, like that found in medicine, is not likely to be contained in the training sets embeddings. Models trained on Wikipedia are not likely to have medical or business language, which is important in these settings but not widely used outside. For example, news articles rarely reference to P.R.N., Latin abbreviation for as needed, or MQL for Marketing Qualified Lead, both of which are commonly used in medicine and business, respectively. While it is possible pre-trained embeddings will work in these cases, it is likely that more specialized models will be more appropriate.

To fill this gap in the important area of medicine, Amazon has released Amazon Medical Comprehend. This service can perform many functions that are unique to the medical domain such as identifying medication dosage and route of administration using models that are optimized for medical text. Another provider of industry specific-text analytics is John Snow Labs. Their Spark NLP technology includes premium models and datasets for NLP geared towards biomedical applications. In addition, popular open-sourced models include SciBERT by and ScispaCy both by the Allen Institute for Artificial Intelligence. These latter models are refinements to include data in the scientific domain that was not well represented in previous pre-trained models.

It is also possible to fit custom models using cloud-based technologies from the major providers mentioned above. On-premise, open-source tools are available to train models on new data. These custom models can be refinements of existing models or can be trained from scratch. In all cases, custom models require industry-specific datasets supplied by the user and the in-house knowledge base to train/fine-tune and then deploy the models. The decision to use a pre-trained model, refine an existing, or fit from scratch is highly dependent on the specific application and the available data. Using a custom model is an outstanding way of leveraging industry-specific text in machine learning applications. Allowing for insights to be extracted directly from text using machine learning.

Proven results

John Snow Labs and Amazon Comprehend Medical have multiple clients listed including Roche, Johnson & Johnson, PwC, and Fred Hutch. PwC reports that their clients using Amazon Comprehend Medical have significantly faster throughput in identifying medically relevant events. Another specialized text-analytics provider for customer-generated text data is Clarabridge. Their client KitchenAid reported an 85% increase in social media engagement and a 90% increase in social media fan base using Clarabridge technology.

As these examples illustrate, there is an outstanding opportunity to use industry-specific text data and machine learning to drive insights that make a business impact. Companies that embrace text data and ML can increase efficiency, grow engagement, and improve customer lifetime value.