Experts estimate that as much as 90 percent of the data that is generated daily is unstructured data. This includes everything from documents to text files, videos, audio files, social media, presentations, blog posts, photos and more.

It’s a challenge to extract information from this daunting volume of digital data so companies can realize business value. Technologies such as BI, AI, Machine Learning, RPA, NLP and NLU try to make it more accessible, understandable and actionable. A key first step is to identify highly precise subsets of documents with relevant context and content for use in training data, analysis and automation.

Up to 80% of their efforts are spent in this time-consuming process — finding, cleaning and reorganizing huge amounts of messy data — because these systems require accurate input to ensure they don’t use documents about “viral Tweets” if their focus is scientific research about “viral infections” like COVID-19.

But mining unstructured text data isn’t simple. To work effectively, a solution must be fine-tuned to meet your specific needs and address the quirks in your textual content.

DataScava addresses these types of issues and others by helping you to use your own business and domain language to mine your unstructured text data. It’s for Data Professionals, Subject Matter Experts, Business Users and Software Engineers. You don’t have to be a Data Scientist to use it.


4 Problems DataScava Addresses


DataScava is a data-agnostic, self-service tool that works as an alternative or adjunct to NLP and NLU to help you curate, search, filter, tag, match and route your raw unstructured text data. There are four main problems we address.

Garbage In, Garbage Out

Data-driven systems require high-quality data to succeed. Bad data produces bad output. This is an enormous effort and requires extensive input from your subject matter experts. Our solution enables seamless collaboration between non-technical and technical people, speeding up the process and providing a rapid path to efficiency.


NLP and NLU use fuzzy logic to interpret language and simplify it to a form that the average person would understand. Business language and jargon used in your organization confuse this process, and thereby the output, especially when common words are used to mean something very different, such as in the financial industry — which uses specialized industry lingo such as “yours, mine, put and call.”


Today’s solutions strive to be explainable, auditable and transparent yet many still do a lot of processing in a black box “under the hood.” This can leave users thinking “I don’t really know why my system did what it did.” DataScava uses a white-box approach that you can see, control, and measure, displaying color-coded highlights and scores for all defined topics found in your content.

Hard Coding

The end-user really cannot easily tweak the output after the implementation of many solutions. Instead, the process needs to be repeated. SME’s have to once again curate the input, evaluate the output, train data scientists in their business, have programmers reprogram the system (which may end up in the hands of your competitors), and then evaluate the output.  This takes an exceptionally long time, but you cannot have an expertly trained system without constantly improving it.


DataScava is designed for constant improvement but allows the whole iterative process to happen as a matter of course. Because of its accuracy and transparency, tweaking the model becomes simple and obvious.


DSLP/WTS as an Adjunct/Alternative to BI, AI, ML, RPA, NLP, NLU

NLP and NLU analyze words, then phrases, sentences, and, if you are lucky, entire paragraphs. It’s a bottom-up approach that processes information you hope leads to relevant output.

DataScava does not interpret input data, disambiguate natural language, or infer what you’re looking for — it measures and finds what you are looking for. It shows a graphical and numeric representation of it, akin to an oscilloscope in electronics; and this informs the data discovery process by providing corpus-wide measurements of each topic of interest.

Armed with these precise measurements, users gain great insight into their unstructured textual data based on personalized criteria they control.

Business applications for today’s data-driven systems are designed to make recommendations, route data to a defined destination, and/or trigger a process. These tasks have a set of predefined triggers you have trained it to react to. DataScava can accurately route and trigger while providing insight into the process.

Some automated systems require a high degree of precision and auditable results. They cannot be exactly right “most of the time,” and mistakes can be disastrous. And while they can handle complex inputs quickly, one barrier we have learned that can be a challenge is where they are provided with “multi-intent” input but only react to a single primary intent.

Our solution supports “not” that provides any required complexity for multi-intent applications (E.G. “a” infers “b,” unless “c” or “d” is present in which case it means “e” if “f” is not present, in which case it means “g” . . . ).

DataScava is designed for constant improvement but allows the whole iterative process to happen as a matter of course.  Because of the accuracy and visibility mentioned above, tweaking the model becomes simple and obvious.



Domain-Specific Language Processing and Weighted Topic Scoring

DataScava helps bridge those gaps by ensuring that input is relevant, thus improving the quality of results while reducing the risk of inappropriate analysis and badly informed decisions. All the while, it increases your business and data teams’ efficiency.

Domain-Specific Language Processing (DSLP) and patented Weighted Topic Scoring (WTS)  leverage users’ subject matter expertise,  business and domain language to extract highly precise information and present it in context.

Because they make unstructured text data more accessible, more understandable and, above all, more useful, they are a powerful alternative or adjunct to NLP, NLU, Semantic and Boolean Search.


Why We Don’t Use Natural Language Processing or Semantics

Although they are powerful and versatile technologies,  NLP and Semantics proponents with real-world experience will acknowledge that NLP is hard. In general, NLP is useful for relatively simple tasks, such as automating a phone attendant’s call routing or a chatbot’s responses.

AI can take it a step further, but it certainly doesn’t summarize an entire document or provide any means to compare, measure and filter so users can view and adjust how they output in normal use.

In addition, business use cases for AI solutions for large bodies of textual data largely focus on routing documents to a particular destination and/or taking some action based on their contents (initiate a process, set an alert, send an email).  This requires summarizing the document overall.

DataScava works at the document level, summarizing textual content in a usable, numerical form for routing purposes or to trigger an action using a process that is adjustable by users.



Navigational vs. Research Search

DataScava excels at navigational search  – when users want to “navigate” to the most relevant overall documents — using scored topics, keywords, and phrases that are in context, visible and under your control.

In his paper on Semantic SearchRamanathan V. Guha, responsible for products such as Google Custom Search, distinguished between two very different kinds of searches,  navigational and research, as follows:

  • Navigational Search:  the user is using the search engine as a navigation tool to navigate to a particular intended document. In this class of searches, the user provides the search engine a phrase or combination of words which s/he expects to find in the documents. There is no straightforward, reasonable interpretation of these words as denoting a concept. In such cases, the user is using the search engine as a navigation tool to navigate to a particular intended document. We are not interested in this class of searches.
  • Research Search:  the user provides the search engine with a phrase that is intended to denote an object about which the user is trying to gather/research information. There is no particular document which the user knows about and is trying to get to. Rather, the user is trying to locate a number of documents which together will provide the desired information. Semantic search lends itself well with this approach that is closely related with exploratory search.”

This Github article What is Semantic Search? references Guha’s work and discusses why Semantic Search is not well-suited to navigational search.



Open Architecture and Highly Customizable Platform

DataScava’s open architecture makes it simple to connect and share data via SQL or the REST API in event-driven models. Whether used on its own or integrated with your existing business applications, the platform is highly customizable.

From the start, data sources you designate are accumulated in the data store, to be indexed and re-indexed as you adjust and improve the model you use.

Users identify the principal free-form inputs for the system—such as business reports, reference data, surveys, news, journals or research papers—and also the desired outputs of data for use in other platforms.