Real-time mining of unstructured textual content isn’t simple. To work effectively, a solution must be fine-tuned to meet your organization’s specific needs and address the quirks in your company’s information.

To add value, applications require a vocabulary that accurately captures the definitions, context, and nuance of your business and the way it uses language. To work properly, data-driven systems require a tremendous amount of standardized, labeled and otherwise “structured” data.

DataScava does not use numerical information as part of its process. Our scope is solely unstructured textual data, and most of that data is in the form of Natural Language.


DataScava Measures Your Unstructured Text Data

AI, Machine Learning and RPA systems are created to automate the decision process. The assumption is that you can train a program to think like a human think (in a particular field or domain.

The big advantage is the machine can do it 24/7, can be replicated thousands of times and processes incoming information in milliseconds.

DataScava is not an AI, Machine Learning or RPA Solution. It does not interpret input data — it measures it.

Armed with these precise measurements, users gain great insight into their unstructured textual data based on personalized criteria they control.


DataScava as an Adjunct to AI, Machine Learning and RPA

There are four main problems we address:

  1. Garbage In, Garbage Out
    To train an AI system, you need to prepare thousands of pages of input and that input needs to be curated.  If a company does not put time into this process the result will be vastly inferior. This is an enormous effort, and it requires in-depth business expertise (i.e., your senior people.)  DataScava, on the other hand, works on messy data and provides a unique and powerful tool to speed up the curation process.
  2. Accuracy
    Natural language engines use fuzzy logic and are programmed to interpret language and simplify it to a form that the average person would understand.  Domain jargon you use in your business, especially when common words are used to mean something different (the Financial Industry is especially prone to this: yours, mine, put, call . . . ) confuses this process and thereby, the output.
  3. Visibility
    AI systems do a lot of processing “under the hood” and leave users thinking “I don’t really know why it did what it did.”   DataScava highlights the text that contributed to the measurement of that document’s content.
  4. Hard Coding
    The end-user really cannot tweak the AI engine after it is implemented.  Your senior people have to curate the input, evaluate the output, train data scientists in their business, reprogram the system (which ends up in the hands of your competitors) and then your senior people must evaluate it again.  This takes an exceptionally long time, but you cannot have an expertly trained system without constantly improving it. 

DataScava is designed for constant improvement but allows the whole iterative process to happen as a matter of course.  Because of the accuracy and visibility mentioned above, tweaking the model becomes simple and obvious.


DataScava as an Alternative to AI, Machine Learning and RPA

AI analyzes words, then phrases, sentences, and if you are lucky, entire paragraphs. It is a bottom-up approach that processes information you hope leads to relevant output.

DataScava measures data. It shows a numeric representation of your unstructured data, akin to an oscilloscope in electronics; and this informs the data discovery process by providing corpus-wide measurements of each topic of interest.

We would argue that most business applications for AI are to perform two tasks: they route data to a defined destination and/or trigger a process.

Furthermore, those tasks have a set of predefined triggers you have trained it to react to. DataScava can route and trigger and do so with greater accuracy and insight into the process.

Some automated business applications require a high degree of precision and must provide audit-able results.  Robotic Process Automation and financial industry applications cannot be exactly right “most of the time.” Mistakes can be disastrous.

There is a perception that AI can handle complex inputs quickly.  In fact, one barrier we have found about AI systems is where it is provided with “multi-intent” input but reacts to a single primary intent.

DataScava supports “not” that provides any required complexity for multi-intent applications  (E.G. “a” infers “b,” unless “c” or “d” is present in which case it means “e” if f is not present, in which case it means “g” . . . ).


Domain-Specific Language Processing and Weighted Topic Scoring

DataScava helps bridge those gaps by ensuring that input is relevant, thus improving the quality of results while reducing the risk of inappropriate analysis and badly informed decisions. All the while, it increases your business and data teams’ efficiency.

Domain-Specific Language Processing (DSLP) and patented Weighted Topic Scoring (WTS)  leverage users’ subject matter expertise,  business and domain language to extract highly precise information and present it in context.

Because they make unstructured text data more accessible, more understandable and, above all, more useful, they are a powerful alternative or adjunct to NLP, NLU, Semantic and Boolean Search.


Why We Don’t Use Natural Language Processing or Semantics

Although they are powerful and versatile technologies,  NLP and Semantics proponents with real-world experience will acknowledge that NLP is hard. In general, NLP is useful for relatively simple tasks, such as automating a phone attendant’s call routing or a chatbot’s responses.

AI can take it a step further, but it certainly doesn’t summarize an entire document or provide any means to compare, measure and filter so users can view and adjust how they output in normal use.

In addition, business use cases for AI solutions for large bodies of textual data largely focus on routing documents to a particular destination and/or taking some action based on their contents (initiate a process, set an alert, send an email).  This requires summarizing the document overall.

DataScava works at the document level, summarizing textual content in a usable, numerical form for routing purposes or to trigger an action using a process that is adjustable by users.


Navigational vs. Research Search

DataScava excels at navigational search  – when users want to “navigate” to the most relevant overall documents — using scored topics, keywords, and phrases that are in context, visible and under your control.

In his paper on Semantic SearchRamanathan V. Guha, responsible for products such as Google Custom Search, distinguished between two very different kinds of searches,  navigational and research, as follows:

  • Navigational Search:  the user is using the search engine as a navigation tool to navigate to a particular intended document. In this class of searches, the user provides the search engine a phrase or combination of words which s/he expects to find in the documents. There is no straightforward, reasonable interpretation of these words as denoting a concept. In such cases, the user is using the search engine as a navigation tool to navigate to a particular intended document. We are not interested in this class of searches.
  • Research Search:  the user provides the search engine with a phrase that is intended to denote an object about which the user is trying to gather/research information. There is no particular document which the user knows about and is trying to get to. Rather, the user is trying to locate a number of documents which together will provide the desired information. Semantic search lends itself well with this approach that is closely related with exploratory search.”

This Github article What is Semantic Search? references Guha’s work and discusses why Semantic Search is not well-suited to navigational search.


Open Architecture and Highly Customizable Platform

DataScava’s open architecture makes it simple to connect and share data via SQL or the REST API in event-driven models. Whether used on its own or integrated with your existing business applications, the platform is highly customizable.

From the start, data sources you designate are accumulated in the data store, to be indexed and re-indexed as you adjust and improve the model you use.

Users identify the principal free-form inputs for the system—such as business reports, reference data, surveys, news, journals or research papers—and also the desired outputs of data for use in other platforms.