Our unstructured data miner leverages your business and domain-specific language to pinpoint the high-value text data you need for applications in AI, LLMs, ML, RPA, BI, Talent, Research, BAU, and more. In “DataScava: How It Pinpoints Unstructured Text Data Using Your Business Language,” we explain how it helps people to unlock the full potential of their data by defining the abstract topics and themes that represent their own business and subject matter expertise, applying both to big data sets in real time. And keeps the Human in Command in the era of AI and advanced analytics.

Articles by Scott Spangler

DataScava commissioned a series of six articles from Scott Spangler, former IBM Watson Health Researcher, Chief Data Scientist, and author of the bookMining the Talk: Unlocking the Business Value in Unstructured Information.”  Scott discusses how and why DataScava’s patented precise approach to mining unstructured text data perfectly complements real-world big data applications in AI, LLMs, ML, RPA, BI, Research, Talent, and BAU applications. He also contrasts our Tailored Topics Taxonomies, Domain-Specific Language Processing, and Weighted Topic Scoring methodologies with standard approaches such as NLP.

“Machines in the Conversation: The Case for a More Data-Centric AI,” published in CDO Magazine, excerpted from a longer article with product information.

“Executive Q&A: DataScava, AI and ML”

“The Key Ingredients for Game-Changing Business Intelligence (BI) from Unstructured Textual Data”

“Consistent High-Quality Robot Process Automation (RPA) Requires Deep Customer Understanding”

“Who’s in Charge of Your Business: The Humans or the Machines?”

Scott’s first article discusses:

  • The pitfalls of using a fully automated approach to critical decision-making.
  • The desirability of having a parallel human-machine partnership that regulates and monitors the inputs and outputs of automated approaches.
  • The three basic ingredients that are needed to make that hybrid process successful and how DataScava implements each of these components.

Here’s an excerpt:

“Algorithms will be more effective in the long run if they are part of a more holistic framework that includes user-controlled domain-specific ontologies, statistical analysis, and rule-based reasoning strategies. These are the basic ingredients that a tool like DataScava provides.”

DataScava . . .

“Is a robot ally in humanity’s struggle for control of how we utilize big data to make decisions. By providing tools for defining the key underlying topics and rules that govern important concepts of the business needs, it evens the playing field so that machine learning no longer has to have the final say on critical business decisions.

Can supervise the process based on human-provided expertise and determine which data to use for training and which to avoid, as well as in which situations to trust deep learning decisions and when to fall back on more rule-based approaches. Such processes put the humans back in charge and allow the machines to serve their intended role as adjuncts and trusted advisors.

In partnership with a trained human mind – can act effectively as a tool for giving the left brain an equal say in big data decision-making tasks.

Can play a leading role in helping businesses manage and maintain their big data more efficiently using information ontologies, statistics with visualization and rule-based approaches.

Perfectly complements existing approaches to unlocking the value of unstructured text data – by helping companies to model higher-level intents and purposes behind the labeling and classification of data – by defining the abstract topics and themes that represent their own business and subject matter expertise – and by applying both to big data sets real-time.

Provides a practical, easy-to-use tool-set for defining the critical business ontologies that provide the critical bridge between unstructured text data analysis using standard data science techniques and the human expertise that gives your business its competitive edge.

When a deep learning system and DataScava agree on a classification, that’s ideal because then we now have a plausible explanation for why the deep learning algorithm decided the way it did.

Can help data professionals and business people use machine and human intelligence together to make their messy unstructured text data more accessible, understandable and actionable.”

“Mine Unstructured Text Data Using Your Business Language and Domain Expertise with DataScava”

Experts estimate that as much as 90 percent of the data that is generated daily is unstructured data. This includes everything from documents to text files, emails, videos, audio files, social media, presentations, blog posts, photos, and more.

It’s a challenge to extract information from this daunting volume of digital data so companies can realize business value. Technologies such as AI, Machine Learning, RPA, NLP, and NLU try to make it more accessible, understandable, and actionable. A key first step is to identify highly precise subsets of documents with relevant context and content for use in training data, analysis, and automation.

Up to 80% of their efforts are spent in this time-consuming process — finding, cleaning, and reorganizing huge amounts of messy data — because these systems require accurate input to ensure they don’t use documents about “viral Tweets” if their focus is scientific research about “viral infections” like COVID-19.

But mining unstructured text data isn’t simple. To work effectively, a solution must be fine-tuned to meet your specific needs and address the quirks in your textual content.

DataScava addresses these types of issues and others by helping experts in Data, Business, and Software use their own business language and domain expertise to mine unstructured text data.

Read the full article

“7 Ways Mining Unstructured Text Data with DataScava is Different”

It’s always true been true in computing that garbage in is garbage out.

This certainly is the case when it comes to working with unstructured text data, which by 2022 will make up more than 90 percent of the world’s electronic information, delivered in business reports, research papers, emails, and other formats – all written in different styles and using terms whose meaning differ from sector to sector.

It’s become an axiom that data scientists spend 80 percent of their time dealing with data preparation problems — leaving only 20 percent for algorithm development, model training and tuning, and machine learning.


Use Your Business Language and Subject Matter Expertise

Today, the technology exists to use your business language and subject matter expertise to quickly and accurately mine raw text so you can unlock its value.

It’s a patented self-service system that curates, searches, filters and routes raw unstructured text data to make it more accessible, understandable and actionable.

A tool that uses your company’s domain-specific language and topics, searches and filters tuned to your business and always in your control.

And works as an alternative or adjunct to NLP/NLU.

It’s called DataScava.

Read the full article

“How DataScava Mines Unstructured Text Data Using Your Business Language”

To work properly, data-driven systems require a tremendous amount of standardized, labeled, and otherwise “structured” data. Yet by 2022, more than 90% of the world’s electronic information will be unstructured, all of it written in different styles and using terms whose meaning differs from sector to sector.

DataScava curates, searches, filters or routes messy unstructured text data to make it more accessible, understandable and actionable. It uses your business language and domain-specific topics we work with you to define, measure and prioritize. And you don’t need to be a Data Scientist to use it.

It’s an automated self-service SME-driven system for Business Users, Data Professionals and Software Engineers that keeps the human in command. Ease of use and transparency enable collaboration between nontechnical and technical people and provide a rapid path to efficiency.

Read the full article

“Domain-Specific Language Processing Mines Value from Unstructured Data”

Check out our guest post about how our Domain-Specific Language Processing (DSLP) and patented Weighted Topic Scoring (WTS) can be your alternative or an adjunct to NLP to mine your business data in real-time, published by KDnuggets, a leading site on AI, big data, data mining, data science and machine learning.

Here’s an excerpt:

“Processing unstructured text data in real-time is challenging when applying NLP or NLU. Find out how Domain-Specific Language Processing can be your alternative or an adjunct to NLP to mine valuable information from data by following your guidance and using the language of your business.

Real-time mining of unstructured textual content isn’t simple. To work effectively, a solution must be fine-tuned to meet your organization’s specific needs and address the quirks in your company’s information. To add value, applications require a vocabulary that accurately captures the definitions, context, and nuance of your business and the way it uses language.

Consequently, unstructured textual data needs to be organized before it can be put to use. That’s a huge challenge. For data-driven systems based on artificial intelligence, machine learning or other advanced applications, natural language processing and natural language understanding are supposed to be the solution, either by themselves or as hybrid models that plug in industry terms or use complex Boolean logic. But none of these are easy to use or implement, and without extensive programming and training, they often process data incorrectly.

There is, however, a simpler approach to addressing these challenges, one that does not require NLP or NLU. It’s called “Domain-Specific Language Processing,” or DSLP, and it uses the language of your business to mine unstructured textual data.”


Read the full article

“How DataScava’s Intelligent Approach Adds Value to Unstructured Data”

Check out this article written by our CTO John Harney about how DataScava mines unstructured textual data using our Domain-Specific Language Processing and patented Weighted Topic Scoring:

Real-time mining of unstructured textual content isn’t simple. Available solutions don’t work well unless they’re fine-tuned to meet your specific needs and address the unique quirks in your company’s information. To truly add value, your applications, whether AI-driven or not, require a vocabulary that captures the definitions, context and nuance of your business.

DataScava’s unstructured data miner provides fast and effective solutions for leveraging the explosive growth of unstructured data. The sheer volume of information—even some small businesses must manage thousands of pages each day—challenges companies to find the most efficient and accurate way to read data, index it, extract what they need and quickly route it to its proper destination.

Conventional wisdom says that taking full advantage of information requires a “data-driven system” based on “artificial intelligence, “machine learning” or some other application. But without modification, these technologies often process information quickly, but incorrectly. Needless to say, machines that use inaccurate data for machine learning, AI insights, business decisions, or that pass data to the wrong destination, don’t do anyone much good.

That’s where DataScava comes in.

Power from Unstructured Data 

DataScava’s proprietary search engine extracts actionable, highly precise and industry-specific information from the content you already have. Our technology is data-agnostic and provides a true competitive advantage that maximizes data’s transformative power.

Simple to use and transparent, DataScava is accessible to a wide range of technical and non-technical users. It encourages cross-discipline collaboration and facilitates a faster path to improved efficiency and productivity. It also automates time-consuming data preparation, positively impacting your bottom line.

Leveraging their own subject matter expertise, users can easily set DataScava’s filters. They then work around the clock, continually increasing the system’s capabilities and providing measurable benefits. Our solution can stand on its own, integrate with other systems or curate the data needed for AI and machine learning projects. However you employ it, DataScava provides an automated, simple and precise way for business people and data professionals to search, index, score and match unstructured textual content.

Domain-Specific Language Processing and Weighted Topic Scoring

DataScava is the only product to offer Domain-Specific Language Processing with our patented Weighted Topic Scoring, which provide highly precise results you can see, control and measure. Users select topics of interest, weight their significance, adjust them on-the-fly and rank the resulting output. They can then hone in further, using multi-level sort to drill down and surface key results while the system automatically mines and matches new data as it arrives. Search templates, editable topic libraries, percentile rankings and “not” capability allow them to extract the exact data they need.

Domain-Specific Language Processing provides a powerful alternative to Semantic Search and Natural Language Processing. It’s also more effective than Boolean search because users can prioritize their search criteria, automate it and perform “what-if” analysis. That’s especially important today, when most users struggle with even basic queries. As Marti A. Hearst wrote in her book Search User Interfaces, “studies have shown time and again that most users have difficulty specifying queries in Boolean format and often misjudge what the results will be.”

Data Curation With Your Own Vocabulary

Out of the box, AI systems act much like new graduates starting their first job: Their vocabulary includes no real knowledge of how the business works. They don’t account for nuance or context. Starting up, they’re only as effective as their “education” has prepared them to be.

Consider Wordnet, a “large lexical database of English” developed at Princeton University. The system groups words into sets of “cognitive synonyms” that share a discrete concept. According to its creators, Wordnet links not just words, but specific senses of words. That allows it to deliver narrowly defined results, grouped in ways that go beyond the simple semantic relationships found in a thesaurus.

It sounds impressive. But is it practical? Searching “financial” on Wordnet returns just one result: “fiscal.” That’s certainly narrow, but is it useful? What about “finance,” “monetary,” or “economic?” Wordnet keeps all of those possible matches on the shelf. By comparison, our financial topic has over 250 associated keywords.

In DataScava, all topics and keywords can be easily viewed, edited, renamed, deleted or added to by any user. As a result, your own specialist nomenclature may be adopted across all of your systems, whether they’re AI-driven or not.

But that’s simply a foundation block. As implementation begins, we partner with you to create company-specific topics, their keywords and a customized library of search templates. “Weighted Topic Scoring” returns precise matches of existing, new or modified data automatically. In addition, DataScava indexes the textual content of your files in real-time, generating metadata such as topic scores, percentile rankings and data tags.

Better Results in a Straightforward Way

DataScava also addresses the logic and flexibility challenges posed by the current state of AI. For example, Natural Language Processing systems have been trained to recognize, derive meaning from and respond to specific words and phrases. However, they may ignore words they don’t recognize or misinterpret ones they do. And while unstructured data rarely appears in standard form, NLP is based on the standard use of language. But the grammar for natural languages is ambiguous and typical sentences have multiple possible analyses. Without extensive programming, it can’t process code words, jargon and unpredictable changes to sentence structure.

Understanding how language is used in the real world is critical to accurate parsing and deriving correct meaning. An NLP system might know the word “options” and the location of “Summit, N.J.,” but still be challenged to process “The Summit Derivatives Trading System suffered an outage in New Jersey, causing massive losses in Index Options.”

DataScava solves such issues with a rich library of domain-specific topics, fine-tuned to your needs during implementation. Users can constantly correct and adjust the model without bastardizing the system’s vocabulary. Instead, their input enhances it.

We Keep the Human in Command

In essence, NLP is a translator that uses fuzzy logic to identify known constructs and phrases, then substitutes more complex phrases to give the system explicit instructions. At the same time, it removes a cornerstone of human language: nuance. “The blood-red sun birthed a new day as it rose over the horizon” becomes “the sun rose.”

That kind of translation is appropriate for relatively simple tasks, such as automating a phone attendant’s call routing. However, it’s not particularly useful for processing the more complex undertakings human beings perform every day.

For a system to complete more-than-basic tasks, it requires the addition of human subject matter expertise. That means incorporating additional specialist words and phrases of interest into the dictionary. Usability issues aside, Boolean phrases may or may not be relevant, can suppress useful results and create output that’s either too broad or too narrow. Despite its use of sophisticated search terms, adding keywords doesn’t necessarily improve results. It simply bastardizes the NLP vocabulary and may negatively impact the information produced by the AI “black box.”

You have to wonder —if there’s an “intelligent” process within AI that “understands what you mean,” and AI applies further “intelligence” to it—how that logic incorporates these added conditions? Do these exceptions make the system more intelligent or do they hamper its operation? If the user adds a word such as “summit,” as referenced above, to mean “Summit trading system,” does the AI engine always assume “summit” is linked to financial trading? What does it do with phrases such as “they reached the summit of the mountain” or “the heads of government held a summit to discuss their options.”?

Can these issues be addressed? Of course. But doing so usually involves modifying the AI model—an expensive effort given the amount of time and money required. And even with training, many users won’t ever understand the way NLP “thinks” or how new keywords can impact a Boolean search. The potential results include inefficiency, lost productivity and relying on incorrect insights to make ineffective decisions.

Why DataScava Makes Sense 

DataScava addresses these concerns by simplifying the user experience and powering search results that are more accurate in every sense of the word.

  •  As it works DataScava finds, measures and highlights references based on your interests and priorities. Those references can be tweaked as desired in real time.
  • Over time, DataScava’s white box allows you to develop and maintain your own nomenclature, a vocabulary that’s specific to you, remains proprietary and can be used across other systems as necessary. As you deploy new systems, there’s no reason to repeat their training.
  • DataScava eliminates the need for continual manual curation, evaluation and costly reprogramming. It works automatically to mine your data based on criteria you control.
  • Our results can be audited. Unlike the typical AI “black box,” DataScava allows users to see precisely why it produced a given outcome in any situation.
  • DataScava serves your interests alone. In most cases, deploying an AI system involves training a vendor’s product, which can then be resold to your competitors. With DataScava, your intelligence stays in-house and constantly improves.

To work properly, AI systems require a tremendous amount of standardized, labeled and otherwise “structured” data. Yet by 2022, more than 90 percent of the world’s electronic information will be unstructured, delivered in business reports, research papers, emails, user comments—all written in different styles and using terms whose meaning differs from sector to sector.

DataScava bridges that gap by creating high quality input that artificial intelligence and machine learning systems can use to improve the quality of their own output. That increases the accuracy of searches, reduces the risk of inappropriate analysis and badly informed decisions, and increases your data team’s efficiency. With no black box, it generates the kind of auditable output that helps compliance officers sleep better.

By extracting industry-specific information, DataScava makes unstructured data more accessible, more understandable and, above all, more useful.

“Let’s Admit It: We’re a Long Way from Using Real ‘Intelligence’ in AI”

With the growth of AI systems and unstructured data, there is a need for an independent means of data curation, evaluation and measurement of output that does not depend on the natural language constructs of AI and creates a comparative method of how the data is processed. Here’s an excerpt from a blog post on this subject written by our CTO John Harney published by big data site KDNuggets:

“For anyone worrying about machines taking over the world, I have reassuring news: The idea of artificial intelligence has been overcome by hype. I don’t mean to belittle AI’s promise or even its existing capabilities. The technology allows organizations to put data to use in ways we could only imagine not that long ago.

“It’s revolutionized the way executives approach strategic planning. But very often lately—when I’m in meetings, reading research papers or listening to an expert’s presentation—I can’t shake the feeling that to many people, terms like “AI,” “machine learning” and “cognitive computing” have become answers unto themselves.

“Today, solutions providers put statements like “AI-driven” or “harnessing the power of machine learning” at the core of their sales pitch. The buzzwords are certainly getting through. One colleague tells the story of a client calling “to make sure AI was included” in their data analysis project. Business people have been sold on the notion that today’s cutting-edge systems analyze data in a black box, then spit out reliable insights. How? They just do.”

Read the full article here:’s Admit It: We’re a Long Way from Using Real ‘Intelligence’ in AI


“Busting A Buzzword: Semantic Search”

DataScava’s Non-Semantic Search was featured in a piece by the renowned RecruitingTools news blog written by Katrina Kibben.

“What we really need, and I only know one company that does this (shout out to TalentBrowser, powered by DataScava, and founders Janet Dwyer and John Harney) is a completely customizable white box ‘profile’ search built on input and personalized rules that you the user control, not a black box semantic search engine that thinks it knows what you ‘really mean.’ Profile search allows you to specify many individual topics in a search, with thresholds (minimums) to be met by each topic. This twofold process bubbles the best candidates right to the top.”

Here’s the full report