Check out our latest post “Mine Unstructured Text Data Using Your Business and Domain Language with DataScava.”
Experts estimate that as much as 90 percent of the data that is generated daily is unstructured data. This includes everything from documents to text files, videos, audio files, social media, presentations, blog posts, photos and more.
It’s a challenge to extract information from this daunting volume of digital data so companies can realize business value. Technologies such as AI, Machine Learning, RPA, NLP and NLU try to make it more accessible, understandable and actionable. A key first step is to identify highly precise subsets of documents with relevant context and content for use in training data, analysis and automation.
Up to 80% of their efforts are spent in this time-consuming process — finding, cleaning and reorganizing huge amounts of messy data — because these systems require accurate input to ensure they don’t use documents about “viral Tweets” if their focus is scientific research about “viral infections” like COVID-19.
But mining unstructured text data isn’t simple. To work effectively, a solution must be fine-tuned to meet your specific needs and address the quirks in your textual content.
DataScava addresses these types of issues and others by helping you to use your own business and domain language to mine your unstructured text data. It’s for Data Professionals, Subject Matter Experts, Business Users and Software Engineers. You don’t have to be a Data Scientist to use it.
It’s always true been true in computing that garbage in is garbage out.
This certainly is the case when it comes to working with unstructured text data, which by 2022 will make up more than 90 percent of the world’s electronic information, delivered in business reports, research papers, emails, and other formats – all written in different styles and using terms whose meaning differ from sector to sector.
It’s become an axiom that data scientists spend 80 percent of their time dealing with data preparation problems — leaving only 20 percent for algorithm development, model training and tuning, and machine learning.
Use Your Business Language and Subject Matter Expertise
Today, the technology exists to use your business language and subject matter expertise to quickly and accurately mine raw text so you can unlock its value.
It’s a patented self-service system that curates, searches, filters and routes raw unstructured text data to make it more accessible, understandable and actionable.
A tool that uses your company’s domain-specific language and topics, searches and filters tuned to your business and always in your control.
And works as an alternative or adjunct to NLP/NLU.
It’s called DataScava.
To work properly, data-driven systems require a tremendous amount of standardized, labeled, and otherwise “structured” data. Yet by 2022, more than 90% of the world’s electronic information will be unstructured, all of it written in different styles and using terms whose meaning differs from sector to sector.
DataScava curates, searches, filters or routes messy unstructured text data to make it more accessible, understandable and actionable. It uses your business language and domain-specific topics we work with you to define, measure and prioritize. And you don’t need to be a Data Scientist to use it.
It’s an automated self-service SME-driven system for Business Users, Data Professionals and Software Engineers that keeps the human in command. Ease of use and transparency enable collaboration between nontechnical and technical people and provide a rapid path to efficiency.
Check out this LinkedIn Post to learn more.
Check out our guest post about how our Domain-Specific Language Processing (DSLP) and patented Weighted Topic Scoring (WTS) can be your alternative or an adjunct to NLP to mine your business data in real-time, published by KDnuggets, a leading site on AI, big data, data mining, data science and machine learning.
Here’s an excerpt:
“Processing unstructured text data in real-time is challenging when applying NLP or NLU. Find out how Domain-Specific Language Processing can be your alternative or an adjunct to NLP to mine valuable information from data by following your guidance and using the language of your business.
Real-time mining of unstructured textual content isn’t simple. To work effectively, a solution must be fine-tuned to meet your organization’s specific needs and address the quirks in your company’s information. To add value, applications require a vocabulary that accurately captures the definitions, context, and nuance of your business and the way it uses language.
Consequently, unstructured textual data needs to be organized before it can be put to use. That’s a huge challenge. For data-driven systems based on artificial intelligence, machine learning or other advanced applications, natural language processing and natural language understanding are supposed to be the solution, either by themselves or as hybrid models that plug in industry terms or use complex Boolean logic. But none of these are easy to use or implement, and without extensive programming and training, they often process data incorrectly.
There is, however, a simpler approach to addressing these challenges, one that does not require NLP or NLU. It’s called “Domain-Specific Language Processing,” or DSLP, and it uses the language of your business to mine unstructured textual data.”
Check out this article written by our CTO John Harney about how DataScava mines unstructured textual data using our Domain-Specific Language Processing and patented Weighted Topic Scoring:
Real-time mining of unstructured textual content isn’t simple. Available solutions don’t work well unless they’re fine-tuned to meet your specific needs and address the unique quirks in your company’s information. To truly add value, your applications, whether AI-driven or not, require a vocabulary that captures the definitions, context and nuance of your business.
DataScava’s unstructured data miner provides fast and effective solutions for leveraging the explosive growth of unstructured data. The sheer volume of information—even some small businesses must manage thousands of pages each day—challenges companies to find the most efficient and accurate way to read data, index it, extract what they need and quickly route it to its proper destination.
Conventional wisdom says that taking full advantage of information requires a “data-driven system” based on “artificial intelligence, “machine learning” or some other application. But without modification, these technologies often process information quickly, but incorrectly. Needless to say, machines that use inaccurate data for machine learning, AI insights, business decisions, or that pass data to the wrong destination, don’t do anyone much good.
That’s where DataScava comes in.
Power from Unstructured Data
DataScava’s proprietary search engine extracts actionable, highly precise and industry-specific information from the content you already have. Our technology is data-agnostic and provides a true competitive advantage that maximizes data’s transformative power.
Simple to use and transparent, DataScava is accessible to a wide range of technical and non-technical users. It encourages cross-discipline collaboration and facilitates a faster path to improved efficiency and productivity. It also automates time-consuming data preparation, positively impacting your bottom line.
Leveraging their own subject matter expertise, users can easily set DataScava’s filters. They then work around the clock, continually increasing the system’s capabilities and providing measurable benefits. Our solution can stand on its own, integrate with other systems or curate the data needed for AI and machine learning projects. However you employ it, DataScava provides an automated, simple and precise way for business people and data professionals to search, index, score and match unstructured textual content.
Domain-Specific Language Processing and Weighted Topic Scoring
DataScava is the only product to offer Domain-Specific Language Processing with our patented Weighted Topic Scoring, which provide highly precise results you can see, control and measure. Users select topics of interest, weight their significance, adjust them on-the-fly and rank the resulting output. They can then hone in further, using multi-level sort to drill down and surface key results while the system automatically mines and matches new data as it arrives. Search templates, editable topic libraries, percentile rankings and “not” capability allow them to extract the exact data they need.
Domain-Specific Language Processing provides a powerful alternative to Semantic Search and Natural Language Processing. It’s also more effective than Boolean search because users can prioritize their search criteria, automate it and perform “what-if” analysis. That’s especially important today, when most users struggle with even basic queries. As Marti A. Hearst wrote in her book Search User Interfaces, “studies have shown time and again that most users have difficulty specifying queries in Boolean format and often misjudge what the results will be.”
Data Curation With Your Own Vocabulary
Out of the box, AI systems act much like new graduates starting their first job: Their vocabulary includes no real knowledge of how the business works. They don’t account for nuance or context. Starting up, they’re only as effective as their “education” has prepared them to be.
Consider Wordnet, a “large lexical database of English” developed at Princeton University. The system groups words into sets of “cognitive synonyms” that share a discrete concept. According to its creators, Wordnet links not just words, but specific senses of words. That allows it to deliver narrowly defined results, grouped in ways that go beyond the simple semantic relationships found in a thesaurus.
It sounds impressive. But is it practical? Searching “financial” on Wordnet returns just one result: “fiscal.” That’s certainly narrow, but is it useful? What about “finance,” “monetary,” or “economic?” Wordnet keeps all of those possible matches on the shelf. By comparison, our financial
In DataScava, all topics and keywords can be easily viewed, edited, renamed, deleted or added to by any user. As a result, your own specialist nomenclature may be adopted across all of your systems, whether they’re AI-driven or not.
But that’s simply a foundation block. As implementation begins, we partner with you to create company-specific topics, their keywords and a customized library of search templates. “Weighted Topic Scoring” returns precise matches of existing, new or modified data automatically. In addition, DataScava indexes the textual content of your files in real-time, generating metadata such as topic scores, percentile rankings and data tags.
Better Results in a Straightforward Way
DataScava also addresses the logic and flexibility challenges posed by the current state of AI. For example, Natural Language Processing systems have been trained to recognize, derive meaning from and respond to specific words and phrases. However, they may ignore words they don’t recognize or misinterpret ones they do. And while unstructured data rarely appears in standard form, NLP is based on the standard use of language. But the grammar for natural languages is ambiguous and typical sentences have multiple possible analyses. Without extensive programming, it can’t process code words, jargon and unpredictable changes to sentence structure.
Understanding how language is used in the real world is critical to accurate parsing and deriving correct meaning. An NLP system might know the word “options” and the location of “Summit, N.J.,” but still be challenged to process “The Summit Derivatives Trading System suffered an outage in New Jersey, causing massive losses in Index Options.”
DataScava solves such issues with a rich library of domain-specific topics, fine-tuned to your needs during implementation. Users can constantly correct and adjust the model without bastardizing the system’s vocabulary. Instead, their input enhances it.
We Keep the Human in Command
In essence, NLP is a translator that uses fuzzy logic to identify known constructs and phrases, then substitutes more complex phrases to give the system explicit instructions. At the same time, it removes a cornerstone of human language: nuance. “The blood-red sun birthed a new day as it rose over the horizon” becomes “the sun rose.”
That kind of translation is appropriate for relatively simple tasks, such as automating a phone attendant’s call routing. However, it’s not particularly useful for processing the more complex undertakings human beings perform every day.
For a system to complete more-than-basic tasks, it requires the addition of human subject matter expertise. That means incorporating additional specialist words and phrases of interest into the dictionary. Usability issues aside, Boolean phrases may or may not be relevant, can suppress useful results and create output that’s either too broad or too narrow. Despite its use of sophisticated search terms, adding keywords doesn’t necessarily improve results. It simply bastardizes the NLP vocabulary and may negatively impact the information produced by the AI “black box.”
You have to wonder —if there’s an “intelligent” process within AI that “understands what you mean,” and AI applies further “intelligence” to it—how that logic incorporates these added conditions? Do these exceptions make the system more intelligent or do they hamper its operation? If the user adds a word such as “summit,” as referenced above, to mean “Summit trading system,” does the AI engine always assume “summit” is linked to financial trading? What does it do with phrases such as “they reached the summit of the mountain” or “the heads of government held a summit to discuss their options.”?
Can these issues be addressed? Of course. But doing so usually involves modifying the AI model—an expensive effort given the amount of time and money required. And even with training, many users won’t ever understand the way NLP “thinks” or how new keywords can impact a Boolean search. The potential results include inefficiency, lost productivity and relying on incorrect insights to make ineffective decisions.
Why DataScava Makes Sense
DataScava addresses these concerns by simplifying the user experience and powering search results that are more accurate in every sense of the word.
- As it works DataScava finds, measures and highlights references based on your interests and priorities. Those references can be tweaked as desired in real time.
- Over time, DataScava’s white box allows you to develop and maintain your own nomenclature, a vocabulary that’s specific to you, remains proprietary and can be used across other systems as necessary. As you deploy new systems, there’s no reason to repeat their training.
- DataScava eliminates the need for continual manual curation, evaluation and costly reprogramming. It works automatically to mine your data based on criteria you control.
- Our results can be audited. Unlike the typical AI “black box,” DataScava allows users to see precisely why it produced a given outcome in any situation.
- DataScava serves your interests alone. In most cases, deploying an AI system involves training a vendor’s product, which can then be resold to your competitors. With DataScava, your intelligence stays in-house and constantly improves.
To work properly, AI systems require a tremendous amount of standardized, labeled and otherwise “structured” data. Yet by 2022, more than 90 percent of the world’s electronic information will be unstructured, delivered in business reports, research papers, emails, user comments—all written in different styles and using terms whose meaning differs from sector to sector.
DataScava bridges that gap by creating high quality input that artificial intelligence and machine learning systems can use to improve the quality of their own output. That increases the accuracy of searches, reduces the risk of inappropriate analysis and badly informed decisions, and increases your data team’s efficiency. With no black box, it generates the kind of auditable output that helps compliance officers sleep better.
By extracting industry-specific information, DataScava makes unstructured data more accessible, more understandable and, above all, more useful.
With the growth of AI systems and unstructured data, there is a need for an independent means of data curation, evaluation and measurement of output that does not depend on the natural language constructs of AI and creates a comparative method of how the data is processed. Here’s an excerpt from a blog post on this subject written by our CTO John Harney published by big data site KDNuggets:
“For anyone worrying about machines taking over the world, I have reassuring news: The idea of artificial intelligence has been overcome by hype. I don’t mean to belittle AI’s promise or even its existing capabilities. The technology allows organizations to put data to use in ways we could only imagine not that long ago.
“It’s revolutionized the way executives approach strategic planning. But very often lately—when I’m in meetings, reading research papers or listening to an expert’s presentation—I can’t shake the feeling that to many people, terms like “AI,” “machine learning” and “cognitive computing” have become answers unto themselves.
“Today, solutions providers put statements like “AI-driven” or “harnessing the power of machine learning” at the core of their sales pitch. The buzzwords are certainly getting through. One colleague tells the story of a client calling “to make sure AI was included” in their data analysis project. Business people have been sold on the notion that today’s cutting-edge systems analyze data in a black box, then spit out reliable insights. How? They just do.”
Read the full article here: https://www.kdnuggets.com/2018/Let’s Admit It: We’re a Long Way from Using Real ‘Intelligence’ in AI
DataScava’s Non-Semantic Search was featured in a piece by the renowned RecruitingTools news blog written by Katrina Kibben.
“What we really need, and I only know one company that does this (shout out to TalentBrowser, powered by DataScava, and founders Janet Dwyer and John Harney) is a completely customizable white box ‘profile’ search built on input and personalized rules that you the user control, not a black box semantic search engine that thinks it knows what you ‘really mean.’ Profile search allows you to specify many individual topics in a search, with thresholds (minimums) to be met by each topic. This twofold process bubbles the best candidates right to the top.”
Here’s the full report http://recruitingtools.com/semantic-search/