Experts estimate that 90% of the digital data generated daily is unstructured. Unlocking business value from this overwhelming volume of data requires identifying precise, relevant information efficiently.
DataScava is an advanced unstructured text data miner built on patented matching technology. Using your business and domain-specific language, DataScava allows users to define and refine the data they need in real-time. It helps you find, measure, filter, match, route, sort, and rank textual content automatically for use in AI, LLMs, ML, RPA, BI, Research, BAU, TA, and more. It produces visible results you can see, control, and audit.
How It Works
DataScava’s DSTopics, DSIndex, and DSMatch use our proprietary Domain-Specific Language Processing (DSLP), Tailored Topics Taxonomies (TTT), and Weighted Topic Scoring (WTS) methodologies to precisely focus on processing unstructured text with user-defined business logic. Unlike traditional NLP, DSLP ensures explainable, tailored, results by categorizing documents and surfacing relevant files in context.
Context is key, and DataScava achieves it through our patented “Profile Matching of Unstructured Documents,” named after a Contour Profile Gauge carpentry tool used to model complex shapes that can’t be measured linearly.
At the heart of DataScava are three core pillars:
DSTopics
DSTopics leverages TTT to define and model domain-specific features within heterogeneous text. Users can import, select, create, and edit specialized taxonomies to incorporate their business language and expertise. It ensures precise categorization, offering the customized and flexible vocabulary logic necessary for complex document processing.
It enables real-time import, creation, and refinement of domain-specific topics and their associated keywords and key phrases. Users can add, delete, and edit Topics on the fly, which triggers DSIndex to restore and rematch files with the topic.
Examples: A medical taxonomy would include “Covid-19” under the topic “Viruses,” while an IT taxonomy would exclude it. A financial investment banking taxonomy will have topics with key terms such a as commodities, fixed income derivatives, credit default swaps, CDOs, IRD, traders, Calypso, Wall Street while a retail banking one will have key terms like checking account, savings account, debit cards, credit cards, mortgages, annuity, and personal loans.
Continuous refinement of user-defined topics and their key terms ensures the system improves and adapts to your business needs over time.
DSIndex
DSIndex applies DSLP, TTT, and WTS to focus on the language of your industry and organization, generating numeric measurements of text at the file level. Tailored for precision, it identifies user-defined jargon and integrates domain-specific expertise into the language processing pipeline, surfacing the most relevant files from even the largest datasets.
It creates structured metadata for advanced analysis in BI and visualization tools. Users can select or create custom templates with Weighted Topic Scoring to analyze text with precise context and depth.
Why It’s Different: While NLP attempts to infer meaning, DSIndex measures and categorizes complex domain-specific logic—tailored to your unique requirements.
DSMatch
DSMatch uses WTS to mine indexed text data, synthesize multiple weighted topics into cohesive categories, categorize and classify documents, and prioritize results based on user-defined required and desired score thresholds. It uses your defined priorities to deliver further refined and transparent outcomes via multi-level rank and sort capabilities to home in on relevance and importance.
It automates text mining, sorting, and routing, displaying results in intuitive grids with visible topic scores and dual bar charts for transparency. The ability to segment and weight multiple topics of interest based on user-defined criteria results in highly accurate output.
Key Features:
- Users define minimum and/or maximum topic score thresholds for “required” and “nice-to-have” criteria.
- It uses WTS to classify, tag, match, filter, and route documents based on your preferences.
- Visual tools highlight matched topic scores and color-coded key terms scores for easy refinement, sorting, and filtering.
- Continuous classification ensures new text files are automatically processed and matched to open DSMatch templates.
Visualization and Presentation
- Dual Bar Charts: Display Topic scores and thresholds for each document, making gaps and matches immediately visible. Topic keywords and key phrases are identified and displayed in color-coded highlights, enabling quick identification of their associated Topics.
- Third-Party Integration: DataScava outputs metadata that can be used in other systems and visualized in Qlik, Tableau, or similar business intelligence tools for advanced analysis.
File Match
Color-Coded Topic Key Terms
Tailored Topics Taxonomies
DSMatches Grid
Why Choose DataScava?
Transparency Over Black-Box AI:
DataScava empowers users to understand and control their data, providing clear explanations for its outputs.
Handles Messy Data:
DataScava classifies, tags, and labels unstructured text without requiring extensive cleansing. Non-relevant documents are ignored after classification.
Human in Command:
Unlike traditional AI that replaces human effort, DataScava enhances it, ensuring users remain in control while benefiting from automation.
Domain-Specific Taxonomies:
Pre-configured taxonomies are available for Financial, Technology, and Talent Analytics domains, with customizable options to suit your unique needs
Deployment Options
On-Premises: DataScava reads and writes index data to your existing database.
Cloud-Based: Hosted on an AWS cloud database, ensuring scalability and security.
REST API Integration: Access index values through a REST API GET call for seamless integration.