Experts estimate that as much as 90% of the digital data generated daily is unstructured. This includes emails, documents, incident reports, text files, videos, contracts, customer chats, audio files, social media, presentations, blog posts, photos, and more. As a result, extracting information from this daunting volume of data is challenging to realize business value.

DataScava is an unstructured text data miner built on patented technology that uses business and domain-specific language you control to help you identify the high-value data you need to use with applications in AI, Machine Learning, RPA, Business Intelligence, Research, Operations, Talent, and more. Use it to curate, search, filter, match, visualize, and route heterogeneous textual content to unlock its value.

Our technology finds and filters the most relevant documents or information from large unstructured datasets or raw text snippets, providing precise results you can see and measure. Use it to mine unstructured text data based on source, content, intents, interests, and other criteria you define, control, and weight. Our tools can be shaped and molded to fit each organization’s unique data universe and problem space.


Our Approach

DataScava is a tool for modeling and capturing features and topics within raw text using your organization’s proprietary taxonomy and a rules-based approach, which allows it to be highly customized around your vocabulary and for the design of specific business logic necessary for complex document processing.


Domain-Specific Language Processing (DSLP)

Typically, raw textual content is in the form of natural language, so the tools that handle it employ Natural Language Processing (NLP). This is not so with DataScava. Our Domain-Specific Language Processing (DSLP), Weighted Topic Scoring (WTS), and Tailored Topics Taxonomies (TTT) methodologies work as an alternative or adjunct to NLP. They generate metadata about unstructured text and searchable document indices.

Unlike NLP,  DSLP does not use all word forms for taxonomy topic key terms — it uses the exact words that are specified (i.e., “bank” is not the same as “banking” or “banks”). There is also a “not” capability (i.e., “bank,” but not “food bank”) and the ability to set topic default score minimums and key term factors.



Weighted Topic Scoring (WTS)

Beyond the topic level, DataScava provides another level of abstraction, a proprietary method called Weighted Topic Scoring (WTS), to synthesize multiple topics into a cohesive category and score and rank them to determine how well any given document fits into the user-defined category.



Tailored Topics Taxonomies (TTT)

DataScava Tailored Topics Taxonomies facilitate the creation, selection, and definition of company and domain-specific topics and associated key terms to capture how you think about your operations, technology, products, and brand.


DataScava Highlighter

DataScava Highlighter supports the taxonomy topics refinement effort. Unlike NLP, Machine Learning, or AI, when you view source data in DataScava, it clearly shows which topic key terms contribute to one or more topic scores because they are highlighted and color-coded to the topics, while those that do not contribute are not. This is how errors are identified and users improve the accuracy of the DataScava Topics on an ongoing basis. Just as businesses evolve, so must ontologies. In other systems, end users are often in the dark about why a snippet of input data results in specific output.





DSLP Taxonomies for Financial and IT Domains




DSLP Taxonomies for Talent Mining and Skills Analytics


DataScava Indexers

DataScava Indexers use DSLP to generate precise numeric scores that measure the topic key terms frequency in the raw text data. These scores do not use NLP functions such as fuzzy search, stemming, stop word removal, POS tagging, or lemmatization. Only textual data that contains topic key terms gets a score in the topic; textual content that does not mention a topic key term is not indexed.

Ongoing refinements and exceptions are handled by adding, deleting, and editing topics and their key terms. For example, the topic “Viruses” in the medical domain would have the key term “Covid-19,” but in the IT domain would not. Over time, DataScava Topics are continually refined in a measurable way at the direction of users to produce ever-increasingly accurate results and eliminate incorrectly indexed textual content.


Data Visualization

By producing numeric score measurements of what the raw unstructured textual data contains in the typical structured data format used by business intelligence and data visualization tools, DataScava Indexes enable a world of analysis not possible using NLP, Machine Learning, or AI. On its own, this topic scores metadata can support many of the applications for which intelligent systems are used.


Profile Matching

Context is key, and DSLP, WTS, and TTT work together to achieve it through “Profile Matching,” named after a carpentry tool used to model complex shapes that can’t be measured linearly. Customizable company and domain-specific WTS templates use topic scores metadata generated by DataScava Indexers to define the precise context to any degree of complexity and depth.

Proponents of NLP contend that it “knows” what language means or intends, but that’s a massive challenge within specialized domains. DataScava does not program a machine to learn; it trains the machine to categorize complex human logic into domain-specific use cases. Think about how difficult subject matter expertise is. An SME can think, “I want either a lot of A or B or maybe C if D is present, but I don’t want anything that includes F or G unless there’s H, but if I have A and C, the FG exception is less important as long as K is there.”



One of the most labor-intensive parts of training and deploying machine learning and AI systems involves cleansing and classifying raw unstructured textual data. DataScava works directly with messy data and so does not require cleansing to successfully classify, tag, and label it using DSLP, WTS, and TTT. Furthermore, documents that do not contain defined domain-specific topics can be ignored after classification, so the need to cleanse data is lessened.

DataScava Highlighter uses this classification data to empower human-in-the-loop (HITL) verification of training data and to improve the accuracy of the taxonomy topics and their key terms on an ongoing basis. In addition, by automating the classification process, DataScava can be used during the training process and everyday system operation and to assess output.



Machine Learning or AI users often find they do not understand what the system did, why it did it, and whether it gave the correct results. DataScava doesn’t attempt to create output in a black box hidden internal process — instead, it empowers users to identify any subset of the raw data they’re specifically interested in. While intelligent systems replace human effort, DataScava informs and empowers it by providing the logical analytical scaffolding and boundaries to keep the human in control.