Experts estimate that as much as 90% of the digital data generated daily is unstructured. This includes everything from emails to documents, incident reports, text files, videos, contracts, customer chats, audio files, social media, presentations, blog posts, photos and more. It’s a challenge to extract information from this daunting volume of data to realize business value.

DataScava is an unstructured text data miner built on patented technology that uses business and domain-specific language you control to curate, search, filter, match, label, tag and route heterogeneous textual content for use in big data applications in AI, Machine Learning, RPA, Business Intelligence, Research and Talent.

It finds and filters the most relevant documents or information from large unstructured datasets or snippets of raw text, providing highly precise results you can see and measure. Use it to mine unstructured text data based on source, content, intents, interests and other criteria you define, control and weight.

All of our tools can be shaped and molded to fit each organization’s unique data universe and problem space.


How We Do It

DataScava is a tool for modeling and capturing features and topics within raw text using your organization’s own proprietary taxonomy and a rules-based approach, which allows it to be highly customized around your vocabulary and for the design of specific business logic necessary for complex document processing.


Domain-Specific Language Processing (DSLP)

Normally, raw textual content is in the form of natural language, so the tools that handle it employ Natural Language Processing (NLP). This is not so with DataScava. Our Domain-Specific Language Processing (DSLP), Weighted Topic Scoring (WTS), and Tailored Topics Taxonomies (TTT) methodologies work as an alternative or adjunct to Natural Language Processing (NLP)They generate metadata about unstructured text and searchable document indices, producing results you can see, control, and measure.

Unlike NLP,  DSLP does not use all word forms for taxonomy topic key terms — it uses the exact words that are specified (i.e., “bank” is not the same as “banking” or “banks”). There is also a “not” capability (i.e., “bank,” but not “food bank”) and the ability to set topic default score minimums and key term factors.



Tailored Topics Taxonomies (TTT)

Tailored Topics Taxonomies facilitate the creation, selection and definition of  company and domain-specific topics and associated key terms to capture how you think about your operations, technology, products and brand.




Weighted Topic Scoring (WTS)

Beyond the topic level, DataScava provides another level of abstraction, a proprietary method called Weighted Topic Scoring (WTS) to synthesize multiple topics into a cohesive category and score and rank them to determine how well any given document fits into the user-defined category.


DataScava Highlighter

DataScava’s Highlighter supports the taxonomy topics refinement effort. Unlike NLP, Machine Learning or AI, when you view source data in DataScava, it clearly shows which topic key terms contribute to one or more topic scores because they are highlighted and color-coded to the topics, while those that do not contribute are not.  This is how errors are identified and users improve the accuracy of the DataScava Topics on an ongoing basis. Just as businesses evolve, so must ontologies. In other systems, end users are often in the dark about why a snippet of input data results in specific output.




DataScava Indexers use DSLP to generate precise numeric scores that measure the topic key terms frequency in the raw text data. These scores do not use NLP functions such as fuzzy search, stemming, stop word removal, POS tagging or lemmatization. Only textual data that contains topic key terms gets a score in the topic; textual content that does not mention a topic key term is not indexed.

Ongoing refinements and exceptions are handled by adding, deleting and editing topics and their key terms. For example, a topic “Viruses” in the medical domain would have the key term “Covid-19” but in the I.T. domain would not. Over time, DataScava Topics are continually refined in a measurable way at the direction of users to produce ever-increasingly accurate results and eliminate incorrectly indexed textual content.


Data Visualization

By producing numeric score measurements of what the raw unstructured textual data contains in the typical structured data format used by business intelligence and data visualization tools, DataScava Indexes enable a world of analysis not possible using NLP, Machine Learning or AI. On its own, this topic scores metadata can support many of the applications intelligent systems are used for.


Profile Matching

Context is key, and DSLP, WTS and TTT work together to ensure it is achieved through “Profile Matching,” named after a carpentry tool used to model complex shapes that can’t be measured linearly.  Customizable company and domain-specific WTS templates use topic scores metadata generated by DataScava Indexers to define precise context to any degree of complexity and depth.

Proponents of NLP contend that it “knows” what language means or intends, but that’s a huge challenge within specialized domains.  DataScava does not program a machine to learn, it trains the machine to categorize complex human logic into domain-specific use cases. Think about how complex subject matter expertise is. An SME can think “I want either a lot of A or B or maybe C if D is present, but I don’t want anything that includes F or G unless there’s H, but if I have A and C, the FG exception is less important as long as K is there.”



One of the most labor-intensive parts of training and deploying machine learning and AI systems involves cleansing and classifying raw unstructured textual data. DataScava works directly with messy data and so does not require cleansing to successfully classify, tag and label it using DSLP, WTS and TTT. After classification, documents that do not contain defined domain-specific topics can thus be ignored, so the need to cleanse data is reduced.

DataScava Highlighter uses this classification data to empower human-in-the-loop (HITL) verification of training data and in improving the accuracy of the taxonomy topics and their key terms on an ongoing basis.  By automating the classification process, DataScava can be used not only during the training process but also during everyday system operation and to assess output.



Users of Machine Learning or AI often find they do not understand what the system did, or why it did it, and whether it gave the correct results. DataScava doesn’t attempt to create output in a black box hidden internal process — instead, it empowers users to identify any subset of the raw data they’re specifically interested in.  While intelligent systems replace human effort, DataScava informs and empowers it by providing the logical analytical scaffolding and boundaries to keep the human in control.