Experts estimate that as much as 90% of the digital data generated daily is unstructured. This includes everything from emails to documents, text files, videos, contracts, customer chats, audio files, social media, presentations, blog posts, photos and more. It’s a challenge to extract information from this daunting volume of data to realize business value.

DataScava is an unstructured text data miner that uses business and domain-specific language you control to curate, search, filter, tag, match and route heterogeneous textual content for use in big data applications in AI, Machine Learning, RPA, Business Intelligence, Research and Talent Matching and other systems.

It finds and filters the most relevant documents or information from large unstructured data sets, providing highly precise results you can see and measure. You don’t need to be a data scientist to use it.

Domain-Specific Language Processing (DSLP)

Normally, raw textual content is in the form of natural language, so the tools that handle it employ Natural Language Processing (NLP). This is not so with DataScava — that’s how it’s different.

We use our proprietary Domain-Specific Language Processing (DSLP) and patented Weighted Topic Scoring (WTS) methodologies, which work as an alternative or adjunct to NLP to convert unstructured text data into precisely structured derived data and searchable document indices.


Weighted Topic Scoring (WTS)

DataScava Components

DataScava Topics

DataScava Topics facilitate the creation, selection and definition of personalized company and domain-specific topics and their associated key terms and use DSLP. Unlike NLP, DataScava Topics does not use all word forms for topic key terms — it uses the exact words that are specified (i.e., “bank” is not the same as “banking” or “banks”). There is also a “not” capability (i.e., “bank,” but not “food bank”) and also the ability to set topic default score minimums and key term factors.

DataScava Indexers

DataScava Indexers use DSLP to generate precise numeric scores that measure the topic key terms frequency in the raw text data. These scores do not use NLP functions such as fuzzy search, stemming, stop word removal, POS tagging or lemmatization. Only textual data that contains topic key terms gets a score in the topic; textual content that does not mention a topic key term is not indexed. Ongoing refinements and exceptions are handled by adding, deleting and editing topics and their key terms. For example, a topic “Viruses” in the medical domain would have the key term “Covid” but in the I.T. domain would not. Over time, DataScava Topics are continually refined in a measurable way at the direction of users to produce ever-increasingly accurate results and eliminate incorrectly indexed textual content.

DataScava Highlighter

DataScava Highlighter supports the topics refinement effort. Unlike NLP, Machine Learning or AI, when you view source data in DataScava, you can see what topic key terms contribute to the score measurement. In other systems, end users are in the dark about why a snippet of input data results in specific output. DataScava Highlighter clearly shows which topic key terms contribute to scores because they are highlighted, while those that do not contribute are not.  This is how errors are identified and users improve the accuracy of the DataScava Topics on an ongoing basis. Just as businesses evolve, so must ontologies.

DataScava Data Visualization

By producing numeric score measurements of what the raw unstructured textual data contains in the typical structured data format used by business intelligence and data visualization tools, DataScava Indexes enable a world of analysis not possible using NLP, Machine Learning or AI. On its own, this metadata can support many of the applications intelligent systems are used for.

DataScava Profile Matching

Context is key, and DSLP and WTS are how DataScava ensures it is achieved. Proponents of NLP contend that it “knows” what language means or intends, but that’s a huge challenge within specialized domains.  DataScava does not program a machine to learn, it trains the machine to categorize complex human logic into domain-specific use cases. Think about how complex subject matter expertise is. An SME can think “I want either a lot of A or B or maybe C if D is present, but I don’t want anything that includes F or G unless there’s H, but if I have A and C, the FG exception is less important as long as K is there.”

DSLP and WTS ensure correct context using our patented methodology called “Profile Matching,” named after a carpentry tool used to model complex shapes that can’t be measured linearly.  Customizable company and domain-specific WTS templates use DataScava Topics scores generated by DataScava Indexers to define precise context to any degree of complexity and depth.

DataScava Classifier

One of the most labor-intensive parts of training and deploying machine learning and AI systems involves cleansing and classifying raw unstructured textual data. DataScava works directly with messy data and so does not require cleansing to successfully classify, tag and label it using DSLP and WTS. After classification, documents that do not contain defined domain-specific topics can thus be ignored, so the need to cleanse data is reduced.

DataScava Highlighter uses this classification data to empower human-in-the-loop (HITL) verification of training data and in improving the accuracy of the taxonomy topics and their key terms on an ongoing basis.  By automating the classification process, DataScava can be used not only during the training process but also during everyday system operation and to assess output.

DataScava Output

Users of Machine Learning or AI often find they do not understand what the system did, or why it did it, and whether it gave the correct results. DataScava doesn’t attempt to create output in a black box hidden internal process — instead, it empowers users to identify any subset of the raw data they’re specifically interested in.  While intelligent systems replace human effort, DataScava informs and empowers it by providing the logical analytical scaffolding and boundaries to keep the human in control.