Analysis Featured

Semantic Fingerprints: Natural Language for Enterprise

semantic fingerprinting

One of the most important features of a business website is the search tool. Someone coming to the site or searching for the services/products offered needs to be able to find what they are looking for. The problem is that people don’t usually think the way the search engine works. The current state of the art is based on statistical modeling instead of natural language understanding, and that leaves businesses scrambling to make their website’s search feature fit their typical user.

“A whole bunch of problems are implied by this,” Francisco Webber, CEO of Cortical.io, said. “The statistical processing model is hard to apply to a business or enterprise environment. Solutions tend to be labor-intensive, expensive, and insecure. I ended up changing the approach completely by using a brain-based approach. The human brain is the reference rep for natural language processing, so by understanding how the brain does it, we can get insight into how a computer can do it. This works much better than traditional methods and takes much less effort.”

Webber called the representation of language according to the principles of the human cortex “a semantic fingerprint of text.” Using Boolean logic, Cortical.io is an engine that does not rely on further machine learning. Their system can represent any text in any language as a semantic fingerprint to classify, filter, or search documents in a business context.

semantic indexing

The team came out of an initial startup specializing in patent retrieval. This was sold off to partners, and part of the team regrouped to start Cortical.io in 2012.

“We used experience to provide a better solution,” Webber said. Research funding helped them create the first prototype. In 2013, they found a biz angel who brought the total funding to $6.5 million by investing in that prototype, and now they have a second version.

“To convert a word into a semantic fingerprint, we use a Boolean vector,” Webber said. “It uses 128 x 128 bits and 16K features. Our specific process adds a topology to the distribution of features, making it comparable. For example, the word “organ” has a semantic fingerprint that overlaps with both “piano” and “liver” because the word can be used in different contexts. Our engine is trained to discern which “organ” is wanted based on a collection of reference materials.”

Cortical.io defines a semantic map that folds into every context and is true for every word that occurs within that context. They have selected about 400,000 Wikipedia pages to end up with every word in the collection. This enables them to convert any text into a fingerprint, then add all the words together that make up a fingerprint representation of the sentence. Thus they can compare words to sentences and paragraphs since any fingerprint can be compared to any other fingerprint.

semantic fingerprinting

“We can throw a fingerprint at the engine and look for all contexts where that fingerprint can occur,” Webber said, “so for the possible context of the word ‘organ,’ the algorithm finds all the contexts in which it appears including ‘liver’ and ‘piano’. The building blocks are all packaged into a library so we can scale to any problem you have. We can cope with terabytes of data because the algorithm is perfectly parallel. It’s the fastest you can do on modern computers.”

This semantic fingerprint system is more efficient than other natural language processing systems. It has many more features and is more fine-grained. Webber said this is the effect of trying to do things like the brain does things. This flexibility means that search engines can be easily tailored to a specific site, like the high-precision search needed in manuals. Car manuals can be a thousand pages or more, but the ordinary consumer doesn’t always know the vocabulary to search for in the manual. Semantic fingerprinting can close the vocabulary gap without lengthy input by engineers.

In another example, customer support cases often solve similar problems. Traditional methods of matching are complex, but by adequately fingerprinting documents, all possible duplicate representations can be matched and new customer support cases can be resolved with complete references to any pertinent data.

“Filtering in social media space is a great example of the value of semantic fingerprinting,” Webber said, “because we want a 360-degree view of the brand. We want every tweet relevant to smartphones. We want to process with an analytics packet. You get about 20,000 tweets per second with traditional methods. Matching every tweet with 200 keywords, you get 20,000,000 comparisons per second. We can do 50,000 fingerprints per second and filter out interesting messages.”

Other uses for semantic fingerprinting are highly individualized news feeds and product recommendations. Since the filtering is inexpensive, it can be offered on a “per user” basis. Broadcast companies could cluster TV captions, and, with live data from Nielsen, can find out if a particular topic made people stay or leave the show. HR departments could utilize the semantic fingerprint instead of manually editing resumes, even if matching an English job description with a French resume.

“Most of our customers come from the financial space,” Webber said, “in something like email compliance monitoring. We can find out if a message needs to be inspected for fraud or regulated items. A big New York bank is a large customer for contract intelligence. On the west coast, a network manufacturer uses it for support functionality. Everything is rooted in the fact that we evolve text into semantic fingerprints and overlap measurements to derive business value.”

Webber said their competition has had a marketing effect. IBM Watson is the leader in natural language processing, and their main competitor uses a conventional approach.

“We get called by customers who say they’ve tried everything, building it themselves with elastic search and IBM Watson, but their results are not good enough. If you get 20,000 false positives per day, you need a big team to filter through them. They come to us, and a couple of weeks later they realize our technology does high-precision, high-volume work with low false positives. Our technology solves their problem.”

This semantic approach can be used with higher volumes of data and filter for meaning rather than keywords in any technology sector where a language model is used, like speech-to-text.

“Everybody is talking about bots, but a bot doesn’t actually understand what people tell them,” Webber said. “In any application where it makes sense to understand what the text means, our technology is performing well.”

Currently, Cortical.io is being used in finance, fintech, insurtech, manufacturers of complex goods, and consumer good companies. Their lean team of 15 people is handling about 20 concurrent projects and can handle more.

“Our main approach is to create a proof-of-concept for the customer,” Webber said. “In the future, we plan smaller packages for smaller companies who want to run as service in the cloud. Our current focus is on enterprise deployments, at about $100K per node per year. For very large companies, we also offer a flat fee of $1.7M. We project a return this year of $5M.”

The semantic engine could be a component that integrates with any other documentation system. It has a lot of potential applications. Startups and small companies can be provided with an API of functionality without investments or needing to know how it works — it can be used on a pay-per-use basis. In the future, it may be implemented in hardware, allowing a web-scale application.

“Advertising is not the future of the internet,” Webber said. “The future is in real personalization, getting a system so smart you can see the world the way you want to see it instead of the way some statistical engine puts it in front of you.”

For an enterprise that uses language to communicate with customers and potential customers, this opens the door wider so more can come in.

Author:

Written by Nicki Jacoby.

About the author

Allen Taylor

An award-winning journalist and former newspaper editor, I currently work as a freelance writer/editor through Taylored Content. In addition to editing VisionAR, I edit the daily news digest at Lending-Times. I also serve the FinTech and AR/AI industries with authoritative content in the form of white papers, case studies, blog posts, and other content designed to position innovators as experts in their niches. Also, a published poet and fiction writer.

Add Comment

Click here to post a comment

Your email address will not be published. Required fields are marked *