Humans respond based on all of our senses – we take our conversational cues from far more than just “Hey Siri”. Previously, this is what distinguished ‘human’ interactions, but we’re fast entering a world in which seeing, understanding and reacting is no longer the domain of humans alone. The technology driving this progression – computer vision – is developing at pace.

In this article, UneeQ’s Lead Scientist, Dr Jason Catchpole, explains what’s involved in ensuring Digital Humans have the ability to ‘see’; ‘understand’; and ‘react’ so that we’re able to go from “What can I help you with?” to “Great to see you, Jason. There’s a hole in that shirt – can I order you a new one?” If we can integrate this kind of intelligence into the digital fabric that overlays our physical world, the possibilities are endless.

Image Recognition – ‘Seeing’ What’s There

Computers process visual information in a very different way to humans. The way in which computers recognize objects so that they can classify them is done by looking for meaningful features such as line-segments, boundaries, shapes and other visual features. This is why, when pointed at a cupboard, a Digital Human would classify it as such and (hopefully) not end up talking to it.

Thanks to recent improvements in computer vision and machine learning, computers are more capable than ever at understanding what is happening in an image or video – enabling huge leaps forward, particularly when it comes to how people interface with machines in customer service ‘roles’. Deep neural networks (deep learning) help them to learn features more similar to how a human brain works. These neural networks process the individual pixels of an image. Researchers feed these networks as many pre-labelled images as they can, in order to “teach” them how to recognize similar images. This is incredibly important for Digital Humans who need to quickly and accurately recognise when they’re talking to a human – in all their many varieties of appearances.

It’s this ability that is enabling accurate biometric identification and making authentification processes far less painless. Need to open your banking app? Easy – just use your face. It’s also what’s driving the disruption of traditional online shopping. Seen a cap you like? Upload and find the same one using Amazon’s in-app tech. Taken a step further, Digital Humans would see you and could also provide fashion advice based on your current fit out or even color matching.

Situational Awareness – ‘Understanding’ What’s There

Situational awareness is an important ability for Digital Humans and is driven by complex sets of bespoke algorithms to suit the products it’s applied to. Nuances of conversation such as eye contact are natural to us – but need to be taught to computers. To simulate convincing interaction, a Digital Human needs to determine if the user is talking to them or to someone else in the room (or their cat). If there are other people in the room who are having a conversation, the Digital Human needs to determine when it is being addressed so it doesn’t butt in on the conversation. All of these aspects of conversation rely on computer vision techniques for detecting, tracking and recognizing faces.

This would mean that if a parent is chatting to a Digital Human in a banking branch about ordering a new card, the Digital Humans is able to distinguish when they’re talking to the busy toddler next to them (“Just a moment, I need to order my new card…”) and when they’re engaged with them. The Digital Human would also need to take cues from their body language in terms of when to pause the chat to allow the parent time to pacify the toddler.  

Emotion Recognition – ‘Reacting’ To What’s There  

Once a Digital Human has made eye contact, emotion recognition is key to sustaining natural conversation and ensuring a successful outcome for the user. Humans do this automatically, but computers rely on complex computational methodologies to get this right.

Unlike audio-only based interfaces such as Siri and Google assistant, Digital Humans can exploit their ‘senses’ to better understand the user and what is going on in the user’s environment, increasing the likelihood of a successful outcome. Given the Digital Human has ‘eyes’, it’s natural for the user to assume that the Digital Human can ‘see’ them (via cameras), unlike audio or text-based interfaces where if cameras are enabled, it invokes the feeling that the user is being ‘spied’ on by the system. This visual information provides a rich source of information which opens many doors for opportunities to improve the experience for the user and enable the system to intuit how the conversation is going and if it needs to adapt in any way. In a customer service environment, recognizing a client’s frustration early on and proactively raising ways to solve this is key. Using computer vision, it will soon be possible to build up data sets around customer satisfaction – including common pain points; frustration ‘indicators’ and self-learning behaviors to correct each of these.  

Fast forward a few more years, and Digital Humans will comfortably be able to chat with two clients at once, for example a couple buying their first home. If the one was anxious to understand the implications of an interest rate hike on the monthly budget, the Digital Human would be able to respond with empathy, taking its clues from the question as well as look of concern. Perhaps even a glance at the wristwatch (indicating a brief explanation would be preferred) to respond in an appropriate manner knowing the client may have limited time.

How we ‘talk’ to computers is changing.  Powered by computer vision, Digital Humans will irrevocably change how companies embody themselves and connect with and serve customers – opening up ways to connect with them on an emotional level to drive loyalty and conversion. It will also break open new opportunities in the humanitarian space. Some of these are already starting to emerge in the form of Digital Mental Health Coaches. “Hey Siri!” is, quite literally, just the beginning.