Image source: Baxter robot: http://news.mit.edu/2017/robot-learns-to-follow-orders-like-alexa-0830
While voice interaction is on track to revolutionize digital interfaces the way that the touchscreen did, language processing has its limitations. In particular, digital language processing is limited to finite and specific commands and does not have the contextual understanding that occurs in a human conversation.
In relevant and timely point of development, researchers from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) have worked to build a better processing unit so that a robot does not need step-by-step instructions and can instead infer things from the context of the commands and environment.
Thus was born the “ComText,” a processing system that stands for “commands in context” and allows robots to understand contextual information such as language cues and surrounding environment.
In natural language use, it’s common to say something like “pick it up.” While a person would likely be able to understand what “it” is based on the context of the situation, today’s digital assistants or robots would need more information for comprehension because the command lacks specificity.
MIT explains: “Picking it up means being able to see and identify objects, understand commands, recognize that the ‘it’ in question is the tool you put down, go back in time to remember the moment when you put down the tool, and distinguish the tool you put down from other ones of similar shapes and sizes.”
While current digital assistants like Alexa and Siri are revolutionizing how we interact with technology, for robotic personal assistants to advance to the next generation, this type of contextual understanding is essential.
The CEO of RAGE Frameworks, Venkat Srinivasan, breaks down this interaction challenge into three main points.
First, many voice-enabled AI tools like IBM Watson and Google AlphaGo have difficulty processing human language because “the large majority of the current implementations approach text as data, not as language.”
The second point is context: “The right context can only be developed if the technology focuses on the language structure, not just on the words in the text, as most current technologies seem to be doing.”
The last challenge is logic: The “traceability of reasoning that the solution deploys to reach its conclusion.”
To develop the ComText, a group of researchers used a “probabilistic model that enables incremental grounding of natural language,” according to the full research paper.
“The main contribution is this idea that robots should have different kinds of memory, just like people,” says Andrei Barbu, a lead researcher. “We have the first mathematical formulation to address this issue, and we’re exploring how these two types of memory play and work off of each other.”
To test the tool, the researchers used a two-armed humanoid called the Baxter Research Robot that observed the workspace using a cross-calibrated Kinect version 2 RGB-D sensor at ∼20Hz with 1080×760 resolution. The setup was outfitted with an Amazon Echo Dot to convert spoken commands into text.
To study how effectively the machine can assess contextual cues, independent operators were asked to direct the robot to perform five tasks, resulting in 96 short videos of the user-robot interactions. After analyzing the results, researchers found that 90.2% to 94.7% of the time, the inferred command was executed with the “correct action on the correct objects in the intended location.”
Failure occurred mostly due to errors in perception, either from an obstructed view or the movement of objects directly toward or away from the camera.
The experiment successfully demonstrated the ComText’s ability to gather facts from previous voice statements, combine them with visual observations, and track the movements of objects. This accrued knowledge is then refined over time through more interactions and observations.
While the ethical questions of AI biases remain, the ComText makes significant progress toward a robot with a more human-like interaction capacity. However, the reality is that we’re still a long way from a fully functioning assistant robot that can comprehend the nuances of human interaction.
Rohan Paul, a lead researcher says, “At the moment, we’re not building products.” Instead, they are continuing to develop the ability for machines to gather context and reach conclusions on a larger scale. “What we really want to do is get humans and robots together to build something more complex,” said Paul.