Visual recognition has witnessed significant improvements thanks to the recent advances of deep visual representations. In its most popular form, recognition is performed on web data, including images and videos uploaded by users to platforms such as YouTube or Facebook. However, the role of perception is inherently tied to action. Active perception is vital for robotics. Robots perceive in order to act and act in order to perceive. At RE•WORK's Deep Learning in Robotics Summit in San Francisco last month, Georgia Gkioxari of the Facebook AI Research Institute (FAIR) introduced the latest AI technology entitled "Embodied Vision". Embodied Vision is used in contrast to Computer Vision, referring to the cognitive ability of a robot, while Computer Vision means 'vision of a robot' (Agent). Emphasis is placed not only on the robot grasping surrounding objects but also on understanding its meaning like a human being.
FAIR educates robots from three perspectives:
To perform this task, a wide range of AI techniques are required for the brains of the robot. Specifically, we need Perception, Language Understanding, Navigation, Commonsense Reasoning, and Grounding of words and actions. The research team of Gkioxari succeeded in building the model of Embodied QA and executing the task in the aforementioned 3D virtual environment "House 3 D".
In this model, the brains of the robot are composed of Planner and Controller educated by Deep Reinforcement Learning method. The planner is the commander, decides the direction of travel (front and back, left and right), the controller is the executor, and determines the speed of advance (the number of steps) according to the instruction. The planner is composed of a network of the type Long Short-Term Memory (LSTM), and as described above, it is educated by Deep Reinforcement Learning method. Planner learns common sense while repeating trial and error like humans.
FAIR is developing intelligent robots through these studies. AI is evolving rapidly: image judgment exceeds human ability, AI surprised the world in the world of Go beating human beings champion. Although overwhelmed by the immeasurable ability of AI, AI is far from intellectual. AI does not understand the meaning of objects (e.g. cats) and can only perform limited tasks such as Go (e.g, AlphaGO can not drive a car). The present robot can not even move in the house like a human being. In other words, the development of AI that can intelligently think like a human being is stagnating without breakthrough.
For this reason, FAIR develops AI with an entirely different approach. We educate AI in a 3D virtual environment simulating real society, and the aim is trying to learn complicated tasks themselves in this. By AI learning in the real world, we develop human-like vision, natural conversation, develop the next plan, and develop algorithms that can make intellectual thinking. To do this, we need a virtual environment that looks like a real world, and we are developing a 3D environment that depicted faithfully as if we were photographing the inside of the house. Likewise, OpenAI and Google DeepMind also take this approach, and development competition of Deep Reinforcement Learning is intensifying in elaborate virtual environments.
Human life is fundamentally changed by making the brain of the robot intelligent. Facebook has developed the Virtual Assistant "M", but has since stopped releasing it as a product. M was a specification like the hotel concierge that answered any questions, but the conversation topics with humans were too wide and the AI could not respond to this. Embodied Vision is an important basic technology supporting virtual assistant and AI speaker. Furthermore, as this research goes well, a roadmap for home robot development will come into view. Market attention has gathered whether Facebook develops intelligent home robots.