Google is using Gemini AI to train its robots so they may become more adept at finding their way around and finishing jobs. In a recent research paper, the DeepMind robotics team described how users may more easily communicate with its RT-2 robots using natural language instructions by utilizing Gemini 1.5 Pro’s extended context window, which limits the amount of information an AI model can comprehend.
In order for the robot to “watch” the video and learn about its surroundings, researchers use Gemini 1.5 Pro to record a video tour of a predetermined region, such as a house or office. The robot can then carry out commands using verbal and/or image outputs based on what it has observed. For example, it can direct customers to a power outlet when they are shown a phone and inquire, “Where can I charge this?” The robot, driven by Gemini from DeepMind, reportedly had a 90 percent success rate when given more than fifty user commands in an area that was over nine thousand square feet.
Researchers have also discovered “preliminary evidence” that Gemini 1.5 Pro gave its droids the ability to plan tasks beyond simple navigation. According to the researchers, Gemini “knows that the robot should navigate to the fridge, inspect if there are Cokes, and then return to the user to report the result” when a user asks the droid if their favorite drink is available when they have a lot of Coke cans on their desk. DeepMind said it intends to look into these findings more.
The Google video demonstrations are stunning, but the research article notes that the clear cuts that occur after the droid recognizes each request conceal the fact that it takes the device between 10 and 30 seconds to process these instructions.