Robots Learn to Mimic Human Actions with Language Model Integration


TEHRAN (Tasnim) – The University of Tokyo scientists have achieved a breakthrough by integrating large language models (LLMs) with robots, marking a significant stride in enabling more humanlike gestures without relying on conventional hardware-driven controls.

Alter3, the updated version of a humanoid robot introduced in 2016, is at the forefront of this innovation. Utilizing GPT-4, the researchers are orchestrating various simulations for the robot, encompassing activities like taking selfies, tossing balls, munching popcorn, and even strumming an air guitar.

Previously, these tasks necessitated individual coding for each action. However, GPT-4's integration broadens the horizons by empowering robots to comprehend and learn from natural language instructions.

According to the researchers' recent study, AI-driven robots have predominantly focused on facilitating basic communication between humans and machines within a computer, leveraging LLMs to interpret lifelike responses. The integration now enables direct control, mapping human action expressions onto the robot's framework through program code, signifying a groundbreaking shift in robotic technology.

Alter3, capable of intricate upper body movements and detailed facial expressions across its 43 musculoskeletal movement axes, is stationary but showcases vivid gestures (despite lacking the ability to walk).

Previously, the arduous task of coding coordination for numerous joints involved repetitive actions. The researchers highlighted the liberation from this labor-intensive process, thanks to LLM integration.

Presently, they can articulate verbal instructions describing desired movements, prompting the LLM to generate Python code that runs Alter3's Android engine. This allows Alter3 to retain activities in its memory, enabling researchers to refine and enhance its actions over time, resulting in swifter, smoother, and more precise movements.

An instance provided by the authors details natural language instructions given to Alter3 for capturing a selfie. These instructions translate into dynamic movements and expressions, such as adopting a joyful smile, displaying excitement through widened eyes, and positioning the "phone" hand to mimic a selfie.

The researchers emphasize that utilizing LLMs in robotics research redefines human-robot collaboration, envisaging more intelligent, adaptable, and personable robotic entities. Injecting humor into Alter3's activities, like simulating eating someone else's popcorn and conveying embarrassment, underscores the potential for enhanced human-robot interaction.

Furthermore, the camera-equipped Alter3 demonstrates the ability to learn from human responses, akin to neonatal imitation observed in newborns. This zero-shot learning capacity of GPT-4 integrated robots stands to reshape human-robot boundaries, paving the way for sophisticated and adaptable robotic entities.

The paper, "From Text to Motion: Grounding GPT-4 in a Humanoid Robot 'Alter3,'" authored by Takahide Yoshida, Atsushi Masumori, and Takashi Ikegami, is accessible on the preprint server arXiv.