Walk-the-Talk: LLM-driven Pedestrian Motion Generation

Exploring the capabilities of Walk-the-Talk and its potential applications in autonomous systems, robotics, and animation.

Walk-the-Talk in Action

A non attentive pedestrian

Example prompt: A pedestrian crossing the street sudden stops upon encountring an oncoming vehicle from the left, stops, apologies and gives way by making hand gestures

Cop monitoring an accident

Example prompt: A person waving their right hand acknowledging and guiding a vehicle

A jaywalker emerging between parked vehicles

Example prompt: A jaywalking pedestrian crosses the street while looking both ways, sudden stops upon encountering an oncoming vehicle from the left, apologises and walks back anxiously

An intoxicated VRU causing nuisance

Example prompt: A drunk pedestrian on the street performs random intoxicated motion.

Walk-the-Talk Framework Overview

Walk-the-Talk framework overview

Paper Abstract

In the field of autonomous driving, a key challenge is the "reality gap": transferring knowledge gained in simulation to real-world settings. Despite various approaches to mitigate this gap, there’s a notable absence of solutions targeting agent behavior generation which are crucial for mimicking spontaneous, erratic, and realistic actions of traffic participants. Recent advancements in Generative AI have enabled the representation of human activities in semantic space and generate real human motion from textual descriptions. Despite current limitations such as modality constraints, motion sequence length, resource demands, and data specificity, there’s an opportunity to innovate and use these techniques in the intelligent vehicles domain. We propose Walk-the-Talk, a motion generator utilizing Large Language Models (LLMs) to produce reliable pedestrian motions for high-fidelity simulators like CARLA. Thus, we contribute to autonomous driving simulations by aiming to scale realistic, diverse long-tail agent motion data - currently a gap in training datasets. We employ Motion Capture (MoCap) techniques to develop the Walk-the-Talk dataset, which illustrates a broad spectrum of pedestrian behaviors in street-crossing scenarios, ranging from standard walking patterns to extreme behaviors such as drunk walking and near-crash incidents. By utilizing this new dataset within a LLM, we facilitate the creation of realistic pedestrian motion sequences, a capability previously unattainable. Additionally, our findings demonstrate that leveraging the Walk-the-Talk dataset enhances cross-domain generalization and significantly improves the Fréchet Inception Distance (FID) score by approximately 15% on the HumanML3D dataset.

Citation

@INPROCEEDINGS{10588860,
                author={Ramesh, Mohan and Flohr, Fabian B.},
                booktitle={2024 IEEE Intelligent Vehicles Symposium (IV)}, 
                title={Walk-the-Talk: LLM driven pedestrian motion generation}, 
                year={2024},
                volume={},
                number={},
                pages={3057-3062},
                keywords={Legged locomotion;Training;Pedestrians;Generative AI;Large language models;Semantics;Motion capture},
                doi={10.1109/IV55156.2024.10588860}}              
        }