Robotics 35
☆ UniOcc: A Unified Benchmark for Occupancy Forecasting and Prediction in Autonomous Driving
We introduce UniOcc, a comprehensive, unified benchmark for occupancy
forecasting (i.e., predicting future occupancies based on historical
information) and current-frame occupancy prediction from camera images. UniOcc
unifies data from multiple real-world datasets (i.e., nuScenes, Waymo) and
high-fidelity driving simulators (i.e., CARLA, OpenCOOD), which provides 2D/3D
occupancy labels with per-voxel flow annotations and support for cooperative
autonomous driving. In terms of evaluation, unlike existing studies that rely
on suboptimal pseudo labels for evaluation, UniOcc incorporates novel metrics
that do not depend on ground-truth occupancy, enabling robust assessment of
additional aspects of occupancy quality. Through extensive experiments on
state-of-the-art models, we demonstrate that large-scale, diverse training data
and explicit flow information significantly enhance occupancy prediction and
forecasting performance.
comment: 14 pages; Dataset: https://huggingface.co/datasets/tasl-lab/uniocc;
Code: https://github.com/tasl-lab/UniOcc
☆ Sim-and-Real Co-Training: A Simple Recipe for Vision-Based Robotic Manipulation
Abhiram Maddukuri, Zhenyu Jiang, Lawrence Yunliang Chen, Soroush Nasiriany, Yuqi Xie, Yu Fang, Wenqi Huang, Zu Wang, Zhenjia Xu, Nikita Chernyadev, Scott Reed, Ken Goldberg, Ajay Mandlekar, Linxi Fan, Yuke Zhu
Large real-world robot datasets hold great potential to train generalist
robot models, but scaling real-world human data collection is time-consuming
and resource-intensive. Simulation has great potential in supplementing
large-scale data, especially with recent advances in generative AI and
automated data generation tools that enable scalable creation of robot behavior
datasets. However, training a policy solely in simulation and transferring it
to the real world often demands substantial human effort to bridge the reality
gap. A compelling alternative is to co-train the policy on a mixture of
simulation and real-world datasets. Preliminary studies have recently shown
this strategy to substantially improve the performance of a policy over one
trained on a limited amount of real-world data. Nonetheless, the community
lacks a systematic understanding of sim-and-real co-training and what it takes
to reap the benefits of simulation data for real-robot learning. This work
presents a simple yet effective recipe for utilizing simulation data to solve
vision-based robotic manipulation tasks. We derive this recipe from
comprehensive experiments that validate the co-training strategy on various
simulation and real-world datasets. Using two domains--a robot arm and a
humanoid--across diverse tasks, we demonstrate that simulation data can enhance
real-world task performance by an average of 38%, even with notable differences
between the simulation and real-world data. Videos and additional results can
be found at https://co-training.github.io/
comment: Project website: https://co-training.github.io/
☆ Pro-Routing: Proactive Routing of Autonomous Multi-Capacity Robots for Pickup-and-Delivery Tasks
We consider a multi-robot setting, where we have a fleet of multi-capacity
autonomous robots that must service spatially distributed pickup-and-delivery
requests with fixed maximum wait times. Requests can be either scheduled ahead
of time or they can enter the system in real-time. In this setting, stability
for a routing policy is defined as the cost of the policy being uniformly
bounded over time. Most previous work either solve the problem offline to
theoretically maintain stability or they consider dynamically arriving requests
at the expense of the theoretical guarantees on stability. In this paper, we
aim to bridge this gap by proposing a novel proactive rollout-based routing
framework that adapts to real-time demand while still provably maintaining the
stability of the learned routing policy. We derive provable stability
guarantees for our method by proposing a fleet sizing algorithm that obtains a
sufficiently large fleet that ensures stability by construction. To validate
our theoretical results, we consider a case study on real ride requests for
Harvard's evening Van System. We also evaluate the performance of our framework
using the currently deployed smaller fleet size. In this smaller setup, we
compare against the currently deployed routing algorithm, greedy heuristics,
and Monte-Carlo-Tree-Search-based algorithms. Our empirical results show that
our framework maintains stability when we use the sufficiently large fleet size
found in our theoretical results. For the smaller currently deployed fleet
size, our method services 6% more requests than the closest baseline while
reducing median passenger wait times by 33%.
comment: 25 pages, 7 figures, and 1 table
☆ AutoEval: Autonomous Evaluation of Generalist Robot Manipulation Policies in the Real World
Scalable and reproducible policy evaluation has been a long-standing
challenge in robot learning. Evaluations are critical to assess progress and
build better policies, but evaluation in the real world, especially at a scale
that would provide statistically reliable results, is costly in terms of human
time and hard to obtain. Evaluation of increasingly generalist robot policies
requires an increasingly diverse repertoire of evaluation environments, making
the evaluation bottleneck even more pronounced. To make real-world evaluation
of robotic policies more practical, we propose AutoEval, a system to
autonomously evaluate generalist robot policies around the clock with minimal
human intervention. Users interact with AutoEval by submitting evaluation jobs
to the AutoEval queue, much like how software jobs are submitted with a cluster
scheduling system, and AutoEval will schedule the policies for evaluation
within a framework supplying automatic success detection and automatic scene
resets. We show that AutoEval can nearly fully eliminate human involvement in
the evaluation process, permitting around the clock evaluations, and the
evaluation results correspond closely to ground truth evaluations conducted by
hand. To facilitate the evaluation of generalist policies in the robotics
community, we provide public access to multiple AutoEval scenes in the popular
BridgeData robot setup with WidowX robot arms. In the future, we hope that
AutoEval scenes can be set up across institutions to form a diverse and
distributed evaluation network.
☆ Pseudo-Random UAV Test Generation Using Low-Fidelity Path Simulator
Simulation-based testing provides a safe and cost-effective environment for
verifying the safety of Uncrewed Aerial Vehicles (UAVs). However, simulation
can be resource-consuming, especially when High-Fidelity Simulators (HFS) are
used. To optimise simulation resources, we propose a pseudo-random test
generator that uses a Low-Fidelity Simulator (LFS) to estimate UAV flight
paths. This work simplifies the PX4 autopilot HFS to develop a LFS, which
operates one order of magnitude faster than the HFS.Test cases predicted to
cause safety violations in the LFS are subsequently validated using the HFS.
☆ Reinforcement Learning for Safe Autonomous Two Device Navigation of Cerebral Vessels in Mechanical Thrombectomy
Harry Robertshaw, Benjamin Jackson, Jiaheng Wang, Hadi Sadati, Lennart Karstensen, Alejandro Granados, Thomas C Booth
Purpose: Autonomous systems in mechanical thrombectomy (MT) hold promise for
reducing procedure times, minimizing radiation exposure, and enhancing patient
safety. However, current reinforcement learning (RL) methods only reach the
carotid arteries, are not generalizable to other patient vasculatures, and do
not consider safety. We propose a safe dual-device RL algorithm that can
navigate beyond the carotid arteries to cerebral vessels.
Methods: We used the Simulation Open Framework Architecture to represent the
intricacies of cerebral vessels, and a modified Soft Actor-Critic RL algorithm
to learn, for the first time, the navigation of micro-catheters and
micro-guidewires. We incorporate patient safety metrics into our reward
function by integrating guidewire tip forces. Inverse RL is used with
demonstrator data on 12 patient-specific vascular cases.
Results: Our simulation demonstrates successful autonomous navigation within
unseen cerebral vessels, achieving a 96% success rate, 7.0s procedure time, and
0.24 N mean forces, well below the proposed 1.5 N vessel rupture threshold.
Conclusion: To the best of our knowledge, our proposed autonomous system for
MT two-device navigation reaches cerebral vessels, considers safety, and is
generalizable to unseen patient-specific cases for the first time. We envisage
future work will extend the validation to vasculatures of different complexity
and on in vitro models. While our contributions pave the way towards deploying
agents in clinical settings, safety and trustworthiness will be crucial
elements to consider when proposing new methodology.
☆ Graph Neural Network-Based Predictive Modeling for Robotic Plaster Printing
Diego Machain Rivera, Selen Ercan Jenny, Ping Hsun Tsai, Ena Lloret-Fritschi, Luis Salamanca, Fernando Perez-Cruz, Konstantinos E. Tatsis
This work proposes a Graph Neural Network (GNN) modeling approach to predict
the resulting surface from a particle based fabrication process. The latter
consists of spray-based printing of cementitious plaster on a wall and is
facilitated with the use of a robotic arm. The predictions are computed using
the robotic arm trajectory features, such as position, velocity and direction,
as well as the printing process parameters. The proposed approach, based on a
particle representation of the wall domain and the end effector, allows for the
adoption of a graph-based solution. The GNN model consists of an
encoder-processor-decoder architecture and is trained using data from
laboratory tests, while the hyperparameters are optimized by means of a
Bayesian scheme. The aim of this model is to act as a simulator of the printing
process, and ultimately used for the generation of the robotic arm trajectory
and the optimization of the printing parameters, towards the materialization of
an autonomous plastering process. The performance of the proposed model is
assessed in terms of the prediction error against unseen ground truth data,
which shows its generality in varied scenarios, as well as in comparison with
the performance of an existing benchmark model. The results demonstrate a
significant improvement over the benchmark model, with notably better
performance and enhanced error scaling across prediction steps.
☆ HACTS: a Human-As-Copilot Teleoperation System for Robot Learning
Teleoperation is essential for autonomous robot learning, especially in
manipulation tasks that require human demonstrations or corrections. However,
most existing systems only offer unilateral robot control and lack the ability
to synchronize the robot's status with the teleoperation hardware, preventing
real-time, flexible intervention. In this work, we introduce HACTS
(Human-As-Copilot Teleoperation System), a novel system that establishes
bilateral, real-time joint synchronization between a robot arm and
teleoperation hardware. This simple yet effective feedback mechanism, akin to a
steering wheel in autonomous vehicles, enables the human copilot to intervene
seamlessly while collecting action-correction data for future learning.
Implemented using 3D-printed components and low-cost, off-the-shelf motors,
HACTS is both accessible and scalable. Our experiments show that HACTS
significantly enhances performance in imitation learning (IL) and reinforcement
learning (RL) tasks, boosting IL recovery capabilities and data efficiency, and
facilitating human-in-the-loop RL. HACTS paves the way for more effective and
interactive human-robot collaboration and data-collection, advancing the
capabilities of robot manipulation.
☆ COSMO: Combination of Selective Memorization for Low-cost Vision-and-Language Navigation
Vision-and-Language Navigation (VLN) tasks have gained prominence within
artificial intelligence research due to their potential application in fields
like home assistants. Many contemporary VLN approaches, while based on
transformer architectures, have increasingly incorporated additional components
such as external knowledge bases or map information to enhance performance.
These additions, while boosting performance, also lead to larger models and
increased computational costs. In this paper, to achieve both high performance
and low computational costs, we propose a novel architecture with the
COmbination of Selective MemOrization (COSMO). Specifically, COSMO integrates
state-space modules and transformer modules, and incorporates two
VLN-customized selective state space modules: the Round Selective Scan (RSS)
and the Cross-modal Selective State Space Module (CS3). RSS facilitates
comprehensive inter-modal interactions within a single scan, while the CS3
module adapts the selective state space module into a dual-stream architecture,
thereby enhancing the acquisition of cross-modal interactions. Experimental
validations on three mainstream VLN benchmarks, REVERIE, R2R, and R2R-CE, not
only demonstrate competitive navigation performance of our model but also show
a significant reduction in computational costs.
☆ Toward Anxiety-Reducing Pocket Robots for Children
A common denominator for most therapy treatments for children who suffer from
an anxiety disorder is daily practice routines to learn techniques needed to
overcome anxiety. However, applying those techniques while experiencing anxiety
can be highly challenging. This paper presents the design, implementation, and
pilot study of a tactile hand-held pocket robot AffectaPocket, designed to work
alongside therapy as a focus object to facilitate coping during an anxiety
attack. The robot does not require daily practice to be used, has a small form
factor, and has been designed for children 7 to 12 years old. The pocket robot
works by sensing when it is being held and attempts to shift the child's focus
by presenting them with a simple three-note rhythm-matching game. We conducted
a pilot study of the pocket robot involving four children aged 7 to 10 years,
and then a main study with 18 children aged 6 to 8 years; neither study
involved children with anxiety. Both studies aimed to assess the reliability of
the robot's sensor configuration, its design, and the effectiveness of the user
tutorial. The results indicate that the morphology and sensor setup performed
adequately and the tutorial process enabled the children to use the robot with
little practice. This work demonstrates that the presented pocket robot could
represent a step toward developing low-cost accessible technologies to help
children suffering from anxiety disorders.
comment: 8 pages
☆ Learning 3D-Gaussian Simulators from RGB Videos
Learning physics simulations from video data requires maintaining spatial and
temporal consistency, a challenge often addressed with strong inductive biases
or ground-truth 3D information -- limiting scalability and generalization. We
introduce 3DGSim, a 3D physics simulator that learns object dynamics end-to-end
from multi-view RGB videos. It encodes images into a 3D Gaussian particle
representation, propagates dynamics via a transformer, and renders frames using
3D Gaussian splatting. By jointly training inverse rendering with a dynamics
transformer using a temporal encoding and merging layer, 3DGSimembeds physical
properties into point-wise latent vectors without enforcing explicit
connectivity constraints. This enables the model to capture diverse physical
behaviors, from rigid to elastic and cloth-like interactions, along with
realistic lighting effects that also generalize to unseen multi-body
interactions and novel scene edits.
☆ SALT: A Flexible Semi-Automatic Labeling Tool for General LiDAR Point Clouds with Cross-Scene Adaptability and 4D Consistency
We propose a flexible Semi-Automatic Labeling Tool (SALT) for general LiDAR
point clouds with cross-scene adaptability and 4D consistency. Unlike recent
approaches that rely on camera distillation, SALT operates directly on raw
LiDAR data, automatically generating pre-segmentation results. To achieve this,
we propose a novel zero-shot learning paradigm, termed data alignment, which
transforms LiDAR data into pseudo-images by aligning with the training
distribution of vision foundation models. Additionally, we design a
4D-consistent prompting strategy and 4D non-maximum suppression module to
enhance SAM2, ensuring high-quality, temporally consistent presegmentation.
SALT surpasses the latest zero-shot methods by 18.4% PQ on SemanticKITTI and
achieves nearly 40-50% of human annotator performance on our newly collected
low-resolution LiDAR data and on combined data from three LiDAR types,
significantly boosting annotation efficiency. We anticipate that SALT's
open-sourcing will catalyze substantial expansion of current LiDAR datasets and
lay the groundwork for the future development of LiDAR foundation models. Code
is available at https://github.com/Cavendish518/SALT.
☆ A Reactive Framework for Whole-Body Motion Planning of Mobile Manipulators Combining Reinforcement Learning and SDF-Constrained Quadratic Programmi
As an important branch of embodied artificial intelligence, mobile
manipulators are increasingly applied in intelligent services, but their
redundant degrees of freedom also limit efficient motion planning in cluttered
environments. To address this issue, this paper proposes a hybrid learning and
optimization framework for reactive whole-body motion planning of mobile
manipulators. We develop the Bayesian distributional soft actor-critic
(Bayes-DSAC) algorithm to improve the quality of value estimation and the
convergence performance of the learning. Additionally, we introduce a quadratic
programming method constrained by the signed distance field to enhance the
safety of the obstacle avoidance motion. We conduct experiments and make
comparison with standard benchmark. The experimental results verify that our
proposed framework significantly improves the efficiency of reactive whole-body
motion planning, reduces the planning time, and improves the success rate of
motion planning. Additionally, the proposed reinforcement learning method
ensures a rapid learning process in the whole-body planning task. The novel
framework allows mobile manipulators to adapt to complex environments more
safely and efficiently.
☆ Video-based Traffic Light Recognition by Rockchip RV1126 for Autonomous Driving
Real-time traffic light recognition is fundamental for autonomous driving
safety and navigation in urban environments. While existing approaches rely on
single-frame analysis from onboard cameras, they struggle with complex
scenarios involving occlusions and adverse lighting conditions. We present
\textit{ViTLR}, a novel video-based end-to-end neural network that processes
multiple consecutive frames to achieve robust traffic light detection and state
classification. The architecture leverages a transformer-like design with
convolutional self-attention modules, which is optimized specifically for
deployment on the Rockchip RV1126 embedded platform. Extensive evaluations on
two real-world datasets demonstrate that \textit{ViTLR} achieves
state-of-the-art performance while maintaining real-time processing
capabilities (>25 FPS) on RV1126's NPU. The system shows superior robustness
across temporal stability, varying target distances, and challenging
environmental conditions compared to existing single-frame approaches. We have
successfully integrated \textit{ViTLR} into an ego-lane traffic light
recognition system using HD maps for autonomous driving applications. The
complete implementation, including source code and datasets, is made publicly
available to facilitate further research in this domain.
comment: Accepted by IEEE IV'25
☆ A Benchmark for Vision-Centric HD Mapping by V2I Systems
Autonomous driving faces safety challenges due to a lack of global
perspective and the semantic information of vectorized high-definition (HD)
maps. Information from roadside cameras can greatly expand the map perception
range through vehicle-to-infrastructure (V2I) communications. However, there is
still no dataset from the real world available for the study on map
vectorization onboard under the scenario of vehicle-infrastructure cooperation.
To prosper the research on online HD mapping for Vehicle-Infrastructure
Cooperative Autonomous Driving (VICAD), we release a real-world dataset, which
contains collaborative camera frames from both vehicles and roadside
infrastructures, and provides human annotations of HD map elements. We also
present an end-to-end neural framework (i.e., V2I-HD) leveraging vision-centric
V2I systems to construct vectorized maps. To reduce computation costs and
further deploy V2I-HD on autonomous vehicles, we introduce a directionally
decoupled self-attention mechanism to V2I-HD. Extensive experiments show that
V2I-HD has superior performance in real-time inference speed, as tested by our
real-world dataset. Abundant qualitative results also demonstrate stable and
robust map construction quality with low cost in complex and various driving
scenes. As a benchmark, both source codes and the dataset have been released at
OneDrive for the purpose of further study.
comment: Accepted by IEEE IV'25
☆ MAER-Nav: Bidirectional Motion Learning Through Mirror-Augmented Experience Replay for Robot Navigation
Deep Reinforcement Learning (DRL) based navigation methods have demonstrated
promising results for mobile robots, but suffer from limited action flexibility
in confined spaces. Conventional DRL approaches predominantly learn
forward-motion policies, causing robots to become trapped in complex
environments where backward maneuvers are necessary for recovery. This paper
presents MAER-Nav (Mirror-Augmented Experience Replay for Robot Navigation), a
novel framework that enables bidirectional motion learning without requiring
explicit failure-driven hindsight experience replay or reward function
modifications. Our approach integrates a mirror-augmented experience replay
mechanism with curriculum learning to generate synthetic backward navigation
experiences from successful trajectories. Experimental results in both
simulation and real-world environments demonstrate that MAER-Nav significantly
outperforms state-of-the-art methods while maintaining strong forward
navigation capabilities. The framework effectively bridges the gap between the
comprehensive action space utilization of traditional planning methods and the
environmental adaptability of learning-based approaches, enabling robust
navigation in scenarios where conventional DRL methods consistently fail.
comment: 8 pages, 8 figures
☆ Less is More: Contextual Sampling for Nonlinear Data-Enabled Predictive Control IROS 2025
Data-enabled Predictive Control (DeePC) is a powerful data-driven approach
for predictive control without requiring an explicit system model. However, its
high computational cost limits its applicability to real-time robotic systems.
For robotic applications such as motion planning and trajectory tracking,
real-time control is crucial. Nonlinear DeePC either relies on large datasets
or learning the nonlinearities to ensure predictive accuracy, leading to high
computational complexity. This work introduces contextual sampling, a novel
data selection strategy to handle nonlinearities for DeePC by dynamically
selecting the most relevant data at each time step. By reducing the dataset
size while preserving prediction accuracy, our method improves computational
efficiency, of DeePC for real-time robotic applications. We validate our
approach for autonomous vehicle motion planning. For a dataset size of 100
sub-trajectories, Contextual sampling DeePC reduces tracking error by 53.2 %
compared to Leverage Score sampling. Additionally, Contextual sampling reduces
max computation time by 87.2 % compared to using the full dataset of 491
sub-trajectories while achieving comparable tracking performance. These results
highlight the potential of Contextual sampling to enable real-time, data-driven
control for robotic systems.
comment: Submitted to IROS 2025 on March 1st
☆ ZeroMimic: Distilling Robotic Manipulation Skills from Web Videos ICRA 2025
Many recent advances in robotic manipulation have come through imitation
learning, yet these rely largely on mimicking a particularly hard-to-acquire
form of demonstrations: those collected on the same robot in the same room with
the same objects as the trained policy must handle at test time. In contrast,
large pre-recorded human video datasets demonstrating manipulation skills
in-the-wild already exist, which contain valuable information for robots. Is it
possible to distill a repository of useful robotic skill policies out of such
data without any additional requirements on robot-specific demonstrations or
exploration? We present the first such system ZeroMimic, that generates
immediately deployable image goal-conditioned skill policies for several common
categories of manipulation tasks (opening, closing, pouring, pick&place,
cutting, and stirring) each capable of acting upon diverse objects and across
diverse unseen task setups. ZeroMimic is carefully designed to exploit recent
advances in semantic and geometric visual understanding of human videos,
together with modern grasp affordance detectors and imitation policy classes.
After training ZeroMimic on the popular EpicKitchens dataset of ego-centric
human videos, we evaluate its out-of-the-box performance in varied real-world
and simulated kitchen settings with two different robot embodiments,
demonstrating its impressive abilities to handle these varied tasks. To enable
plug-and-play reuse of ZeroMimic policies on other task setups and robots, we
release software and policy checkpoints of our skill policies.
comment: ICRA 2025. Project website: https://zeromimic.github.io/
☆ GenSwarm: Scalable Multi-Robot Code-Policy Generation and Deployment via Language Models
Wenkang Ji, Huaben Chen, Mingyang Chen, Guobin Zhu, Lufeng Xu, Roderich Groß, Rui Zhou, Ming Cao, Shiyu Zhao
The development of control policies for multi-robot systems traditionally
follows a complex and labor-intensive process, often lacking the flexibility to
adapt to dynamic tasks. This has motivated research on methods to automatically
create control policies. However, these methods require iterative processes of
manually crafting and refining objective functions, thereby prolonging the
development cycle. This work introduces \textit{GenSwarm}, an end-to-end system
that leverages large language models to automatically generate and deploy
control policies for multi-robot tasks based on simple user instructions in
natural language. As a multi-language-agent system, GenSwarm achieves zero-shot
learning, enabling rapid adaptation to altered or unseen tasks. The white-box
nature of the code policies ensures strong reproducibility and
interpretability. With its scalable software and hardware architectures,
GenSwarm supports efficient policy deployment on both simulated and real-world
multi-robot systems, realizing an instruction-to-execution end-to-end
functionality that could prove valuable for robotics specialists and
non-specialists alike.The code of the proposed GenSwarm system is available
online: https://github.com/WindyLab/GenSwarm.
☆ Disambiguate Gripper State in Grasp-Based Tasks: Pseudo-Tactile as Feedback Enables Pure Simulation Learning IROS 2025
Grasp-based manipulation tasks are fundamental to robots interacting with
their environments, yet gripper state ambiguity significantly reduces the
robustness of imitation learning policies for these tasks. Data-driven
solutions face the challenge of high real-world data costs, while simulation
data, despite its low costs, is limited by the sim-to-real gap. We identify the
root cause of gripper state ambiguity as the lack of tactile feedback. To
address this, we propose a novel approach employing pseudo-tactile as feedback,
inspired by the idea of using a force-controlled gripper as a tactile sensor.
This method enhances policy robustness without additional data collection and
hardware involvement, while providing a noise-free binary gripper state
observation for the policy and thus facilitating pure simulation learning to
unleash the power of simulation. Experimental results across three real-world
grasp-based tasks demonstrate the necessity, effectiveness, and efficiency of
our approach.
comment: 8 pages, 5 figures, submitted to IROS 2025, project page:
https://yifei-y.github.io/project-pages/Pseudo-Tactile-Feedback/
☆ Trajectory Planning for Automated Driving using Target Funnels
Self-driving vehicles rely on sensory input to monitor their surroundings and
continuously adapt to the most likely future road course. Predictive trajectory
planning is based on snapshots of the (uncertain) road course as a key input.
Under noisy perception data, estimates of the road course can vary
significantly, leading to indecisive and erratic steering behavior. To overcome
this issue, this paper introduces a predictive trajectory planning algorithm
with a novel objective function: instead of targeting a single reference
trajectory based on the most likely road course, tracking a series of target
reference sets, called a target funnel, is considered. The proposed planning
algorithm integrates probabilistic information about the road course, and thus
implicitly considers regular updates to road perception. Our solution is
assessed in a case study using real driving data collected from a prototype
vehicle. The results demonstrate that the algorithm maintains tracking accuracy
and substantially reduces undesirable steering commands in the presence of
noisy road perception, achieving a 56% reduction in input costs compared to a
certainty equivalent formulation.
comment: accepted to European Control Conference 2025 (ECC25)
☆ Towards a cognitive architecture to enable natural language interaction in co-constructive task learning
This research addresses the question, which characteristics a cognitive
architecture must have to leverage the benefits of natural language in
Co-Constructive Task Learning (CCTL). To provide context, we first discuss
Interactive Task Learning (ITL), the mechanisms of the human memory system, and
the significance of natural language and multi-modality. Next, we examine the
current state of cognitive architectures, analyzing their capabilities to
inform a concept of CCTL grounded in multiple sources. We then integrate
insights from various research domains to develop a unified framework. Finally,
we conclude by identifying the remaining challenges and requirements necessary
to achieve CCTL in Human-Robot Interaction (HRI).
comment: 8 pages, 5 figures, submitted to: IEEE RO-MAN 2025
☆ Towards Benchmarking and Assessing the Safety and Robustness of Autonomous Driving on Safety-critical Scenarios
Jingzheng Li, Xianglong Liu, Shikui Wei, Zhijun Chen, Bing Li, Qing Guo, Xianqi Yang, Yanjun Pu, Jiakai Wang
Autonomous driving has made significant progress in both academia and
industry, including performance improvements in perception task and the
development of end-to-end autonomous driving systems. However, the safety and
robustness assessment of autonomous driving has not received sufficient
attention. Current evaluations of autonomous driving are typically conducted in
natural driving scenarios. However, many accidents often occur in edge cases,
also known as safety-critical scenarios. These safety-critical scenarios are
difficult to collect, and there is currently no clear definition of what
constitutes a safety-critical scenario. In this work, we explore the safety and
robustness of autonomous driving in safety-critical scenarios. First, we
provide a definition of safety-critical scenarios, including static traffic
scenarios such as adversarial attack scenarios and natural distribution shifts,
as well as dynamic traffic scenarios such as accident scenarios. Then, we
develop an autonomous driving safety testing platform to comprehensively
evaluate autonomous driving systems, encompassing not only the assessment of
perception modules but also system-level evaluations. Our work systematically
constructs a safety verification process for autonomous driving, providing
technical support for the industry to establish standardized test framework and
reduce risks in real-world road deployment.
☆ A Survey of Reinforcement Learning-Based Motion Planning for Autonomous Driving: Lessons Learned from a Driving Task Perspective
Zhuoren Li, Guizhe Jin, Ran Yu, Zhiwen Chen, Nan Li, Wei Han, Lu Xiong, Bo Leng, Jia Hu, Ilya Kolmanovsky, Dimitar Filev
Reinforcement learning (RL), with its ability to explore and optimize
policies in complex, dynamic decision-making tasks, has emerged as a promising
approach to addressing motion planning (MoP) challenges in autonomous driving
(AD). Despite rapid advancements in RL and AD, a systematic description and
interpretation of the RL design process tailored to diverse driving tasks
remains underdeveloped. This survey provides a comprehensive review of RL-based
MoP for AD, focusing on lessons from task-specific perspectives. We first
outline the fundamentals of RL methodologies, and then survey their
applications in MoP, analyzing scenario-specific features and task requirements
to shed light on their influence on RL design choices. Building on this
analysis, we summarize key design experiences, extract insights from various
driving task applications, and provide guidance for future implementations.
Additionally, we examine the frontier challenges in RL-based MoP, review recent
efforts to addresse these challenges, and propose strategies for overcoming
unresolved issues.
comment: 21 pages, 5 figures
♻ ☆ Robust Nonprehensile Object Transportation with Uncertain Inertial Parameters
We consider the nonprehensile object transportation task known as the
waiter's problem - in which a robot must move an object on a tray from one
location to another - when the transported object has uncertain inertial
parameters. In contrast to existing approaches that completely ignore
uncertainty in the inertia matrix or which only consider small parameter
errors, we are interested in pushing the limits of the amount of inertial
parameter uncertainty that can be handled. We first show how constraints that
are robust to inertial parameter uncertainty can be incorporated into an
optimization-based motion planning framework to transport objects while moving
quickly. Next, we develop necessary conditions for the inertial parameters to
be realizable on a bounding shape based on moment relaxations, allowing us to
verify whether a trajectory will violate the constraints for any realizable
inertial parameters. Finally, we demonstrate our approach on a mobile
manipulator in simulations and real hardware experiments: our proposed robust
constraints consistently successfully transport a 56 cm tall object with
substantial inertial parameter uncertainty in the real world, while the
baseline approaches drop the object while transporting it.
comment: 8 pages, 7 figures. Published in IEEE Robotics and Automation Letters
♻ ☆ CALMM-Drive: Confidence-Aware Autonomous Driving with Large Multimodal Model
Decision-making and motion planning constitute critical components for
ensuring the safety and efficiency of autonomous vehicles (AVs). Existing
methodologies typically adopt two paradigms: decision then planning or
generation then scoring. However, the former architecture often suffers from
decision-planning misalignment that incurs risky situations. Meanwhile, the
latter struggles to balance short-term operational metrics (e.g., immediate
motion smoothness) with long-term tactical goals (e.g., route efficiency),
resulting in myopic or overly conservative behaviors. To address these issues,
we introduce CALMM-Drive, a novel Confidence-Aware Large Multimodal Model (LMM)
empowered Autonomous Driving framework. Our approach integrates driving
task-oriented Chain-of-Thought (CoT) reasoning coupled with Top-K confidence
elicitation, which facilitates high-level reasoning to generate multiple
candidate decisions with their confidence levels. Furthermore, we propose a
novel planning module that integrates a diffusion model for trajectory
generation and a hierarchical refinement process to find the optimal
trajectory. This framework enables the selection over trajectory candidates
accounting for both low-level solution quality and high-level tactical
confidence, which avoids the risks within one-shot decisions and overcomes the
limitations in short-sighted scoring mechanisms. Comprehensive evaluations in
nuPlan closed-loop simulation environments demonstrate the competitive
performance of CALMM-Drive across both common and long-tail benchmarks,
showcasing a significant advancement in the integration of uncertainty in
LMM-empowered AVs. The code will be released upon acceptance.
comment: 14 pages, 7 figures
♻ ☆ Tactile Ergodic Coverage on Curved Surfaces
In this article, we present a feedback control method for tactile coverage
tasks, such as cleaning or surface inspection. These tasks are challenging to
plan due to complex continuous physical interactions. In these tasks, the
coverage target and progress can be easily measured using a camera and encoded
in a point cloud. We propose an ergodic coverage method that operates directly
on point clouds, guiding the robot to spend more time on regions requiring more
coverage. For robot control and contact behavior, we use geometric algebra to
formulate a task-space impedance controller that tracks a line while
simultaneously exerting a desired force along that line. We evaluate the
performance of our method in kinematic simulations and demonstrate its
applicability in real-world experiments on kitchenware. Our source codes,
experimental data, and videos are available as open access at
https://sites.google.com/view/tactile-ergodic-control/
♻ ☆ Fast and Accurate Task Planning using Neuro-Symbolic Language Models and Multi-level Goal Decomposition
In robotic task planning, symbolic planners using rule-based representations
like PDDL are effective but struggle with long-sequential tasks in complicated
environments due to exponentially increasing search space. Meanwhile, LLM-based
approaches, which are grounded in artificial neural networks, offer faster
inference and commonsense reasoning but suffer from lower success rates. To
address the limitations of the current symbolic (slow speed) or LLM-based
approaches (low accuracy), we propose a novel neuro-symbolic task planner that
decomposes complex tasks into subgoals using LLM and carries out task planning
for each subgoal using either symbolic or MCTS-based LLM planners, depending on
the subgoal complexity. This decomposition reduces planning time and improves
success rates by narrowing the search space and enabling LLMs to focus on more
manageable tasks. Our method significantly reduces planning time while
maintaining high success rates across task planning domains, as well as
real-world and simulated robotics environments. More details are available at
http://graphics.ewha.ac.kr/LLMTAMP/.
♻ ☆ Fast Online Learning of CLiFF-maps in Changing Environments ICRA
Maps of dynamics are effective representations of motion patterns learned
from prior observations, with recent research demonstrating their ability to
enhance various downstream tasks such as human-aware robot navigation,
long-term human motion prediction, and robot localization. Current advancements
have primarily concentrated on methods for learning maps of human flow in
environments where the flow is static, i.e., not assumed to change over time.
In this paper we propose an online update method of the CLiFF-map (an advanced
map of dynamics type that models motion patterns as velocity and orientation
mixtures) to actively detect and adapt to human flow changes. As new
observations are collected, our goal is to update a CLiFF-map to effectively
and accurately integrate them, while retaining relevant historic motion
patterns. The proposed online update method maintains a probabilistic
representation in each observed location, updating parameters by continuously
tracking sufficient statistics. In experiments using both synthetic and
real-world datasets, we show that our method is able to maintain accurate
representations of human motion dynamics, contributing to high performance
flow-compliant planning downstream tasks, while being orders of magnitude
faster than the comparable baselines.
comment: Accepted to the 2025 IEEE International Conference on Robotics and
Automation (ICRA)
♻ ☆ Grasping a Handful: Sequential Multi-Object Dexterous Grasp Generation
We introduce the sequential multi-object robotic grasp sampling algorithm
SeqGrasp that can robustly synthesize stable grasps on diverse objects using
the robotic hand's partial Degrees of Freedom (DoF). We use SeqGrasp to
construct the large-scale Allegro Hand sequential grasping dataset SeqDataset
and use it for training the diffusion-based sequential grasp generator
SeqDiffuser. We experimentally evaluate SeqGrasp and SeqDiffuser against the
state-of-the-art non-sequential multi-object grasp generation method MultiGrasp
in simulation and on a real robot. The experimental results demonstrate that
SeqGrasp and SeqDiffuser reach an 8.71%-43.33% higher grasp success rate than
MultiGrasp. Furthermore, SeqDiffuser is approximately 1000 times faster at
generating grasps than SeqGrasp and MultiGrasp.
comment: 8 pages, 7 figures
♻ ☆ Dynamic High-Order Control Barrier Functions with Diffuser for Safety-Critical Trajectory Planning at Signal-Free Intersections
Planning safe and efficient trajectories through signal-free intersections
presents significant challenges for autonomous vehicles (AVs), particularly in
dynamic, multi-task environments with unpredictable interactions and an
increased possibility of conflicts. This study aims to address these challenges
by developing a unified, robust, adaptive framework to ensure safety and
efficiency across three distinct intersection movements: left-turn, right-turn,
and straight-ahead. Existing methods often struggle to reliably ensure safety
and effectively learn multi-task behaviors from demonstrations in such
environments. This study proposes a safety-critical planning method that
integrates Dynamic High-Order Control Barrier Functions (DHOCBF) with a
diffusion-based model, called Dynamic Safety-Critical Diffuser (DSC-Diffuser).
The DSC-Diffuser leverages task-guided planning to enhance efficiency, allowing
the simultaneous learning of multiple driving tasks from real-world expert
demonstrations. Moreover, the incorporation of goal-oriented constraints
significantly reduces displacement errors, ensuring precise trajectory
execution. To further ensure driving safety in dynamic environments, the
proposed DHOCBF framework dynamically adjusts to account for the movements of
surrounding vehicles, offering enhanced adaptability and reduce the
conservatism compared to traditional control barrier functions. Validity
evaluations of DHOCBF, conducted through numerical simulations, demonstrate its
robustness in adapting to variations in obstacle velocities, sizes,
uncertainties, and locations, effectively maintaining driving safety across a
wide range of complex and uncertain scenarios. Comprehensive performance
evaluations demonstrate that DSC-Diffuser generates realistic, stable, and
generalizable policies, providing flexibility and reliable safety assurance in
complex multi-task driving scenarios.
comment: 11 figures, 5 tables, 15 pages
♻ ☆ Mitigating Covariate Shift in Imitation Learning for Autonomous Vehicles Using Latent Space Generative World Models ICRA 2025
Alexander Popov, Alperen Degirmenci, David Wehr, Shashank Hegde, Ryan Oldja, Alexey Kamenev, Bertrand Douillard, David Nistér, Urs Muller, Ruchi Bhargava, Stan Birchfield, Nikolai Smolyanskiy
We propose the use of latent space generative world models to address the
covariate shift problem in autonomous driving. A world model is a neural
network capable of predicting an agent's next state given past states and
actions. By leveraging a world model during training, the driving policy
effectively mitigates covariate shift without requiring an excessive amount of
training data. During end-to-end training, our policy learns how to recover
from errors by aligning with states observed in human demonstrations, so that
at runtime it can recover from perturbations outside the training distribution.
Additionally, we introduce a novel transformer-based perception encoder that
employs multi-view cross-attention and a learned scene query. We present
qualitative and quantitative results, demonstrating significant improvements
upon prior state of the art in closed-loop testing in the CARLA simulator, as
well as showing the ability to handle perturbations in both CARLA and NVIDIA's
DRIVE Sim.
comment: 8 pages, 6 figures, updated in March 2025, original published in
September 2024, for ICRA 2025 submission, for associated video file, see
https://youtu.be/7m3bXzlVQvU
♻ ☆ Beyond Omakase: Designing Shared Control for Navigation Robots with Blind People
Rie Kamikubo, Seita Kayukawa, Yuka Kaniwa, Allan Wang, Hernisa Kacorri, Hironobu Takagi, Chieko Asakawa
Autonomous navigation robots can increase the independence of blind people
but often limit user control, following what is called in Japanese an "omakase"
approach where decisions are left to the robot. This research investigates ways
to enhance user control in social robot navigation, based on two studies
conducted with blind participants. The first study, involving structured
interviews (N=14), identified crowded spaces as key areas with significant
social challenges. The second study (N=13) explored navigation tasks with an
autonomous robot in these environments and identified design strategies across
different modes of autonomy. Participants preferred an active role, termed the
"boss" mode, where they managed crowd interactions, while the "monitor" mode
helped them assess the environment, negotiate movements, and interact with the
robot. These findings highlight the importance of shared control and user
involvement for blind users, offering valuable insights for designing future
social navigation robots.
comment: Preprint, ACM CHI Conference on Human Factors in Computing Systems
(CHI 2025)
♻ ☆ Scalable Multi-modal Model Predictive Control via Duality-based Interaction Predictions
We propose a hierarchical architecture designed for scalable real-time Model
Predictive Control (MPC) in complex, multi-modal traffic scenarios. This
architecture comprises two key components: 1) RAID-Net, a novel attention-based
Recurrent Neural Network that predicts relevant interactions along the MPC
prediction horizon between the autonomous vehicle and the surrounding vehicles
using Lagrangian duality, and 2) a reduced Stochastic MPC problem that
eliminates irrelevant collision avoidance constraints, enhancing computational
efficiency. Our approach is demonstrated in a simulated traffic intersection
with interactive surrounding vehicles, showcasing a 12x speed-up in solving the
motion planning problem. A video demonstrating the proposed architecture in
multiple complex traffic scenarios can be found here:
https://youtu.be/-pRiOnPb9_c. GitHub:
https://github.com/MPC-Berkeley/hmpc_raidnet
comment: Accepted at IEEE Intelligent Vehicles Symposium 2024
♻ ☆ Joint Moment Estimation for Hip Exoskeleton Control: A Generalized Moment Feature Generation Method
Hip joint moments during walking are the key foundation for hip exoskeleton
assistance control. Most recent studies have shown estimating hip joint moments
instantaneously offers a lot of advantages compared to generating assistive
torque profiles based on gait estimation, such as simple sensor requirements
and adaptability to variable walking speeds. However, existing joint moment
estimation methods still suffer from a lack of personalization, leading to
estimation accuracy degradation for new users. To address the challenges, this
paper proposes a hip joint moment estimation method based on generalized moment
features (GMF). A GMF generator is constructed to learn GMF of the joint moment
which is invariant to individual variations while remaining decodable into
joint moments through a dedicated decoder. Utilizing this well-featured
representation, a GRU-based neural network is used to predict GMF with joint
kinematics data, which can easily be acquired by hip exoskeleton encoders. The
proposed estimation method achieves a root mean square error of 0.1180 Nm/kg
under 28 walking speed conditions on a treadmill dataset, improved by 6.5%
compared to the model without body parameter fusion, and by 8.3% for the
conventional fusion model with body parameter. Furthermore, the proposed method
was employed on a hip exoskeleton with only encoder sensors and achieved an
average 20.5% metabolic reduction (p<0.01) for users compared to assist-off
condition in level-ground walking.
comment: 13 pages, 10 figures, Submitted to Biomimetic Intelligence and
Robotics