Robotics 36
☆ LiDPM: Rethinking Point Diffusion for Lidar Scene Completion
Training diffusion models that work directly on lidar points at the scale of
outdoor scenes is challenging due to the difficulty of generating fine-grained
details from white noise over a broad field of view. The latest works
addressing scene completion with diffusion models tackle this problem by
reformulating the original DDPM as a local diffusion process. It contrasts with
the common practice of operating at the level of objects, where vanilla DDPMs
are currently used. In this work, we close the gap between these two lines of
work. We identify approximations in the local diffusion formulation, show that
they are not required to operate at the scene level, and that a vanilla DDPM
with a well-chosen starting point is enough for completion. Finally, we
demonstrate that our method, LiDPM, leads to better results in scene completion
on SemanticKITTI. The project page is https://astra-vision.github.io/LiDPM .
comment: Accepted to IEEE IV 2025
☆ Gripper Keypose and Object Pointflow as Interfaces for Bimanual Robotic Manipulation RSS
Bimanual manipulation is a challenging yet crucial robotic capability,
demanding precise spatial localization and versatile motion trajectories, which
pose significant challenges to existing approaches. Existing approaches fall
into two categories: keyframe-based strategies, which predict gripper poses in
keyframes and execute them via motion planners, and continuous control methods,
which estimate actions sequentially at each timestep. The keyframe-based method
lacks inter-frame supervision, struggling to perform consistently or execute
curved motions, while the continuous method suffers from weaker spatial
perception. To address these issues, this paper introduces an end-to-end
framework PPI (keyPose and Pointflow Interface), which integrates the
prediction of target gripper poses and object pointflow with the continuous
actions estimation. These interfaces enable the model to effectively attend to
the target manipulation area, while the overall framework guides diverse and
collision-free trajectories. By combining interface predictions with continuous
actions estimation, PPI demonstrates superior performance in diverse bimanual
manipulation tasks, providing enhanced spatial localization and satisfying
flexibility in handling movement restrictions. In extensive evaluations, PPI
significantly outperforms prior methods in both simulated and real-world
experiments, achieving state-of-the-art performance with a +16.1% improvement
on the RLBench2 simulation benchmark and an average of +27.5% gain across four
challenging real-world tasks. Notably, PPI exhibits strong stability, high
precision, and remarkable generalization capabilities in real-world scenarios.
Project page: https://yuyinyang3y.github.io/PPI/
comment: Published at Robotics: Science and Systems (RSS) 2025
☆ Integrating Learning-Based Manipulation and Physics-Based Locomotion for Whole-Body Badminton Robot Control ICRA 2025
Haochen Wang, Zhiwei Shi, Chengxi Zhu, Yafei Qiao, Cheng Zhang, Fan Yang, Pengjie Ren, Lan Lu, Dong Xuan
Learning-based methods, such as imitation learning (IL) and reinforcement
learning (RL), can produce excel control policies over challenging agile robot
tasks, such as sports robot. However, no existing work has harmonized
learning-based policy with model-based methods to reduce training complexity
and ensure the safety and stability for agile badminton robot control. In this
paper, we introduce \ourmethod, a novel hybrid control system for agile
badminton robots. Specifically, we propose a model-based strategy for chassis
locomotion which provides a base for arm policy. We introduce a
physics-informed ``IL+RL'' training framework for learning-based arm policy. In
this train framework, a model-based strategy with privileged information is
used to guide arm policy training during both IL and RL phases. In addition, we
train the critic model during IL phase to alleviate the performance drop issue
when transitioning from IL to RL. We present results on our self-engineered
badminton robot, achieving 94.5% success rate against the serving machine and
90.7% success rate against human players. Our system can be easily generalized
to other agile mobile manipulation tasks such as agile catching and table
tennis. Our project website: https://dreamstarring.github.io/HAMLET/.
comment: Accepted to ICRA 2025. Project page:
https://dreamstarring.github.io/HAMLET/
☆ Robotic Task Ambiguity Resolution via Natural Language Interaction
Language-conditioned policies have recently gained substantial adoption in
robotics as they allow users to specify tasks using natural language, making
them highly versatile. While much research has focused on improving the action
prediction of language-conditioned policies, reasoning about task descriptions
has been largely overlooked. Ambiguous task descriptions often lead to
downstream policy failures due to misinterpretation by the robotic agent. To
address this challenge, we introduce AmbResVLM, a novel method that grounds
language goals in the observed scene and explicitly reasons about task
ambiguity. We extensively evaluate its effectiveness in both simulated and
real-world domains, demonstrating superior task ambiguity detection and
resolution compared to recent state-of-the-art baselines. Finally, real robot
experiments show that our model improves the performance of downstream robot
policies, increasing the average success rate from 69.6% to 97.1%. We make the
data, code, and trained models publicly available at
https://ambres.cs.uni-freiburg.de.
☆ BIM-Constrained Optimization for Accurate Localization and Deviation Correction in Construction Monitoring
Asier Bikandi, Muhammad Shaheer, Hriday Bavle, Jayan Jevanesan, Holger Voos, Jose Luis Sanchez-Lopez
Augmented reality (AR) applications for construction monitoring rely on
real-time environmental tracking to visualize architectural elements. However,
construction sites present significant challenges for traditional tracking
methods due to featureless surfaces, dynamic changes, and drift accumulation,
leading to misalignment between digital models and the physical world. This
paper proposes a BIM-aware drift correction method to address these challenges.
Instead of relying solely on SLAM-based localization, we align ``as-built"
detected planes from the real-world environment with ``as-planned"
architectural planes in BIM. Our method performs robust plane matching and
computes a transformation (TF) between SLAM (S) and BIM (B) origin frames using
optimization techniques, minimizing drift over time. By incorporating BIM as
prior structural knowledge, we can achieve improved long-term localization and
enhanced AR visualization accuracy in noisy construction environments. The
method is evaluated through real-world experiments, showing significant
reductions in drift-induced errors and optimized alignment consistency. On
average, our system achieves a reduction of 52.24% in angular deviations and a
reduction of 60.8% in the distance error of the matched walls compared to the
initial manual alignment by the user.
☆ Unifying Complementarity Constraints and Control Barrier Functions for Safe Whole-Body Robot Control
Rafael I. Cabral Muchacho, Riddhiman Laha, Florian T. Pokorny, Luis F. C. Figueredo, Nilanjan Chakraborty
Safety-critical whole-body robot control demands reactive methods that ensure
collision avoidance in real-time. Complementarity constraints and control
barrier functions (CBF) have emerged as core tools for ensuring such safety
constraints, and each represents a well-developed field. Despite addressing
similar problems, their connection remains largely unexplored. This paper
bridges this gap by formally proving the equivalence between these two
methodologies for sampled-data, first-order systems, considering both single
and multiple constraint scenarios. By demonstrating this equivalence, we
provide a unified perspective on these techniques. This unification has
theoretical and practical implications, facilitating the cross-application of
robustness guarantees and algorithmic improvements between complementarity and
CBF frameworks. We discuss these synergistic benefits and motivate future work
in the comparison of the methods in more general cases.
☆ Flying through cluttered and dynamic environments with LiDAR
Navigating unmanned aerial vehicles (UAVs) through cluttered and dynamic
environments remains a significant challenge, particularly when dealing with
fast-moving or sudden-appearing obstacles. This paper introduces a complete
LiDAR-based system designed to enable UAVs to avoid various moving obstacles in
complex environments. Benefiting the high computational efficiency of
perception and planning, the system can operate in real time using onboard
computing resources with low latency. For dynamic environment perception, we
have integrated our previous work, M-detector, into the system. M-detector
ensures that moving objects of different sizes, colors, and types are reliably
detected. For dynamic environment planning, we incorporate dynamic object
predictions into the integrated planning and control (IPC) framework, namely
DynIPC. This integration allows the UAV to utilize predictions about dynamic
obstacles to effectively evade them. We validate our proposed system through
both simulations and real-world experiments. In simulation tests, our system
outperforms state-of-the-art baselines across several metrics, including
success rate, time consumption, average flight time, and maximum velocity. In
real-world trials, our system successfully navigates through forests, avoiding
moving obstacles along its path.
☆ Object Pose Estimation by Camera Arm Control Based on the Next Viewpoint Estimation
We have developed a new method to estimate a Next Viewpoint (NV) which is
effective for pose estimation of simple-shaped products for product display
robots in retail stores. Pose estimation methods using Neural Networks (NN)
based on an RGBD camera are highly accurate, but their accuracy significantly
decreases when the camera acquires few texture and shape features at a current
view point. However, it is difficult for previous mathematical model-based
methods to estimate effective NV which is because the simple shaped objects
have few shape features. Therefore, we focus on the relationship between the
pose estimation and NV estimation. When the pose estimation is more accurate,
the NV estimation is more accurate. Therefore, we develop a new pose estimation
NN that estimates NV simultaneously. Experimental results showed that our NV
estimation realized a pose estimation success rate 77.3\%, which was 7.4pt
higher than the mathematical model-based NV calculation did. Moreover, we
verified that the robot using our method displayed 84.2\% of products.
☆ Longitudinal Control for Autonomous Racing with Combustion Engine Vehicles
Usually, a controller for path- or trajectory tracking is employed in
autonomous driving. Typically, these controllers generate high-level commands
like longitudinal acceleration or force. However, vehicles with combustion
engines expect different actuation inputs. This paper proposes a longitudinal
control concept that translates high-level trajectory-tracking commands to the
required low-level vehicle commands such as throttle, brake pressure and a
desired gear. We chose a modular structure to easily integrate different
trajectory-tracking control algorithms and vehicles. The proposed control
concept enables a close tracking of the high-level control command. An
anti-lock braking system, traction control, and brake warmup control also
ensure a safe operation during real-world tests. We provide experimental
validation of our concept using real world data with longitudinal accelerations
reaching up to $25 \, \frac{\mathrm{m}}{\mathrm{s}^2}$. The experiments were
conducted using the EAV24 racecar during the first event of the Abu Dhabi
Autonomous Racing League on the Yas Marina Formula 1 Circuit.
comment: 8 pages, 9 Figures
☆ Bias-Eliminated PnP for Stereo Visual Odometry: Provably Consistent and Large-Scale Localization
In this paper, we first present a bias-eliminated weighted (Bias-Eli-W)
perspective-n-point (PnP) estimator for stereo visual odometry (VO) with
provable consistency. Specifically, leveraging statistical theory, we develop
an asymptotically unbiased and $\sqrt {n}$-consistent PnP estimator that
accounts for varying 3D triangulation uncertainties, ensuring that the relative
pose estimate converges to the ground truth as the number of features
increases. Next, on the stereo VO pipeline side, we propose a framework that
continuously triangulates contemporary features for tracking new frames,
effectively decoupling temporal dependencies between pose and 3D point errors.
We integrate the Bias-Eli-W PnP estimator into the proposed stereo VO pipeline,
creating a synergistic effect that enhances the suppression of pose estimation
errors. We validate the performance of our method on the KITTI and Oxford
RobotCar datasets. Experimental results demonstrate that our method: 1)
achieves significant improvements in both relative pose error and absolute
trajectory error in large-scale environments; 2) provides reliable localization
under erratic and unpredictable robot motions. The successful implementation of
the Bias-Eli-W PnP in stereo VO indicates the importance of information
screening in robotic estimation tasks with high-uncertainty measurements,
shedding light on diverse applications where PnP is a key ingredient.
comment: 10 pages, 7 figures
☆ S2S-Net: Addressing the Domain Gap of Heterogeneous Sensor Systems in LiDAR-Based Collective Perception
Collective Perception (CP) has emerged as a promising approach to overcome
the limitations of individual perception in the context of autonomous driving.
Various approaches have been proposed to realize collective perception;
however, the Sensor2Sensor domain gap that arises from the utilization of
different sensor systems in Connected and Automated Vehicles (CAVs) remains
mostly unaddressed. This is primarily due to the paucity of datasets containing
heterogeneous sensor setups among the CAVs. The recently released SCOPE
datasets address this issue by providing data from three different LiDAR
sensors for each CAV. This study is the first to tackle the Sensor2Sensor
domain gap in vehicle to vehicle (V2V) collective perception. First, we present
our sensor-domain robust architecture S2S-Net. Then an in-depth analysis of the
Sensor2Sensor domain adaptation capabilities of S2S-Net on the SCOPE dataset is
conducted. S2S-Net demonstrates the capability to maintain very high
performance in unseen sensor domains and achieved state-of-the-art results on
the SCOPE dataset.
☆ Demonstrating Berkeley Humanoid Lite: An Open-source, Accessible, and Customizable 3D-printed Humanoid Robot RSS
Yufeng Chi, Qiayuan Liao, Junfeng Long, Xiaoyu Huang, Sophia Shao, Borivoje Nikolic, Zhongyu Li, Koushil Sreenath
Despite significant interest and advancements in humanoid robotics, most
existing commercially available hardware remains high-cost, closed-source, and
non-transparent within the robotics community. This lack of accessibility and
customization hinders the growth of the field and the broader development of
humanoid technologies. To address these challenges and promote democratization
in humanoid robotics, we demonstrate Berkeley Humanoid Lite, an open-source
humanoid robot designed to be accessible, customizable, and beneficial for the
entire community. The core of this design is a modular 3D-printed gearbox for
the actuators and robot body. All components can be sourced from widely
available e-commerce platforms and fabricated using standard desktop 3D
printers, keeping the total hardware cost under $5,000 (based on U.S. market
prices). The design emphasizes modularity and ease of fabrication. To address
the inherent limitations of 3D-printed gearboxes, such as reduced strength and
durability compared to metal alternatives, we adopted a cycloidal gear design,
which provides an optimal form factor in this context. Extensive testing was
conducted on the 3D-printed actuators to validate their durability and
alleviate concerns about the reliability of plastic components. To demonstrate
the capabilities of Berkeley Humanoid Lite, we conducted a series of
experiments, including the development of a locomotion controller using
reinforcement learning. These experiments successfully showcased zero-shot
policy transfer from simulation to hardware, highlighting the platform's
suitability for research validation. By fully open-sourcing the hardware
design, embedded code, and training and deployment frameworks, we aim for
Berkeley Humanoid Lite to serve as a pivotal step toward democratizing the
development of humanoid robotics. All resources are available at
https://lite.berkeley-humanoid.org.
comment: Accepted in Robotics: Science and Systems (RSS) 2025
☆ Robotic Grinding Skills Learning Based on Geodesic Length Dynamic Motion Primitives
Learning grinding skills from human craftsmen via imitation learning has
become a key research topic in robotic machining. Due to their strong
generalization and robustness to external disturbances, Dynamical Movement
Primitives (DMPs) offer a promising approach for robotic grinding skill
learning. However, directly applying DMPs to grinding tasks faces challenges,
such as low orientation accuracy, unsynchronized position-orientation-force,
and limited generalization for surface trajectories. To address these issues,
this paper proposes a robotic grinding skill learning method based on geodesic
length DMPs (Geo-DMPs). First, a normalized 2D weighted Gaussian kernel and
intrinsic mean clustering algorithm are developed to extract geometric features
from multiple demonstrations. Then, an orientation manifold distance metric
removes the time dependency in traditional orientation DMPs, enabling accurate
orientation learning via Geo-DMPs. A synchronization encoding framework is
further proposed to jointly model position, orientation, and force using a
geodesic length-based phase function. This framework enables robotic grinding
actions to be generated between any two surface points. Experiments on robotic
chamfer grinding and free-form surface grinding validate that the proposed
method achieves high geometric accuracy and generalization in skill encoding
and generation. To our knowledge, this is the first attempt to use DMPs for
jointly learning and generating grinding skills in position, orientation, and
force on model-free surfaces, offering a novel path for robotic grinding.
☆ Simultaneous Collision Detection and Force Estimation for Dynamic Quadrupedal Locomotion
In this paper we address the simultaneous collision detection and force
estimation problem for quadrupedal locomotion using joint encoder information
and the robot dynamics only. We design an interacting multiple-model Kalman
filter (IMM-KF) that estimates the external force exerted on the robot and
multiple possible contact modes. The method is invariant to any gait pattern
design. Our approach leverages pseudo-measurement information of the external
forces based on the robot dynamics and encoder information. Based on the
estimated contact mode and external force, we design a reflex motion and an
admittance controller for the swing leg to avoid collisions by adjusting the
leg's reference motion. Additionally, we implement a force-adaptive model
predictive controller to enhance balancing. Simulation ablatation studies and
experiments show the efficacy of the approach.
☆ MAT-DiSMech: A Discrete Differential Geometry-based Computational Tool for Simulation of Rods, Shells, and Soft Robots
Accurate and efficient simulation tools are essential in robotics, enabling
the visualization of system dynamics and the validation of control laws before
committing resources to physical experimentation. Developing physically
accurate simulation tools is particularly challenging in soft robotics, largely
due to the prevalence of geometrically nonlinear deformation. A variety of
robot simulators tackle this challenge by using simplified modeling techniques
-- such as lumped mass models -- which lead to physical inaccuracies in
real-world applications. On the other hand, high-fidelity simulation methods
for soft structures, like finite element analysis, offer increased accuracy but
lead to higher computational costs. In light of this, we present a Discrete
Differential Geometry-based simulator that provides a balance between physical
accuracy and computational speed. Building on an extensive body of research on
rod and shell-based representations of soft robots, our tool provides a pathway
to accurately model soft robots in a computationally tractable manner. Our
open-source MATLAB-based framework is capable of simulating the deformations of
rods, shells, and their combinations, primarily utilizing implicit integration
techniques. The software design is modular for the user to customize the code,
for example, add new external forces and impose custom boundary conditions. The
implementations for prevalent forces encountered in robotics, including
gravity, contact, kinetic and viscous friction, and aerodynamic drag, have been
provided. We provide several illustrative examples that showcase the
capabilities and validate the physical accuracy of the simulator. The
open-source code is available at
https://github.com/StructuresComp/dismech-matlab. We anticipate that the
proposed simulator can serve as an effective digital twin tool, enhancing the
Sim2Real pathway in soft robotics research.
comment: Total 25 pages, 8 figures, open-source code available at
https://github.com/StructuresComp/dismech-matlab
☆ AUTHENTICATION: Identifying Rare Failure Modes in Autonomous Vehicle Perception Systems using Adversarially Guided Diffusion Models
Autonomous Vehicles (AVs) rely on artificial intelligence (AI) to accurately
detect objects and interpret their surroundings. However, even when trained
using millions of miles of real-world data, AVs are often unable to detect rare
failure modes (RFMs). The problem of RFMs is commonly referred to as the
"long-tail challenge", due to the distribution of data including many instances
that are very rarely seen. In this paper, we present a novel approach that
utilizes advanced generative and explainable AI techniques to aid in
understanding RFMs. Our methods can be used to enhance the robustness and
reliability of AVs when combined with both downstream model training and
testing. We extract segmentation masks for objects of interest (e.g., cars) and
invert them to create environmental masks. These masks, combined with carefully
crafted text prompts, are fed into a custom diffusion model. We leverage the
Stable Diffusion inpainting model guided by adversarial noise optimization to
generate images containing diverse environments designed to evade object
detection models and expose vulnerabilities in AI systems. Finally, we produce
natural language descriptions of the generated RFMs that can guide developers
and policymakers to improve the safety and reliability of AV systems.
comment: 8 pages, 10 figures. Accepted to IEEE Conference on Artificial
Intelligence (CAI), 2025
☆ Advancing Frontiers of Path Integral Theory for Stochastic Optimal Control
Stochastic Optimal Control (SOC) problems arise in systems influenced by
uncertainty, such as autonomous robots or financial models. Traditional methods
like dynamic programming are often intractable for high-dimensional, nonlinear
systems due to the curse of dimensionality. This dissertation explores the path
integral control framework as a scalable, sampling-based alternative. By
reformulating SOC problems as expectations over stochastic trajectories, it
enables efficient policy synthesis via Monte Carlo sampling and supports
real-time implementation through GPU parallelization.
We apply this framework to six classes of SOC problems: Chance-Constrained
SOC, Stochastic Differential Games, Deceptive Control, Task Hierarchical
Control, Risk Mitigation of Stealthy Attacks, and Discrete-Time LQR. A sample
complexity analysis for the discrete-time case is also provided. These
contributions establish a foundation for simulator-driven autonomy in complex,
uncertain environments.
♻ ☆ Demonstrating CavePI: Autonomous Exploration of Underwater Caves by Semantic Guidance
Enabling autonomous robots to safely and efficiently navigate, explore, and
map underwater caves is of significant importance to water resource management,
hydrogeology, archaeology, and marine robotics. In this work, we demonstrate
the system design and algorithmic integration of a visual servoing framework
for semantically guided autonomous underwater cave exploration. We present the
hardware and edge-AI design considerations to deploy this framework on a novel
AUV (Autonomous Underwater Vehicle) named CavePI. The guided navigation is
driven by a computationally light yet robust deep visual perception module,
delivering a rich semantic understanding of the environment. Subsequently, a
robust control mechanism enables CavePI to track the semantic guides and
navigate within complex cave structures. We evaluate the system through field
experiments in natural underwater caves and spring-water sites and further
validate its ROS (Robot Operating System)-based digital twin in a simulation
environment. Our results highlight how these integrated design choices
facilitate reliable navigation under feature-deprived, GPS-denied, and
low-visibility conditions.
comment: V4, 17 pages
♻ ☆ Efficient Iterative Proximal Variational Inference Motion Planning
Motion planning under uncertainty can be cast as a stochastic optimal control
problem where the optimal posterior distribution has an explicit form. To
approximate this posterior, this work frames an optimization problem in the
space of Gaussian distributions by solving a Variational Inference (VI) in the
path distribution space. For linear-Gaussian stochastic dynamics, we propose a
proximal algorithm to solve for an optimal Gaussian proposal iteratively. The
computational bottleneck is evaluating the gradients with respect to the
proposal over a dense trajectory. We exploit the sparse motion planning factor
graph and Gaussian Belief Propagation (GBP), allowing for parallel computing of
these gradients on Graphics Processing Units (GPUs). We term the novel paradigm
as the Parallel Gaussian Variational Inference Motion Planning (P-GVIMP).
Building on the efficient algorithm for linear Gaussian systems, we then
propose an iterative paradigm based on Statistical Linear Regression (SLR)
techniques to solve motion planning for nonlinear stochastic systems, where the
P-GVIMP serves as a sub-routine for the linearized time-varying system. We
validate the proposed framework on various robotic systems, demonstrating
significant speed acceleration achieved by leveraging parallel computation and
successful planning solutions for nonlinear systems under uncertainty. An
open-sourced implementation is presented at https://github.com/hzyu17/VIMP.
comment: 13 pages
♻ ☆ Robot Pouring: Identifying Causes of Spillage and Selecting Alternative Action Parameters Using Probabilistic Actual Causation
In everyday life, we perform tasks (e.g., cooking or cleaning) that involve a
large variety of objects and goals. When confronted with an unexpected or
unwanted outcome, we take corrective actions and try again until achieving the
desired result. The reasoning performed to identify a cause of the observed
outcome and to select an appropriate corrective action is a crucial aspect of
human reasoning for successful task execution. Central to this reasoning is the
assumption that a factor is responsible for producing the observed outcome. In
this paper, we investigate the use of probabilistic actual causation to
determine whether a factor is the cause of an observed undesired outcome.
Furthermore, we show how the actual causation probabilities can be used to find
alternative actions to change the outcome. We apply the probabilistic actual
causation analysis to a robot pouring task. When spillage occurs, the analysis
indicates whether a task parameter is the cause and how it should be changed to
avoid spillage. The analysis requires a causal graph of the task and the
corresponding conditional probability distributions. To fulfill these
requirements, we perform a complete causal modeling procedure (i.e., task
analysis, definition of variables, determination of the causal graph structure,
and estimation of conditional probability distributions) using data from a
realistic simulation of the robot pouring task, covering a large combinatorial
space of task parameters. Based on the results, we discuss the implications of
the variables' representation and how the alternative actions suggested by the
actual causation analysis would compare to the alternative solutions proposed
by a human observer. The practical use of the analysis of probabilistic actual
causation to select alternative action parameters is demonstrated.
comment: 20 pages, 13 figures
♻ ☆ Deployment-friendly Lane-changing Intention Prediction Powered by Brain-inspired Spiking Neural Networks
Accurate and real-time prediction of surrounding vehicles' lane-changing
intentions is a critical challenge in deploying safe and efficient autonomous
driving systems in open-world scenarios. Existing high-performing methods
remain hard to deploy due to their high computational cost, long training
times, and excessive memory requirements. Here, we propose an efficient
lane-changing intention prediction approach based on brain-inspired Spiking
Neural Networks (SNN). By leveraging the event-driven nature of SNN, the
proposed approach enables us to encode the vehicle's states in a more efficient
manner. Comparison experiments conducted on HighD and NGSIM datasets
demonstrate that our method significantly improves training efficiency and
reduces deployment costs while maintaining comparable prediction accuracy.
Particularly, compared to the baseline, our approach reduces training time by
75% and memory usage by 99.9%. These results validate the efficiency and
reliability of our method in lane-changing predictions, highlighting its
potential for safe and efficient autonomous driving systems while offering
significant advantages in deployment, including reduced training time, lower
memory usage, and faster inference.
♻ ☆ Learning Type-Generalized Actions for Symbolic Planning IROS
Symbolic planning is a powerful technique to solve complex tasks that require
long sequences of actions and can equip an intelligent agent with complex
behavior. The downside of this approach is the necessity for suitable symbolic
representations describing the state of the environment as well as the actions
that can change it. Traditionally such representations are carefully
hand-designed by experts for distinct problem domains, which limits their
transferability to different problems and environment complexities. In this
paper, we propose a novel concept to generalize symbolic actions using a given
entity hierarchy and observed similar behavior. In a simulated grid-based
kitchen environment, we show that type-generalized actions can be learned from
few observations and generalize to novel situations. Incorporating an
additional on-the-fly generalization mechanism during planning, unseen task
combinations, involving longer sequences, novel entities and unexpected
environment behavior, can be solved.
comment: IEEE/RSJ International Conference on Intelligent Robots and Systems
(IROS) 2023
♻ ☆ CoPAL: Corrective Planning of Robot Actions with Large Language Models ICRA
Frank Joublin, Antonello Ceravola, Pavel Smirnov, Felix Ocker, Joerg Deigmoeller, Anna Belardinelli, Chao Wang, Stephan Hasler, Daniel Tanneberg, Michael Gienger
In the pursuit of fully autonomous robotic systems capable of taking over
tasks traditionally performed by humans, the complexity of open-world
environments poses a considerable challenge. Addressing this imperative, this
study contributes to the field of Large Language Models (LLMs) applied to task
and motion planning for robots. We propose a system architecture that
orchestrates a seamless interplay between multiple cognitive levels,
encompassing reasoning, planning, and motion generation. At its core lies a
novel replanning strategy that handles physically grounded, logical, and
semantic errors in the generated plans. We demonstrate the efficacy of the
proposed feedback architecture, particularly its impact on executability,
correctness, and time complexity via empirical evaluation in the context of a
simulation and two intricate real-world scenarios: blocks world, barman and
pizza preparation.
comment: IEEE International Conference on Robotics and Automation (ICRA) 2024
♻ ☆ To Help or Not to Help: LLM-based Attentive Support for Human-Robot Group Interactions IROS
Daniel Tanneberg, Felix Ocker, Stephan Hasler, Joerg Deigmoeller, Anna Belardinelli, Chao Wang, Heiko Wersing, Bernhard Sendhoff, Michael Gienger
How can a robot provide unobtrusive physical support within a group of
humans? We present Attentive Support, a novel interaction concept for robots to
support a group of humans. It combines scene perception, dialogue acquisition,
situation understanding, and behavior generation with the common-sense
reasoning capabilities of Large Language Models (LLMs). In addition to
following user instructions, Attentive Support is capable of deciding when and
how to support the humans, and when to remain silent to not disturb the group.
With a diverse set of scenarios, we show and evaluate the robot's attentive
behavior, which supports and helps the humans when required, while not
disturbing if no help is needed.
comment: IEEE/RSJ International Conference on Intelligent Robots and Systems
(IROS) 2024
♻ ★ DYNUS: Uncertainty-aware Trajectory Planner in Dynamic Unknown Environments
Kota Kondo, Mason Peterson, Nicholas Rober, Juan Rached Viso, Lucas Jia, Jialin Chen, Harvey Merton, Jonathan P. How
This paper introduces DYNUS, an uncertainty-aware trajectory planner designed
for dynamic unknown environments. Operating in such settings presents many
challenges -- most notably, because the agent cannot predict the ground-truth
future paths of obstacles, a previously planned trajectory can become unsafe at
any moment, requiring rapid replanning to avoid collisions.
Recently developed planners have used soft-constraint approaches to achieve
the necessary fast computation times; however, these methods do not guarantee
collision-free paths even with static obstacles. In contrast, hard-constraint
methods ensure collision-free safety, but typically have longer computation
times.
To address these issues, we propose three key contributions. First, the DYNUS
Global Planner (DGP) and Temporal Safe Corridor Generation operate in
spatio-temporal space and handle both static and dynamic obstacles in the 3D
environment. Second, the Safe Planning Framework leverages a combination of
exploratory, safe, and contingency trajectories to flexibly re-route when
potential future collisions with dynamic obstacles are detected. Finally, the
Fast Hard-Constraint Local Trajectory Formulation uses a variable elimination
approach to reduce the problem size and enable faster computation by
pre-computing dependencies between free and dependent variables while still
ensuring collision-free trajectories.
We evaluated DYNUS in a variety of simulations, including dense forests,
confined office spaces, cave systems, and dynamic environments. Our experiments
show that DYNUS achieves a success rate of 100% and travel times that are
approximately 25.0% faster than state-of-the-art methods. We also evaluated
DYNUS on multiple platforms -- a quadrotor, a wheeled robot, and a quadruped --
in both simulation and hardware experiments.
comment: 20 pages, 30 figures, Under review at IEEE Transactions on Robotics
♻ ☆ Empirical Comparison of Four Stereoscopic Depth Sensing Cameras for Robotics Applications
Depth sensing is an essential technology in robotics and many other fields.
Many depth sensing (or RGB-D) cameras are available on the market and selecting
the best one for your application can be challenging. In this work, we tested
four stereoscopic RGB-D cameras that sense the distance by using two images
from slightly different views. We empirically compared four cameras (Intel
RealSense D435, Intel RealSense D455, StereoLabs ZED 2, and Luxonis OAK-D Pro)
in three scenarios: (i) planar surface perception, (ii) plastic doll
perception, (iii) household object perception (YCB dataset). We recorded and
evaluated more than 3,000 RGB-D frames for each camera. For table-top robotics
scenarios with distance to objects up to one meter, the best performance is
provided by the D435 camera that is able to perceive with an error under 1 cm
in all of the tested scenarios. For longer distances, the other three models
perform better, making them more suitable for some mobile robotics
applications. OAK-D Pro additionally offers integrated AI modules (e.g., object
and human keypoint detection). ZED 2 is overall the best camera which is able
to keep the error under 3 cm even at 4 meters. However, it is not a standalone
device and requires a computer with a GPU for depth data acquisition. All data
(more than 12,000 RGB-D frames) are publicly available at
https://rustlluk.github.io/rgbd-comparison.
♻ ☆ Dexterous Manipulation through Imitation Learning: A Survey
Shan An, Ziyu Meng, Chao Tang, Yuning Zhou, Tengyu Liu, Fangqiang Ding, Shufang Zhang, Yao Mu, Ran Song, Wei Zhang, Zeng-Guang Hou, Hong Zhang
Dexterous manipulation, which refers to the ability of a robotic hand or
multi-fingered end-effector to skillfully control, reorient, and manipulate
objects through precise, coordinated finger movements and adaptive force
modulation, enables complex interactions similar to human hand dexterity. With
recent advances in robotics and machine learning, there is a growing demand for
these systems to operate in complex and unstructured environments. Traditional
model-based approaches struggle to generalize across tasks and object
variations due to the high dimensionality and complex contact dynamics of
dexterous manipulation. Although model-free methods such as reinforcement
learning (RL) show promise, they require extensive training, large-scale
interaction data, and carefully designed rewards for stability and
effectiveness. Imitation learning (IL) offers an alternative by allowing robots
to acquire dexterous manipulation skills directly from expert demonstrations,
capturing fine-grained coordination and contact dynamics while bypassing the
need for explicit modeling and large-scale trial-and-error. This survey
provides an overview of dexterous manipulation methods based on imitation
learning, details recent advances, and addresses key challenges in the field.
Additionally, it explores potential research directions to enhance IL-driven
dexterous manipulation. Our goal is to offer researchers and practitioners a
comprehensive introduction to this rapidly evolving domain.
comment: 22pages, 5 figures
♻ ☆ MUVO: A Multimodal Generative World Model for Autonomous Driving with Geometric Representations
World models for autonomous driving have the potential to dramatically
improve the reasoning capabilities of today's systems. However, most works
focus on camera data, with only a few that leverage lidar data or combine both
to better represent autonomous vehicle sensor setups. In addition, raw sensor
predictions are less actionable than 3D occupancy predictions, but there are no
works examining the effects of combining both multimodal sensor data and 3D
occupancy prediction. In this work, we perform a set of experiments with a
MUltimodal World Model with Geometric VOxel representations (MUVO) to evaluate
different sensor fusion strategies to better understand the effects on sensor
data prediction. We also analyze potential weaknesses of current sensor fusion
approaches and examine the benefits of additionally predicting 3D occupancy.
comment: Daniel Bogdoll and Yitian Yang contributed equally. Accepted for
publication at IV 2025
♻ ☆ Latent Representations for Visual Proprioception in Inexpensive Robots
Robotic manipulation requires explicit or implicit knowledge of the robot's
joint positions. Precise proprioception is standard in high-quality industrial
robots but is often unavailable in inexpensive robots operating in unstructured
environments. In this paper, we ask: to what extent can a fast, single-pass
regression architecture perform visual proprioception from a single external
camera image, available even in the simplest manipulation settings? We explore
several latent representations, including CNNs, VAEs, ViTs, and bags of
uncalibrated fiducial markers, using fine-tuning techniques adapted to the
limited data available. We evaluate the achievable accuracy through experiments
on an inexpensive 6-DoF robot.
♻ ☆ Label-Free Model Failure Detection for Lidar-based Point Cloud Segmentation
Autonomous vehicles drive millions of miles on the road each year. Under such
circumstances, deployed machine learning models are prone to failure both in
seemingly normal situations and in the presence of outliers. However, in the
training phase, they are only evaluated on small validation and test sets,
which are unable to reveal model failures due to their limited scenario
coverage. While it is difficult and expensive to acquire large and
representative labeled datasets for evaluation, large-scale unlabeled datasets
are typically available. In this work, we introduce label-free model failure
detection for lidar-based point cloud segmentation, taking advantage of the
abundance of unlabeled data available. We leverage different data
characteristics by training a supervised and self-supervised stream for the
same task to detect failure modes. We perform a large-scale qualitative
analysis and present LidarCODA, the first publicly available dataset with
labeled anomalies in real-world lidar data, for an extensive quantitative
analysis.
comment: Daniel Bogdoll, Finn Sartoris, and Vincent Geppert contributed
equally. Accepted for publication at IV 2025
♻ ☆ QUART-Online: Latency-Free Large Multimodal Language Model for Quadruped Robot Learning ICRA 2025
Xinyang Tong, Pengxiang Ding, Yiguo Fan, Donglin Wang, Wenjie Zhang, Can Cui, Mingyang Sun, Han Zhao, Hongyin Zhang, Yonghao Dang, Siteng Huang, Shangke Lyu
This paper addresses the inherent inference latency challenges associated
with deploying multimodal large language models (MLLM) in quadruped
vision-language-action (QUAR-VLA) tasks. Our investigation reveals that
conventional parameter reduction techniques ultimately impair the performance
of the language foundation model during the action instruction tuning phase,
making them unsuitable for this purpose. We introduce a novel latency-free
quadruped MLLM model, dubbed QUART-Online, designed to enhance inference
efficiency without degrading the performance of the language foundation model.
By incorporating Action Chunk Discretization (ACD), we compress the original
action representation space, mapping continuous action values onto a smaller
set of discrete representative vectors while preserving critical information.
Subsequently, we fine-tune the MLLM to integrate vision, language, and
compressed actions into a unified semantic space. Experimental results
demonstrate that QUART-Online operates in tandem with the existing MLLM system,
achieving real-time inference in sync with the underlying controller frequency,
significantly boosting the success rate across various tasks by 65%. Our
project page is https://quart-online.github.io.
comment: Accepted to ICRA 2025; Github page: https://quart-online.github.io
♻ ☆ Robotic World Model: A Neural Network Simulator for Robust Policy Optimization in Robotics
Learning robust and generalizable world models is crucial for enabling
efficient and scalable robotic control in real-world environments. In this
work, we introduce a novel framework for learning world models that accurately
capture complex, partially observable, and stochastic dynamics. The proposed
method employs a dual-autoregressive mechanism and self-supervised training to
achieve reliable long-horizon predictions without relying on domain-specific
inductive biases, ensuring adaptability across diverse robotic tasks. We
further propose a policy optimization framework that leverages world models for
efficient training in imagined environments and seamless deployment in
real-world systems. This work advances model-based reinforcement learning by
addressing the challenges of long-horizon prediction, error accumulation, and
sim-to-real transfer. By providing a scalable and robust framework, the
introduced methods pave the way for adaptive and efficient robotic systems in
real-world applications.
♻ ☆ NGM-SLAM: Gaussian Splatting SLAM with Radiance Field Submap
SLAM systems based on Gaussian Splatting have garnered attention due to their
capabilities for rapid real-time rendering and high-fidelity mapping. However,
current Gaussian Splatting SLAM systems usually struggle with large scene
representation and lack effective loop closure detection. To address these
issues, we introduce NGM-SLAM, the first 3DGS based SLAM system that utilizes
neural radiance field submaps for progressive scene expression, effectively
integrating the strengths of neural radiance fields and 3D Gaussian Splatting.
We utilize neural radiance field submaps as supervision and achieve
high-quality scene expression and online loop closure adjustments through
Gaussian rendering of fused submaps. Our results on multiple real-world scenes
and large-scale scene datasets demonstrate that our method can achieve accurate
hole filling and high-quality scene expression, supporting monocular, stereo,
and RGB-D inputs, and achieving state-of-the-art scene reconstruction and
tracking performance.
comment: 9pages, 4 figures
♻ ☆ MARFT: Multi-Agent Reinforcement Fine-Tuning
LLM-based Multi-Agent Systems have demonstrated remarkable capabilities in
addressing complex, agentic tasks requiring multifaceted reasoning and
collaboration, from generating high-quality presentation slides to conducting
sophisticated scientific research. Meanwhile, RL has been widely recognized for
its effectiveness in enhancing agent intelligence, but limited research has
investigated the fine-tuning of LaMAS using foundational RL techniques.
Moreover, the direct application of MARL methodologies to LaMAS introduces
significant challenges, stemming from the unique characteristics and mechanisms
inherent to LaMAS. To address these challenges, this article presents a
comprehensive study of LLM-based MARL and proposes a novel paradigm termed
Multi-Agent Reinforcement Fine-Tuning (MARFT). We introduce a universal
algorithmic framework tailored for LaMAS, outlining the conceptual foundations,
key distinctions, and practical implementation strategies. We begin by
reviewing the evolution from RL to Reinforcement Fine-Tuning, setting the stage
for a parallel analysis in the multi-agent domain. In the context of LaMAS, we
elucidate critical differences between MARL and MARFT. These differences
motivate a transition toward a novel, LaMAS-oriented formulation of RFT.
Central to this work is the presentation of a robust and scalable MARFT
framework. We detail the core algorithm and provide a complete, open-source
implementation to facilitate adoption and further research. The latter sections
of the paper explore real-world application perspectives and opening challenges
in MARFT. By bridging theoretical underpinnings with practical methodologies,
this work aims to serve as a roadmap for researchers seeking to advance MARFT
toward resilient and adaptive solutions in agentic systems. Our implementation
of the proposed framework is publicly available at:
https://github.com/jwliao-ai/MARFT.
comment: 36 pages
♻ ☆ Fast Online Adaptive Neural MPC via Meta-Learning
Data-driven model predictive control (MPC) has demonstrated significant
potential for improving robot control performance in the presence of model
uncertainties. However, existing approaches often require extensive offline
data collection and computationally intensive training, limiting their ability
to adapt online. To address these challenges, this paper presents a fast online
adaptive MPC framework that leverages neural networks integrated with
Model-Agnostic Meta-Learning (MAML). Our approach focuses on few-shot
adaptation of residual dynamics - capturing the discrepancy between nominal and
true system behavior - using minimal online data and gradient steps. By
embedding these meta-learned residual models into a computationally efficient
L4CasADi-based MPC pipeline, the proposed method enables rapid model
correction, enhances predictive accuracy, and improves real-time control
performance. We validate the framework through simulation studies on a Van der
Pol oscillator, a Cart-Pole system, and a 2D quadrotor. Results show
significant gains in adaptation speed and prediction accuracy over both nominal
MPC and nominal MPC augmented with a freshly initialized neural network,
underscoring the effectiveness of our approach for real-time adaptive robot
control.
♻ ☆ SHIFT Planner: Speedy Hybrid Iterative Field and Segmented Trajectory Optimization with IKD-tree for Uniform Lightweight Coverage
This paper introduces a comprehensive planning and navigation framework that
address these limitations by integrating semantic mapping, adaptive coverage
planning, dynamic obstacle avoidance and precise trajectory tracking. Our
framework begins by generating panoptic occupancy local semantic maps and
accurate localization information from data aligned between a monocular camera,
IMU, and GPS. This information is combined with input terrain point clouds or
preloaded terrain information to initialize the planning process. We propose
the Radiant Field-Informed Coverage Planning algorithm, which utilizes a
diffusion field model to dynamically adjust the robot's coverage trajectory and
speed based on environmental attributes such as dirtiness and dryness. By
modeling the spatial influence of the robot's actions using a Gaussian field,
ensures a speed-optimized, uniform coverage trajectory while adapting to
varying environmental conditions.