Robotics 43
☆ MADR: MPC-guided Adversarial DeepReach
Hamilton-Jacobi (HJ) Reachability offers a framework for generating safe
value functions and policies in the face of adversarial disturbance, but is
limited by the curse of dimensionality. Physics-informed deep learning is able
to overcome this infeasibility, but itself suffers from slow and inaccurate
convergence, primarily due to weak PDE gradients and the complexity of
self-supervised learning. A few works, recently, have demonstrated that
enriching the self-supervision process with regular supervision (based on the
nature of the optimal control problem), greatly accelerates convergence and
solution quality, however, these have been limited to single player problems
and simple games. In this work, we introduce MADR: MPC-guided Adversarial
DeepReach, a general framework to robustly approximate the two-player, zero-sum
differential game value function. In doing so, MADR yields the corresponding
optimal strategies for both players in zero-sum games as well as safe policies
for worst-case robustness. We test MADR on a multitude of high-dimensional
simulated and real robotic agents with varying dynamics and games, finding that
our approach significantly out-performs state-of-the-art baselines in
simulation and produces impressive results in hardware.
comment: 8 pages, under review
☆ Online Object-Level Semantic Mapping for Quadrupeds in Real-World Environments
We present an online semantic object mapping system for a quadruped robot
operating in real indoor environments, turning sensor detections into named
objects in a global map. During a run, the mapper integrates range geometry
with camera detections, merges co-located detections within a frame, and
associates repeated detections into persistent object instances across frames.
Objects remain in the map when they are out of view, and repeated sightings
update the same instance rather than creating duplicates. The output is a
compact object layer that can be queried (class, pose, and confidence), is
integrated with the occupancy map and readable by a planner. In on-robot tests,
the layer remained stable across viewpoint changes.
comment: Published at the Italian Conference on Robotics and Intelligent
Machines (I-RIM) 3D, 2025
☆ Sharing the Load: Distributed Model-Predictive Control for Precise Multi-Rover Cargo Transport
For autonomous cargo transportation, teams of mobile robots can provide more
operational flexibility than a single large robot. In these scenarios,
precision in both inter-vehicle distance and path tracking is key. With this
motivation, we develop a distributed model-predictive controller (MPC) for
multi-vehicle cargo operations that builds on the precise path-tracking of
lidar teach and repeat. To carry cargo, a following vehicle must maintain a
Euclidean distance offset from a lead vehicle regardless of the path curvature.
Our approach uses a shared map to localize the robots relative to each other
without GNSS or direct observations. We compare our approach to a centralized
MPC and a baseline approach that directly measures the inter-vehicle distance.
The distributed MPC shows equivalent nominal performance to the more complex
centralized MPC. Using a direct measurement of the relative distance between
the leader and follower shows improved tracking performance in close-range
scenarios but struggles with long-range offsets. The operational flexibility
provided by distributing the computation makes it well suited for real
deployments. We evaluate four types of convoyed path trackers with over 10 km
of driving in a coupled convoy. With convoys of two and three rovers, the
proposed distributed MPC method works in real-time to allow map-based convoying
to maintain maximum spacing within 20 cm of the target in various conditions.
comment: 8 pages, 4 figures
☆ Event-Grounding Graph: Unified Spatio-Temporal Scene Graph from Robotic Observations
A fundamental aspect for building intelligent autonomous robots that can
assist humans in their daily lives is the construction of rich environmental
representations. While advances in semantic scene representations have enriched
robotic scene understanding, current approaches lack a connection between
spatial features and dynamic events; e.g., connecting the blue mug to the event
washing a mug. In this work, we introduce the event-grounding graph (EGG), a
framework grounding event interactions to spatial features of a scene. This
representation allows robots to perceive, reason, and respond to complex
spatio-temporal queries. Experiments using real robotic data demonstrate EGG's
capability to retrieve relevant information and respond accurately to human
inquiries concerning the environment and events within. Furthermore, the EGG
framework's source code and evaluation dataset are released as open-source at:
https://github.com/aalto-intelligent-robotics/EGG.
comment: Submitted to RA-L
☆ Towards An Adaptive Locomotion Strategy For Quadruped Rovers: Quantifying When To Slide Or Walk On Planetary Slopes
Alberto Sanchez-Delgado, João Carlos Virgolino Soares, David Omar Al Tawil, Alessia Li Noce, Matteo Villa, Victor Barasuol, Paolo Arena, Claudio Semini
Legged rovers provide enhanced mobility compared to wheeled platforms,
enabling navigation on steep and irregular planetary terrains. However,
traditional legged locomotion might be energetically inefficient and
potentially dangerous to the rover on loose and inclined surfaces, such as
crater walls and cave slopes. This paper introduces a preliminary study that
compares the Cost of Transport (CoT) of walking and torso-based sliding
locomotion for quadruped robots across different slopes, friction conditions
and speed levels. By identifying intersections between walking and sliding CoT
curves, we aim to define threshold conditions that may trigger transitions
between the two strategies. The methodology combines physics-based simulations
in Isaac Sim with particle interaction validation in ANSYS-Rocky. Our results
represent an initial step towards adaptive locomotion strategies for planetary
legged rovers.
comment: Published at the 18th Symposium on Advanced Space Technologies in
Robotics and Automation (ASTRA 2025)
☆ Least Restrictive Hyperplane Control Barrier Functions
Control Barrier Functions (CBFs) can provide provable safety guarantees for
dynamic systems. However, finding a valid CBF for a system of interest is often
non-trivial, especially if the shape of the unsafe region is complex and the
CBFs are of higher order. A common solution to this problem is to make a
conservative approximation of the unsafe region in the form of a
line/hyperplane, and use the corresponding conservative Hyperplane-CBF when
deciding on safe control actions. In this letter, we note that conservative
constraints are only a problem if they prevent us from doing what we want.
Thus, instead of first choosing a CBF and then choosing a safe control with
respect to the CBF, we optimize over a combination of CBFs and safe controls to
get as close as possible to our desired control, while still having the safety
guarantee provided by the CBF. We call the corresponding CBF the least
restrictive Hyperplane-CBF. Finally, we also provide a way of creating a smooth
parameterization of the CBF-family for the optimization, and illustrate the
approach on a double integrator dynamical system with acceleration constraints,
moving through a group of arbitrarily shaped static and moving obstacles.
☆ C-SWAP: Explainability-Aware Structured Pruning for Efficient Neural Networks Compression BMVC2025
Neural network compression has gained increasing attention in recent years,
particularly in computer vision applications, where the need for model
reduction is crucial for overcoming deployment constraints. Pruning is a widely
used technique that prompts sparsity in model structures, e.g. weights,
neurons, and layers, reducing size and inference costs. Structured pruning is
especially important as it allows for the removal of entire structures, which
further accelerates inference time and reduces memory overhead. However, it can
be computationally expensive, requiring iterative retraining and optimization.
To overcome this problem, recent methods considered one-shot setting, which
applies pruning directly at post-training. Unfortunately, they often lead to a
considerable drop in performance. In this paper, we focus on this issue by
proposing a novel one-shot pruning framework that relies on explainable deep
learning. First, we introduce a causal-aware pruning approach that leverages
cause-effect relations between model predictions and structures in a
progressive pruning process. It allows us to efficiently reduce the size of the
network, ensuring that the removed structures do not deter the performance of
the model. Then, through experiments conducted on convolution neural network
and vision transformer baselines, pre-trained on classification tasks, we
demonstrate that our method consistently achieves substantial reductions in
model size, with minimal impact on performance, and without the need for
fine-tuning. Overall, our approach outperforms its counterparts, offering the
best trade-off. Our code is available on GitHub.
comment: 10 pages, BMVC2025
☆ A Compositional Paradigm for Foundation Models: Towards Smarter Robotic Agents
Luigi Quarantiello, Elia Piccoli, Jack Bell, Malio Li, Giacomo Carfì, Eric Nuertey Coleman, Gerlando Gramaglia, Lanpei Li, Mauro Madeddu, Irene Testa, Vincenzo Lomonaco
The birth of Foundation Models brought unprecedented results in a wide range
of tasks, from language to vision, to robotic control. These models are able to
process huge quantities of data, and can extract and develop rich
representations, which can be employed across different domains and modalities.
However, they still have issues in adapting to dynamic, real-world scenarios
without retraining the entire model from scratch. In this work, we propose the
application of Continual Learning and Compositionality principles to foster the
development of more flexible, efficient and smart AI solutions.
☆ Quadrupeds for Planetary Exploration: Field Testing Control Algorithms on an Active Volcano
Shubham Vyas, Franek Stark, Rohit Kumar, Hannah Isermann, Jonas Haack, Mihaela Popescu, Jakob Middelberg, Dennis Mronga, Frank Kirchner
Missions such as the Ingenuity helicopter have shown the advantages of using
novel locomotion modes to increase the scientific return of planetary
exploration missions. Legged robots can further expand the reach and capability
of future planetary missions by traversing more difficult terrain than wheeled
rovers, such as jumping over cracks on the ground or traversing rugged terrain
with boulders. To develop and test algorithms for using quadruped robots, the
AAPLE project was carried out at DFKI. As part of the project, we conducted a
series of field experiments on the Volcano on the Aeolian island of Vulcano, an
active stratovolcano near Sicily, Italy. The experiments focused on validating
newly developed state-of-the-art adaptive optimal control algorithms for
quadrupedal locomotion in a high-fidelity analog environment for Lunar and
Martian surfaces. This paper presents the technical approach, test plan,
software architecture, field deployment strategy, and evaluation results from
the Vulcano campaign.
comment: Presented at 18th Symposium on Advanced Space Technologies in
Robotics and Automation (ASTRA)
☆ Flexbee: A Grasping and Perching UAV Based on Soft Vector-Propulsion Nozzle
The aim of this paper is to design a new type of grasping and perching
unmanned aerial vehicle (UAV), called Flexbee, which features a soft
vector-propulsion nozzle (SVPN). Compared to previous UAVs, Flexbee integrates
flight, grasping, and perching functionalities into the four SVPNs. This
integration offers advantages including decoupled position and attitude
control, high structural reuse, and strong adaptability strong adaptability for
grasping and perching. A dynamics model of Flexbee has been developed, and the
nonlinear coupling issue of the moment has been resolved through linearization
of the equivalent moment model. A hierarchical control strategy was used to
design controllers for the two operational modes of Flexbee. Finally, flight,
grasping, and perching experiments were conducted to validate Flexbee's
kinematic capabilities and the effectiveness of the control strategy.
comment: 11 pages, 17 figures
☆ EfficientNav: Towards On-Device Object-Goal Navigation with Navigation Map Caching and Retrieval NeurIPS 2025
Object-goal navigation (ObjNav) tasks an agent with navigating to the
location of a specific object in an unseen environment. Embodied agents
equipped with large language models (LLMs) and online constructed navigation
maps can perform ObjNav in a zero-shot manner. However, existing agents heavily
rely on giant LLMs on the cloud, e.g., GPT-4, while directly switching to small
LLMs, e.g., LLaMA3.2-11b, suffer from significant success rate drops due to
limited model capacity for understanding complex navigation maps, which
prevents deploying ObjNav on local devices. At the same time, the long prompt
introduced by the navigation map description will cause high planning latency
on local devices. In this paper, we propose EfficientNav to enable on-device
efficient LLM-based zero-shot ObjNav. To help the smaller LLMs better
understand the environment, we propose semantics-aware memory retrieval to
prune redundant information in navigation maps. To reduce planning latency, we
propose discrete memory caching and attention-based memory clustering to
efficiently save and re-use the KV cache. Extensive experimental results
demonstrate that EfficientNav achieves 11.1% improvement in success rate on
HM3D benchmark over GPT-4-based baselines, and demonstrates 6.7x real-time
latency reduction and 4.7x end-to-end latency reduction over GPT-4 planner. Our
code will be released soon.
comment: NeurIPS 2025
☆ Efficient Model-Based Reinforcement Learning for Robot Control via Online Learning
We present an online model-based reinforcement learning algorithm suitable
for controlling complex robotic systems directly in the real world. Unlike
prevailing sim-to-real pipelines that rely on extensive offline simulation and
model-free policy optimization, our method builds a dynamics model from
real-time interaction data and performs policy updates guided by the learned
dynamics model. This efficient model-based reinforcement learning scheme
significantly reduces the number of samples to train control policies, enabling
direct training on real-world rollout data. This significantly reduces the
influence of bias in the simulated data, and facilitates the search for
high-performance control policies. We adopt online learning analysis to derive
sublinear regret bounds under standard stochastic online optimization
assumptions, providing formal guarantees on performance improvement as more
interaction data are collected. Experimental evaluations were performed on a
hydraulic excavator arm and a soft robot arm, where the algorithm demonstrates
strong sample efficiency compared to model-free reinforcement learning methods,
reaching comparable performance within hours. Robust adaptation to shifting
dynamics was also observed when the payload condition was randomized. Our
approach paves the way toward efficient and reliable on-robot learning for a
broad class of challenging control tasks.
☆ MPC-based motion planning for non-holonomic systems in non-convex domains
Motivated by the application of using model predictive control (MPC) for
motion planning of autonomous mobile robots, a form of output tracking MPC for
non-holonomic systems and with non-convex constraints is studied. Although the
advantages of using MPC for motion planning have been demonstrated in several
papers, in most of the available fundamental literature on output tracking MPC
it is assumed, often implicitly, that the model is holonomic and generally the
state or output constraints must be convex. Thus, in application-oriented
publications, empirical results dominate and the topic of proving completeness,
in particular under which assumptions the target is always reached, has
received comparatively little attention. To address this gap, we present a
novel MPC formulation that guarantees convergence to the desired target under
realistic assumptions, which can be verified in relevant real-world scenarios.
comment: Preprint of ECC 2025 submission
☆ Biomechanically consistent real-time action recognition for human-robot interaction
Wanchen Li, Kahina Chalabi, Sabbah Maxime, Thomas Bousquet, Robin Passama, Sofiane Ramdani, Andrea Cherubini, Vincent Bonnet
This paper presents a novel framework for real-time human action recognition
in industrial contexts, using standard 2D cameras. We introduce a complete
pipeline for robust and real-time estimation of human joint kinematics, input
to a temporally smoothed Transformer-based network, for action recognition. We
rely on a new dataset including 11 subjects performing various actions, to
evaluate our approach. Unlike most of the literature that relies on joint
center positions (JCP) and is offline, ours uses biomechanical prior, eg. joint
angles, for fast and robust real-time recognition. Besides, joint angles make
the proposed method agnostic to sensor and subject poses as well as to
anthropometric differences, and ensure robustness across environments and
subjects. Our proposed learning model outperforms the best baseline model,
running also in real-time, along various metrics. It achieves 88% accuracy and
shows great generalization ability, for subjects not facing the cameras.
Finally, we demonstrate the robustness and usefulness of our technique, through
an online interaction experiment, with a simulated robot controlled in
real-time via the recognized actions.
☆ MMRHP: A Miniature Mixed-Reality HIL Platform for Auditable Closed-Loop Evaluation
Validation of autonomous driving systems requires a trade-off between test
fidelity, cost, and scalability. While miniaturized hardware-in-the-loop (HIL)
platforms have emerged as a promising solution, a systematic framework
supporting rigorous quantitative analysis is generally lacking, limiting their
value as scientific evaluation tools. To address this challenge, we propose
MMRHP, a miniature mixed-reality HIL platform that elevates miniaturized
testing from functional demonstration to rigorous, reproducible quantitative
analysis. The core contributions are threefold. First, we propose a systematic
three-phase testing process oriented toward the Safety of the Intended
Functionality(SOTIF)standard, providing actionable guidance for identifying the
performance limits and triggering conditions of otherwise correctly functioning
systems. Second, we design and implement a HIL platform centered around a
unified spatiotemporal measurement core to support this process, ensuring
consistent and traceable quantification of physical motion and system timing.
Finally, we demonstrate the effectiveness of this solution through
comprehensive experiments. The platform itself was first validated, achieving a
spatial accuracy of 10.27 mm RMSE and a stable closed-loop latency baseline of
approximately 45 ms. Subsequently, an in-depth Autoware case study leveraged
this validated platform to quantify its performance baseline and identify a
critical performance cliff at an injected latency of 40 ms. This work shows
that a structured process, combined with a platform offering a unified
spatio-temporal benchmark, enables reproducible, interpretable, and
quantitative closed-loop evaluation of autonomous driving systems.
☆ PGTT: Phase-Guided Terrain Traversal for Perceptive Legged Locomotion
State-of-the-art perceptive Reinforcement Learning controllers for legged
robots either (i) impose oscillator or IK-based gait priors that constrain the
action space, add bias to the policy optimization and reduce adaptability
across robot morphologies, or (ii) operate "blind", which struggle to
anticipate hind-leg terrain, and are brittle to noise. In this paper, we
propose Phase-Guided Terrain Traversal (PGTT), a perception-aware deep-RL
approach that overcomes these limitations by enforcing gait structure purely
through reward shaping, thereby reducing inductive bias in policy learning
compared to oscillator/IK-conditioned action priors. PGTT encodes per-leg phase
as a cubic Hermite spline that adapts swing height to local heightmap
statistics and adds a swing-phase contact penalty, while the policy acts
directly in joint space supporting morphology-agnostic deployment. Trained in
MuJoCo (MJX) on procedurally generated stair-like terrains with curriculum and
domain randomization, PGTT achieves the highest success under push disturbances
(median +7.5% vs. the next best method) and on discrete obstacles (+9%), with
comparable velocity tracking, and converging to an effective policy roughly 2x
faster than strong end-to-end baselines. We validate PGTT on a Unitree Go2
using a real-time LiDAR elevation-to-heightmap pipeline, and we report
preliminary results on ANYmal-C obtained with the same hyperparameters. These
findings indicate that terrain-adaptive, phase-guided reward shaping is a
simple and general mechanism for robust perceptive locomotion across platforms.
comment: 9 pages, 9 figures, 2 tables
☆ Coverage-Recon: Coordinated Multi-Drone Image Sampling with Online Map Feedback
This article addresses collaborative 3D map reconstruction using multiple
drones. Achieving high-quality reconstruction requires capturing images of
keypoints within the target scene from diverse viewing angles, and coverage
control offers an effective framework to meet this requirement. Meanwhile,
recent advances in real-time 3D reconstruction algorithms make it possible to
render an evolving map during flight, enabling immediate feedback to guide
drone motion. Building on this, we present Coverage-Recon, a novel coordinated
image sampling algorithm that integrates online map feedback to improve
reconstruction quality on-the-fly. In Coverage-Recon, the coordinated motion of
drones is governed by a Quadratic Programming (QP)-based angle-aware coverage
controller, which ensures multi-viewpoint image capture while enforcing safety
constraints. The captured images are processed in real time by the NeuralRecon
algorithm to generate an evolving 3D mesh. Mesh changes across the scene are
interpreted as indicators of reconstruction uncertainty and serve as feedback
to update the importance index of the coverage control as the map evolves. The
effectiveness of Coverage-Recon is validated through simulation and
experiments, demonstrating both qualitatively and quantitatively that
incorporating online map feedback yields more complete and accurate 3D
reconstructions than conventional methods. Project page:
https://htnk-lab.github.io/coverage-recon/
comment: Submitted to IEEE Transactions on Control Systems Technology (under
review). Project page: https://htnk-lab.github.io/coverage-recon/
☆ MoTVLA: A Vision-Language-Action Model with Unified Fast-Slow Reasoning
Integrating visual-language instructions into visuomotor policies is gaining
momentum in robot learning for enhancing open-world generalization. Despite
promising advances, existing approaches face two challenges: limited language
steerability when no generated reasoning is used as a condition, or significant
inference latency when reasoning is incorporated.In this work, we introduce
MoTVLA, a mixture-of-transformers (MoT)-based vision-language-action (VLA)
model that integrates fast-slow unified reasoning with behavior policy
learning. MoTVLA preserves the general intelligence of pre-trained VLMs
(serving as the generalist) for tasks such as perception, scene understanding,
and semantic planning, while incorporating a domain expert, a second
transformer that shares knowledge with the pretrained VLM, to generate
domain-specific fast reasoning (e.g., robot motion decomposition), thereby
improving policy execution efficiency. By conditioning the action expert on
decomposed motion instructions, MoTVLA can learn diverse behaviors and
substantially improve language steerability. Extensive evaluations across
natural language processing benchmarks, robotic simulation environments, and
real-world experiments confirm the superiority of MoTVLA in both fast-slow
reasoning and manipulation task performance.
☆ MoMaGen: Generating Demonstrations under Soft and Hard Constraints for Multi-Step Bimanual Mobile Manipulation
Chengshu Li, Mengdi Xu, Arpit Bahety, Hang Yin, Yunfan Jiang, Huang Huang, Josiah Wong, Sujay Garlanka, Cem Gokmen, Ruohan Zhang, Weiyu Liu, Jiajun Wu, Roberto Martín-Martín, Li Fei-Fei
Imitation learning from large-scale, diverse human demonstrations has proven
effective for training robots, but collecting such data is costly and
time-consuming. This challenge is amplified for multi-step bimanual mobile
manipulation, where humans must teleoperate both a mobile base and two
high-degree-of-freedom arms. Prior automated data generation frameworks have
addressed static bimanual manipulation by augmenting a few human demonstrations
in simulation, but they fall short for mobile settings due to two key
challenges: (1) determining base placement to ensure reachability, and (2)
positioning the camera to provide sufficient visibility for visuomotor
policies. To address these issues, we introduce MoMaGen, which formulates data
generation as a constrained optimization problem that enforces hard constraints
(e.g., reachability) while balancing soft constraints (e.g., visibility during
navigation). This formulation generalizes prior approaches and provides a
principled foundation for future methods. We evaluate MoMaGen on four
multi-step bimanual mobile manipulation tasks and show that it generates
significantly more diverse datasets than existing methods. Leveraging this
diversity, MoMaGen can train successful imitation learning policies from a
single source demonstration, and these policies can be fine-tuned with as few
as 40 real-world demonstrations to achieve deployment on physical robotic
hardware. More details are available at our project page: momagen.github.io.
comment: Project website: momagen.github.io. The first four authors contribute
equally
♻ ☆ Wonder Wins Ways: Curiosity-Driven Exploration through Multi-Agent Contextual Calibration
Autonomous exploration in complex multi-agent reinforcement learning (MARL)
with sparse rewards critically depends on providing agents with effective
intrinsic motivation. While artificial curiosity offers a powerful
self-supervised signal, it often confuses environmental stochasticity with
meaningful novelty. Moreover, existing curiosity mechanisms exhibit a uniform
novelty bias, treating all unexpected observations equally. However, peer
behavior novelty, which encode latent task dynamics, are often overlooked,
resulting in suboptimal exploration in decentralized, communication-free MARL
settings. To this end, inspired by how human children adaptively calibrate
their own exploratory behaviors via observing peers, we propose a novel
approach to enhance multi-agent exploration. We introduce CERMIC, a principled
framework that empowers agents to robustly filter noisy surprise signals and
guide exploration by dynamically calibrating their intrinsic curiosity with
inferred multi-agent context. Additionally, CERMIC generates
theoretically-grounded intrinsic rewards, encouraging agents to explore state
transitions with high information gain. We evaluate CERMIC on benchmark suites
including VMAS, Meltingpot, and SMACv2. Empirical results demonstrate that
exploration with CERMIC significantly outperforms SoTA algorithms in
sparse-reward environments.
♻ ☆ Seeing through Uncertainty: Robust Task-Oriented Optimization in Visual Navigation
Visual navigation is a fundamental problem in embodied AI, yet practical
deployments demand long-horizon planning capabilities to address
multi-objective tasks. A major bottleneck is data scarcity: policies learned
from limited data often overfit and fail to generalize OOD. Existing neural
network-based agents typically increase architectural complexity that
paradoxically become counterproductive in the small-sample regime. This paper
introduce NeuRO, a integrated learning-to-optimize framework that tightly
couples perception networks with downstream task-level robust optimization.
Specifically, NeuRO addresses core difficulties in this integration: (i) it
transforms noisy visual predictions under data scarcity into convex uncertainty
sets using Partially Input Convex Neural Networks (PICNNs) with conformal
calibration, which directly parameterize the optimization constraints; and (ii)
it reformulates planning under partial observability as a robust optimization
problem, enabling uncertainty-aware policies that transfer across environments.
Extensive experiments on both unordered and sequential multi-object navigation
tasks demonstrate that NeuRO establishes SoTA performance, particularly in
generalization to unseen environments. Our work thus presents a significant
advancement for developing robust, generalizable autonomous agents.
♻ ☆ Rethink Repeatable Measures of Robot Performance with Statistical Query
For a general standardized testing algorithm designed to evaluate a specific
aspect of a robot's performance, several key expectations are commonly imposed.
Beyond accuracy (i.e., closeness to a typically unknown ground-truth reference)
and efficiency (i.e., feasibility within acceptable testing costs and equipment
constraints), one particularly important attribute is repeatability.
Repeatability refers to the ability to consistently obtain the same testing
outcome when similar testing algorithms are executed on the same subject robot
by different stakeholders, across different times or locations. However,
achieving repeatable testing has become increasingly challenging as the
components involved grow more complex, intelligent, diverse, and, most
importantly, stochastic. While related efforts have addressed repeatability at
ethical, hardware, and procedural levels, this study focuses specifically on
repeatable testing at the algorithmic level. Specifically, we target the
well-adopted class of testing algorithms in standardized evaluation:
statistical query (SQ) algorithms (i.e., algorithms that estimate the expected
value of a bounded function over a distribution using sampled data). We propose
a lightweight, parameterized, and adaptive modification applicable to any SQ
routine, whether based on Monte Carlo sampling, importance sampling, or
adaptive importance sampling, that makes it provably repeatable, with
guaranteed bounds on both accuracy and efficiency. We demonstrate the
effectiveness of the proposed approach across three representative scenarios:
(i) established and widely adopted standardized testing of manipulators, (ii)
emerging intelligent testing algorithms for operational risk assessment in
automated vehicles, and (iii) developing use cases involving command tracking
performance evaluation of humanoid robots in locomotion tasks.
♻ ☆ Towards Versatile Humanoid Table Tennis: Unified Reinforcement Learning with Prediction Augmentation
Muqun Hu, Wenxi Chen, Wenjing Li, Falak Mandali, Zijian He, Renhong Zhang, Praveen Krisna, Katherine Christian, Leo Benaharon, Dizhi Ma, Karthik Ramani, Yan Gu
Humanoid table tennis (TT) demands rapid perception, proactive whole-body
motion, and agile footwork under strict timing -- capabilities that remain
difficult for unified controllers. We propose a reinforcement learning
framework that maps ball-position observations directly to whole-body joint
commands for both arm striking and leg locomotion, strengthened by predictive
signals and dense, physics-guided rewards. A lightweight learned predictor, fed
with recent ball positions, estimates future ball states and augments the
policy's observations for proactive decision-making. During training, a
physics-based predictor supplies precise future states to construct dense,
informative rewards that lead to effective exploration. The resulting policy
attains strong performance across varied serve ranges (hit rate $\geq$ 96% and
success rate $\geq$ 92%) in simulations. Ablation studies confirm that both the
learned predictor and the predictive reward design are critical for end-to-end
learning. Deployed zero-shot on a physical Booster T1 humanoid with 23 revolute
joints, the policy produces coordinated lateral and forward-backward footwork
with accurate, fast returns, suggesting a practical path toward versatile,
competitive humanoid TT.
♻ ☆ Interpretable Decision-Making for End-to-End Autonomous Driving ICCV 2025
Trustworthy AI is mandatory for the broad deployment of autonomous vehicles.
Although end-to-end approaches derive control commands directly from raw data,
interpreting these decisions remains challenging, especially in complex urban
scenarios. This is mainly attributed to very deep neural networks with
non-linear decision boundaries, making it challenging to grasp the logic behind
AI-driven decisions. This paper presents a method to enhance interpretability
while optimizing control commands in autonomous driving. To address this, we
propose loss functions that promote the interpretability of our model by
generating sparse and localized feature maps. The feature activations allow us
to explain which image regions contribute to the predicted control command. We
conduct comprehensive ablation studies on the feature extraction step and
validate our method on the CARLA benchmarks. We also demonstrate that our
approach improves interpretability, which correlates with reducing infractions,
yielding a safer, high-performance driving model. Notably, our monocular,
non-ensemble model surpasses the top-performing approaches from the CARLA
Leaderboard by achieving lower infraction scores and the highest route
completion rate, all while ensuring interpretability.
comment: Accepted to the ICCV 2025 2nd Workshop on the Challenge Of
Out-of-Label Hazards in Autonomous Driving (2COOOL)
♻ ☆ Learning to See and Act: Task-Aware View Planning for Robotic Manipulation
Yongjie Bai, Zhouxia Wang, Yang Liu, Weixing Chen, Ziliang Chen, Mingtong Dai, Yongsen Zheng, Lingbo Liu, Guanbin Li, Liang Lin
Recent vision-language-action (VLA) models for multi-task robotic
manipulation commonly rely on static viewpoints and shared visual encoders,
which limit 3D perception and cause task interference, hindering robustness and
generalization. In this work, we propose Task-Aware View Planning (TAVP), a
framework designed to overcome these challenges by integrating active view
planning with task-specific representation learning. TAVP employs an efficient
exploration policy, accelerated by a novel pseudo-environment, to actively
acquire informative views. Furthermore, we introduce a Mixture-of-Experts (MoE)
visual encoder to disentangle features across different tasks, boosting both
representation fidelity and task generalization. By learning to see the world
in a task-aware way, TAVP generates more complete and discriminative visual
representations, demonstrating significantly enhanced action prediction across
a wide array of manipulation challenges. Extensive experiments on RLBench tasks
show that our proposed TAVP model achieves superior performance over
state-of-the-art fixed-view approaches. Visual results and code are provided
at: https://hcplab-sysu.github.io/TAVP.
comment: 14 pages, 8 figures, project page: https://hcplab-sysu.github.io/TAVP
♻ ☆ Dynamic object goal pushing with mobile manipulators through model-free constrained reinforcement learning ICRA 2025
Non-prehensile pushing to move and reorient objects to a goal is a versatile
loco-manipulation skill. In the real world, the object's physical properties
and friction with the floor contain significant uncertainties, which makes the
task challenging for a mobile manipulator. In this paper, we develop a
learning-based controller for a mobile manipulator to move an unknown object to
a desired position and yaw orientation through a sequence of pushing actions.
The proposed controller for the robotic arm and the mobile base motion is
trained using a constrained Reinforcement Learning (RL) formulation. We
demonstrate its capability in experiments with a quadrupedal robot equipped
with an arm. The learned policy achieves a success rate of 91.35% in simulation
and at least 80% on hardware in challenging scenarios. Through our extensive
hardware experiments, we show that the approach demonstrates high robustness
against unknown objects of different masses, materials, sizes, and shapes. It
reactively discovers the pushing location and direction, thus achieving
contact-rich behavior while observing only the pose of the object.
Additionally, we demonstrate the adaptive behavior of the learned policy
towards preventing the object from toppling.
comment: presented at ICRA 2025, Video:
https://youtu.be/wGAdPGVf9Ws?si=pi83ONWofHHqbFG0
♻ ☆ Generation of Uncertainty-Aware Emergent Concepts in Factorized 3D Scene Graphs via Graph Neural Networks
Jose Andres Millan-Romera, Muhammad Shaheer, Miguel Fernandez-Cortizas, Martin R. Oswald, Holger Voos, Jose Luis Sanchez-Lopez
Enabling robots to autonomously discover emergent spatial concepts (e.g.,
rooms) from primitive geometric observations (e.g., planar surfaces) within 3D
Scene Graphs is essential for robust indoor navigation and mapping. These
graphs provide a hierarchical metric-semantic representation in which such
concepts are organized. To further enhance graph-SLAM performance, Factorized
3D Scene Graphs incorporate these concepts as optimization factors that
constrain relative geometry and enforce global consistency. However, both
stages of this process remain largely manual: concepts are typically derived
using hand-crafted, concept-specific heuristics, while factors and their
covariances are likewise manually designed. This reliance on manual
specification limits generalization across diverse environments and scalability
to new concept classes. This paper presents, for the first time, a
learning-based method to generate online spatial emergent concepts as
optimizable factors within a SLAM backend, reducing the need to handcraft both
concept generation and the definition of their corresponding factors and
covariances. In both simulated and real indoor scenarios, our approach improves
complex concept detection by 20.7% and 5.3%, trajectory estimation by 19.2%,
and map reconstruction by 12.3% and 3.8%, respectively, highlighting the
benefits of this integration for robust and adaptive spatial understanding.
comment: Submitted to IEEE Robotics and Automation Letters (RA-L)
♻ ☆ From Watch to Imagine: Steering Long-horizon Manipulation via Human Demonstration and Future Envisionment
Generalizing to long-horizon manipulation tasks in a zero-shot setting
remains a central challenge in robotics. Current multimodal foundation based
approaches, despite their capabilities, typically fail to decompose high-level
commands into executable action sequences from static visual input alone. To
address this challenge, we introduce Super-Mimic, a hierarchical framework that
enables zero-shot robotic imitation by directly inferring procedural intent
from unscripted human demonstration videos. Our framework is composed of two
sequential modules. First, a Human Intent Translator (HIT) parses the input
video using multimodal reasoning to produce a sequence of language-grounded
subtasks. These subtasks then condition a Future Dynamics Predictor (FDP),
which employs a generative model that synthesizes a physically plausible video
rollout for each step. The resulting visual trajectories are dynamics-aware,
explicitly modeling crucial object interactions and contact points to guide the
low-level controller. We validate this approach through extensive experiments
on a suite of long-horizon manipulation tasks, where Super-Mimic significantly
outperforms state-of-the-art zero-shot methods by over 20%. These results
establish that coupling video-driven intent parsing with prospective dynamics
modeling is a highly effective strategy for developing general-purpose robotic
systems.
comment: More details and videos can be found at:
https://yipko.com/super-mimic
♻ ☆ VIKI-R: Coordinating Embodied Multi-Agent Cooperation via Reinforcement Learning
Li Kang, Xiufeng Song, Heng Zhou, Yiran Qin, Jie Yang, Xiaohong Liu, Philip Torr, Lei Bai, Zhenfei Yin
Coordinating multiple embodied agents in dynamic environments remains a core
challenge in artificial intelligence, requiring both perception-driven
reasoning and scalable cooperation strategies. While recent works have
leveraged large language models (LLMs) for multi-agent planning, a few have
begun to explore vision-language models (VLMs) for visual reasoning. However,
these VLM-based approaches remain limited in their support for diverse
embodiment types. In this work, we introduce VIKI-Bench, the first hierarchical
benchmark tailored for embodied multi-agent cooperation, featuring three
structured levels: agent activation, task planning, and trajectory perception.
VIKI-Bench includes diverse robot embodiments, multi-view visual observations,
and structured supervision signals to evaluate reasoning grounded in visual
inputs. To demonstrate the utility of VIKI-Bench, we propose VIKI-R, a
two-stage framework that fine-tunes a pretrained vision-language model (VLM)
using Chain-of-Thought annotated demonstrations, followed by reinforcement
learning under multi-level reward signals. Our extensive experiments show that
VIKI-R significantly outperforms baselines method across all task levels.
Furthermore, we show that reinforcement learning enables the emergence of
compositional cooperation patterns among heterogeneous agents. Together,
VIKI-Bench and VIKI-R offer a unified testbed and method for advancing
multi-agent, visual-driven cooperation in embodied AI systems.
comment: Project page: https://faceong.github.io/VIKI-R/
♻ ☆ Neural 3D Object Reconstruction with Small-Scale Unmanned Aerial Vehicles
Àlmos Veres-Vitàlyos, Genis Castillo Gomez-Raya, Filip Lemic, Daniel Johannes Bugelnig, Bernhard Rinner, Sergi Abadal, Xavier Costa-Pérez
Small Unmanned Aerial Vehicles (UAVs) exhibit immense potential for
navigating indoor and hard-to-reach areas, yet their significant constraints in
payload and autonomy have largely prevented their use for complex tasks like
high-quality 3-Dimensional (3D) reconstruction. To overcome this challenge, we
introduce a novel system architecture that enables fully autonomous,
high-fidelity 3D scanning of static objects using UAVs weighing under 100
grams. Our core innovation lies in a dual-reconstruction pipeline that creates
a real-time feedback loop between data capture and flight control. A
near-real-time (near-RT) process uses Structure from Motion (SfM) to generate
an instantaneous pointcloud of the object. The system analyzes the model
quality on the fly and dynamically adapts the UAV's trajectory to intelligently
capture new images of poorly covered areas. This ensures comprehensive data
acquisition. For the final, detailed output, a non-real-time (non-RT) pipeline
employs a Neural Radiance Fields (NeRF)-based Neural 3D Reconstruction (N3DR)
approach, fusing SfM-derived camera poses with precise Ultra Wide-Band (UWB)
location data to achieve superior accuracy. We implemented and validated this
architecture using Crazyflie 2.1 UAVs. Our experiments, conducted in both
single- and multi-UAV configurations, conclusively show that dynamic trajectory
adaptation consistently improves reconstruction quality over static flight
paths. This work demonstrates a scalable and autonomous solution that unlocks
the potential of miniaturized UAVs for fine-grained 3D reconstruction in
constrained environments, a capability previously limited to much larger
platforms.
comment: 13 pages, 16 figures, 3 tables, 45 references
♻ ☆ Learn2Decompose: Learning Problem Decomposition for Efficient Sequential Multi-object Manipulation Planning
We present an efficient task and motion replanning approach for sequential
multi-object manipulation in dynamic environments. Conventional Task And Motion
Planning (TAMP) solvers experience an exponential increase in planning time as
the planning horizon and number of objects grow, limiting their applicability
in real-world scenarios. To address this, we propose learning problem
decompositions from demonstrations to accelerate TAMP solvers. Our approach
consists of three key components: goal decomposition learning, computational
distance learning, and object reduction. Goal decomposition identifies the
necessary sequences of states that the system must pass through before reaching
the final goal, treating them as subgoal sequences. Computational distance
learning predicts the computational complexity between two states, enabling the
system to identify the temporally closest subgoal from a disturbed state.
Object reduction minimizes the set of active objects considered during
replanning, further improving efficiency. We evaluate our approach on three
benchmarks, demonstrating its effectiveness in improving replanning efficiency
for sequential multi-object manipulation tasks in dynamic environments.
comment: Extension of RAL version: added PR2 Whole-body kitchen task and
detailed discussion on limitations in main text; added pseudocode and
robustness analysis of our approach, and formal analysis on why and when task
goals are decomposable in appendix
♻ ☆ VLA-Cache: Efficient Vision-Language-Action Manipulation via Adaptive Token Caching NeurIPS 2025
Vision-Language-Action (VLA) models have demonstrated strong multi-modal
reasoning capabilities, enabling direct action generation from visual
perception and language instructions in an end-to-end manner. However, their
substantial computational cost poses a challenge for real-time robotic control,
where rapid decision-making is essential. This paper introduces VLA-Cache, a
training-free inference acceleration method that reduces computational overhead
by adaptively caching and reusing static visual tokens across frames.
Exploiting the temporal continuity in robotic manipulation, VLA-Cache
identifies minimally changed tokens between adjacent frames and reuses their
cached key-value representations, thereby circumventing redundant computations.
Additionally, to maintain action precision, VLA-Cache selectively re-computes
task-relevant tokens that are environmentally sensitive, ensuring the fidelity
of critical visual information. To further optimize efficiency, we introduce a
layer adaptive token reusing strategy that dynamically adjusts the reuse ratio
based on attention concentration across decoder layers, prioritizing critical
tokens for recomputation. Extensive experiments on two simulation platforms
(LIBERO and SIMPLER) and a real-world robotic system demonstrate that VLA-Cache
achieves up to 1.7x speedup in CUDA latency and a 15% increase in control
frequency, with negligible loss on task success rate. The code and videos can
be found at our project page: https://vla-cache.github.io.
comment: Accepted to NeurIPS 2025
♻ ☆ Time Reversal Symmetry for Efficient Robotic Manipulations in Deep Reinforcement Learning NeurIPS 2025
Symmetry is pervasive in robotics and has been widely exploited to improve
sample efficiency in deep reinforcement learning (DRL). However, existing
approaches primarily focus on spatial symmetries, such as reflection, rotation,
and translation, while largely neglecting temporal symmetries. To address this
gap, we explore time reversal symmetry, a form of temporal symmetry commonly
found in robotics tasks such as door opening and closing. We propose Time
Reversal symmetry enhanced Deep Reinforcement Learning (TR-DRL), a framework
that combines trajectory reversal augmentation and time reversal guided reward
shaping to efficiently solve temporally symmetric tasks. Our method generates
reversed transitions from fully reversible transitions, identified by a
proposed dynamics-consistent filter, to augment the training data. For
partially reversible transitions, we apply reward shaping to guide learning,
according to successful trajectories from the reversed task. Extensive
experiments on the Robosuite and MetaWorld benchmarks demonstrate that TR-DRL
is effective in both single-task and multi-task settings, achieving higher
sample efficiency and stronger final performance compared to baseline methods.
comment: Accepted in NeurIPS 2025
♻ ☆ RAPID Hand Prototype: Design of an Affordable, Fully-Actuated Biomimetic Hand for Dexterous Teleoperation IROS2025
This paper addresses the scarcity of affordable, fully-actuated five-fingered
hands for dexterous teleoperation, which is crucial for collecting large-scale
real-robot data within the "Learning from Demonstrations" paradigm. We
introduce the prototype version of the RAPID Hand, the first low-cost,
20-degree-of-actuation (DoA) dexterous hand that integrates a novel
anthropomorphic actuation and transmission scheme with an optimized motor
layout and structural design to enhance dexterity. Specifically, the RAPID Hand
features a universal phalangeal transmission scheme for the non-thumb fingers
and an omnidirectional thumb actuation mechanism. Prioritizing affordability,
the hand employs 3D-printed parts combined with custom gears for easier
replacement and repair. We assess the RAPID Hand's performance through
quantitative metrics and qualitative testing in a dexterous teleoperation
system, which is evaluated on three challenging tasks: multi-finger retrieval,
ladle handling, and human-like piano playing. The results indicate that the
RAPID Hand's fully actuated 20-DoF design holds significant promise for
dexterous teleoperation.
comment: Accepted by IROS2025
♻ ☆ DDBot: Differentiable Physics-based Digging Robot for Unknown Granular Materials
Automating the manipulation of granular materials poses significant
challenges due to complex contact dynamics, unpredictable material properties,
and intricate system states. Existing approaches often fail to achieve
efficiency and accuracy in such tasks. To fill the research gap, this paper
studies the small-scale and high-precision granular material digging task with
unknown physical properties. A new framework, named differentiable digging
robot (DDBot), is proposed to manipulate granular materials, including sand and
soil.
Specifically, we equip DDBot with a differentiable physics-based simulator,
tailored for granular material manipulation, powered by GPU-accelerated
parallel computing and automatic differentiation. DDBot can perform efficient
differentiable system identification and high-precision digging skill
optimisation for unknown granular materials, which is enabled by a
differentiable skill-to-action mapping, a task-oriented demonstration method,
gradient clipping and line search-based gradient descent.
Experimental results show that DDBot can efficiently (converge within 5 to 20
minutes) identify unknown granular material dynamics and optimise digging
skills, with high-precision results in zero-shot real-world deployments,
highlighting its practicality. Benchmark results against state-of-the-art
baselines also confirm the robustness and efficiency of DDBot in such digging
tasks.
comment: Accepted as a regular paper by the IEEE Transactions on Robotics
♻ ☆ IR2: Implicit Rendezvous for Robotic Exploration Teams under Sparse Intermittent Connectivity
Information sharing is critical in time-sensitive and realistic multi-robot
exploration, especially for smaller robotic teams in large-scale environments
where connectivity may be sparse and intermittent. Existing methods often
overlook such communication constraints by assuming unrealistic global
connectivity. Other works account for communication constraints (by maintaining
close proximity or line of sight during information exchange), but are often
inefficient. For instance, preplanned rendezvous approaches typically involve
unnecessary detours resulting from poorly timed rendezvous, while pursuit-based
approaches often result in short-sighted decisions due to their greedy nature.
We present IR2, a deep reinforcement learning approach to information sharing
for multi-robot exploration. Leveraging attention-based neural networks trained
via reinforcement and curriculum learning, IR2 allows robots to effectively
reason about the longer-term trade-offs between disconnecting for solo
exploration and reconnecting for information sharing. In addition, we propose a
hierarchical graph formulation to maintain a sparse yet informative graph,
enabling our approach to scale to large-scale environments. We present
simulation results in three large-scale Gazebo environments, which show that
our approach yields 6.6-34.1% shorter exploration paths when compared to
state-of-the-art baselines, and lastly deploy our learned policy on hardware.
Our simulation training and testing code is available at
https://ir2-explore.github.io.
comment: \c{opyright} 20XX IEEE. Personal use of this material is permitted.
Permission from IEEE must be obtained for all other uses, in any current or
future media, including reprinting/republishing this material for advertising
or promotional purposes, creating new collective works, for resale or
redistribution to servers or lists, or reuse of any copyrighted component of
this work in other works
♻ ☆ When LLMs step into the 3D World: A Survey and Meta-Analysis of 3D Tasks via Multi-modal Large Language Models
Xianzheng Ma, Brandon Smart, Yash Bhalgat, Shuai Chen, Xinghui Li, Jian Ding, Jindong Gu, Dave Zhenyu Chen, Songyou Peng, Jia-Wang Bian, Philip H Torr, Marc Pollefeys, Matthias Nießner, Ian D Reid, Angel X. Chang, Iro Laina, Victor Adrian Prisacariu
As large language models (LLMs) evolve, their integration with 3D spatial
data (3D-LLMs) has seen rapid progress, offering unprecedented capabilities for
understanding and interacting with physical spaces. This survey provides a
comprehensive overview of the methodologies enabling LLMs to process,
understand, and generate 3D data. Highlighting the unique advantages of LLMs,
such as in-context learning, step-by-step reasoning, open-vocabulary
capabilities, and extensive world knowledge, we underscore their potential to
significantly advance spatial comprehension and interaction within embodied
Artificial Intelligence (AI) systems. Our investigation spans various 3D data
representations, from point clouds to Neural Radiance Fields (NeRFs). It
examines their integration with LLMs for tasks such as 3D scene understanding,
captioning, question-answering, and dialogue, as well as LLM-based agents for
spatial reasoning, planning, and navigation. The paper also includes a brief
review of other methods that integrate 3D and language. The meta-analysis
presented in this paper reveals significant progress yet underscores the
necessity for novel approaches to harness the full potential of 3D-LLMs. Hence,
with this paper, we aim to chart a course for future research that explores and
expands the capabilities of 3D-LLMs in understanding and interacting with the
complex 3D world. To support this survey, we have established a project page
where papers related to our topic are organized and listed:
https://github.com/ActiveVisionLab/Awesome-LLM-3D.
comment: 2nd version update to Jun.2025
♻ ☆ Distilling LLM Prior to Flow Model for Generalizable Agent's Imagination in Object Goal Navigation
The Object Goal Navigation (ObjectNav) task challenges agents to locate a
specified object in an unseen environment by imagining unobserved regions of
the scene. Prior approaches rely on deterministic and discriminative models to
complete semantic maps, overlooking the inherent uncertainty in indoor layouts
and limiting their ability to generalize to unseen environments. In this work,
we propose GOAL, a generative flow-based framework that models the semantic
distribution of indoor environments by bridging observed regions with
LLM-enriched full-scene semantic maps. During training, spatial priors inferred
from large language models (LLMs) are encoded as two-dimensional Gaussian
fields and injected into target maps, distilling rich contextual knowledge into
the flow model and enabling more generalizable completions. Extensive
experiments demonstrate that GOAL achieves state-of-the-art performance on MP3D
and Gibson, and shows strong generalization in transfer settings to HM3D. Codes
and pretrained models are available at https://github.com/Badi-Li/GOAL.
♻ ☆ RAD: Training an End-to-End Driving Policy via Large-Scale 3DGS-based Reinforcement Learning
Hao Gao, Shaoyu Chen, Bo Jiang, Bencheng Liao, Yiang Shi, Xiaoyang Guo, Yuechuan Pu, Haoran Yin, Xiangyu Li, Xinbang Zhang, Ying Zhang, Wenyu Liu, Qian Zhang, Xinggang Wang
Existing end-to-end autonomous driving (AD) algorithms typically follow the
Imitation Learning (IL) paradigm, which faces challenges such as causal
confusion and an open-loop gap. In this work, we propose RAD, a 3DGS-based
closed-loop Reinforcement Learning (RL) framework for end-to-end Autonomous
Driving. By leveraging 3DGS techniques, we construct a photorealistic digital
replica of the real physical world, enabling the AD policy to extensively
explore the state space and learn to handle out-of-distribution scenarios
through large-scale trial and error. To enhance safety, we design specialized
rewards to guide the policy in effectively responding to safety-critical events
and understanding real-world causal relationships. To better align with human
driving behavior, we incorporate IL into RL training as a regularization term.
We introduce a closed-loop evaluation benchmark consisting of diverse,
previously unseen 3DGS environments. Compared to IL-based methods, RAD achieves
stronger performance in most closed-loop metrics, particularly exhibiting a 3x
lower collision rate. Abundant closed-loop results are presented in the
supplementary material. Code is available at https://github.com/hustvl/RAD for
facilitating future research.
comment: Code: https://github.com/hustvl/RAD
♻ ☆ SAMPO:Scale-wise Autoregression with Motion PrOmpt for generative world models
Sen Wang, Jingyi Tian, Le Wang, Zhimin Liao, Jiayi Li, Huaiyi Dong, Kun Xia, Sanping Zhou, Wei Tang, Hua Gang
World models allow agents to simulate the consequences of actions in imagined
environments for planning, control, and long-horizon decision-making. However,
existing autoregressive world models struggle with visually coherent
predictions due to disrupted spatial structure, inefficient decoding, and
inadequate motion modeling. In response, we propose \textbf{S}cale-wise
\textbf{A}utoregression with \textbf{M}otion \textbf{P}r\textbf{O}mpt
(\textbf{SAMPO}), a hybrid framework that combines visual autoregressive
modeling for intra-frame generation with causal modeling for next-frame
generation. Specifically, SAMPO integrates temporal causal decoding with
bidirectional spatial attention, which preserves spatial locality and supports
parallel decoding within each scale. This design significantly enhances both
temporal consistency and rollout efficiency. To further improve dynamic scene
understanding, we devise an asymmetric multi-scale tokenizer that preserves
spatial details in observed frames and extracts compact dynamic representations
for future frames, optimizing both memory usage and model performance.
Additionally, we introduce a trajectory-aware motion prompt module that injects
spatiotemporal cues about object and robot trajectories, focusing attention on
dynamic regions and improving temporal consistency and physical realism.
Extensive experiments show that SAMPO achieves competitive performance in
action-conditioned video prediction and model-based control, improving
generation quality with 4.4$\times$ faster inference. We also evaluate SAMPO's
zero-shot generalization and scaling behavior, demonstrating its ability to
generalize to unseen tasks and benefit from larger model sizes.
comment: 22 pages,15 figures
♻ ☆ DiffVLA++: Bridging Cognitive Reasoning and End-to-End Driving through Metric-Guided Alignment
Conventional end-to-end (E2E) driving models are effective at generating
physically plausible trajectories, but often fail to generalize to long-tail
scenarios due to the lack of essential world knowledge to understand and reason
about surrounding environments. In contrast, Vision-Language-Action (VLA)
models leverage world knowledge to handle challenging cases, but their limited
3D reasoning capability can lead to physically infeasible actions. In this work
we introduce DiffVLA++, an enhanced autonomous driving framework that
explicitly bridges cognitive reasoning and E2E planning through metric-guided
alignment. First, we build a VLA module directly generating semantically
grounded driving trajectories. Second, we design an E2E module with a dense
trajectory vocabulary that ensures physical feasibility. Third, and most
critically, we introduce a metric-guided trajectory scorer that guides and
aligns the outputs of the VLA and E2E modules, thereby integrating their
complementary strengths. The experiment on the ICCV 2025 Autonomous Grand
Challenge leaderboard shows that DiffVLA++ achieves EPDMS of 49.12.
♻ ☆ LLM-RG: Referential Grounding in Outdoor Scenarios using Large Language Models IROS 2025
Referential grounding in outdoor driving scenes is challenging due to large
scene variability, many visually similar objects, and dynamic elements that
complicate resolving natural-language references (e.g., "the black car on the
right"). We propose LLM-RG, a hybrid pipeline that combines off-the-shelf
vision-language models for fine-grained attribute extraction with large
language models for symbolic reasoning. LLM-RG processes an image and a
free-form referring expression by using an LLM to extract relevant object types
and attributes, detecting candidate regions, generating rich visual descriptors
with a VLM, and then combining these descriptors with spatial metadata into
natural-language prompts that are input to an LLM for chain-of-thought
reasoning to identify the referent's bounding box. Evaluated on the Talk2Car
benchmark, LLM-RG yields substantial gains over both LLM and VLM-based
baselines. Additionally, our ablations show that adding 3D spatial cues further
improves grounding. Our results demonstrate the complementary strengths of VLMs
and LLMs, applied in a zero-shot manner, for robust outdoor referential
grounding.
comment: Human-aware Embodied AI Workshop @ IROS 2025
♻ ☆ NEBULA: Do We Evaluate Vision-Language-Action Agents Correctly?
The evaluation of Vision-Language-Action (VLA) agents is hindered by the
coarse, end-task success metric that fails to provide precise skill diagnosis
or measure robustness to real-world perturbations. This challenge is
exacerbated by a fragmented data landscape that impedes reproducible research
and the development of generalist models. To address these limitations, we
introduce NEBULA, a unified ecosystem for single-arm manipulation that enables
diagnostic and reproducible evaluation. NEBULA features a novel dual-axis
evaluation protocol that combines fine-grained capability tests for precise
skill diagnosis with systematic stress tests that measure robustness. A
standardized API and a large-scale, aggregated dataset are provided to reduce
fragmentation and support cross-dataset training and fair comparison. Using
NEBULA, we demonstrate that top-performing VLAs struggle with key capabilities
such as spatial reasoning and dynamic adaptation, which are consistently
obscured by conventional end-task success metrics. By measuring both what an
agent can do and when it does so reliably, NEBULA provides a practical
foundation for robust, general-purpose embodied agents.
comment: Homepage: https://vulab-ai.github.io/NEBULA-Alpha/