Robotics 32
☆ Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets
Imitation learning has emerged as a promising approach towards building
generalist robots. However, scaling imitation learning for large robot
foundation models remains challenging due to its reliance on high-quality
expert demonstrations. Meanwhile, large amounts of video data depicting a wide
range of environments and diverse behaviors are readily available. This data
provides a rich source of information about real-world dynamics and
agent-environment interactions. Leveraging this data directly for imitation
learning, however, has proven difficult due to the lack of action annotation
required for most contemporary methods. In this work, we present Unified World
Models (UWM), a framework that allows for leveraging both video and action data
for policy learning. Specifically, a UWM integrates an action diffusion process
and a video diffusion process within a unified transformer architecture, where
independent diffusion timesteps govern each modality. We show that by simply
controlling each diffusion timestep, UWM can flexibly represent a policy, a
forward dynamics, an inverse dynamics, and a video generator. Through simulated
and real-world experiments, we show that: (1) UWM enables effective pretraining
on large-scale multitask robot datasets with both dynamics and action
predictions, resulting in more generalizable and robust policies than imitation
learning, (2) UWM naturally facilitates learning from action-free video data
through independent control of modality-specific diffusion timesteps, further
improving the performance of finetuned policies. Our results suggest that UWM
offers a promising step toward harnessing large, heterogeneous datasets for
scalable robot learning, and provides a simple unification between the often
disparate paradigms of imitation learning and world modeling. Videos and code
are available at https://weirdlabuw.github.io/uwm/.
☆ BT-ACTION: A Test-Driven Approach for Modular Understanding of User Instruction Leveraging Behaviour Trees and LLMs
Natural language instructions are often abstract and complex, requiring
robots to execute multiple subtasks even for seemingly simple queries. For
example, when a user asks a robot to prepare avocado toast, the task involves
several sequential steps. Moreover, such instructions can be ambiguous or
infeasible for the robot or may exceed the robot's existing knowledge. While
Large Language Models (LLMs) offer strong language reasoning capabilities to
handle these challenges, effectively integrating them into robotic systems
remains a key challenge. To address this, we propose BT-ACTION, a test-driven
approach that combines the modular structure of Behavior Trees (BT) with LLMs
to generate coherent sequences of robot actions for following complex user
instructions, specifically in the context of preparing recipes in a
kitchen-assistance setting. We evaluated BT-ACTION in a comprehensive user
study with 45 participants, comparing its performance to direct LLM prompting.
Results demonstrate that the modular design of BT-ACTION helped the robot make
fewer mistakes and increased user trust, and participants showed a significant
preference for the robot leveraging BT-ACTION. The code is publicly available
at https://github.com/1Eggbert7/BT_LLM.
☆ Robot-Led Vision Language Model Wellbeing Assessment of Children
Nida Itrat Abbasi, Fethiye Irmak Dogan, Guy Laban, Joanna Anderson, Tamsin Ford, Peter B. Jones, Hatice Gunes
This study presents a novel robot-led approach to assessing children's mental
wellbeing using a Vision Language Model (VLM). Inspired by the Child
Apperception Test (CAT), the social robot NAO presented children with pictorial
stimuli to elicit their verbal narratives of the images, which were then
evaluated by a VLM in accordance with CAT assessment guidelines. The VLM's
assessments were systematically compared to those provided by a trained
psychologist. The results reveal that while the VLM demonstrates moderate
reliability in identifying cases with no wellbeing concerns, its ability to
accurately classify assessments with clinical concern remains limited.
Moreover, although the model's performance was generally consistent when
prompted with varying demographic factors such as age and gender, a
significantly higher false positive rate was observed for girls, indicating
potential sensitivity to gender attribute. These findings highlight both the
promise and the challenges of integrating VLMs into robot-led assessments of
children's wellbeing.
☆ Autonomous Human-Robot Interaction via Operator Imitation
Sammy Christen, David Müller, Agon Serifi, Ruben Grandia, Georg Wiedebach, Michael A. Hopkins, Espen Knoop, Moritz Bächer
Teleoperated robotic characters can perform expressive interactions with
humans, relying on the operators' experience and social intuition. In this
work, we propose to create autonomous interactive robots, by training a model
to imitate operator data. Our model is trained on a dataset of human-robot
interactions, where an expert operator is asked to vary the interactions and
mood of the robot, while the operator commands as well as the pose of the human
and robot are recorded. Our approach learns to predict continuous operator
commands through a diffusion process and discrete commands through a
classifier, all unified within a single transformer architecture. We evaluate
the resulting model in simulation and with a user study on the real system. We
show that our method enables simple autonomous human-robot interactions that
are comparable to the expert-operator baseline, and that users can recognize
the different robot moods as generated by our model. Finally, we demonstrate a
zero-shot transfer of our model onto a different robotic platform with the same
operator interface.
☆ A Planning Framework for Stable Robust Multi-Contact Manipulation
While modeling multi-contact manipulation as a quasi-static mechanical
process transitioning between different contact equilibria, we propose
formulating it as a planning and optimization problem, explicitly evaluating
(i) contact stability and (ii) robustness to sensor noise. Specifically, we
conduct a comprehensive study on multi-manipulator control strategies, focusing
on dual-arm execution in a planar peg-in-hole task and extending it to the
Multi-Manipulator Multiple Peg-in-Hole (MMPiH) problem to explore increased
task complexity. Our framework employs Dynamic Movement Primitives (DMPs) to
parameterize desired trajectories and Black-Box Optimization (BBO) with a
comprehensive cost function incorporating friction cone constraints, squeeze
forces, and stability considerations. By integrating parallel scenario
training, we enhance the robustness of the learned policies. To evaluate the
friction cone cost in experiments, we test the optimal trajectories computed
for various contact surfaces, i.e., with different coefficients of friction.
The stability cost is analytical explained and tested its necessity in
simulation. The robustness performance is quantified through variations of hole
pose and chamfer size in simulation and experiment. Results demonstrate that
our approach achieves consistently high success rates in both the single
peg-in-hole and multiple peg-in-hole tasks, confirming its effectiveness and
generalizability. The video can be found at https://youtu.be/IU0pdnSd4tE.
☆ A Memory-Augmented LLM-Driven Method for Autonomous Merging of 3D Printing Work Orders
With the rapid development of 3D printing, the demand for personalized and
customized production on the manufacturing line is steadily increasing.
Efficient merging of printing workpieces can significantly enhance the
processing efficiency of the production line. Addressing the challenge, a Large
Language Model (LLM)-driven method is established in this paper for the
autonomous merging of 3D printing work orders, integrated with a
memory-augmented learning strategy. In industrial scenarios, both device and
order features are modeled into LLM-readable natural language prompt templates,
and develop an order-device matching tool along with a merging interference
checking module. By incorporating a self-memory learning strategy, an
intelligent agent for autonomous order merging is constructed, resulting in
improved accuracy and precision in order allocation. The proposed method
effectively leverages the strengths of LLMs in industrial applications while
reducing hallucination.
comment: 6 pages, 5 figures
☆ Industrial Internet Robot Collaboration System and Edge Computing Optimization
In a complex environment, for a mobile robot to safely and collision - free
avoid all obstacles, it poses high requirements for its intelligence level.
Given that the information such as the position and geometric characteristics
of obstacles is random, the control parameters of the robot, such as velocity
and angular velocity, are also prone to random deviations. To address this
issue in the framework of the Industrial Internet Robot Collaboration System,
this paper proposes a global path control scheme for mobile robots based on
deep learning. First of all, the dynamic equation of the mobile robot is
established. According to the linear velocity and angular velocity of the
mobile robot, its motion behaviors are divided into obstacle - avoidance
behavior, target - turning behavior, and target approaching behavior.
Subsequently, the neural network method in deep learning is used to build a
global path planning model for the robot. On this basis, a fuzzy controller is
designed with the help of a fuzzy control algorithm to correct the deviations
that occur during path planning, thereby achieving optimized control of the
robot's global path. In addition, considering edge computing optimization, the
proposed model can process local data at the edge device, reducing the
communication burden between the robot and the central server, and improving
the real time performance of path planning. The experimental results show that
for the mobile robot controlled by the research method in this paper, the
deviation distance of the path angle is within 5 cm, the deviation convergence
can be completed within 10 ms, and the planned path is shorter. This indicates
that the proposed scheme can effectively improve the global path planning
ability of mobile robots in the industrial Internet environment and promote the
collaborative operation of robots through edge computing optimization.
☆ Multimodal Fusion and Vision-Language Models: A Survey for Robot Vision
Xiaofeng Han, Shunpeng Chen, Zenghuang Fu, Zhe Feng, Lue Fan, Dong An, Changwei Wang, Li Guo, Weiliang Meng, Xiaopeng Zhang, Rongtao Xu, Shibiao Xu
Robot vision has greatly benefited from advancements in multimodal fusion
techniques and vision-language models (VLMs). We systematically review the
applications of multimodal fusion in key robotic vision tasks, including
semantic scene understanding, simultaneous localization and mapping (SLAM), 3D
object detection, navigation and localization, and robot manipulation. We
compare VLMs based on large language models (LLMs) with traditional multimodal
fusion methods, analyzing their advantages, limitations, and synergies.
Additionally, we conduct an in-depth analysis of commonly used datasets,
evaluating their applicability and challenges in real-world robotic scenarios.
Furthermore, we identify critical research challenges such as cross-modal
alignment, efficient fusion strategies, real-time deployment, and domain
adaptation, and propose future research directions, including self-supervised
learning for robust multimodal representations, transformer-based fusion
architectures, and scalable multimodal frameworks. Through a comprehensive
review, comparative analysis, and forward-looking discussion, we provide a
valuable reference for advancing multimodal perception and interaction in
robotic vision. A comprehensive list of studies in this survey is available at
https://github.com/Xiaofeng-Han-Res/MF-RV.
comment: 27 pages, 11 figures, survey paper submitted to Information Fusion
☆ Adaptive path planning for efficient object search by UAVs in agricultural fields
This paper presents an adaptive path planner for object search in
agricultural fields using UAVs. The path planner uses a high-altitude coverage
flight path and plans additional low-altitude inspections when the detection
network is uncertain. The path planner was evaluated in an offline simulation
environment containing real-world images. We trained a YOLOv8 detection network
to detect artificial plants placed in grass fields to showcase the potential of
our path planner. We evaluated the effect of different detection certainty
measures, optimized the path planning parameters, investigated the effects of
localization errors and different numbers of objects in the field. The YOLOv8
detection confidence worked best to differentiate between true and false
positive detections and was therefore used in the adaptive planner. The optimal
parameters of the path planner depended on the distribution of objects in the
field, when the objects were uniformly distributed, more low-altitude
inspections were needed compared to a non-uniform distribution of objects,
resulting in a longer path length. The adaptive planner proved to be robust
against localization uncertainty. When increasing the number of objects, the
flight path length increased, especially when the objects were uniformly
distributed. When the objects were non-uniformly distributed, the adaptive path
planner yielded a shorter path than a low-altitude coverage path, even with
high number of objects. Overall, the presented adaptive path planner allowed to
find non-uniformly distributed objects in a field faster than a coverage path
planner and resulted in a compatible detection accuracy. The path planner is
made available at https://github.com/wur-abe/uav_adaptive_planner.
☆ CHARMS: Cognitive Hierarchical Agent with Reasoning and Motion Styles
To address the current challenges of low intelligence and simplistic vehicle
behavior modeling in autonomous driving simulation scenarios, this paper
proposes the Cognitive Hierarchical Agent with Reasoning and Motion Styles
(CHARMS). The model can reason about the behavior of other vehicles like a
human driver and respond with different decision-making styles, thereby
improving the intelligence and diversity of the surrounding vehicles in the
driving scenario. By introducing the Level-k behavioral game theory, the paper
models the decision-making process of human drivers and employs deep
reinforcement learning to train the models with diverse decision styles,
simulating different reasoning approaches and behavioral characteristics.
Building on the Poisson cognitive hierarchy theory, this paper also presents a
novel driving scenario generation method. The method controls the proportion of
vehicles with different driving styles in the scenario using Poisson and
binomial distributions, thus generating controllable and diverse driving
environments. Experimental results demonstrate that CHARMS not only exhibits
superior decision-making capabilities as ego vehicles, but also generates more
complex and diverse driving scenarios as surrounding vehicles. We will release
code for CHARMS at https://github.com/WUTAD-Wjy/CHARMS.
☆ Estimating Scene Flow in Robot Surroundings with Distributed Miniaturized Time-of-Flight Sensors
Tracking motions of humans or objects in the surroundings of the robot is
essential to improve safe robot motions and reactions. In this work, we present
an approach for scene flow estimation from low-density and noisy point clouds
acquired from miniaturized Time of Flight (ToF) sensors distributed on the
robot body. The proposed method clusters points from consecutive frames and
applies Iterative Closest Point (ICP) to estimate a dense motion flow, with
additional steps introduced to mitigate the impact of sensor noise and
low-density data points. Specifically, we employ a fitness-based classification
to distinguish between stationary and moving points and an inlier removal
strategy to refine geometric correspondences. The proposed approach is
validated in an experimental setup where 24 ToF are used to estimate the
velocity of an object moving at different controlled speeds. Experimental
results show that the method consistently approximates the direction of the
motion and its magnitude with an error which is in line with sensor noise.
comment: 7 pages, 5 figures, 2 tables, 1 algorithm
☆ On learning racing policies with reinforcement learning
Fully autonomous vehicles promise enhanced safety and efficiency. However,
ensuring reliable operation in challenging corner cases requires control
algorithms capable of performing at the vehicle limits. We address this
requirement by considering the task of autonomous racing and propose solving it
by learning a racing policy using Reinforcement Learning (RL). Our approach
leverages domain randomization, actuator dynamics modeling, and policy
architecture design to enable reliable and safe zero-shot deployment on a real
platform. Evaluated on the F1TENTH race car, our RL policy not only surpasses a
state-of-the-art Model Predictive Control (MPC), but, to the best of our
knowledge, also represents the first instance of an RL policy outperforming
expert human drivers in RC racing. This work identifies the key factors driving
this performance improvement, providing critical insights for the design of
robust RL-based control strategies for autonomous vehicles.
☆ All-day Depth Completion via Thermal-LiDAR Fusion
Depth completion, which estimates dense depth from sparse LiDAR and RGB
images, has demonstrated outstanding performance in well-lit conditions.
However, due to the limitations of RGB sensors, existing methods often struggle
to achieve reliable performance in harsh environments, such as heavy rain and
low-light conditions. Furthermore, we observe that ground truth depth maps
often suffer from large missing measurements in adverse weather conditions such
as heavy rain, leading to insufficient supervision. In contrast, thermal
cameras are known for providing clear and reliable visibility in such
conditions, yet research on thermal-LiDAR depth completion remains
underexplored. Moreover, the characteristics of thermal images, such as
blurriness, low contrast, and noise, bring unclear depth boundary problems. To
address these challenges, we first evaluate the feasibility and robustness of
thermal-LiDAR depth completion across diverse lighting (eg., well-lit,
low-light), weather (eg., clear-sky, rainy), and environment (eg., indoor,
outdoor) conditions, by conducting extensive benchmarks on the MS$^2$ and ViViD
datasets. In addition, we propose a framework that utilizes COntrastive
learning and Pseudo-Supervision (COPS) to enhance depth boundary clarity and
improve completion accuracy by leveraging a depth foundation model in two key
ways. First, COPS enforces a depth-aware contrastive loss between different
depth points by mining positive and negative samples using a monocular depth
foundation model to sharpen depth boundaries. Second, it mitigates the issue of
incomplete supervision from ground truth depth maps by leveraging foundation
model predictions as dense depth priors. We also provide in-depth analyses of
the key challenges in thermal-LiDAR depth completion to aid in understanding
the task and encourage future research.
☆ X-Capture: An Open-Source Portable Device for Multi-Sensory Learning
Understanding objects through multiple sensory modalities is fundamental to
human perception, enabling cross-sensory integration and richer comprehension.
For AI and robotic systems to replicate this ability, access to diverse,
high-quality multi-sensory data is critical. Existing datasets are often
limited by their focus on controlled environments, simulated objects, or
restricted modality pairings. We introduce X-Capture, an open-source, portable,
and cost-effective device for real-world multi-sensory data collection, capable
of capturing correlated RGBD images, tactile readings, and impact audio. With a
build cost under $1,000, X-Capture democratizes the creation of multi-sensory
datasets, requiring only consumer-grade tools for assembly. Using X-Capture, we
curate a sample dataset of 3,000 total points on 500 everyday objects from
diverse, real-world environments, offering both richness and variety. Our
experiments demonstrate the value of both the quantity and the sensory breadth
of our data for both pretraining and fine-tuning multi-modal representations
for object-centric tasks such as cross-sensory retrieval and reconstruction.
X-Capture lays the groundwork for advancing human-like sensory representations
in AI, emphasizing scalability, accessibility, and real-world applicability.
comment: Project page: https://xcapture.github.io/
☆ MinkOcc: Towards real-time label-efficient semantic occupancy prediction
Developing 3D semantic occupancy prediction models often relies on dense 3D
annotations for supervised learning, a process that is both labor and
resource-intensive, underscoring the need for label-efficient or even
label-free approaches. To address this, we introduce MinkOcc, a multi-modal 3D
semantic occupancy prediction framework for cameras and LiDARs that proposes a
two-step semi-supervised training procedure. Here, a small dataset of
explicitly 3D annotations warm-starts the training process; then, the
supervision is continued by simpler-to-annotate accumulated LiDAR sweeps and
images -- semantically labelled through vision foundational models. MinkOcc
effectively utilizes these sensor-rich supervisory cues and reduces reliance on
manual labeling by 90\% while maintaining competitive accuracy. In addition,
the proposed model incorporates information from LiDAR and camera data through
early fusion and leverages sparse convolution networks for real-time
prediction. With its efficiency in both supervision and computation, we aim to
extend MinkOcc beyond curated datasets, enabling broader real-world deployment
of 3D semantic occupancy prediction in autonomous driving.
comment: 8 pages
☆ Bipedal Robust Walking on Uneven Footholds: Piecewise Slope LIPM with Discrete Model Predictive Control
This study presents an enhanced theoretical formulation for bipedal
hierarchical control frameworks under uneven terrain conditions. Specifically,
owing to the inherent limitations of the Linear Inverted Pendulum Model (LIPM)
in handling terrain elevation variations, we develop a Piecewise Slope LIPM
(PS-LIPM). This innovative model enables dynamic adjustment of the Center of
Mass (CoM) height to align with topographical undulations during single-step
cycles. Another contribution is proposed a generalized Angular Momentum-based
LIPM (G-ALIP) for CoM velocity compensation using Centroidal Angular Momentum
(CAM) regulation. Building upon these advancements, we derive the DCM
step-to-step dynamics for Model Predictive Control MPC formulation, enabling
simultaneous optimization of step position and step duration. A hierarchical
control framework integrating MPC with a Whole-Body Controller (WBC) is
implemented for bipedal locomotion across uneven stepping stones. The results
validate the efficacy of the proposed hierarchical control framework and the
theoretical formulation.
☆ Adapting World Models with Latent-State Dynamics Residuals
Simulation-to-reality reinforcement learning (RL) faces the critical
challenge of reconciling discrepancies between simulated and real-world
dynamics, which can severely degrade agent performance. A promising approach
involves learning corrections to simulator forward dynamics represented as a
residual error function, however this operation is impractical with
high-dimensional states such as images. To overcome this, we propose ReDRAW, a
latent-state autoregressive world model pretrained in simulation and calibrated
to target environments through residual corrections of latent-state dynamics
rather than of explicit observed states. Using this adapted world model, ReDRAW
enables RL agents to be optimized with imagined rollouts under corrected
dynamics and then deployed in the real world. In multiple vision-based MuJoCo
domains and a physical robot visual lane-following task, ReDRAW effectively
models changes to dynamics and avoids overfitting in low data regimes where
traditional transfer methods fail.
comment: 15 pages, 11 figures. Project website at https://redraw.jblanier.net/
☆ Designing Effective Human-Swarm Interaction Interfaces: Insights from a User Study on Task Performance
In this paper, we present a systematic method of design for human-swarm
interaction interfaces, combining theoretical insights with empirical
evaluation. We first derive ten design principles from existing literature,
apply them to key information dimensions identified through goal-directed task
analysis and developed a tablet-based interface for a target search task. We
then conducted a user study with 31 participants where humans were required to
guide a robotic swarm to a target in the presence of three types of hazards
that pose a risk to the robots: Distributed, Moving, and Spreading. Performance
was measured based on the proximity of the robots to the target and the number
of deactivated robots at the end of the task. Results indicate that at least
one robot was bought closer to the target in 98% of tasks, demonstrating the
interface's success fulfilling the primary objective of the task. Additionally,
in nearly 67% of tasks, more than 50% of the robots reached the target.
Moreover, particularly better performance was noted in moving hazards.
Additionally, the interface appeared to help minimize robot deactivation, as
evidenced by nearly 94% of tasks where participants managed to keep more than
50% of the robots active, ensuring that most of the swarm remained operational.
However, its effectiveness varied across hazards, with robot deactivation being
lowest in distributed hazard scenarios, suggesting that the interface provided
the most support in these conditions.
comment: 8 pages, 4 figures, 5 tables
☆ Model Predictive Control with Visibility Graphs for Humanoid Path Planning and Tracking Against Adversarial Opponents ICRA
In this paper we detail the methods used for obstacle avoidance, path
planning, and trajectory tracking that helped us win the adult-sized,
autonomous humanoid soccer league in RoboCup 2024. Our team was undefeated for
all seated matches and scored 45 goals over 6 games, winning the championship
game 6 to 1. During the competition, a major challenge for collision avoidance
was the measurement noise coming from bipedal locomotion and a limited field of
view (FOV). Furthermore, obstacles would sporadically jump in and out of our
planned trajectory. At times our estimator would place our robot inside a hard
constraint. Any planner in this competition must also be be computationally
efficient enough to re-plan and react in real time. This motivated our approach
to trajectory generation and tracking. In many scenarios long-term and
short-term planning is needed. To efficiently find a long-term general path
that avoids all obstacles we developed DAVG (Dynamic Augmented Visibility
Graphs). DAVG focuses on essential path planning by setting certain regions to
be active based on obstacles and the desired goal pose. By augmenting the
states in the graph, turning angles are considered, which is crucial for a
large soccer playing robot as turning may be more costly. A trajectory is
formed by linearly interpolating between discrete points generated by DAVG. A
modified version of model predictive control (MPC) is used to then track this
trajectory called cf-MPC (Collision-Free MPC). This ensures short-term
planning. Without having to switch formulations cf-MPC takes into account the
robot dynamics and collision free constraints. Without a hard switch the
control input can smoothly transition in cases where the noise places our robot
inside a constraint boundary. The nonlinear formulation runs at approximately
120 Hz, while the quadratic version achieves around 400 Hz.
comment: This is a preprint version. This paper has been accepted to IEEE
International Conference on Robotics and Automation (ICRA) 2025. The final
published version will be available on IEEE Xplore
♻ ☆ SAT: Dynamic Spatial Aptitude Training for Multimodal Language Models
Arijit Ray, Jiafei Duan, Ellis Brown, Reuben Tan, Dina Bashkirova, Rose Hendrix, Kiana Ehsani, Aniruddha Kembhavi, Bryan A. Plummer, Ranjay Krishna, Kuo-Hao Zeng, Kate Saenko
Reasoning about motion and space is a fundamental cognitive capability that
is required by multiple real-world applications. While many studies highlight
that large multimodal language models (MLMs) struggle to reason about space,
they only focus on static spatial relationships, and not dynamic awareness of
motion and space, i.e., reasoning about the effect of egocentric and object
motions on spatial relationships. Manually annotating such object and camera
movements is expensive. Hence, we introduce SAT, a simulated spatial aptitude
training dataset comprising both static and dynamic spatial reasoning across
175K question-answer (QA) pairs and 20K scenes. Complementing this, we also
construct a small (150 image-QAs) yet challenging dynamic spatial test set
using real-world images. Leveraging our SAT datasets and 6 existing static
spatial benchmarks, we systematically investigate what improves both static and
dynamic spatial awareness. Our results reveal that simulations are surprisingly
effective at imparting spatial aptitude to MLMs that translate to real images.
We show that perfect annotations in simulation are more effective than existing
approaches of pseudo-annotating real images. For instance, SAT training
improves a LLaVA-13B model by an average 11% and a LLaVA-Video-7B model by an
average 8% on multiple spatial benchmarks, including our real-image dynamic
test set and spatial reasoning on long videos -- even outperforming some large
proprietary models. While reasoning over static relationships improves with
synthetic training data, there is still considerable room for improvement for
dynamic reasoning questions.
comment: Project webpage: https://arijitray.com/SAT/
♻ ☆ Scaling Laws in Scientific Discovery with AI and Robot Scientists
Pengsong Zhang, Heng Zhang, Huazhe Xu, Renjun Xu, Zhenting Wang, Cong Wang, Animesh Garg, Zhibin Li, Arash Ajoudani, Xinyu Liu
Scientific discovery is poised for rapid advancement through advanced
robotics and artificial intelligence. Current scientific practices face
substantial limitations as manual experimentation remains time-consuming and
resource-intensive, while multidisciplinary research demands knowledge
integration beyond individual researchers' expertise boundaries. Here, we
envision an autonomous generalist scientist (AGS) concept combines agentic AI
and embodied robotics to automate the entire research lifecycle. This system
could dynamically interact with both physical and virtual environments while
facilitating the integration of knowledge across diverse scientific
disciplines. By deploying these technologies throughout every research stage --
spanning literature review, hypothesis generation, experimentation, and
manuscript writing -- and incorporating internal reflection alongside external
feedback, this system aims to significantly reduce the time and resources
needed for scientific discovery. Building on the evolution from virtual AI
scientists to versatile generalist AI-based robot scientists, AGS promises
groundbreaking potential. As these autonomous systems become increasingly
integrated into the research process, we hypothesize that scientific discovery
might adhere to new scaling laws, potentially shaped by the number and
capabilities of these autonomous systems, offering novel perspectives on how
knowledge is generated and evolves. The adaptability of embodied robots to
extreme environments, paired with the flywheel effect of accumulating
scientific knowledge, holds the promise of continually pushing beyond both
physical and intellectual frontiers.
♻ ☆ GRACE: Generating Socially Appropriate Robot Actions Leveraging LLMs and Human Explanations ICRA
When operating in human environments, robots need to handle complex tasks
while both adhering to social norms and accommodating individual preferences.
For instance, based on common sense knowledge, a household robot can predict
that it should avoid vacuuming during a social gathering, but it may still be
uncertain whether it should vacuum before or after having guests. In such
cases, integrating common-sense knowledge with human preferences, often
conveyed through human explanations, is fundamental yet a challenge for
existing systems. In this paper, we introduce GRACE, a novel approach
addressing this while generating socially appropriate robot actions. GRACE
leverages common sense knowledge from LLMs, and it integrates this knowledge
with human explanations through a generative network. The bidirectional
structure of GRACE enables robots to refine and enhance LLM predictions by
utilizing human explanations and makes robots capable of generating such
explanations for human-specified actions. Our evaluations show that integrating
human explanations boosts GRACE's performance, where it outperforms several
baselines and provides sensible explanations.
comment: 2025 IEEE International Conference on Robotics & Automation (ICRA),
Supplementary video: https://youtu.be/GTNCC1GkiQ4
♻ ☆ Online Hybrid-Belief POMDP with Coupled Semantic-Geometric Models and Semantic Safety Awareness
Robots operating in complex and unknown environments frequently require
geometric-semantic representations of the environment to safely perform their
tasks. While inferring the environment, they must account for many possible
scenarios when planning future actions. Since objects' class types are discrete
and the robot's self-pose and the objects' poses are continuous, the
environment can be represented by a hybrid discrete-continuous belief which is
updated according to models and incoming data. Prior probabilities and
observation models representing the environment can be learned from data using
deep learning algorithms. Such models often couple environmental semantic and
geometric properties. As a result, semantic variables are interconnected,
causing semantic state space dimensionality to increase exponentially. In this
paper, we consider planning under uncertainty using partially observable Markov
decision processes (POMDPs) with hybrid semantic-geometric beliefs. The models
and priors consider the coupling between semantic and geometric variables.
Within POMDP, we introduce the concept of semantically aware safety. Obtaining
representative samples of the theoretical hybrid belief, required for
estimating the value function, is very challenging. As a key contribution, we
develop a novel form of the hybrid belief and leverage it to sample
representative samples. We show that under certain conditions, the value
function and probability of safety can be calculated efficiently with an
explicit expectation over all possible semantic mappings. Our simulations show
that our estimates of the objective function and probability of safety achieve
similar levels of accuracy compared to estimators that run exhaustively on the
entire semantic state-space using samples from the theoretical hybrid belief.
Nevertheless, the complexity of our estimators is polynomial rather than
exponential.
comment: 18 pages, 11 figures
♻ ☆ MI-HGNN: Morphology-Informed Heterogeneous Graph Neural Network for Legged Robot Contact Perception ICRA 2025
We present a Morphology-Informed Heterogeneous Graph Neural Network (MI-HGNN)
for learning-based contact perception. The architecture and connectivity of the
MI-HGNN are constructed from the robot morphology, in which nodes and edges are
robot joints and links, respectively. By incorporating the morphology-informed
constraints into a neural network, we improve a learning-based approach using
model-based knowledge. We apply the proposed MI-HGNN to two contact perception
problems, and conduct extensive experiments using both real-world and simulated
data collected using two quadruped robots. Our experiments demonstrate the
superiority of our method in terms of effectiveness, generalization ability,
model efficiency, and sample efficiency. Our MI-HGNN improved the performance
of a state-of-the-art model that leverages robot morphological symmetry by 8.4%
with only 0.21% of its parameters. Although MI-HGNN is applied to contact
perception problems for legged robots in this work, it can be seamlessly
applied to other types of multi-body dynamical systems and has the potential to
improve other robot learning frameworks. Our code is made publicly available at
https://github.com/lunarlab-gatech/Morphology-Informed-HGNN.
comment: 6 pages, 5 figures; This work has been accepted to ICRA 2025 and will
soon be published
♻ ☆ ArtFormer: Controllable Generation of Diverse 3D Articulated Objects CVPR 2025
This paper presents a novel framework for modeling and conditional generation
of 3D articulated objects. Troubled by flexibility-quality tradeoffs, existing
methods are often limited to using predefined structures or retrieving shapes
from static datasets. To address these challenges, we parameterize an
articulated object as a tree of tokens and employ a transformer to generate
both the object's high-level geometry code and its kinematic relations.
Subsequently, each sub-part's geometry is further decoded using a
signed-distance-function (SDF) shape prior, facilitating the synthesis of
high-quality 3D shapes. Our approach enables the generation of diverse objects
with high-quality geometry and varying number of parts. Comprehensive
experiments on conditional generation from text descriptions demonstrate the
effectiveness and flexibility of our method.
comment: CVPR 2025. impl. repo: https://github.com/ShuYuMo2003/ArtFormer
♻ ☆ A nonlinear real time capable motion cueing algorithm based on deep reinforcement learning
Hendrik Scheidel, Camilo Gonzalez, Houshyar Asadi, Tobias Bellmann, Andreas Seefried, Shady Mohamed, Saeid Nahavandi
In motion simulation, motion cueing algorithms are used for the trajectory
planning of the motion simulator platform, where workspace limitations prevent
direct reproduction of reference trajectories. Strategies such as motion
washout, which return the platform to its center, are crucial in these
settings. For serial robotic MSPs with highly nonlinear workspaces, it is
essential to maximize the efficient utilization of the MSPs kinematic and
dynamic capabilities. Traditional approaches, including classical washout
filtering and linear model predictive control, fail to consider
platform-specific, nonlinear properties, while nonlinear model predictive
control, though comprehensive, imposes high computational demands that hinder
real-time, pilot-in-the-loop application without further simplification. To
overcome these limitations, we introduce a novel approach using deep
reinforcement learning for motion cueing, demonstrated here for the first time
in a 6-degree-of-freedom setting with full consideration of the MSPs kinematic
nonlinearities. Previous work by the authors successfully demonstrated the
application of DRL to a simplified 2-DOF setup, which did not consider
kinematic or dynamic constraints. This approach has been extended to all 6 DOF
by incorporating a complete kinematic model of the MSP into the algorithm, a
crucial step for enabling its application on a real motion simulator. The
training of the DRL-MCA is based on Proximal Policy Optimization in an
actor-critic implementation combined with an automated hyperparameter
optimization. After detailing the necessary training framework and the
algorithm itself, we provide a comprehensive validation, demonstrating that the
DRL MCA achieves competitive performance against established algorithms.
Moreover, it generates feasible trajectories by respecting all system
constraints and meets all real-time requirements with low...
♻ ☆ 6DOPE-GS: Online 6D Object Pose Estimation using Gaussian Splatting
Efficient and accurate object pose estimation is an essential component for
modern vision systems in many applications such as Augmented Reality,
autonomous driving, and robotics. While research in model-based 6D object pose
estimation has delivered promising results, model-free methods are hindered by
the high computational load in rendering and inferring consistent poses of
arbitrary objects in a live RGB-D video stream. To address this issue, we
present 6DOPE-GS, a novel method for online 6D object pose estimation \&
tracking with a single RGB-D camera by effectively leveraging advances in
Gaussian Splatting. Thanks to the fast differentiable rendering capabilities of
Gaussian Splatting, 6DOPE-GS can simultaneously optimize for 6D object poses
and 3D object reconstruction. To achieve the necessary efficiency and accuracy
for live tracking, our method uses incremental 2D Gaussian Splatting with an
intelligent dynamic keyframe selection procedure to achieve high spatial object
coverage and prevent erroneous pose updates. We also propose an opacity
statistic-based pruning mechanism for adaptive Gaussian density control, to
ensure training stability and efficiency. We evaluate our method on the HO3D
and YCBInEOAT datasets and show that 6DOPE-GS matches the performance of
state-of-the-art baselines for model-free simultaneous 6D pose tracking and
reconstruction while providing a 5$\times$ speedup. We also demonstrate the
method's suitability for live, dynamic object tracking and reconstruction in a
real-world setting.
♻ ☆ R+X: Retrieval and Execution from Everyday Human Videos ICRA
We present R+X, a framework which enables robots to learn skills from long,
unlabelled, first-person videos of humans performing everyday tasks. Given a
language command from a human, R+X first retrieves short video clips containing
relevant behaviour, and then executes the skill by conditioning an in-context
imitation learning method (KAT) on this behaviour. By leveraging a Vision
Language Model (VLM) for retrieval, R+X does not require any manual annotation
of the videos, and by leveraging in-context learning for execution, robots can
perform commanded skills immediately, without requiring a period of training on
the retrieved videos. Experiments studying a range of everyday household tasks
show that R+X succeeds at translating unlabelled human videos into robust robot
skills, and that R+X outperforms several recent alternative methods. Videos and
code are available at https://www.robot-learning.uk/r-plus-x.
comment: Published at the IEEE International Conference on Robotics and
Automation (ICRA) 2025
♻ ☆ A Framework for Adapting Human-Robot Interaction to Diverse User Groups
Theresa Pekarek Rosin, Vanessa Hassouna, Xiaowen Sun, Luca Krohm, Henri-Leon Kordt, Michael Beetz, Stefan Wermter
To facilitate natural and intuitive interactions with diverse user groups in
real-world settings, social robots must be capable of addressing the varying
requirements and expectations of these groups while adapting their behavior
based on user feedback. While previous research often focuses on specific
demographics, we present a novel framework for adaptive Human-Robot Interaction
(HRI) that tailors interactions to different user groups and enables individual
users to modulate interactions through both minor and major interruptions. Our
primary contributions include the development of an adaptive, ROS-based HRI
framework with an open-source code base. This framework supports natural
interactions through advanced speech recognition and voice activity detection,
and leverages a large language model (LLM) as a dialogue bridge. We validate
the efficiency of our framework through module tests and system trials,
demonstrating its high accuracy in age recognition and its robustness to
repeated user inputs and plan changes.
comment: Published in the Proceedings of the 16th International Conference on
Social Robotics (ICSR) 2024
♻ ☆ STEAK: Streaming Network for Continual Learning of Object Relocations under Household Context Drifts
In real-world settings, robots are expected to assist humans across diverse
tasks and still continuously adapt to dynamic changes over time. For example,
in domestic environments, robots can proactively help users by fetching needed
objects based on learned routines, which they infer by observing how objects
move over time. However, data from these interactions are inherently
non-independent and non-identically distributed (non-i.i.d.), e.g., a robot
assisting multiple users may encounter varying data distributions as
individuals follow distinct habits. This creates a challenge: integrating new
knowledge without catastrophic forgetting. To address this, we propose STREAK
(Spatio Temporal RElocation with Adaptive Knowledge retention), a continual
learning framework for real-world robotic learning. It leverages a streaming
graph neural network with regularization and rehearsal techniques to mitigate
context drifts while retaining past knowledge. Our method is time- and
memory-efficient, enabling long-term learning without retraining on all past
data, which becomes infeasible as data grows in real-world interactions. We
evaluate STREAK on the task of incrementally predicting human routines over 50+
days across different households. Results show that it effectively prevents
catastrophic forgetting while maintaining generalization, making it a scalable
solution for long-term human-robot interactions.
♻ ☆ HEROS: Hierarchical Exploration with Online Subregion Updating for 3D Environment Coverage
We present an autonomous exploration system for efficient coverage of unknown
environments. First, a rapid environment preprocessing method is introduced to
provide environmental information for subsequent exploration planning. Then,
the whole exploration space is divided into multiple subregion cells, each with
varying levels of detail. The subregion cells are capable of decomposition and
updating online, effectively characterizing dynamic unknown regions with
variable resolution. Finally, the hierarchical planning strategy treats
subregions as basic planning units and computes an efficient global coverage
path. Guided by the global path, the local path that sequentially visits the
viewpoint set is refined to provide an executable path for the robot. This
hierarchical planning from coarse to fine steps reduces the complexity of the
planning scheme while improving exploration efficiency. The proposed method is
compared with state-of-art methods in benchmark environments. Our approach
demonstrates superior efficiency in completing exploration while using lower
computational resources.
♻ ☆ Beyond Non-Expert Demonstrations: Outcome-Driven Action Constraint for Offline Reinforcement Learning
We address the challenge of offline reinforcement learning using realistic
data, specifically non-expert data collected through sub-optimal behavior
policies. Under such circumstance, the learned policy must be safe enough to
manage distribution shift while maintaining sufficient flexibility to deal with
non-expert (bad) demonstrations from offline data.To tackle this issue, we
introduce a novel method called Outcome-Driven Action Flexibility (ODAF), which
seeks to reduce reliance on the empirical action distribution of the behavior
policy, hence reducing the negative impact of those bad demonstrations.To be
specific, a new conservative reward mechanism is developed to deal with
distribution shift by evaluating actions according to whether their outcomes
meet safety requirements - remaining within the state support area, rather than
solely depending on the actions' likelihood based on offline data.Besides
theoretical justification, we provide empirical evidence on widely used MuJoCo
and various maze benchmarks, demonstrating that our ODAF method, implemented
using uncertainty quantification techniques, effectively tolerates unseen
transitions for improved "trajectory stitching," while enhancing the agent's
ability to learn from realistic non-expert data.