Abstract
Modern manufacturing demands robotic assembly systems with enhanced flexibility and reliability. However, traditional approaches often rely on programming tailored to each product by experts for fixed settings, which are inherently inflexible to product changes and lack the robustness to handle variations. As Behavior Trees (BTs) are increasingly used in robotics for their modularity and reactivity, we propose a novel hierarchical framework, Video-to-BT, that seamlessly integrates high-level cognitive planning with low-level reactive control, with BTs serving both as the structured output of planning and as the governing structure for execution. Our approach leverages a Vision-Language Model (VLM) to decompose human demonstration videos into subtasks, from which Behavior Trees are generated. During the execution, the planned BTs combined with real-time scene interpretation enable the system to operate reactively in the dynamic environment, while VLM-driven replanning is triggered upon execution failure. This closed-loop architecture ensures stability and adaptivity. We validate our framework on real-world assembly tasks through a series of experiments, demonstrating high planning reliability, robust performance in long-horizon assembly tasks, and strong generalization across diverse and perturbed conditions.
Methods
The Video-to-BT framework consists of three tightly coupled modules — Planning, Perception, and Execution — forming a closed-loop system.
Planning. A two-stage VLM pipeline is employed: Stage A interprets a human demonstration video to infer a sequence of subtasks along with inter-object constraints, while Stage B structures and compiles these subtasks into an executable Behavior Tree (BT), where each subtree is iteratively validated via a BT simulator. Human-in-the-loop (HITL) verification is embedded in both stages to ensure safety and reliability.
Perception. The module maintains a real-time world state ω, which is modeled as a set of object properties and inter-object relations, by continuously performing object recognition and scene interpretation, enabling the system to track environmental changes caused by external disturbances.
Execution. The module drives the BT by recursively propagating ticks in a depth-first, left-to-right order, dynamically scheduling low-level action primitives (e.g., grasping, assembling) based on the current world state. Upon detecting execution failure, the system automatically triggers VLM-driven replanning to refine the BT and recover from disturbances.
The tight coupling of these three modules forms a closed-loop architecture that integrates high-level cognitive planning with low-level reactive control, enabling robots to learn long-horizon assembly tasks from non-expert demonstration videos and execute them robustly in dynamic environments without extensive expert programming.
Human Demonstration Examples
Behavior Tree Visualizations
Real World Demo
The video is playing at three times the normal speed.