🏭 Fabrica: Dual-Arm Assembly of General Multi-Part Objects via Integrated Planning and Learning

Yunsheng Tian¹, Joshua Jacob¹, Yijiang Huang², Jialiang Zhao¹, Edward Gu¹, Pingchuan Ma¹, Annan Zhang¹, Farhad Javid³, Branden Romero¹, Sachin Chitta³, Shinjiro Sueda⁴, Hui Li³, Wojciech Matusik¹

¹MIT CSAIL ²ETH Zurich ³Autodesk Research ⁴Texas A&M University

Paper arXiv Video Code & Benchmark (Soon)

Fabrica is an autonomous robotic assembly system capable of planning and executing multi-step contact-rich assembly of general objects without human demonstrations.

Abstract

Multi-part assembly poses significant challenges for robots to execute long-horizon, contact-rich manipulation with generalization across complex geometries. We present Fabrica, a dual-arm robotic system capable of end-to-end planning and control for autonomous assembly of general multi-part objects. For planning over long horizons, we develop hierarchies of precedence, sequence, grasp, and motion planning with automated fixture generation, enabling general multi-step assembly on any dual-arm robots. The planner is made efficient through a parallelizable design and is optimized for downstream control stability. For contact-rich assembly steps, we propose a lightweight reinforcement learning framework that trains generalist policies across object geometries, assembly directions, and grasp poses, guided by equivariance and residual actions obtained from the plan. These policies transfer zero-shot to the real world and achieve 80% successful steps. For systematic evaluation, we propose a benchmark suite of multi-part assemblies resembling industrial and daily objects across diverse categories and geometries. By integrating efficient global planning and robust local control, we showcase the first system to achieve complete and generalizable real-world multi-part assembly without domain knowledge or human demonstrations.

Video Introduction

Overview

Our robotic system achieves general and precise multi-part assembly using standard dual-arm robots, seamlessly integrating holistic planning and learning-based control.

Starting from minimal inputs including just part geometries and robot configurations, our multi-stage planner autonomously generates assembly sequences, robust grasps, customized fixtures, and collision-free motion plans.

We train RL policies for precise contact-rich assembly steps with minimal yet effective designs, enabling the system to adapt to variations in object geometries and grasp poses.

Feasible, Efficient, and Optimal Planning

Planning is the core of our system to efficiently generate long-horizon, feasible, and even optimal assembly sequences and grasps.

Part Precedence Planning

We identify all possible assembly orders for a set of parts by determining their precedence relationships.

Parts are grouped into precedence tiers using physics-based motion planning, and are transformed to a precedence graph that represents necessary assembly sequences to guarantee collision-free assembly.

Identifying the precedence relationships is fast and autonomous, reducing computation for the following planning stages.

Dual-Arm Grasp Filtering

Finding valid grasp pairs for assembling and holding parts is crucial to avoid interference between arms and parts during assembly in a constrained space.

Using any grasp planner for single arms, we sample candidate grasps offline and then evaluate collision and reachability in simulation. The evaluations are conducted efficiently in parallel to obtain feasible grasp pairs for each part following any assembly sequence.

Dual-Arm Sequence-Grasp Optimization

Beyond kinematically valid grasps, it is crucial to plan the optimal assembly sequence and grasp choices that maximize the success rate (or stability) when executing the plan with uncertainties in the real world.

We construct a tree where nodes represent partial assembly states and recursively expand this tree to explore valid grasp sequences. We define a grasp stability score that considers key factors such as part supportiveness, grasp switch frequency, torque stability, and contact quality.

The optimal grasp sequences with the best cumulative scores are determined efficiently by dynamic programming.

Simple and General Policy Learning

We develop a simple but highly effective recipe to learn generalist assembly policies with variations in part geometries and grasp poses, which is simplified and enhanced by the planning stages.

Learning general single-step assembly policy.

Equivariance: Humans naturally reuse the same assembly skills across different objects, regardless of their poses or motions. Similarly, we map all assembly motions in the world frame into equivalent top-down insertions in the path-centric task frame, allowing the RL agent to perceive them in a unified way.

Residual action: We find that guidance from the planned open-loop action helps learning by injecting prior knowledge about the coarse assembly direction. Thus, our policy outputs only the corrective action on top of the open-loop action, which warm-starts policy learning and typically leads to faster and better convergence.

Simple observation, action, and reward: Thanks to planning, which gives us the stable grasps, proper coordinate frame, open-loop action, and proximity to assembled pose, the policy can be trained with a simple reward function that only considers the distance to the target pose without extra engineering. In addition, we take only the part pose as observation and the desired delta part pose as action.

Easy and Comprehensive Benchmarking

We develop a diverse benchmark suite spanning furniture, toys, and industrial equipment, with both top-down and sideway insertions, and are feasible for standard dual-arm robots with parallel grippers.

Automated Grasp-Aware Pickup Fixture Generation

To enable precise and efficient robotic pickup, as well as easy benchmarking, we propose an automated fixture design method tailored to each part's planned grasp pose, based on geometry processing and an iterative bin-packing algorithm. This software-hardware co-design approach eliminates extra reorientation or grasp adjustments by aligning each part optimally for direct, top-down pickup.

Results

Planning Multi-Step Assembly in Simulation

Our system can plan multi-step assembly sequences in simulation, generating robust grasps and fixture designs for each part. The planner efficiently handles complex precedence relationships and generates collision-free motion plans for dual-arm robots.

The planner is designed to be generalizable, capable of handling various object geometries and assembly tasks without requiring task-specific tuning. It can also be parallelized to handle multiple assembly tasks simultaneously, making it suitable for high-throughput industrial applications.

Executing Multi-Step Assembly in Real World

We show successful transfers of simulated plans to the real world, where the planner generates robust grasps, fixture designs, and global motions, while the learned policies ensure local precise contact-rich assembly.

We demonstrate several key findings from our real-world experiments:

Our system can successfully execute multi-step assembly tasks in the real world without domain knowledge or human demonstrations, achieving best success rates with minimal human intervention, thanks to the integration of planning and learning.
Generalist policies trained on diverse assemblies can effectively transfer to novel assemblies, achieving comparable performance to specialist policies, thanks to the short horizon and equivariant representations.
The planner's ability to generate proper sequences and robust grasps significantly contributes to the success of the assembly tasks. A good global planner can maximize the local policy's performance in long-horizon and contact-rich tasks!

While there are still many limitations to address to achieve industrial-grade robustness in such a long-horizon and precise task, we hope to provide the first simple but effective baseline and an easily reproducible benchmark towards tackling this challenge. For an in-depth discussion, please refer to Section 7 of our paper.

VLM for Visual Insertion Alignment

To further address insertion misalignments observed during the initial insertion attempt, we integrate a vision-language model (VLM) to provide corrective alignment feedback in the form of discrete actions before the next policy trial.

We show that even with a low-cost camera setup, the VLM effectively discerned alignment cues based on coarse visual features. This is particularly noteworthy given that occlusions and visual clutter are prevalent in multi-part assemblies, where small positional errors can accumulate over successive steps. Exciting future works can be envisioned to bridge VLM's perceptual understanding with robust robotic control, enabling human-level dexterous and precise assembly capabilities.

Acknowledgement

This work is funded by Autodesk, and in part by NSF 1846368 & 2313076. Yijiang Huang is supported by the SNSF Ambizione program. The authors would like to thank Bingjie Tang and Lars Ankile for insightful discussions during the RL environment setup, Xiang Zhang and Yotto Koga for valuable advice on the early physical setup, the members of Ted Adelson's lab at MIT for their support with the physical Panda robot, and the MIT SuperCloud and Lincoln Laboratory for HPC resources.

BibTeX

@misc{tian2025fabricadualarmassemblygeneral,
      title={Fabrica: Dual-Arm Assembly of General Multi-Part Objects via Integrated Planning and Learning}, 
      author={Yunsheng Tian and Joshua Jacob and Yijiang Huang and Jialiang Zhao and Edward Gu and Pingchuan Ma and Annan Zhang and Farhad Javid and Branden Romero and Sachin Chitta and Shinjiro Sueda and Hui Li and Wojciech Matusik},
      year={2025},
      eprint={2506.05168},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2506.05168}, 
}