AI-driven visual monitoring of industrial assembly tasks

Mattia Nardon¹, Stefano Messelodi¹, Antonio Granata²,
Fabio Poiesi¹, Alberto Danese², Davide Boscaini¹

¹Fondazione Bruno Kessler, ²Meccanica del Sarca s.p.a.

We present ViMAT, a novel system for the real-time visual monitoring of industrial assembly tasks. Given prior knowledge on assembly instructions (top left) and synthetic CAD models of assembly components (bottom left), ViMAT integrates an AI-driven perception module, which extracts visual observations from real-world video streams (top right), with a probabilistic reasoning module that predicts the assembly state from these observations (bottom center-right).

Abstract

Visual monitoring of industrial assembly tasks is critical for preventing equipment damage due to procedural errors and ensuring worker safety. Although commercial solutions exist, they typically require rigid workspace setups or the application of visual markers to simplify the problem. We introduce ViMAT, a novel AI-driven system for real-time visual monitoring of assembly tasks that operates without these constraints. ViMAT combines a perception module that extracts visual observations from multi-view video streams with a reasoning module that infers the most likely action being performed based on the observed assembly state and prior task knowledge. We validate ViMAT on two assembly tasks, involving the replacement of LEGO components and the reconfiguration of hydraulic press molds, demonstrating its effectiveness through quantitative and qualitative analysis in challenging real-world scenarios characterized by partial and uncertain visual observations.

ViMAT's overview

Overview of ViMAT. Multi-view video frames are processed by the perception module (pink) to detect assembly components using a detector trained on a synthetic dataset generated by the digital twin module (green). These detections, along with prior task knowledge (assembly instructions), are passed to the probabilistic reasoning module (azure) to estimate the action being performed.

LEGO Scenario

In this scenario, the assembly task consists of modifying a LEGO structure by adding or removing elements according to a predefined goal configuration. Components are picked from an input tray and placed onto the structure, while removed pieces are relocated to an output tray. A human operator performs the task, and the system continuously observes the scene through multiple calibrated RGBD sensors, identifying and tracking the components to estimate the most probable current assembly configuration.

Visualization of the digital LEGO components and the corresponding input/output trays utilized in the LEGO scenario task.

Demonstration of ViMAT in action on the LEGO assembly scenario. The video showcases real-time visual monitoring and reasoning: RGBD camera views with object detections (top and bottom-left), and a live bar chart (bottom-right) representing the system's estimated probabilities for each assembly state.

BibTeX

@inproceedings{nardon2025vimat,
  title={AI-driven visual monitoring of industrial assembly tasks},
  author={Nardon, Mattia and Messelodi, Stefano and Granata, Antonio and Poiesi, Fabio and Danese, Alberto and Boscaini, Davide},
  booktitle={Proceedings of the International Conference on Image Analysis and Processing (ICIAP)},
  year={2025}}

Acknowledgments

This work has been partially funded by the Provincia Autonoma di Trento (Italy) under L.P. 6/99, as part of the NEXTMAG project.