Category: Seminars and Conferences
State: Current
21 maggio, ore 14.30

Advancing Instance-Level Perception: End-to-End Sequence Modeling for Tracking and Efficient

Seminario Online

Workshop online Advancing Instance-Level Perception: End-to-End Sequence Modeling for Tracking and Efficient, held by Dr. Mattia Segù.

When: May 21st, 2.30 pm
Where: online at the link: https://tinyurl.com/yc296vdp

Speaker: Mattia Segù - ETH Zurich
Organiser: Prof. Tatiana Tommasi of the DAUIN

The event is in the "Ellis Turin Talk" online seminar series, collaborating with the Artificial Intelligence Hub of the Politecnico di Torino and the Vandal Research Group of the Department of Control and Computer Engineering.

Abstract: Instance-level perception - the ability to localize, segment, and classify individual objects over time - is fundamental to systems that interact with the physical world. Recent advances in model architectures and data quality have enabled unified models capable of detecting and segmenting objects across both closed-set categories and free-form referring expressions. However, existing approaches struggle to scale to end-to-end instance tracking and efficiently adapt to edge deployment, posing key challenges for real-world applications. In this talk, I will present two recent works - SambaMOTR and MOBIUS - that push the boundaries of instance-level perception by addressing multi-object tracking and efficient segmentation. SambaMOTR enables end-to-end multiobject tracking by leveraging Samba, a set-of-sequences model that captures long-range dependencies, tracklet interactions, and temporal occlusions, improving robustness in dynamic environments with complex motion. MOBIUS makes vision-language instance segmentation scalable through a bottleneck encoder for efficient scale and modality fusion, and a language-guided calibration loss for adaptive decoder pruning, reducing inference time by up to 75% while maintaining state-of-the-art performance across both high-end and mobile devices. Through the lens of these two works, I will explore how sequence modeling and efficient multi-modal perception can be leveraged to develop scalable, real-time object perception models, enabling robust tracking and segmentation in complex environments.