MICDrop: Masking Image and Depth Features via Complementary Dropout for Domain-Adaptive Semantic Segmentation

Linyan Yang1, 3, 2, Mark Weber1, 3, Tobias Fischer2,
Dengxin Dai2, Laura Leal-Taixé4, Marc Pollefeys5, Daniel Cremers1, 3 and Luc Van Gool2
1TU Munich, 2ETH Zurich, 3Munich Center for Machine Learning, 4 NVIDIA, 5 Microsoft
ECCV 2024
Quanliative Results

Previous SotA UDA methods such as MIC [1] struggle with the segmentation of fine structures (top row) and oversegmentation of difficult objects (bottom row). Therefore, we propose MICDrop to improve semantic segmentation UDA with depth estimates, which can capture fine structures and are consistent within object boundaries.

Abstract

Unsupervised Domain Adaptation (UDA) is the task of bridging the domain gap between a labeled source domain, e.g., synthetic data, and an unlabeled target domain. We observe that current UDA methods show inferior results on fine structures and tend to oversegment objects with ambiguous appearance. To address these shortcomings, we propose to leverage geometric information, i.e., depth predictions, as depth discontinuities often coincide with segmentation boundaries. We show that naively incorporating depth into current UDA methods does not fully exploit the potential of this complementary information. To this end, we present MICDrop, which learns a joint feature representation by masking image encoder features while inversely masking depth encoder features. With this simple yet effective complementary masking strategy, we enforce the use of both modalities when learning the joint feature representation. To aid this process, we propose a feature fusion module to improve both global as well as local information sharing while being robust to errors in the depth predictions. We show that our method can be plugged into various recent UDA methods and consistently improve results across standard UDA benchmarks, obtaining new state-of-the-art performances.

Results

Method

Our framework is based on a novel cross-modality complementary dropout technique along with a tailored masking schedule. We foster cross-modal feature learning by strategically corrupting both RGB and depth features in a complementary manner, enforcing the utilization of the different modalities to fill in masked information. To integrate information from both modalities effectively, we also propose a cross-modality feature fusion module.

Architecture

Description of Image 1

In our training pipeline source and target images are fed through the student encoders. Then, our proposed cross-modality complementary dropout is applied to the corresponding features on each feature resolution. Finally, we feed them through our fusion block, followed by the decoder, to make a final prediction.


Training Scheme

Description of Image 1

For Training we adopt a student-teacher framework. We follow standard practice and present the student with a heavily augmented view while presenting the teacher with a weakly augmented view of an image. To improve the cross-modal information exchange, we introduce a cross-modal masking strategy, which involves masking the learned representation of different modalities on feature-level. Conceptually, this avoids the recovery of features within the feature pyramid of the same modality. Therefore, our method is designed to foster the transfer of complementary information and to promote the learning of potentially redundant information, which in turn increases robustness and reduces sensitivity to domain-specific appearance changes.

BibTeX

@article{yang2024micdrop,
        title={Micdrop: masking image and depth features via complementary dropout for domain-adaptive semantic segmentation},
        author={Yang, Linyan and Hoyer, Lukas and Weber, Mark and Fischer, Tobias and Dai, Dengxin and Leal-Taix{\'e}, Laura and Pollefeys, Marc and Cremers, Daniel and Van Gool, Luc},
        journal={arXiv preprint arXiv:2408.16478},
        year={2024}
      }