ROSITA

ROSITA Framework

PromptAlign design — **Overview of ROSITA framework:** The test samples with Weak and Strong OOD data arrive one at a time. The image features are matched with the text based classifier, the confidence scores of which are used to distinguish between weak and strong OOD samples through a simple LDA based OOD classifier. Based on this classification and if a sample is identified to be reliable, the respective feature banks are updated and the proposed test-time objective is optimized to update the LayerNorm parameters of the Vision Encoder.

Abstract

We propose a novel framework to address the real-world challenging task of Single Image Test Time Adaptation in an open and dynamic environment. We leverage large scale Vision Language Models like CLIP to enable real time adaptation on a per-image basis without access to source data or ground truth labels. Since the deployed model can also encounter unseen classes in an open world, we first employ a simple and effective Out of Distribution (OOD) detection module to distinguish between weak and strong OOD samples. We propose a novel contrastive learning based objective to enhance the discriminability between weak and strong OOD samples by utilizing small, dynamically updated feature banks. Finally, we also employ a classification objective for adapting the model using the reliable weak OOD samples. The proposed framework ROSITA combines these components, enabling continuous online adaptation of Vision Language Models on a single image basis. Extensive experimentation on diverse domain adaptation benchmarks validates the effectiveness of the proposed framework.

BibTeX


@misc{sreenivas2024effectiveness,
      title={Effectiveness of Vision Language Models for Open-world Single Image Test Time Adaptation},
      author={Manogna Sreenivas and Soma Biswas},
      year={2024},
      eprint={2406.00481},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}