SegAssist

Abstract

In dynamic environments, unfamiliar objects and distribution shifts are often encountered which challenge the generalization abilities of deployed trained models. This work addresses Incremental Test Time Adaptation (ITTA) of Vision-Language Models (VLMs), tackling scenarios where unseen classes and unseen domains continuously appear during testing. Unlike traditional TTA approaches, where the test stream comes only from a predefined set of classes, our framework allows models to adapt simultaneously to both covariate and label shifts, actively incorporating new classes as they emerge. Towards this goal, we establish a new benchmark for ITTA, integrating single-image TTA methods for VLMs with active labeling techniques that query an oracle for samples potentially representing unseen classes during test time. We propose a segmentation assisted active labeling module, termed SegAssist, which is training-free and repurposes the VLM's segmentation capabilities to refine active sample selection, prioritizing samples likely to belong to unseen classes. Extensive experiments on several benchmark datasets demonstrate the potential of SegAssist to enhance the performance of VLMs in real-world scenarios, where continuous adaptation to emerging data is essential.

Incremental Test Time Adaptation

Consider a model deployed in the real world, navigating a dynamic environment where it encounters familiar objects in unfamiliar settings and new objects it has never seen before. Traditional TTA methods handle covariate shifts but cannot accommodate dynamically emerging new classes. We introduce ITTA, a protocol that requires the model to simultaneously:

Continuously adapt to covariate shifts
Recognize and incorporate new classes as they appear
Operate on a single-image basis without access to batches or labeled data
Maintain performance across seen classes and a growing set of dynamically incorporated unseen classes

ITTA framework overview — **Figure 1. Incremental Test Time Adaptation.** A test stream in the real world can consist of samples with covariate shift (painting style) and label shift (unseen classes such as *Egret* and *Tiger*). Uncertain samples are queried for active labeling; new classes are added to the seen-class set when confirmed by an oracle. SegAssist is a plug-in module that favours selection of unseen class samples for active labeling.

The table below situates ITTA among closely related research directions. ITTA is the only setting that handles all of: training-free adaptation, single-image testing, covariate shift, label shift, and a growing classification space \(\mathcal{C}_s + \mathcal{C}_u\).

Training-free	TTA	Covariate shift	Label shift	Single image	Task space	Methods
✗	✓	✓	✗	✗	\(\mathcal{C}_s\)	TENT, CoTTA, ROID
✓	✓	✓	✗	✓	\(\mathcal{C}_s\)	TPT, TDA, DPE
✗	✓	✓	✓	✗	\(\mathcal{C}_s+1\)	Open-world TTA
✓	✗	✗	✓	✗	\(\mathcal{C}_s+\mathcal{C}_u\)	Class Incremental Learning
✓	✓	✓	✓	✓	\(\mathcal{C}_s+\mathcal{C}_u\)	Proposed ITTA (Ours)

Table 1. Comparison of the proposed ITTA framework with existing research directions.

SegAssist: Segmentation-Assisted Active Labeling

**Figure 2. SegAssist pipeline.** Uncertain samples are further filtered by performing pixel-wise segmentation using the top-K predicted classes plus a "background" class. A sample is forwarded to the oracle for labeling only if the majority of its pixels are classified as "background", prioritizing unseen-class samples.

Step 1: Pixel-wise segmentation of uncertain samples

For an uncertain sample \(x_t\), let \(\text{top}K(x_t)\) denote the top-\(K\) class names predicted by CLIP. We define \(\mathcal{C}_\text{seg}(x_t) = \text{top}K(x_t) \cup \{\text{"background"}\}\). Using patch features \(f_t(i,j)\) from the last self-attention layer, each patch is assigned a class label:

\begin{align} \hat{y}_t(i,j) = \arg\max_{c \,\in\, \mathcal{C}_\text{seg}(x_t)} \langle f_t(i,j),\, t_c \rangle \end{align}

where \(t_c\) is the text embedding of class \(c\). Patch predictions are upsampled via bilinear interpolation to produce the full segmentation map \(S(x_t)\).

Step 2: Background ratio for active sample selection

We compute the proportion of pixels classified as "background":

\begin{align} B(x_t) = \frac{1}{|S(x_t)|} \sum_{i=1}^{|S(x_t)|} \mathbb{I}\bigl[S_i(x_t) = \text{"background"}\bigr] \end{align}

A sample \(x_t\) is selected for oracle querying only if \(B(x_t) > \alpha\), with \(\alpha = 0.95\). This ensures the limited active-labeling budget is spent primarily on genuinely novel, unseen-class samples. SegAssist requires no additional training and adds negligible computation overhead, as patch features are already computed during CLIP inference.

Incremental Class Detection Delay (ICDD)

We propose a novel metric, Incremental Class Detection Delay (ICDD), to quantify how timely an ITTA system discovers new classes. It compares the ground-truth class introduction curve \(n_\text{gt}(t)\) with the detected-class curve \(n_\text{det}(t)\) over normalized time \(t \in [0,1]\):

\begin{align} \text{ICDD} = \text{AUC}(n_\text{gt}(t)) - \text{AUC}(n_\text{det}(t)) \end{align}

A lower ICDD (closer to zero) means the system detects new classes almost as soon as they appear. A higher ICDD indicates significant detection delay and consequent performance degradation. ICDD and Harmonic Mean (HM) accuracy are complementary: ICDD captures detection timeliness while HM measures classification correctness after detection.

ICDD metric illustration — **Figure 3. ICDD metric.** Given a budget of 3 samples, MSP-based selection chooses 2 seen-class and 1 unseen-class sample, exhausting the budget before later new classes arrive. SegAssist prioritizes unseen samples, detecting all three new classes as they are introduced. The bottom plot shows the detection curve \(n_\text{det}(t)\) vs. ground truth \(n_\text{gt}(t)\); the shaded area is the ICDD.

Experimental Results

We evaluate SegAssist by plugging it into three VLM-based TTA baselines: ZS-Eval (zero-shot CLIP), TDA (CVPR'24), and DPE (NeurIPS'24), combined with three active labeling strategies: Random, Entropy, and MSP. Experiments are on ImageNet-R, ImageNet-A, Clipart, and Painting (DomainNet), using CLIP ViT-B/16.

Base TTA	Method	ImageNet-R		ImageNet-Adversarial		Clipart		Painting
Base TTA	Method	HM ↑	ICDD ↓	HM ↑	ICDD ↓	HM ↑	ICDD ↓	HM ↑	ICDD ↓
ZS-Eval	Random	49.57	25.90	23.60	45.33	22.04	75.30	32.54	68.41
	Entropy	59.89	21.99	21.11	43.83	48.25	61.88	64.34	43.89
	MSP	62.92	17.21	24.59	41.47	47.65	62.04	70.59	39.14
	SegAssist	64.05	16.01	23.94	41.89	49.08	61.20	71.21	38.56
TDA	Random	53.33	25.90	16.58	45.33	22.54	75.30	32.66	68.41
	Entropy	64.06	19.52	18.78	44.95	58.79	49.10	67.27	46.08
	MSP	70.38	13.25	19.27	43.93	58.99	50.68	68.83	39.31
	SegAssist	70.48	13.07	20.02	43.93	60.25	48.96	69.20	41.27
DPE	Random	54.90	25.90	28.52	45.33	23.36	75.30	33.49	68.41
	Entropy	0.19	56.05	4.44	57.88	15.50	80.27	0.00	81.81
	MSP	68.09	19.64	24.62	47.18	65.35	49.00	71.45	34.70
	SegAssist	69.39	16.38	33.42	44.98	65.49	49.00	72.40	33.12

Comparison of Harmonic Mean (HM ↑) and Incremental Class Detection Delay (ICDD ↓) across active labeling and TTA methods on the ITTA benchmark, evaluated on ImageNet-R, ImageNet-Adversarial, Clipart, and Painting. SegAssist consistently improves over the MSP baseline across all settings.

BibTeX

@InProceedings{sreenivas2025segassist,
  author    = {Sreenivas, Manogna and Biswas, Soma},
  title     = {Segmentation Assisted Incremental Test Time Adaptation in an Open World},
  booktitle = {British Machine Vision Conference},
  year      = {2025}
}