SANTA

Abstract

Adapting a trained model to perform satisfactorily on continually changing test environments is an important and challenging task. In this work, we propose a novel framework, SANTA, which aims to satisfy the following characteristics required for online adaptation: (1) can work effectively for different (even small) batch sizes; (2) should continue to work well on the source domain; (3) should have minimal tunable hyperparameters and storage requirements. Given a pre-trained network trained on source domain data, the proposed framework modifies the affine parameters of the batch normalization layers using source anchoring based self-distillation. This ensures that the model incorporates knowledge from the newly encountered domains, without catastrophically forgetting the previously seen domains. We also propose a source-prototype driven contrastive alignment to ensure natural grouping of the target samples, while maintaining the already learnt semantic information. Extensive evaluation on three benchmark datasets under challenging settings justify the effectiveness of SANTA for real-world applications.

Problem Setting

In Continual Test-Time Adaptation (CTTA), a model trained on source domain \( \mathcal{D}_s = \{(x_i, y_i)\}_{i=1}^{n_s} \sim P_s \) must adapt to a stream of test batches drawn from a dynamically changing distribution \( P_t^{(1)} \neq P_t^{(2)} \neq \ldots \neq P_s \), without accessing any source data. This is a challenging fully-online setting because:

Small batch sizes: Practical online deployment uses small batches (e.g., 10–25), yet most prior CTTA methods are validated only at batch size 200.
Catastrophic forgetting: Continuous adaptation to new domains may degrade performance on previously seen domains, including the source.
Hyperparameter sensitivity: Without a validation set at test time, methods requiring extensive hyperparameter tuning are impractical.

Existing approaches like CoTTA store 3N parameters (source, teacher, student models) and require hyperparameters for stochastic restoration. SANTA addresses all three challenges with a framework that stores only 1.02N parameters (source + BN parameters of the adapting model), requires no restoration, and has a single hyperparameter (temperature for contrastive loss).

Proposed Method: SANTA

Only the BN affine parameters of the adapting model \( f_\theta \) are updated during test time. The adapting model is initialised from the source model \( f_{\theta_s} \). At each step, both models have their BN statistics updated to match the current test batch statistics; the source-corrected model is called the Target Corrected Source (TCS) model \( f_{\theta_s}^k \). SANTA has two components:

SANTA framework diagram — **Fig. 1:** SANTA framework. The original image and its augmentation pass through both the adapting model and source model. (1) Source-anchoring loss uses TCS model predictions as anchors. (2) Source prototype guided target alignment enforces semantically meaningful clustering in the feature space.

Source Anchoring (SA Loss)

The TCS model provides domain-invariant predictions for the current batch. We use its outputs as soft anchors for the adapting model via self-distillation. Let \( p_{ij} \) and \( a_{ij} \) denote the \( j \)-th class scores from the adapting and TCS models for image \( x_i \). The base source-anchoring loss is:

\[ \mathcal{L}'_{SA} = -\frac{1}{N}\sum_{i=1}^{N}\sum_{j=1}^{C} p_{ij} \log(a_{ij}) \tag{1} \]

Using augmented copies of the test images makes the model more robust. Let \( q_{ij} \) be the adapting model's score for the augmented version. The complete SA loss is:

\[ \mathcal{L}_{SA} = -\frac{1}{N}\sum_{i=1}^{N}\sum_{j=1}^{C} \left( p_{ij}\log(a_{ij}) + q_{ij}\log(a_{ij}) \right) \tag{2} \]

This avoids storing a separate teacher model, requires no restoration hyperparameter, and directly uses the adapting model for prediction.

Target Alignment (TA Loss)

The goal is to make features gradually domain-invariant as the model encounters data from different domains. The adapting model \( f_\theta^k = h \circ g_\phi^k \) is decomposed into a feature extractor \( g_\phi^k \) and a fixed classifier \( h \). Features are projected to a \( d \)-dimensional hypersphere:

\[ z_i = p_\psi \circ g_\phi^k(x_i) \tag{3} \]

The base contrastive loss pairs each sample with its augmented view:

\[ \mathcal{L}_{con} = \frac{1}{N}\sum_{i=1}^{N} \log \frac{\exp(z_i \cdot z_{N+i}/\tau)}{\sum_{k=1, k\neq i}^{2N}\exp(z_i \cdot z_k/\tau)} \tag{4} \]

To ensure target clusters align with the source feature space, the nearest source prototype \( \pi(x_i) \) is used as a third view:

\[ \pi(x_i) = \{\pi_t \mid t = \arg\max_c \cos(\pi_c,\, g_\phi^k(x_i))\} \tag{5} \]

The source prototype guided Target Alignment loss uses two positive views per sample (augmented image + nearest source prototype):

\[ \mathcal{L}_{TA} = \frac{1}{2N}\sum_{i=1}^{N}\left[\log \frac{\exp(z_i \cdot z_{N+i}/\tau)}{\sum_{k\neq i}^{3N}\exp(z_i \cdot z_k/\tau)} + \log \frac{\exp(z_i \cdot z_{2N+i}/\tau)}{\sum_{k\neq i}^{3N}\exp(z_i \cdot z_k/\tau)}\right] \tag{6} \]

Final Loss

The BN affine parameters \( \phi \) and projection head \( \psi \) are updated at each test batch by minimizing:

\[ \mathcal{L}_{SANTA} = \mathcal{L}_{SA} + \mathcal{L}_{TA} \tag{7} \]

The adapting model is used directly for prediction and is robust enough to update continuously without any restoration back to the source model.

Results

Comparison with State-of-the-Art

Method	CIFAR-10C	CIFAR-100C	ImageNet-C (5k)	ImageNet-C (50k)
Source	43.5	46.4	82.0	82.0
BN Stats Adapt	20.4	35.4	68.6	68.5
TENT-Continual	20.7	60.9	62.6	91.0
CoTTA	16.2	32.5	62.7	69.7
RMT	17.3	30.4	60.2	69.9
SANTA	16.1 ±0.06	30.3 ±0.05	60.1 ±0.06	60.3 ±0.07

Table 1: Mean error % (lower is better) across all 15 corruptions at severity 5. Batch sizes: 200 / 200 / 64 for CIFAR-10C / CIFAR-100C / ImageNet-C. SANTA is competitive with or better than all methods, and is the only method that remains stable at ImageNet-C 50k (where TENT diverges to 91%).

Ablation Study

Method	CIFAR-10C (batch size)						CIFAR-100C (batch size)						ImageNet-C (batch size)
	200	150	100	50	25	10	200	150	100	50	25	10	64	32	16	8
SA (orig only)	20.1	20.1	20.4	20.7	21.7	24.8	32.6	32.8	33.2	34.0	35.7	40.7	62.7	64.9	69.4	77.8
SA (orig + aug)	17.1	17.2	17.4	18.0	19.0	22.5	31.4	31.6	32.0	32.8	34.7	39.9	61.3	63.8	68.5	76.8
SANTA (SA + TA)	16.1	16.2	16.5	17.2	18.3	22.0	30.4	30.5	30.8	31.9	33.9	39.1	60.2	63.2	68.5	78.5

Table 2: Ablation across batch sizes. Both loss components contribute, with augmentation in SA providing the larger gain and TA providing consistent additional improvement.

Parameter Efficiency

Method	Total Params	Trainable Params	% Trainable
BN-Stats	6.9M	0	0%
TENT	6.9M	25K	0.37%
CoTTA	20.7M (3N)	6.9M	33.3%
RMT	13.9M (2N)	6.9M	50.0%
SANTA	7.05M (1.02N)	173K	2.5%

Table 3: Storage comparison for CIFAR-100C with ResNeXt-29. SANTA stores only the source model plus BN affine parameters (1.02N), vs 3N for CoTTA. SANTA is also ~10x faster than CoTTA at inference (0.22s vs 2.2s per batch of 200).

BibTeX

@article{chakrabarty2023santa,
  author    = {Chakrabarty, Goirik and Sreenivas, Manogna and Biswas, Soma},
  title     = {SANTA: Source Anchoring Network and Target Alignment for Continual Test Time Adaptation},
  journal   = {Transactions on Machine Learning Research},
  year      = {2023}
}