We propose the DIFC (Data Imbalance and Feature Confusion) module, which addresses both challenges in an integrated manner. DIFC replaces the standard cross-entropy classification loss in a UDA framework with a two-phase loss that first handles source imbalance and then adapts to target confusion.
Standard cross-entropy training on imbalanced data leads to poor margins for minority classes. We adopt Label Distribution Aware Margin (LDAM) loss, which enforces class-dependent margins: larger margins for minority classes relative to majority classes. Given source logits \( z_s = [z_1, \ldots, z_K]^\top \), the standard CE loss is:
\[ \mathcal{L}_\text{CE}(z_s, y_s) = -\log \frac{\exp(z_{y_s})}{\sum_{k=1}^{K} \exp(z_k)} \tag{2} \]
The DI loss modifies CE with class-dependent margins \( \Delta \in \mathbb{R}^K \):
\[ \mathcal{L}_\text{DI}(z_s, y_s; \Delta) = -\log \frac{e^{z_{y_s} - \Delta_{y_s}}}{e^{z_{y_s} - \Delta_{y_s}} + \sum_{k \neq y_s} e^{z_k}}, \quad \text{where } \Delta_k = \frac{\gamma}{N_k^{1/4}} \tag{3} \]
Here \( \gamma \) is a hyperparameter normalizing the margins. Following LDAM, we apply a deferred re-weighting scheme after a few initial epochs. Although effective for supervised imbalance, Eq. (3) uses source label distribution to compute margins, which may not match target data.
The final goal is to correctly classify the target features. Minority classes in the target may differ from those in the source, and inherent perceptual similarity between certain expressions causes additional confusion. We quantify this using a class confusion score estimated from model predictions on unlabelled target data.
Target samples are grouped into pseudo-class sets \( S_k \) based on current model predictions:
\[ S_k = \{x_t \mid \hat{y}_t = k\}, \quad \text{where} \quad \hat{y}_t = \arg\max(p_t) \tag{4} \]
For each class \( k \), the confusion score is the inverse of the average margin between the top-2 softmax scores across all samples predicted as class \( k \):
\[ \text{conf}_k = \frac{1}{\mathbb{E}_{x_t \in S_k} |p^1_t - p^2_t|} \tag{5} \]
A large average gap means the model is confident about class \( k \) (low confusion); a small gap means class \( k \) is frequently confused with another class (high confusion). Classes are sorted in decreasing order of confusion to obtain ordered indices \( C \). The DI margins are then updated with exponentially decaying increments to emphasize the most confusing classes:
\[ \Delta'_{C[j]} = \Delta_{C[j]} + \frac{\epsilon}{2^{j-1}}, \qquad \mathcal{L}_\text{DIFC} = \mathcal{L}_\text{DI}(x_t, y_t; \Delta') \tag{6} \]
Here \( \epsilon \) controls the margin increase for the most confusing class \( C[1] \). This formulation learns class boundaries that account for source imbalance while simultaneously reducing target confusion.
We integrate DIFC with Maximum Classifier Discrepancy (MCD), which uses a feature extractor \( F \) and two classifiers \( C_1, C_2 \). MCD optimizes three objectives:
\[
\min_{F,C_1,C_2} \; \mathcal{L}_\text{cls}(C_1(F(x_s)), y_s) + \mathcal{L}_\text{cls}(C_2(F(x_s)), y_s)
\]
\[
\max_{C_1,C_2} \; \|C_1(F(x_t)) - C_2(F(x_t))\|_1
\]
\[
\min_{F} \; \|C_1(F(x_t)) - C_2(F(x_t))\|_1 \tag{7}
\]
DIFC replaces the classification loss \( \mathcal{L}_\text{cls} \): DI loss is used for the initial \( T_1 \) epochs to stabilize source discrimination, then the full DIFC loss is used for epochs \( T_1+1 \) to \( T_2 \). Confusion scores are computed using the ensemble prediction:
\[ p_{t,i} = \text{softmax}(C_i(F(x_t))), \quad p_t = \tfrac{1}{2}(p_{t,1} + p_{t,2}), \quad \hat{y}_t = \arg\max_k(p_t) \tag{8} \]