Shortcut to Nowhere: Demystifying
Deep Spurious Regression

Guanrong Xu1 Jessica Li1 Hao Wang2 Yuzhe Yang1
1University of California, Los Angeles  |  2Rutgers University
DSR transferability visualization

From aligned to misaligned spurious attributes: varying target distribution similarity across attributes reveals how learned feature-space embeddings reflect distributional relationships, motivating MDS-based information sharing across related attributes.

📰 News

2026-05-28 Paper released on arXiv!
2026-05-27 Project website is live!
2026-05-26 Code released on GitHub.

Abstract

Real-world regression often exhibits shortcuts: attributes that are spuriously correlated with continuous targets in training, yet unreliable under deployment shifts — regressing targets using such shortcuts may fail catastrophically at test time.

Existing studies on spurious correlations focus primarily on classification, where labels are categorical and groups are naturally defined. However, many real-world tasks require continuous prediction, where hard label boundaries or discrete group–label pairs do not exist. We define Deep Spurious Regression (DSR) as learning from regression data with attribute–label confounding, addressing continuous spurious correlations, and generalizing to all attribute–label combinations at test time.

Motivated by the intrinsic difference between classification and regression shortcuts, we propose to exploit the similarity among spurious attributes in both label and feature spaces, thereby accounting for nearby targets and related groups while calibrating both label and learned feature distributions across attributes. Extensive experiments on real-world DSR datasets spanning computer vision, environmental sensing, and LLM regression verify the superior performance of our strategies. Our work fills the gap in benchmarks and techniques for studying spurious correlations in continuous prediction.

Our Methods

Label MDS (LMDS)

Builds a cross-group kernel from pairwise Wasserstein distances between attribute–target distributions via Multidimensional Scaling (MDS). Uses cross-group similarity to reweight samples, correcting for spurious attribute–target confounding in the label space. Applied once at dataset construction — no training overhead.

Feature MDS (FMDS)

Extracts model representations during training, computes per-attribute centroids, and builds a cross-group kernel via MDS to dynamically update sample weights. Unlike LMDS, FMDS operates in feature space and continuously adapts as the model's representations evolve throughout training.

LMDS + FMDS

The two techniques can be combined for full static + dynamic cross-group correction, leveraging both label-space and feature-space signals simultaneously.

Benchmarks

We curate four real-world DSR datasets spanning computer vision, environmental sensing, and large language models (LLM), and compare against a comprehensive set of baselines:

ERM Resample SqrtReWeight ReWeight CBLoss DANN Rank-N-Contrast (RnC) Label Distribution Smoothing (LDS) GroupDRO
DatasetTaskSpurious Attribute
UTKFaceAge regressionRace
SkyFinderTemperature regressionCamera Location
PovertyMapPoverty index regressionCountry
CodeNetCPU runtime regressionCoding Language

Results

UTKFace — Age Regression

Algorithm Overall Test Error (by attribute) Test Error (by shot)
Average Worst Many Medium Few Zero
AvgWorst AvgWorst AvgWorst AvgWorst
ERM 7.39 ±0.1 7.26 ±0.19.19 ±0.2 4.34 ±0.218.61 ±2.3 6.24 ±0.119.18 ±0.7 7.19 ±0.113.16 ±1.9 9.83 ±0.273.56 ±7.3
Resample 7.64 ±0.1 7.50 ±0.19.40 ±0.3 4.72 ±0.119.71 ±1.2 6.20 ±0.117.12 ±0.9 7.00 ±0.313.03 ±1.4 10.43 ±0.476.58 ±7.1
SqrtReWeight 7.31 ±0.1 7.18 ±0.19.02 ±0.3 4.48 ±0.216.30 ±2.3 6.10 ±0.116.01 ±0.8 6.90 ±0.113.18 ±1.1 9.78 ±0.462.10 ±8.2
ReWeight 8.42 ±0.1 8.25 ±0.19.62 ±0.2 6.21 ±0.117.82 ±0.9 7.06 ±0.118.00 ±1.4 7.65 ±0.114.78 ±1.4 10.88 ±0.386.59 ±4.6
CBLoss 8.37 ±0.1 8.19 ±0.19.61 ±0.1 6.16 ±0.118.55 ±1.8 6.98 ±0.118.18 ±1.6 7.56 ±0.215.23 ±1.3 10.86 ±0.287.54 ±2.1
DANN 7.97 ±0.1 7.82 ±0.19.69 ±0.2 4.63 ±0.220.62 ±1.4 6.65 ±0.120.32 ±0.5 7.88 ±0.115.90 ±1.1 10.67 ±0.376.86 ±4.7
RnC 7.38 ±0.1 7.25 ±0.19.22 ±0.2 4.35 ±0.119.31 ±1.1 6.15 ±0.017.64 ±1.3 7.12 ±0.112.55 ±0.9 9.91 ±0.263.69 ±6.2
LDS 7.23 ±0.1 7.09 ±0.18.90 ±0.1 4.58 ±0.116.59 ±1.9 6.12 ±0.016.36 ±0.6 7.01 ±0.213.22 ±0.6 9.46 ±0.371.49 ±10.6
GroupDRO 7.43 ±0.1 7.29 ±0.19.02 ±0.1 4.94 ±0.116.69 ±1.7 6.13 ±0.117.78 ±1.5 6.86 ±0.212.71 ±0.9 9.90 ±0.171.79 ±11.7
L-MDS 7.30 ±0.1 7.17 ±0.18.94 ±0.2 4.66 ±0.219.40 ±2.1 6.03 ±0.117.62 ±1.0 6.79 ±0.212.51 ±1.2 9.79 ±0.276.32 ±6.2
F-MDS 7.22 ±0.1 7.08 ±0.18.71 ±0.2 4.65 ±0.217.49 ±1.9 6.08 ±0.117.91 ±0.2 6.71 ±0.212.43 ±1.4 9.54 ±0.468.81 ±8.1
L-MDS + F-MDS 7.42 ±0.1 7.29 ±0.19.04 ±0.2 4.55 ±0.217.55 ±1.8 6.15 ±0.117.73 ±1.8 7.02 ±0.112.88 ±0.6 9.96 ±0.377.43 ±9.3
Ours (best) vs. ERM +0.17 +0.18+0.48 -0.21+1.12 +0.21+1.56 +0.48+0.73 +0.29+4.75

SkyFinder — Temperature Regression

Algorithm Overall Test Error (by attribute) Test Error (by shot)
Average Worst Many Medium Few Zero
AvgWorst AvgWorst AvgWorst AvgWorst
ERM 3.68 ±0.0 3.41 ±0.05.95 ±0.1 2.27 ±0.06.45 ±0.6 2.94 ±0.012.21 ±1.2 4.49 ±0.025.08 ±0.9 5.22 ±0.029.78 ±1.8
Resample 3.62 ±0.0 3.35 ±0.05.76 ±0.1 2.71 ±0.17.28 ±0.3 3.00 ±0.013.37 ±1.2 4.23 ±0.019.23 ±0.6 4.97 ±0.136.23 ±4.0
SqrtReWeight 3.53 ±0.0 3.26 ±0.05.85 ±0.1 2.34 ±0.16.91 ±0.8 2.95 ±0.011.69 ±0.8 4.17 ±0.023.76 ±1.0 4.75 ±0.132.65 ±1.7
ReWeight 4.25 ±0.0 3.91 ±0.07.13 ±0.2 4.03 ±0.110.55 ±0.8 3.88 ±0.015.84 ±1.3 4.52 ±0.021.90 ±0.4 5.13 ±0.133.09 ±1.6
CBLoss 4.23 ±0.1 3.90 ±0.17.24 ±0.2 4.07 ±0.210.39 ±0.8 3.86 ±0.115.04 ±2.1 4.50 ±0.120.16 ±0.6 5.12 ±0.130.88 ±1.0
DANN 4.04 ±0.1 3.76 ±0.16.75 ±0.2 2.57 ±0.07.98 ±0.7 3.32 ±0.112.60 ±1.3 4.83 ±0.124.84 ±0.4 5.56 ±0.131.02 ±1.6
RnC 3.49 ±0.0 3.24 ±0.05.69 ±0.1 2.42 ±0.07.21 ±0.5 2.90 ±0.111.99 ±0.8 4.14 ±0.019.65 ±1.2 4.71 ±0.130.86 ±2.2
LDS 3.85 ±0.1 3.56 ±0.06.44 ±0.4 2.39 ±0.07.95 ±0.6 3.12 ±0.013.51 ±1.0 4.69 ±0.121.71 ±0.9 5.26 ±0.133.98 ±2.8
GroupDRO 3.62 ±0.0 3.35 ±0.06.00 ±0.1 2.34 ±0.06.64 ±0.5 2.90 ±0.012.43 ±1.1 4.42 ±0.125.07 ±1.4 5.04 ±0.029.66 ±1.5
L-MDS 3.54 ±0.0 3.27 ±0.05.81 ±0.2 2.38 ±0.06.86 ±0.6 2.95 ±0.011.63 ±1.0 4.17 ±0.023.63 ±0.7 4.78 ±0.031.47 ±2.3
F-MDS 3.56 ±0.0 3.29 ±0.05.81 ±0.2 2.33 ±0.16.44 ±0.3 2.97 ±0.011.86 ±0.4 4.22 ±0.021.40 ±1.1 4.74 ±0.030.47 ±2.0
L-MDS + F-MDS 3.58 ±0.0 3.30 ±0.05.78 ±0.1 2.39 ±0.16.28 ±0.6 2.97 ±0.012.18 ±0.6 4.23 ±0.122.30 ±1.3 4.82 ±0.132.99 ±2.8
Ours (best) vs. ERM +0.14 +0.14+0.17 -0.06+0.17 -0.01+0.58 +0.32+3.68 +0.48-0.69

PovertyMap — Poverty Index Regression

Algorithm Overall Test Error (by attribute) Test Error (by shot)
Average Worst Many Medium Few Zero
AvgWorst AvgWorst AvgWorst AvgWorst
ERM 0.504 ±0.0 0.502 ±0.00.679 ±0.0 0.256 ±0.00.504 ±0.1 0.335 ±0.01.356 ±0.1 0.494 ±0.02.452 ±0.1 0.744 ±0.01.996 ±0.1
Resample 0.506 ±0.0 0.503 ±0.00.710 ±0.0 0.385 ±0.00.781 ±0.2 0.391 ±0.01.383 ±0.1 0.463 ±0.02.247 ±0.1 0.737 ±0.02.019 ±0.1
SqrtReWeight 0.512 ±0.0 0.509 ±0.00.670 ±0.0 0.375 ±0.00.679 ±0.2 0.376 ±0.01.441 ±0.1 0.478 ±0.02.233 ±0.1 0.753 ±0.02.037 ±0.1
ReWeight 0.522 ±0.0 0.520 ±0.00.750 ±0.0 0.485 ±0.10.805 ±0.1 0.431 ±0.01.426 ±0.1 0.464 ±0.02.088 ±0.2 0.748 ±0.02.012 ±0.1
CBLoss 0.515 ±0.0 0.513 ±0.00.720 ±0.0 0.450 ±0.00.856 ±0.2 0.420 ±0.01.447 ±0.1 0.467 ±0.02.142 ±0.1 0.729 ±0.02.029 ±0.1
DANN 0.689 ±0.1 0.685 ±0.10.869 ±0.0 0.796 ±0.10.996 ±0.1 0.574 ±0.11.638 ±0.1 0.598 ±0.11.926 ±0.1 1.003 ±0.12.191 ±0.1
RnC 0.494 ±0.0 0.490 ±0.00.675 ±0.0 0.304 ±0.00.559 ±0.1 0.290 ±0.01.103 ±0.1 0.486 ±0.02.320 ±0.1 0.773 ±0.02.153 ±0.2
LDS 0.501 ±0.0 0.499 ±0.00.712 ±0.0 0.331 ±0.00.717 ±0.1 0.336 ±0.01.458 ±0.1 0.501 ±0.02.276 ±0.1 0.714 ±0.02.049 ±0.1
GroupDRO 0.492 ±0.0 0.489 ±0.00.648 ±0.0 0.376 ±0.10.844 ±0.2 0.319 ±0.01.245 ±0.1 0.470 ±0.02.382 ±0.1 0.757 ±0.02.016 ±0.1
L-MDS 0.486 ±0.0 0.484 ±0.00.666 ±0.0 0.271 ±0.00.535 ±0.2 0.336 ±0.01.417 ±0.1 0.467 ±0.02.385 ±0.1 0.720 ±0.01.987 ±0.1
F-MDS 0.488 ±0.0 0.485 ±0.00.670 ±0.0 0.278 ±0.00.554 ±0.1 0.327 ±0.01.307 ±0.1 0.477 ±0.02.492 ±0.1 0.719 ±0.02.057 ±0.0
L-MDS + F-MDS 0.492 ±0.0 0.490 ±0.00.642 ±0.0 0.352 ±0.10.834 ±0.1 0.332 ±0.01.369 ±0.1 0.483 ±0.02.283 ±0.2 0.715 ±0.02.015 ±0.1
Ours (best) vs. ERM (%) +3.57% +3.59%+5.45% -5.86%-6.15% +2.39%+3.61% +5.47%+6.89% +3.90%+0.45%

CodeNet — CPU Runtime Regression

Algorithm Overall Test Error (by attribute) Test Error (by shot)
Average Worst Many Medium Few
AvgWorst AvgWorst AvgWorst
ERM 268.7 ±2.8 269.0 ±2.6350.3 ±11.0 165.4 ±2.8228.7 ±8.9 268.8 ±3.6398.4 ±13.1 529.8 ±5.2711.3 ±21.2
ReWeight 253.7 ±2.8 253.5 ±2.5306.3 ±9.1 179.4 ±3.4253.6 ±13.0 249.4 ±3.6374.3 ±19.0 463.4 ±6.2616.5 ±18.3
SqrtReWeight 248.2 ±2.5 248.3 ±2.6299.4 ±8.9 179.2 ±3.1247.2 ±11.5 242.1 ±3.5328.2 ±12.4 444.6 ±6.3609.0 ±23.5
CBLoss 251.9 ±2.6 251.8 ±2.6301.3 ±9.6 161.3 ±2.8229.2 ±11.5 253.9 ±3.5333.9 ±15.2 472.9 ±5.9624.0 ±16.2
DANN 276.0 ±2.8 276.4 ±2.6348.8 ±11.0 148.3 ±2.6228.9 ±10.7 292.9 ±3.4427.0 ±15.2 551.4 ±4.7716.9 ±16.3
LDS 263.1 ±2.7 263.3 ±2.8322.4 ±9.5 178.6 ±3.3275.6 ±13.1 261.4 ±3.7365.5 ±12.1 484.6 ±6.4686.3 ±24.4
L-MDS 243.4 ±2.7 243.1 ±2.7299.0 ±9.3 163.7 ±3.3257.7 ±14.0 245.0 ±3.9321.8 ±11.4 440.2 ±6.5623.4 ±16.1
F-MDS 250.5 ±2.6 250.4 ±2.5287.2 ±8.7 196.0 ±3.3279.6 ±10.4 235.6 ±3.5306.0 ±11.9 429.0 ±2.6592.2 ±25.7
L-MDS + F-MDS 249.4 ±2.5 249.2 ±2.6299.5 ±10.4 205.4 ±3.4309.0 ±12.7 231.2 ±3.3306.0 ±12.0 413.5 ±6.1622.6 ±18.2
Ours (best) vs. ERM +25.3 +25.9+63.1 +1.7-29.0 +37.6+92.4 +116.3+119.1

Paper

BibTeX

@article{xu2026shortcut,
  title   = {Shortcut to Nowhere: Demystifying Deep Spurious Regression},
  author  = {Xu, Guanrong and Li, Jessica and Wang, Hao and Yang, Yuzhe},
  journal = {arXiv preprint arXiv:XXXX.XXXXX},
  year    = {2026}
}