Shortcut to Nowhere: Demystifying
Deep Spurious Regression

Guanrong Xu¹, Jessica Li¹, Hao Wang², Yuzhe Yang¹

¹University of California, Los Angeles | ²Rutgers University

From aligned to misaligned spurious attributes: varying target distribution similarity across attributes reveals how learned feature-space embeddings reflect distributional relationships, motivating MDS-based information sharing across related attributes.

📰 News

2026-06-01 Paper released on arXiv!

2026-05-27 Project website is live!

2026-05-26 Code released on GitHub.

Abstract

Real-world regression often exhibits shortcuts: attributes that are spuriously correlated with continuous targets in training, yet unreliable under deployment shifts — regressing targets using such shortcuts may fail catastrophically at test time.

Existing studies on spurious correlations focus primarily on classification, where labels are categorical and groups are naturally defined. However, many real-world tasks require continuous prediction, where hard label boundaries or discrete group–label pairs do not exist. We define Deep Spurious Regression (DSR) as learning from regression data with attribute–label confounding, addressing continuous spurious correlations, and generalizing to all attribute–label combinations at test time.

Motivated by the intrinsic difference between classification and regression shortcuts, we propose to exploit the similarity among spurious attributes in both label and feature spaces, thereby accounting for nearby targets and related groups while calibrating both label and learned feature distributions across attributes. Extensive experiments on real-world DSR datasets spanning computer vision, environmental sensing, and LLM regression verify the superior performance of our strategies. Our work fills the gap in benchmarks and techniques for studying spurious correlations in continuous prediction.

Our Methods

Label MDS (LMDS)

Builds a cross-group kernel from pairwise Wasserstein distances between attribute–target distributions via Multidimensional Scaling (MDS). Uses cross-group similarity to reweight samples, correcting for spurious attribute–target confounding in the label space. Applied once at dataset construction — no training overhead.

Feature MDS (FMDS)

Extracts model representations during training, computes per-attribute centroids, and builds a cross-group kernel via MDS to dynamically update sample weights. Unlike LMDS, FMDS operates in feature space and continuously adapts as the model's representations evolve throughout training.

LMDS + FMDS

The two techniques can be combined for full static + dynamic cross-group correction, leveraging both label-space and feature-space signals simultaneously.

Benchmarks

We curate four real-world DSR datasets spanning computer vision, environmental sensing, and large language models (LLM), and compare against a comprehensive set of baselines:

ERM Resample SqrtReWeight ReWeight CBLoss DANN Rank-N-Contrast (RnC) Label Distribution Smoothing (LDS) GroupDRO

Dataset	Task	Spurious Attribute
UTKFace	Age regression	Race
SkyFinder	Temperature regression	Camera Location
PovertyMap	Poverty index regression	Country
CodeNet	CPU runtime regression	Coding Language

Results

UTKFace — Age Regression

Algorithm	Overall	Test Error (by attribute)		Test Error (by shot)
		Average	Worst	Many		Medium		Few		Zero
		Average	Worst	Avg	Worst	Avg	Worst	Avg	Worst	Avg	Worst
ERM	7.39 ±0.1	7.26 ±0.1	9.19 ±0.2	4.34 ±0.2	18.61 ±2.3	6.24 ±0.1	19.18 ±0.7	7.19 ±0.1	13.16 ±1.9	9.83 ±0.2	73.56 ±7.3
Resample	7.64 ±0.1	7.50 ±0.1	9.40 ±0.3	4.72 ±0.1	19.71 ±1.2	6.20 ±0.1	17.12 ±0.9	7.00 ±0.3	13.03 ±1.4	10.43 ±0.4	76.58 ±7.1
SqrtReWeight	7.31 ±0.1	7.18 ±0.1	9.02 ±0.3	4.48 ±0.2	16.30 ±2.3	6.10 ±0.1	16.01 ±0.8	6.90 ±0.1	13.18 ±1.1	9.78 ±0.4	62.10 ±8.2
ReWeight	8.42 ±0.1	8.25 ±0.1	9.62 ±0.2	6.21 ±0.1	17.82 ±0.9	7.06 ±0.1	18.00 ±1.4	7.65 ±0.1	14.78 ±1.4	10.88 ±0.3	86.59 ±4.6
CBLoss	8.37 ±0.1	8.19 ±0.1	9.61 ±0.1	6.16 ±0.1	18.55 ±1.8	6.98 ±0.1	18.18 ±1.6	7.56 ±0.2	15.23 ±1.3	10.86 ±0.2	87.54 ±2.1
DANN	7.97 ±0.1	7.82 ±0.1	9.69 ±0.2	4.63 ±0.2	20.62 ±1.4	6.65 ±0.1	20.32 ±0.5	7.88 ±0.1	15.90 ±1.1	10.67 ±0.3	76.86 ±4.7
RnC	7.38 ±0.1	7.25 ±0.1	9.22 ±0.2	4.35 ±0.1	19.31 ±1.1	6.15 ±0.0	17.64 ±1.3	7.12 ±0.1	12.55 ±0.9	9.91 ±0.2	63.69 ±6.2
LDS	7.23 ±0.1	7.09 ±0.1	8.90 ±0.1	4.58 ±0.1	16.59 ±1.9	6.12 ±0.0	16.36 ±0.6	7.01 ±0.2	13.22 ±0.6	9.46 ±0.3	71.49 ±10.6
GroupDRO	7.43 ±0.1	7.29 ±0.1	9.02 ±0.1	4.94 ±0.1	16.69 ±1.7	6.13 ±0.1	17.78 ±1.5	6.86 ±0.2	12.71 ±0.9	9.90 ±0.1	71.79 ±11.7
L-MDS	7.30 ±0.1	7.17 ±0.1	8.94 ±0.2	4.66 ±0.2	19.40 ±2.1	6.03 ±0.1	17.62 ±1.0	6.79 ±0.2	12.51 ±1.2	9.79 ±0.2	76.32 ±6.2
F-MDS	7.22 ±0.1	7.08 ±0.1	8.71 ±0.2	4.65 ±0.2	17.49 ±1.9	6.08 ±0.1	17.91 ±0.2	6.71 ±0.2	12.43 ±1.4	9.54 ±0.4	68.81 ±8.1
L-MDS + F-MDS	7.42 ±0.1	7.29 ±0.1	9.04 ±0.2	4.55 ±0.2	17.55 ±1.8	6.15 ±0.1	17.73 ±1.8	7.02 ±0.1	12.88 ±0.6	9.96 ±0.3	77.43 ±9.3
Ours (best) vs. ERM	+0.17	+0.18	+0.48	-0.21	+1.12	+0.21	+1.56	+0.48	+0.73	+0.29	+4.75

SkyFinder — Temperature Regression

Algorithm	Overall	Test Error (by attribute)		Test Error (by shot)
		Average	Worst	Many		Medium		Few		Zero
		Average	Worst	Avg	Worst	Avg	Worst	Avg	Worst	Avg	Worst
ERM	3.68 ±0.0	3.41 ±0.0	5.95 ±0.1	2.27 ±0.0	6.45 ±0.6	2.94 ±0.0	12.21 ±1.2	4.49 ±0.0	25.08 ±0.9	5.22 ±0.0	29.78 ±1.8
Resample	3.62 ±0.0	3.35 ±0.0	5.76 ±0.1	2.71 ±0.1	7.28 ±0.3	3.00 ±0.0	13.37 ±1.2	4.23 ±0.0	19.23 ±0.6	4.97 ±0.1	36.23 ±4.0
SqrtReWeight	3.53 ±0.0	3.26 ±0.0	5.85 ±0.1	2.34 ±0.1	6.91 ±0.8	2.95 ±0.0	11.69 ±0.8	4.17 ±0.0	23.76 ±1.0	4.75 ±0.1	32.65 ±1.7
ReWeight	4.25 ±0.0	3.91 ±0.0	7.13 ±0.2	4.03 ±0.1	10.55 ±0.8	3.88 ±0.0	15.84 ±1.3	4.52 ±0.0	21.90 ±0.4	5.13 ±0.1	33.09 ±1.6
CBLoss	4.23 ±0.1	3.90 ±0.1	7.24 ±0.2	4.07 ±0.2	10.39 ±0.8	3.86 ±0.1	15.04 ±2.1	4.50 ±0.1	20.16 ±0.6	5.12 ±0.1	30.88 ±1.0
DANN	4.04 ±0.1	3.76 ±0.1	6.75 ±0.2	2.57 ±0.0	7.98 ±0.7	3.32 ±0.1	12.60 ±1.3	4.83 ±0.1	24.84 ±0.4	5.56 ±0.1	31.02 ±1.6
RnC	3.49 ±0.0	3.24 ±0.0	5.69 ±0.1	2.42 ±0.0	7.21 ±0.5	2.90 ±0.1	11.99 ±0.8	4.14 ±0.0	19.65 ±1.2	4.71 ±0.1	30.86 ±2.2
LDS	3.85 ±0.1	3.56 ±0.0	6.44 ±0.4	2.39 ±0.0	7.95 ±0.6	3.12 ±0.0	13.51 ±1.0	4.69 ±0.1	21.71 ±0.9	5.26 ±0.1	33.98 ±2.8
GroupDRO	3.62 ±0.0	3.35 ±0.0	6.00 ±0.1	2.34 ±0.0	6.64 ±0.5	2.90 ±0.0	12.43 ±1.1	4.42 ±0.1	25.07 ±1.4	5.04 ±0.0	29.66 ±1.5
L-MDS	3.54 ±0.0	3.27 ±0.0	5.81 ±0.2	2.38 ±0.0	6.86 ±0.6	2.95 ±0.0	11.63 ±1.0	4.17 ±0.0	23.63 ±0.7	4.78 ±0.0	31.47 ±2.3
F-MDS	3.56 ±0.0	3.29 ±0.0	5.81 ±0.2	2.33 ±0.1	6.44 ±0.3	2.97 ±0.0	11.86 ±0.4	4.22 ±0.0	21.40 ±1.1	4.74 ±0.0	30.47 ±2.0
L-MDS + F-MDS	3.58 ±0.0	3.30 ±0.0	5.78 ±0.1	2.39 ±0.1	6.28 ±0.6	2.97 ±0.0	12.18 ±0.6	4.23 ±0.1	22.30 ±1.3	4.82 ±0.1	32.99 ±2.8
Ours (best) vs. ERM	+0.14	+0.14	+0.17	-0.06	+0.17	-0.01	+0.58	+0.32	+3.68	+0.48	-0.69

PovertyMap — Poverty Index Regression

Algorithm	Overall	Test Error (by attribute)		Test Error (by shot)
		Average	Worst	Many		Medium		Few		Zero
		Average	Worst	Avg	Worst	Avg	Worst	Avg	Worst	Avg	Worst
ERM	0.504 ±0.0	0.502 ±0.0	0.679 ±0.0	0.256 ±0.0	0.504 ±0.1	0.335 ±0.0	1.356 ±0.1	0.494 ±0.0	2.452 ±0.1	0.744 ±0.0	1.996 ±0.1
Resample	0.506 ±0.0	0.503 ±0.0	0.710 ±0.0	0.385 ±0.0	0.781 ±0.2	0.391 ±0.0	1.383 ±0.1	0.463 ±0.0	2.247 ±0.1	0.737 ±0.0	2.019 ±0.1
SqrtReWeight	0.512 ±0.0	0.509 ±0.0	0.670 ±0.0	0.375 ±0.0	0.679 ±0.2	0.376 ±0.0	1.441 ±0.1	0.478 ±0.0	2.233 ±0.1	0.753 ±0.0	2.037 ±0.1
ReWeight	0.522 ±0.0	0.520 ±0.0	0.750 ±0.0	0.485 ±0.1	0.805 ±0.1	0.431 ±0.0	1.426 ±0.1	0.464 ±0.0	2.088 ±0.2	0.748 ±0.0	2.012 ±0.1
CBLoss	0.515 ±0.0	0.513 ±0.0	0.720 ±0.0	0.450 ±0.0	0.856 ±0.2	0.420 ±0.0	1.447 ±0.1	0.467 ±0.0	2.142 ±0.1	0.729 ±0.0	2.029 ±0.1
DANN	0.689 ±0.1	0.685 ±0.1	0.869 ±0.0	0.796 ±0.1	0.996 ±0.1	0.574 ±0.1	1.638 ±0.1	0.598 ±0.1	1.926 ±0.1	1.003 ±0.1	2.191 ±0.1
RnC	0.494 ±0.0	0.490 ±0.0	0.675 ±0.0	0.304 ±0.0	0.559 ±0.1	0.290 ±0.0	1.103 ±0.1	0.486 ±0.0	2.320 ±0.1	0.773 ±0.0	2.153 ±0.2
LDS	0.501 ±0.0	0.499 ±0.0	0.712 ±0.0	0.331 ±0.0	0.717 ±0.1	0.336 ±0.0	1.458 ±0.1	0.501 ±0.0	2.276 ±0.1	0.714 ±0.0	2.049 ±0.1
GroupDRO	0.492 ±0.0	0.489 ±0.0	0.648 ±0.0	0.376 ±0.1	0.844 ±0.2	0.319 ±0.0	1.245 ±0.1	0.470 ±0.0	2.382 ±0.1	0.757 ±0.0	2.016 ±0.1
L-MDS	0.486 ±0.0	0.484 ±0.0	0.666 ±0.0	0.271 ±0.0	0.535 ±0.2	0.336 ±0.0	1.417 ±0.1	0.467 ±0.0	2.385 ±0.1	0.720 ±0.0	1.987 ±0.1
F-MDS	0.488 ±0.0	0.485 ±0.0	0.670 ±0.0	0.278 ±0.0	0.554 ±0.1	0.327 ±0.0	1.307 ±0.1	0.477 ±0.0	2.492 ±0.1	0.719 ±0.0	2.057 ±0.0
L-MDS + F-MDS	0.492 ±0.0	0.490 ±0.0	0.642 ±0.0	0.352 ±0.1	0.834 ±0.1	0.332 ±0.0	1.369 ±0.1	0.483 ±0.0	2.283 ±0.2	0.715 ±0.0	2.015 ±0.1
Ours (best) vs. ERM (%)	+3.57%	+3.59%	+5.45%	-5.86%	-6.15%	+2.39%	+3.61%	+5.47%	+6.89%	+3.90%	+0.45%

CodeNet — CPU Runtime Regression

Algorithm	Overall	Test Error (by attribute)		Test Error (by shot)
		Average	Worst	Many		Medium		Few
		Average	Worst	Avg	Worst	Avg	Worst	Avg	Worst
ERM	268.7 ±2.8	269.0 ±2.6	350.3 ±11.0	165.4 ±2.8	228.7 ±8.9	268.8 ±3.6	398.4 ±13.1	529.8 ±5.2	711.3 ±21.2
ReWeight	253.7 ±2.8	253.5 ±2.5	306.3 ±9.1	179.4 ±3.4	253.6 ±13.0	249.4 ±3.6	374.3 ±19.0	463.4 ±6.2	616.5 ±18.3
SqrtReWeight	248.2 ±2.5	248.3 ±2.6	299.4 ±8.9	179.2 ±3.1	247.2 ±11.5	242.1 ±3.5	328.2 ±12.4	444.6 ±6.3	609.0 ±23.5
CBLoss	251.9 ±2.6	251.8 ±2.6	301.3 ±9.6	161.3 ±2.8	229.2 ±11.5	253.9 ±3.5	333.9 ±15.2	472.9 ±5.9	624.0 ±16.2
DANN	276.0 ±2.8	276.4 ±2.6	348.8 ±11.0	148.3 ±2.6	228.9 ±10.7	292.9 ±3.4	427.0 ±15.2	551.4 ±4.7	716.9 ±16.3
LDS	263.1 ±2.7	263.3 ±2.8	322.4 ±9.5	178.6 ±3.3	275.6 ±13.1	261.4 ±3.7	365.5 ±12.1	484.6 ±6.4	686.3 ±24.4
L-MDS	243.4 ±2.7	243.1 ±2.7	299.0 ±9.3	163.7 ±3.3	257.7 ±14.0	245.0 ±3.9	321.8 ±11.4	440.2 ±6.5	623.4 ±16.1
F-MDS	250.5 ±2.6	250.4 ±2.5	287.2 ±8.7	196.0 ±3.3	279.6 ±10.4	235.6 ±3.5	306.0 ±11.9	429.0 ±2.6	592.2 ±25.7
L-MDS + F-MDS	249.4 ±2.5	249.2 ±2.6	299.5 ±10.4	205.4 ±3.4	309.0 ±12.7	231.2 ±3.3	306.0 ±12.0	413.5 ±6.1	622.6 ±18.2
Ours (best) vs. ERM	+25.3	+25.9	+63.1	+1.7	-29.0	+37.6	+92.4	+116.3	+119.1

Shortcut to Nowhere: DemystifyingDeep Spurious Regression