Real-world regression often exhibits shortcuts: attributes that are spuriously correlated with continuous targets in training, yet unreliable under deployment shifts — regressing targets using such shortcuts may fail catastrophically at test time.
Existing studies on spurious correlations focus primarily on classification, where labels are categorical and groups are naturally defined. However, many real-world tasks require continuous prediction, where hard label boundaries or discrete group–label pairs do not exist. We define Deep Spurious Regression (DSR) as learning from regression data with attribute–label confounding, addressing continuous spurious correlations, and generalizing to all attribute–label combinations at test time.
Motivated by the intrinsic difference between classification and regression shortcuts, we propose to exploit the similarity among spurious attributes in both label and feature spaces, thereby accounting for nearby targets and related groups while calibrating both label and learned feature distributions across attributes. Extensive experiments on real-world DSR datasets spanning computer vision, environmental sensing, and LLM regression verify the superior performance of our strategies. Our work fills the gap in benchmarks and techniques for studying spurious correlations in continuous prediction.
Builds a cross-group kernel from pairwise Wasserstein distances between attribute–target distributions via Multidimensional Scaling (MDS). Uses cross-group similarity to reweight samples, correcting for spurious attribute–target confounding in the label space. Applied once at dataset construction — no training overhead.
Extracts model representations during training, computes per-attribute centroids, and builds a cross-group kernel via MDS to dynamically update sample weights. Unlike LMDS, FMDS operates in feature space and continuously adapts as the model's representations evolve throughout training.
The two techniques can be combined for full static + dynamic cross-group correction, leveraging both label-space and feature-space signals simultaneously.
We curate four real-world DSR datasets spanning computer vision, environmental sensing, and large language models (LLM), and compare against a comprehensive set of baselines:
| Dataset | Task | Spurious Attribute |
|---|---|---|
| UTKFace | Age regression | Race |
| SkyFinder | Temperature regression | Camera Location |
| PovertyMap | Poverty index regression | Country |
| CodeNet | CPU runtime regression | Coding Language |
| Algorithm | Overall | Test Error (by attribute) | Test Error (by shot) | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Average | Worst | Many | Medium | Few | Zero | ||||||
| Avg | Worst | Avg | Worst | Avg | Worst | Avg | Worst | ||||
| ERM | 7.39 ±0.1 | 7.26 ±0.1 | 9.19 ±0.2 | 4.34 ±0.2 | 18.61 ±2.3 | 6.24 ±0.1 | 19.18 ±0.7 | 7.19 ±0.1 | 13.16 ±1.9 | 9.83 ±0.2 | 73.56 ±7.3 |
| Resample | 7.64 ±0.1 | 7.50 ±0.1 | 9.40 ±0.3 | 4.72 ±0.1 | 19.71 ±1.2 | 6.20 ±0.1 | 17.12 ±0.9 | 7.00 ±0.3 | 13.03 ±1.4 | 10.43 ±0.4 | 76.58 ±7.1 |
| SqrtReWeight | 7.31 ±0.1 | 7.18 ±0.1 | 9.02 ±0.3 | 4.48 ±0.2 | 16.30 ±2.3 | 6.10 ±0.1 | 16.01 ±0.8 | 6.90 ±0.1 | 13.18 ±1.1 | 9.78 ±0.4 | 62.10 ±8.2 |
| ReWeight | 8.42 ±0.1 | 8.25 ±0.1 | 9.62 ±0.2 | 6.21 ±0.1 | 17.82 ±0.9 | 7.06 ±0.1 | 18.00 ±1.4 | 7.65 ±0.1 | 14.78 ±1.4 | 10.88 ±0.3 | 86.59 ±4.6 |
| CBLoss | 8.37 ±0.1 | 8.19 ±0.1 | 9.61 ±0.1 | 6.16 ±0.1 | 18.55 ±1.8 | 6.98 ±0.1 | 18.18 ±1.6 | 7.56 ±0.2 | 15.23 ±1.3 | 10.86 ±0.2 | 87.54 ±2.1 |
| DANN | 7.97 ±0.1 | 7.82 ±0.1 | 9.69 ±0.2 | 4.63 ±0.2 | 20.62 ±1.4 | 6.65 ±0.1 | 20.32 ±0.5 | 7.88 ±0.1 | 15.90 ±1.1 | 10.67 ±0.3 | 76.86 ±4.7 |
| RnC | 7.38 ±0.1 | 7.25 ±0.1 | 9.22 ±0.2 | 4.35 ±0.1 | 19.31 ±1.1 | 6.15 ±0.0 | 17.64 ±1.3 | 7.12 ±0.1 | 12.55 ±0.9 | 9.91 ±0.2 | 63.69 ±6.2 |
| LDS | 7.23 ±0.1 | 7.09 ±0.1 | 8.90 ±0.1 | 4.58 ±0.1 | 16.59 ±1.9 | 6.12 ±0.0 | 16.36 ±0.6 | 7.01 ±0.2 | 13.22 ±0.6 | 9.46 ±0.3 | 71.49 ±10.6 |
| GroupDRO | 7.43 ±0.1 | 7.29 ±0.1 | 9.02 ±0.1 | 4.94 ±0.1 | 16.69 ±1.7 | 6.13 ±0.1 | 17.78 ±1.5 | 6.86 ±0.2 | 12.71 ±0.9 | 9.90 ±0.1 | 71.79 ±11.7 |
| L-MDS | 7.30 ±0.1 | 7.17 ±0.1 | 8.94 ±0.2 | 4.66 ±0.2 | 19.40 ±2.1 | 6.03 ±0.1 | 17.62 ±1.0 | 6.79 ±0.2 | 12.51 ±1.2 | 9.79 ±0.2 | 76.32 ±6.2 |
| F-MDS | 7.22 ±0.1 | 7.08 ±0.1 | 8.71 ±0.2 | 4.65 ±0.2 | 17.49 ±1.9 | 6.08 ±0.1 | 17.91 ±0.2 | 6.71 ±0.2 | 12.43 ±1.4 | 9.54 ±0.4 | 68.81 ±8.1 |
| L-MDS + F-MDS | 7.42 ±0.1 | 7.29 ±0.1 | 9.04 ±0.2 | 4.55 ±0.2 | 17.55 ±1.8 | 6.15 ±0.1 | 17.73 ±1.8 | 7.02 ±0.1 | 12.88 ±0.6 | 9.96 ±0.3 | 77.43 ±9.3 |
| Ours (best) vs. ERM | +0.17 | +0.18 | +0.48 | -0.21 | +1.12 | +0.21 | +1.56 | +0.48 | +0.73 | +0.29 | +4.75 |
| Algorithm | Overall | Test Error (by attribute) | Test Error (by shot) | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Average | Worst | Many | Medium | Few | Zero | ||||||
| Avg | Worst | Avg | Worst | Avg | Worst | Avg | Worst | ||||
| ERM | 3.68 ±0.0 | 3.41 ±0.0 | 5.95 ±0.1 | 2.27 ±0.0 | 6.45 ±0.6 | 2.94 ±0.0 | 12.21 ±1.2 | 4.49 ±0.0 | 25.08 ±0.9 | 5.22 ±0.0 | 29.78 ±1.8 |
| Resample | 3.62 ±0.0 | 3.35 ±0.0 | 5.76 ±0.1 | 2.71 ±0.1 | 7.28 ±0.3 | 3.00 ±0.0 | 13.37 ±1.2 | 4.23 ±0.0 | 19.23 ±0.6 | 4.97 ±0.1 | 36.23 ±4.0 |
| SqrtReWeight | 3.53 ±0.0 | 3.26 ±0.0 | 5.85 ±0.1 | 2.34 ±0.1 | 6.91 ±0.8 | 2.95 ±0.0 | 11.69 ±0.8 | 4.17 ±0.0 | 23.76 ±1.0 | 4.75 ±0.1 | 32.65 ±1.7 |
| ReWeight | 4.25 ±0.0 | 3.91 ±0.0 | 7.13 ±0.2 | 4.03 ±0.1 | 10.55 ±0.8 | 3.88 ±0.0 | 15.84 ±1.3 | 4.52 ±0.0 | 21.90 ±0.4 | 5.13 ±0.1 | 33.09 ±1.6 |
| CBLoss | 4.23 ±0.1 | 3.90 ±0.1 | 7.24 ±0.2 | 4.07 ±0.2 | 10.39 ±0.8 | 3.86 ±0.1 | 15.04 ±2.1 | 4.50 ±0.1 | 20.16 ±0.6 | 5.12 ±0.1 | 30.88 ±1.0 |
| DANN | 4.04 ±0.1 | 3.76 ±0.1 | 6.75 ±0.2 | 2.57 ±0.0 | 7.98 ±0.7 | 3.32 ±0.1 | 12.60 ±1.3 | 4.83 ±0.1 | 24.84 ±0.4 | 5.56 ±0.1 | 31.02 ±1.6 |
| RnC | 3.49 ±0.0 | 3.24 ±0.0 | 5.69 ±0.1 | 2.42 ±0.0 | 7.21 ±0.5 | 2.90 ±0.1 | 11.99 ±0.8 | 4.14 ±0.0 | 19.65 ±1.2 | 4.71 ±0.1 | 30.86 ±2.2 |
| LDS | 3.85 ±0.1 | 3.56 ±0.0 | 6.44 ±0.4 | 2.39 ±0.0 | 7.95 ±0.6 | 3.12 ±0.0 | 13.51 ±1.0 | 4.69 ±0.1 | 21.71 ±0.9 | 5.26 ±0.1 | 33.98 ±2.8 |
| GroupDRO | 3.62 ±0.0 | 3.35 ±0.0 | 6.00 ±0.1 | 2.34 ±0.0 | 6.64 ±0.5 | 2.90 ±0.0 | 12.43 ±1.1 | 4.42 ±0.1 | 25.07 ±1.4 | 5.04 ±0.0 | 29.66 ±1.5 |
| L-MDS | 3.54 ±0.0 | 3.27 ±0.0 | 5.81 ±0.2 | 2.38 ±0.0 | 6.86 ±0.6 | 2.95 ±0.0 | 11.63 ±1.0 | 4.17 ±0.0 | 23.63 ±0.7 | 4.78 ±0.0 | 31.47 ±2.3 |
| F-MDS | 3.56 ±0.0 | 3.29 ±0.0 | 5.81 ±0.2 | 2.33 ±0.1 | 6.44 ±0.3 | 2.97 ±0.0 | 11.86 ±0.4 | 4.22 ±0.0 | 21.40 ±1.1 | 4.74 ±0.0 | 30.47 ±2.0 |
| L-MDS + F-MDS | 3.58 ±0.0 | 3.30 ±0.0 | 5.78 ±0.1 | 2.39 ±0.1 | 6.28 ±0.6 | 2.97 ±0.0 | 12.18 ±0.6 | 4.23 ±0.1 | 22.30 ±1.3 | 4.82 ±0.1 | 32.99 ±2.8 |
| Ours (best) vs. ERM | +0.14 | +0.14 | +0.17 | -0.06 | +0.17 | -0.01 | +0.58 | +0.32 | +3.68 | +0.48 | -0.69 |
| Algorithm | Overall | Test Error (by attribute) | Test Error (by shot) | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Average | Worst | Many | Medium | Few | Zero | ||||||
| Avg | Worst | Avg | Worst | Avg | Worst | Avg | Worst | ||||
| ERM | 0.504 ±0.0 | 0.502 ±0.0 | 0.679 ±0.0 | 0.256 ±0.0 | 0.504 ±0.1 | 0.335 ±0.0 | 1.356 ±0.1 | 0.494 ±0.0 | 2.452 ±0.1 | 0.744 ±0.0 | 1.996 ±0.1 |
| Resample | 0.506 ±0.0 | 0.503 ±0.0 | 0.710 ±0.0 | 0.385 ±0.0 | 0.781 ±0.2 | 0.391 ±0.0 | 1.383 ±0.1 | 0.463 ±0.0 | 2.247 ±0.1 | 0.737 ±0.0 | 2.019 ±0.1 |
| SqrtReWeight | 0.512 ±0.0 | 0.509 ±0.0 | 0.670 ±0.0 | 0.375 ±0.0 | 0.679 ±0.2 | 0.376 ±0.0 | 1.441 ±0.1 | 0.478 ±0.0 | 2.233 ±0.1 | 0.753 ±0.0 | 2.037 ±0.1 |
| ReWeight | 0.522 ±0.0 | 0.520 ±0.0 | 0.750 ±0.0 | 0.485 ±0.1 | 0.805 ±0.1 | 0.431 ±0.0 | 1.426 ±0.1 | 0.464 ±0.0 | 2.088 ±0.2 | 0.748 ±0.0 | 2.012 ±0.1 |
| CBLoss | 0.515 ±0.0 | 0.513 ±0.0 | 0.720 ±0.0 | 0.450 ±0.0 | 0.856 ±0.2 | 0.420 ±0.0 | 1.447 ±0.1 | 0.467 ±0.0 | 2.142 ±0.1 | 0.729 ±0.0 | 2.029 ±0.1 |
| DANN | 0.689 ±0.1 | 0.685 ±0.1 | 0.869 ±0.0 | 0.796 ±0.1 | 0.996 ±0.1 | 0.574 ±0.1 | 1.638 ±0.1 | 0.598 ±0.1 | 1.926 ±0.1 | 1.003 ±0.1 | 2.191 ±0.1 |
| RnC | 0.494 ±0.0 | 0.490 ±0.0 | 0.675 ±0.0 | 0.304 ±0.0 | 0.559 ±0.1 | 0.290 ±0.0 | 1.103 ±0.1 | 0.486 ±0.0 | 2.320 ±0.1 | 0.773 ±0.0 | 2.153 ±0.2 |
| LDS | 0.501 ±0.0 | 0.499 ±0.0 | 0.712 ±0.0 | 0.331 ±0.0 | 0.717 ±0.1 | 0.336 ±0.0 | 1.458 ±0.1 | 0.501 ±0.0 | 2.276 ±0.1 | 0.714 ±0.0 | 2.049 ±0.1 |
| GroupDRO | 0.492 ±0.0 | 0.489 ±0.0 | 0.648 ±0.0 | 0.376 ±0.1 | 0.844 ±0.2 | 0.319 ±0.0 | 1.245 ±0.1 | 0.470 ±0.0 | 2.382 ±0.1 | 0.757 ±0.0 | 2.016 ±0.1 |
| L-MDS | 0.486 ±0.0 | 0.484 ±0.0 | 0.666 ±0.0 | 0.271 ±0.0 | 0.535 ±0.2 | 0.336 ±0.0 | 1.417 ±0.1 | 0.467 ±0.0 | 2.385 ±0.1 | 0.720 ±0.0 | 1.987 ±0.1 |
| F-MDS | 0.488 ±0.0 | 0.485 ±0.0 | 0.670 ±0.0 | 0.278 ±0.0 | 0.554 ±0.1 | 0.327 ±0.0 | 1.307 ±0.1 | 0.477 ±0.0 | 2.492 ±0.1 | 0.719 ±0.0 | 2.057 ±0.0 |
| L-MDS + F-MDS | 0.492 ±0.0 | 0.490 ±0.0 | 0.642 ±0.0 | 0.352 ±0.1 | 0.834 ±0.1 | 0.332 ±0.0 | 1.369 ±0.1 | 0.483 ±0.0 | 2.283 ±0.2 | 0.715 ±0.0 | 2.015 ±0.1 |
| Ours (best) vs. ERM (%) | +3.57% | +3.59% | +5.45% | -5.86% | -6.15% | +2.39% | +3.61% | +5.47% | +6.89% | +3.90% | +0.45% |
| Algorithm | Overall | Test Error (by attribute) | Test Error (by shot) | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Average | Worst | Many | Medium | Few | |||||
| Avg | Worst | Avg | Worst | Avg | Worst | ||||
| ERM | 268.7 ±2.8 | 269.0 ±2.6 | 350.3 ±11.0 | 165.4 ±2.8 | 228.7 ±8.9 | 268.8 ±3.6 | 398.4 ±13.1 | 529.8 ±5.2 | 711.3 ±21.2 |
| ReWeight | 253.7 ±2.8 | 253.5 ±2.5 | 306.3 ±9.1 | 179.4 ±3.4 | 253.6 ±13.0 | 249.4 ±3.6 | 374.3 ±19.0 | 463.4 ±6.2 | 616.5 ±18.3 |
| SqrtReWeight | 248.2 ±2.5 | 248.3 ±2.6 | 299.4 ±8.9 | 179.2 ±3.1 | 247.2 ±11.5 | 242.1 ±3.5 | 328.2 ±12.4 | 444.6 ±6.3 | 609.0 ±23.5 |
| CBLoss | 251.9 ±2.6 | 251.8 ±2.6 | 301.3 ±9.6 | 161.3 ±2.8 | 229.2 ±11.5 | 253.9 ±3.5 | 333.9 ±15.2 | 472.9 ±5.9 | 624.0 ±16.2 |
| DANN | 276.0 ±2.8 | 276.4 ±2.6 | 348.8 ±11.0 | 148.3 ±2.6 | 228.9 ±10.7 | 292.9 ±3.4 | 427.0 ±15.2 | 551.4 ±4.7 | 716.9 ±16.3 |
| LDS | 263.1 ±2.7 | 263.3 ±2.8 | 322.4 ±9.5 | 178.6 ±3.3 | 275.6 ±13.1 | 261.4 ±3.7 | 365.5 ±12.1 | 484.6 ±6.4 | 686.3 ±24.4 |
| L-MDS | 243.4 ±2.7 | 243.1 ±2.7 | 299.0 ±9.3 | 163.7 ±3.3 | 257.7 ±14.0 | 245.0 ±3.9 | 321.8 ±11.4 | 440.2 ±6.5 | 623.4 ±16.1 |
| F-MDS | 250.5 ±2.6 | 250.4 ±2.5 | 287.2 ±8.7 | 196.0 ±3.3 | 279.6 ±10.4 | 235.6 ±3.5 | 306.0 ±11.9 | 429.0 ±2.6 | 592.2 ±25.7 |
| L-MDS + F-MDS | 249.4 ±2.5 | 249.2 ±2.6 | 299.5 ±10.4 | 205.4 ±3.4 | 309.0 ±12.7 | 231.2 ±3.3 | 306.0 ±12.0 | 413.5 ±6.1 | 622.6 ±18.2 |
| Ours (best) vs. ERM | +25.3 | +25.9 | +63.1 | +1.7 | -29.0 | +37.6 | +92.4 | +116.3 | +119.1 |
@article{xu2026shortcut,
title = {Shortcut to Nowhere: Demystifying Deep Spurious Regression},
author = {Xu, Guanrong and Li, Jessica and Wang, Hao and Yang, Yuzhe},
journal = {arXiv preprint arXiv:XXXX.XXXXX},
year = {2026}
}