OSF: On Pre-training and Scaling of Sleep Foundation Models

1University of California, Los Angeles    2Emory University

We identify three critical insights for building sleep foundation models, and our pre-trained OSF consistently achieves state-of-the-art performance on diverse downstream sleep tasks.

News

2026-02-23 Project website is live!
2026-02-23 Code released on GitHub.

Abstract

Polysomnography (PSG) provides the gold standard for sleep assessment but suffers from substantial heterogeneity across recording devices and cohorts. There have been growing efforts to build general-purpose foundation models (FMs) for sleep physiology, but lack an in-depth understanding of the pre-training process and scaling patterns that lead to more generalizable sleep FMs.

To fill this gap, we curate a massive corpus of 166,500 hours of sleep recordings from nine public sources and establish SleepBench, a comprehensive, fully open-source benchmark. Leveraging SleepBench, we systematically evaluate four families of self-supervised pre-training objectives and uncover three critical findings: (1) existing FMs fail to generalize to missing channels at inference; (2) channel-invariant feature learning is essential for pre-training; and (3) scaling sample size, model capacity, and multi-source data mixture consistently improves downstream performance.

With an enhanced pre-training and scaling recipe, we introduce OSF, a family of sleep FMs that achieves state-of-the-art performance across nine datasets on diverse sleep and disease prediction tasks. Further analysis of OSF also reveals intriguing properties in sample efficiency, hierarchical aggregation, and cross-dataset scaling.

Motivation

Which pre-training and scaling design choices truly improve the generalization of sleep FMs, especially under cohort shift and missing-channel inference?

Sleep foundation models promise to unify diverse recording setups and patient populations, but current approaches have not been systematically evaluated under realistic deployment scenarios. OSF addresses this gap through comprehensive benchmarking and principled pre-training design.

Key Findings

Through systematic evaluation on SleepBench, we uncover three critical insights that guide the design of more robust and generalizable sleep foundation models.

Finding 1: Missing-Channel Inference is Challenging

Existing sleep FMs fail to generalize under missing-channel inference, motivating pre-training designs that explicitly handle channel incompleteness.

Finding 2: Channel-Invariant Pre-training Improves Robustness

Explicitly encouraging channel-invariant feature learning during pre-training improves robustness and downstream transfer, particularly for contrastive and distillation-based methods.

Finding 3: Scaling Laws Hold in Sleep Data

Scaling laws emerge in sleep data; jointly scaling model and data size yields the strongest gains across diverse downstream tasks.

Missing-Channel Inference

Missing Channel Inference
Figure: Inference with full versus missing channels. Existing sleep FM fails on inference time missing channel samples.

Channel-Invariant Pre-training

Augmentation Strategies
Figure: Illustration of considered augmentations. We consider time-wise masking and channel masking strategies.

Scaling Behavior

Model Scaling
Figure: Scaling behavior. Linear probing results on hypopnea detection show that OSF improves with both model capacity and pre-training sample size.

Experiment Results

OSF is evaluated across nine datasets on diverse sleep staging and event detection tasks. It consistently achieves state-of-the-art performance under linear probing, few-shot learning, and fine-tuning settings.

Task Types

Sleep Staging

Best overall performance on multi-class sleep stage classification across diverse cohorts.

Event Detection

Superior detection of sleep events including arousal, hypopnea, oxygen desaturation, and central apnea.

Disease Prediction

Extracted embeddings better capture disease-related information.

Evaluation Settings

Linear Probing

Strong performance with frozen features, demonstrating high-quality representations.

Fine-tuning

Further gains when adapting to specific downstream tasks with full model training.

Few-shot Learning

Sample efficiency in adapting to downstream tasks.

Main Results

Main Results Table
Table: Sleep staging and sleep event detection. OSF achieves the best overall performance among all compared methods.

We evaluate models on four sleep event detection tasks. As shown in the table, OSF achieves state-of-the-art performance on both sleep staging and event detection under linear probing and fine-tuning.

Missing-Channel Robustness

Missing Channel Results
Table: Linear probing results under realistic missing-channel settings. OSF is more robust to missing channels.

OSF consistently outperforms SleepFM across these missing-channel settings. Specifically, (1) OSF makes better use of the available channels. With brain-activity channels only, it achieves stronger sleep staging and arousal detection, suggesting stronger brain-related representations. Similarly, with respiratory channels only, it achieves stronger performance on hypopnea and oxygen desaturation.

(2) OSF is more robust when key modalities are missing. When respiratory signals are removed, both methods degrade on hypopnea and oxygen desaturation, but OSF remains consistently better. Conversely, when brain-related channels are unavailable, sleep staging becomes much harder for both models; nevertheless, OSF better uses the remaining channels and yields stronger performance.

BibTeX

@article{shuai2026osf,
  title={OSF: On Pre-training and Scaling of Sleep Foundation Models},
  author={Shuai, Zitao and Xu, Zongzhe and Yang, David and Wang, Wei and Yang, Yuzhe},
  journal={arXiv preprint},
  year={2026}
}