We present SleepLM, a family of sleep-language foundation models that enable human sleep alignment, interpretation, and interaction with natural language. Despite the critical role of sleep, learning-based sleep analysis systems operate in closed label spaces (e.g., predefined stages or events) and fail to describe, query, or generalize to novel sleep phenomena.
SleepLM bridges natural language and multimodal polysomnography, enabling language-grounded representations of sleep physiology. To support this alignment, we introduce a multilevel sleep caption generation pipeline that enables the curation of the first large-scale sleep-text dataset, comprising over 100K hours of data from more than 10,000 individuals. Furthermore, we present a unified pretraining objective that combines contrastive alignment, caption generation, and signal reconstruction to better capture physiological fidelity and cross-modal interactions.
Extensive experiments on real-world sleep understanding tasks verify that SleepLM outperforms state-of-the-art in zero-shot and few-shot learning, cross-modal retrieval, and sleep captioning. Importantly, SleepLM also exhibits intriguing capabilities including language-guided event localization, targeted insight generation, and zero-shot generalization to unseen tasks.
Humans move between two distinct states: the waking life, structured by perception and language; and sleep, expressed through dense and continuous physiology. Making sense of sleep requires a mapping from physiology to language.
Current computational methods are predominantly discriminative and confined to closed label spaces (e.g., sleep stages or events). They lack the capacity for open-ended description. SleepLM aims to bridge this gap by learning a mapping between PSG and language at scale, enabling interactive and open-ended sleep analysis.
SleepLM is built on the Reconstructive Contrastive Captioner (ReCoCa) framework, designed to learn joint representations of sleep PSG and text. The architecture consists of three key components:
Captures the unique morphology of different sensor channels (EEG, EOG, etc.) using channel-independent patch embedding followed by interleaved temporal-attention and channel-attention blocks.
A lightweight decoder that reconstructs the original signal from encoder latents. This acts as a regularizer, ensuring the model retains physiological fidelity that might be lost if only training on sparse text alignment.
Generates targeted captions using cross attention between the encoder latents and the text token embeddings. It uses a learnable token [m] to condition generation on specific physiological systems (Brain, Respiratory, Cardiac, Somatic), enabling controllable output.
L_total = λ_con · L_con + λ_rec · L_rec + λ_cap · L_cap
Combining contrastive alignment, signal reconstruction, and caption generation.
SleepLM is evaluated on four complementary axes of sleep understanding and generalization. It outperforms a comprehensive set of baseline models including proprietary LLMs (Gemini 2.5 Pro, DeepSeek-R1) and finetuned VLMs (Qwen3-VL-8B, LLaVA-Next).
Strong performance without task-specific finetuning, demonstrating that the learned signal and text alignment transfers to standard sleep staging and event-related recognition settings.
Supports both text-to-signal and signal-to-text retrieval, enabling natural language search over PSG segments and interpretation of retrieved physiology with matched captions.
Achieves high data efficiency in low-label regimes, where simple linear probes on top of pretrained representations perform competitively with state-of-the-art SSL baselines.
Remains robust when evaluating on held-out concepts and datasets, indicating that the model captures reusable physiological semantics rather than memorizing a closed label set.
We select two qualitative views that make it easier to understand what the model learns and where its strengths come from. In the paper, we also demonstrate other intriguing capabilities of SleepLM including precise targeted generation, localization sensitivity to event segmentation, scaling behavior, full night index aggregation, etc.
Caption generation quality: SleepLM produces clinically consistent descriptions that reflect both high-level sleep state and fine-grained localized events, while a strong general-purpose LLM baseline may introduce incorrect associations or miss event localization tied to the underlying signal physiology.
Embedding-space continuity: For a fixed PSG query, retrieved captions form a smooth semantic gradient in the embedding space. The most similar results describe physiologically matching states, and progressively less similar results shift toward increasingly different states. This supports the view that embedding distance correlates with physiological similarity, enabling meaningful comparison by semantic proximity.
@article{xu2026sleeplm,
title={SleepLM: Natural-Language Intelligence for Human Sleep},
author={Xu, Zongzhe and Shuai, Zitao and Mozaffari, Eideen and Aysola, Ravi Shankar and Kumar, Rajesh and Yang, Yuzhe},
journal={arXiv preprint},
year={2026}
}