Pretraining¶

Definition

AI-generated

Pretraining is the first, large-scale training phase where a model learns general representations from broad data—next-token prediction for text, masked modeling, contrastive image objectives, or denoising autoencoding—before fine-tuning or prompting on a downstream task.

Topics

LLM and Agents Machine learning concepts

Why it matters in GWAS¶

DNA, protein, and single-cell foundation models rely on pretraining corpora and tokenization choices that shape inductive bias; downstream GWAS interpretation still requires association study design independent of the model’s pretraining domain.

Example usage¶

"The methods explicitly include Pretraining to support interpretation of the main findings."

References¶

Devlin J, et al. (2019). BERT: pre-training of deep bidirectional transformers for language understanding. NAACL.
Brown T, et al. (2020). Language models are few-shot learners. NeurIPS.

← Preregistration Prevalence →

Last updated 2026-04-05 (UTC · Git history)

Pretraining¶

Why it matters in GWAS¶

Example usage¶

Related terms¶

References¶