Masked Attention¶

Definition

AI-generated

Masked attention is self-attention (or a head within multi-head attention) where an attention mask prevents some positions from attending to others—most famously a causal mask in autoregressive decoders so token *i* cannot attend to future tokens *j > i*, and masks in bidirectional encoders that hide held-out spans for pretraining objectives (e.g. masked language modeling).

Topics

LLM and Agents Machine learning concepts

Why it matters in GWAS¶

Encoder–decoder setups (sequence to phenotype labels, protein structure heads) use masks to separate modalities or padding; mis-specified padding masks can leak batch structure into predictions. Literature on DNA transformers should state whether training used causal (GPT-style) or bidirectional (BERT-style) masking because transfer behavior differs.

Example usage¶

"The decoder used causal masked attention during fine-tuning so the model only conditioned on upstream sequence when scoring the variant site."

References¶

Vaswani A, et al. (2017). Attention is all you need. NeurIPS.

← Markov chain Monte Carlo (MCMC) Maternally expressed gene (MEG) →

Last updated 2026-04-05 (UTC · Git history)

Masked Attention¶

Why it matters in GWAS¶

Example usage¶

Related terms¶

References¶