Skip to content

Masked Attention

Definition
AI-generated

Masked attention is self-attention (or a head within multi-head attention) where an attention mask prevents some positions from attending to others—most famously a causal mask in autoregressive decoders so token *i* cannot attend to future tokens *j > i*, and masks in bidirectional encoders that hide held-out spans for pretraining objectives (e.g. masked language modeling).

Why it matters in GWAS

Encoderdecoder setups (sequence to phenotype labels, protein structure heads) use masks to separate modalities or padding; mis-specified padding masks can leak batch structure into predictions. Literature on DNA transformers should state whether training used causal (GPT-style) or bidirectional (BERT-style) masking because transfer behavior differs.

Example usage

"The decoder used causal masked attention during fine-tuning so the model only conditioned on upstream sequence when scoring the variant site."

References

  • Vaswani A, et al. (2017). Attention is all you need. NeurIPS.

Last updated (UTC · Git history)