Allele
Definition
An allele is one of the alternative forms of a genetic variant (e.g., SNP) at a specific genomic position. For a bi-allelic variant, there are two possible alleles (e.g., A or G). Each individual carries two alleles at each autosomal position (one from each parent), which together form their genotype (e.g., AA, AG, or GG).
In GWAS and genetic analysis, alleles are the fundamental units used to:
- Measure genetic variation within and between populations
- Test associations between genetic variants and phenotypes
- Calculate allele frequencies and genotype distributions
- Estimate effect sizes and odds ratios
The specific nucleotide or sequence variant at a position represents an allele, and different alleles can have different effects on phenotypes, disease risk, or other traits of interest.
Related concepts
Concepts: Major/Minor/Reference/Alternative/Risk/Effect Allele, Allele0/Allele1, A1/A2, Ancestral/Derived Allele
Understanding allele terminology is crucial in GWAS analysis. These concepts are often confused, leading to errors or misunderstandings.
Allele Naming is Highly Inconsistent
The naming conventions for alleles are quite mixed and inconsistent across different software, databases, and file formats. The same term (e.g., "reference allele", "A1", "Allele0") can have different meanings depending on the context, software version, or data source.
Always check the documentation of your data source, software, or database to determine which allele is which. Never assume that naming conventions are consistent across different tools or datasets. When in doubt:
- Consult the official documentation
- Check allele frequency information if available
- Verify with example data or test cases
- Cross-reference with known reference genomes when possible
This inconsistency is a common source of errors in GWAS analysis, so extra caution is essential.
Three Groups of Allele Concepts
First Group: Major and Minor Allele (Frequency-based)
Major allele and minor allele are defined relative to a specific population of a certain size. The allele with the highest frequency is the major allele, and the one with the second highest frequency is the minor allele. For the most common bi-allelic SNPs, the two alleles have different frequencies - one is major and one is minor. For tri-allelic or quad-allelic SNPs (sites with three or four bases), the minor allele is the second most frequent allele.
Key points:
- The distinction between major and minor is based on allele frequency in a specific population of a certain size
- PLINK1.9 uses major and minor allele concepts. The software automatically calculates frequencies and may reorder alleles when processing raw data
- When using PLINK1.9's
--frqoption, the output shows MAF (minor allele frequency), which will not exceed 0.5 - In PLINK1.9, A1 is the minor allele and A2 is the major allele, so MAF refers to the frequency of A1 (minor allele)
Example PLINK1.9 output:
CHR SNP A1 A2 MAF NCHROBS
1 SNP1 T C 0.1258 10000
1 SNP2 A G 0.1258 10000
Second Group: Reference (ref) and Alternative (alt) Allele (Reference-genome-based)
Reference allele refers to the allele at that position in a specific reference genome. All other alleles at that position are called alternative alleles. Note: reference and alternative alleles are unrelated to frequency - the only determining factor is the chosen reference genome. While reference genome alleles are often major alleles, this is coincidental and should not be used to equate major and reference alleles. Some reference alleles can be minor alleles in a given population.
Unlike PLINK1.9, PLINK2 uses reference and alternative allele concepts. When processing data, it does not automatically reorder alleles based on frequency. When using PLINK2's --frq option, the output shows alternative allele frequency (not MAF), with values ranging from [0,1].
Example PLINK2 output:
#CHROM ID REF ALT ALT_FREQS OBS_CT
1 SNP1 T C 0.8742 10000
1 SNP2 G A 0.1258 10000
In PLINK2, reference and alternative alleles are clearly distinguished. For example, in the above SNPs:
- SNP1: T is the ref allele (from reference genome) but is the minor allele in this population, while C is the alt allele but is the major allele
- SNP2: G is the ref allele and major allele, while A is the alt allele and minor allele
Tip: You can align your data's ref and alt alleles with the corresponding reference genome using PLINK2:
plink2 \
--bfile testfile \
--ref-from-fa -fa hg19.fasta \
--make-bed \
--out testfile_fa
Third Group: Reference and Risk/Effect Allele (Association-test-based)
This concept changes again. When "reference allele" is used alongside "risk/effect allele", it refers to the reference allele in GWAS association testing (non-risk or non-effect allele), which is the reference group for estimating effect sizes (beta or odds ratio). However, some software may also use "reference allele" as the effect allele. This concept is independent of the ref/alt combination above, but for consistency, recent studies often align the reference allele in association testing with the reference genome to avoid confusion. (Note: Early studies often used minor allele as the reference allele in association testing, which is a source of confusion.) The concept of "reference allele" is very confusing - when distinguishing, don't focus on the name, but rather on what the effect size refers to.
Risk allele is straightforward - it's the allele that contributes to disease occurrence. In complex disease research, risk alleles are often minor alleles, but exceptions exist. The concept of effect allele is similar - it's the allele whose effect on disease or phenotype we want to study, so it's usually the allele that contributes to the phenotype or disease. The "effect" column in association test results refers to the effect of the effect allele.
Additional Allele Naming Conventions
Allele0 and Allele1
Allele0 and Allele1 use 0-indexed numbering:
- Allele0: Typically refers to the reference allele
- Allele1: Typically refers to the alternative allele
Caveats:
- The numbering is 0-indexed, where the reference allele is always index 0
- Alternative alleles are numbered sequentially (1, 2, 3... for multi-allelic sites)
- This convention does not indicate frequency - Allele0 may be major or minor depending on the population
- Always verify what Allele0 and Allele1 represent in your specific dataset or software
Allele1 and Allele2 (A1/A2)
A1 and A2 are numeric designations for the two alleles at a bi-allelic site:
- A1: Often refers to the minor allele (lower frequency) in frequency-based systems
- A2: Often refers to the major allele (higher frequency) in frequency-based systems
Caveats:
- A1/A2 designation is often frequency-based and may differ from reference/alternative alleles
- Some software may automatically reorder alleles based on frequency, so A1/A2 do not necessarily correspond to reference/alternative alleles in the reference genome
- The meaning of A1/A2 can vary between software and datasets - always check the documentation or frequency information to understand which is which
- A1 is not always the minor allele - it depends on the convention used by the specific software or dataset
Derived Allele
Derived allele is an evolutionary concept:
- Derived allele: The allele that arose from a mutation from the ancestral state
- Ancestral allele: The allele that was present in the common ancestor (often inferred from outgroup species)
Caveats:
- Derived alleles are not necessarily minor alleles - they can be major alleles in some populations
- The derived allele frequency (DAF) can range from 0 to 1, unlike MAF which is capped at 0.5
- The ancestral allele is often used as a proxy for the reference allele in some contexts, but they are distinct concepts - the reference genome may not always match the ancestral state
- Ancestral/derived information requires phylogenetic inference and may have uncertainty, especially for older mutations or when outgroup information is limited
- The same allele can be ancestral in one population but derived in another, depending on the evolutionary history
Summary
Understanding these three groups of concepts will help you navigate allele terminology with confidence:
| Concept Group | Definition | Basis | Key Caveat |
|---|---|---|---|
| Major/Minor | Major: highest frequency allele Minor: second highest frequency allele |
Frequency in a specific population | Population-specific; may differ from reference/alternative |
| Reference/Alternative | Reference: allele in reference genome Alternative: other alleles at that position |
Reference genome | Unrelated to frequency; reference may be minor in some populations |
| Reference/Risk/Effect | Reference: non-risk/non-effect allele (baseline) Risk/Effect: allele with effect on phenotype |
GWAS association testing context | Meaning varies by software; check effect size interpretation |
Important reminders:
| Common Misconception | Reality |
|---|---|
| Major = Reference | They can differ - reference is genome-based, major is frequency-based |
| Minor = Risk/Effect | They can differ - risk/effect is context-dependent |
| Reference allele meaning is consistent | Always check what effect size refers to in your software |
| Naming conventions are standardized | Allele naming is highly inconsistent - always check source documentation |
| - | Modern best practice: align reference allele in association testing with reference genome for consistency |
Critical Reminder
Due to the inconsistent naming conventions across software and databases, always verify allele designations by checking the documentation of your specific data source or software. Do not assume naming conventions are universal.