Skip to content

Phasing

Human genome is diploid. Distribution of variants between homologous chromosomes can affect the interpretation of genotype data, such as allele specific expression, context-informed annotation, loss-of-function compound heterozygous events.

Example

image

( SHAPEIT5 )

In the above illustration, when LoF variants are on both copies of a gene, the gene is thought knocked out

Trio data and long read sequencing can solve the haplotyping problem. That is not always possible. Statistical phasing is based on the Li & Stephens Markov model. The haploid version of this model (see Imputation) is easier to understand. Because the maternal and paternal haplotypes are independent, unphased genotype could be constructed by the addition of two haplotypes.

Recent methods had incopoorates long IBD sharing, local haplotypes, etc, to make it tractable for large datasets. You could read the following methods if you are interested.

How to do phasing

In most of the cases, phasing is just a pre-step of imputation, and we do not care about how the phasing goes. But there are several considerations, like reference-based or reference-free, large and small sample size, rare variants cutoff. There is no single method that could best fit all cases.

Here I show one example using EAGLE2.

eagle \
    --vcf=target.vcf.gz \
    --geneticMapFile=genetic_map_hg19_withX.txt.gz \
    --chrom=19 \
    --outPrefix=target.eagle \
    --numThreads=10