Skip to content

Phasing

Human genome is diploid. Distribution of variants between homologous chromosomes can affect the interpretation of genotype data, such as allele specific expression, context-informed annotation, loss-of-function compound heterozygous events.

Example

image

( SHAPEIT5 )

In the above illustration, when LoF variants are on both copies of a gene, the gene is thought knocked out

Trio data and long read sequencing can solve the haplotyping problem. That is not always possible. Statistical phasing is based on the Li & Stephens Markov model. The haploid version of this model (see Imputation) is easier to understand. Because the maternal and paternal haplotypes are independent, unphased genotype could be constructed by the addition of two haplotypes.

Recent methods had incorporated long IBD sharing, local haplotypes, etc, to make it tractable for large datasets. You could read the following methods if you are interested.

How to do phasing

In most of the cases, phasing is just a pre-step of imputation, and we do not care about how the phasing goes. But there are several considerations, like reference-based or reference-free, large and small sample size, rare variants' cutoff. There is no single method that could best fit all cases.

We will show two examples:

  • Reference-based phasing using SHAPEIT2
  • Cohort-based phasing using Eagle2

Phasing using SHAPEIT2

Here, we show an example using SHAPEIT2, which is another commonly used tool for haplotype phasing. SHAPEIT2 is usually used for small cohorts.

Note

We will do a phasing for 1KG EAS samples using 1KG as reference, which does not make any sense. This is just to demonstrate how to do reference-based phasing using SHAPEIT2.

Quote

Delaneau, O., Zagury, J. F., & Marchini, J. (2013). Improved whole-chromosome phasing for disease and population genetic studies. Nature methods, 10(1), 5-6.

We will conduct a reference-based phasing using SHAPEIT2 for chromosome 22.

First, download SHPAEIT2 from SHPAEIT2 website. Unzip and add it to your environment. We also need reference files (1000 Genomes Phase I integrated haplotypes (produced using SHAPEIT2) b37 June 2014) and genetic maps, which are also available on SHPAEIT2 website

Next, for simplicity, we extract only JPT samples and common variants on chr22 from the clean datasets obtained in the previous section.

inputbed=../04_Data_QC/sample_data.clean
jptsample=../01_Dataset/JPT.sample
inputbedchr22=./sample_data.chr22.clean

plink2 \
        --bfile ${inputbed} \
        --make-bed \
        --keep ${jptsample} \
        --maf 0.05 \
        --chr 22 \
        --out ${inputbedchr22}

Tip

Usually, we also remove G/C or A/T SNPs before phasing.

Then, it is needed to check the alignment of SNPs between your genotype file and the reference file.

out=./1KG.JPT.chr22.phased.shapeit2.reference_based
outputhaps=${out}.haps
outputsample=${out}.sample
outputlog=${out}
outputlogcheck=${out}.check

geneticmap=~/tools/shapeit2/shapeit.v2.904.3.10.0-693.11.6.el7.x86_64/reference/ALL.integrated_phase1_SHAPEIT_16-06-14.nomono/genetic_map_chr22_combined_b37.txt
inputrefhap=~/tools/shapeit2/shapeit.v2.904.3.10.0-693.11.6.el7.x86_64/reference/ALL.integrated_phase1_SHAPEIT_16-06-14.nomono/ALL.chr22.integrated_phase1_v3.20101123.snps_indels_svs.genotypes.nomono.haplotypes.gz
inputreflegend=~/tools/shapeit2/shapeit.v2.904.3.10.0-693.11.6.el7.x86_64/reference/ALL.integrated_phase1_SHAPEIT_16-06-14.nomono/ALL.chr22.integrated_phase1_v3.20101123.snps_indels_svs.genotypes.nomono.legend.gz
inputrefsample=~/tools/shapeit2/shapeit.v2.904.3.10.0-693.11.6.el7.x86_64/reference/ALL.integrated_phase1_SHAPEIT_16-06-14.nomono/ALL.integrated_phase1_v3.20101123.snps_indels_svs.genotypes.sample

shapeit -check \
        -B ${inputbedchr22} \
        -M ${geneticmap} \
        --input-ref ${inputrefhap} ${inputreflegend} ${inputrefsample} \
        --output-log ${outputlogcheck}
This command will generate a list of variants for exclusion. See Phasing with a reference panel Step2.

Finally, excluded the mismatched variants and run phasing using 1KG PhaseI EAS samples as reference.

excludesnp=${outputlogcheck}.snp.strand.exclude
echo "EAS" > group.list
includegrp=group.list

shapeit --input-bed ${inputbedchr22} \
        --input-map ${geneticmap} \
        --input-ref ${inputrefhap} ${inputreflegend} ${inputrefsample} \
        --output-max ${outputhaps} ${outputsample} \
        --output-log ${outputlog} \
        --exclude-snp  ${excludesnp} \
        --thread 1 \
        --include-grp ${includegrp} \
        --seed 123 \
        --states 200 \
        --window 2

This command will generate a haplotype file .hap and a sample file .sample. We need to convert the files to VCF, compress and index the VCF for downstream analysis.

outputvcf=${out}.vcf

shapeit \
    -convert \
    --input-haps ${outputhaps} ${outputsample} \
    --output-vcf ${outputvcf}

bgzip ${outputvcf} && \
tabix -p vcf ${outputvcf}.gz

A look at the phased VCF:

| indicates that the genotypes are phased.

##fileformat=VCFv4.1
##fileDate=16102024_11h58m32s
##source=SHAPEIT2.v904
##log_file=shapeit_16102024_11h58m32s_53660203-b6af-49ba-a054-ae0fdd303a78.log
##FORMAT=<ID=GT,Number=1,Type=String,Description="Phased Genotype">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  NA18939 NA18940 NA18941 NA18942 NA18943 NA18944
NA18945 NA18946 NA18947 NA18948 NA18949 NA18950 NA18951 NA18952 NA18953 NA18954 NA18956 NA18957 NA18959 NA18960 NA18961
NA18962 NA18964 NA18965 NA18966 NA18967 NA18968 NA18969 NA18970 NA18971 NA18972 NA18973 NA18974 NA18975 NA18976 NA18977
NA18978 NA18979 NA18980 NA18981 NA18982 NA18983 NA18984 NA18985 NA18986 NA18987 NA18988 NA18989 NA18990 NA18991 NA18992
NA18993 NA18994 NA18995 NA18997 NA18998 NA18999 NA19000 NA19001 NA19002 NA19003 NA19004 NA19005 NA19006 NA19007 NA19009
NA19010 NA19011 NA19012 NA19054 NA19055 NA19056 NA19057 NA19058 NA19059 NA19060 NA19062 NA19063 NA19064 NA19065 NA19066
NA19067 NA19068 NA19070 NA19072 NA19074 NA19075 NA19076 NA19077 NA19078 NA19079 NA19080 NA19081 NA19082 NA19083 NA19084
NA19085 NA19086 NA19087 NA19088 NA19089 NA19090 NA19091
22      16051453        22:16051453:A:C C       A       .       PASS    .       GT      1|1     1|1     1|1     1|1
1|1     1|1     0|1     1|1     1|1     1|1     1|1     0|1     0|1     1|1     1|1     1|1     1|1     1|1     1|1
1|1     1|1     1|1     1|1     1|1     1|1     1|1     1|1     1|1     1|1     1|1     1|1     1|1     0|1     1|1
0|1     1|1     1|1     1|1     1|1     1|1     1|1     1|1     0|1     1|1     0|1     1|1     1|1     1|1     1|1
1|1     0|1     1|1     0|1     1|1     1|1     1|1     1|1     1|1     1|1     1|1     1|1     1|1     1|1     1|1
1|1     1|1     1|1     1|1     1|1     1|1     1|1     1|1     1|1     0|1     0|1     1|1     1|1     1|1     0|1
1|1     1|1     0|1     1|1     1|1     1|1     1|1     1|1     1|1     0|1     1|1     0|1     1|1     1|1     1|1
1|1     1|1     1|1     1|1     1|1     1|1     1|1     1|1     1|1

Phasing using Eagle

Next, we will use Eagle2 as an example for cohort-based phasing.

Quote

Loh, P. R., Danecek, P., Palamara, P. F., Fuchsberger, C., A Reshef, Y., K Finucane, H., ... & L Price, A. (2016). Reference-based phasing using the Haplotype Reference Consortium panel. Nature genetics, 48(11), 1443-1448.

Download Eagle from Eagle2 website. Unzip and add it to your environment. We also need genetic maps, which are also available on Eagle2 website

Cohort-based phasing (without reference) using eagle2. Eagle2

inputbedchr22=./sample_data.chr22.clean
geneticmap=~/tools/eagle/genetic_map_hg19_withX.txt.gz
out=./1KG.JPT.chr22.phased.eagle2.cohort_based

eagle \
    --bfile=${inputbedchr22} \
    --geneticMapFile=${geneticmap} \
    --outPrefix=${out} \
    --maxMissingPerSnp=1 \
    --maxMissingPerIndiv=1 \
    --numThreads=4 \
    --chrom=22

Use shapeit2 to convert .hap and .sample to VCF

outputhaps=${out}.haps.gz
outputsample=${out}.sample
outputvcf=${out}.vcf

shapeit \
    -convert \
    --input-haps ${outputhaps} ${outputsample} \
    --output-vcf ${outputvcf}

bgzip ${outputvcf} && \
tabix -p vcf ${outputvcf}.gz
##fileformat=VCFv4.1
##fileDate=16102024_17h46m51s
##source=SHAPEIT2.v904
##log_file=shapeit_16102024_17h46m51s_16aef2b1-de11-4c78-b8a6-f15bf1c2d14b.log
##FORMAT=<ID=GT,Number=1,Type=String,Description="Phased Genotype">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  NA18939 NA18940 NA18941 NA18942 NA18943 NA18944NA18945  NA18946 NA18947 NA18948 NA18949 NA18950 NA18951 NA18952 NA18953 NA18954 NA18956 NA18957 NA18959 NA18960 NA18961NA18962  NA18964 NA18965 NA18966 NA18967 NA18968 NA18969 NA18970 NA18971 NA18972 NA18973 NA18974 NA18975 NA18976 NA18977NA18978  NA18979 NA18980 NA18981 NA18982 NA18983 NA18984 NA18985 NA18986 NA18987 NA18988 NA18989 NA18990 NA18991 NA18992NA18993  NA18994 NA18995 NA18997 NA18998 NA18999 NA19000 NA19001 NA19002 NA19003 NA19004 NA19005 NA19006 NA19007 NA19009NA19010  NA19011 NA19012 NA19054 NA19055 NA19056 NA19057 NA19058 NA19059 NA19060 NA19062 NA19063 NA19064 NA19065 NA19066NA19067  NA19068 NA19070 NA19072 NA19074 NA19075 NA19076 NA19077 NA19078 NA19079 NA19080 NA19081 NA19082 NA19083 NA19084NA19085  NA19086 NA19087 NA19088 NA19089 NA19090 NA19091
22      16051453        22:16051453:A:C C       A       .       PASS    .       GT      1|1     1|1     1|1     1|1    1|1      1|1     0|1     1|1     1|1     1|1     1|1     1|0     1|0     1|1     1|1     1|1     1|1     1|1     1|1    1|1      1|1     1|1     1|1     1|1     1|1     1|1     1|1     1|1     1|1     1|1     1|1     1|1     0|1     1|1    0|1      1|1     1|1     1|1     1|1     1|1     1|1     1|1     0|1     1|1     0|1     1|1     1|1     1|1     1|1    1|1      0|1     1|1     1|0     1|1     1|1     1|1     1|1     1|1     1|1     1|1     1|1     1|1     1|1     1|1    1|1      1|1     1|1     1|1     1|1     1|1     1|1     1|1     1|1     0|1     0|1     1|1     1|1     1|1     0|1    1|1      1|1     1|0     1|1     1|1     1|1     1|1     1|1     1|1     1|0     1|1     1|0     1|1     1|1     1|1    1|1      1|1     1|1     1|1     1|1     1|1     1|1     1|1     1|1